#Information Retrieval

In information retrieval, we search for information through a corpus and retrieve documents that have the searched information. In this tutorial, we want to search for a query in a list of sentences. The first step is converting the text to numerical vectors. We explore different embedding techniques to see how the result of this search is different. The list of these techniques are as follows:

1. Doc2bow + LSI  
2. TF-IDF
3. Glove
4. Word2vec
5. Doc2vec
6. BERT

After representing each sentence and the query as vectors, we compute similarities between the vector of query and each vector of sentences. For the similarity metric, we use cosine similarity to compute the similarity between a query and a document. The value of cosine similarity is in the range of <-1,1>. This means that the most similar document to the query has the highest value and it is the closest value to 1. After computing the cosine similarity score for each document, we sort based on the score to see which document is the most similar to the query. 



In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import gensim

ValueError: ignored

In [6]:
print(gensim.__version__)

NameError: ignored

In [7]:
!pip install --upgrade gensim


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


We start this section by defining a list of sentences and a query. We want to find the most similar sentence to the query.  This example is taken from [here](https://radimrehurek.com/gensim_3.8.3/auto_examples/core/run_similarity_queries.html) which describes how we can search a query using a popular package called **Gensim**.  





In [8]:
# import libraries
from collections import defaultdict
from gensim import corpora

ValueError: ignored

In [9]:


documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

search_terms = "human computer interaction"

## Doc2bow + LSI

In [10]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

NameError: ignored

In [None]:
texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [None]:
corpus

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

In the Gensim tuturial, they use LSI to convert the vectors to 2-dimensional space:



In [None]:
from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

The reason of using LSI is we can identify patterns and relationships between terms (in our case, words in a document) and topics.
The LSI space is two-dimensional (`num_topics = 2`) so there are two topics, but this is arbitrary.
If you're interested, you can read more about LSI here: `Latent Semantic Indexing <https://en.wikipedia.org/wiki/Latent_semantic_indexing>`_:

Now suppose a user typed in the query `"Human computer interaction"`. We would
like to sort our nine corpus documents in decreasing order of relevance to this query.
Unlike modern search engines, here we only concentrate on a single aspect of possible
similarities---on apparent semantic relatedness of their texts (words). No hyperlinks,
no random-walk static ranks, just a semantic extension over the boolean keyword match:



In [None]:

vec_bow = dictionary.doc2bow(search_terms.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 0.46182100453271613), (1, -0.0700276652790001)]


In addition, we will be considering [cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity)
to determine the similarity of two vectors. Cosine similarity is a standard measure
in Vector Space Modeling, but wherever the vectors represent probability distributions,
[different similarity measures](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence)
may be more appropriate.

###Initializing query structures


To prepare for similarity queries, we need to enter all documents which we want
to compare against subsequent queries. In our case, they are the same nine documents
used for training LSI, converted to 2-D LSA space. But that's only incidental, we
might also be indexing a different corpus altogether.



In [None]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it



To obtain similarities of our query document against the nine indexed documents:



In [None]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.98658866), (4, 0.90755945), (5, -0.12416792), (6, -0.1063926), (7, -0.09879464), (8, 0.05004177)]


Cosine measure returns similarities in the range `<-1, 1>` (the greater, the more similar),
so that the first document has a score of 0.9984453 etc.

With some standard Python magic we sort these similarities into descending
order, and obtain the final answer to the query `"Human computer interaction"`:



In [None]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

0.9984453 The EPS user interface management system
0.998093 Human machine interface for lab abc computer applications
0.98658866 System and human system engineering testing of EPS
0.93748635 A survey of user opinion of computer system response time
0.90755945 Relation of user perceived response time to error measurement
0.05004177 Graph minors A survey
-0.09879464 Graph minors IV Widths of trees and well quasi ordering
-0.1063926 The intersection graph of paths in trees
-0.12416792 The generation of random binary unordered trees


The thing to note here is that documents no. 2 ("The EPS user interface management system") and 4 ("Relation of user perceived response time to error measurement") would never be returned by a standard boolean fulltext search, because they do not share any common words with "Human computer interaction". However, after applying LSI, we can observe that both of them received quite high similarity scores (no. 2 is actually the most similar!), which corresponds better to our intuition of them sharing a “computer-human” related topic with the query. In fact, this semantic generalization is the reason why we apply transformations and do topic modelling in the first place.

###Assignment
We removed the words which occure only once. Try not remove these words and rerun the code to see how the results are different.

## TF-IDF

TF-IDF can be used for the vectorization of a sentence considering how much a word is relevant to the document and the sentence. In the calculation of the TF-IDF of a word, we take into account the frequency of the word with respect to the total number of words in the document. Due to the fact that rare words hold significant information, in the TF-IDF calculation, it is important whether the word exists in all documents or not.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk import word_tokenize          
import spacy
import nltk
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
nlp = spacy.load('en_core_web_sm')


def lemmatization(text):
  doc = nlp(text)
  mytokens = [word.lemma_ if word.lemma_ != "-PRON-" else word.lower_ for word in doc]
  return " ".join(mytokens)

#Lemmatize the corpus
corpus = [ lemmatization(text) for text in [search_terms] + documents ]



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
corpus

['human computer interaction',
 'human machine interface for lab abc computer application',
 'a survey of user opinion of computer system response time',
 'the EPS user interface management system',
 'system and human system engineering testing of EPS',
 'relation of user perceive response time to error measurement',
 'the generation of random binary unordered tree',
 'the intersection graph of path in tree',
 'Graph minor IV Widths of tree and well quasi ordering',
 'Graph minor a survey']

In [None]:
# Lemmatize the stop words
token_stop = set([lemmatization(word) for word in stoplist])


In [None]:

# Create TF-idf model
vectorizer = TfidfVectorizer(stop_words=token_stop)
doc_vectors = vectorizer.fit_transform(corpus)



In [None]:
# Calculate similarity
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]

In [None]:
sims = sorted(enumerate(document_scores), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

0.3157251202041625 Human machine interface for lab abc computer applications
0.17346750739842565 A survey of user opinion of computer system response time
0.16268722264438382 System and human system engineering testing of EPS
0.0 The EPS user interface management system
0.0 Relation of user perceived response time to error measurement
0.0 The generation of random binary unordered trees
0.0 The intersection graph of paths in trees
0.0 Graph minors IV Widths of trees and well quasi ordering
0.0 Graph minors A survey


We can see that TF-IDF cannot capture the semantic meaning of words in a sequence efficiently and it is based on the occurance of words.

## Glove

GloVe considers two methodologies: matrix factorization using latent semantic analysis (LSA) and local context window method like Skip-gram. The GloVe technique has a simpler least square cost or error function that reduces the computational cost of training the model. The resulting word embeddings are different and improved than Word2Vec.

In [None]:
from gensim.utils import simple_preprocess

# Convert a document into a list of tokens and remove stop words
def preprocess(text):
    return [token for token in simple_preprocess(text, min_len=0, max_len=float("inf")) if token not in stoplist]



In [None]:
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.utils import simple_preprocess
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity



In [None]:
# Load the model: this is a big file, can take a while to be downloaded and open
glove = api.load("glove-wiki-gigaword-50")    
similarity_index = WordEmbeddingSimilarityIndex(glove) # computes cosine similarities between word embeddings





For more information about `gensim.similarities` visit https://radimrehurek.com/gensim/similarities/termsim.html#gensim.similarities.termsim.SparseTermSimilarityMatrix

In [None]:
simple_preprocess('doc is great', min_len=0, max_len=float("inf"))

['doc', 'is', 'great']

In [None]:
# Build the term dictionary, TF-idf model
tokens = [preprocess(document) for document in corpus]
dictionary = Dictionary(tokens)
tfidf = TfidfModel(dictionary=dictionary)



In [None]:
tokens

[['human', 'computer', 'interaction'],
 ['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'application'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceive', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'tree'],
 ['intersection', 'graph', 'path', 'tree'],
 ['graph', 'minor', 'iv', 'widths', 'tree', 'well', 'quasi', 'ordering'],
 ['graph', 'minor', 'survey']]

In [None]:
# Create the term similarity matrix.  
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf) # A sparse term similarity matrix built using a term similarity index

100%|██████████| 36/36 [00:01<00:00, 35.35it/s]


In [None]:
# Compute Soft Cosine Measure between the query and the documents.
query = preprocess(search_terms)
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in tokens]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]
doc_similarity_scores = doc_similarity_scores[1::]



In [None]:
sims = sorted(enumerate(doc_similarity_scores), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

0.458761 Relation of user perceived response time to error measurement
0.44832718 Human machine interface for lab abc computer applications
0.38101894 System and human system engineering testing of EPS
0.35302877 A survey of user opinion of computer system response time
0.24636662 The EPS user interface management system
0.19063666 The generation of random binary unordered trees
0.0 The intersection graph of paths in trees
0.0 Graph minors IV Widths of trees and well quasi ordering
0.0 Graph minors A survey


Word2Vec only captures the local context of words. During training, it only considers neighboring words to capture the context. GloVe considers the entire corpus and creates a large matrix that can capture the co-occurrence of words within the corpus.

##Word2vec

Word2vec is a word embedding technique which considers the assications and dependencies among the words in the calculation. 

In [None]:
# Load the model: this is a big file, can take a while to download and open
 #loading pre-trained embeddings, each word is represented as a 300 dimensional vector
import gensim.downloader as api
model_w2v = api.load("word2vec-google-news-300")



Load a model from local place

In [None]:
# loading pre-trained embeddings, each word is represented as a 300 dimensional vector
# W2V_PATH="GoogleNews-vectors-negative300.bin.gz"
# model_w2v = gensim.models.KeyedVectors.load_word2vec_format(W2V_PATH, binary=True)

In [None]:
similarity_index_w = WordEmbeddingSimilarityIndex(model_w2v) # computes cosine similarities between word embeddings
similarity_matrix_w = SparseTermSimilarityMatrix(similarity_index_w, dictionary, tfidf)
index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in tokens]],
            similarity_matrix_w)

doc_similarity_scores_w = index[query_tf]
doc_similarity_scores_w = doc_similarity_scores_w[1::]

100%|██████████| 36/36 [00:23<00:00,  1.52it/s]


In [None]:
sims = sorted(enumerate(doc_similarity_scores_w), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

0.42041993 A survey of user opinion of computer system response time
0.42041993 System and human system engineering testing of EPS
0.3918314 Human machine interface for lab abc computer applications
0.0 The EPS user interface management system
0.0 Relation of user perceived response time to error measurement
0.0 The generation of random binary unordered trees
0.0 The intersection graph of paths in trees
0.0 Graph minors IV Widths of trees and well quasi ordering
0.0 Graph minors A survey


### Assignment
Explain how and why the results of using Word2Vec is different than GloVe

##Doc2Vec

In [None]:
documents_cleaned = []
for tok in tokens:
  documents_cleaned.append(" ".join(tok))


In [None]:
documents_cleaned

['human computer interaction',
 'human machine interface lab abc computer application',
 'survey user opinion computer system response time',
 'eps user interface management system',
 'system human system engineering testing eps',
 'relation user perceive response time error measurement',
 'generation random binary unordered tree',
 'intersection graph path tree',
 'graph minor iv widths tree well quasi ordering',
 'graph minor survey']

In [None]:
documents_not_cleaned = [search_terms ]
documents_not_cleaned.extend(documents)

In [None]:
documents_not_cleaned

['human computer interaction',
 'Human machine interface for lab abc computer applications',
 'A survey of user opinion of computer system response time',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees and well quasi ordering',
 'Graph minors A survey']

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
tagged_data = [TaggedDocument(words=word_tokenize(doc), tags=[i]) for i, doc in enumerate(documents_cleaned)]

In [None]:
model_d2v = Doc2Vec(vector_size=300,alpha=0.025, min_count=1)
  
model_d2v.build_vocab(tagged_data)

for epoch in range(100):
    model_d2v.train(tagged_data,
                total_examples=model_d2v.corpus_count,
                epochs=model_d2v.epochs)

In [None]:
document_embeddings=np.zeros((len(documents_cleaned),300))

for i in range(len(document_embeddings)):
    document_embeddings[i]=model_d2v.docvecs[i]


In [None]:
pairwise_similarities=cosine_similarity(document_embeddings)


In [None]:
pairwise_similarities[0]

array([1.        , 0.89434382, 0.81315093, 0.8066324 , 0.77248275,
       0.69620613, 0.71216651, 0.70627077, 0.59890234, 0.79506582])

In [None]:
sims =np.argsort(pairwise_similarities[0])[::-1][1:]
for doc_position in sims:
    print(pairwise_similarities[0][doc_position], documents_not_cleaned[doc_position])

0.8943438240184495 Human machine interface for lab abc computer applications
0.8131509262338534 A survey of user opinion of computer system response time
0.8066324022665379 The EPS user interface management system
0.7950658239137872 Graph minors A survey
0.7724827504745031 System and human system engineering testing of EPS
0.7121665106918822 The generation of random binary unordered trees
0.706270773555438 The intersection graph of paths in trees
0.6962061277222966 Relation of user perceived response time to error measurement
0.5989023354588031 Graph minors IV Widths of trees and well quasi ordering


##BERT

BERT relies on an attention mechanism. It generates high-quality context-aware or contextualized word embeddings.

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 3.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 11.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 47.4 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.8 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (

In [None]:
from sentence_transformers import SentenceTransformer


In [None]:
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
document_embeddings = sbert_model.encode(documents_cleaned)

In [None]:
pairwise_similarities=cosine_similarity(document_embeddings)

In [None]:
sims =np.argsort(pairwise_similarities[0])[::-1][1:]
for doc_position in sims:
    print(pairwise_similarities[0][doc_position], documents_not_cleaned[doc_position])

0.5600678 A survey of user opinion of computer system response time
0.55632615 System and human system engineering testing of EPS
0.54429317 Human machine interface for lab abc computer applications
0.5148372 The EPS user interface management system
0.46277103 The intersection graph of paths in trees
0.43949413 Relation of user perceived response time to error measurement
0.38810784 Graph minors A survey
0.28876284 Graph minors IV Widths of trees and well quasi ordering
0.22333612 The generation of random binary unordered trees


# **Assignment**
1) Compare the results of each word\sentence embedding and try to understand the difference between them

2) Read the data from annual reports, define keywords for E, S, and G and retreive data related to ESG