![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab2: Search Engine

by Fabian Märki (FHNW), revised by Andrei Popescu-Belis (HES-SO).

## Summary
The aim of this lab is to build a simple document search engine based on TF-IDF document vectors. 

The lab is inspired by a notebook designed by [Kavita Ganesan](https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb).

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [None]:
import os    
import nltk
import gensim
import pandas as pd
from nltk.corpus import stopwords, wordnet
from TextPreprocessor import *
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from gensim import models, corpora, similarities

The data used in this lab is a set of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the years 2000-2001.  This is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF).  It is available as test data in the **gensim** package, so you do not need to download it separately.

The following code will load the documents into a Pandas dataframe.

In [None]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

The following code will run our in-house Text Preprocessor provided in the `TextPreprocessor.py` file.

<font color='green'>Please enrich the following code according your needs (especially stopwords)</font>

In [None]:
language = 'english'
stop_words = set(stopwords.words(language))
# Extend the list here:


processor = TextPreprocessor(
# Add options here:
 
)

In [None]:
data_df['processed'] = processor.transform(data_df['text'])

In [None]:
print(data_df['processed'].iloc[0])

## Generation of document vectors with [Scikit-learn](https://scikit-learn.org/stable)

We will use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class from scikit-learn to create a vocabulary and generate word counts or *Term Frequencies* (TF).
    
The result is a  matrix representation of the counts: each column represents a _word_ in the vocabulary and each row represents a document in our dataset: the cell values are the word counts of the word in the document. 

The matrix is very sparse, because all words not appearing in a document have 0 counts.

In [None]:
cv = CountVectorizer(max_features=3000) # keep only the 3000 most frequent words in the corpus
word_count_vector = cv.fit_transform(data_df['processed'])

Let's look at some words from our vocabulary:

In [None]:
feature_names = cv.get_feature_names()

In [None]:
print(len(feature_names)) # has the max_features value been reached?
print(feature_names[2500:2505]) # try various slices
print(feature_names.index('hundred')) # find a word

**TfidfTransformer to Compute Inverse Document Frequency (IDF)**

We now use the (sparse) matrix generated by `CountVectorizer` to compute the IDF values of each word.  Note that the IDF should in reality be based on a large and representative corpus.

In [None]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)

The IDF values are stored in the `idf_` field of the `TfidfTransformer`.  It has the same size as the array of feature names (words).

In [None]:
print(len(tfidf_transformer.idf_)) # check length
print(tfidf_transformer.idf_[cv.get_feature_names().index('hundred')]) # check IDF value of a word

**We define here two helper functions:**
 * the first one is a sorting function for the columns of a sparse matrix in COOrdinate format (a.k.a "ijv" or "triplet" format [explained here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html));
 * the second one extracts the feature names (*words*) and their TF-IDF values from the sorted list.

In [None]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items from sorted list"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

We now select a document for which we will generate TF-IDF values.  <font color="green">Please select a random document of your choice between 0 and 300.</font>

In [None]:
doc_orig = data_df['text'].iloc[136]
doc_processed = data_df['processed'].iloc[136]

**The next instruction generates the vector of TF-IDF values for the document** using the `tfidf_transformer`.

In [None]:
tf_idf_vector = tfidf_transformer.transform(cv.transform([doc_processed]))

Next, we sort the words in the `tf_idf_vector` by decreasing values of TF-IDF values (first transforming the vector into a COOordinate format, and then applying our sorting function from above).  We then extract the 10 top scores (with words) for the selected document using our second helper function from above and display words and scores.

In [None]:
sorted_items=sort_coo(tf_idf_vector.tocoo())

topn_words = extract_topn_from_vector(feature_names, sorted_items, 10)

print(doc_orig, '\n', topn_words)

<font color="green">Please comment briefly on the relevance of these words with respect to the document content.</font>

## Document-document similarity using scikit-learn

In this section, you will write the commands to compute a document-document similarity matrix over the above documents, in scikit-learn.

Please use a processing [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) and a [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and compute the *cosine similarities* between all documents.  

<font color="green">At the end, you will be asked to display the five most similar documents to the one you selected above, and compare the 1st and the 5th best results.</font>

You can use inspiration from: 
 * the above code
 * https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XkK2ceFCe-Y
 * https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
 * https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity
 * https://markhneedham.com/blog/2016/07/27/scitkit-learn-tfidf-and-cosine-similarity-for-computer-science-papers

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
tfidf = TfidfVectorizer(use_idf=True)
pipe = Pipeline(steps=[('pre', processor), ('tfidf', tfidf)]) # the 'processor' was defined above

<font color='green'>Please write a function called `find_similar` which receives a `tfidf_matrix` with all similarity scores between documents, and the `index` of a document in the collection, and returns the `top_n` most similar documents to it using cosine similarity.</font>

In [None]:
def find_similar(tfidf_matrix, index, top_n = 5):


<font color="green">Using the data from the Pandas form created above, please use "fit" and "transform" to generate the matrix of all document similarites called "tfidf_matrix". -- How long do these two operations take on your computer?  -- Please explain briefly in your own words what is the difference between "fit" and "transform".</font>

<font color="green">Using `find_similar` and the `tfidf_matrix` please display the five most similar documents to the one you selected above, with their scores, comment them, and compare the 1st and the 5th best results.</font>

<font color='green'>Could you also use the dot product instead of the cosine similarity in the `find_similar` function?  Please answer in the following box.</font>

## Building a search engine using Gensim

<font color='green'>Using the [tutorial from Gensim](https://radimrehurek.com/gensim/tut3.html), and in particular the [TF-IDF Model](https://radimrehurek.com/gensim/models/tfidfmodel.html) to build the model and the [MatrixSimilarity](https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity) to measure cosine similarity between documents, please implement a method that returns the documents most similar to a given query.</font>

<font color='green'>Please write a query of your own (5-10 words), retrieve the 5 most similar documents, and comment the result.</font>

Thank you for your work!  Please submit the notebook on Moodle before the next course.