# DAY 3: Student version

**Machine Learning NLP**

The goal of this session is to improve the search engine using NLP features.

This notebook guides you through different techniques to explore. It is expected of you to be inventive and improve the techniques introduced. 

First, let's import the useful packages and load the data.

## Installs

In [1]:
! pip install sentence-transformers --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Imports

In [2]:
import os
import re
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.sparse import find

from sentence_transformers import SentenceTransformer

import nltk
nltk.download('punkt')
nltk.download('wordnet')

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## Load Data

In [6]:
# Only if you use Colab
from google.colab import drive
drive.mount('/content/drive')

import os

# TODO:
DATA_PATH = '/content/drive/MyDrive/EI/data' 

# CORR:
# DATA_PATH = '/content/drive/MyDrive/TP Centrale/data'
     

Mounted at /content/drive


In [7]:
posts = pd.read_xml(os.path.join(DATA_PATH, 'Posts.xml'), parser="etree", encoding="utf8")
posts

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,ClosedDate,ContentLicense,AcceptedAnswerId,LastEditorUserId,LastEditDate,ParentId,OwnerDisplayName,CommunityOwnedDate,LastEditorDisplayName,FavoriteCount
0,5,1,2014-05-13T23:58:30.457,9,898.0,<p>I've always been interested in machine lear...,5.0,2014-05-14T00:36:31.077,How can I do simple machine learning without h...,<machine-learning>,...,2014-05-14T14:40:25.950,CC BY-SA 3.0,,,,,,,,
1,7,1,2014-05-14T00:11:06.457,4,478.0,"<p>As a researcher and instructor, I'm looking...",36.0,2014-05-16T13:45:00.237,What open-source books (or other materials) pr...,<education><open-source>,...,2014-05-14T08:40:54.950,CC BY-SA 3.0,10.0,97.0,2014-05-16T13:45:00.237,,,,,
2,9,2,2014-05-14T00:36:31.077,5,,"<p>Not sure if this fits the scope of this SE,...",51.0,2014-05-14T00:36:31.077,,,...,,CC BY-SA 3.0,,,,5.0,,,,
3,10,2,2014-05-14T00:53:43.273,13,,"<p>One book that's freely available is ""The El...",22.0,2014-05-14T00:53:43.273,,,...,,CC BY-SA 3.0,,,,7.0,,,,
4,14,1,2014-05-14T01:25:59.677,26,1901.0,<p>I am sure data science as will be discussed...,66.0,2020-08-16T13:01:33.543,Is Data Science the Same as Data Mining?,<data-mining><definitions>,...,,CC BY-SA 3.0,29.0,322.0,2014-06-17T16:17:20.473,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75722,119962,1,2023-03-04T20:06:06.820,0,8.0,<p>I am implementing a neural network of arbit...,147597.0,2023-03-04T20:22:12.523,Back Propagation on arbitrary depth network wi...,<neural-network><backpropagation>,...,,CC BY-SA 4.0,,147597.0,2023-03-04T20:22:12.523,,,,,
75723,119963,1,2023-03-04T20:12:19.677,0,10.0,<p>I am using KNN for a regression task</p>\n<...,147598.0,2023-03-04T20:12:19.677,Evaluation parameter in knn,<regression><k-nn>,...,,CC BY-SA 4.0,,,,,,,,
75724,119964,1,2023-03-05T00:14:12.597,0,7.0,<p>I have developed a small encoding algorithm...,44581.0,2023-03-05T00:14:12.597,Can I use zero-padded input and output layers ...,<deep-learning><convolutional-neural-network>,...,,CC BY-SA 4.0,,,,,,,,
75725,119965,1,2023-03-05T00:43:12.213,0,5.0,"<p>To my understanding, optimizing a model wit...",84437.0,2023-03-05T00:43:12.213,Why does cross validation and hyperparameter t...,<cross-validation><hyperparameter-tuning>,...,,CC BY-SA 4.0,,,,,,,,


## Data Cleaning

Implement a function to clean the posts. 

You can reuse what you have used in the Day 2 notebook or improve it.

In [8]:
def clean_post(text:str)->str:
  res=""
  in_tag=False
  next=False
  for lettre in text:
    if lettre=='<' and not in_tag:
      in_tag=True
    elif lettre=='>' and in_tag:
      in_tag=False
    elif lettre=='\n' or lettre=='\t':
      next=True
    elif not in_tag:
      if next:
        res+=" " + str(lettre)
        next=False
      else:
        res+=str(lettre)
  return res

In [11]:
clean_posts = posts[['Id','Body']]
clean_posts['Clean Body'] = clean_posts['Body'].fillna('').apply(clean_post)
clean_posts

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_posts['Clean Body'] = clean_posts['Body'].fillna('').apply(clean_post)


Unnamed: 0,Id,Body,Clean Body
0,5,<p>I've always been interested in machine lear...,I've always been interested in machine learnin...
1,7,"<p>As a researcher and instructor, I'm looking...","As a researcher and instructor, I'm looking fo..."
2,9,"<p>Not sure if this fits the scope of this SE,...","Not sure if this fits the scope of this SE, bu..."
3,10,"<p>One book that's freely available is ""The El...","One book that's freely available is ""The Eleme..."
4,14,<p>I am sure data science as will be discussed...,I am sure data science as will be discussed in...
...,...,...,...
75722,119962,<p>I am implementing a neural network of arbit...,I am implementing a neural network of arbitrar...
75723,119963,<p>I am using KNN for a regression task</p>\n<...,I am using KNN for a regression task It's like...
75724,119964,<p>I have developed a small encoding algorithm...,I have developed a small encoding algorithm th...
75725,119965,"<p>To my understanding, optimizing a model wit...","To my understanding, optimizing a model with k..."


You can also implement a function that cleans the user's query (the query). 

This step is optionnal (if you don't think that it is necessary, just return the query)

In [29]:
def clean_query(text:str)->str:
    cleaned_query = clean_post(text)
    return cleaned_query

## Text specific metadata

What metadata can you get from the text at your disposal ? Which ones are relevant ? 

In [14]:
#TODO
posts.keys()

Index(['Id', 'PostTypeId', 'CreationDate', 'Score', 'ViewCount', 'Body',
       'OwnerUserId', 'LastActivityDate', 'Title', 'Tags', 'AnswerCount',
       'CommentCount', 'ClosedDate', 'ContentLicense', 'AcceptedAnswerId',
       'LastEditorUserId', 'LastEditDate', 'ParentId', 'OwnerDisplayName',
       'CommunityOwnedDate', 'LastEditorDisplayName', 'FavoriteCount'],
      dtype='object')

## Classic Preprocessing

The goal for this part is to implement a classic vectorization (Count vectorizer, tfidf...).

You can do it on your own or use scikit-learn.

Hints : pay attention to stopwords, additionnal preprocessing steps and techniques of vectoriation


In [59]:
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(clean_posts["Clean Body"].values)
vectors = vectorizer.transform(clean_posts["Clean Body"].values)

Write a function that applies the same process to the query


In [60]:
def vectorize_query(query : str, vectorizer=vectorizer):
    """vectorizes the query
    Args:
        query (str): query string
        vectorizer (optional): Defaults to vectorizer.

    Returns:
        query vectorized
    """
    query_clean=clean_query(query)
    query_vectorized=vectorizer.transform([query_clean])
    return query_vectorized

In [64]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(vectors, vectorize_query("Test de merde"))

array([[0.       ],
       [0.       ],
       [0.       ],
       ...,
       [0.       ],
       [0.1774713],
       [0.       ]])

Determine a way to use this vectorization to suggest the closest items to the entry in the database

In [90]:
from sklearn.metrics.pairwise import cosine_similarity

def vectorizer_search(query : str,
                      vectors=vectors,
                      vectorizer=vectorizer,
                      top=10) -> list:
    query_vector=vectorize_query(query)
    values=cosine_similarity(vectors, query_vector)
    values=[[values[k][0],k] for k in range(len(values))]
    values.sort(reverse=True)
    closest_items = [clean_posts["Clean Body"][x[1]] for x in values[:top]]
    return closest_items

In [84]:
entry = 'what is stochastic gradient descent ?'

In [91]:
 vectorizer_search(entry, top=1)[0]

'As I know, Gradient Descent has three variants which are: 1- Batch Gradient Descent: processes all the training examples for each iteration of gradient descent. 2- Stochastic Gradient Descent: processes one training example per iteration. Hence, the parameters are being updated even after one iteration in which only a single example has been processed. 3- Mini Batch gradient descent: which works faster than both batch gradient descent and stochastic gradient descent. Here, b examples where b &lt; m are processed per iteration. But in some cases they use Stochastic Gradient Descent and they define a batch size for training which is what I am confused about. Also, what about Adam, AdaDelta &amp; AdaGrad, are they all mini-batch gradient descent or not?'

How you can improve this approach ? 

*Answer* here

On pourrait mettre en place un calcul de rareté des termespour mettre en avant les similarités sur les mots rares. Cela pourrait se faire par l'ajout de poids.

## Semantic similarity

There are NLP methods that go further than word-by-word study, by taking into account the context of the terms. There are several methods: Word2vec, Bert.

From the Sentence Transformers documentation: https://www.sbert.net/docs/pretrained_models.html choose the pre-trained model that you think is the most appropriate. Justify your choice.

In [None]:
sentence_transformer_model = ''

Answer Here

⚠ To use Sentence Transformers it is recommended to activate the GPU of google colab.

In [None]:
MODEL_ST = SentenceTransformer(sentence_transformer_model)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Use this algorithm to encode the data in the database

In [None]:
embeddings = MODEL_ST.encode(posts.cleaned_body.values, normalize_embeddings=True)

*If this process is slow, you can save this array in case you need to load it again*

In [None]:
import pickle

with open(os.path.join(DATA_PATH, 'embeddings.pkl', 'wb') as file:
    pickle.dump(embeddings, file)

Make a function that transforms the input

In [None]:
def encode_query(query : str) ->  numpy.ndarray:
    #TODO
    return encoded_query

Which distance is most relevant to measure the distance between the input and the data?

Answer here

Write a function that returns a matrix containing information about the similarity between the query and the data

In [None]:
def similarity(query, embeddings=embeddings):
    #TODO
    return similarity_matrix

In [None]:
matrix_similarity = similarity(query, embeddings)

How do you determine which documents in the data set most closely match the input?

In [None]:
def ordre_en_fonction_similarité(matrix_similarity):
    #TODO
    return ordre

In [None]:
assert ordre_en_fonction_similarité([0.6, 0.8, 0.7]) == [1, 2, 0]

[1, 2, 0]

Put it all together in a function.

In [None]:
def closest_semantic_doc(query, embeddings=embeddings, top_n=10):
    return closest_posts

What methods could be used to improve the recommendations of this algorithm?

Answer here

In [None]:
#todo

## Text clustering (BONUS)

We can use topic modeling techniques to identify groups of texts among our document base and classify the input to restrict the application of the proximity calculations seen previously.

#### LDA

Latent Dirichlet Allocation is a topic modeling algorithm that allows soft clustering. Soft clustering means that the LDA does not allocate an input to a cluster, but gives a probabilistic score for each identified cluster. This decomposition allows to identify topics within the documents. 

In order to compute this algorithm, you need to vectorize your data (you can use the one you have already done previously or make another one).

In [None]:
# Vectorize document using TF-IDF
vectorizer_lda = 

# Fit and Transform the documents
train_data = vectorizer_lda.fit_transform(posts.cleaned_body.values)

You can use Gensim or scikit-learn to compute LDA.

In [None]:
#TODO

Assign a main topic to each document

In [None]:
topic_documents = 

Write a function that assigns a topic to the query


In [None]:
def get_topic_query(query, vectorizer=vectorizer_lda, lda_model=lda_model) -> int:
    #TODO
    return topic_query

## Merge methods

Write an algorithm to merge the methods seen. Which methods to use? How can you check if they are relevant ?

Answer here

In [None]:
def nlp_search_algorithm(query,
                         topic_documents=topic_documents,
                         vectors=vectors,
                         vectorizer=vectorizer,
                         vectorizer_lda=vectorizer_lda,
                         lda_model=lda_model,
                         embeddings=embeddings,
                         top_n=10
                         )->list:
    #TODO
 
    return matching_posts




Once you have a list of possible results, you can it: (you can use one of the ranking algorithms you have previously done or make up a new one)

In [None]:
def rank(possible_results):
    #to_do
    return possible_results

## Incorporation in the search engine

### Addition of metadata

You must now have a new set of metadata and data to add to your original index. You can load the index you had as a result of Day 2 and today's work to it. 

In [None]:
#load previous data 

#TODO 

In [None]:
# add the new data to the previous index

Hint : if you have changed the preprocessing function at the beginning of this notebook make sure to update the Clean Body attribute

### Adaptation to the index format

Adapt your nlp search results to the index format

In [None]:
def nlp_search_in_index(query,
                        index,
                        args):

    return matching_posts
  

### Compare the new searching and ranking method to the previous ones

Compare in terms of efficiency (precision, completeness, speed, memory usage)

### merge all methods to make an efficient search algorithm