# DAY 3: Student version

**Machine Learning NLP**

The goal of this session is to improve the search engine using NLP features.

This notebook guides you through different techniques to explore. It is expected of you to be inventive and improve the techniques introduced. 

First, let's import the useful packages and load the data.

## Installs

In [7]:
#  !pip install sentence-transformers --quiet

## Imports

In [None]:
import os
import re
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.sparse import find

from sentence_transformers import SentenceTransformer

import nltk
nltk.download('punkt')
nltk.download('wordnet')

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

## Load Data

In [None]:
# Only if you use Colab
# from google.colab import drive
# drive.mount('/content/drive')

import os

# TODO:
DATA_PATH = 'datascience.stackexchange.com/' 

# # CORR:
# DATA_PATH = '/content/drive/MyDrive/TP Centrale/data'
     

In [None]:
posts = pd.read_xml(os.path.join(DATA_PATH, 'Posts.xml'), parser="etree", encoding="utf8")
posts

## Data Cleaning

Implement a function to clean the posts. 

You can reuse what you have used in the Day 2 notebook or improve it.

In [None]:
def clean_post(text:str)->str:
    try:
        clean_post = re.sub(r'<.*?>', '', text)
        return clean_post
    except: # empty text (nan, None, etc..)
        return ""


In [None]:
posts['cleaned_body'] = posts.Body.apply(clean_post)

You can also implement a function that cleans the user's query (the query). 

This step is optionnal (if you don't think that it is necessary, just return the query)

In [None]:
def clean_query(text:str)->str:
    # keep only letters
    cleaned_query = re.sub(r'[^a-zA-Z ]', '', text)
    return cleaned_query

## Text specific metadata

What metadata can you get from the text at your disposal ? Which ones are relevant ? 

In [None]:
print(posts.columns)
# the relevant metadata is:
# the Title, the Score, the ViewCount, the Tags, the AnswerCount, the CommentCount and the FavoriteCount

## Classic Preprocessing

The goal for this part is to implement a classic vectorization (Count vectorizer, tfidf...).

You can do it on your own or use scikit-learn.

Hints : pay attention to stopwords, additionnal preprocessing steps and techniques of vectoriation


In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(posts.cleaned_body.values)
vectors = vectorizer.transform(posts.cleaned_body.values)
print(vectors)

Write a function that applies the same process to the query


In [None]:
def vectorize_query(query : str, vectorizer=vectorizer):
    """vectorizes the query
    Args:
        query (str): query string
        vectorizer (optional): Defaults to vectorizer.

    Returns:
        query vectorized
    """
    query_vectorized = vectorizer.transform([query])

    return query_vectorized

Determine a way to use this vectorization to suggest the closest items to the entry in the database

In [17]:
def vectorizer_search(query : str,
                      vectors=vectors,
                      vectorizer=vectorizer) -> list:
    
    query_vectorized = vectorize_query(query, vectorizer)
    # compute cosine similarity
    
    similarity = np.dot(query_vectorized, vectors.T).todense()
    # get the indices of the 5 most similar posts
    indices = np.argsort(similarity)
    indices = np.array(indices).flatten()[-5:][::-1]
    print(len(indices))

    # get the posts
    print(indices)
    closest_items = posts.iloc[indices] # buffer has wrong number of dimensions (expected 1, got 2)
    return closest_items

In [18]:
entry = 'what is stochastic gradient descent ?'

In [19]:
v = vectorizer_search(entry,vectorizer=vectorizer)
# print(v)
for _, row in v.iterrows():
    print(row['cleaned_body'])
    print('-------------------')

5
[45389 25089 74678 10551  8366]
I have been training a WGAN for a while now, with my generator training once in every five epochs.

I have tried several model architectures(no of filters) and also tried varying the relationship with each other. No matter what happens, my output is essentially noise. On further reading, it seems to be a classic case of convergence failure. 

Over time, my generator loss gets more and more negative while my discriminator loss remains around -0.4

My guess is that since the discriminator isn't improving enough, the generator doesn't get improve enough.

gen_loss = 0.0, disc_loss = -0.03792113810777664
Time for epoch 567 is 3.381150007247925 sec - gen_loss = 0.0, disc_loss = -0.037839196622371674
Time for epoch 568 is 3.3113789558410645 sec - gen_loss = 0.0, disc_loss = -0.040219761431217194
Time for epoch 569 is 3.2963240146636963 sec - gen_loss = 0.0, disc_loss = -0.04105686396360397
Time for epoch 570 is 9.3097665309906 sec - gen_loss = -0.31678220629

How you can improve this approach ? 

Answer here

## Semantic similarity

There are NLP methods that go further than word-by-word study, by taking into account the context of the terms. There are several methods: Word2vec, Bert.

From the Sentence Transformers documentation: https://www.sbert.net/docs/pretrained_models.html choose the pre-trained model that you think is the most appropriate. Justify your choice.

In [20]:
sentence_transformer_model = 'distilbert-base-uncased'

Answer Here

⚠ To use Sentence Transformers it is recommended to activate the GPU of google colab.

In [21]:
MODEL_ST = SentenceTransformer(sentence_transformer_model)

No sentence-transformers model found with name /home/himmi/.cache/torch/sentence_transformers/distilbert-base-uncased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /home/himmi/.cache/torch/sentence_transformers/distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Use this algorithm to encode the data in the database

In [22]:
# keep only 100 posts for the sake of computation
posts = posts.iloc[:100]
embeddings = MODEL_ST.encode(posts.cleaned_body.values, normalize_embeddings=True) # ne pas faire tourner, trop lent

*If this process is slow, you can save this array in case you need to load it again*

In [23]:
# import pickle

# with open(os.path.join(DATA_PATH, 'embeddings.pkl', 'wb') as file:
#     pickle.dump(embeddings, file)

Make a function that transforms the input

In [24]:
import numpy
def encode_query(query : str) ->  numpy.ndarray:
    
    encoded_query = MODEL_ST.encode(query, normalize_embeddings=True)
    return encoded_query

Which distance is most relevant to measure the distance between the input and the data?

Answer here

Write a function that returns a matrix containing information about the similarity between the query and the data

In [25]:
def similarity(query, embeddings=embeddings):
    
    query_embedding = encode_query(query)
    similarity_matrix = np.dot(query_embedding, embeddings.T)
    return similarity_matrix

In [26]:
query = 'what is stochastic gradient descent ?'
matrix_similarity = similarity(query, embeddings)

How do you determine which documents in the data set most closely match the input?

In [6]:
def ordre_en_fonction_similarité(matrix_similarity):
    
    ordre = np.argsort(matrix_similarity)[::-1]
    return ordre

In [27]:
print(ordre_en_fonction_similarité([0.6, 0.8, 0.7]))
# assert ordre_en_fonction_similarité([0.6, 0.8, 0.7]) == [1, 2, 0]

[1 2 0]


Put it all together in a function.

In [23]:
def closest_semantic_doc(query, embeddings=embeddings, top_n=10):

    matrix_similarity = similarity(query, embeddings)
    ordre = ordre_en_fonction_similarité(matrix_similarity)
    closest_posts = posts.iloc[ordre[:top_n]]
    return closest_posts

In [27]:
results = closest_semantic_doc(query)
# for _, row in results.iterrows():
#     print(row['cleaned_body'])
#     print('-------------------')

What are the data conditions that we should watch out for, where p-values may not be the best way of deciding statistical significance?  Are there specific problem types that fall into this category?

-------------------
Logic often states that by overfitting a model, its capacity to generalize is limited, though this might only mean that overfitting stops a model from improving after a certain complexity. Does overfitting cause models to become worse regardless of the complexity of data, and if so, why is this the case?



Related: Followup to the question above, "When is a Model Underfitted?"

-------------------
If small p-values are plentiful in big data, what is a comparable replacement for p-values in data with million of samples?

-------------------
Assume that we have a set of elements E and a similarity (not distance) function sim(ei, ej) between two elements ei,ej ∈ E. 

How could we (efficiently) cluster the elements of E, using sim?

k-means, for example, requires a given 

What methods could be used to improve the recommendations of this algorithm?

Answer here

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,ContentLicense,AcceptedAnswerId,LastEditorUserId,LastEditDate,ParentId,OwnerDisplayName,CommunityOwnedDate,LastEditorDisplayName,FavoriteCount,cleaned_body
53,71,1,2014-05-14T22:12:37.203,14,759.0,<p>What are the data conditions that we should...,179.0,2014-05-15T08:25:47.933,When are p-values deceptive?,<bigdata><statistics>,...,CC BY-SA 3.0,84.0,,,,,,,,What are the data conditions that we should wa...
44,61,1,2014-05-14T18:09:01.940,56,16477.0,<p>Logic often states that by overfitting a mo...,158.0,2017-09-17T02:27:31.110,Why Is Overfitting Bad in Machine Learning?,<machine-learning><predictive-modeling>,...,CC BY-SA 3.0,62.0,-1.0,2017-04-13T12:50:41.230,,,,,,Logic often states that by overfitting a model...
57,75,1,2014-05-15T00:26:11.387,5,166.0,<p>If small p-values are plentiful in big data...,158.0,2019-05-07T04:16:29.673,Is there a replacement for small p-values in b...,<statistics><bigdata>,...,CC BY-SA 3.0,78.0,1330.0,2019-05-07T04:16:29.673,,,,,,"If small p-values are plentiful in big data, w..."
81,103,1,2014-05-16T14:26:12.270,24,9227.0,<p>Assume that we have a set of elements <em>E...,113.0,2021-06-28T09:13:21.753,Clustering based on similarity scores,<clustering><algorithms><similarity>,...,CC BY-SA 3.0,,,,,,,,,Assume that we have a set of elements E and a ...
98,123,5,2014-05-17T21:10:41.990,0,,<p>The most basic relationship to describe is ...,53.0,2014-05-20T13:50:21.763,,,...,CC BY-SA 3.0,,53.0,2014-05-20T13:50:21.763,,,,,,The most basic relationship to describe is a l...
5,15,1,2014-05-14T01:41:23.110,2,656.0,<p>In which situations would one system be pre...,64.0,2014-05-14T01:41:23.110,What are the advantages and disadvantages of S...,<databases>,...,CC BY-SA 3.0,,,,,,,,,In which situations would one system be prefer...
27,41,1,2014-05-14T11:15:40.907,55,10152.0,<p>R has many libraries which are aimed at Dat...,136.0,2019-02-23T11:34:41.513,Is the R language suitable for Big Data,<bigdata><r>,...,CC BY-SA 3.0,44.0,118.0,2014-05-14T13:06:28.407,,,,,,R has many libraries which are aimed at Data A...
55,73,2,2014-05-14T22:43:23.587,5,,<p>You shouldn't consider the p-value out of c...,14.0,2014-05-14T22:43:23.587,,,...,CC BY-SA 3.0,,,,71.0,,,,,You shouldn't consider the p-value out of cont...
56,74,2,2014-05-14T22:58:11.583,2,,<p>One thing you should be aware of is the sam...,64.0,2014-05-14T22:58:11.583,,,...,CC BY-SA 3.0,,,,71.0,,,,,One thing you should be aware of is the sample...
87,109,4,2014-05-16T20:24:38.980,0,,An activity that seeks patterns in a continuou...,200.0,2014-05-20T13:52:00.620,,,...,CC BY-SA 3.0,,200.0,2014-05-20T13:52:00.620,,,,,,An activity that seeks patterns in a continuou...


## Text clustering (BONUS)

We can use topic modeling techniques to identify groups of texts among our document base and classify the input to restrict the application of the proximity calculations seen previously.

#### LDA

Latent Dirichlet Allocation is a topic modeling algorithm that allows soft clustering. Soft clustering means that the LDA does not allocate an input to a cluster, but gives a probabilistic score for each identified cluster. This decomposition allows to identify topics within the documents. 

In order to compute this algorithm, you need to vectorize your data (you can use the one you have already done previously or make another one).

In [29]:
# Vectorize document using TF-IDF
vectorizer_lda = TfidfVectorizer(stop_words='english', max_features=1000)

# Fit and Transform the documents
train_data = vectorizer_lda.fit_transform(posts.cleaned_body.values) 

You can use Gensim or scikit-learn to compute LDA.

In [30]:
# Init the Model
lda_model = LatentDirichletAllocation(n_components=10, max_iter=10)

# Fit the Model
lda_model.fit(train_data)



Assign a main topic to each document

In [31]:
topic_documents = lda_model.transform(train_data)

print(topic_documents)


[[0.01808268 0.8372429  0.01808522 0.01808199 0.01808166 0.01808664
  0.018082   0.01808126 0.01808391 0.01809172]
 [0.01924021 0.01924048 0.01923927 0.01924172 0.01924096 0.82683405
  0.01924181 0.01923931 0.01923989 0.01924231]
 [0.01471923 0.01472343 0.01471998 0.01471924 0.01471926 0.01472362
  0.0147213  0.01471876 0.0147193  0.86751587]
 [0.82627387 0.01930321 0.0192983  0.01930017 0.01930394 0.01930413
  0.01930715 0.01929825 0.01930288 0.0193081 ]
 [0.02097993 0.0209786  0.02097855 0.02097818 0.02097826 0.81118557
  0.02098118 0.02097949 0.02097834 0.02098192]
 [0.02662726 0.02662849 0.02662705 0.02662705 0.02662706 0.02662763
  0.02662705 0.02662721 0.02662723 0.76035397]
 [0.01915965 0.01917139 0.8275217  0.01916477 0.01915682 0.01916545
  0.0191633  0.01915639 0.01916189 0.01917865]
 [0.02740677 0.02740769 0.02740714 0.02740677 0.02740718 0.02740677
  0.02740684 0.02740677 0.75333704 0.02740705]
 [0.1        0.1        0.1        0.1        0.1        0.1
  0.1        0.1   

Write a function that assigns a topic to the query


In [40]:
def get_topic_query(query, vectorizer=vectorizer_lda, lda_model=lda_model):

    query_vectorized = vectorizer.transform([query])
    topic_query = lda_model.transform(query_vectorized)
    return topic_query[0]
query = 'what is stochastic gradient descent ?'
topic_query = get_topic_query(query)
print(topic_query)

[0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]


## Merge methods

Write an algorithm to merge the methods seen. Which methods to use? How can you check if they are relevant ?

Answer here

In [42]:
def nlp_search_algorithm(query,
                         topic_documents=topic_documents,
                         vectors=vectors,
                         vectorizer=vectorizer,
                         vectorizer_lda=vectorizer_lda,
                         lda_model=lda_model,
                         embeddings=embeddings,
                         top_n=10
                         )->list:
    
    
    return matching_posts

results = nlp_search_algorithm(query)
for _, row in results.iterrows():
    print(row['cleaned_body'])
    print('-------------------')


ValueError: X has 2515 features, but LatentDirichletAllocation is expecting 1000 features as input.

Once you have a list of possible results, you can it: (you can use one of the ranking algorithms you have previously done or make up a new one)

In [None]:
def rank(possible_results):
    #to_do
    return possible_results

## Incorporation in the search engine

### Addition of metadata

You must now have a new set of metadata and data to add to your original index. You can load the index you had as a result of Day 2 and today's work to it. 

In [None]:
#load previous data 

#TODO 

In [None]:
# add the new data to the previous index

Hint : if you have changed the preprocessing function at the beginning of this notebook make sure to update the Clean Body attribute

### Adaptation to the index format

Adapt your nlp search results to the index format

In [None]:
def nlp_search_in_index(query,
                        index,
                        args):

    return matching_posts
  

### Compare the new searching and ranking method to the previous ones

Compare in terms of efficiency (precision, completeness, speed, memory usage)

### merge all methods to make an efficient search algorithm