#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding 

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

In [None]:
! pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 2.2 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


In [None]:
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable




In [None]:
%%time
model = Word2Vec(dataset)  # train w2v model
model.save("word2vec.model")


CPU times: user 1.7 ms, sys: 999 µs, total: 2.69 ms
Wall time: 6.45 ms


(0, 1)

### **Question 2**:
Perform intrinsic evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv). 

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v` 
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue?
2. Test the model trained on Google News available in gensim.



In [None]:
%%capture
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
! pip install --upgrade pandas

In [None]:
from tqdm import tqdm

def score_word_embedding_model (complete_df, model, analogy_type="family", MAX_K=10):

    top_dict = {}
    keys = list(range(1, MAX_K+1))
    for k in keys:
        top_dict[k] = 0

    temp_df = complete_df.loc[complete_df['type'] == analogy_type]
    word_1_list = temp_df["word1"].tolist()
    word_2_list = temp_df["word2"].tolist()
    word_3_list = temp_df["word3"].tolist()
    target_list = temp_df["target"].tolist()

    for i, _ in enumerate(tqdm(word_1_list)):
        try:
            try:
                most_similar_words = model.most_similar(positive=[word_2_list[i], word_3_list[i]], negative=[word_1_list[i]], topn=10)
            except: 
                most_similar_words = model.wv.most_similar(positive=[word_2_list[i], word_3_list[i]], negative=[word_1_list[i]], topn=10)
            #print (most_similar_words)
            most_similar_words_list=[w[0] for w in most_similar_words]
            if target_list[i] in most_similar_words_list:
                index = most_similar_words_list.index(target_list[i])
                positive_keys = range(index+1, MAX_K+1)
                for pk in positive_keys:
                    top_dict[pk]+=1
        except Exception as e:
            print (e)
    
    print ("-------------------------")
    print ("Results for", analogy_type)
    print ("-------------------------")
    for k in top_dict.keys():
        print ("@"+str(k), "=", top_dict[k]/len(target_list))

In [None]:
# Executing this cell could take ~5 minutes
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')

In [None]:
import pandas as pd
df = pd.read_csv("google_analogies.csv")

df_family = df[df["type"]=="family"]
df_family.reset_index(drop=True, inplace=True)
print(df_family)

     Unnamed: 0    type  word1 word2        word3        target
0          8363  family    boy  girl      brother        sister
1          8364  family    boy  girl     brothers       sisters
2          8365  family    boy  girl          dad           mom
3          8366  family    boy  girl       father        mother
4          8367  family    boy  girl  grandfather   grandmother
..          ...     ...    ...   ...          ...           ...
501        8864  family  uncle  aunt          son      daughter
502        8865  family  uncle  aunt         sons     daughters
503        8866  family  uncle  aunt  stepbrother    stepsister
504        8867  family  uncle  aunt   stepfather    stepmother
505        8868  family  uncle  aunt      stepson  stepdaughter

[506 rows x 6 columns]


In [None]:
arr = []
for i in range(len(df_family)):
    word1 = df_family['word1'][i]
    word2 = df_family['word2'][i]
    word3 = df_family['word3'][i]
    target = df_family['target'][i]

    vec1 = w2v_google_news_model.wv[word1]
    vec2 = w2v_google_news_model.wv[word2]
    vec3 = w2v_google_news_model.wv[word3]
    

    res_v = vec1 - vec2 + vec3

    result = w2v_google_news_model.wv.most_similar(positive=[res_v], topn=3)
    for i in range(3):
        if(target == result[i][0]):
            arr.append(1)
        else:
            arr.append(0)




In [None]:
import numpy as np
arr = np.array(arr)
r = len(df_family)
accuracy = arr.sum() / r
print("The accuracy with k=3 is:",accuracy)


In [None]:
#With k=1 we had zero results equal to the target value
#With k=3 we had 14 results equal to the target value

#The problems with utf8 were related to the fact that words like "stepbrother" were not in the dictionary

### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps. 

- Is there any significant difference in training time if compared with Word2Vec training?

In [None]:
import gensim.downloader as api
from gensim.models import FastText
import time
dataset = api.load("text8")
start = time.time()
ft_model = FastText(dataset)
end = time.time()

print ("(FastText on Text8) Training took", end-start, "seconds")

(FastText on Text8) Training took 541.6233084201813 seconds


In [None]:
ft_model.save("text8_ft.model")

### **Question 4:**
Score the FastText model by exploiting the same methodology presented in Q2. 

**Notes:**
- Is there any issue similar to Word2Vec model?
- Test the model trained on Wikipedia+News available in gensim.

In [None]:
import gensim.downloader
ft_wiki_news_model = gensim.downloader.load('fasttext-wiki-news-subwords-300')



In [None]:
print ("Text8 model - FastText")
score_word_embedding_model(df_family,ft_model)
print ("Wikipedia+News model - FastText")
score_word_embedding_model(df_family,ft_wiki_news_model)

Text8 model - FastText


100%|██████████| 506/506 [00:02<00:00, 197.19it/s]


-------------------------
Results for family
-------------------------
@1 = 0.2845849802371542
@2 = 0.3952569169960474
@3 = 0.4505928853754941
@4 = 0.4762845849802372
@5 = 0.4980237154150198
@6 = 0.5177865612648221
@7 = 0.5375494071146245
@8 = 0.5513833992094862
@9 = 0.5711462450592886
@10 = 0.5790513833992095
Wikipedia+News model - FastText


100%|██████████| 506/506 [01:07<00:00,  7.46it/s]

-------------------------
Results for family
-------------------------
@1 = 0.849802371541502
@2 = 0.924901185770751
@3 = 0.950592885375494
@4 = 0.9525691699604744
@5 = 0.9604743083003953
@6 = 0.9664031620553359
@7 = 0.9723320158102767
@8 = 0.9802371541501976
@9 = 0.9822134387351779
@10 = 0.9841897233201581





### **Question 5** (optional) 
Evaluate Word2Vec and FastText models  on the analogy task for the whole dataset (include all analogy types).

## Sentence Embeddings

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework
- It incorporate Document ID to obtain a more accurate representation of a document/paragraph
- Training document vectors are pre-computed, however you can infer vectors for new documents
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6:**

Train a Doc2Vec model using gensim with text8 corpus. Compute the training time for the model and store it for subsequent steps.

In [None]:
from gensim.models import Doc2Vec

import gensim.downloader as api
dataset = api.load("text8")
data = [d for d in dataset]

def tagged_document(list_of_list_of_words):
   for i, list_of_words in enumerate(list_of_list_of_words):
      yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data_for_training = list(tagged_document(data))

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

### **Question 7 (qualitative Evaluation)**
Perform some qualitative experiments by computing the cosine similarities between sentences composed by yourself.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

vector_1 = model.infer_vector(["university", "course"])
vector_2 = model.infer_vector(["college", "graduate"])
print (vector_1.shape)
print (cosine_similarity(vector_1.reshape(1, -1), vector_2.reshape(1, -1)))

(40,)
[[0.5709212]]


### **Question 8** (Extrinsic Evaluation)

Extrinsic evaluation measure performance of the word/sentence/paragraph embedding model for a downstream NLP task (e.g., Text Classification).

We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset)

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models 


**Note:** You need to choose an aggregation function (e.g., average) to obtain sentence embeddings from word vectors.

Which model has better performance? Report the performance of each variant of the classification system.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [None]:
import pandas as pd
df_news_clf = pd.read_csv("news_headline_classification.csv")
list_sentences = df_news_clf["headline"].tolist()
list_sentences = list_sentences[:10000]
list_labels = df_news_clf["category"].tolist()
list_labels = list_labels[:10000]

**Word2Vec + Average aggregation function**

In [None]:
from nltk import word_tokenize
import numpy as np
from tqdm import tqdm
list_w2v_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(w2v_google_news_model[w])
        except Exception as e:
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_w2v_vectors.append(sentence_vector)

del w2v_google_news_model
del w2v_model

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(list_w2v_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

**FastText + Average aggregation function**

In [None]:
from nltk import word_tokenize
import numpy as np
list_ft_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(ft_wiki_news_model[w])
        except Exception as e:
            #print (e)
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_ft_vectors.append(sentence_vector)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(list_ft_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

del ft_wiki_news_model
del ft_model
del mlp

**Doc2Vec (Text8)**

In [None]:
from nltk import word_tokenize
import numpy as np
list_d2v_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    try:
        sentence_vector = d2v_model.infer_vector(words)
    except Exception as e:
        print (e)
        sentence_vector = np.zeros(300)

    list_d2v_vectors.append(sentence_vector)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(list_d2v_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

del d2v_model
del mlp

**InferSent**

In [None]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [None]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [None]:
infersent.build_vocab(list_sentences, tokenize=True)

In [None]:
from nltk import word_tokenize
import numpy as np
infersent_embeddings = infersent.encode(list_sentences, tokenize=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(infersent_embeddings, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))