<a href="https://colab.research.google.com/github/LeoMaggio/Deep-NLP/blob/main/practices/P2/Practice_2_Word_and_Sentence_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding 

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

In [1]:
%%capture
!pip install --upgrade gensim

In [22]:
import gensim.downloader as api
import time
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable
start = time.time()
w2v_model = Word2Vec(dataset)  # train w2v model
w2v_time = time.time() - start

print(f"Training time: {w2v_time}s")

Training time: 170.66526007652283s


### **Question 2**:
Perform intrinsic evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv). 

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v` 
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue?
2. Test the model trained on Google News available in gensim.



In [3]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
!pip install --upgrade pandas
!pip install -U pandas-profiling

In [23]:
# Executing this cell could take ~5 minutes
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')

In [7]:
import pandas as pd
import numpy as np

df = pd.read_csv('google_analogies.csv', index_col='Unnamed: 0')
df = df.sort_values(by=['type', 'word1'])

In [None]:
k_list = [1, 3, 5, 10]

types = df['type'].unique()
for t in types:
  top_k_text8 = [0, 0, 0, 0]
  top_k_news = [0, 0, 0, 0]
  if t == 'family':
    print(f"Type: {t}")
    print("")
    df_by_type = df[df['type'] == t]
    for i, row in df_by_type.iterrows():
      for i, k in enumerate(k_list):
        try:
          if row['target'] in [x[0] for x in model.wv.most_similar(positive=[row['word2'], row['word3']], negative=[row['word1']], topn=k)]:
            top_k_text8[i] += 1
        except:
          pass
        if row['target'] in [x[0] for x in w2v_google_news_model.most_similar(positive=[row['word2'], row['word3']], negative=[row['word1']], topn=k)]:
          top_k_news[i] += 1
    for i, k in enumerate(k_list):
      print(f"Top {k} text8: {top_k_text8[i] / len(df_by_type)}")
      print(f"Top {k} Google News: {top_k_news[i] / len(df_by_type)}")
      print("")
    print("-----------------------------------------------------------")
    print("")

Type: family
Top 1 text8: 0.5
Top 1 Google News: 0.8458498023715415

Top 3 text8: 0.5711462450592886
Top 3 Google News: 0.9229249011857708

Top 5 text8: 0.6126482213438735
Top 5 Google News: 0.9525691699604744

Top 10 text8: 0.6561264822134387
Top 10 Google News: 0.974308300395257

-----------------------------------------------------------



### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps. 

- Is there any significant difference in training time if compared with Word2Vec training?

In [8]:
import gensim.downloader as api
import time
from gensim.models import FastText

dataset = api.load("text8")  # load dataset as iterable
start = time.time()
ft_model = FastText(dataset)  # train FastText model
ft_time = time.time() - start

print(f"Training time: {ft_time}s")

Training time: 561.4850926399231s


### **Question 4:**
Score the FastText model by exploiting the same methodology presented in Q2. 

**Notes:**
- Is there any issue similar to Word2Vec model?
- Test the model trained on Wikipedia+News available in gensim.

In [9]:
# Executing this cell could take ~5 minutes
import gensim.downloader
ft_wiki_news_model = gensim.downloader.load('fasttext-wiki-news-subwords-300')



In [10]:
k_list = [1, 3, 5, 10]

types = df['type'].unique()
for t in types:
  top_k_text8 = [0, 0, 0, 0]
  top_k_news = [0, 0, 0, 0]
  if t == 'family':
    print(f"Type: {t}")
    print("")
    df_by_type = df[df['type'] == t]
    for i, row in df_by_type.iterrows():
      for i, k in enumerate(k_list):
        try:
          if row['target'] in [x[0] for x in ft_model.wv.most_similar(positive=[row['word2'], row['word3']], negative=[row['word1']], topn=k)]:
            top_k_text8[i] += 1
        except:
          pass
        if row['target'] in [x[0] for x in ft_wiki_news_model.most_similar(positive=[row['word2'], row['word3']], negative=[row['word1']], topn=k)]:
          top_k_news[i] += 1
    for i, k in enumerate(k_list):
      print(f"Top {k} text8: {top_k_text8[i] / len(df_by_type)}")
      print(f"Top {k} Wiki News: {top_k_news[i] / len(df_by_type)}")
      print("")
    print("-----------------------------------------------------------")
    print("")

Type: family

Top 1 text8: 0.27865612648221344
Top 1 Wiki News: 0.849802371541502

Top 3 text8: 0.4308300395256917
Top 3 Wiki News: 0.950592885375494

Top 5 text8: 0.48616600790513836
Top 5 Wiki News: 0.9604743083003953

Top 10 text8: 0.5553359683794467
Top 10 Wiki News: 0.9841897233201581

-----------------------------------------------------------



### **Question 5** (optional) 
Evaluate Word2Vec and FastText models  on the analogy task for the whole dataset (include all analogy types).

In [None]:
# Your code here

## Sentence Embeddings

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework
- It incorporate Document ID to obtain a more accurate representation of a document/paragraph
- Training document vectors are pre-computed, however you can infer vectors for new documents
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6:**

Train a Doc2Vec model using gensim with text8 corpus. Compute the training time for the model and store it for subsequent steps.

In [11]:
%%capture
!pip install gensim
!pip install nltk
!pip install sklearn

In [13]:
import nltk
from IPython.utils import io
with io.capture_output() as captured:
  nltk.download('punkt')

In [14]:
import gensim.downloader as api
import time
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

dataset = api.load("text8")  # load dataset as iterable
data = [d for d in dataset]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(data)]

In [16]:
d2v_model = Doc2Vec(documents, vector_size=200, min_count=1, epochs=10)
d2v_model.build_vocab(documents)
d2v_model.train(documents, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

### **Question 7 (qualitative Evaluation)**
Perform some qualitative experiments by computing the cosine similarities between sentences composed by yourself.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

vector_1 = d2v_model.infer_vector(["university", "course"])
vector_2 = d2v_model.infer_vector(["college", "graduate"])
print (vector_1.shape)
print (cosine_similarity(vector_1.reshape(1, -1), vector_2.reshape(1, -1)))

(200,)
[[0.4930306]]


### **Question 8** (Extrinsic Evaluation)

Extrinsic evaluation measure performance of the word/sentence/paragraph embedding model for a downstream NLP task (e.g., Text Classification).

We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset)

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models 


**Note:** You need to choose an aggregation function (e.g., average) to obtain sentence embeddings from word vectors.

Which model has better performance? Report the performance of each variant of the classification system.

In [18]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [19]:
# Reading data
import pandas as pd
df_news_clf = pd.read_csv("news_headline_classification.csv")
list_sentences = df_news_clf["headline"].tolist()
list_sentences = list_sentences[:10000]
list_labels = df_news_clf["category"].tolist()
list_labels = list_labels[:10000]

**Word2Vec + Average aggregation function**

In [24]:
# Word2Vec + Avg
from nltk import word_tokenize
import numpy as np
from tqdm import tqdm
list_w2v_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(w2v_google_news_model[w])
        except Exception as e:
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_w2v_vectors.append(sentence_vector)

100%|██████████| 10000/10000 [00:02<00:00, 4059.87it/s]


In [27]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(list_w2v_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=False)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

0.646
                precision    recall  f1-score   support

ARTS & CULTURE       0.00      0.00      0.00         9
  BLACK VOICES       0.47      0.37      0.41        95
      BUSINESS       0.31      0.15      0.21        26
        COMEDY       0.56      0.53      0.55        96
         CRIME       0.41      0.31      0.35        42
     EDUCATION       0.29      0.29      0.29         7
 ENTERTAINMENT       0.63      0.76      0.69       376
         GREEN       0.33      0.50      0.40         8
HEALTHY LIVING       0.17      0.14      0.15        14
        IMPACT       0.50      0.24      0.32        17
 LATINO VOICES       0.38      0.33      0.35        18
         MEDIA       0.52      0.39      0.44        57
       PARENTS       0.55      0.33      0.41        18
      POLITICS       0.77      0.84      0.80       736
  QUEER VOICES       0.69      0.51      0.59       100
      RELIGION       0.54      0.39      0.45        18
       SCIENCE       0.00      0.00      



**FastText + Average aggregation function**

In [28]:
# FastText + Avg
from nltk import word_tokenize
import numpy as np
list_ft_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(ft_wiki_news_model[w])
        except Exception as e:
            #print (e)
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_ft_vectors.append(sentence_vector)

100%|██████████| 10000/10000 [00:02<00:00, 4118.76it/s]


In [29]:
X_train, X_test, y_train, y_test = train_test_split(list_ft_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=False)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

0.644
                precision    recall  f1-score   support

ARTS & CULTURE       0.00      0.00      0.00         9
  BLACK VOICES       0.47      0.41      0.44        95
      BUSINESS       0.40      0.08      0.13        26
        COMEDY       0.67      0.48      0.56        96
         CRIME       0.39      0.33      0.36        42
     EDUCATION       0.25      0.14      0.18         7
 ENTERTAINMENT       0.62      0.78      0.69       376
         GREEN       0.33      0.38      0.35         8
HEALTHY LIVING       0.00      0.00      0.00        14
        IMPACT       0.29      0.12      0.17        17
 LATINO VOICES       0.40      0.22      0.29        18
         MEDIA       0.49      0.40      0.44        57
       PARENTS       0.40      0.11      0.17        18
      POLITICS       0.74      0.86      0.79       736
  QUEER VOICES       0.62      0.61      0.61       100
      RELIGION       0.33      0.17      0.22        18
       SCIENCE       0.33      0.20      

  _warn_prf(average, modifier, msg_start, len(result))


**Doc2Vec (Text8)**

In [30]:
# Doc2Vec
from nltk import word_tokenize
import numpy as np
list_d2v_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    try:
        sentence_vector = d2v_model.infer_vector(words)
    except Exception as e:
        print (e)
        sentence_vector = np.zeros(300)

    list_d2v_vectors.append(sentence_vector)

100%|██████████| 10000/10000 [00:06<00:00, 1439.57it/s]


In [31]:
X_train, X_test, y_train, y_test = train_test_split(list_d2v_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=False)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

0.3675
                precision    recall  f1-score   support

ARTS & CULTURE       0.00      0.00      0.00         9
  BLACK VOICES       0.00      0.00      0.00        95
      BUSINESS       0.00      0.00      0.00        26
        COMEDY       0.00      0.00      0.00        96
         CRIME       0.00      0.00      0.00        42
     EDUCATION       0.00      0.00      0.00         7
 ENTERTAINMENT       0.00      0.00      0.00       376
         GREEN       0.00      0.00      0.00         8
HEALTHY LIVING       0.00      0.00      0.00        14
        IMPACT       0.00      0.00      0.00        17
 LATINO VOICES       0.00      0.00      0.00        18
         MEDIA       0.00      0.00      0.00        57
       PARENTS       0.00      0.00      0.00        18
      POLITICS       0.37      1.00      0.54       736
  QUEER VOICES       0.00      0.00      0.00       100
      RELIGION       0.00      0.00      0.00        18
       SCIENCE       0.00      0.00     

  _warn_prf(average, modifier, msg_start, len(result))


**InferSent**

In [32]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [33]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [34]:
infersent.build_vocab(list_sentences, tokenize=True)

Found 13627(/15255) words with w2v vectors
Vocab size : 13627


In [35]:
# InferSent
from nltk import word_tokenize
import numpy as np
infersent_embeddings = infersent.encode(list_sentences, tokenize=True)

  sentences = np.array(sentences)[idx_sort]


In [36]:
X_train, X_test, y_train, y_test = train_test_split(infersent_embeddings, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=False)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

0.685
                precision    recall  f1-score   support

ARTS & CULTURE       0.00      0.00      0.00         9
  BLACK VOICES       0.54      0.54      0.54        95
      BUSINESS       0.46      0.23      0.31        26
        COMEDY       0.71      0.64      0.67        96
         CRIME       0.56      0.36      0.43        42
     EDUCATION       0.40      0.29      0.33         7
 ENTERTAINMENT       0.66      0.77      0.71       376
         GREEN       0.44      0.50      0.47         8
HEALTHY LIVING       0.25      0.14      0.18        14
        IMPACT       0.43      0.18      0.25        17
 LATINO VOICES       0.50      0.39      0.44        18
         MEDIA       0.54      0.49      0.51        57
       PARENTS       0.71      0.28      0.40        18
      POLITICS       0.79      0.87      0.83       736
  QUEER VOICES       0.71      0.62      0.66       100
      RELIGION       0.53      0.44      0.48        18
       SCIENCE       0.00      0.00      

