#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding 

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

In [None]:
! pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 2.8 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


In [None]:
import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")
start = time.time()
w2v_model = Word2Vec(dataset)
end = time.time()

print ("(Word2Vec on Text8) Training took", end-start, "seconds")

(Word2Vec on Text8) Training took 145.4744997024536 seconds


In [None]:
w2v_model.save("text8_w2v.model")

### **Question 2**:
Perform intrinsic evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv). 

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v` 
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue?
2. Test the model trained on Google News available in gensim.



In [None]:
%%capture
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
! pip install --upgrade pandas

In [None]:
# Executing this cell could take ~5 minutes
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')



In [None]:
import pandas as pd
df_google_dataset = pd.read_csv("google_analogies.csv")
print (df_google_dataset.head())
types = set(df_google_dataset["type"].tolist())
print (types)

   Unnamed: 0                      type   word1   word2    word3       target
0           0  capital-common-countries  Athens  Greece  Baghdad         Iraq
1           1  capital-common-countries  Athens  Greece  Bangkok     Thailand
2           2  capital-common-countries  Athens  Greece  Beijing        China
3           3  capital-common-countries  Athens  Greece   Berlin      Germany
4           4  capital-common-countries  Athens  Greece     Bern  Switzerland
{'gram6-nationality-adjective', 'gram7-past-tense', 'gram8-plural', 'gram9-plural-verbs', 'family', 'gram1-adjective-to-adverb', 'gram5-present-participle', 'currency', 'gram4-superlative', 'capital-world', 'capital-common-countries', 'city-in-state', 'gram2-opposite', 'gram3-comparative'}


**Answer 2.1:** The model trained on text8 has limited vocabulary. Most of the entries in the analogy dataset result in a OOV error.

In [None]:
from tqdm import tqdm

def score_word_embedding_model (complete_df, model, analogy_type="family", MAX_K=10):

    top_dict = {}
    keys = list(range(1, MAX_K+1))
    for k in keys:
        top_dict[k] = 0

    temp_df = complete_df.loc[complete_df['type'] == analogy_type]
    word_1_list = temp_df["word1"].tolist()
    word_2_list = temp_df["word2"].tolist()
    word_3_list = temp_df["word3"].tolist()
    target_list = temp_df["target"].tolist()

    for i, _ in enumerate(tqdm(word_1_list)):
        try:
            try:
                most_similar_words = model.most_similar(positive=[word_2_list[i], word_3_list[i]], negative=[word_1_list[i]], topn=10)
            except: 
                most_similar_words = model.wv.most_similar(positive=[word_2_list[i], word_3_list[i]], negative=[word_1_list[i]], topn=10)
            #print (most_similar_words)
            most_similar_words_list=[w[0] for w in most_similar_words]
            if target_list[i] in most_similar_words_list:
                index = most_similar_words_list.index(target_list[i])
                positive_keys = range(index+1, MAX_K+1)
                for pk in positive_keys:
                    top_dict[pk]+=1
        except Exception as e:
            print (e)
    
    print ("-------------------------")
    print ("Results for", analogy_type)
    print ("-------------------------")
    for k in top_dict.keys():
        print ("@"+str(k), "=", top_dict[k]/len(target_list))

print ("Text8 model - Word2Vec")
score_word_embedding_model(df_google_dataset,w2v_model)
print ("Google News model - Word2Vec")
score_word_embedding_model(df_google_dataset,w2v_google_news_model)

Text8 model - Word2Vec


  6%|▌         | 28/506 [00:00<00:01, 278.72it/s]

"Key 'stepbrother' not present"
"Key 'stepbrother' not present"
"Key 'stepbrother' not present"


 13%|█▎        | 64/506 [00:00<00:01, 321.51it/s]

"Key 'stepbrother' not present"


 20%|█▉        | 100/506 [00:00<00:01, 337.34it/s]

"Key 'stepbrother' not present"
"Key 'stepbrother' not present"


 26%|██▋       | 134/506 [00:00<00:01, 336.65it/s]

"Key 'stepbrother' not present"
"Key 'stepbrother' not present"


 40%|███▉      | 202/506 [00:00<00:00, 328.80it/s]

"Key 'stepbrother' not present"
"Key 'stepbrother' not present"
"Key 'stepbrother' not present"


 47%|████▋     | 236/506 [00:00<00:00, 331.18it/s]

"Key 'stepbrother' not present"


 54%|█████▍    | 272/506 [00:00<00:00, 337.51it/s]

"Key 'stepbrother' not present"
"Key 'stepbrother' not present"


 61%|██████    | 307/506 [00:00<00:00, 340.75it/s]

"Key 'stepbrother' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'policewoman' not present"
"Key 'stepbrother' not present"


 71%|███████   | 357/506 [00:01<00:00, 384.20it/s]

"Key 'stepbrother' not present"


 78%|███████▊  | 396/506 [00:01<00:00, 360.42it/s]

"Key 'stepbrother' not present"


 89%|████████▉ | 450/506 [00:01<00:00, 408.92it/s]

"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepsister' not present"
"Key 'stepbrother' not present"
"Key 'stepbrother' not present"


100%|██████████| 506/506 [00:01<00:00, 354.60it/s]


"Key 'stepbrother' not present"
-------------------------
Results for family
-------------------------
@1 = 0.5019762845849802
@2 = 0.5553359683794467
@3 = 0.5830039525691699
@4 = 0.6047430830039525
@5 = 0.6185770750988142
@6 = 0.6324110671936759
@7 = 0.6442687747035574
@8 = 0.6541501976284585
@9 = 0.658102766798419
@10 = 0.6640316205533597
Google News model - Word2Vec


100%|██████████| 506/506 [03:55<00:00,  2.15it/s]

-------------------------
Results for family
-------------------------
@1 = 0.8458498023715415
@2 = 0.9011857707509882
@3 = 0.9229249011857708
@4 = 0.9347826086956522
@5 = 0.9525691699604744
@6 = 0.9545454545454546
@7 = 0.9624505928853755
@8 = 0.9683794466403162
@9 = 0.9723320158102767
@10 = 0.974308300395257





### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps. 

- Is there any significant difference in training time if compared with Word2Vec training?

In [None]:
import gensim.downloader as api
from gensim.models import FastText
import time
dataset = api.load("text8")
start = time.time()
ft_model = FastText(dataset)
end = time.time()

print ("(FastText on Text8) Training took", end-start, "seconds")

(FastText on Text8) Training took 528.5233931541443 seconds


In [None]:
ft_model.save("text8_ft.model")

### **Question 4:**
Score the FastText model by exploiting the same methodology presented in Q2. 

**Notes:**
- Is there any issue similar to Word2Vec model?
- Test the model trained on Wikipedia+News available in gensim.

In [None]:
# Executing this cell could take ~5 minutes
import gensim.downloader
ft_wiki_news_model = gensim.downloader.load('fasttext-wiki-news-subwords-300')



In [None]:
print ("Text8 model - FastText")
score_word_embedding_model(df_google_dataset,ft_model)
print ("Wikipedia+News model - FastText")
score_word_embedding_model(df_google_dataset,ft_wiki_news_model)

Text8 model - FastText


100%|██████████| 506/506 [00:01<00:00, 376.23it/s]


-------------------------
Results for family
-------------------------
@1 = 0.2707509881422925
@2 = 0.3774703557312253
@3 = 0.43478260869565216
@4 = 0.4644268774703557
@5 = 0.4841897233201581
@6 = 0.5019762845849802
@7 = 0.5177865612648221
@8 = 0.5197628458498024
@9 = 0.5296442687747036
@10 = 0.5434782608695652
Wikipedia+News model - FastText


100%|██████████| 506/506 [01:18<00:00,  6.42it/s]

-------------------------
Results for family
-------------------------
@1 = 0.849802371541502
@2 = 0.924901185770751
@3 = 0.950592885375494
@4 = 0.9525691699604744
@5 = 0.9604743083003953
@6 = 0.9664031620553359
@7 = 0.9723320158102767
@8 = 0.9802371541501976
@9 = 0.9822134387351779
@10 = 0.9841897233201581





### **Question 5** (optional) 
Evaluate Word2Vec and FastText models  on the analogy task for the whole dataset (include all analogy types).

## Sentence Embeddings

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework
- It incorporate Document ID to obtain a more accurate representation of a document/paragraph
- Training document vectors are pre-computed, however you can infer vectors for new documents
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6:**

Train a Doc2Vec model using gensim with text8 corpus. Compute the training time for the model and store it for subsequent steps.

In [None]:
! pip install gensim
! pip install nltk
! pip install sklearn



In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import gensim
import gensim.downloader as api
import time
dataset = api.load("text8")
data = [d for d in dataset]

def tagged_document(list_of_list_of_words):
   for i, list_of_words in enumerate(list_of_list_of_words):
      yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data_for_training = list(tagged_document(data))

print(data_for_training [:1])

[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

In [None]:
# Take ~5 minutes to train
d2v_model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=1, epochs=10)
d2v_model.build_vocab(data_for_training)
d2v_model.train(data_for_training, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)
d2v_model.save("text8_doc2vec.model")

### **Question 7 (qualitative Evaluation)**
Perform some qualitative experiments by computing the cosine similarities between sentences composed by yourself.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

vector_1 = d2v_model.infer_vector(["university", "course"])
vector_2 = d2v_model.infer_vector(["college", "graduate"])
print (vector_1.shape)
print (cosine_similarity(vector_1.reshape(1, -1), vector_2.reshape(1, -1)))

(200,)
[[0.48853233]]


### **Question 7** (Extrinsic Evaluation)

Extrinsic evaluation measure performance of the word/sentence/paragraph embedding model for a downstream NLP task (e.g., Text Classification).

We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset)

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models 


**Note:** You need to choose an aggregation function (e.g., average) to obtain sentence embeddings from word vectors.

Which model has better performance? Report the performance of each variant of the classification system.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

--2021-10-18 18:01:40--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15212143 (15M) [text/plain]
Saving to: ‘news_headline_classification.csv’


2021-10-18 18:01:41 (73.7 MB/s) - ‘news_headline_classification.csv’ saved [15212143/15212143]



In [None]:
# Reading data
import pandas as pd
df_news_clf = pd.read_csv("news_headline_classification.csv")
list_sentences = df_news_clf["headline"].tolist()
list_sentences = list_sentences[:10000]
list_labels = df_news_clf["category"].tolist()
list_labels = list_labels[:10000]


**Word2Vec + Average aggregation function**

In [None]:
# Word2Vec + Avg
from nltk import word_tokenize
import numpy as np
from tqdm import tqdm
list_w2v_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(w2v_google_news_model[w])
        except Exception as e:
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_w2v_vectors.append(sentence_vector)

del w2v_google_news_model
del w2v_model

100%|██████████| 10000/10000 [00:02<00:00, 4003.05it/s]


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(list_w2v_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

del mlp

Iteration 1, loss = 2.76255525
Iteration 2, loss = 2.16554658
Iteration 3, loss = 1.98061449
Iteration 4, loss = 1.86324597
Iteration 5, loss = 1.77400730
Iteration 6, loss = 1.70044677
Iteration 7, loss = 1.63641406
Iteration 8, loss = 1.58155364
Iteration 9, loss = 1.52878528
Iteration 10, loss = 1.48285162
Iteration 11, loss = 1.44202070
Iteration 12, loss = 1.40539003
Iteration 13, loss = 1.37310066
Iteration 14, loss = 1.34256409
Iteration 15, loss = 1.31496687
Iteration 16, loss = 1.28881830
Iteration 17, loss = 1.26625007
Iteration 18, loss = 1.24542080
Iteration 19, loss = 1.22553857
Iteration 20, loss = 1.20579339
Iteration 21, loss = 1.18859194
Iteration 22, loss = 1.17418780
Iteration 23, loss = 1.15791409
Iteration 24, loss = 1.14236068
Iteration 25, loss = 1.12949809
Iteration 26, loss = 1.11729219
Iteration 27, loss = 1.10577414
Iteration 28, loss = 1.09314858
Iteration 29, loss = 1.08082377
Iteration 30, loss = 1.07086291
Iteration 31, loss = 1.06061891
Iteration 32, los

  _warn_prf(average, modifier, msg_start, len(result))


**FastText + Average aggregation function**

In [None]:
# FastText + Avg
from nltk import word_tokenize
import numpy as np
list_ft_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(ft_wiki_news_model[w])
        except Exception as e:
            #print (e)
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_ft_vectors.append(sentence_vector)

100%|██████████| 10000/10000 [00:02<00:00, 3718.43it/s]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(list_ft_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

del ft_wiki_news_model
del ft_model
del mlp

Iteration 1, loss = 2.85711381
Iteration 2, loss = 2.35164327
Iteration 3, loss = 2.22868343
Iteration 4, loss = 2.17500058
Iteration 5, loss = 2.12023218
Iteration 6, loss = 2.05618887
Iteration 7, loss = 1.99015652
Iteration 8, loss = 1.92829891
Iteration 9, loss = 1.87167157
Iteration 10, loss = 1.81954718
Iteration 11, loss = 1.77421762
Iteration 12, loss = 1.73331637
Iteration 13, loss = 1.69815265
Iteration 14, loss = 1.66501192
Iteration 15, loss = 1.63535149
Iteration 16, loss = 1.60770394
Iteration 17, loss = 1.58048436
Iteration 18, loss = 1.55414573
Iteration 19, loss = 1.53018308
Iteration 20, loss = 1.50817006
Iteration 21, loss = 1.48693819
Iteration 22, loss = 1.46830402
Iteration 23, loss = 1.44964394
Iteration 24, loss = 1.43224486
Iteration 25, loss = 1.41655785
Iteration 26, loss = 1.40141233
Iteration 27, loss = 1.38631595
Iteration 28, loss = 1.37196174
Iteration 29, loss = 1.35864699
Iteration 30, loss = 1.34531329
Iteration 31, loss = 1.33331648
Iteration 32, los

  _warn_prf(average, modifier, msg_start, len(result))


**Doc2Vec (Text8)**

In [None]:
# Doc2Vec
from nltk import word_tokenize
import numpy as np
list_d2v_vectors = []
for s in tqdm(list_sentences):
    words = word_tokenize(s)
    try:
        sentence_vector = d2v_model.infer_vector(words)
    except Exception as e:
        print (e)
        sentence_vector = np.zeros(300)

    list_d2v_vectors.append(sentence_vector)

100%|██████████| 10000/10000 [00:08<00:00, 1215.53it/s]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(list_d2v_vectors, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

del d2v_model
del mlp

Iteration 1, loss = 3.13347502
Iteration 2, loss = 2.95704716
Iteration 3, loss = 2.73036703
Iteration 4, loss = 2.50486370
Iteration 5, loss = 2.35256679
Iteration 6, loss = 2.29926926
Iteration 7, loss = 2.28509586
Iteration 8, loss = 2.27945792
Iteration 9, loss = 2.27650970
Iteration 10, loss = 2.27455119
Iteration 11, loss = 2.27338032
Iteration 12, loss = 2.27254641
Iteration 13, loss = 2.27179921
Iteration 14, loss = 2.27124185
Iteration 15, loss = 2.27073448
Iteration 16, loss = 2.27049389
Iteration 17, loss = 2.27015898
Iteration 18, loss = 2.26962910
Iteration 19, loss = 2.26931528
Iteration 20, loss = 2.26892134
Iteration 21, loss = 2.26855641
Iteration 22, loss = 2.26821958
Iteration 23, loss = 2.26805619
Iteration 24, loss = 2.26762986
Iteration 25, loss = 2.26736766
Iteration 26, loss = 2.26742864
Iteration 27, loss = 2.26681383
Iteration 28, loss = 2.26650271
Iteration 29, loss = 2.26644954
Iteration 30, loss = 2.26617871
Iteration 31, loss = 2.26576581
Iteration 32, los

  _warn_prf(average, modifier, msg_start, len(result))


**InferSent**

In [None]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [None]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [None]:
infersent.build_vocab(list_sentences, tokenize=True)

Found 13627(/15255) words with w2v vectors
Vocab size : 13627


In [None]:
# InferSent
from nltk import word_tokenize
import numpy as np
infersent_embeddings = infersent.encode(list_sentences, tokenize=True)

  sentences = np.array(sentences)[idx_sort]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(infersent_embeddings, list_labels , test_size=0.20, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print (accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred))

Iteration 1, loss = 2.40991633
Iteration 2, loss = 1.88529433
Iteration 3, loss = 1.69703724
Iteration 4, loss = 1.56397961
Iteration 5, loss = 1.45397494
Iteration 6, loss = 1.36370284
Iteration 7, loss = 1.28759151
Iteration 8, loss = 1.21961230
Iteration 9, loss = 1.16347353
Iteration 10, loss = 1.11345247
Iteration 11, loss = 1.07126255
Iteration 12, loss = 1.03145725
Iteration 13, loss = 0.99523648
Iteration 14, loss = 0.95960877
Iteration 15, loss = 0.92698344
Iteration 16, loss = 0.89842346
Iteration 17, loss = 0.87003258
Iteration 18, loss = 0.84503250
Iteration 19, loss = 0.81904346
Iteration 20, loss = 0.79222011
Iteration 21, loss = 0.77111216
Iteration 22, loss = 0.74666715
Iteration 23, loss = 0.72639160
Iteration 24, loss = 0.70594812
Iteration 25, loss = 0.68487796
Iteration 26, loss = 0.66943699
Iteration 27, loss = 0.64619767
Iteration 28, loss = 0.62806498
Iteration 29, loss = 0.61124729
Iteration 30, loss = 0.59640576
Iteration 31, loss = 0.57804956
Iteration 32, los

