# Evolving vector-space model

This lab will be devoted to the use of `doc2vec` model for the needs of information retrieval and text classification.  

## 1. Searching in the curious facts database
The facts dataset is given [here](https://github.com/hsu-ai-course/hsu.ai/blob/master/code/datasets/nlp/facts.txt), take a look.  We want you to retrieve facts relevant to the query, for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using `doc2vec` model.

### 1.1 Loading trained `doc2vec` model

First, let's load the pre-trained `doc2vec` model from https://github.com/jhlau/doc2vec (Associated Press News DBOW (0.6GB))

In [0]:
import nltk 
import numpy as np
import os 
import pandas as pd

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 1.1.1 Download Pre-Trained Doc2Vec Models

In [0]:
!wget https://public.boxcloud.com/d/1/b1!do2HFzvOsOzItdsAgqBa-ErA4C3VYUaMHaZyOJmKyLCEOMw9sKdJvQdewdfX_oe_tWmfNd0mMZ3hpa4QhfggxzAwdhx8LA3Lsd9ZIJ0GIt5XiDpZjwltLku3G54wjTd8qxik7qsJ-mw9a5NslVMnAzxiy3uUISn3TI-pZzgL0tEmjpffJMm9zh2EDJ17bGmCRv9dsfVPOeD2K1LgNI-DlV9nQRWzzpqsTXB4dEghHFJBvtkey0Qpw0QXXenu16rfKSvVy6E5ky6yklC6NbnNYiAZaKMx3caffSmDfgWVcIs_oks0VFOxV7wkNoHos_ULqx23bgtWw9KZION0vDGabjm3mWHS-fbfzrEbD8uX9GM6XNSLyrKHqDLuYn4O4KYBr-1KDBLgpXL5rLYttUQpbl0xXEo0uIODC3N6zW0-6uw1oWc7yp5qSfQiOkpJWMIIIfZtvxvIuV9pEsaMtECm2K8aDIXDMKDk0Y5CwboGQIME7v6w1qSwnsKY0QRbCdWJwkRXpxzUKqhVhMCvgMf3DgWkFc17oKR9c_by9osF9PInFWYjO9BWAjHAmkIpEXAIi_ueG5NVvk4WQAefUQd3VmobN5ki2HgUQWqDZjSZNCdP6KxFdDTtczj72N_O2QIEEiO2K3cCFX8Pw4eCEBIzgRgicoRqTm6lHkIVZ2ySW_DYkeIkYXUhtRYDAoOPlF2TEILYayZzGqiQghkEPXOTbvi6pNOZNG3CVrlhpDzGrd0POBE9B5OQJ65j3HFPWkhVeee_4YdjtNs-4jYJCZ4jiwxkzgaBT8HX5clJz964pZZrH0ZTWyQ4_nK3wJylqhxrLkasTX5diTjLAkSRfsGuwLE60ZqrKtXP1EeAkDzIZgOBSUUMURjzfPaqxQxZh6qGaY3o-4w-qfKx4b4hc_6k9C6ZmB1Vr1DoBjG2NDJAJ91gk89yK3G4QUqebIjM3BTkeeJuIftdOIVTlzl4ishKBFE5kKiEPSypMlpVBL7kRTThu9xoT_znvtpRgUqalC99mJVXWU1F9kuthcm2pGffUWxkT9_WBf3w0j8m-OaZ6cy5T1BbbgJJQ8WueCeGsTP0UVKdNuQvuASJIChsjICLnF7X6mIb9OJdvhq0YQ9Ceva4i7zN9O4X9BvGTNGn9ntWCUomvNILYImY_QmIY0GePCaqm3g7UJ9xmBB_uj7jJm0aR-qvHKH-xKPROv25wZKeZ5sDE_i109e383lpw_XzFPKzJb5EbEFuj0gHEOP2sR7jOZ1UwTi-dWHnPtMl2mxZhGJyYFNYOvD_0fELkLEldzPyR6qTm-Mk5efW_xnIIl2OXVZi1tLNft0dYYL6It89lA../download

!tar -xvzf "./download"

--2020-02-10 13:08:15--  https://public.boxcloud.com/d/1/b1!do2HFzvOsOzItdsAgqBa-ErA4C3VYUaMHaZyOJmKyLCEOMw9sKdJvQdewdfX_oe_tWmfNd0mMZ3hpa4QhfggxzAwdhx8LA3Lsd9ZIJ0GIt5XiDpZjwltLku3G54wjTd8qxik7qsJ-mw9a5NslVMnAzxiy3uUISn3TI-pZzgL0tEmjpffJMm9zh2EDJ17bGmCRv9dsfVPOeD2K1LgNI-DlV9nQRWzzpqsTXB4dEghHFJBvtkey0Qpw0QXXenu16rfKSvVy6E5ky6yklC6NbnNYiAZaKMx3caffSmDfgWVcIs_oks0VFOxV7wkNoHos_ULqx23bgtWw9KZION0vDGabjm3mWHS-fbfzrEbD8uX9GM6XNSLyrKHqDLuYn4O4KYBr-1KDBLgpXL5rLYttUQpbl0xXEo0uIODC3N6zW0-6uw1oWc7yp5qSfQiOkpJWMIIIfZtvxvIuV9pEsaMtECm2K8aDIXDMKDk0Y5CwboGQIME7v6w1qSwnsKY0QRbCdWJwkRXpxzUKqhVhMCvgMf3DgWkFc17oKR9c_by9osF9PInFWYjO9BWAjHAmkIpEXAIi_ueG5NVvk4WQAefUQd3VmobN5ki2HgUQWqDZjSZNCdP6KxFdDTtczj72N_O2QIEEiO2K3cCFX8Pw4eCEBIzgRgicoRqTm6lHkIVZ2ySW_DYkeIkYXUhtRYDAoOPlF2TEILYayZzGqiQghkEPXOTbvi6pNOZNG3CVrlhpDzGrd0POBE9B5OQJ65j3HFPWkhVeee_4YdjtNs-4jYJCZ4jiwxkzgaBT8HX5clJz964pZZrH0ZTWyQ4_nK3wJylqhxrLkasTX5diTjLAkSRfsGuwLE60ZqrKtXP1EeAkDzIZgOBSUUMURjzfPaqxQxZh6qGaY3o-4w-qfKx4b4hc_6k9C6ZmB1Vr1DoBjG2NDJAJ91g

In [0]:
!pip install gensim

In [0]:
from gensim.models.doc2vec import Doc2Vec

# unpack a model into 3 files and target the main one
# doc2vec.bin  <---------- this
# doc2vec.bin.syn0.npy
# doc2vec.bin.sin1neg.npy
model = Doc2Vec.load('./apnews_dbow/doc2vec.bin', mmap=None)
print(type(model))
print(type(model.infer_vector(["to", "be", "or", "not"])))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


<class 'gensim.models.doc2vec.Doc2Vec'>
<class 'numpy.ndarray'>




### 1.2 Reading data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [0]:
!wget https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt

--2020-02-10 08:23:59--  https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13158 (13K) [text/plain]
Saving to: ‘facts.txt.1’


2020-02-10 08:23:59 (1.04 MB/s) - ‘facts.txt.1’ saved [13158/13158]



In [0]:
#TODO read facts into list
with open("./facts.txt") as f:
  lines = f.readlines()
  facts = [l.strip() for l in lines]

len(facts)

159

### 1.3 Tests

In [0]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.
2. McDonalds calls frequent buyers of their food “heavy users.”
3. The average person spends 6 months of their lifetime waiting on a red light to turn green.
4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.
5. You burn more calories sleeping than you do watching television.


### 1.4  Transforming sentences to vectors

Transform the list of facts to numpy array of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [0]:
#TODO infer vectors
# model.infer_vector(nltk.word_tokenize())
def norm_vectors(A):
  '''divide each row of A by its norm ||A[i]||'''
  return A/np.linalg.norm(A,axis=1, keepdims=True)

result = []
for doc in facts:
  result.append(model.infer_vector(nltk.word_tokenize(doc)).tolist())

# result

sent_vecs = norm_vectors(np.array(result))

In [0]:
sent_vecs.shape

(159, 300)

### 1.5 Tests 

In [0]:
print(sent_vecs.shape)
assert sent_vecs.shape == (159, 300)

(159, 300)


### 1.6 Find closest

Now, reusing the code from the last lab, find facts which are closest to the query using cosine similarity measure.

In [0]:
#TODO output closest facts to the query
def find_k_closest(query, dataset, k=5):
  """find 5 closest rows in dataset in terms of cosine similarity"""

  similarities = np.dot(dataset,query)
  top_k = (-similarities).argsort()[:k+1] # exclude the query element   
  
  return [[x , y, z] for x, y, z in zip(top_k, dataset[top_k] , similarities[top_k])]

query = "good mood"
query_vec = model.infer_vector(nltk.word_tokenize(query))
r = find_k_closest(query_vec,sent_vecs)

print("Results for query:", query)
for k, v, p in r:
    print("\t", facts[k], "sim=", p)

Results for query: good mood
	 68. Cherophobia is the fear of fun. sim= 0.6018843634032023
	 144. Dolphins sleep with one eye open! sim= 0.5927347226508698
	 76. You breathe on average about 8,409,600 times a year sim= 0.5774093156930318
	 57. Gorillas burp when they are happy sim= 0.5743244816649642
	 97. 111,111,111 X 111,111,111 = 12,345,678,987,654,321 sim= 0.5716028199341777
	 69. The toothpaste “Colgate” in Spanish translates to “go hang yourself” sim= 0.5574922155566818


## 2. Training doc2vec model and documents classifier

Now we would like you to train doc2vec model yourself based on [this topic-modeling dataset](https://code.google.com/archive/p/topic-modeling-tool/downloads).

### 2.1 Read dataset

First, read the dataset - it consists of 4 parts, you need to merge them into single list. 

In [0]:
!mkdir testdata_news
!wget -O ./testdata_news/music.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
!wget -O ./testdata_news/economy.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
!wget -O ./testdata_news/fuel.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_fuel_845docs.txt
!wget -O ./testdata_news/braininjury.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_braininjury_10000docs.txt


--2020-02-10 13:11:40--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.217.128, 2607:f8b0:400c:c00::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.217.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13985603 (13M) [application/octet-stream]
Saving to: ‘./testdata_news/music.txt’


2020-02-10 13:11:41 (52.5 MB/s) - ‘./testdata_news/music.txt’ saved [13985603/13985603]

--2020-02-10 13:11:41--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.210.128, 2607:f8b0:400c:c00::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.210.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1368

In [0]:
#TODO read the dataset into list
all_data = []
label_mapping = {}
for i , file in enumerate(os.listdir("./testdata_news")) :
  label_mapping[i] = file[:-4]
  with open("./testdata_news/"+file) as f:
    lines = f.readlines()
    for l in lines:
      all_data.append(nltk.word_tokenize(l) + [i])

all_data = pd.DataFrame([[i[:-1],i[-1]] for i in all_data],columns=["doc","label"])
print(len(all_data))

15002


In [0]:
label_mapping

{0: 'music', 1: 'braininjury', 2: 'fuel', 3: 'economy'}

### 2.2 Tests 

In [0]:
print(len(all_data))
assert len(all_data) == 15002

15002


### 2.3 Training `doc2vec` model

Train a `doc2vec` model based on the dataset you've loaded. The example of training is provided.

In [0]:
#TODO change this according to the task
# small set of tokenized sentences
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# just a test set of tokenized sentences
print(all_data[:10], "\n")
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(all_data.doc.values)]
print(documents[:10], "\n")
# train a model
model = Doc2Vec(
    documents,     # collection of texts
    vector_size=5, # output vector size
    window=2,      # maximum distance between the target word and its neighboring word
    min_count=1,   # minimal number of 
    workers=4      # in parallel
)

# clean training data
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

# save and load
model.save("d2v.model")
model = Doc2Vec.load("d2v.model")

vec = model.infer_vector(["system", "response"])
print(vec)

                                                 doc  label
0  [the, ball, still, glittered, atop, times, squ...      0
1  [the, backstage, area, for, the, uso, holiday,...      0
2  [with, flags, waving, and, confetti, falling, ...      0
3  [the, new, york, times, said, editorial, for, ...      0
4  [amwest, stock, gets, lift, for, use, times, n...      0
5  [young, friend, mine, drew, ramsey, second, ye...      0
6  [the, seminal, russian, filmmaker, sergei, eis...      0
7  [where, are, boricuas, anthony, morales, shout...      0
8  [felt, like, was, going, church, marry, guy, n...      0
9  [performances, work, martha, graham, have, bee...      0 




  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[ 0.03309583 -0.06166314 -0.02303167 -0.16588251 -0.15204777]


### 2.4 Form train and test datasets

Transform documents to vectors and split data to train and test sets. Make sure that the split is stratified as the classes are imbalanced.

In [0]:
#TODO transform and make a train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(all_data.doc, all_data.label, stratify=all_data.label, shuffle = True , random_state = 89,test_size=0.25)


### 2.5 Train topics classifier

Train a classifier that would classify any document to one of four categories: fuel, brain injury, music, and economy.
Print a classification report for test data.

In [0]:
#TODO train a classifier and measure its performance
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

train_embeddings = np.array([model.infer_vector(doc_words = doc_words).tolist() for doc_words in X_train]) 
test_embeddings = np.array([model.infer_vector(doc_words = doc_words).tolist() for doc_words in X_test])

rf = RandomForestClassifier(n_estimators=50)
rf.fit(train_embeddings,y_train.values)
y_pred = rf.predict(test_embeddings)

print(classification_report(y_test.values, y_pred,target_names=label_mapping.values()))

              precision    recall  f1-score   support

       music       0.74      0.77      0.76       521
 braininjury       1.00      1.00      1.00      2501
        fuel       0.34      0.14      0.20       211
     economy       0.65      0.77      0.70       518

    accuracy                           0.89      3751
   macro avg       0.68      0.67      0.66      3751
weighted avg       0.88      0.89      0.88      3751



Which class is the hardest one to recognize? <br> 
**Answer** : Fuel news

### 2.6 Bonus task

What if we trained our `doc2vec` model using window size = 5 or 10? Would it improve the classification acccuracy? What about vector dimensionality? Does it mean that increasing it we will achieve better performance in terms of classification?

Explore the influence of these parameters on classification performance, visualizing it as a graph (e.g. window size vs f1-score, vector dim vs f1-score).

In [0]:
#TODO bonus task