# Evolving vector-space model

This lab will be devoted to the use of `doc2vec` model for the needs of information retrieval and text classification.  

## 1. Searching in the curious facts database
The facts dataset is given [here](https://github.com/hsu-ai-course/hsu.ai/blob/master/code/datasets/nlp/facts.txt), take a look.  We want you to retrieve facts relevant to the query, for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using `doc2vec` model.

### 1.1 Loading trained `doc2vec` model

First, let's load the pre-trained `doc2vec` model from https://github.com/jhlau/doc2vec (Associated Press News DBOW (0.6GB))

In [0]:
!pip install gensim
!wget https://public.boxcloud.com/d/1/b1!z4mQUletMDL50OUZ5OX9LOcOjDMU_Me8s-t6uvcr6U4HWOojZMV-UwevTZsfiijxyKQQYrdzQbTMXMK4cReD7KjXADFCc_5ySlm3hV9s47-5QPiXH8uMAdY5y9HwVDn6OWRU-OM0VrQ4I1CEcsclZDCplbTDj0kCsB84HXY533kALhbGCN7pvpj2FZJF-GhJARGGGK7WeQx4r9z6Q_fMz49ljUdFNoC0kz3qmhp1fSjsg2A2SFrP_9ZAlaK2aT6Te3WAEq-xNanneeZc_H_MAJy4Z7WU97ac9Jq9lI2pA-KxWyPrwkIGvr9y2muTV5T5hhJ6kI1JYe73FUqImoZ1maFyT40QP4i2XhqVKLcfQ9JRQDUqaJkTLrLzMLOWbpP1FdzZG679FYpxpAQYWRu9BJ8wyusIIDDdTYZXa16o3BBNMUDdUkcuiD74vphaFCbYvTTlSnhu1k3GfjkYUaKn6uizEysOk4MykB-s_mOGipGWFmyyUFgEYfW7ofoAOhVFBNtM8wQPwN1dLiFduKG7HSTV-8giZ4yLUUJ3bFUaysRZqWlU-d1evdkAf0Q73j3kpLwretWFg4U_yt2TP2RgWn7HrNu8aiozDyrPE6JnwahFtropbFF5VRXqecOtgvDHsK6TnO3lV0ml4_V51fPn4AuoPEMs8XMfaVQKqX2TfhQBOrSkA_cjpnA_SWL1K6YJw1nZWaAEeLSFHZgD6AABpQ42OXNmF45jkVdXC1MxJpHf9hG8pbV3d2ln2mDO9MzjPf_roPrVEB_BtYD7A6Zr9jMBqNo4cMxVlPFrsiSW6upvHOJkVHhnXTiTTGMBVVtN9XiFfXPB7WmNoNNjaIk2vuZP3uGBcK_LcxBZqDML2JLfcRCd3_XWx7TCmjmCwaFnxkximEIRBsr_xcxr0QyVGfTWg6jOsECQBHUATHVnawK3ZxxMdzyLnmlS0HLLgsg2kqTpAsO8zP4-LJ0GM7esuL_3-_dsevu3vxSSqRRuMd_UcDfxuEfrpEqEwv1huBBRltX86Xvo4QqvrBycyJVBGYhCnAdoqNJIki3FQQAXXWklZBlxAuywmP7CYMSNWoUE3olvD-D_IVltuvIDeMqEYRP6x7S2-EsvrzlDyrnAIFaNRvRKCBAzbKs8Atwm5qT6chdL-DuTu3XIIQ1tME0qt9nlIRiyo9QmWsnV_BtSz26CPTCecb2CU6TzEOsy70_vsuKJodD4txwrMhLhShBwG3CdRU1ffo8uwqpDD_EiFjtpFefDOQ3f1yzBmJoKZJJPI-_ey6HYjZkA9MSrcK-Ek5hIOj4fj-JVYTyfZGSHL2o2H5KAYV541fFjQyEolGI./download

--2020-02-16 16:13:44--  https://public.boxcloud.com/d/1/b1!z4mQUletMDL50OUZ5OX9LOcOjDMU_Me8s-t6uvcr6U4HWOojZMV-UwevTZsfiijxyKQQYrdzQbTMXMK4cReD7KjXADFCc_5ySlm3hV9s47-5QPiXH8uMAdY5y9HwVDn6OWRU-OM0VrQ4I1CEcsclZDCplbTDj0kCsB84HXY533kALhbGCN7pvpj2FZJF-GhJARGGGK7WeQx4r9z6Q_fMz49ljUdFNoC0kz3qmhp1fSjsg2A2SFrP_9ZAlaK2aT6Te3WAEq-xNanneeZc_H_MAJy4Z7WU97ac9Jq9lI2pA-KxWyPrwkIGvr9y2muTV5T5hhJ6kI1JYe73FUqImoZ1maFyT40QP4i2XhqVKLcfQ9JRQDUqaJkTLrLzMLOWbpP1FdzZG679FYpxpAQYWRu9BJ8wyusIIDDdTYZXa16o3BBNMUDdUkcuiD74vphaFCbYvTTlSnhu1k3GfjkYUaKn6uizEysOk4MykB-s_mOGipGWFmyyUFgEYfW7ofoAOhVFBNtM8wQPwN1dLiFduKG7HSTV-8giZ4yLUUJ3bFUaysRZqWlU-d1evdkAf0Q73j3kpLwretWFg4U_yt2TP2RgWn7HrNu8aiozDyrPE6JnwahFtropbFF5VRXqecOtgvDHsK6TnO3lV0ml4_V51fPn4AuoPEMs8XMfaVQKqX2TfhQBOrSkA_cjpnA_SWL1K6YJw1nZWaAEeLSFHZgD6AABpQ42OXNmF45jkVdXC1MxJpHf9hG8pbV3d2ln2mDO9MzjPf_roPrVEB_BtYD7A6Zr9jMBqNo4cMxVlPFrsiSW6upvHOJkVHhnXTiTTGMBVVtN9XiFfXPB7WmNoNNjaIk2vuZP3uGBcK_LcxBZqDML2JLfcRCd3_XWx7TCmjmCwaFnxkximEIRBsr_xcxr0QyVGfTWg6jOsECQBHUATHVnawK3

In [0]:
from gensim.models.doc2vec import Doc2Vec

# unpack a model into 3 files and target the main one
! tar -xvf download && mv apnews_dbow/*
# doc2vec.bin  <---------- this
# doc2vec.bin.syn0.npy
# doc2vec.bin.sin1neg.npy
model = Doc2Vec.load('apnews_dbow/doc2vec.bin', mmap=None)
print(type(model))
print(type(model.infer_vector(["to", "be", "or", "not"])))

apnews_dbow/
apnews_dbow/doc2vec.bin.syn1neg.npy
apnews_dbow/doc2vec.bin.syn0.npy
apnews_dbow/doc2vec.bin
mv: target 'apnews_dbow/doc2vec.bin.syn1neg.npy' is not a directory


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


<class 'gensim.models.doc2vec.Doc2Vec'>
<class 'numpy.ndarray'>




### 1.2 Reading data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [0]:
!wget https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt

--2020-02-16 16:14:57--  https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13158 (13K) [text/plain]
Saving to: ‘facts.txt’


2020-02-16 16:14:57 (1.79 MB/s) - ‘facts.txt’ saved [13158/13158]



In [0]:
facts = []
with open("facts.txt", encoding="windows-1251", errors='ignore') as facts0:
    for b in facts0:
      facts.append(b)

### 1.3 Tests

In [0]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.

2. McDonalds calls frequent buyers of their food “heavy users.”

3. The average person spends 6 months of their lifetime waiting on a red light to turn green.

4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.

5. You burn more calories sleeping than you do watching television.



### 1.4  Transforming sentences to vectors

Transform the list of facts to numpy array of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [0]:
#TODO infer vectors
import numpy as np
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

class Preprocessor:
    def __init__(self):
        self.stop_words = {'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he', 'in', 'is', 'it', 'its',
                      'of', 'on', 'that', 'the', 'to', 'was', 'were', 'will', 'with'}
        self.ps = nltk.stem.PorterStemmer()

    
    def tokenize(self, text):
        #TODO word tokenize text using nltk lib
        
        return word_tokenize (text)

    
    def stem(self, word, stemmer):
        return  stemmer.stem(word.lower())

    
    def is_apt_word(self, word):
        #TODO check if word is appropriate - not a stop word and isalpha, 
        # i.e consists of letters, not punctuation, numbers, datesа
  
        if word.isalpha():
          if word not in self.stop_words:
            return True
        return False
    def preprocess(self, text):
        #TODO combine all previous methods together: tokenize lowercased text 
        # and stem it, ignoring not appropriate words
        tokenized = self.tokenize (text)
        filtered = []
        for word in tokenized:
          word = self.stem(word,self.ps)
          if self.is_apt_word(word):
            filtered.append(word)
        
          
        return filtered
preprocessed = Preprocessor()
sent_vecs = np.array([])
sent_vecs = np.vstack( [model.infer_vector( preprocessed.preprocess(fact.split(". ")[1]) ) for fact in facts] )

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 1.5 Tests 

In [0]:
print(sent_vecs.shape)
assert sent_vecs.shape == (159, 300)

(159, 300)


### 1.6 Find closest

Now, reusing the code from the last lab, find facts which are closest to the query using cosine similarity measure.

In [0]:
def find_k_closest(query, dataset, k=5):    
    index = list((i, v, np.dot(query, v)) for i, v in enumerate(dataset))    
    return sorted(index, key=lambda pair: pair[2], reverse=True)[:k]

query = "good mood"
query1 = model.infer_vector( preprocessed.preprocess(query) ) 

def norm_vectors(A):
    An = A.copy()
    for i, row in enumerate(An):
        An[i, :] /= np.linalg.norm(row)
    return An
    
query1 /= np.linalg.norm(query1)
sent_vecs = norm_vectors(sent_vecs)



r = find_k_closest(query1, sent_vecs)



print("Results for query:", query)
for k, v, p in r:
    print("\t", facts[k], "sim=", p)

Results for query: good mood
	 76. You breathe on average about 8,409,600 times a year
 sim= 0.61738324
	 68. Cherophobia is the fear of fun.
 sim= 0.6098009
	 145. It is impossible to sneeze with your eyes open.
 sim= 0.60838455
	 69. The toothpaste “Colgate” in Spanish translates to “go hang yourself”
 sim= 0.5818511
	 144. Dolphins sleep with one eye open!
 sim= 0.5698239


## 2. Training doc2vec model and documents classifier

Now we would like you to train doc2vec model yourself based on [this topic-modeling dataset](https://code.google.com/archive/p/topic-modeling-tool/downloads).

### 2.1 Read dataset

First, read the dataset - it consists of 4 parts, you need to merge them into single list. 

In [0]:
!wget -O music.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
!wget -O economy.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
!wget -O fuel.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_fuel_845docs.txt
!wget -O brain.txt https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_braininjury_10000docs.txt

--2020-02-16 16:15:01--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.197.128, 2607:f8b0:400e:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.197.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13985603 (13M) [application/octet-stream]
Saving to: ‘music.txt’


2020-02-16 16:15:02 (36.3 MB/s) - ‘music.txt’ saved [13985603/13985603]

--2020-02-16 16:15:03--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 2607:f8b0:400e:c09::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13682532 (13M) [application/octet-stream

In [0]:
#TODO read the dataset into list
import itertools
import pandas as pd
brain=[]
economy = []
music = []
fuel = []

bb = open('brain.txt', 'r')
for b in bb:
  brain.append(b)

ee = open("economy.txt", 'r')
for b in ee:
  economy.append(b)


mu = open("music.txt",'r')
for b in mu:
  music.append(b)

fu = open("fuel.txt",'r')
for b in fu:
  fuel.append(b)
all_data = list(itertools.chain(brain, economy, music, fuel))

In [0]:
economy

['the new york times said editorial for tuesday jan tuesday the same coins and banknotes can used buy cup coffee and the morning paper amsterdam lisbon helsinki naples dublin and dresden the franc mark lira and other currencies are disappearing vanquished the euro nothing quite like this changeover has ever taken place all goes according plan some billion worth newly minted euros will enter into circulation tuesday every bank and retail establishment from the southwestern corner portugal where the atlantic meets the mediterranean finland northernmost tundra over the next two months the euro will displace the traditional currencies used million people countries day planners call represents more than breathtaking logistical challenge and financial milestone for europe also day great political significance with euros circulation the process european integration first championed half century ago visionary french statesman named jean monnet acquires its most potent and tangible symbol the e

### 2.2 Tests 

In [0]:
print(len(all_data))
assert len(all_data) == 15002

15002


### 2.3 Training `doc2vec` model

Train a `doc2vec` model based on the dataset you've loaded. The example of training is provided.

In [0]:
#TODO change this according to the task
# small set of tokenized sentences
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# just a test set of tokenized sentences
# print(common_texts, "\n")
documents = [TaggedDocument(doc, 'brain') for i, doc in enumerate(all_data)]
for B in range (len(brain), len(brain)+len(economy)):
  documents[B] = TaggedDocument(documents[B].words, "economy")
for B in range (len(brain)+len(economy),len(brain)+len(economy)+len(music)):
  documents[B] = TaggedDocument(documents[B].words, "music")
for B in range (len(brain)+len(economy)+len(music),len(brain)+len(economy)+len(music)+len(fuel)):
  documents[B] = TaggedDocument(documents[B].words, "fuel")



# print(documents, "\n")
# train a model
model = Doc2Vec(
    documents,     # collection of texts
    vector_size=5, # output vector size
    window=2,      # maximum distance between the target word and its neighboring word
    min_count=1,   # minimal number of 
    workers=4      # in parallel
)

# clean training data
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

# save and load
model.save("d2v.model")
model = Doc2Vec.load("d2v.model")

vec = model.infer_vector(["system", "response"])
# print(vec)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### 2.4 Form train and test datasets

Transform documents to vectors and split data to train and test sets. Make sure that the split is stratified as the classes are imbalanced.

In [0]:
#TODO transform and make a train-test split
import sklearn
from sklearn.model_selection import train_test_split
vektor_l = []
tags_l = []
for b in range (len(documents)):
  vektor_l.append(model.infer_vector(documents[b].words))
for b in range (len(documents)):
  tags_l.append(documents[b].tags)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(vektor_l, tags_l,stratify = tags_l)











 ### 2.5 Train topics classifier

Train a classifier that would classify any document to one of four categories: fuel, brain injury, music, and economy.
Print a classification report for test data.

In [52]:
#TODO train a classifier and measure its performance

from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       brain       0.90      0.98      0.94      2501
     economy       0.44      0.53      0.49       518
        fuel       0.00      0.00      0.00       211
       music       0.53      0.41      0.46       521

    accuracy                           0.79      3751
   macro avg       0.47      0.48      0.47      3751
weighted avg       0.74      0.79      0.76      3751



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  _warn_prf(average, modifier, msg_start, len(result))


Which class is the hardest one to recognize?

### 2.6 Bonus task

What if we trained our `doc2vec` model using window size = 5 or 10? Would it improve the classification acccuracy? What about vector dimensionality? Does it mean that increasing it we will achieve better performance in terms of classification?

Explore the influence of these parameters on classification performance, visualizing it as a graph (e.g. window size vs f1-score, vector dim vs f1-score).

In [0]:
#TODO bonus task