# Evolving vector-space model

This lab will be devoted to the use of `doc2vec` model for the needs of information retrieval and text classification.  

## 1. Searching in the curious facts database
The facts dataset is given [here](https://github.com/hsu-ai-course/hsu.ai/blob/master/code/datasets/nlp/facts.txt), take a look.  We want you to retrieve facts relevant to the query, for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using `doc2vec` model.

### 1.1 Loading trained `doc2vec` model

First, let's load the pre-trained `doc2vec` model from https://github.com/jhlau/doc2vec (Associated Press News DBOW (0.6GB))

In [1]:
!pip install gensim



In [2]:
!cd apnews_dbow/

In [1]:
%cd apnews_dbow/

/home/jafar/PycharmProjects/Search-Engine-IR/Notebooks/lab5/apnews_dbow


In [2]:
from gensim.models.doc2vec import Doc2Vec

# unpack a model into 3 files and target the main one
# doc2vec.bin  <---------- this
# doc2vec.bin.syn0.npy
# doc2vec.bin.sin1neg.npy
model = Doc2Vec.load('doc2vec.bin', mmap=None)
print(type(model))
print(type(model.infer_vector(["to", "be", "or", "not"])))

<class 'gensim.models.doc2vec.Doc2Vec'>
<class 'numpy.ndarray'>


### 1.2 Reading data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [6]:
!wget 'https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt'

--2020-03-24 17:10:24--  https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.244.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.244.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13158 (13K) [text/plain]
Saving to: ‘facts.txt’


2020-03-24 17:10:25 (199 KB/s) - ‘facts.txt’ saved [13158/13158]



In [3]:
#TODO read facts into list
with open('facts.txt',  encoding="ISO-8859-1") as f:
    facts = f.readlines()

### 1.3 Tests

In [4]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.

2. McDonalds calls frequent buyers of their food heavy users.

3. The average person spends 6 months of their lifetime waiting on a red light to turn green.

4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.

5. You burn more calories sleeping than you do watching television.



### 1.4  Transforming sentences to vectors

Transform the list of facts to numpy array of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [5]:
print(model.docvecs)

<gensim.models.keyedvectors.Doc2VecKeyedVectors object at 0x7f29cf0a50d0>


In [6]:
import numpy as np
#TODO infer vectors
sent_vecs = np.array(list(map(lambda x: model.infer_vector(x.split(' ')), facts)))

### 1.5 Tests 

In [7]:
print(sent_vecs.shape)
assert sent_vecs.shape == (159, 300)

(159, 300)


### 1.6 Find closest

Now, reusing the code from the last lab, find facts which are closest to the query using cosine similarity measure.

In [8]:
from System.Indexer.indexer import Indexer
#TODO output closest facts to the query
query = "good mood"
counter = 1
indexer = Indexer()
for fact in facts:
    indexer((str(counter), fact))
    counter+=1
#print(indexer.term_doc_matrix)
r = indexer.optimize_via_cosine(query, [str(i) for i in range(1,len(facts)+1)])

[nltk_data] Downloading package punkt to /home/jafar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jafar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jafar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True
dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', 

In [9]:
r = list(zip(r[0],r[1]))
r = list(reversed(sorted(r)))
print(r[0], facts[int(r[0][1])-1])
# print("Results for query:", query)
# r = model.most_similar(query.split(' '))
# print(r)
# for k, v, p in r:
#     print("\t", facts[k], "sim=", p)

(0.2357022603955158, '64') 60. It is considered good luck in Japan when a sumo wrestler makes your baby cry.



## 2. Training doc2vec model and documents classifier

Now we would like you to train doc2vec model yourself based on [this topic-modeling dataset](https://code.google.com/archive/p/topic-modeling-tool/downloads).

### 2.1 Read dataset

First, read the dataset - it consists of 4 parts, you need to merge them into single list. 

In [30]:
file_names = ['testdata_braininjury_10000docs.txt', 'testdata_news_economy_2073docs.txt',
              'testdata_news_fuel_845docs.txt','testdata_news_music_2084docs.txt']
all_data = []

for file in file_names:
    f = open('../../'+file, 'r')
    all_data.extend(list(zip(f.readlines(),[file]*100000)))

### 2.2 Tests 

In [31]:
print(len(all_data))
assert len(all_data) == 15002

15002


### 2.3 Training `doc2vec` model

Train a `doc2vec` model based on the dataset you've loaded. The example of training is provided.

In [27]:
#TODO change this according to the task
# small set of tokenized sentences
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# just a test set of tokenized sentences
print(common_texts, "\n")
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
print(documents, "\n")
# train a model
model = Doc2Vec(
    documents,     # collection of texts
    vector_size=5, # output vector size
    window=2,      # maximum distance between the target word and its neighboring word
    min_count=1,   # minimal number of 
    workers=4      # in parallel
)

# clean training data
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

# save and load
model.save("d2v.model")
model = Doc2Vec.load("d2v.model")

vec = model.infer_vector(["system", "response"])
print(vec)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']] 

[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]), TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]), TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), TaggedDocument(words=['graph', 'trees'], tags=[6]), TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]), TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])] 

[-0.05293733 -0.00942373  0.09095236  0.05740672  0.02836587]


### 2.4 Form train and test datasets

Transform documents to vectors and split data to train and test sets. Make sure that the split is stratified as the classes are imbalanced.

In [33]:
import numpy as np
#TODO transforn and make a train-test split
numpy_all = np.array(all_data)
np.random.shuffle(numpy_all)
thresh = len(numpy_all) * 8 // 10

X_train, X_test, y_train, y_test = numpy_all[:thresh], numpy_all[thresh:], numpy_all[:thresh], numpy_all[thresh:]

In [34]:
print(X_train[0])


['the beginning january gustavo sibona young physicist the edge international fame departed argentina with his wife carola and three children the youngest whom was only few months old and just his pocket gustavo hadn received his last two paychecks the national technological university and carola who worked the dean office the university cordoba suffered the same despite desperate protests and attempts deal with the banking maze and bureaucratic webs that each day grew more complex they were also unable withdraw savings they had set aside for emergencies unlike the majority argentineans abandoning masse the country where they were educated gustavo was hired almost immediately augsburg bavarian city kilometers northeast munich the vast distance and loss loved ones made this exodus for both them indescribable separation without name emptiness like that following death carola the great granddaughter jorge newbery hero argentine aviation and the engineer who designed the centennial illumin



### 2.5 Train topics classifier

Train a classifier that would classify any document to one of four categories: fuel, brain injury, music, and economy.
Print a classification report for test data.

In [None]:
#TODO train a classifier and measure its performance

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Which class is the hardest one to recognize?

### 2.6 Bonus task

What if we trained our `doc2vec` model using window size = 5 or 10? Would it improve the classification acccuracy? What about vector dimensionality? Does it mean that increasing it we will achieve better performance in terms of classification?

Explore the influence of these parameters on classification performance, visualizing it as a graph (e.g. window size vs f1-score, vector dim vs f1-score).

In [None]:
#TODO bonus task