# Train Our Model
Author: Sean Flannery [sflanner@purdue.edu](sflanner@purdue.edu)

Last Updated: June 13th, 2019

This notebook was developed with the intent of satisfying modeling needs of work with 
Professor Daisuke Kihara [dkihara@purdue.edu](dkihara@purdue.edu).
### Description
This notebook trains an instance of the Natural Language Processing model Doc2Vec.

**Libraries Needed:** 
[pandas](https://pandas.pydata.org/pandas-docs/stable/install.html), 
[numpy](https://www.numpy.org), 
[gensim](https://radimrehurek.com/gensim/install.html), 
[tqdm](https://github.com/tqdm/tqdm), 
[scikit-learn](https://scikit-learn.org/stable/install.html),
[nltk](https://www.nltk.org)

**Reference:** This notebook relied heavily on information from a tutorial found here: [https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb) 

In [1]:
FILE_PREFIX = '' # optional! include to save files to certain area

In [2]:
# Standard imports
import random
random.seed(42)
import collections
import os
import pandas as pd
import numpy as np
np.random.seed(42)
# Progress bar
from tqdm import tqdm_notebook as tqdm
# Document similarity
import gensim
import gensim.parsing.preprocessing as gpp
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument

In [3]:
db = pd.read_csv('preprocessed-data.csv')
db.head()

Unnamed: 0,year,article-link,local-path,title,abstract,authors,introduction,preprocessed_data
0,2019,https://doi.org/10.1093/nar/gky993,articles/2019/1-NAR.html,Database Resources of the BIG Data Center in 2019,The BIG Data Center at Beijing Institute of Ge...,['BIG Data Center Members'],The BIG Data Center (http://bigd.big.ac.cn) at...,big center beij institute genomic big chinese...
1,2019,https://doi.org/10.1093/nar/gky1124,articles/2019/2-NAR.html,The European Bioinformatics Institute in 2018:...,The European Bioinformatics Institute (https:/...,"['Charles E Cook', 'Rodrigo Lopez', 'Oana Stro...","A primary mission of EMBL-EBI is to collect, o...",european bioinformatic institute ebi archive ...
2,2019,https://doi.org/10.1093/nar/gky1069,articles/2019/3-NAR.html,Database resources of the National Center for ...,The National Center for Biotechnology Informat...,"['Eric W Sayers', 'Richa Agarwala', 'Evan E Bo...",The National Center for Biotechnology Informat...,national center biotechnology ncbi large suit...
3,2019,https://doi.org/10.1093/nar/gky843,articles/2019/4-NAR.html,AmtDB: a database of ancient human mitochondri...,Ancient mitochondrial DNA is used for tracing ...,"['Edvard Ehler', 'Jiří Novotný', 'Anna Juras',...",Ancient DNA (aDNA) is a genetic material obtai...,ancient mitochondrial dna trace human past de...
4,2019,https://doi.org/10.1093/nar/gky822,articles/2019/5-NAR.html,AnimalTFDB 3.0: a comprehensive resource for a...,The Animal Transcription Factor DataBase (Anim...,"['Hui Hu', 'Ya-Ru Miao', 'Long-Hao Jia', 'Qing...",Transcription factors (TFs) are special protei...,animal transcription factor animaltfdb aim co...


## Build a Vocabulary
The vocabulary is little more than a glorified dictionary (held inside of `model.wv.vocab`), and contains all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word penalty). 

In [4]:
documents = [gensim.models.doc2vec.TaggedDocument(doc.split(),[i])
            for i, doc in enumerate(list(db['preprocessed_data']))]

In [5]:
print(documents[1])

TaggedDocument(['european', 'bioinformatic', 'institute', 'ebi', 'archive', 'curate', 'analyse', 'life', 'science', 'produce', 'researcher', 'world', 'make', 'use', 'globally', 'ebi', 'volume', 'continue', 'grow', 'exponentially', 'total', 'raw', 'storage', 'capacity', 'exceed', 'petabyte', 'manage', 'increase', 'flow', 'maintain', 'quality', 'service', 'year', 'improve', 'efficiency', 'computational', 'infrastructure', 'double', 'bandwidth', 'connection', 'worldwide', 'report', 'single', 'cell', 'expression', 'atla', 'ebi', 'gxa', 'component', 'expression', 'atla', 'pdbe', 'knowledgebase', 'ebi', 'pdbe', 'pdbe', 'collate', 'functional', 'annotation', 'prediction', 'structure', 'bank', 'additionally', 'europe', 'pmc', 'europepmc', 'org', 'add', 'preprint', 'abstract', 'result', 'supplement', 'result', 'peer', 'review', 'publication', 'embl', 'ebi', 'maintain', 'analytical', 'bioinformatic', 'complement', 'interface', 'programmatically', 'application', 'programme', 'interface', 'whilst'

## Building a Doc2Vec Instance

The basis for these parameters came from this paper: https://www.aclweb.org/anthology/W16-1609

In [6]:
import time
start_time = time.time()
model = gensim.models.doc2vec.Doc2Vec(
                documents=documents,
                vector_size=300,
                alpha=0.002, 
                dm = 1, # we want to use distributed memory not just Bag of Words
                dm_concat = 1, # Use more context (generates larger model)
                min_alpha=0.0001, 
                min_count=1,
                epochs=1000,
                negative=5,
                window=5, # have also used 15
                sample=0.00001,
                seed=42,
                max_vocab_size=None, # No limit on the size of our built vocabulary...
                workers=8,
                train_lbls=False)
file_time = time.time() - start_time
print("Training model with `documents` took {:.3f} minutes".format(file_time/60.))

Training model with `documents` took 20.971 minutes


In [7]:
model.save(FILE_PREFIX + "trained.model")
print("Model Saved")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Model Saved


In [8]:
#Uncomment if you would like to load a local model
#model = gensim.models.Doc2Vec.load(FILE_PREFIX + "trained.model")

## Evaluate Self-Similarity
We are going to see how well our model can infer itself and record how well it does.

In [9]:
def infer_vector(doc_id):
    return model.infer_vector(train_list[doc_id])

In [10]:
from multiprocessing import Pool, Process

Here, we create a dictionary of similarities to catalogue the `i-th` most similar entries to each document vector. This will help us determine if our model is robust enough to predict itself as being most similar when given its own words as input.

We define `getRanks` to fetch the most similar documents based on the model's document vectors

In [11]:
def getRanks(doc_id):
    res_dict = {'doc_id':doc_id}
    inferred_vector = model.infer_vector(documents[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    res_dict['rank'] = [docid for docid, sim in sims].index(doc_id)
    res_dict['second_rank'] = sims[1]
    res_dict['sims'] = sims
    return res_dict
entry_range = list(range(len(documents)))
#with Pool(25) as p:
with Pool(8) as p:
    req_list = list(tqdm(p.imap(getRanks, entry_range), total=len(entry_range), leave=True))

HBox(children=(IntProgress(value=0, max=3113), HTML(value='')))




In [13]:
ranks = []
sim_dict = dict()
for ent in req_list:
    ranks.append(ent['rank'])
    sim_dict[ent['doc_id']] = ent['sims']

We would hope to have a high number of entries in the 0 range (indicating most-similar document is the original document itself).

In [14]:
collections.Counter(ranks).most_common()

[(0, 3111), (1, 2)]

We see that the vast majority of articles are able to be ranked to themselves. Let's see what other conclusions we may draw from a single example.

In [15]:
#print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
doc_id = random.randint(0, 3114)
print('Document ({})\n'.format(doc_id))
print(db.loc[doc_id, 'title'])
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
print(sim_dict[doc_id][0])
for label, index in [
    ('MOST', 0), 
    ('SECOND-MOST', 1), 
    ('THIRD-MOST', 2), 
    ('FOURTH-MOST', 3), 
    ('FIFTH-MOST', 4), 
    ('SIXTH-MOST', 5),
    ('SEVENTH-MOST', 6), 
    ('EIGHT-MOST', 7),
    ('NINTH-MOST', 8), 
    ('TENTH-MOST', 9),
    ('MEDIAN', len(sim_dict[0])//2), 
    ('LEAST', len(sim_dict[0]) - 1)]:
    print(u'%s \n%s \n%s\n' % (label, sim_dict[doc_id][index], str(db.loc[sim_dict[doc_id][index][0], 'title'])))

Document (2619)

trEST, trGEN and Hits: access to databases of predicted protein sequences
SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/c,d300,n5,w5,s1e-05,t8):

(2619, 0.9403233528137207)
MOST 
(2619, 0.9403233528137207) 
trEST, trGEN and Hits: access to databases of predicted protein sequences

SECOND-MOST 
(2317, 0.4497912526130676) 
trome, trEST and trGEN: databases of predicted protein sequences

THIRD-MOST 
(1933, 0.27196189761161804) 
ARED 3.0: the large and diverse AU-rich transcriptome

FOURTH-MOST 
(1912, 0.24466779828071594) 
DDBJ in preparation for overview of research activities behind data submissions

FIFTH-MOST 
(2357, 0.23794858157634735) 
TheArabidopsisSeedGenes Project

SIXTH-MOST 
(2492, 0.23058755695819855) 
FLAGdb/FST: a database of mapped flanking insertion sites (FSTs) ofArabidopsis thalianaT-DNA transformants

SEVENTH-MOST 
(2389, 0.22896608710289001) 
TheArabidopsisInformation Resource (TAIR): a model organism database providing a centralized, curated gateway 

## Evaluate Similarities
Now, we will randomly pick an example and see what the second most similar document might be.

In [36]:
# Pick a random document from the corpus and infer a vector from the model
doc_id = random.randint(0, len(documents) - 1)
# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, db.loc[doc_id, 'title']))
sim_id = sim_dict[doc_id][1][0]
print('Similar Document ({}): «{}»\n'.format(sim_id, db.loc[sim_id, 'title']))

Train Document (2069): «The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes»

Similar Document (2622): «The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species»



### Self-Similarity Confidence Levels
Now, we want to systematically evaluate our model for accuracy.

In [37]:
number_correct = 0.
confidence_sum = 0.
for original_id in range(len(documents)):
    if sim_dict[original_id][0][0] == original_id:
        number_correct += 1
        confidence_sum += sim_dict[original_id][0][1]
print("Percentage Correct Predictions for self-similarity: %.3f%%" % (100.* number_correct / len(documents)))
print("Average Confidence of self-similarity correctness: %.3f%%" % (100.* confidence_sum / number_correct))

Percentage Correct Predictions for self-similarity: 99.936%
Average Confidence of self-similarity correctness: 95.538%


### i-th Similarity Confidence Levels

In [38]:
sim_range = [0,1,2,3,4,5,10,15,20,25,100,250,500,1000,2000,3000]

In [39]:
for i in sim_range:
    conf_sum = 0.
    total_count = 0.
    for original_id in range(len(documents)):
        conf_sum += sim_dict[original_id][i][1]
        total_count += 1
    print("Predicted similarity % for rank " + str(i) + ": %.3f%%" % (100. * conf_sum / total_count))

Predicted similarity % for rank 0: 95.537%
Predicted similarity % for rank 1: 52.687%
Predicted similarity % for rank 2: 45.244%
Predicted similarity % for rank 3: 40.751%
Predicted similarity % for rank 4: 37.638%
Predicted similarity % for rank 5: 35.581%
Predicted similarity % for rank 10: 29.951%
Predicted similarity % for rank 15: 27.133%
Predicted similarity % for rank 20: 25.387%
Predicted similarity % for rank 25: 24.114%
Predicted similarity % for rank 100: 17.003%
Predicted similarity % for rank 250: 12.366%
Predicted similarity % for rank 500: 8.473%
Predicted similarity % for rank 1000: 3.769%
Predicted similarity % for rank 2000: -3.262%
Predicted similarity % for rank 3000: -14.692%


Now, let's extract the resultant embeddings from our documents based on the `docvecs` result.

In [40]:
entry_range = list(range(len(documents)))
def getVectorFromCorpus(doc_id):
    return model.docvecs[doc_id]

In [41]:
with Pool(25) as p:
    req_list = list(tqdm(p.imap(getVectorFromCorpus, entry_range), total=len(entry_range), leave=True))

HBox(children=(IntProgress(value=0, max=3113), HTML(value='')))




In [42]:
data = np.float64(req_list)

In [43]:
np.save(FILE_PREFIX + 'doc_embeddings.npy', data)