# Training Doc2Vec on scientific articles

This notebook replicates the **Document Embedding with Paragraph Vectors** paper, http://arxiv.org/abs/1507.07998.

In that paper, the authors only showed results from the DBOW ("distributed bag of words") mode, trained on the article dataset. Here we replicate this experiment using not only DBOW, but also the DM ("distributed memory") mode of the Paragraph Vector algorithm aka Doc2Vec.

## Basic setup

Let's import the necessary modules and set up logging. The code below assumes Python 3.7+ and Gensim 4.0+.

In [1]:
import logging
import multiprocessing
from pprint import pprint

import smart_open
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [151]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from collections import Counter
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Preparing the corpus

In [152]:
COLUMNS_TO_DROP = ['year', 'n_citation', 'references', 'authors','venue', 'lang', 'page_start', 'page_end', 'volume',
       'issue', 'issn', 'isbn', 'doi', 'pdf', 'url']
RANDOM_STATE = 42
NUM_PARTS = 5

def get_text_data(file_path):
    
    data = pd.read_json(file_path, dtype={'title': 'string', 'abstract': 'string'})
    data.drop(COLUMNS_TO_DROP, axis=1, inplace=True)
    data['abstract'].replace('', np.nan, inplace=True)
    data = data.dropna(subset=['keywords', 'abstract', 'title', 'fos'])
    data['text'] = data[['title', 'abstract']].apply(lambda row: ' '.join(row.astype(str)), axis=1).astype('string')
    data.drop(['title', 'abstract'], axis=1, inplace=True)
    return data

In [154]:
articles = pd.concat(get_text_data(f'data/part_{i+1}.json') for i in range(NUM_PARTS))
articles.reset_index(drop=True, inplace=True)
articles.to_json('articles.json')
articles.to_csv('articles.csv')
articles = pd.read_json('articles.json')

articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1124438 entries, 0 to 1124437
Data columns (total 4 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   _id       1124438 non-null  object
 1   keywords  1124438 non-null  object
 2   fos       1124438 non-null  object
 3   text      1124438 non-null  object
dtypes: object(4)
memory usage: 42.9+ MB


In [155]:
test_articles = get_text_data('data/part_6.json')
test_articles.reset_index(drop=True, inplace=True)
test_articles.to_json('articles.json')
test_articles = pd.read_json('articles.json')

test_articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 231269 entries, 0 to 231268
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   _id       231269 non-null  object
 1   keywords  231269 non-null  object
 2   fos       231269 non-null  object
 3   text      231269 non-null  object
dtypes: object(4)
memory usage: 8.8+ MB


In [231]:
test_articles.head()

Unnamed: 0,_id,keywords,fos,text
0,53e9b036b7602d9703a91d9b,"[single class, performance analysis, response ...","[Timing diagram, Sequence diagram, Computer sc...",Applying the UML class diagram in the performa...
1,53e9b036b7602d9703a922a3,"[grid computing, share geographically, differe...","[Reservation, Preemption, Virtual machine, Sch...",Resource Leasing and the Art of Suspending Vir...
2,53e9b036b7602d9703a920d5,"[potential cause, entropy rate constancy princ...","[Gapping, Branching factor, Entropy rate, Tree...",Variation of entropy and parse trees of senten...
3,53e9b036b7602d9703a92304,"[database management systems, best-first searc...","[Skyline, Data mining, R-tree, Ranking, Comput...",Skyline ranking for uncertain data with maybe ...
4,53e9b036b7602d9703a92101,"[State feedback, Closed loop systems, Linear m...","[Iterative method, Control theory, Exponential...",Robust Controller Design of Uncertain Discrete...


## Normalize data

In [158]:
%%time

def normalize(sentence):
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    if isinstance(sentence, list):
        return [word.lower() for word in sentence]
        #return [porter.stem(word) for word in sentence]
    return ' '.join(porter.stem(word) for word in sentence.split())

articles = articles.parallel_applymap(normalize)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=281112), Label(value='0 / 281112')…

CPU times: user 25 s, sys: 9.05 s, total: 34.1 s
Wall time: 3min 2s


## Topic selection

In [160]:
tag_freq = Counter()
for index, doc in articles.iterrows():
    tag_freq.update(Counter(doc['fos']))
    tag_freq.update(Counter(doc['keywords']))

tag_freq.most_common(50)

[('computer science', 716471),
 ('mathematics', 261368),
 ('artificial intelligence', 216887),
 ('algorithm', 104301),
 ('computer vision', 80242),
 ('data mining', 78938),
 ('discrete mathematics', 74711),
 ('computer network', 74309),
 ('distributed computing', 69909),
 ('engineering', 66918),
 ('combinatorics', 62436),
 ('mathematical optimization', 60434),
 ('theoretical computer science', 59880),
 ('pattern recognition', 59719),
 ('machine learning', 51758),
 ('world wide web', 49397),
 ('programming language', 46461),
 ('information retrieval', 45807),
 ('control theory', 44805),
 ('software engineering', 42317),
 ('multimedia', 35088),
 ('computer security', 35015),
 ('knowledge management', 34590),
 ('software', 33129),
 ('natural language processing', 31679),
 ('parallel computing', 31594),
 ('human–computer interaction', 30237),
 ('feature extraction', 29798),
 ('embedded system', 29575),
 ('artificial neural network', 29138),
 ('simulation', 28845),
 ('speech recognition', 2

In [161]:
len(tag_freq)

3557260

In [225]:
most_freq_tags = {tag:tag_freq[tag] for tag in tag_freq if tag_freq[tag] > 1499}
len(most_freq_tags)

1749

In [226]:
class TaggedCorpus:
    def __init__(self, dataframe, min_freq=99):
        self.df = dataframe
        self.min_freq = min_freq
        
    def __iter__(self):
        for index, row in self.df.iterrows():
            kws = {kw for kw in row['keywords'] +  row['fos'] if tag_freq[kw] > self.min_freq}
            if len(kws) < 2:
                continue
            yield TaggedDocument(words=row['text'].split(), tags=list(kws))

documents299 = TaggedCorpus(articles, min_freq=299)            
documents99 = TaggedCorpus(articles, min_freq=99)  
documents29 = TaggedCorpus(articles, min_freq=29) 
documents599 = TaggedCorpus(articles, min_freq=599)
documents1499 = TaggedCorpus(articles, min_freq=1499)
#documents = [TaggedDocument(row['text'], [row['_id']]) for index, row in articles.iterrows()]


In [227]:
# Load and print the first preprocessed document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents1499))
print(first_doc.tags, ': ', first_doc.words)


['environmental science', 'spectrum', 'indexes', 'indexing terms', 'indexation'] :  ['the', 'relationship', 'between', 'canopi', 'paramet', 'and', 'spectrum', 'of', 'winter', 'wheat', 'under', 'differ', 'irrig', 'in', 'hebei', 'province.', 'intern', 'geoscienc', 'and', 'remot', 'sens', 'symposium', 'drought', 'is', 'the', 'first', 'place', 'in', 'all', 'the', 'natur', 'disast', 'in', 'the', 'world.', 'It', 'is', 'especi', 'seriou', 'in', 'north', 'china', 'plain.', 'In', 'thi', 'paper,', 'differ', 'soil', 'water', 'content', 'control', 'level', 'at', 'winter', 'wheat', 'growth', 'stage', 'are', 'perform', 'on', 'gucheng', 'ecological-meteorolog', 'integr', 'observ', 'experi', 'station', 'of', 'cams,', 'china.', 'some', 'canopi', 'parameters,', 'includ', 'growth', 'conditions,', 'dri', 'weight,', 'physiolog', 'paramet', 'and', 'hyperspectr', 'reflectance,', 'are', 'measur', 'from', 'erect', 'stage', 'to', 'milk', 'stage', 'for', 'winter', 'wheat', 'in', '2009.', 'the', 'relationship', '

The document seems legit so let's move on to finally training some Doc2vec models.

## Training Doc2Vec

The original paper had a vocabulary size of 915,715 word types, so we'll try to match it by setting `max_final_vocab` to 1,000,000 in the Doc2vec constructor.

Other critical parameters were left unspecified in the paper, so we'll go with a window size of eight (a prediction window of 8 tokens to either side). It looks like the authors tried vector dimensionality of 100, 300, 1,000 & 10,000 in the paper (with 10k dims performing the best), but I'll only train with 200 dimensions here, to keep the RAM in check on my laptop.

Feel free to tinker with these values yourself if you like:

In [194]:
workers = multiprocessing.cpu_count() - 2  # leave one core for the OS & other stuff

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow29 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow99 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow299 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)


# PV-DM: paragraph vector in distributed memory mode
model_dm = Doc2Vec(
    dm=1, dm_mean=1,  # use average of context word vectors to train DM
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)

2022-10-19 17:34:07,138 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-19T17:34:07.138120', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-19 17:34:07,172 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-19T17:34:07.172679', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-19 17:34:07,173 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-19T17:34:07.173916', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-19 17:34:07,174 : INFO : Doc2

In [195]:
workers

30

In [196]:
model_dbow29.build_vocab(documents29, progress_per=100000, )
print(model_dbow29)
print('---------------------------------------')

model_dbow99.build_vocab(documents99, progress_per=100000)
print(model_dbow99)
print('---------------------------------------')

model_dbow299.build_vocab(documents299, progress_per=100000)
print(model_dbow299)

# Save some time by copying the vocabulary structures from the DBOW model to the DM model.
# Both models are built on top of exactly the same data, so there's no need to repeat the vocab-building step.
#model_dm.reset_from(model_dbow)
#print(model_dm)

2022-10-19 17:34:14,001 : INFO : collecting all words and their counts
2022-10-19 17:34:14,003 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-10-19 17:34:28,033 : INFO : PROGRESS: at example #100000, processed 13973232 words (996004 words/s), 385945 word types, 49025 tags
2022-10-19 17:34:41,866 : INFO : PROGRESS: at example #200000, processed 27589815 words (984397 words/s), 636898 word types, 49920 tags
2022-10-19 17:34:56,269 : INFO : PROGRESS: at example #300000, processed 42242831 words (1017444 words/s), 851710 word types, 49976 tags
2022-10-19 17:35:10,659 : INFO : PROGRESS: at example #400000, processed 56854607 words (1015492 words/s), 1045186 word types, 49980 tags
2022-10-19 17:35:25,235 : INFO : PROGRESS: at example #500000, processed 71412120 words (998771 words/s), 1226341 word types, 49980 tags
2022-10-19 17:35:40,101 : INFO : PROGRESS: at example #600000, processed 85962511 words (978836 words/s), 1396240 word types, 49980 tag

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
---------------------------------------


2022-10-19 17:37:19,245 : INFO : PROGRESS: at example #100000, processed 13973232 words (1004736 words/s), 385945 word types, 19731 tags
2022-10-19 17:37:33,042 : INFO : PROGRESS: at example #200000, processed 27589815 words (986960 words/s), 636898 word types, 19732 tags
2022-10-19 17:37:47,127 : INFO : PROGRESS: at example #300000, processed 42242831 words (1040398 words/s), 851710 word types, 19732 tags
2022-10-19 17:38:01,498 : INFO : PROGRESS: at example #400000, processed 56854607 words (1016844 words/s), 1045186 word types, 19732 tags
2022-10-19 17:38:16,091 : INFO : PROGRESS: at example #500000, processed 71412120 words (997647 words/s), 1226341 word types, 19732 tags
2022-10-19 17:38:30,662 : INFO : PROGRESS: at example #600000, processed 85962511 words (998620 words/s), 1396240 word types, 19732 tags
2022-10-19 17:38:45,382 : INFO : PROGRESS: at example #700000, processed 100626352 words (996236 words/s), 1559209 word types, 19732 tags
2022-10-19 17:38:59,992 : INFO : PROGRES

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
---------------------------------------


2022-10-19 17:40:09,950 : INFO : PROGRESS: at example #100000, processed 13973232 words (1023841 words/s), 385945 word types, 8152 tags
2022-10-19 17:40:23,708 : INFO : PROGRESS: at example #200000, processed 27589815 words (989751 words/s), 636898 word types, 8152 tags
2022-10-19 17:40:38,034 : INFO : PROGRESS: at example #300000, processed 42242831 words (1022899 words/s), 851710 word types, 8152 tags
2022-10-19 17:40:52,395 : INFO : PROGRESS: at example #400000, processed 56854607 words (1017524 words/s), 1045186 word types, 8152 tags
2022-10-19 17:41:06,808 : INFO : PROGRESS: at example #500000, processed 71412120 words (1010074 words/s), 1226341 word types, 8152 tags
2022-10-19 17:41:21,377 : INFO : PROGRESS: at example #600000, processed 85962511 words (998780 words/s), 1396240 word types, 8152 tags
2022-10-19 17:41:36,028 : INFO : PROGRESS: at example #700000, processed 100626352 words (1000953 words/s), 1559209 word types, 8152 tags
2022-10-19 17:41:50,566 : INFO : PROGRESS: at

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>


In [197]:
# Train DBOW doc2vec
# Report progress every 5 min.
model_dbow299.train(documents299, total_examples=model_dbow299.corpus_count, epochs=model_dbow299.epochs, report_delay=5*60)

2022-10-19 17:42:44,952 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 290617 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-19T17:42:44.952305', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-19 17:42:47,199 : INFO : EPOCH 0 - PROGRESS: at 0.01% examples, 3698 words/s, in_qsize 59, out_qsize 0
2022-10-19 17:47:47,242 : INFO : EPOCH 0 - PROGRESS: at 30.88% examples, 136641 words/s, in_qsize 60, out_qsize 0
2022-10-19 17:52:47,262 : INFO : EPOCH 0 - PROGRESS: at 60.37% examples, 136103 words/s, in_qsize 60, out_qsize 0
2022-10-19 17:57:47,269 : INFO : EPOCH 0 - PROGRESS: at 88.13% examples, 132684 words/s, in_qsize 60, out_qsize 0
2022-10-19 17:59:45,155 : INFO : EPOCH 0: training on 161599099 raw words (135843241 effective words) took 1020.2s, 133157 effective w

In [202]:
model_dbow599 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)
model_dbow599.build_vocab(documents599, progress_per=100000)
print(model_dbow599)

2022-10-19 21:16:03,857 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-19T21:16:03.857502', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-19 21:16:03,858 : INFO : collecting all words and their counts
2022-10-19 21:16:03,860 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-10-19 21:16:17,482 : INFO : PROGRESS: at example #100000, processed 13992516 words (1027257 words/s), 381557 word types, 4455 tags
2022-10-19 21:16:31,389 : INFO : PROGRESS: at example #200000, processed 27656910 words (982648 words/s), 628044 word types, 4455 tags
2022-10-19 21:16:45,748 : INFO : PROGRESS: at example #300000, processed 42317911 words (1021078 words/s), 840906 word types, 4455 tags
2022-10-19 21:17:00,039 : INFO : PROGRESS: at example #400000, processed 56934419 words (10

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>


In [203]:
model_dbow599 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)
model_dbow599.build_vocab(documents599, progress_per=100000)
print(model_dbow599)

2022-10-19 21:18:50,259 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 286533 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-19T21:18:50.259471', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-19 21:18:52,367 : INFO : EPOCH 0 - PROGRESS: at 0.01% examples, 3920 words/s, in_qsize 59, out_qsize 0
2022-10-19 21:23:52,381 : INFO : EPOCH 0 - PROGRESS: at 32.75% examples, 143335 words/s, in_qsize 59, out_qsize 0
2022-10-19 21:28:52,416 : INFO : EPOCH 0 - PROGRESS: at 63.98% examples, 142423 words/s, in_qsize 59, out_qsize 0
2022-10-19 21:33:52,503 : INFO : EPOCH 0 - PROGRESS: at 95.83% examples, 142311 words/s, in_qsize 60, out_qsize 0
2022-10-19 21:34:30,982 : INFO : EPOCH 0: training on 161060371 raw words (133972441 effective words) took 940.7s, 142416 effective wo

In [229]:
model_dbow1499 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=100, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)
model_dbow1499.build_vocab(documents1499, progress_per=100000)
print(model_dbow1499)

2022-10-20 01:38:29,164 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d100,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-20T01:38:29.164632', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-20 01:38:29,202 : INFO : collecting all words and their counts
2022-10-20 01:38:29,204 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-10-20 01:38:42,739 : INFO : PROGRESS: at example #100000, processed 13978134 words (1032729 words/s), 379245 word types, 1749 tags
2022-10-20 01:38:56,555 : INFO : PROGRESS: at example #200000, processed 27666763 words (990871 words/s), 624354 word types, 1749 tags
2022-10-20 01:39:10,646 : INFO : PROGRESS: at example #300000, processed 42319066 words (1039897 words/s), 836069 word types, 1749 tags
2022-10-20 01:39:24,941 : INFO : PROGRESS: at example #400000, processed 56912252 words (10

Doc2Vec<dbow+w,d100,n5,w8,mc5,s0.001,t30>


In [230]:
model_dbow1499.train(documents1499, total_examples=model_dbow1499.corpus_count, epochs=model_dbow1499.epochs, report_delay=5*60)

2022-10-20 01:41:21,812 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 283895 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-20T01:41:21.812708', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-20 01:41:23,639 : INFO : EPOCH 0 - PROGRESS: at 0.01% examples, 4395 words/s, in_qsize 60, out_qsize 0
2022-10-20 01:46:23,651 : INFO : EPOCH 0 - PROGRESS: at 39.03% examples, 167929 words/s, in_qsize 60, out_qsize 1
2022-10-20 01:51:23,675 : INFO : EPOCH 0 - PROGRESS: at 76.80% examples, 167449 words/s, in_qsize 60, out_qsize 0
2022-10-20 01:54:24,128 : INFO : EPOCH 0: training on 160170076 raw words (131030834 effective words) took 782.3s, 167496 effective words/s
2022-10-20 01:54:25,844 : INFO : EPOCH 1 - PROGRESS: at 0.01% examples, 4734 words/s, in_qsize 60, out_qsize

## Finding similar documents

In [232]:
models = [model_dbow299, model_dbow599, model_dbow1499]

In [237]:
for model in models:
    print(model)
    pprint(model.dv.most_similar(positive=["reinforcement learning"], topn=15))
    

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
[('q-learning', 0.9207262992858887),
 ('temporal difference learning', 0.8835273385047913),
 ('error-driven learning', 0.8363973498344421),
 ('action selection', 0.8000704050064087),
 ('reinforcement', 0.7967041730880737),
 ('learning classifier system', 0.7761959433555603),
 ('markov decision process', 0.7275181412696838),
 ('partially observable markov decision process', 0.6965964436531067),
 ('value function', 0.6812279224395752),
 ('robot learning', 0.6616670489311218),
 ('bellman equation', 0.6610139608383179),
 ('sequence learning', 0.6240488290786743),
 ('optimal policy', 0.6211299896240234),
 ('learning', 0.6179113984107971),
 ('active learning (machine learning)', 0.6147891283035278)]
Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
[('learning classifier system', 0.7715908885002136),
 ('markov decision process', 0.732695460319519),
 ('partially observable markov decision process', 0.7013351917266846),
 ('robot learning', 0.6769681572914124),

# Examples


In [235]:
text = test_articles.text[0]
print(text)
print('\n')
for model in  models:
    print(model)
    doc_vector = model.infer_vector(text.split())
    sims = model.dv.most_similar([doc_vector], topn=10)
    print('\n'.join(map(str,sims)))
    print('\n')
# оригинальные теги из датасета
set(topic.lower() for topic in test_articles.keywords[0] + test_articles.fos[0])

Applying the UML class diagram in the performance analysis  This paper covers the performance parameters for an object= oriented software system: The number of classes in the class diagram of this system, the number of attributes and methods in each class, their data types, the multiplicities of single classes, the number of relationships in this diagram, the types and multiplicities of relationships, the lengths of access paths, and the allocation of methods and attributes to classes. A performance analysis is described. It treats a class diagram, which must be in attendance at each analysis because used dynamic diagrams must be consistent with it, and encloses these parameters. It is based on an approach which enables one to predict the performance values of response time, throughput and utilization, for use cases that can operate on databases related to this diagram


Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
('class diagram', 0.45280298590660095)
('flowchart', 0.4430471956729889)
(

{'access path',
 'algorithm',
 'class diagram',
 'communication diagram',
 'composite structure diagram',
 'computer science',
 'data type',
 'dynamic diagram',
 'oriented software system',
 'performance analysis',
 'performance parameter',
 'performance value',
 'response time',
 'sequence diagram',
 'single class',
 'system context diagram',
 'system sequence diagram',
 'theoretical computer science',
 'timing diagram',
 'uml class diagram',
 'use case',
 'use case diagram'}

In [238]:
text = test_articles.text[11]
print(text)
print('\n')
for model in models:
    print(model)
    doc_vector = model.infer_vector(text.split())
    sims = model.dv.most_similar([doc_vector], topn=10)
    print('\n'.join(map(str,sims)))
    print('\n')
    
# оригинальные теги из датасета
set(topic.lower() for topic in test_articles.keywords[11] + test_articles.fos[11])

My Requirements? Well, That Depends  Using scenarios can help you quantify your requirements. Scenarios make requirements quantification easier by describing and restricting the context necessary for quantification.


Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
('user needs', 0.5663952827453613)
('usable', 0.5588348507881165)
('programming language', 0.5476263165473938)
('documentation', 0.5468922853469849)
('ask price', 0.5420323610305786)
('user interface', 0.5368489027023315)
('software engineering', 0.5357776284217834)
('better understanding', 0.5351817607879639)
('software engineer', 0.5344774127006531)
('as is', 0.5340394377708435)


Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
('software engineering', 0.5559115409851074)
('position paper', 0.5510293841362)
('better understanding', 0.5479234457015991)
('usable', 0.5449410080909729)
('documentation', 0.5432483553886414)
('user interface', 0.5380824208259583)
('programming language', 0.5376284122467041)
('internet privacy', 0.5290253162

{'computer science',
 'formal specification',
 'goal modeling',
 'requirement',
 'requirements',
 'software engineering',
 'systems engineering',
 'use case'}

In [239]:
text = test_articles.text[33]
print(text)
print('\n')
for model in models:
    print(model)
    doc_vector = model.infer_vector(text.split())
    sims = model.dv.most_similar([doc_vector], topn=10)
    print('\n'.join(map(str,sims)))
    print('\n')
# оригинальные теги из датасета
set(topic.lower() for topic in test_articles.keywords[33] + test_articles.fos[33])

Examplers based image fusion features for face recognition    Examplers of a face are formed from multiple gallery images of a person and are used in the process of classification of a test image. We incorporate such examplers in forming a biologically inspired local binary decisions on similarity based face recognition method. As opposed to single model approaches such as face averages the exampler based approach results in higher recognition accu- racies and stability. Using multiple training samples per person, the method shows the following recognition accuracies: 99.0% on AR, 99.5% on FERET, 99.5% on ORL, 99.3% on EYALE, 100.0% on YALE and 100.0% on CALTECH face databases. In addition to face recognition, the method also detects the natural variability in the face images which can find application in automatic tagging of face images. 


Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
('face recognition', 0.5977910757064819)
('face detection', 0.5977813601493835)
('image recognition', 0.

{'artificial intelligence',
 'computer science',
 'computer vision',
 'face detection',
 'face recognition',
 'facial recognition system',
 'feret',
 'image fusion',
 'natural variability',
 'object-class detection',
 'pattern recognition',
 'standard test image',
 'three-dimensional face recognition'}

In [240]:
text = test_articles.text[77]
print(text)
print('\n')
for model in models:
    print(model)
    doc_vector = model.infer_vector(text.split())
    sims = model.dv.most_similar([doc_vector], topn=10)
    print('\n'.join(map(str,sims)))
    print('\n')
# оригинальные теги из датасета
set(topic.lower() for topic in test_articles.keywords[77] + test_articles.fos[77])

Teaching an Agent to Test Students International Conference on Machine Learning This paper presents an innovative application of the Disciple Learning Agent Shell to the building of an educational agent that generates history tests for middle school students, to assist in the assessment of their understanding and use of higher-order thinking skills. Disciple has been taught by an educator to generate and answer basic test questions and to explain the answers. From its interaction with the educational expert, Disciple has learned general rules that allow it to generate a large number of new test questions for students, together with hints, answers, and exp- lanations of the answers. As a result, it can guide the students during their practice of higher-order thinking skills as they would be directly guided by the educator. It can also be used by the edu- cator to generate a different exam for each student in the class. Disciple has been experimentally evaluated by history experts, stude

{'computer science',
 'higher order',
 'knowledge acquisition',
 'learning agent',
 'machine learning',
 'mathematics education',
 'test students',
 'thinking skills'}

In [241]:
text = test_articles.text[555]
print(text)
print('\n')
for model in models:
    print(model)
    doc_vector = model.infer_vector(text.split())
    sims = model.dv.most_similar([doc_vector], topn=15)
    print('\n'.join(map(str,sims)))
    print('\n')
# оригинальные теги из датасета
set(topic.lower() for topic in test_articles.keywords[555] + test_articles.fos[555])

Computing modular polynomials in quasi-linear time  We analyse and compare the complexity of several algorithms for computing modular polynomials. Under the assumption that rounding errors do not influence the correctness of the result, which appears to be satisfied in practice, we show that an algorithm relying on floating point evaluation of modular functions and on interpolation has a complexity that is up to logarithmic factors linear in the size of the computed polynomials. In particular, it obtains the classical modular polynomial Phi(l) of prime level l in time O (l(2) log(3) l M(l)) subset of O (l(3) log(4+epsilon) l), where M(l) is the time needed to multiply two l-bit numbers. Besides treating modular polynomials for Gamma(0)(l), which are an important ingredient in many algorithms dealing with isogenies of elliptic curves, the algorithm is easily adapted to more general situations. Composite levels are handled just as easily as prime levels, as well as polynomials between a 

{'calculus',
 'computational complexity',
 'difference polynomials',
 'discrete mathematics',
 'elliptic curve',
 'floating point',
 'interpolation',
 'linear time',
 'macdonald polynomials',
 'mathematics',
 'modular equation',
 'modular form',
 'modular function',
 'number theory',
 'polynomial',
 'prime (order theory)',
 'time complexity'}

In [243]:
model_dbow299.save('doc2vec_dbow299.model')
model_dbow599.save('doc2vec_dbow599.model')
model_dbow1499.save('doc2vec_dbow1499.model')


2022-10-20 04:22:55,842 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dbow299.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-10-20T04:22:55.842525', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'saving'}
2022-10-20 04:22:55,858 : INFO : storing np array 'vectors' to doc2vec_dbow299.model.wv.vectors.npy
2022-10-20 04:22:55,972 : INFO : storing np array 'syn1neg' to doc2vec_dbow299.model.syn1neg.npy
2022-10-20 04:22:56,083 : INFO : not storing attribute cum_table
2022-10-20 04:22:56,236 : INFO : saved doc2vec_dbow299.model
2022-10-20 04:22:56,237 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dbow599.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-10-20T04:22:56.237580', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'pl

#### only fos for topics?
       

To continue your doc2vec explorations, refer to the official API documentation in Gensim: https://radimrehurek.com/gensim/models/doc2vec.html