# Doc2Vec to wikipedia articles

We conduct the replication to **Document Embedding with Paragraph Vectors** (http://arxiv.org/abs/1507.07998).
In this paper, they showed only DBOW results to Wikipedia data. So we replicate this experiments using not only DBOW but also DM.

## Basic Setup

In [1]:
! pip3 install gensim

Collecting gensim
Collecting numpy>=1.3 (from gensim)
  Using cached numpy-1.12.1-cp35-cp35m-manylinux1_x86_64.whl
Collecting smart-open>=1.2.1 (from gensim)
Collecting six>=1.5.0 (from gensim)
  Using cached six-1.10.0-py2.py3-none-any.whl
Collecting scipy>=0.7.0 (from gensim)
  Using cached scipy-0.19.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting bz2file (from smart-open>=1.2.1->gensim)
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
  Using cached boto-2.47.0-py2.py3-none-any.whl
Collecting requests (from smart-open>=1.2.1->gensim)
  Using cached requests-2.17.0-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->smart-open>=1.2.1->gensim)
  Using cached chardet-3.0.3-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests->smart-open>=1.2.1->gensim)
  Using cached certifi-2017.4.17-py2.py3-none-any.whl
Collecting idna<2.6,>=2.5 (from requests->smart-open>=1.2.1->gensim)
  Using cached idna-2.5-py2.py3-none-any.whl
Collecting urllib3<1.22,>=1.21.1 

In [2]:
# First Always set logging
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  

Let's import Doc2Vec module.

In [3]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

2017-05-29 19:39:04,614 : INFO : 'pattern' package not found; tag filters are not available for English


## Preparing the corpus

First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/) (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps).

Second, convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

For more details on WikiCorpus, you should access [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html).

In [1]:
print('Start creating wiki corpus:')
wiki = WikiCorpus("data/enwiki-20170520-pages-articles.xml.bz2")
print('Finished')
#wiki = WikiCorpus("enwiki-YYYYMMDD-pages-articles.xml.bz2")

SyntaxError: invalid syntax (<ipython-input-1-f331338be95a>, line 2)

Define **TaggedWikiDocument** class to convert WikiCorpus into suitable form for Doc2Vec.

In [5]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
        self.count = 0
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            if self.count < 10:
                print(page_id)
                print(title)
                print(content)
                self.count+=1

            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

In [6]:
documents = TaggedWikiDocument(wiki)

In [7]:
count_doc = 0
for doc in documents:
    if count_doc < 10:
        print(doc)
        count_doc += 1
    else:
        break

12
Anarchism
[b'anarchism', b'is', b'political', b'philosophy', b'that', b'advocates', b'self', b'governed', b'societies', b'based', b'on', b'voluntary', b'institutions', b'these', b'are', b'often', b'described', b'as', b'stateless', b'societies', b'although', b'several', b'authors', b'have', b'defined', b'them', b'more', b'specifically', b'as', b'institutions', b'based', b'on', b'non', b'hierarchical', b'free', b'associations', b'anarchism', b'holds', b'the', b'state', b'to', b'be', b'undesirable', b'unnecessary', b'and', b'harmful', b'while', b'anti', b'statism', b'is', b'central', b'anarchism', b'generally', b'entails', b'opposing', b'authority', b'or', b'hierarchical', b'organisation', b'in', b'the', b'conduct', b'of', b'all', b'human', b'relations', b'including', b'but', b'not', b'limited', b'to', b'the', b'state', b'system', b'anarchism', b'is', b'usually', b'considered', b'radical', b'left', b'wing', b'ideology', b'and', b'much', b'of', b'anarchist', b'economics', b'and', b'anar

## Preprocessing
To set the same vocabulary size with original papar. We first calculate the optimal **min_count** parameter.

In [8]:
# pre = Doc2Vec(min_count=0)
# pre.scan_vocab(documents)

In [9]:
#for num in range(0, 20):
#    print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)

In the original paper, they set the vocabulary size 915,715. It seems similar size of vocabulary if we set min_count = 19. (size of vocab = 898,725)

## Training the Doc2Vec Model
To train Doc2Vec model by several method, DBOW and DM, we define the list of models.

In [10]:
cores = multiprocessing.cpu_count()
print('Cores: ' + str(cores))

models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=19, iter=10, workers=cores),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, workers=cores),
]

Cores: 8


In [11]:
models[0].build_vocab(documents)
print(str(models[0]))
models[1].reset_from(models[0])
print(str(models[1]))

2017-05-29 19:40:32,768 : INFO : collecting all words and their counts
2017-05-29 19:40:33,075 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-05-29 19:41:20,201 : INFO : PROGRESS: at example #10000, processed 31159151 words (661208/s), 441596 word types, 10000 tags
2017-05-29 19:41:42,707 : INFO : finished iterating over Wikipedia corpus of 14901 documents with 46096422 positions (total 19857 articles, 46117421 positions before pruning articles shorter than 50 words)
2017-05-29 19:41:42,708 : INFO : collected 551375 word types and 14901 unique tags from a corpus of 14901 examples and 46096422 words
2017-05-29 19:41:42,709 : INFO : Loading a fresh vocabulary
2017-05-29 19:41:43,179 : INFO : min_count=19 retains 62534 unique words (11% of original 551375, drops 488841)
2017-05-29 19:41:43,180 : INFO : min_count=19 leaves 44748674 word corpus (97% of original 46096422, drops 1347748)
2017-05-29 19:41:43,340 : INFO : deleting the raw counts dictionary 

Doc2Vec(dbow+w,d200,n5,w8,mc19,s0.001,t8)
Doc2Vec(dm/m,d200,n5,w8,mc19,s0.001,t8)


Now we’re ready to train Doc2Vec of the English Wikipedia. 

In [12]:
for model in models:
    model.train(documents, total_examples=model.corpus_count, epochs=model.iter)

2017-05-29 19:41:45,588 : INFO : training model with 8 workers on 62534 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8
2017-05-29 19:41:46,645 : INFO : PROGRESS: at 0.02% examples, 92835 words/s, in_qsize 7, out_qsize 0
2017-05-29 19:41:47,647 : INFO : PROGRESS: at 0.05% examples, 127810 words/s, in_qsize 16, out_qsize 1
2017-05-29 19:41:48,665 : INFO : PROGRESS: at 0.10% examples, 151163 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:41:49,668 : INFO : PROGRESS: at 0.15% examples, 163586 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:41:50,702 : INFO : PROGRESS: at 0.20% examples, 170053 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:41:51,719 : INFO : PROGRESS: at 0.26% examples, 173930 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:41:52,735 : INFO : PROGRESS: at 0.32% examples, 177738 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:41:53,739 : INFO : PROGRESS: at 0.41% examples, 178875 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:41:54,752 : IN

2017-05-29 19:43:07,617 : INFO : PROGRESS: at 4.49% examples, 190973 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:43:08,628 : INFO : PROGRESS: at 4.55% examples, 191035 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:43:09,638 : INFO : PROGRESS: at 4.60% examples, 191228 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:43:10,648 : INFO : PROGRESS: at 4.65% examples, 191069 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:43:11,666 : INFO : PROGRESS: at 4.69% examples, 190963 words/s, in_qsize 14, out_qsize 1
2017-05-29 19:43:12,721 : INFO : PROGRESS: at 4.76% examples, 190899 words/s, in_qsize 11, out_qsize 0
2017-05-29 19:43:13,732 : INFO : PROGRESS: at 4.81% examples, 190999 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:43:14,794 : INFO : PROGRESS: at 4.88% examples, 191140 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:43:15,959 : INFO : PROGRESS: at 4.93% examples, 190909 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:43:16,985 : INFO : PROGRESS: at 4.98% examples, 191040 word

2017-05-29 19:44:30,149 : INFO : PROGRESS: at 9.08% examples, 192047 words/s, in_qsize 16, out_qsize 0
2017-05-29 19:44:31,258 : INFO : PROGRESS: at 9.15% examples, 191898 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:44:32,332 : INFO : PROGRESS: at 9.20% examples, 191930 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:44:33,340 : INFO : PROGRESS: at 9.27% examples, 191930 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:44:34,428 : INFO : PROGRESS: at 9.33% examples, 192039 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:44:35,443 : INFO : PROGRESS: at 9.37% examples, 192018 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:44:36,459 : INFO : PROGRESS: at 9.42% examples, 192087 words/s, in_qsize 14, out_qsize 1
2017-05-29 19:44:37,467 : INFO : PROGRESS: at 9.48% examples, 192026 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:44:38,517 : INFO : PROGRESS: at 9.55% examples, 192068 words/s, in_qsize 12, out_qsize 0
2017-05-29 19:44:39,563 : INFO : PROGRESS: at 9.61% examples, 192060 word

2017-05-29 19:45:49,934 : INFO : PROGRESS: at 13.50% examples, 191899 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:45:50,948 : INFO : PROGRESS: at 13.55% examples, 191921 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:45:51,981 : INFO : PROGRESS: at 13.61% examples, 191916 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:45:53,025 : INFO : PROGRESS: at 13.67% examples, 191933 words/s, in_qsize 13, out_qsize 2
2017-05-29 19:45:54,107 : INFO : PROGRESS: at 13.73% examples, 191971 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:45:55,113 : INFO : PROGRESS: at 13.77% examples, 191926 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:45:56,118 : INFO : PROGRESS: at 13.83% examples, 191924 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:45:57,216 : INFO : PROGRESS: at 13.89% examples, 191888 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:45:58,247 : INFO : PROGRESS: at 13.95% examples, 191921 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:45:59,253 : INFO : PROGRESS: at 14.01% examples, 1

2017-05-29 19:47:11,179 : INFO : PROGRESS: at 17.90% examples, 192018 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:47:12,197 : INFO : PROGRESS: at 17.96% examples, 192069 words/s, in_qsize 12, out_qsize 0
2017-05-29 19:47:13,231 : INFO : PROGRESS: at 18.04% examples, 192062 words/s, in_qsize 11, out_qsize 0
2017-05-29 19:47:14,298 : INFO : PROGRESS: at 18.11% examples, 192058 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:47:15,354 : INFO : PROGRESS: at 18.20% examples, 192058 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:47:16,388 : INFO : PROGRESS: at 18.27% examples, 192093 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:47:17,421 : INFO : PROGRESS: at 18.33% examples, 192036 words/s, in_qsize 15, out_qsize 2
2017-05-29 19:47:18,466 : INFO : PROGRESS: at 18.39% examples, 192070 words/s, in_qsize 12, out_qsize 0
2017-05-29 19:47:19,480 : INFO : PROGRESS: at 18.43% examples, 192027 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:47:20,489 : INFO : PROGRESS: at 18.48% examples, 1

2017-05-29 19:48:30,567 : INFO : PROGRESS: at 22.33% examples, 192145 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:48:31,599 : INFO : PROGRESS: at 22.41% examples, 192186 words/s, in_qsize 11, out_qsize 0
2017-05-29 19:48:32,599 : INFO : PROGRESS: at 22.46% examples, 192173 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:48:33,616 : INFO : PROGRESS: at 22.52% examples, 192166 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:48:34,630 : INFO : PROGRESS: at 22.57% examples, 192168 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:48:35,632 : INFO : PROGRESS: at 22.63% examples, 192193 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:48:36,756 : INFO : PROGRESS: at 22.68% examples, 192161 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:48:37,776 : INFO : PROGRESS: at 22.74% examples, 192201 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:48:38,776 : INFO : PROGRESS: at 22.80% examples, 192211 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:48:39,823 : INFO : PROGRESS: at 22.85% examples, 1

2017-05-29 19:49:51,781 : INFO : PROGRESS: at 26.77% examples, 192360 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:49:52,788 : INFO : PROGRESS: at 26.82% examples, 192313 words/s, in_qsize 16, out_qsize 0
2017-05-29 19:49:53,820 : INFO : PROGRESS: at 26.88% examples, 192305 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:49:54,837 : INFO : PROGRESS: at 26.92% examples, 192358 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:49:55,884 : INFO : PROGRESS: at 26.98% examples, 192330 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:49:56,908 : INFO : PROGRESS: at 27.03% examples, 192323 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:49:57,923 : INFO : PROGRESS: at 27.09% examples, 192353 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:49:58,945 : INFO : PROGRESS: at 27.14% examples, 192358 words/s, in_qsize 16, out_qsize 0
2017-05-29 19:49:59,956 : INFO : PROGRESS: at 27.22% examples, 192372 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:50:00,974 : INFO : PROGRESS: at 27.26% examples, 1

2017-05-29 19:51:10,963 : INFO : PROGRESS: at 31.22% examples, 192224 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:51:11,999 : INFO : PROGRESS: at 31.26% examples, 192242 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:51:12,999 : INFO : PROGRESS: at 31.31% examples, 192220 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:51:14,010 : INFO : PROGRESS: at 31.37% examples, 192268 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:51:15,023 : INFO : PROGRESS: at 31.42% examples, 192262 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:51:16,077 : INFO : PROGRESS: at 31.48% examples, 192258 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:51:17,085 : INFO : PROGRESS: at 31.53% examples, 192266 words/s, in_qsize 12, out_qsize 1
2017-05-29 19:51:18,111 : INFO : PROGRESS: at 31.58% examples, 192267 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:51:19,160 : INFO : PROGRESS: at 31.64% examples, 192245 words/s, in_qsize 16, out_qsize 0
2017-05-29 19:51:20,178 : INFO : PROGRESS: at 31.68% examples, 1

2017-05-29 19:52:32,033 : INFO : PROGRESS: at 35.62% examples, 192429 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:52:33,040 : INFO : PROGRESS: at 35.67% examples, 192410 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:52:34,045 : INFO : PROGRESS: at 35.74% examples, 192413 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:52:35,045 : INFO : PROGRESS: at 35.79% examples, 192418 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:52:36,098 : INFO : PROGRESS: at 35.84% examples, 192425 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:52:37,110 : INFO : PROGRESS: at 35.91% examples, 192435 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:52:38,118 : INFO : PROGRESS: at 35.97% examples, 192415 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:52:39,128 : INFO : PROGRESS: at 36.03% examples, 192426 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:52:40,204 : INFO : PROGRESS: at 36.09% examples, 192392 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:52:41,216 : INFO : PROGRESS: at 36.13% examples, 1

2017-05-29 19:53:50,765 : INFO : PROGRESS: at 40.00% examples, 192471 words/s, in_qsize 0, out_qsize 0
2017-05-29 19:53:51,818 : INFO : PROGRESS: at 40.02% examples, 192394 words/s, in_qsize 2, out_qsize 0
2017-05-29 19:53:52,828 : INFO : PROGRESS: at 40.06% examples, 192351 words/s, in_qsize 14, out_qsize 1
2017-05-29 19:53:53,867 : INFO : PROGRESS: at 40.10% examples, 192381 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:53:54,872 : INFO : PROGRESS: at 40.16% examples, 192378 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:53:55,977 : INFO : PROGRESS: at 40.21% examples, 192377 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:53:56,983 : INFO : PROGRESS: at 40.28% examples, 192395 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:53:58,033 : INFO : PROGRESS: at 40.35% examples, 192388 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:53:59,076 : INFO : PROGRESS: at 40.44% examples, 192395 words/s, in_qsize 10, out_qsize 0
2017-05-29 19:54:00,091 : INFO : PROGRESS: at 40.49% examples, 192

2017-05-29 19:55:12,615 : INFO : PROGRESS: at 44.51% examples, 192481 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:55:13,624 : INFO : PROGRESS: at 44.57% examples, 192476 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:55:14,712 : INFO : PROGRESS: at 44.62% examples, 192446 words/s, in_qsize 9, out_qsize 0
2017-05-29 19:55:15,729 : INFO : PROGRESS: at 44.67% examples, 192467 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:55:16,736 : INFO : PROGRESS: at 44.73% examples, 192488 words/s, in_qsize 8, out_qsize 0
2017-05-29 19:55:17,821 : INFO : PROGRESS: at 44.79% examples, 192480 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:55:18,829 : INFO : PROGRESS: at 44.85% examples, 192488 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:55:19,831 : INFO : PROGRESS: at 44.90% examples, 192471 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:55:20,851 : INFO : PROGRESS: at 44.95% examples, 192491 words/s, in_qsize 14, out_qsize 1
2017-05-29 19:55:21,896 : INFO : PROGRESS: at 44.99% examples, 192

2017-05-29 19:56:34,475 : INFO : PROGRESS: at 49.08% examples, 192588 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:56:35,505 : INFO : PROGRESS: at 49.15% examples, 192575 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:56:36,510 : INFO : PROGRESS: at 49.20% examples, 192602 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:56:37,515 : INFO : PROGRESS: at 49.26% examples, 192587 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:56:38,530 : INFO : PROGRESS: at 49.31% examples, 192585 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:56:39,538 : INFO : PROGRESS: at 49.37% examples, 192589 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:56:40,541 : INFO : PROGRESS: at 49.42% examples, 192608 words/s, in_qsize 13, out_qsize 0
2017-05-29 19:56:41,572 : INFO : PROGRESS: at 49.47% examples, 192603 words/s, in_qsize 16, out_qsize 1
2017-05-29 19:56:42,645 : INFO : PROGRESS: at 49.54% examples, 192611 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:56:43,746 : INFO : PROGRESS: at 49.61% examples, 1

2017-05-29 19:57:53,959 : INFO : PROGRESS: at 53.53% examples, 192694 words/s, in_qsize 15, out_qsize 1
2017-05-29 19:57:54,985 : INFO : PROGRESS: at 53.58% examples, 192699 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:57:56,066 : INFO : PROGRESS: at 53.65% examples, 192715 words/s, in_qsize 12, out_qsize 0
2017-05-29 19:57:57,067 : INFO : PROGRESS: at 53.71% examples, 192713 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:57:58,082 : INFO : PROGRESS: at 53.76% examples, 192711 words/s, in_qsize 12, out_qsize 0
2017-05-29 19:57:59,142 : INFO : PROGRESS: at 53.80% examples, 192693 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:58:00,148 : INFO : PROGRESS: at 53.88% examples, 192712 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:58:01,156 : INFO : PROGRESS: at 53.92% examples, 192682 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:58:02,178 : INFO : PROGRESS: at 53.99% examples, 192716 words/s, in_qsize 12, out_qsize 0
2017-05-29 19:58:03,196 : INFO : PROGRESS: at 54.05% examples, 1

2017-05-29 19:59:15,265 : INFO : PROGRESS: at 57.98% examples, 192800 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:59:16,266 : INFO : PROGRESS: at 58.06% examples, 192798 words/s, in_qsize 16, out_qsize 0
2017-05-29 19:59:17,284 : INFO : PROGRESS: at 58.12% examples, 192805 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:59:18,296 : INFO : PROGRESS: at 58.21% examples, 192795 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:59:19,338 : INFO : PROGRESS: at 58.27% examples, 192784 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:59:20,353 : INFO : PROGRESS: at 58.34% examples, 192791 words/s, in_qsize 11, out_qsize 0
2017-05-29 19:59:21,365 : INFO : PROGRESS: at 58.40% examples, 192794 words/s, in_qsize 14, out_qsize 0
2017-05-29 19:59:22,372 : INFO : PROGRESS: at 58.43% examples, 192787 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:59:23,403 : INFO : PROGRESS: at 58.49% examples, 192808 words/s, in_qsize 15, out_qsize 0
2017-05-29 19:59:24,479 : INFO : PROGRESS: at 58.53% examples, 1

2017-05-29 20:00:34,270 : INFO : PROGRESS: at 62.43% examples, 192869 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:00:35,287 : INFO : PROGRESS: at 62.49% examples, 192857 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:00:36,351 : INFO : PROGRESS: at 62.55% examples, 192873 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:00:37,431 : INFO : PROGRESS: at 62.60% examples, 192863 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:00:38,433 : INFO : PROGRESS: at 62.66% examples, 192863 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:00:39,526 : INFO : PROGRESS: at 62.72% examples, 192880 words/s, in_qsize 8, out_qsize 1
2017-05-29 20:00:40,559 : INFO : PROGRESS: at 62.77% examples, 192877 words/s, in_qsize 14, out_qsize 0
2017-05-29 20:00:41,596 : INFO : PROGRESS: at 62.83% examples, 192881 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:00:42,611 : INFO : PROGRESS: at 62.87% examples, 192875 words/s, in_qsize 14, out_qsize 0
2017-05-29 20:00:43,632 : INFO : PROGRESS: at 62.93% examples, 19

2017-05-29 20:01:55,692 : INFO : PROGRESS: at 66.86% examples, 192897 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:01:56,758 : INFO : PROGRESS: at 66.91% examples, 192890 words/s, in_qsize 14, out_qsize 1
2017-05-29 20:01:57,778 : INFO : PROGRESS: at 66.96% examples, 192900 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:01:58,801 : INFO : PROGRESS: at 67.02% examples, 192901 words/s, in_qsize 14, out_qsize 0
2017-05-29 20:01:59,819 : INFO : PROGRESS: at 67.07% examples, 192891 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:02:00,820 : INFO : PROGRESS: at 67.12% examples, 192907 words/s, in_qsize 12, out_qsize 0
2017-05-29 20:02:01,822 : INFO : PROGRESS: at 67.19% examples, 192901 words/s, in_qsize 12, out_qsize 0
2017-05-29 20:02:02,825 : INFO : PROGRESS: at 67.25% examples, 192899 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:02:03,857 : INFO : PROGRESS: at 67.29% examples, 192881 words/s, in_qsize 16, out_qsize 0
2017-05-29 20:02:04,874 : INFO : PROGRESS: at 67.35% examples, 1

2017-05-29 20:03:14,968 : INFO : PROGRESS: at 71.34% examples, 192912 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:16,024 : INFO : PROGRESS: at 71.38% examples, 192909 words/s, in_qsize 14, out_qsize 1
2017-05-29 20:03:17,089 : INFO : PROGRESS: at 71.45% examples, 192930 words/s, in_qsize 10, out_qsize 1
2017-05-29 20:03:18,140 : INFO : PROGRESS: at 71.50% examples, 192901 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:19,185 : INFO : PROGRESS: at 71.55% examples, 192921 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:20,229 : INFO : PROGRESS: at 71.60% examples, 192910 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:21,235 : INFO : PROGRESS: at 71.66% examples, 192923 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:22,244 : INFO : PROGRESS: at 71.70% examples, 192913 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:23,245 : INFO : PROGRESS: at 71.75% examples, 192913 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:03:24,271 : INFO : PROGRESS: at 71.81% examples, 1

2017-05-29 20:04:36,297 : INFO : PROGRESS: at 75.79% examples, 193012 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:04:37,331 : INFO : PROGRESS: at 75.84% examples, 193009 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:04:38,336 : INFO : PROGRESS: at 75.91% examples, 193018 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:04:39,337 : INFO : PROGRESS: at 75.97% examples, 193018 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:04:40,342 : INFO : PROGRESS: at 76.04% examples, 193025 words/s, in_qsize 14, out_qsize 0
2017-05-29 20:04:41,420 : INFO : PROGRESS: at 76.09% examples, 193005 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:04:42,444 : INFO : PROGRESS: at 76.14% examples, 193013 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:04:43,583 : INFO : PROGRESS: at 76.19% examples, 193015 words/s, in_qsize 9, out_qsize 0
2017-05-29 20:04:44,683 : INFO : PROGRESS: at 76.27% examples, 193006 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:04:45,739 : INFO : PROGRESS: at 76.33% examples, 19

2017-05-29 20:05:55,546 : INFO : PROGRESS: at 80.16% examples, 193035 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:05:56,667 : INFO : PROGRESS: at 80.21% examples, 193030 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:05:57,669 : INFO : PROGRESS: at 80.28% examples, 193050 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:05:58,771 : INFO : PROGRESS: at 80.36% examples, 193037 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:05:59,832 : INFO : PROGRESS: at 80.44% examples, 193037 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:06:00,848 : INFO : PROGRESS: at 80.50% examples, 193048 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:06:01,849 : INFO : PROGRESS: at 80.56% examples, 193043 words/s, in_qsize 14, out_qsize 0
2017-05-29 20:06:02,853 : INFO : PROGRESS: at 80.60% examples, 193044 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:06:03,903 : INFO : PROGRESS: at 80.67% examples, 193050 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:06:04,919 : INFO : PROGRESS: at 80.77% examples, 1

2017-05-29 20:07:17,222 : INFO : PROGRESS: at 84.72% examples, 193072 words/s, in_qsize 10, out_qsize 0
2017-05-29 20:07:18,373 : INFO : PROGRESS: at 84.78% examples, 193061 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:19,383 : INFO : PROGRESS: at 84.85% examples, 193075 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:20,432 : INFO : PROGRESS: at 84.91% examples, 193076 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:21,530 : INFO : PROGRESS: at 84.96% examples, 193068 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:07:22,560 : INFO : PROGRESS: at 85.01% examples, 193073 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:23,643 : INFO : PROGRESS: at 85.06% examples, 193067 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:24,755 : INFO : PROGRESS: at 85.10% examples, 193078 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:25,776 : INFO : PROGRESS: at 85.16% examples, 193085 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:07:26,786 : INFO : PROGRESS: at 85.20% examples, 1

2017-05-29 20:08:38,891 : INFO : PROGRESS: at 89.33% examples, 193138 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:08:39,963 : INFO : PROGRESS: at 89.37% examples, 193131 words/s, in_qsize 16, out_qsize 0
2017-05-29 20:08:41,003 : INFO : PROGRESS: at 89.43% examples, 193144 words/s, in_qsize 13, out_qsize 1
2017-05-29 20:08:42,038 : INFO : PROGRESS: at 89.49% examples, 193138 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:08:43,119 : INFO : PROGRESS: at 89.55% examples, 193138 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:08:44,132 : INFO : PROGRESS: at 89.61% examples, 193141 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:08:45,211 : INFO : PROGRESS: at 89.67% examples, 193145 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:08:46,222 : INFO : PROGRESS: at 89.73% examples, 193160 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:08:47,248 : INFO : PROGRESS: at 89.79% examples, 193169 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:08:48,272 : INFO : PROGRESS: at 89.84% examples, 1

2017-05-29 20:09:58,049 : INFO : PROGRESS: at 93.74% examples, 193159 words/s, in_qsize 14, out_qsize 0
2017-05-29 20:09:59,085 : INFO : PROGRESS: at 93.78% examples, 193158 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:10:00,113 : INFO : PROGRESS: at 93.84% examples, 193149 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:10:01,130 : INFO : PROGRESS: at 93.91% examples, 193161 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:10:02,137 : INFO : PROGRESS: at 93.96% examples, 193152 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:10:03,141 : INFO : PROGRESS: at 94.02% examples, 193159 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:10:04,142 : INFO : PROGRESS: at 94.08% examples, 193156 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:10:05,178 : INFO : PROGRESS: at 94.14% examples, 193157 words/s, in_qsize 12, out_qsize 0
2017-05-29 20:10:06,212 : INFO : PROGRESS: at 94.21% examples, 193153 words/s, in_qsize 16, out_qsize 0
2017-05-29 20:10:07,252 : INFO : PROGRESS: at 94.24% examples, 1

2017-05-29 20:11:19,025 : INFO : PROGRESS: at 98.21% examples, 193166 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:11:20,055 : INFO : PROGRESS: at 98.28% examples, 193170 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:11:21,060 : INFO : PROGRESS: at 98.35% examples, 193168 words/s, in_qsize 9, out_qsize 0
2017-05-29 20:11:22,105 : INFO : PROGRESS: at 98.40% examples, 193162 words/s, in_qsize 11, out_qsize 0
2017-05-29 20:11:23,135 : INFO : PROGRESS: at 98.44% examples, 193166 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:11:24,142 : INFO : PROGRESS: at 98.49% examples, 193174 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:11:25,143 : INFO : PROGRESS: at 98.53% examples, 193170 words/s, in_qsize 15, out_qsize 0
2017-05-29 20:11:26,146 : INFO : PROGRESS: at 98.59% examples, 193178 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:11:27,224 : INFO : PROGRESS: at 98.64% examples, 193167 words/s, in_qsize 14, out_qsize 1
2017-05-29 20:11:28,257 : INFO : PROGRESS: at 98.70% examples, 19

CPU times: user 2h 42min 30s, sys: 23.1 s, total: 2h 42min 53s
Wall time: 30min 5s


2017-05-29 20:11:52,584 : INFO : PROGRESS: at 0.04% examples, 220483 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:11:53,585 : INFO : PROGRESS: at 0.13% examples, 312308 words/s, in_qsize 5, out_qsize 0
2017-05-29 20:11:54,658 : INFO : PROGRESS: at 0.25% examples, 344500 words/s, in_qsize 4, out_qsize 1
2017-05-29 20:11:55,686 : INFO : PROGRESS: at 0.42% examples, 359549 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:11:56,700 : INFO : PROGRESS: at 0.53% examples, 362688 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:11:57,724 : INFO : PROGRESS: at 0.65% examples, 370840 words/s, in_qsize 3, out_qsize 0
2017-05-29 20:11:58,763 : INFO : PROGRESS: at 0.82% examples, 374596 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:11:59,767 : INFO : PROGRESS: at 0.96% examples, 377672 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:12:00,774 : INFO : PROGRESS: at 1.06% examples, 374685 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:12:01,821 : INFO : PROGRESS: at 1.18% examples, 378295 words/s, in_q

2017-05-29 20:13:14,861 : INFO : PROGRESS: at 9.43% examples, 394371 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:13:15,872 : INFO : PROGRESS: at 9.55% examples, 393961 words/s, in_qsize 5, out_qsize 0
2017-05-29 20:13:16,879 : INFO : PROGRESS: at 9.67% examples, 394472 words/s, in_qsize 3, out_qsize 0
2017-05-29 20:13:17,903 : INFO : PROGRESS: at 9.79% examples, 394649 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:13:18,912 : INFO : PROGRESS: at 9.88% examples, 394617 words/s, in_qsize 8, out_qsize 0
2017-05-29 20:13:19,777 : INFO : finished iterating over Wikipedia corpus of 14901 documents with 46096422 positions (total 19857 articles, 46117421 positions before pruning articles shorter than 50 words)
2017-05-29 20:13:20,090 : INFO : PROGRESS: at 10.00% examples, 394247 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:13:21,127 : INFO : PROGRESS: at 10.07% examples, 393700 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:13:22,148 : INFO : PROGRESS: at 10.16% examples, 393558 words/s,

2017-05-29 20:14:34,417 : INFO : PROGRESS: at 18.46% examples, 394429 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:14:35,418 : INFO : PROGRESS: at 18.55% examples, 394540 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:14:36,418 : INFO : PROGRESS: at 18.66% examples, 394655 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:14:37,445 : INFO : PROGRESS: at 18.79% examples, 394785 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:14:38,452 : INFO : PROGRESS: at 18.87% examples, 394510 words/s, in_qsize 5, out_qsize 1
2017-05-29 20:14:39,484 : INFO : PROGRESS: at 18.98% examples, 394553 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:14:40,485 : INFO : PROGRESS: at 19.15% examples, 394717 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:14:41,512 : INFO : PROGRESS: at 19.26% examples, 394701 words/s, in_qsize 1, out_qsize 1
2017-05-29 20:14:42,529 : INFO : PROGRESS: at 19.36% examples, 394559 words/s, in_qsize 11, out_qsize 0
2017-05-29 20:14:43,529 : INFO : PROGRESS: at 19.49% examples, 395009 wo

2017-05-29 20:15:54,394 : INFO : PROGRESS: at 27.41% examples, 394659 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:15:55,399 : INFO : PROGRESS: at 27.51% examples, 394653 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:15:56,439 : INFO : PROGRESS: at 27.63% examples, 394441 words/s, in_qsize 0, out_qsize 2
2017-05-29 20:15:57,442 : INFO : PROGRESS: at 27.74% examples, 394463 words/s, in_qsize 4, out_qsize 2
2017-05-29 20:15:58,479 : INFO : PROGRESS: at 27.88% examples, 394392 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:15:59,493 : INFO : PROGRESS: at 28.02% examples, 394425 words/s, in_qsize 6, out_qsize 0
2017-05-29 20:16:00,522 : INFO : PROGRESS: at 28.18% examples, 394506 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:16:01,527 : INFO : PROGRESS: at 28.31% examples, 394416 words/s, in_qsize 12, out_qsize 0
2017-05-29 20:16:02,540 : INFO : PROGRESS: at 28.42% examples, 394487 words/s, in_qsize 6, out_qsize 0
2017-05-29 20:16:03,549 : INFO : PROGRESS: at 28.52% examples, 394632 wo

2017-05-29 20:17:13,591 : INFO : PROGRESS: at 36.42% examples, 395165 words/s, in_qsize 3, out_qsize 0
2017-05-29 20:17:14,614 : INFO : PROGRESS: at 36.55% examples, 395295 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:17:15,626 : INFO : PROGRESS: at 36.65% examples, 395220 words/s, in_qsize 8, out_qsize 1
2017-05-29 20:17:16,664 : INFO : PROGRESS: at 36.78% examples, 395283 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:17:17,675 : INFO : PROGRESS: at 36.91% examples, 395355 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:17:18,680 : INFO : PROGRESS: at 37.02% examples, 395373 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:17:19,742 : INFO : PROGRESS: at 37.13% examples, 395412 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:17:20,825 : INFO : PROGRESS: at 37.26% examples, 395309 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:17:21,849 : INFO : PROGRESS: at 37.37% examples, 395388 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:17:22,860 : INFO : PROGRESS: at 37.49% examples, 395376 wor

2017-05-29 20:18:33,169 : INFO : PROGRESS: at 45.44% examples, 395564 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:18:34,184 : INFO : PROGRESS: at 45.60% examples, 395507 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:18:35,201 : INFO : PROGRESS: at 45.73% examples, 395516 words/s, in_qsize 1, out_qsize 2
2017-05-29 20:18:36,263 : INFO : PROGRESS: at 45.84% examples, 395489 words/s, in_qsize 3, out_qsize 0
2017-05-29 20:18:37,278 : INFO : PROGRESS: at 45.97% examples, 395501 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:18:38,303 : INFO : PROGRESS: at 46.09% examples, 395421 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:18:39,322 : INFO : PROGRESS: at 46.18% examples, 395448 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:18:40,328 : INFO : PROGRESS: at 46.31% examples, 395414 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:18:41,340 : INFO : PROGRESS: at 46.40% examples, 395454 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:18:42,369 : INFO : PROGRESS: at 46.51% examples, 395460 wor

2017-05-29 20:19:52,471 : INFO : PROGRESS: at 54.52% examples, 395679 words/s, in_qsize 3, out_qsize 0
2017-05-29 20:19:53,474 : INFO : PROGRESS: at 54.64% examples, 395733 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:19:54,566 : INFO : PROGRESS: at 54.75% examples, 395664 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:19:55,572 : INFO : PROGRESS: at 54.85% examples, 395612 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:19:56,592 : INFO : PROGRESS: at 54.97% examples, 395630 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:19:57,613 : INFO : PROGRESS: at 55.06% examples, 395609 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:19:58,620 : INFO : PROGRESS: at 55.14% examples, 395517 words/s, in_qsize 7, out_qsize 0
2017-05-29 20:19:59,622 : INFO : PROGRESS: at 55.25% examples, 395600 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:20:00,631 : INFO : PROGRESS: at 55.38% examples, 395576 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:20:01,635 : INFO : PROGRESS: at 55.48% examples, 395606 wor

2017-05-29 20:21:12,001 : INFO : PROGRESS: at 63.63% examples, 395890 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:21:13,001 : INFO : PROGRESS: at 63.75% examples, 395912 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:21:14,009 : INFO : PROGRESS: at 63.85% examples, 395833 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:21:15,024 : INFO : PROGRESS: at 63.96% examples, 395800 words/s, in_qsize 5, out_qsize 0
2017-05-29 20:21:16,071 : INFO : PROGRESS: at 64.09% examples, 395794 words/s, in_qsize 5, out_qsize 1
2017-05-29 20:21:17,073 : INFO : PROGRESS: at 64.21% examples, 395822 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:21:18,074 : INFO : PROGRESS: at 64.30% examples, 395836 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:21:19,077 : INFO : PROGRESS: at 64.39% examples, 395837 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:21:20,083 : INFO : PROGRESS: at 64.48% examples, 395764 words/s, in_qsize 13, out_qsize 0
2017-05-29 20:21:21,096 : INFO : PROGRESS: at 64.59% examples, 395836 wo

2017-05-29 20:22:31,327 : INFO : PROGRESS: at 72.65% examples, 396050 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:22:32,365 : INFO : PROGRESS: at 72.76% examples, 396034 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:22:33,411 : INFO : PROGRESS: at 72.86% examples, 396008 words/s, in_qsize 0, out_qsize 2
2017-05-29 20:22:34,417 : INFO : PROGRESS: at 72.98% examples, 396051 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:22:35,454 : INFO : PROGRESS: at 73.10% examples, 396029 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:22:36,462 : INFO : PROGRESS: at 73.22% examples, 396026 words/s, in_qsize 10, out_qsize 0
2017-05-29 20:22:37,475 : INFO : PROGRESS: at 73.37% examples, 396097 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:22:38,481 : INFO : PROGRESS: at 73.49% examples, 396096 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:22:39,481 : INFO : PROGRESS: at 73.60% examples, 396134 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:22:40,482 : INFO : PROGRESS: at 73.72% examples, 396150 wo

2017-05-29 20:23:50,731 : INFO : PROGRESS: at 81.64% examples, 395974 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:23:51,740 : INFO : PROGRESS: at 81.73% examples, 395952 words/s, in_qsize 10, out_qsize 0
2017-05-29 20:23:52,763 : INFO : PROGRESS: at 81.85% examples, 396001 words/s, in_qsize 6, out_qsize 0
2017-05-29 20:23:53,765 : INFO : PROGRESS: at 81.98% examples, 396009 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:23:54,814 : INFO : PROGRESS: at 82.08% examples, 396021 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:23:55,865 : INFO : PROGRESS: at 82.21% examples, 396007 words/s, in_qsize 1, out_qsize 2
2017-05-29 20:23:56,867 : INFO : PROGRESS: at 82.35% examples, 396042 words/s, in_qsize 5, out_qsize 0
2017-05-29 20:23:57,876 : INFO : PROGRESS: at 82.48% examples, 396050 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:23:58,877 : INFO : PROGRESS: at 82.59% examples, 396058 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:23:59,887 : INFO : PROGRESS: at 82.71% examples, 396110 wo

2017-05-29 20:25:10,187 : INFO : PROGRESS: at 90.64% examples, 396095 words/s, in_qsize 2, out_qsize 1
2017-05-29 20:25:11,188 : INFO : PROGRESS: at 90.80% examples, 396094 words/s, in_qsize 2, out_qsize 0
2017-05-29 20:25:12,189 : INFO : PROGRESS: at 90.95% examples, 396106 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:25:13,203 : INFO : PROGRESS: at 91.06% examples, 396080 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:25:14,234 : INFO : PROGRESS: at 91.17% examples, 396067 words/s, in_qsize 7, out_qsize 0
2017-05-29 20:25:15,237 : INFO : PROGRESS: at 91.29% examples, 396137 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:25:16,237 : INFO : PROGRESS: at 91.38% examples, 396116 words/s, in_qsize 4, out_qsize 0
2017-05-29 20:25:17,329 : INFO : PROGRESS: at 91.52% examples, 396126 words/s, in_qsize 0, out_qsize 0
2017-05-29 20:25:18,347 : INFO : PROGRESS: at 91.60% examples, 396123 words/s, in_qsize 1, out_qsize 0
2017-05-29 20:25:19,359 : INFO : PROGRESS: at 91.71% examples, 396109 wor

2017-05-29 20:26:31,800 : INFO : PROGRESS: at 99.98% examples, 396368 words/s, in_qsize 0, out_qsize 1
2017-05-29 20:26:31,895 : INFO : finished iterating over Wikipedia corpus of 14901 documents with 46096422 positions (total 19857 articles, 46117421 positions before pruning articles shorter than 50 words)
2017-05-29 20:26:31,902 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-29 20:26:31,904 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-29 20:26:31,904 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-29 20:26:31,905 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-29 20:26:31,906 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-29 20:26:31,906 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-29 20:26:31,907 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-29 20:26:31,919 : INFO : worker thread finishe

CPU times: user 32min 45s, sys: 15.9 s, total: 33min 1s
Wall time: 14min 40s


## Save the models ##

In [13]:
models[0].save('models/wikipedia_dbow_model.doc2vec')
models[1].save('models/wikipedia_dm_model.doc2vec')

2017-05-29 20:26:31,925 : INFO : saving Doc2Vec object under wikipedia_dbow_model.doc2vec, separately None
2017-05-29 20:26:31,927 : INFO : not storing attribute syn0norm
2017-05-29 20:26:31,927 : INFO : storing np array 'syn0' to wikipedia_dbow_model.doc2vec.wv.syn0.npy
2017-05-29 20:26:31,960 : INFO : not storing attribute cum_table
2017-05-29 20:26:31,961 : INFO : storing np array 'syn1neg' to wikipedia_dbow_model.doc2vec.syn1neg.npy
2017-05-29 20:26:32,284 : INFO : saved wikipedia_dbow_model.doc2vec
2017-05-29 20:26:32,285 : INFO : saving Doc2Vec object under wikipedia_dm_model.doc2vec, separately None
2017-05-29 20:26:32,286 : INFO : not storing attribute syn0norm
2017-05-29 20:26:32,286 : INFO : storing np array 'syn0' to wikipedia_dm_model.doc2vec.wv.syn0.npy
2017-05-29 20:26:32,317 : INFO : not storing attribute cum_table
2017-05-29 20:26:32,318 : INFO : storing np array 'syn1neg' to wikipedia_dm_model.doc2vec.syn1neg.npy
2017-05-29 20:26:32,640 : INFO : saved wikipedia_dm_mode

## Similarity interface

After that, let's test both models! DBOW model show the simillar results with the original paper. First, calculating cosine simillarity of "Machine learning" using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [14]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Anarchism"], topn=20))

2017-05-29 20:26:32,645 : INFO : precomputing L2-norms of doc weight vectors


Doc2Vec(dbow+w,d200,n5,w8,mc19,s0.001,t8)


TypeError: unorderable types: str() < int()

DBOW model interpret the word 'Machine Learning' as a part of Computer Science field, and DM model as Data Science related field.

Second, calculating cosine simillarity of "Lady Gaga" using Paragraph Vector.

In [None]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Lady Gaga"], topn=10))

DBOW model reveal the similar singer in the U.S., and DM model understand that many of Lady Gaga's songs are similar with the word "Lady Gaga".

Third, calculating cosine simillarity of "Lady Gaga" - "American" + "Japanese" using Document vector and Word Vectors. "American" and "Japanese" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus.

In [None]:
for model in models:
    print(str(model))
    vec = [model.docvecs["Lady Gaga"] - model["american"] + model["japanese"]]
    pprint([m for m in model.docvecs.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

As a result, DBOW model demonstrate the similar artists with Lady Gaga in Japan such as 'Perfume', which is the Most famous Idol in Japan. On the other hand, DM model results don't include the Japanese aritsts in top 10 simillar documents. It's almost same with no vector calculated results.

This results demonstrate that DBOW employed in the original paper is outstanding for calculating the similarity between Document Vector and Word Vector.