# Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
This notebook is modified based on the original one provided by Gensim.  
You can find it here: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb  

The major difference is, this notebook use **different text preprocessing methods** compared to the original one.

## Load corpus

Let's download the IMDB archive if it is not already downloaded (84 MB). This will be our text data for this tutorial.   
The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import locale
import glob
import os.path
import requests
import tarfile
import sys
import codecs
import smart_open

In [2]:
dirname = 'data/aclImdb'
filename = 'data/aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')

if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

convert text to lower-case and strip punctuation/symbols from words

In [3]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def pos_tag_map(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'

def normalize_text(text):
    # 1. remove all html tags
    only_text = BeautifulSoup(text, 'lxml').get_text()
    
    # 2. only keep english characters
    letters_only = re.sub("[^a-zA-Z]", " ", only_text)
    
    # 3. Turn all words into their lowercase and split for use
    words = letters_only.lower().split()
    words = [w.strip() for w in words]
    
    # 4. Lemmatizing
    word_pos = nltk.pos_tag(words)
    clean_words = [wordnet_lemmatizer.lemmatize(w, pos_tag_map(pos)) for (w, pos) in word_pos]
    
    norm_text = ' '.join(clean_words)
    return norm_text

In [20]:
import time
start = time.clock()

if not os.path.isfile('data/aclImdb/alldata-id.txt'):
    if not os.path.isdir(dirname):
        if not os.path.isfile(filename):
            # Download IMDB archive
            print("Downloading IMDB archive...")
            url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename
            r = requests.get(url)
            with open(filename, 'wb') as f:
                f.write(r.content)
        tar = tarfile.open(filename, mode='r')
        tar.extractall()
        tar.close()

    # Concatenate and normalize test/train data
    print("Cleaning up dataset...")
    folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']
    alldata = u''
    for fol in folders:
        temp = u''
        # get all txt files inside the folder
        output = fol.replace('/', '-') + '.txt'
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        for txt in txt_files:
            # each txt file represent one review
            with smart_open.smart_open(txt, "rb") as t:
                t_clean = t.read().decode("utf-8")
                for c in control_chars:
                    t_clean = t_clean.replace(c, ' ')
                temp += normalize_text(t_clean)
            temp += "\n"
        with smart_open.smart_open(os.path.join(dirname, output), "wb") as n:
            n.write(temp.encode("utf-8"))
        alldata += temp

    with smart_open.smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:
        for idx, line in enumerate(alldata.splitlines()):
            num_line = u"_*{0} {1}\n".format(idx, line)
            f.write(num_line.encode("utf-8"))

end = time.clock()
print ("Total running time: ", end-start)

Cleaning up dataset...
('Total running time: ', 5346.248143999999)


In [4]:
import os.path
assert os.path.isfile("data/aclImdb/alldata-id.txt"), "alldata-id.txt unavailable"

The text data is small enough to be read into memory. 

In [5]:
import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

In [6]:
SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

alldocs = []  # Will hold all docs in original order
with open('data/aclImdb/alldata-id.txt') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        split = ['train', 'test', 'extra', 'extra'][line_no//25000]  # 25k train, 25k test, 25k extra
        sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown
        alldocs.append(SentimentDocument(words, tags, split, sentiment))

In [7]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
doc_list = alldocs[:]  # For reshuffling per pass
print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))

100000 docs: 25000 train-sentiment, 25000 test-sentiment


## Set-up Doc2Vec Training & Evaluation Models

We approximate the experiment of Le & Mikolov ["Distributed Representations of Sentences and Documents"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf) with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`

We vary the following parameter choices:
* **100-dimensional** vectors, as the 400-d vectors of the paper don't seem to offer much benefit on this task
* Similarly, **frequent word subsampling** seems to decrease sentiment-prediction accuracy, so it's left out
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* A `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [8]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

In [9]:
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

test 3 simple models here

In [10]:
simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

Speed up setup by sharing results of the 1st model's vocabulary scan

In [11]:
# PV-DM w/ concat requires one special NULL word so it serves as template
%time simple_models[0].build_vocab(alldocs)  
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

CPU times: user 5.6 s, sys: 64 ms, total: 5.66 s
Wall time: 5.61 s
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)


Le and Mikolov notes that **combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance**.  
We will follow, pairing the models together for evaluation. Here, **we concatenate the paragraph vectors obtained from each model.**

In [12]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec

models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])

In [13]:
models_by_name

OrderedDict([('Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)',
              <gensim.models.doc2vec.Doc2Vec at 0x7f53fc61a610>),
             ('Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)',
              <gensim.models.doc2vec.Doc2Vec at 0x7f53fc61a710>),
             ('Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)',
              <gensim.models.doc2vec.Doc2Vec at 0x7f53fc61a910>),
             ('dbow+dmm',
              <gensim.test.test_doc2vec.ConcatenatedDoc2Vec at 0x7f53b14c2290>),
             ('dbow+dmc',
              <gensim.test.test_doc2vec.ConcatenatedDoc2Vec at 0x7f534e88e290>)])

### vocab size
The original one has vocab size of **116045**, while here we got **72599** (with more steps of text preprocessing)

In [27]:
len(models_by_name['Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)'].wv.vocab)

72599

## Predictive Evaluation Methods

Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors.  
We will classify document sentiments using a **logistic regression** model based on our paragraph embeddings.  
We will compare the error rates based on word embeddings from our various Doc2vec models.

In [14]:
import numpy as np
import statsmodels.api as sm
from random import sample

  from pandas.core import datetools


for timing

In [15]:
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start

In [16]:
def logistic_predictor_from_data(train_targets, train_regressors):
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    #print(predictor.summary())
    return predictor

In [17]:
def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

## Bulk Training

We use an explicit multiple-pass, alpha-reduction approach as sketched in this [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) with added shuffling of corpus on each pass.  
Note that vector training is occurring on **all** documents of the dataset, which includes all **TRAIN/TEST/DEV** docs.  
We evaluate each model's sentiment predictive power based on error rate, and **the evaluation is repeated after each pass** so we can see the rates of relative improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. 

**(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)**

In [19]:
from collections import defaultdict
best_error = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [20]:
from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)  # Shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
            duration = '%.1f' % elapsed()
            
        # Evaluate
        eval_duration = ''
        with elapsed_timer() as eval_elapsed:
            err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)
        eval_duration = '%.1f' % eval_elapsed()
        best_indicator = ' '
        if err <= best_error[name]:
            best_error[name] = err
            best_indicator = '*' 
        print("%s%f : %i passes : %s %ss %ss" % (best_indicator, err, epoch + 1, name, duration, eval_duration))

        if ((epoch + 1) % 5) == 0 or epoch == 0:
            eval_duration = ''
            with elapsed_timer() as eval_elapsed:
                infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)
            eval_duration = '%.1f' % eval_elapsed()
            best_indicator = ' '
            if infer_err < best_error[name + '_inferred']:
                best_error[name + '_inferred'] = infer_err
                best_indicator = '*'
            print("%s%f : %i passes : %s %ss %ss" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))

    print('Completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta
    
print("END %s" % str(datetime.datetime.now()))

START 2018-01-09 21:01:42.650922
*0.378480 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) 29.3s 0.7s
*0.342000 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_inferred 29.3s 6.3s
*0.239960 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) 13.5s 0.6s
*0.216800 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_inferred 13.5s 2.3s
*0.260440 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) 16.0s 0.7s
*0.214800 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_inferred 16.0s 2.8s
*0.213000 : 1 passes : dbow+dmm 0.0s 1.8s
*0.182000 : 1 passes : dbow+dmm_inferred 0.0s 5.3s
*0.240840 : 1 passes : dbow+dmc 0.0s 1.4s
*0.238800 : 1 passes : dbow+dmc_inferred 0.0s 9.0s
Completed pass 1 at alpha 0.025000
*0.314280 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) 30.8s 0.7s
*0.151560 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) 13.4s 0.7s
*0.208400 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) 15.7s 0.6s
*0.144920 : 2 passes : dbow+dmm 0.0s 1.6s
*0.150120 : 2 passes : dbo

## Achieved Sentiment-Prediction Accuracy

In [21]:
# Print best error rates achieved
print("Err rate Model")
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

Err rate Model
0.108400 dbow+dmm_inferred
0.108840 dbow+dmm
0.109600 dbow+dmc
0.109720 Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)
0.111600 Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_inferred
0.112400 dbow+dmc_inferred
0.152960 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)
0.158200 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)
0.171200 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_inferred
0.173200 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_inferred


### Seems using more sophisticated text preprocessing doesn't gain another improve on the performance, but worse. :\
It might be because of the smaller vocab size...

In our testing, contrary to the results of the paper, PV-DBOW performs best.  
Concatenating vectors from different models only offers a small predictive improvement over averaging vectors. There best results reproduced are just under 10% error rate, still a long way from the paper's reported 7.42% error rate.

## Examining Results

### Are inferred vectors close to the precalculated ones?

In [31]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

for doc 95329...
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8):
 [(95329, 0.7945490479469299), (2911, 0.4073815643787384), (63358, 0.3943607211112976)]
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8):
 [(95329, 0.9538134336471558), (58788, 0.6186803579330444), (20746, 0.6146473288536072)]
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8):
 [(95329, 0.8836967945098877), (30224, 0.6142271757125854), (56893, 0.6125851273536682)]


(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words.   
Note the defaults for inference are very abbreviated – just 5 steps starting at a high alpha – and likely need tuning for other applications.)

### Do close documents seem more related than distant ones?

In [35]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))

TARGET (74844): «since today is steven spielberg's 60th birthday , i wanted to comment on one of his movies . he only produced " young sherlock holmes " - barry levinson directed it - but it's a pretty cool movie . portraying sherlock ( nicholas rowe ) and watson ( alan cox ) meeting in a boarding school while some strange murders are occurring in london , they do pretty much anything that they want . the whole movie has the definite feel of a spielberg movie , what with the burning of a giant set and all . even if the movie doesn't have the most impressive plot , the hallucinations make up for everything ( it's not often that we get to see cream puffs and chocolate éclairs attack someone ; serves him right for eating junk food ! ) . i recommend it .»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8):

MOST (36931, 0.4433958828449249): «hollywood is one of the best and the beautiful things that had occurred in my life . i admire and am very much fascinated by th

(Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST.)

### Do the word vectors show useful similarities?

In [36]:
word_models = simple_models[:]

In [37]:
word_models

[<gensim.models.doc2vec.Doc2Vec at 0x7f8b06601c10>,
 <gensim.models.doc2vec.Doc2Vec at 0x7f8b06601cd0>,
 <gensim.models.doc2vec.Doc2Vec at 0x7f8b069bea10>]

In [39]:
import random
from IPython.display import HTML

# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].wv.index2word)
    if word_models[0].wv.vocab[word].count > 10:
        break
        
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'comedy/drama'
similars_per_model = [str(model.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurences)" % (word, simple_models[0].wv.vocab[word].count))
HTML(similar_table)

most similar words for 'frustrated' (476 occurences)


  if sys.path[0] == '':


"Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)","Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)","Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)"
"[(u'disturbed', 0.7746529579162598), (u'depressed', 0.7320755124092102), (u'confused', 0.706322431564331), (u'disgusted', 0.7054495215415955), (u'disillusioned', 0.7042022943496704), (u'angry', 0.7025586366653442), (u'bored', 0.7015896439552307), (u'irritated', 0.6992833614349365), (u'enraged', 0.6737393140792847), (u'unimpressed', 0.6725138425827026), (u'perplexed', 0.6615269780158997), (u'jealous', 0.6610188484191895), (u'annoyed', 0.6493697166442871), (u'frightened', 0.6388447284698486), (u'dissatisfied', 0.6363586187362671), (u'distraught', 0.633137047290802), (u'perturbed', 0.6268664598464966), (u'nonplussed', 0.6229336857795715), (u'troubled', 0.6206403970718384), (u'impatient', 0.6191350221633911)]","[(u'organisation', 0.42548567056655884), (u'mongolia', 0.39597493410110474), (u""asimov's"", 0.3929506838321686), (u'eamon', 0.38869649171829224), (u'sleez', 0.38152042031288147), (u'refunded', 0.3796987533569336), (u""today's"", 0.3741285502910614), (u'radhika', 0.3675813674926758), (u'slides', 0.36348873376846313), (u'mazar', 0.3630625009536743), (u'babbitt', 0.36268505454063416), (u'tonic', 0.36204156279563904), (u'sec', 0.358733206987381), (u'sullivan', 0.35872185230255127), (u'effeminacy', 0.3576321005821228), (u'magistrate', 0.35597896575927734), (u'pere', 0.3558776378631592), (u'radio/tv', 0.3516930341720581), (u'kilometer', 0.35146278142929077), (u'$50', 0.35077035427093506)]","[(u'confused', 0.6229063868522644), (u'disturbed', 0.5801238417625427), (u'irritated', 0.5658557415008545), (u'depressed', 0.5652328729629517), (u'disgusted', 0.5620701313018799), (u'bored', 0.5554786324501038), (u'repulsed', 0.5281217098236084), (u'disillusioned', 0.5236408710479736), (u'upset', 0.5170360207557678), (u'motivated', 0.5022591352462769), (u'satisfied', 0.501629650592804), (u'jealous', 0.49524572491645813), (u'angry', 0.4920088052749634), (u'annoyed', 0.4907613694667816), (u'engrossed', 0.48971816897392273), (u'perplexed', 0.4884505271911621), (u'unsatisfied', 0.4835604429244995), (u'confounded', 0.4811770021915436), (u'numb', 0.4778806269168854), (u'impatient', 0.47536855936050415)]"


Do the DBOW words look meaningless?  
That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the `dbow_words=1` initialization parameter.  
Concurrent word-training slows DBOW mode significantly, and offers little improvement (and sometimes a little worsening) of the error rate on this IMDB sentiment-prediction task.   
Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'plot' or 'actor'). (All DM modes inherently involve word vector training concurrent with doc vector training.)  

### Are the word vectors from this dataset any good at analogies?

In [41]:
# Download this file: https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt
# and place it in the local directory
# Note: this takes many minutes
if os.path.isfile('data/questions-words.txt'):
    for model in word_models:
        sections = model.accuracy('data/questions-words.txt')
        correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
        print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))

Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8): 30.23% correct (3051 of 10094)
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8): 0.01% correct (1 of 10094)
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8): 32.13% correct (3243 of 10094)


Even though this is a tiny, domain-specific dataset, it shows some meager capability on the general word analogies – at least for the DM/concat and DM/mean models which actually train word vectors. (The untrained random-initialized words of the DBOW model of course fail miserably.)