# Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
This notebook is modified based on the original one provided by Gensim.  
You can find it here: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

## Introduction
In this tutorial, we will learn how to apply Doc2vec using gensim by recreating the results of <a href="https://arxiv.org/pdf/1405.4053.pdf">Le and Mikolov 2014</a>. 

### Bag-of-words Model

Previous state-of-the-art document representations were based on the <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">bag-of-words model</a>, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents  
  
(1) `John likes to watch movies. Mary likes movies too.`  
(2) `John also likes to watch football games.`  
  
are used to construct a length 10 list of words  
  
`["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]`  
  
so then we can represent the two documents as **fixed length vectors** whose elements are the frequencies of the corresponding words in our list  
  
(1) `[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]`  
(2) `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]`  
  
Bag-of-words models are surprisingly effective but still **lose information about word order**.  
Bag of <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> models consider **word phrases of length n** to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.

### Word2vec Model

Word2vec is a more recent model that embeds words in a high-dimensional vector space using a shallow neural network.  
The result is a set of word vectors where vectors close together in vector space have similar meanings based on context, and word vectors distant to each other have differing meanings.  
For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far.  
There are two versions of this model based on skip-grams and continuous bag of words.

#### Word2vec - Skip-gram Model

The skip-gram <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">word2vec</a> model, for example, takes in **pairs (word1, word2)** generated by moving a window across text data, and trains a **1-hidden-layer neural network** based on the fake task of given an input word, giving us a predicted probability distribution of nearby words to the input.  
The **hidden-to-output weights** in the neural network give us the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings. We use <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> encoding for the words.

#### Word2vec - Continuous-bag-of-words Model

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a **1-hidden-layer neural network**.  
The fake task is based on the input context words in a window around a center word, predict the center word. Again, the **hidden-to-output weights** give us the word embeddings and we use one-hot encoding.

### Paragraph Vector

Le and Mikolov 2014 introduces the **Paragraph Vector**, which outperforms more naïve representations of documents such as averaging the Word2vec word vectors of a document.  
The idea is straightforward: we act as if **a paragraph (or document) is just another vector** like a word vector, but we will call it a paragraph vector.  
We determine the embedding of the paragraph in vector space in the same way as words. Our paragraph vector model **considers local word order like bag of n-grams**, but gives us a denser representation in vector space compared to a sparse, high-dimensional representation.

#### Paragraph Vector - Distributed Memory (PV-DM)

This is the Paragraph Vector model analogous to **Continuous-bag-of-words Word2vec**.  
The paragraph vectors are obtained by **training a neural network on the fake task of inferring a center word based on context words and a context paragraph**. A paragraph is a context for all words in the paragraph, and a word in a paragraph can have that paragraph as a context. 

#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)

This is the Paragraph Vector model analogous to **Skip-gram Word2vec**.  
The paragraph vectors are obtained by training a neural network on the **fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph**.

### Requirements

The following python modules are dependencies for this tutorial:
* testfixtures ( `pip install testfixtures` )
* statsmodels ( `pip install statsmodels` )

## Load corpus

Let's download the IMDB archive if it is not already downloaded (84 MB). This will be our text data for this tutorial.   
The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import locale
import glob
import os.path
import requests
import tarfile
import sys
import codecs
import smart_open

In [2]:
dirname = 'aclImdb'
filename = 'aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')

if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

convert text to lower-case and strip punctuation/symbols from words

In [3]:
def normalize_text(text):
    norm_text = text.lower()
    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')
    # Pad punctuation with spaces on both sides
    for char in ['.', '"', ',', '(', ')', '!', '?', ';', ':']:
        norm_text = norm_text.replace(char, ' ' + char + ' ')
    return norm_text

In [None]:
import time
start = time.clock()

if not os.path.isfile('aclImdb/alldata-id.txt'):
    if not os.path.isdir(dirname):
        if not os.path.isfile(filename):
            # Download IMDB archive
            print("Downloading IMDB archive...")
            url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename
            r = requests.get(url)
            with open(filename, 'wb') as f:
                f.write(r.content)
        tar = tarfile.open(filename, mode='r')
        tar.extractall()
        tar.close()

    # Concatenate and normalize test/train data
    print("Cleaning up dataset...")
    folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']
    alldata = u''
    for fol in folders:
        temp = u''
        output = fol.replace('/', '-') + '.txt'
        # Is there a better pattern to use?
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        for txt in txt_files:
            with smart_open.smart_open(txt, "rb") as t:
                t_clean = t.read().decode("utf-8")
                for c in control_chars:
                    t_clean = t_clean.replace(c, ' ')
                temp += t_clean
            temp += "\n"
        temp_norm = normalize_text(temp)
        with smart_open.smart_open(os.path.join(dirname, output), "wb") as n:
            n.write(temp_norm.encode("utf-8"))
        alldata += temp_norm

    with smart_open.smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:
        for idx, line in enumerate(alldata.splitlines()):
            num_line = u"_*{0} {1}\n".format(idx, line)
            f.write(num_line.encode("utf-8"))

end = time.clock()
print ("Total running time: ", end-start)

In [4]:
import os.path
assert os.path.isfile("../../gensim_word2vec/aclImdb/alldata-id.txt"), "alldata-id.txt unavailable"

The text data is small enough to be read into memory. 

In [5]:
import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

In [6]:
SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

alldocs = []  # Will hold all docs in original order
with open('../../gensim_word2vec/aclImdb/alldata-id.txt') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        split = ['train', 'test', 'extra', 'extra'][line_no//25000]  # 25k train, 25k test, 25k extra
        sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown
        alldocs.append(SentimentDocument(words, tags, split, sentiment))

In [7]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
doc_list = alldocs[:]  # For reshuffling per pass
print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))

100000 docs: 25000 train-sentiment, 25000 test-sentiment


In [8]:
train_docs[0]

SentimentDocument(words=[u'sudden', u'impact', u'is', u'the', u'best', u'of', u'the', u'five', u'dirty', u'harry', u'movies', u'.', u'they', u"don't", u'come', u'any', u'leaner', u'and', u'meaner', u'than', u'this', u'as', u'harry', u'romps', u'through', u'a', u'series', u'of', u'violent', u'clashes', u',', u'with', u'the', u'bad', u'guys', u'getting', u'their', u'just', u'desserts', u'.', u'which', u'is', u'just', u'the', u'way', u'i', u'like', u'it', u'.', u'great', u'story', u'too', u'and', u'ably', u'directed', u'by', u'clint', u'himself', u'.', u'excellent', u'entertainment', u'.'], tags=[0], split='train', sentiment=1.0)

## Set-up Doc2Vec Training & Evaluation Models

We approximate the experiment of Le & Mikolov ["Distributed Representations of Sentences and Documents"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf) with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`

We vary the following parameter choices:
* **100-dimensional** vectors, as the 400-d vectors of the paper don't seem to offer much benefit on this task
* Similarly, **frequent word subsampling** seems to decrease sentiment-prediction accuracy, so it's left out
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* A `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [9]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

In [10]:
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

test 3 simple models here

In [11]:
simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

Speed up setup by sharing results of the 1st model's vocabulary scan

### it turns out that, sometime the intersect could be quite low

In [12]:
# PV-DM w/ concat requires one special NULL word so it serves as template
%time simple_models[0].build_vocab(alldocs)  
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

CPU times: user 6.92 s, sys: 133 ms, total: 7.05 s
Wall time: 6.95 s
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)


### check the vocabulary size

In [13]:
print(len(simple_models[0].wv.vocab))

116046


Le and Mikolov notes that **combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance**.  
We will follow, pairing the models together for evaluation. Here, **we concatenate the paragraph vectors obtained from each model.**

In [14]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec

models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])

In [15]:
models_by_name

OrderedDict([('Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)',
              <gensim.models.doc2vec.Doc2Vec at 0x7f66224adf10>),
             ('Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)',
              <gensim.models.doc2vec.Doc2Vec at 0x7f66224adfd0>),
             ('Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)',
              <gensim.models.doc2vec.Doc2Vec at 0x7f66223e80d0>),
             ('dbow+dmm',
              <gensim.test.test_doc2vec.ConcatenatedDoc2Vec at 0x7f6684bc5850>),
             ('dbow+dmc',
              <gensim.test.test_doc2vec.ConcatenatedDoc2Vec at 0x7f661ef47490>)])

## Predictive Evaluation Methods

Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors.  
We will classify document sentiments using a **logistic regression** model based on our paragraph embeddings.  
We will compare the error rates based on word embeddings from our various Doc2vec models.

In [16]:
import numpy as np
import statsmodels.api as sm
from random import sample

  from pandas.core import datetools


for timing

In [17]:
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start

In [18]:
def logistic_predictor_from_data(train_targets, train_regressors):
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    #print(predictor.summary())
    return predictor

In [19]:
def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

## Bulk Training

We use an explicit multiple-pass, alpha-reduction approach as sketched in this [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) with added shuffling of corpus on each pass.  
Note that vector training is occurring on **all** documents of the dataset, which includes all **TRAIN/TEST/DEV** docs.  
We evaluate each model's sentiment predictive power based on error rate, and **the evaluation is repeated after each pass** so we can see the rates of relative improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. 

**(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)**

In [20]:
from collections import defaultdict
best_error = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [21]:
from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)  # Shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
            duration = '%.1f' % elapsed()
            
        # Evaluate
        eval_duration = ''
        with elapsed_timer() as eval_elapsed:
            err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)
        eval_duration = '%.1f' % eval_elapsed()
        best_indicator = ' '
        if err <= best_error[name]:
            best_error[name] = err
            best_indicator = '*' 
        print("%s%f : %i passes : %s %ss %ss" % (best_indicator, err, epoch + 1, name, duration, eval_duration))

        if ((epoch + 1) % 5) == 0 or epoch == 0:
            eval_duration = ''
            with elapsed_timer() as eval_elapsed:
                infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)
            eval_duration = '%.1f' % eval_elapsed()
            best_indicator = ' '
            if infer_err < best_error[name + '_inferred']:
                best_error[name + '_inferred'] = infer_err
                best_indicator = '*'
            print("%s%f : %i passes : %s %ss %ss" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))

    print('Completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta
    
print("END %s" % str(datetime.datetime.now()))

START 2018-01-26 10:48:29.687572
*0.406120 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) 36.1s 1.0s
*0.345200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_inferred 36.1s 8.0s
*0.243600 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) 17.7s 0.7s
*0.181200 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_inferred 17.7s 2.9s
*0.261920 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) 18.1s 0.7s
*0.212400 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_inferred 18.1s 3.4s
*0.213680 : 1 passes : dbow+dmm 0.0s 1.6s
*0.169200 : 1 passes : dbow+dmm_inferred 0.0s 6.3s
*0.241280 : 1 passes : dbow+dmc 0.0s 1.6s
*0.197200 : 1 passes : dbow+dmc_inferred 0.0s 10.6s
Completed pass 1 at alpha 0.025000
*0.344880 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) 33.3s 0.6s
*0.143680 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) 15.4s 0.6s
*0.212400 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) 18.7s 0.7s
*0.136640 : 2 passes : dbow+dmm 0.0s 1.7s
*0.143200 : 2 passes : db

## Achieved Sentiment-Prediction Accuracy

In [22]:
# Print best error rates achieved
print("Err rate Model")
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

Err rate Model
0.101600 dbow+dmc_inferred
0.101680 dbow+dmm
0.102840 dbow+dmc
0.103200 Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)
0.105600 Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_inferred
0.111200 dbow+dmm_inferred
0.149680 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)
0.176000 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)
0.184400 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_inferred
0.186400 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_inferred


## Try other models

In [92]:
from sklearn.metrics import roc_auc_score

In [104]:
def clf_error_rate_for_model(clf, test_model, name, train_set, test_set, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""
    
    train_targets = np.array([doc.sentiment for doc in train_set])
    train_feats = np.array([test_model.docvecs[doc.tags[0]] for doc in train_set])
    
    clf.fit(train_feats, train_targets)

    test_data = test_set
    if infer_subsample < 1.0:
        test_data = sample(test_data, int(infer_subsample * len(test_data)))
        
    test_feats_infer = np.array([test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data])
    test_feats = np.array([test_model.docvecs[doc.tags[0]] for doc in test_data])
    
    # Predict & evaluate
    for feats_name, feats in zip(['', '_infer'], [test_feats, test_feats_infer]):
        test_predictions = clf.predict(feats)
        test_probs = clf.predict_proba(feats)[:, 1]
        test_true = np.array([doc.sentiment for doc in test_data])
        acc_rate = np.mean(test_predictions==test_true)
        roc = roc_auc_score(test_true, test_probs)
        print("%s : acc rate: %f  auc_roc: %f" % (name + feats_name, acc_rate, roc))

### XGBoost

In [80]:
import xgboost as xgb

In [105]:
for name, train_model in models_by_name.items():
    xgb_clf = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=100, objective='binary:logistic', 
                                subsample=0.85, colsample_bytree=0.85, colsample_bylevel=0.85)
    clf_error_rate_for_model(xgb_clf, train_model, name, train_docs, test_docs)

Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) : acc rate: 0.769200  auc_roc: 0.864437
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_infer : acc rate: 0.750400  auc_roc: 0.837426
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) : acc rate: 0.852400  auc_roc: 0.927155
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_infer : acc rate: 0.852000  auc_roc: 0.927317
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) : acc rate: 0.825600  auc_roc: 0.901677
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_infer : acc rate: 0.788800  auc_roc: 0.869700
dbow+dmm : acc rate: 0.866000  auc_roc: 0.940335
dbow+dmm_infer : acc rate: 0.852000  auc_roc: 0.932618
dbow+dmc : acc rate: 0.867600  auc_roc: 0.945008
dbow+dmc_infer : acc rate: 0.863200  auc_roc: 0.941105


### RandomForest

In [69]:
from sklearn.ensemble import RandomForestClassifier

In [106]:
for name, train_model in models_by_name.items():
    rf_clf = RandomForestClassifier(n_estimators=100, max_features=0.85, n_jobs=-1)
    clf_error_rate_for_model(rf_clf, train_model, name, train_docs, test_docs)

Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) : acc rate: 0.756000  auc_roc: 0.828077
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_infer : acc rate: 0.743200  auc_roc: 0.816011
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) : acc rate: 0.828800  auc_roc: 0.909808
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_infer : acc rate: 0.821200  auc_roc: 0.905437
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) : acc rate: 0.779200  auc_roc: 0.856277
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_infer : acc rate: 0.742800  auc_roc: 0.816159
dbow+dmm : acc rate: 0.826400  auc_roc: 0.902402
dbow+dmm_infer : acc rate: 0.811200  auc_roc: 0.890690
dbow+dmc : acc rate: 0.815600  auc_roc: 0.896250
dbow+dmc_infer : acc rate: 0.810800  auc_roc: 0.888118


### LogisticRegression
Simple logistic regression is doing quite a good job!

In [47]:
from sklearn.linear_model import LogisticRegression

In [107]:
for name, train_model in models_by_name.items():
    lr_clf = LogisticRegression(n_jobs=-1)
    clf_error_rate_for_model(lr_clf, train_model, name, train_docs, test_docs)

Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8) : acc rate: 0.822800  auc_roc: 0.904801
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)_infer : acc rate: 0.798000  auc_roc: 0.883578
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8) : acc rate: 0.895200  auc_roc: 0.960083
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)_infer : acc rate: 0.893600  auc_roc: 0.959221
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8) : acc rate: 0.846000  auc_roc: 0.917237
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)_infer : acc rate: 0.815600  auc_roc: 0.891742
dbow+dmm : acc rate: 0.901600  auc_roc: 0.962566
dbow+dmm_infer : acc rate: 0.899200  auc_roc: 0.960734
dbow+dmc : acc rate: 0.894400  auc_roc: 0.961836
dbow+dmc_infer : acc rate: 0.895200  auc_roc: 0.960089


### One-layer neural network

In [23]:
import tensorflow as tf

In [26]:
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [103]:
reset_graph()

# structrue
n_input = 200
n_hidden = 50
n_output = 2

# input
X = tf.placeholder(tf.float32, shape=[None, n_input])
y = tf.placeholder(tf.int32, shape=[None])

hidden = tf.layers.dense(X, n_hidden, activation=tf.tanh, name="hidden")
logits = tf.layers.dense(hidden, n_output, name="logits")

with tf.name_scope("train"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy)
    optimizer = tf.train.AdamOptimizer(0.0005)
    training_op = optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, tf.cast(tf.reshape(y, (-1,)), tf.int32), 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

with tf.name_scope("init_save"):
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

In [109]:
doc_model = models_by_name['dbow+dmc']

train_targets = np.array([doc.sentiment for doc in train_docs])
train_feats = np.array([doc_model.docvecs[doc.tags[0]] for doc in train_docs])

print(train_targets.shape)
print(train_feats.shape)

train_dataset = np.array([{"feats": train_feats[idx], "target": train_targets[idx]} for idx in range(len(train_targets))])

(25000,)
(25000, 200)


In [110]:
len(train_dataset)

25000

In [111]:
def get_train_batch(iteration, batch_size):
    if iteration == 0:
        np.random.shuffle(train_dataset)
    batch_dataset = train_dataset[iteration * batch_size : (iteration + 1) * batch_size]
    X_batch = np.array([item["feats"] for item in batch_dataset]).reshape(-1, n_input)
    y_batch = np.array([item["target"] for item in batch_dataset])
    return X_batch, y_batch

In [112]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(len(train_targets) // batch_size):
            X_batch, y_batch = get_train_batch(iteration, batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            
            if (iteration % 100 == 0) and (iteration != 0):
                loss_eval, accuracy_eval = sess.run([loss, accuracy], feed_dict={X: X_batch, y: y_batch})
                print("Step %d: loss: %.6f, accuracy: %.6f" %(len(train_targets) * epoch + iteration, loss_eval, accuracy_eval))
        
        loss_eval, accuracy_eval = sess.run([loss, accuracy], feed_dict={X: X_batch, y: y_batch})
        print("Epoch %d: loss: %.6f, accuracy: %.6f" %(epoch, loss_eval, accuracy_eval))
    save_path = saver.save(sess, "./my_imdb_model")

Step 100: loss: 0.312429, accuracy: 0.900000
Step 200: loss: 0.242779, accuracy: 0.920000
Step 300: loss: 0.298778, accuracy: 0.920000
Step 400: loss: 0.252701, accuracy: 0.900000
Epoch 0: loss: 0.168647, accuracy: 0.960000
Step 25100: loss: 0.199757, accuracy: 0.940000
Step 25200: loss: 0.249264, accuracy: 0.860000
Step 25300: loss: 0.213603, accuracy: 0.920000
Step 25400: loss: 0.243585, accuracy: 0.860000
Epoch 1: loss: 0.203530, accuracy: 0.900000
Step 50100: loss: 0.402180, accuracy: 0.880000
Step 50200: loss: 0.179384, accuracy: 0.920000
Step 50300: loss: 0.130244, accuracy: 0.940000
Step 50400: loss: 0.387198, accuracy: 0.820000
Epoch 2: loss: 0.242061, accuracy: 0.900000
Step 75100: loss: 0.253615, accuracy: 0.860000
Step 75200: loss: 0.149337, accuracy: 0.960000
Step 75300: loss: 0.206783, accuracy: 0.940000
Step 75400: loss: 0.487248, accuracy: 0.820000
Epoch 3: loss: 0.162171, accuracy: 0.940000
Step 100100: loss: 0.216643, accuracy: 0.900000
Step 100200: loss: 0.193093, acc

In [113]:
doc_model = models_by_name['dbow+dmc']

test_targets = np.array([doc.sentiment for doc in test_docs])
%time test_feats_infer = np.array([doc_model.infer_vector(doc.words, steps=3, alpha=0.1) for doc in test_docs])

print(test_targets.shape)
print(test_feats_infer.shape)

CPU times: user 1min 32s, sys: 510 ms, total: 1min 32s
Wall time: 1min 31s
(25000,)
(25000, 200)


In [114]:
n_test_batches = 50
X_test_batches = np.array_split(test_feats_infer, n_test_batches)
y_test_batches = np.array_split(test_targets, n_test_batches)

with tf.Session() as sess:
    saver.restore(sess, "./my_imdb_model")

    print("Computing final accuracy on the test set (this will take a while)...")
    acc_test = np.mean([
        accuracy.eval(feed_dict={X: X_test_batch, y: y_test_batch})
        for X_test_batch, y_test_batch in zip(X_test_batches, y_test_batches)])
    print("Test accuracy:", acc_test)

INFO:tensorflow:Restoring parameters from ./my_imdb_model
Computing final accuracy on the test set (this will take a while)...
('Test accuracy:', 0.89304)


In our testing, contrary to the results of the paper, PV-DBOW performs best.  
Concatenating vectors from different models only offers a small predictive improvement over averaging vectors. There best results reproduced are just under 10% error rate, still a long way from the paper's reported 7.42% error rate.

## Examining Results

### Are inferred vectors close to the precalculated ones?

In [31]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

for doc 95329...
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8):
 [(95329, 0.7945490479469299), (2911, 0.4073815643787384), (63358, 0.3943607211112976)]
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8):
 [(95329, 0.9538134336471558), (58788, 0.6186803579330444), (20746, 0.6146473288536072)]
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8):
 [(95329, 0.8836967945098877), (30224, 0.6142271757125854), (56893, 0.6125851273536682)]


(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words.   
Note the defaults for inference are very abbreviated – just 5 steps starting at a high alpha – and likely need tuning for other applications.)

### Do close documents seem more related than distant ones?

In [35]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))

TARGET (74844): «since today is steven spielberg's 60th birthday , i wanted to comment on one of his movies . he only produced " young sherlock holmes " - barry levinson directed it - but it's a pretty cool movie . portraying sherlock ( nicholas rowe ) and watson ( alan cox ) meeting in a boarding school while some strange murders are occurring in london , they do pretty much anything that they want . the whole movie has the definite feel of a spielberg movie , what with the burning of a giant set and all . even if the movie doesn't have the most impressive plot , the hallucinations make up for everything ( it's not often that we get to see cream puffs and chocolate éclairs attack someone ; serves him right for eating junk food ! ) . i recommend it .»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8):

MOST (36931, 0.4433958828449249): «hollywood is one of the best and the beautiful things that had occurred in my life . i admire and am very much fascinated by th

(Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST.)

### Do the word vectors show useful similarities?

In [36]:
word_models = simple_models[:]

In [37]:
word_models

[<gensim.models.doc2vec.Doc2Vec at 0x7f8b06601c10>,
 <gensim.models.doc2vec.Doc2Vec at 0x7f8b06601cd0>,
 <gensim.models.doc2vec.Doc2Vec at 0x7f8b069bea10>]

In [39]:
import random
from IPython.display import HTML

# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].wv.index2word)
    if word_models[0].wv.vocab[word].count > 10:
        break
        
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'comedy/drama'
similars_per_model = [str(model.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurences)" % (word, simple_models[0].wv.vocab[word].count))
HTML(similar_table)

most similar words for 'frustrated' (476 occurences)


  if sys.path[0] == '':


"Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8)","Doc2Vec(dbow,d100,n5,mc2,s0.001,t8)","Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8)"
"[(u'disturbed', 0.7746529579162598), (u'depressed', 0.7320755124092102), (u'confused', 0.706322431564331), (u'disgusted', 0.7054495215415955), (u'disillusioned', 0.7042022943496704), (u'angry', 0.7025586366653442), (u'bored', 0.7015896439552307), (u'irritated', 0.6992833614349365), (u'enraged', 0.6737393140792847), (u'unimpressed', 0.6725138425827026), (u'perplexed', 0.6615269780158997), (u'jealous', 0.6610188484191895), (u'annoyed', 0.6493697166442871), (u'frightened', 0.6388447284698486), (u'dissatisfied', 0.6363586187362671), (u'distraught', 0.633137047290802), (u'perturbed', 0.6268664598464966), (u'nonplussed', 0.6229336857795715), (u'troubled', 0.6206403970718384), (u'impatient', 0.6191350221633911)]","[(u'organisation', 0.42548567056655884), (u'mongolia', 0.39597493410110474), (u""asimov's"", 0.3929506838321686), (u'eamon', 0.38869649171829224), (u'sleez', 0.38152042031288147), (u'refunded', 0.3796987533569336), (u""today's"", 0.3741285502910614), (u'radhika', 0.3675813674926758), (u'slides', 0.36348873376846313), (u'mazar', 0.3630625009536743), (u'babbitt', 0.36268505454063416), (u'tonic', 0.36204156279563904), (u'sec', 0.358733206987381), (u'sullivan', 0.35872185230255127), (u'effeminacy', 0.3576321005821228), (u'magistrate', 0.35597896575927734), (u'pere', 0.3558776378631592), (u'radio/tv', 0.3516930341720581), (u'kilometer', 0.35146278142929077), (u'$50', 0.35077035427093506)]","[(u'confused', 0.6229063868522644), (u'disturbed', 0.5801238417625427), (u'irritated', 0.5658557415008545), (u'depressed', 0.5652328729629517), (u'disgusted', 0.5620701313018799), (u'bored', 0.5554786324501038), (u'repulsed', 0.5281217098236084), (u'disillusioned', 0.5236408710479736), (u'upset', 0.5170360207557678), (u'motivated', 0.5022591352462769), (u'satisfied', 0.501629650592804), (u'jealous', 0.49524572491645813), (u'angry', 0.4920088052749634), (u'annoyed', 0.4907613694667816), (u'engrossed', 0.48971816897392273), (u'perplexed', 0.4884505271911621), (u'unsatisfied', 0.4835604429244995), (u'confounded', 0.4811770021915436), (u'numb', 0.4778806269168854), (u'impatient', 0.47536855936050415)]"


Do the DBOW words look meaningless?  
That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the `dbow_words=1` initialization parameter.  
Concurrent word-training slows DBOW mode significantly, and offers little improvement (and sometimes a little worsening) of the error rate on this IMDB sentiment-prediction task.   
Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'plot' or 'actor'). (All DM modes inherently involve word vector training concurrent with doc vector training.)  

### Are the word vectors from this dataset any good at analogies?

In [41]:
# Download this file: https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt
# and place it in the local directory
# Note: this takes many minutes
if os.path.isfile('data/questions-words.txt'):
    for model in word_models:
        sections = model.accuracy('data/questions-words.txt')
        correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
        print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))

Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t8): 30.23% correct (3051 of 10094)
Doc2Vec(dbow,d100,n5,mc2,s0.001,t8): 0.01% correct (1 of 10094)
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t8): 32.13% correct (3243 of 10094)


Even though this is a tiny, domain-specific dataset, it shows some meager capability on the general word analogies – at least for the DM/concat and DM/mean models which actually train word vectors. (The untrained random-initialized words of the DBOW model of course fail miserably.)