In [1]:
%matplotlib inline


Doc2Vec Model
=============

Introduces Gensim's Doc2Vec model and demonstrates its use on the Lee Corpus.




In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Doc2Vec is a `core_concepts_model` that represents each
`core_concepts_document` as a `core_concepts_vector`.  This
tutorial introduces the model and demonstrates how to train and assess it.

Here's a list of what we'll be doing:

0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec
1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)
2. Train a Doc2Vec `core_concepts_model` model using the training corpus
3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`
4. Assess the model
5. Test the model on the test corpus

Review: Bag-of-words
--------------------

.. Note:: Feel free to skip these review sections if you're already familiar with the models.

You may be familiar with the `bag-of-words model
<https://en.wikipedia.org/wiki/Bag-of-words_model>`_ from the
`core_concepts_vector` section.
This model transforms each document to a fixed-length vector of integers.
For example, given the sentences:

- ``John likes to watch movies. Mary likes movies too.``
- ``John also likes to watch football games. Mary hates football.``

The model outputs the vectors:

- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``
- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``

Each vector has 10 elements, where each element counts the number of times a
particular word occurred in the document.
The order of elements is arbitrary.
In the example above, the order of the elements corresponds to the words:
``["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]``.

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: "John likes Mary" and
"Mary likes John" correspond to identical vectors. There is a solution: bag
of `n-grams <https://en.wikipedia.org/wiki/N-gram>`__
models consider word phrases of length n to represent documents as
fixed-length vectors to capture local word order but suffer from data
sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying
words, and as a consequence, the distance between vectors doesn't always
reflect the difference in meaning.  The ``Word2Vec`` model addresses this
second problem.

Review: ``Word2Vec`` Model
--------------------------

``Word2Vec`` is a more recent model that embeds words in a lower-dimensional
vector space using a shallow neural network. The result is a set of
word-vectors where vectors close together in vector space have similar
meanings based on context, and word-vectors distant to each other have
differing meanings. For example, ``strong`` and ``powerful`` would be close
together and ``strong`` and ``Paris`` would be relatively far.

Gensim's :py:class:`~gensim.models.word2vec.Word2Vec` class implements this model.

With the ``Word2Vec`` model, we can calculate the vectors for each **word** in a document.
But what if we want to calculate a vector for the **entire document**\ ?
We could average the vectors for each word in the document - while this is quick and crude, it can often be useful.
However, there is a better way...

Introducing: Paragraph Vector
-----------------------------

.. Important:: In Gensim, we refer to the Paragraph Vector model as ``Doc2Vec``.

Le and Mikolov in 2014 introduced the `Doc2Vec algorithm <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`__, which usually outperforms such simple-averaging of ``Word2Vec`` vectors.

The basic idea is: act as if a document has another floating word-like
vector, which contributes to all training predictions, and is updated like
other word-vectors, but we will call it a doc-vector. Gensim's
:py:class:`~gensim.models.doc2vec.Doc2Vec` class implements this algorithm.

There are two implementations:

1. Paragraph Vector - Distributed Memory (PV-DM)
2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)

.. Important::
  Don't let the implementation details below scare you.
  They're advanced material: if it's too much, then move on to the next section.

PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training
a neural network on the synthetic task of predicting a center word based an
average of both context word-vectors and the full document's doc-vector.

PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training
a neural network on the synthetic task of predicting a target word just from
the full document's doc-vector. (It is also common to combine this with
skip-gram testing, using both the doc-vector and nearby word-vectors to
predict a single target word, but only one at a time.)

Prepare the Training and Test Data
----------------------------------

For this tutorial, we'll be training our model using the `Lee Background
Corpus
<https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`_
included in gensim. This corpus contains 314 documents selected from the
Australian Broadcasting Corporation’s news mail service, which provides text
e-mails of headline stories and covers a number of broad topics.

And we'll test our model by eye using the much shorter `Lee Corpus
<https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`_
which contains 50 documents.




In [2]:
import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')



Define a Function to Read and Preprocess Text
---------------------------------------------

Below, we define a function to:

- open the train/test file (with latin encoding)
- read the file line-by-line
- pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)

The file we're reading is a **corpus**.
Each line of the file is a **document**.

.. Important::
  To train the model, we'll need to associate a tag/number with each document
  of the training corpus. In our case, the tag is simply the zero-based line
  number.




In [17]:
import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

In [3]:
import utility
all_state, all_sjc = utility.combine_cases()


imported utility.py



Unnamed: 0,case,headnote,text,date,county
0,COMMONWEALTH vs. ADMILSON RESENDE.,"Controlled Substances. Constitutional Law, Ple...",\nThe present case is the most recent in a ser...,,
1,COMMONWEALTH vs. GEORGE PHILBROOK.,"Homicide. Evidence, Prior violent conduct, Sta...",\nThe defendant was convicted of murder in the...,,
2,"LINDA S. BOWERS vs. P. WILE'S, INC.","Negligence, Retailer. Notice. Practice, Civil,...",\nIn this case we are called upon to determine...,,
3,COMMONWEALTH vs. JARED ABDALLAH.,"Constitutional Law, Search and seizure. Search...","\nAfter causing a disturbance, the defendant w...",,
4,COMMONWEALTH vs. ROBERT D. WADE.,"Amended October 28, 2016\nDeoxyribonucleic Aci...",\nThis case requires us to decide whether the ...,,
...,...,...,...,...,...
2004,IN THE MATTER OF CLAUDE DAVID,"SJC-12642\nAttorney at Law, Disciplinary proce...","The respondent, Claude David Grayer, appeals f...","December 2, 2019",
2005,"KAROL E. SIMONTON, pet","SJC-12588\nHabeas Corpus. Practice, Criminal, ...",Karol E. Simonton appeals from a judgment of t...,"December 11, 2019",
2006,A.F.,SJC-12686\nHarassment Prevention. Supreme Judi...,"The petitioner, A.F., appeals from a judgment ...","December 11, 2019",
2007,IN THE MATTER OF CARL MARTIN,"SJC-12589\nAttorney at Law, Admission to pract...","Carl Martin Swanson has filed, in the county c...","December 12, 2019",


Let's take a look at the training corpus




In [31]:
print(train_corpus[20])

TaggedDocument(['argentine', 'president', 'adolfo', 'rodriguez', 'saa', 'has', 'asked', 'the', 'country', 'banks', 'to', 'help', 're', 'establish', 'peace', 'by', 'facilitating', 'the', 'payment', 'of', 'pensions', 'and', 'salaries', 'to', 'workers', 'and', 'retirees', 'he', 'says', 'he', 'issued', 'the', 'appeal', 'at', 'meeting', 'with', 'leaders', 'of', 'the', 'banking', 'community', 'very', 'concerned', 'about', 'what', 'has', 'happened', 'in', 'argentina', 'mr', 'rodriguez', 'saa', 'said', 'he', 'says', 'he', 'has', 'asked', 'banks', 'to', 'remain', 'open', 'from', 'am', 'to', 'pm', 'monday', 'to', 'be', 'able', 'to', 'cash', 'checks', 'of', 'up', 'to', 'pesos', 'or', 'us', 'per', 'person'], [20])


And the testing corpus looks like this:




In [7]:
print(test_corpus[:2])

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'

Notice that the testing corpus is just a list of lists and does not contain
any tags.




Training the Model
------------------

Now, we'll instantiate a Doc2Vec model with a vector size with 50 dimensions and
iterating over the training corpus 40 times. We set the minimum word count to
2 in order to discard words with very few occurrences. (Without a variety of
representative examples, retaining such infrequent words can often make a
model worse!) Typical iteration counts in the published `Paragraph Vector paper <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`__
results, using 10s-of-thousands to millions of docs, are 10-20. More
iterations take more time and eventually reach a point of diminishing
returns.

However, this is a very very small dataset (300 documents) with shortish
documents (a few hundred words). Adding training passes can sometimes help
with such small datasets.




In [8]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

  "C extension not loaded, training will be slow. "


Build a vocabulary



In [9]:
model.build_vocab(train_corpus)

2020-05-01 10:55:42,753 : INFO : collecting all words and their counts
2020-05-01 10:55:42,754 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-05-01 10:55:42,768 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
2020-05-01 10:55:42,769 : INFO : Loading a fresh vocabulary
2020-05-01 10:55:42,780 : INFO : effective_min_count=2 retains 3955 unique words (56% of original 6981, drops 3026)
2020-05-01 10:55:42,781 : INFO : effective_min_count=2 leaves 55126 word corpus (94% of original 58152, drops 3026)
2020-05-01 10:55:42,795 : INFO : deleting the raw counts dictionary of 6981 items
2020-05-01 10:55:42,796 : INFO : sample=0.001 downsamples 46 most-common words
2020-05-01 10:55:42,796 : INFO : downsampling leaves estimated 42390 word corpus (76.9% of prior 55126)
2020-05-01 10:55:42,809 : INFO : estimated required memory for 3955 words and 50 dimensions: 3619500 bytes
2020-05-01 10:55:42,810 : INFO : res

Essentially, the vocabulary is a dictionary (accessible via
``model.wv.vocab``\ ) of all of the unique words extracted from the training
corpus along with the count (e.g., ``model.wv.vocab['penalty'].count`` for
counts for the word ``penalty``\ ).




Next, train the model on the corpus.
If the BLAS library is being used, this should take no more than 3 seconds.
If the BLAS library is not being used, this should take no more than 2
minutes, so use BLAS if you value your time.




In [10]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

2020-05-01 10:55:48,528 : INFO : training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-05-01 10:55:55,018 : INFO : EPOCH 1 - PROGRESS: at 16.67% examples, 1093 words/s, in_qsize 5, out_qsize 0
2020-05-01 10:56:01,292 : INFO : EPOCH 1 - PROGRESS: at 66.67% examples, 2241 words/s, in_qsize 2, out_qsize 1
2020-05-01 10:56:01,293 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-05-01 10:56:01,668 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-05-01 10:56:01,699 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-05-01 10:56:01,700 : INFO : EPOCH - 1 : training on 58152 raw words (42508 effective words) took 13.2s, 3228 effective words/s
2020-05-01 10:56:07,931 : INFO : EPOCH 2 - PROGRESS: at 16.67% examples, 1128 words/s, in_qsize 5, out_qsize 0
2020-05-01 10:56:14,217 : INFO : EPOCH 2 - PROGRESS: at 66.67% examples, 2275 words/s, in_qsize 2, out_qsize 1
202

2020-05-01 10:58:43,267 : INFO : EPOCH 14 - PROGRESS: at 16.67% examples, 1153 words/s, in_qsize 5, out_qsize 0
2020-05-01 10:58:49,218 : INFO : EPOCH 14 - PROGRESS: at 66.67% examples, 2363 words/s, in_qsize 2, out_qsize 1
2020-05-01 10:58:49,219 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-05-01 10:58:49,609 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-05-01 10:58:49,617 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-05-01 10:58:49,618 : INFO : EPOCH - 14 : training on 58152 raw words (42392 effective words) took 12.5s, 3404 effective words/s
2020-05-01 10:58:55,707 : INFO : EPOCH 15 - PROGRESS: at 16.67% examples, 1154 words/s, in_qsize 5, out_qsize 0
2020-05-01 10:59:01,839 : INFO : EPOCH 15 - PROGRESS: at 66.67% examples, 2324 words/s, in_qsize 2, out_qsize 1
2020-05-01 10:59:01,840 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-05-01 10:59:02,075 : INFO : worker thread finished

2020-05-01 11:01:32,953 : INFO : EPOCH 27 - PROGRESS: at 66.67% examples, 2286 words/s, in_qsize 2, out_qsize 1
2020-05-01 11:01:32,954 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-05-01 11:01:32,990 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-05-01 11:01:33,010 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-05-01 11:01:33,011 : INFO : EPOCH - 27 : training on 58152 raw words (42311 effective words) took 12.5s, 3385 effective words/s
2020-05-01 11:01:39,361 : INFO : EPOCH 28 - PROGRESS: at 15.33% examples, 1152 words/s, in_qsize 5, out_qsize 0
2020-05-01 11:01:45,423 : INFO : EPOCH 28 - PROGRESS: at 66.67% examples, 2294 words/s, in_qsize 2, out_qsize 1
2020-05-01 11:01:45,424 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-05-01 11:01:45,544 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-05-01 11:01:45,572 : INFO : worker thread finished; awaiting finish of 

2020-05-01 11:04:16,213 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-05-01 11:04:16,527 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-05-01 11:04:16,547 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-05-01 11:04:16,547 : INFO : EPOCH - 40 : training on 58152 raw words (42379 effective words) took 12.6s, 3376 effective words/s
2020-05-01 11:04:16,548 : INFO : training on a 2326080 raw words (1695699 effective words) took 508.0s, 3338 effective words/s


Now, we can use the trained model to infer a vector for any piece of text
by passing a list of words to the ``model.infer_vector`` function. This
vector can then be compared with other vectors via cosine similarity.




In [11]:
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

[ 0.09087306  0.32701305  0.06860156 -0.05938696  0.11786573 -0.07121419
 -0.02741257  0.1069333   0.13580915 -0.23466207 -0.01952953 -0.02204405
 -0.02715763  0.05652005  0.34927163  0.23526803 -0.157106    0.01667246
  0.22597577  0.07169604  0.1451198  -0.07471588  0.22620082  0.05322214
  0.10549111 -0.05575021  0.0084979   0.04471283 -0.04137658 -0.03507899
  0.09665898  0.03951665  0.08097854  0.04166994 -0.24736458 -0.08139656
 -0.09626713 -0.25195324 -0.25258428 -0.14652404 -0.06479393  0.18179579
 -0.06570094  0.02638765  0.05731441  0.00684524 -0.12158128 -0.01456059
  0.25988027 -0.1881302 ]


Note that ``infer_vector()`` does *not* take a string, but rather a list of
string tokens, which should have already been tokenized the same way as the
``words`` property of original training document objects.

Also note that because the underlying training/inference algorithms are an
iterative approximation problem that makes use of internal randomization,
repeated inferences of the same text will return slightly different vectors.




Assessing the Model
-------------------

To assess our new model, we'll first infer new vectors for each document of
the training corpus, compare the inferred vectors with the training corpus,
and then returning the rank of the document based on self-similarity.
Basically, we're pretending as if the training corpus is some new unseen data
and then seeing how they compare with the trained model. The expectation is
that we've likely overfit our model (i.e., all of the ranks will be less than
2) and so we should be able to find similar documents very easily.
Additionally, we'll keep track of the second ranks for a comparison of less
similar documents.




In [12]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

2020-05-01 11:04:17,961 : INFO : precomputing L2-norms of doc weight vectors


Let's count how each document ranks with respect to the training corpus

NB. Results vary between runs due to random seeding and very small corpus



In [13]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 292, 1: 8})


Basically, greater than 95% of the inferred documents are found to be most
similar to itself and about 5% of the time it is mistakenly most similar to
another document. Checking the inferred-vector against a
training-vector is a sort of 'sanity check' as to whether the model is
behaving in a usefully consistent manner, though not a real 'accuracy' value.

This is great and not entirely surprising. We can take a look at an example:




In [14]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

Notice above that the most similar document (usually the same text) is has a
similarity score approaching 1.0. However, the similarity score for the
second-ranked documents should be significantly lower (assuming the documents
are in fact different) and the reasoning becomes obvious when we examine the
text itself.

We can run the next cell repeatedly to see a sampling other target-document
comparisons.




In [15]:
# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (47): «australia will be aiming to take early wickets on day two of the second cricket test against south africa at the mcg the proteas will resume at three for after day one was badly affected by rain with only overs possible australian paceman glenn mcgrath who has two wickets says the catch taken by matthew hayden yesterday is typical of australia outstanding slips fielding this summer in the series so far there been some great catches ricky ponting in the last test occasionally get one myself he said it gives you so much more confidence when you know per cent of the catches that go flying to the slips or through the slips are going to be taken»

Similar Document (139, 0.8667811751365662): «australia will be looking to score quickly today to set south africa challenging victory target on day four of the first cricket test in adelaide the australians will resume their second innings at for an overall lead of south africa was dismissed late yesterday for with shane warn

Testing the Model
-----------------

Using the same approach above, we'll infer the vector for a randomly chosen
test document, and compare the document to our model by eye.




In [16]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (26): «how did allegedly unregistered missile warheads come to be stored on canadian businessman anti terrorism training facility in new mexico and canadian officials are still trying to figure that out but one security expert says the mystery is chilling one david hudak was arrested in the united states more than week ago when according to court documents agents searching his property found the warheads stored in crates that were marked charge demolition»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (41, 0.6550991535186768): «the man accused to trying to blow up an american airlines flight on sunday could not have acted alone according to british islamic leader who knew richard reid well abdul hak baker is the head of the brixton mosque in south london where year old mr reid had worshipped mr baker says mr reid is petty criminal who had converted to islam while in jail he says mr reid had become more and more militant in his outlook aft

Conclusion
----------

Let's review what we've seen in this tutorial:

0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec
1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)
2. Train a Doc2Vec `core_concepts_model` model using the training corpus
3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`
4. Assess the model
5. Test the model on the test corpus

That's it! Doc2Vec is a great way to explore relationships between documents.

Additional Resources
--------------------

If you'd like to know more about the subject matter of this tutorial, check out the links below.

* `Word2Vec Paper <https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf>`_
* `Doc2Vec Paper <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`_
* `Dr. Michael D. Lee's Website <http://faculty.sites.uci.edu/mdlee>`_
* `Lee Corpus <http://faculty.sites.uci.edu/mdlee/similarity-data/>`__
* `IMDB Doc2Vec Tutorial <doc2vec-IMDB.ipynb>`_


