## doc2vec: How to Implement doc2vec

### Train Our Own Model

In [1]:
# Read in the data, clean it, split it into train and test sets.
import gensim
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('data/spam.csv', encoding='latin-1').drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages.columns = ['label', 'text']

messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'], messages['label'], test_size=0.2)

In [3]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [4]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['dont', 'make', 'ne', 'plans', 'for', 'nxt', 'wknd', 'coz', 'she', 'wants', 'us', 'to', 'come', 'down', 'then', 'ok'], tags=[0])

In [5]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(["I", "am", "learning", "nlp"])

array([-7.16968905e-04, -4.41293418e-03, -4.79495805e-03, -3.08861537e-03,
        1.51488204e-02, -2.31668795e-03, -3.79394367e-03, -2.18230463e-03,
       -4.92574787e-03,  3.93426605e-03, -4.06323047e-03,  9.50545131e-04,
       -3.20752594e-03,  1.67644786e-04, -4.03973646e-03,  4.43230802e-03,
       -5.48151927e-03,  3.05316644e-03,  5.72893827e-04,  5.48121287e-03,
       -7.36276293e-03,  1.63312914e-04,  3.76269408e-03,  1.45297346e-03,
       -2.65672267e-03, -9.96853632e-05,  2.35169264e-03, -1.34399179e-02,
       -1.28155947e-02,  6.66901469e-03, -5.75580355e-03, -1.34783620e-02,
       -3.76020907e-03, -1.43665560e-02,  8.56505800e-03, -4.93677508e-04,
        3.30523908e-04, -1.20870266e-02, -1.04676895e-02, -1.96060236e-03,
        1.51292747e-03, -1.26621649e-02, -1.99436047e-03, -1.96347404e-02,
       -1.69641012e-03, -7.86308292e-03,  3.50990682e-03,  1.76143888e-02,
        7.00404076e-03,  8.64590611e-03, -9.88931861e-03, -2.12387438e-03,
       -7.15942960e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!