# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['hmv', 'bonus', 'special', 'pounds', 'of', 'genuine', 'hmv', 'vouchers', 'to', 'be', 'won', 'just', 'answer', 'easy', 'questions', 'play', 'now', 'send', 'hmv', 'to', 'more', 'info', 'www', 'percent', 'real', 'com'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2)

In [5]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [6]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([ 2.62265746e-03,  1.20117189e-02,  7.36849476e-03,  5.01720409e-04,
        1.98712619e-03, -3.16745006e-02, -6.94196438e-04,  3.60809229e-02,
       -2.64994577e-02, -1.38318026e-02, -5.91123942e-03, -2.21291073e-02,
       -1.51363749e-03,  9.78325028e-03,  5.24505088e-03, -1.42252063e-02,
        1.07265059e-02, -9.48298629e-03, -5.18206321e-03, -3.42651680e-02,
        1.66297331e-02,  1.65504292e-02,  1.13149295e-02, -1.71890203e-02,
        3.73426992e-05,  8.64912849e-03, -1.49079347e-02, -1.09734833e-02,
       -2.69840267e-02, -1.51310058e-03,  2.97942162e-02,  3.28300567e-03,
        1.40803615e-02, -1.10984296e-02, -3.64107359e-03,  2.61366535e-02,
        4.90047969e-04, -1.16027417e-02, -1.18197296e-02, -3.70565318e-02,
       -3.13327578e-03, -1.61756631e-02, -5.66129852e-03, -4.49734554e-03,
        1.09385131e-02, -8.96043237e-03, -1.20821409e-02, -1.35634001e-02,
       -2.38964958e-06,  1.36643797e-02,  1.32778045e-02, -2.09862944e-02,
       -1.13230746e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!