# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [3]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [4]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['how', 'are', 'you', 'doing', 'hope', 'you', 've', 'settled', 'in', 'for', 'the', 'new', 'school', 'year', 'just', 'wishin', 'you', 'gr', 'day'], tags=[0])

In [5]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2)

In [7]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [8]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i','am','learning','nlp'])

array([ 2.35179148e-04,  5.68072079e-03, -1.40326889e-02,  2.25870148e-03,
       -1.11643749e-03,  2.27967999e-03, -7.52759818e-03,  1.20253535e-02,
        1.06882341e-02,  6.18451228e-03, -7.74371496e-04, -9.97726712e-03,
       -1.13244611e-03, -1.99307455e-03, -5.24999062e-03, -3.42936534e-03,
        2.67923088e-03,  1.31541658e-02,  2.05536094e-03,  7.78192654e-03,
       -3.01792403e-03,  3.32075125e-03,  1.77785046e-02,  3.70438583e-03,
        3.37608648e-03,  1.33636314e-02, -8.65585287e-04,  1.56611507e-03,
       -4.01868951e-03, -3.32553405e-03, -4.06278111e-03, -1.02125555e-02,
       -5.48871886e-03,  1.31693471e-03, -1.00226153e-03, -1.13672540e-02,
       -5.21322619e-03, -9.96119436e-03, -3.77047062e-03,  4.26980620e-03,
       -1.47341052e-02,  5.59096131e-03, -4.20746207e-03,  8.55168514e-03,
        6.57766731e-03, -1.29871080e-02, -1.15023414e-02, -5.45739895e-03,
        5.53386798e-03, -5.83495898e-03,  3.01053631e-03,  5.80336945e-03,
       -9.63474158e-04, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!