# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['it', 'still', 'not', 'working', 'and', 'this', 'time', 'also', 'tried', 'adding', 'zeros', 'that', 'was', 'the', 'savings', 'the', 'checking', 'is', 'lt', 'gt'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [5]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [6]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-0.01457021,  0.01112458, -0.00198042, -0.00304092, -0.0082907 ,
       -0.02291431,  0.00725796,  0.0315044 , -0.02181566, -0.01737636,
       -0.00864291, -0.02826939,  0.01117655,  0.00959948,  0.00044348,
       -0.01862943,  0.00850201, -0.01652957, -0.00791296, -0.03504454,
        0.00959241,  0.00653022,  0.01015233, -0.00870823, -0.00586789,
        0.00524248, -0.00365885, -0.01936797, -0.01975201, -0.00596575,
        0.02106355,  0.00198226,  0.00549159, -0.0236636 , -0.00393954,
        0.01843485, -0.00178811, -0.00621671, -0.00786219, -0.0341428 ,
       -0.0062568 , -0.02063946, -0.00540194, -0.00846386,  0.01184228,
       -0.01150194, -0.01417251, -0.00488152,  0.00776502,  0.00834555,
        0.00033677, -0.0039801 ,  0.00295076, -0.00928845, -0.00988752,
        0.00218077,  0.00082026, -0.00601022, -0.02448654,  0.0004508 ,
        0.00087885, -0.00303179,  0.00833922, -0.00130846, -0.01393051,
        0.01984945,  0.00980385,  0.01387077, -0.02997819,  0.01

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!