# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [6]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [7]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['sorry', 'man', 'accidentally', 'left', 'my', 'phone', 'on', 'silent', 'last', 'night', 'and', 'didn', 'check', 'it', 'til', 'got', 'up'], tags=[0])

In [8]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=2)

In [9]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [10]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-0.00700438,  0.01663531,  0.00863076,  0.00041389,  0.00027562,
       -0.0339589 ,  0.00563524,  0.03175854, -0.00817696, -0.02149219,
        0.00388519, -0.01745444, -0.00127102,  0.01249101,  0.01017968,
       -0.01696631,  0.01095164, -0.00938394, -0.00613327, -0.03153675,
        0.0126316 ,  0.001393  ,  0.01482108, -0.01540773, -0.00029658,
        0.00083479, -0.01247164, -0.0104255 , -0.02467247, -0.00172686,
        0.01675825,  0.0049846 ,  0.01089753, -0.01739047, -0.00826481,
        0.0249942 ,  0.00276795, -0.00509331, -0.00841393, -0.03133387,
       -0.01548341, -0.01489053, -0.01909465, -0.00909864,  0.01223203,
       -0.01579312, -0.00887041, -0.0075473 ,  0.01421885,  0.01681462,
        0.00403175, -0.01382973,  0.00432675, -0.00156719, -0.00427518,
        0.01039628,  0.00255965, -0.00146323, -0.01194454,  0.00480561,
        0.00700911,  0.00432272, -0.00290664, -0.00019616, -0.01866924,
        0.02826177,  0.00930049,  0.01766691, -0.01874159,  0.02

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!

In [11]:
# How do we prepare these vectors to be used in ML Model?
vectors = [[d2v_model.infer_vector(words)]for words in X_test]

In [12]:
vectors[0]

[array([-0.00554773,  0.00724458,  0.00329775, -0.00521936,  0.00226071,
        -0.01753865,  0.0076242 ,  0.02584808, -0.01059273, -0.00605462,
         0.00103585, -0.01811593,  0.00269339,  0.00043434,  0.00351522,
        -0.01278899,  0.0074543 , -0.01031785,  0.00543193, -0.02477286,
         0.0076733 ,  0.00635272,  0.00754381, -0.01372754,  0.00316142,
         0.00650532, -0.01064131, -0.00772442, -0.01489499, -0.00298955,
         0.0130528 ,  0.00680355,  0.0015731 , -0.00945888, -0.007642  ,
         0.01858883,  0.00739441, -0.00325762, -0.00644208, -0.0191325 ,
        -0.00514582, -0.00759772, -0.00316734, -0.00785367,  0.00089852,
        -0.01210361, -0.00723339, -0.00082462, -0.00062712,  0.00634365,
         0.00674049, -0.01309119,  0.00492129, -0.00735441, -0.00650074,
         0.00132795,  0.00785866,  0.00384196, -0.01147376,  0.01078517,
         0.00308634,  0.00184158, -0.00675979,  0.00201015, -0.01298821,
         0.01513827,  0.00274793,  0.00884538, -0.0