# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'], messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=[&#39;have&#39;, &#39;to&#39;, &#39;take&#39;, &#39;exam&#39;, &#39;with&#39;, &#39;in&#39;, &#39;march&#39;], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([ 0.00876464, -0.00437824,  0.00474751, -0.00528096,  0.00604473,
       -0.00618271, -0.00830345,  0.00305331, -0.00693561, -0.00508801,
       -0.00052642, -0.00786128, -0.00050121, -0.01623155,  0.00557657,
       -0.00762649,  0.00186532, -0.00534158,  0.00471001,  0.00457831,
        0.02211423, -0.00712979,  0.00836985, -0.00552413,  0.01001146,
        0.01283312, -0.00803131,  0.0149781 , -0.00441569,  0.00274851,
       -0.008547  ,  0.0043332 , -0.01139454, -0.01073827,  0.00436417,
       -0.00235552, -0.00778084, -0.00071293, -0.00617901,  0.00108291,
        0.00118454, -0.01095434, -0.00828469, -0.00738097, -0.00910355,
       -0.00736761, -0.01276081, -0.00211034, -0.00222148, -0.00268137,
       -0.00129238,  0.00578987, -0.00879106, -0.00453289,  0.00186002,
        0.00581456, -0.00211341,  0.01043526,  0.00448215, -0.00342329,
        0.00210592, -0.02059046,  0.00203683,  0.00721813, -0.00401543,
       -0.00326609, -0.00792226,  0.00362662,  0.00558912, -0.00

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!