### What is doc2vec?

doc2vec is a shallow, two-layer neural network that accepts a text corpus as an input, and it return a set of vectors (also known as embeddings); each vector is a numeric representation of a given sentence, paragraph or document.

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['that', 'the', 'trouble', 'with', 'classes', 'that', 'go', 'well', 'you', 're', 'due', 'dodgey', 'one', 'û_', 'expecting', 'mine', 'tomo', 'see', 'you', 'for', 'recovery', 'same', 'time', 'same', 'place'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [5]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [6]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-4.45364043e-03, -8.95637274e-03,  6.86274283e-03,  4.30390006e-03,
       -8.66194139e-04,  3.33436322e-03,  3.89745436e-03,  7.93918967e-03,
        1.88355753e-03, -7.71176349e-03, -3.93946283e-03, -9.21483524e-03,
       -4.54162888e-04, -7.88476970e-03, -3.14446253e-04, -1.87810289e-03,
        3.92775238e-03,  4.25519011e-06, -9.20689479e-03,  2.53066607e-03,
        3.29231168e-03,  1.33812753e-02,  9.49568022e-03,  8.65204493e-04,
        1.69633306e-03, -5.92216011e-03,  3.91225889e-03, -2.77321343e-03,
        2.71082856e-03,  2.12223874e-03,  6.47228211e-03,  5.48303872e-03,
       -3.83526291e-04, -4.23274888e-03, -2.08270882e-04, -3.35586141e-03,
        2.45539611e-03, -4.83063795e-03,  1.05192950e-02, -2.46394286e-03,
        1.29672873e-03,  3.33187636e-03, -4.36946284e-03,  7.58815077e-05,
        4.90884623e-03,  4.65069665e-04, -6.34506159e-03, -1.41423650e-03,
        4.23741620e-03, -2.07102601e-03,  2.21205456e-03,  8.90070107e-04,
        6.90404838e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). 

In [7]:
# How to prepare vectors to be used in a machine learning model?
vectors = [[d2v_model.infer_vector(words)] for words in X_test]

In [8]:
vectors[0]

[array([-0.00135667, -0.02554042,  0.01385472,  0.02426753, -0.00610694,
         0.02937614,  0.03307242,  0.03144768, -0.00735158, -0.05141122,
         0.00263753, -0.02761181, -0.01105818, -0.01329363, -0.01397014,
        -0.00281641,  0.0061922 , -0.01606524, -0.02255584,  0.01138902,
         0.01251789,  0.03696594,  0.02281269, -0.00983004,  0.00728614,
        -0.02237232,  0.02067966, -0.02982088,  0.0136727 ,  0.02267332,
         0.01099983,  0.00935793, -0.00489108,  0.00248693, -0.00575307,
        -0.0120107 ,  0.03164999, -0.00526734,  0.02659447, -0.01956563,
         0.00531843, -0.00731028, -0.00303448, -0.02078244,  0.02685647,
        -0.00350828, -0.01962113, -0.00889408,  0.01893955,  0.01236731,
         0.01039139,  0.01018578,  0.00727288,  0.01036821, -0.00659475,
         0.00876178, -0.00996866,  0.01524262,  0.02326941, -0.01306667,
         0.01023581,  0.01141928, -0.04556716, -0.00565641, -0.01788141,
         0.00940144,  0.00161431, -0.00955609, -0.0