# Doc2Vec: How too implement it

So instead of creating a vector for each word, this technique will create a vector for each document or collection of text, whether it's a sentence or a paragraph. The goal is the same as `word2vec`. To create a numeric representation of a set of texts to feed to Python to help it better understand the meaning. 

Recall that `word2vec` is a shallow two-layer neural network that accepts a text corpus as an input, and it returns a set of vectors, also known as embeddings. Each vector is a numeric representation of a given word. `doc2vec` is basically the same thing, but instead of returning a numeric vector for each word, it returns a numeric vector for each sentence or paragraph.

The real benefit of `Doc2Vec` is it captures information about a sentence or paragraph, which is what we need, in a much more sophisticated way than creating word vectors and then averaging them. So in `Word2Vec`, we lose information by averaging the word vectors together to create a sentence or text level representation. `Doc2Vec` is able to capture the sentence level representation in a much more sophisticated way.

### Train Our Own Model

In [4]:
!ipython locate profile

C:\Users\lsoares\.ipython\profile_default


In [5]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], 
                                                    test_size=0.2)

messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


Now, one of the differences between `word2vec` and `doc2vec` is that `doc2vec` requires you to create tagged documents. This tagged document, expects a list of words and a tag for each document.

We're going to iterate through X_train using this enumerate function and that'll return the index and the value for each text message in X_train.

In [8]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['am', 'in', 'hospital', 'da', 'will', 'return', 'home', 'in', 'evening'], tags=[0])

In [9]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,     # Tagged documents for training
                                 vector_size=100,  # Dimensionality of the document vectors
                                 window=5,         # Maximum distance between the current and predicted word within a sentence
                                 min_count=2)      # Minimum number of occurrences of a word to be considered

In [10]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector(['text'])

array([-0.01278291,  0.01385637,  0.00353282, -0.00238422,  0.00467299,
       -0.03675859,  0.00250603,  0.04687937, -0.02790658, -0.01388085,
       -0.03651403, -0.02648865, -0.00297569, -0.00889337,  0.01047053,
       -0.04358591, -0.0002554 , -0.0226766 , -0.00327277, -0.04131157,
        0.02080492,  0.00489327,  0.02491716,  0.00317137,  0.00695801,
        0.0137176 , -0.02586616,  0.00102074, -0.00716305,  0.00327264,
        0.02259056,  0.00783179,  0.00463704,  0.01325294,  0.00277061,
        0.03034823,  0.00244458, -0.01216242, -0.01260882, -0.02291754,
        0.00689542, -0.00736155, -0.01786701, -0.02653614,  0.01008196,
       -0.02338689, -0.02427665, -0.00778264,  0.00328216,  0.0140456 ,
        0.00458468, -0.00955786,  0.00651383,  0.01409424,  0.00259051,
        0.01095745, -0.00033736, -0.01037658, -0.01538919,  0.01542132,
       -0.00360482,  0.00156963, -0.00743088,  0.00562836, -0.03563394,
        0.02587644,  0.01428414,  0.00813987, -0.02170091,  0.02

In [11]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i','am','learning','nlp'])

array([-0.00195068,  0.00776231,  0.0048783 , -0.00510567, -0.0045562 ,
       -0.02515473,  0.00220668,  0.03511856, -0.01456716, -0.01292632,
       -0.016941  , -0.02226035, -0.00354946,  0.00380918,  0.00120318,
       -0.02651529,  0.00421664, -0.01306386, -0.01014609, -0.03112501,
        0.01729575,  0.00504408,  0.01124415, -0.0078456 , -0.00469393,
        0.00120825, -0.01017751, -0.00550199, -0.01298617,  0.00108419,
        0.01511465,  0.00391247,  0.01086077, -0.00812416, -0.00496949,
        0.02287463,  0.00999818, -0.01166083, -0.01518472, -0.0243851 ,
       -0.00221257, -0.01534109, -0.00935461, -0.00781237,  0.01012304,
       -0.00980569, -0.01169394, -0.00061955,  0.0102037 ,  0.01977467,
        0.00984916, -0.00073883,  0.00279152,  0.00264358, -0.00433003,
        0.00853682,  0.00952507, -0.00676516, -0.01401376,  0.0082838 ,
       -0.00380021,  0.00097366, -0.00064311,  0.00248695, -0.02615652,
        0.02211107,  0.00652581,  0.01287649, -0.0216535 ,  0.01

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!

# How To Prep Document Vectors For Modeling

In [None]:
# Read in data, clean it, split it into train/test, and then train a doc2vec model
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)


messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))


X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], 
                                                    test_size=0.2)


# Create tagged document objects to prepare to train the model
tagged_docs_tr = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

d2v_model = gensim.models.Doc2Vec(tagged_docs_tr,
                                  vector_size=50,
                                  window=2,
                                  min_count=2)