# doc2vec: How To Implement doc2vec

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Train Our Own Model

In [2]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/LinkedIn Learning/03_Advanced NLP with Python for Machine Learning/Ex_Files_Adv_NLP_Python_ML/Exercise Files/data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [3]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [4]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['motivate', 'behind', 'every', 'darkness', 'there', 'is', 'shining', 'light', 'waiting', 'for', 'you', 'to', 'find', 'it', 'behind', 'every', 'best', 'friend', 'there', 'is', 'always', 'trust', 'and', 'love', 'bslvyl'], tags=[0])

In [5]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

# It says to require a list of strings

TypeError: ignored

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-5.9985947e-03,  6.3496311e-03,  5.3363563e-03, -7.9971319e-03,
        6.6965171e-05, -3.6055163e-02,  2.0777979e-03,  3.1047551e-02,
       -2.3622267e-02, -2.0700747e-02, -1.6930789e-02, -2.7685340e-02,
       -2.5317762e-03,  8.0003068e-03,  1.1843971e-03, -1.8999359e-02,
        2.1791125e-03, -1.6204547e-02, -3.6400149e-03, -2.7803198e-02,
        1.2003009e-02,  1.5870104e-02,  1.6533514e-02, -1.1403088e-02,
       -8.8226134e-03, -2.5914914e-03, -1.4095995e-02, -1.0216591e-02,
       -1.3129865e-02, -4.5326278e-03,  2.3207583e-02,  8.1692729e-03,
        1.1349378e-02, -1.4395045e-02, -5.2026706e-03,  2.8392898e-02,
       -1.7934184e-03, -1.6818730e-02, -1.7112000e-02, -3.1318039e-02,
        9.8277663e-04, -7.9029398e-03, -1.4490120e-03, -1.0342345e-02,
        1.5781837e-02, -7.7420222e-03, -8.3679408e-03, -6.4208987e-03,
        1.8172039e-02,  2.9923598e-04,  1.2592814e-02, -2.1875773e-02,
       -1.2092551e-02, -8.7217782e-03, -8.4299129e-03,  9.8230559e-03,
      

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!