# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
%pip install gensim
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-win_amd64.whl (24.0 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-6.2.0-py3-none-any.whl (58 kB)
Collecting Cython==0.29.28
  Downloading Cython-0.29.28-py2.py3-none-any.whl (983 kB)
Installing collected packages: smart-open, Cython, gensim
Successfully installed Cython-0.29.28 gensim-4.2.0 smart-open-6.2.0
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'f:\Shadman\advanced-nlp-with-python-for-machine-learning\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['congrats', 'kano', 'whr', 'the', 'treat', 'maga'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector(['text'])

array([-0.00695794,  0.01102642,  0.00791691, -0.00685867,  0.0100231 ,
       -0.02416742,  0.0040943 ,  0.04822392, -0.00828213, -0.01323164,
       -0.01133444, -0.01881858,  0.00289192,  0.01361318,  0.00309901,
       -0.02563928,  0.0105012 , -0.01153763,  0.00543096, -0.03275829,
        0.00653578,  0.00784433,  0.00533204, -0.00710077,  0.00025901,
       -0.00126091, -0.011543  , -0.01961317, -0.0122204 , -0.00835116,
        0.00681895, -0.00335724,  0.0241813 , -0.01539448, -0.00201652,
        0.01577563, -0.00092291, -0.02181119, -0.01173105, -0.02877375,
        0.00056839, -0.0175397 , -0.01076417, -0.00291008,  0.0075044 ,
       -0.01330562, -0.00761019, -0.00846967,  0.00760719,  0.01426211,
        0.00764756, -0.01605777,  0.01021032,  0.00564735, -0.00969031,
        0.01201043,  0.01143895, -0.00052674, -0.0263099 ,  0.01017373,
        0.00262663,  0.00763592, -0.00666569, -0.00589905, -0.02075949,
        0.0166045 ,  0.00982166,  0.00100728, -0.02785631,  0.01

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['I', 'am', 'learning', 'nlp'])

array([-1.2690783e-02,  2.6742611e-03,  8.6970469e-03, -4.3624560e-03,
        2.5087146e-03, -2.6862273e-02,  2.3073100e-03,  3.8417671e-02,
       -6.5484657e-03, -8.1529645e-03, -9.2960894e-03, -1.7031113e-02,
       -4.5445045e-03,  1.7968586e-02, -3.1333370e-03, -1.3602014e-02,
        3.0189406e-03, -1.6097143e-02,  1.8837925e-03, -2.5524555e-02,
        5.9417533e-03,  8.6924024e-03,  8.8350167e-03, -3.1726102e-03,
       -5.6489655e-03, -4.2956774e-03, -1.4344425e-02, -1.8299723e-02,
       -1.4978578e-02, -5.0987299e-03,  7.5405356e-03, -4.2764884e-03,
        1.3847619e-02, -1.3941579e-02, -1.8402614e-03,  1.3327149e-02,
        1.0015505e-03, -1.4537542e-02, -2.7704507e-03, -1.6631756e-02,
       -6.9107865e-03, -1.7751167e-02, -8.1142625e-03, -2.0929093e-03,
        1.3109328e-02, -6.9115814e-03, -8.5080881e-03, -7.1837194e-04,
        1.2882298e-02,  1.8438496e-02,  4.1275490e-03, -6.7977924e-03,
        9.6025877e-03, -5.3389464e-03, -7.6418025e-03,  1.3043826e-02,
      

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!