<a href="https://colab.research.google.com/github/Preranakh/Advanced-NLP/blob/main/Implement%20Doc2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [None]:
# Read in data, clean it, and then split into train and test sets
# While implementing the doc to vec we have the same options as we have in the word to vec i.e use the pre trained vectors or the vectors directly trained on our data
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [None]:
# Create tagged document objects to prepare to train the model
# One difference between doc to vec and word to vec is that doc to vec requires us to create a tagged documents
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [None]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['how', 'it', 'feel', 'mr', 'your', 'not', 'my', 'real', 'valentine', 'just', 'my', 'yo', 'valentine', 'even', 'tho', 'hardly', 'play'], tags=[0])

In [None]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [None]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

array([-2.2326701e-03, -3.5634146e-03, -1.7704996e-03, -4.0488616e-03,
        7.2301237e-04,  3.0870789e-03, -4.7173253e-03, -1.2697446e-03,
        2.3805794e-04,  2.9918496e-03, -4.1260035e-03, -1.2737868e-03,
        1.9132482e-03, -1.8230529e-03, -4.5664571e-03,  4.2173150e-03,
       -4.3275948e-03, -1.4867083e-03, -3.6390380e-03, -4.6032988e-03,
       -4.0002489e-03,  4.0340470e-03, -4.8187000e-04, -2.1900644e-03,
       -6.3356146e-04, -3.9652879e-03,  3.4744418e-03, -3.6678663e-03,
       -3.5300381e-03, -1.6069327e-03,  2.5174061e-03, -3.4876138e-03,
       -2.7389801e-03,  2.9286032e-03, -1.7116150e-03, -4.4756155e-03,
       -4.7625755e-03,  3.1161190e-03, -2.5071442e-04,  2.6763375e-05,
        2.0529262e-03,  2.3196908e-03, -3.6168660e-03,  3.4722791e-03,
        3.3206067e-03, -8.8848046e-04, -5.1860715e-04,  3.5495819e-03,
       -5.1671127e-04, -3.4644147e-03, -1.7398024e-03,  2.7742311e-03,
        1.4327039e-03,  1.3514541e-03, -1.5428691e-04,  3.9644032e-03,
      

In [None]:
d2v_model.infer_vector("I am happy")

array([ 0.00476229, -0.00410813, -0.00432742,  0.00221697,  0.00258297,
       -0.00457672,  0.0028555 , -0.00043436, -0.0046322 ,  0.00405697,
        0.00408254, -0.0043    , -0.0026018 , -0.00051803,  0.00014161,
        0.00069059,  0.00447023, -0.00123623, -0.00263513,  0.00359927,
       -0.00078081,  0.0044362 ,  0.00176397,  0.00227264, -0.00032037,
       -0.00063183, -0.0047198 , -0.00299461, -0.00269   ,  0.00279792,
       -0.00275908, -0.00499277,  0.00331413, -0.00038839,  0.00189426,
        0.0008245 , -0.00250563, -0.00328209,  0.00243776, -0.0041229 ,
       -0.00481323,  0.00219378, -0.00455507, -0.00220185, -0.00343017,
        0.00138382,  0.00288867, -0.00469271,  0.00369675,  0.00301789,
       -0.00311218, -0.0007797 ,  0.00303383, -0.00247174,  0.00053889,
       -0.00013933,  0.00112811,  0.00414079,  0.00088544, -0.00302888,
        0.00253753,  0.00177831, -0.00036581, -0.00287338, -0.0022585 ,
       -0.00230084, -0.00144759,  0.00497996,  0.00185216,  0.00

In [None]:
d2v_model.infer_vector(['I','am', 'happy'])
# it resturned a vector of length 100.
# Though there are not any messages like I am happy in our training set, but yet this is able to return a vector based on what it learnt from the traning set
#even though it did not see this explicit set of words together

array([ 1.38239644e-03, -7.50522176e-03, -6.64790813e-03, -2.71562494e-05,
       -8.55291635e-03, -7.30679277e-03, -8.24065227e-03, -7.74651719e-03,
       -2.01971410e-03, -2.83822231e-03,  7.45578017e-03,  3.87547282e-03,
        2.46211793e-03,  1.19846119e-02, -1.35928765e-02, -2.12312210e-03,
       -1.81488227e-02, -6.16508815e-03, -2.31090205e-04,  1.26408590e-02,
        7.64404889e-04,  5.19988127e-03, -4.09272645e-04,  7.50457449e-03,
        5.10838069e-03, -1.02284066e-02,  4.98170790e-04,  1.04547217e-02,
        1.21220527e-02,  1.22777363e-02,  9.82968230e-03,  3.42295435e-03,
       -3.97764938e-03,  1.14126466e-02, -3.91635066e-03,  1.03745488e-02,
        2.84817535e-03,  1.13663878e-02,  1.88347779e-03,  1.40928961e-02,
       -1.23195406e-02, -5.46222320e-03,  1.78946613e-03,  8.81224219e-03,
       -6.39777025e-03,  4.18703072e-03, -3.81324673e-03,  3.30180349e-03,
        1.22977244e-02, -2.65206414e-04, -7.02176988e-03,  8.85208417e-03,
        1.98847242e-03,  

In [None]:
# What happens if we pass in a list of words?

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!