## Doc2Vec

In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [1]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Choukrallah
[nltk_data]     Lachhab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]

In [4]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [5]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)

In [11]:
print(model_dbow.infer_vector(['man','eats','food']))   #feature vector of man eats food

[-0.00497926 -0.00747131  0.00269496  0.0024915   0.00251809 -0.02208188
 -0.0091675  -0.02124474  0.02117798  0.00771315  0.02289761 -0.00328511
  0.01053256 -0.00290302  0.02079763 -0.01540546 -0.01744637 -0.00165545
  0.02105976 -0.00523596]


In [12]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('meat', 0.39641639590263367),
 ('bites', 0.05595850199460983),
 ('dog', 0.050178974866867065),
 ('food', -0.06502574682235718),
 ('eats', -0.2928890585899353)]

In [14]:
model_dbow.wv.n_similarity(["dog"],["man"])

0.050178967

In [15]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [-0.00497926 -0.00747131  0.00269496  0.0024915   0.00251809 -0.02208188
 -0.0091675  -0.02124474  0.02117798  0.00771315  0.02289761 -0.00328511
  0.01053256 -0.00290302  0.02079763 -0.01540546 -0.01744637 -0.00165545
  0.02105976 -0.00523596]
Most similar words to man in our corpus
 [('meat', 0.39641639590263367), ('bites', 0.05595850199460983), ('dog', 0.050178974866867065), ('food', -0.06502574682235718), ('eats', -0.2928890585899353)]
Similarity between man and dog:  0.050178967


What happens when we compare between words which are not in the vocabulary?

In [16]:
model_dm.wv.n_similarity(['covid'],['man'])

0.0