Exploring Sentence-,
Document-, and Character Level Embeddings

information related to the ordering of words, along with their semantics, can be taken into
account when building embeddings to represent words.  
Doc2vec which will provide an embedding for entire documents.

 In Learn
Natural Language Processing (NLP), along with the word vectors of Learn,
Natural, and Language, the Document vector is used to predict the next
word, Processing. The model is tuned based on how it did in terms of predicting
the word Processing and how it learned throughout

Distributed Bag-of-Words Model of Paragraph Vectors (PV-DBOW): In this
approach, word vectors aren't taken into account. Instead, the paragraph vector
is used to predict randomly sampled words from the paragraph. In the process of
using gradient descent and backpropagation, the paragraph vectors get adjusted
and learning happens based on how good or bad they are doing in terms of
making predictions. This approach is analogous to the Skip-gram approach used
in Word2Vec

**Building a Doc2Vec model**


In [1]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [2]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [4]:
documents=[TaggedDocument(doc,[i]) for i,doc in enumerate(common_texts)]
documents

[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]),
 TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]),
 TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]),
 TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]),
 TaggedDocument(words=['user', 'response', 'time'], tags=[4]),
 TaggedDocument(words=['trees'], tags=[5]),
 TaggedDocument(words=['graph', 'trees'], tags=[6]),
 TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]),
 TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

In [5]:
model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4,epochs = 40)
model.train(documents, total_examples=model.corpus_count,
epochs=model.epochs)



In [6]:
model.vector_size

5

In [8]:
len(model.docvecs)

  len(model.docvecs)


9

In [10]:
len(model.wv)


12

In [16]:
words = list(model.wv.key_to_index.keys())
words

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']

In [18]:
vector=model.infer_vector(['user', 'interface', 'for','computer'])
vector

array([ 0.01769252, -0.10492857, -0.01893673,  0.01627251, -0.03091071],
      dtype=float32)

Changing vector size and min_count

In [19]:
model=Doc2Vec(documents, min_count=3,epochs=40,vector_size=50)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)



In [20]:
len(model.wv)

4

In [21]:
word=list(model.wv.key_to_index.keys())
word

['system', 'graph', 'trees', 'user']

In [22]:
vector=model.infer_vector(['user', 'interface', 'for','computer'])
vector


array([ 2.7921072e-03, -9.5139854e-03, -8.5343439e-05,  2.7636734e-03,
       -3.5706391e-03,  3.7900740e-03,  8.9099603e-03,  8.5471487e-03,
        9.3369018e-03,  1.3304256e-03,  7.2353575e-03, -5.9009641e-03,
       -5.7782428e-03,  9.4995331e-03,  8.1670834e-03,  3.8477441e-03,
       -2.7677598e-03, -3.6295371e-03,  4.9458374e-03, -1.9870708e-03,
        3.6297771e-03,  4.6719038e-03,  3.1809886e-03, -4.9725231e-03,
        7.0057483e-03, -5.1878626e-03,  2.4053396e-03,  1.2494436e-03,
        4.5998469e-03,  9.1097044e-04, -6.7735417e-04, -7.3037618e-03,
       -3.7056257e-03,  7.2824978e-03,  1.8508495e-04,  8.8006724e-03,
        8.2237655e-03, -6.3653821e-03,  2.5660803e-03,  1.4124048e-03,
        7.2619976e-03,  7.9020681e-03,  3.3601034e-03, -8.3770780e-03,
       -3.7034908e-03, -7.2631100e-03,  8.4346188e-03, -5.3134216e-03,
       -3.7815624e-03,  3.0810458e-03], dtype=float32)

As we can see, the vector size is now 50 and only 4 terms are in the vocabulary.
This is because min_count was modified to 3 and, consequently, terms that were
equal to or greater than 3 terms are present in the vocabulary now.

Testing two different approaches of doc2vec


1.   PV-DM
2.   PV-DBOW



In [23]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=1)
model.train(documents,total_examples=model.corpus_count,epochs=model.epochs)



dm equal to 0 builds the Doc2Vec model based on the distributed bag-of-words approach and vice versa

In [24]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=0)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)



The dm_concat parameter is used in the PV-DM approach. Its value, when set to 1,
indicates to the algorithm that the context vectors should be concatenated while trying to
predict the target word. This, of course, leads to building a larger model since multiple
word embeddings get concatenated.

In [25]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=1,window=2,min_alpha=0.005,dm_concat=1)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)


