Exploring Sentence-,
Document-, and Character Level Embeddings

information related to the ordering of words, along with their semantics, can be taken into
account when building embeddings to represent words.  
Doc2vec which will provide an embedding for entire documents.

 In Learn
Natural Language Processing (NLP), along with the word vectors of Learn,
Natural, and Language, the Document vector is used to predict the next
word, Processing. The model is tuned based on how it did in terms of predicting
the word Processing and how it learned throughout

Distributed Bag-of-Words Model of Paragraph Vectors (PV-DBOW): In this
approach, word vectors aren't taken into account. Instead, the paragraph vector
is used to predict randomly sampled words from the paragraph. In the process of
using gradient descent and backpropagation, the paragraph vectors get adjusted
and learning happens based on how good or bad they are doing in terms of
making predictions. This approach is analogous to the Skip-gram approach used
in Word2Vec

**Building a Doc2Vec model**


In [1]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [2]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [3]:
documents=[TaggedDocument(doc,[i]) for i,doc in enumerate(common_texts)]
documents

[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]),
 TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]),
 TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]),
 TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]),
 TaggedDocument(words=['user', 'response', 'time'], tags=[4]),
 TaggedDocument(words=['trees'], tags=[5]),
 TaggedDocument(words=['graph', 'trees'], tags=[6]),
 TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]),
 TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

In [4]:
model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4,epochs = 40)
model.train(documents, total_examples=model.corpus_count,
epochs=model.epochs)



In [5]:
model.vector_size

5

In [6]:
len(model.docvecs)

  len(model.docvecs)


9

In [7]:
len(model.wv)


12

In [8]:
words = list(model.wv.key_to_index.keys())
words

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']

In [9]:
vector=model.infer_vector(['user', 'interface', 'for','computer'])
vector

array([-0.05479828,  0.03763286, -0.06908114, -0.07776993,  0.07566229],
      dtype=float32)

Changing vector size and min_count

In [10]:
model=Doc2Vec(documents, min_count=3,epochs=40,vector_size=50)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)



In [11]:
len(model.wv)

4

In [12]:
word=list(model.wv.key_to_index.keys())
word

['system', 'graph', 'trees', 'user']

In [13]:
vector=model.infer_vector(['user', 'interface', 'for','computer'])
vector


array([-5.2483571e-03,  4.0307618e-03, -6.2962021e-03, -7.2687604e-03,
        7.4628475e-03, -3.4633225e-03,  1.8517875e-04,  8.7290276e-03,
       -9.3542095e-03, -2.9459449e-03,  2.2662533e-03, -6.2356647e-03,
        9.4316434e-03, -5.4678768e-03,  8.8230189e-04,  9.0770731e-03,
       -5.6036492e-03, -2.1582763e-03, -5.8416356e-03,  6.4541609e-04,
        2.6744364e-03,  3.3104499e-03, -1.6974021e-03,  2.9066496e-04,
        7.6739360e-03, -2.7164239e-03,  7.2400584e-03,  9.0947598e-03,
       -5.2203024e-03, -7.1979202e-03,  4.5083674e-05, -3.4565341e-03,
       -8.5999845e-03, -4.5912508e-03, -3.0163312e-03, -4.5598154e-03,
        8.0624688e-03,  5.0757970e-03,  7.2172610e-03, -2.6858069e-03,
        1.7141332e-03,  7.4147373e-03,  7.2725094e-03,  6.6911322e-03,
       -2.3159175e-03,  7.1912785e-03, -1.8938316e-03, -4.5383503e-03,
       -2.5647706e-03, -3.5156982e-03], dtype=float32)

As we can see, the vector size is now 50 and only 4 terms are in the vocabulary.
This is because min_count was modified to 3 and, consequently, terms that were
equal to or greater than 3 terms are present in the vocabulary now.

Testing two different approaches of doc2vec


1.   PV-DM
2.   PV-DBOW



In [14]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=1)
model.train(documents,total_examples=model.corpus_count,epochs=model.epochs)



dm equal to 0 builds the Doc2Vec model based on the distributed bag-of-words approach and vice versa

In [15]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=0)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)



The dm_concat parameter is used in the PV-DM approach. Its value, when set to 1,
indicates to the algorithm that the context vectors should be concatenated while trying to
predict the target word. This, of course, leads to building a larger model since multiple
word embeddings get concatenated.

In [16]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=1,window=2,min_alpha=0.005,dm_concat=1)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)




The window size parameter controls the distance between the word under concentration
and the word to be predicted, similar to the Word2Vec approach.

In [17]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,window=2,dm=0)
model.train(documents,total_examples=model.corpus_count,epochs=model.epochs)



Now, let's explore what the learning rate is and how it can be leveraged.

 For Doc2Vec,
the initial learning rate can be specified using the alpha parameter. With the min_alpha
parameter, we can specify what value the learning rate should drop to over the course of
training

In [18]:
model=Doc2Vec(documents, vector_size=50,min_count=2,epochs=40,alpha=0.3,min_alpha=0.05,window=2,dm=1)
model.train(documents, epochs=model.epochs,total_examples=model.corpus_count)




Exploring fastapi

Let's see the two- and three-character n-grams for the word language:
la, lan, an, ang, ng, ngu, gu, gua, ua, uag, ag, age, ge
fastText leads to parameter sharing among various words that have any overlapping n-grams. We capture their morphological information from sub-words to build an
embedding for the word itself. Also, when certain words are missing from the training
vocabulary or rarely occur, we can still have a representation for them if their n-grams are
present as part of other words.


Why n-grams are useful:

Sharing Similarities: The machine can find connections between words that share these n-gram pieces. For example, "language" and "angle" both have "an" and "ag" as n-grams, suggesting they might be related in some way.

Understanding Unfamiliar Words: If the machine encounters a new word (like "lingual") that it hasn't seen before, it can still make some sense of it because "lingual" shares n-grams ("in", "gu") with words it already knows.

Basically, n-grams help the machine learn word meanings by looking at smaller building blocks that can appear in many different words. This is especially useful for rare words or for languages with complex morphology (where words are built from smaller meaningful parts).

**Buiding a fasttext model**

In [21]:
from gensim.models import FastText
from gensim.test.utils import common_texts

In [30]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [34]:
model = FastText(vector_size=5, window=3, min_count=1)

model.build_vocab(common_texts)
model.train(common_texts, total_examples=len(common_texts), epochs=10)

(36, 290)

In [36]:
vocab=list(model.wv.key_to_index.keys())
vocab

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']

In [37]:
model.wv['human']

array([-0.03166137,  0.02326731,  0.01241683,  0.00036033,  0.02841445],
      dtype=float32)

In [38]:
model.wv.most_similar(positive=['computer','interface'],negative=['human'])

[('user', 0.7968785762786865),
 ('system', 0.17462188005447388),
 ('response', 0.104334257543087),
 ('survey', 0.009604760445654392),
 ('trees', -0.07640466839075089),
 ('time', -0.1330047994852066),
 ('minors', -0.13927175104618073),
 ('eps', -0.24093686044216156),
 ('graph', -0.291752427816391)]

Since word representations in FastText are built using the n-grams, min_n, and
max_n characters, this helps us by setting the minimum and maximum lengths of
the character n-grams so that we can build representations.

In [40]:
model=FastText(vector_size=5,window=3,min_count=1, min_n=1, max_n=5)
model.build_vocab(common_texts)
model.train(common_texts,total_examples=len(common_texts), epochs=10)

(36, 290)

In [41]:
model.wv['rubber']

array([ 0.01833104, -0.02146881,  0.00600105, -0.03445042, -0.0165866 ],
      dtype=float32)

In [42]:
model.wv.most_similar(positive=['computer','human'],negative=['rubber'])

[('trees', 0.795038104057312),
 ('eps', 0.7793108820915222),
 ('minors', 0.2440604716539383),
 ('time', 0.1623203009366989),
 ('user', -0.04820726439356804),
 ('graph', -0.15672056376934052),
 ('survey', -0.20417772233486176),
 ('interface', -0.3921482563018799),
 ('response', -0.6897355914115906),
 ('system', -0.8435077667236328)]

Extending the built model to incorporate words from new sentences

In [52]:
sentences_to_be_added = [["I", "am", "learning", "Natural", "Language", "Processing"],
                         ["Natural", "Language", "Processing"]]

In [53]:
model.build_vocab(sentences_to_be_added, update=True)
model.train(common_texts, total_examples=len(sentences_to_be_added), epochs=10)



(0, 290)

In [54]:
vocab=list(model.wv.key_to_index.keys())
vocab

['Processing',
 'Language',
 'Natural',
 'cool',
 'is',
 'learning',
 'am',
 'I',
 'M',
 'a',
 'c',
 'h',
 'i',
 'n',
 'e',
 'l',
 'r',
 'g',
 's',
 't',
 'f',
 'o',
 'm',
 'w',
 'd']