# Document embedding techniques

Ссылка на Google colab : https://colab.research.google.com/drive/12_G5VIjqOX3CM7zXoYQI8TH0_M5fYtlh

Word embedding techniques and generalizations of these methods are sometime called *Neural Probabilistic Language Models*.

Many of the methods presented in this lecture are inspired by prominent word embedding techniques, chief among them word2vec, and they are sometimes even direct generalizations of these methods. As such, a basic **understanding of word embedding techniques is essential for understanding this section!!!** If you are not familiar with the topic, watch the previous [Lecture_06](https://github.com/ML-HSE/ML-course-2019/blob/master/Lectures/Lecture_06/Part1_word_embeddings.ipynb).

The [distributional hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics#:~:targetText=The%20distributional%20hypothesis%20suggests%20that,occur%20in%20similar%20linguistic%20contexts.) in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings. The underlying idea that **“a word is characterized by the company it keeps”** was popularized by Firth. The distributional hypothesis is the basis for statistical semantics.

So, it is easy to see that word2vec, and other self-supervised methods for learning word representations, rely heavily on this hypothesis; the crux of the model, after all, is that word representations learned while learning to predict the context of a word from the word itself (or vice versa) represent a vector space capturing deep semantic and syntactic concepts and phenomena. Meaning, learning from the context of a word can teach us about both its meaning and its syntactic role.

### word2vec: skip gram & cbow

Lets remember how  skip gram & cbow model for word2vec looks like:

Models __CBOW (Continuous Bag of Words)__ and __Skip gram__ were invented in the now distant 2013,
*article*:
[*Tomas Mikolov et al.*](https://arxiv.org/pdf/1301.3781v3.pdf)

<img src="https://drive.google.com/uc?export=download&id=123tOrqr958DAwU60oW4nPccqr7JbnBF7"/>


Comparing 2 models
<img src="https://drive.google.com/uc?export=download&id=1AW7mMz3e6AyA0azAwvM40qGE2CnjJ7mS"/>



# n-gram embeddings

[Mikolov et al, 2013b](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) extended word2vec’s skip-gram model to handle short phrases by identifying a large number of short phrases — the authors focus on two- and three-word phrases — using a data-driven approach, and then treating the phrases as individual tokens during the training of the word2vec model. Naturally, this is less suitable for learning longer phrases — as the size of the vocabulary explodes when increasing phrase length — and is bound to not generalize to unseen phrases as well as the methods that follow it.

# Averaging word embeddings

There is a very intuitive way to construct document embeddings from meaningful word embeddings: Given a document, perform some vector arithmetics on all the vectors corresponding to the words of the document to summarize them into a single vector in the same embedding space; two such common summarization operators are average and sum.
Building upon this, you can perhaps already imagine that extending the encoder-decoder architecture of word2vec and its relatives to learn how to combine word vectors into document embeddings can be interesting; the methods that follow this one fall into this category.

A second possibility is to use a fixed (unlearnable) operator for vector summarization — e.g. averaging — and learn word embeddings in a preceding layer, using a learning target that is aimed at producing rich document embeddings; a common example is using a sentence to predict context sentences. Thus the main advantage here is that word embeddings are optimized for averaging into document representations.

<img src="https://drive.google.com/uc?export=download&id=1KIdBHa0zrFsetJ7kF_mItjc0j_bIy9Tr"/>

 **Siamese CBOW network architecture from [Kenter et al, 2016](https://arxiv.org/pdf/1606.04640.pdf)**

Another demonstration of the power of correctly-averaged word “embeddings” can perhaps be found when looking at attention-based machine translation models. The one-directional decoder RNN gets the previous translated word as input, plus not just the “embedding” (i.e. the bi-directional activations from the encoder RNN) of the current word to translate, but also those of words around it; these are averaged in a weighted manner into a context vector. It is teaching that this weighted averaging is able to maintain the complex compositional and order-dependent information from the encoder network’s activations (recall, these are not isolated embeddings like in our case; each is infused with the context of previous/following words).

# Sent2Vec

Presented in [Pagliardini et al, 2017](https://aclweb.org/anthology/N18-1049) and [Gupta et al, 2019](https://www.aclweb.org/anthology/N19-1098) (including an official C++-based Python implementation), this technique is very much a combination of the two above approaches: The classic CBOW model of word2vec is both extended to include word n-grams and adapted to optimize the word (and n-grams) embeddings for the purpose of averaging them to yield document vectors.

<img src="https://drive.google.com/uc?export=download&id=1myFO_g9YRF0vZPnHGg7evUnr9LvHQfpl"/>

In addition, the process of input subsampling is removed, considering the entire sentence as context instead. This means both that (a) the use of frequent word subsampling is discarded — so as not to prevent the generation of n-grams features — and (b) the dynamic context windows used by word2vec are made away with: the entire sentence is considered as the context window, instead of sampling the context window size for each subsampled word uniformly between 1 and the length of the current sentence.

Another way to think of sent2vec is as an unsupervised version of fastText (see Figure 6), where the entire sentence is the context and possible class labels are all vocabulary words. Coincidentally, [Agibetov et al, 2018](https://link.springer.com/article/10.1186/s12859-018-2496-4) compare the performance of a multi-layer perceptron using sent2vec vectors as features to that of fastText, against the task of biomedical sentence classification.

# Paragraph vectors

Sometimes referred to as doc2vec, this method, presented in [Le & Mikolov, 2014](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) is perhaps the first attempt to generalize word2vec to work with word sequences. The authors introduce two variants of the paragraph vectors model: Distributed Memory and Distributed Bag-of-Words.

### Paragraph Vectors: Distributed Memory (PV-DM)
The PV-DM model augments the standard encoder-decoder model by adding a memory vector, aimed at capturing the topic of the paragraph, or context from the input. The training task here is quite similar to that of continuous bag of words; a single word is to be predicted from its context. In this case, the context words are the preceding words, not the surrounding words, as is the paragraph.

<img src="https://drive.google.com/uc?export=download&id=181PSREcRSPB93PsYlajqX7kJr2lmpfj7"/>

**The Distributed Bag of Words model of Paragraph Vectors (PV-DBOW)**

To achieve this, every paragraph is mapped to a unique vector, represented by a column in a matrix (denoted by D), as is each word in the vocabulary. The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. Naturally, word embeddings are global, and pre-trained word embeddings can be used (see implementations and enhancements below).

As in word2vec, vectors must be summarized in some way into a single vector; but unlike word2vec, the authors use concatenation in their experiments. Notice that this preserves order information. Similar to word2vec, a simple softmax classifier (in this case, actually hierarchical softmax) is used over this summarized vector representation to predict the task output. Training is done the standard way, using stochastic gradient descent and obtaining the gradient via backpropagation.

Notice that only the paragraphs in the training corpus have a column vector from D associated with them. At prediction time, one needs to perform an inference step to compute the paragraph vector for a new paragraph: The document vector is initialized randomly. Then, repeatedly, a random word is selected from the new document, and gradient descent is used to adjust input-to-hidden-layer weights such that softmax probability is maximized for the selected word, while hidden-to-softmax-output weights are fixed. This results in a representation of the new document as a mixture of training corpus document vectors (i.e. columns of D), naturally residing in the document embedding space.

### Paragraph Vectors: Distributed Bag of Words (PV-DBOW)
The second variant of paragraph vectors, despite its name, is perhaps the parallel of word2vec’s skip-gram architecture; the classification task is to predict a single context word using only the paragraph vector. At each iteration of stochastic gradient descent, a text window is sampled, then a single random word is sampled from that window, forming the below classification task.

<img src="https://drive.google.com/uc?export=download&id=17XncDnPaGp5XRytQkCy_L0_ILYp6WIFQ"/>

Training is otherwise similar, except for the fact that word vectors are not jointly learned along with paragraph vectors. This makes both memory and runtime performance of the PV-DBOW variant much better.

**Note:** *In its **Gensim** implementation, PV-DBOW uses randomly initialized word embeddings by default; if dbow_words is set to 1, a single step of skip-gram is ran to update word embeddings before running dbow. [Lau & Baldwin, 2016](https://arxiv.org/pdf/1607.05368.pdf) argue that even though dbow can in theory work with randomized word embeddings, this degrades performance severely in the tasks they have examined.*

*An intuitive explanation can be traced back to the model’s objective function, which is to maximize the dot product between the document embedding and its constituent word embeddings: if word embeddings are randomly distributed, it becomes more difficult to optimize the document embedding to be close to its more critical content words.*

# Doc2VecC
[Chen, 2017](https://arxiv.org/pdf/1707.02377.pdf) presented an interesting approach inspired by both the distributed memory model of the paragraph vectors approach (PV-DM) and approaches that average word embeddings to represent documents.

<img src="https://drive.google.com/uc?export=download&id=17t8IMVkqIroa_m_POaoKPFh34qKDRMF5"/>

The architecture of the Doc2VecC model

Similar to paragraph vectors, Doc2VecC (an acronym of document

vector through corruption) consists of an input layer, a projection layer and an output layer to predict the target word (“ceremony” in the above example). The embeddings of neighboring words (e.g. “opening”, “for”, “the”) provide local context while the vector representation of the entire document (shown in grey) serves as the global context. In contrast to paragraph vectors, which directly learns a unique vector for each document, Doc2VecC represents each document as an average of the embeddings of words randomly sampled from the document (e.g. “performance” at position p, “praised” at position q, and “brazil” at position r).

Additionally, the authors choose to corrupt the original document by randomly removing a significant portion of words, representing the document by averaging only the embeddings of the remaining words. This corruption mechanism allows a speedup during training as it significantly reduces the number of parameters to update in back propagation. The authors also show how it introduces a special form of regularization, which they believe results in the observed performance improvement, benchmarked on a sentiment analysis task, a document classification task and a semantic relatedness task versus a plethora of state-of-the-art document embedding techniques.
An open source C-based implementation of the method and code to reproduce the experiments in the paper can be found in a public Github repository.

The general idea of corrupting, or adding noise, to the document embedding learning process to produce a more robust embedding space has also been applied by [Hill et al, 2016](https://www.aclweb.org/anthology/N16-1162) to the skip-thought model (see the following sub-section) to create their sequential denoising autoencoder (SDAE) model.

# Skip-thought vectors

Presented in [Kiros et al, 2015](https://arxiv.org/abs/1506.06726), this is another early attempt to generalize word2vec, and was published with an official pure Python implementation (and recently also boasting implementations for PyTorch and TensorFlow).

This, however, extends word2vec — specifically the skip-gram architecture — in another intuitive way: the base unit is now sentences, and an encoded sentence is used to predict the sentences around it. The vector representations are learned using an encoder-decoder model trained on the above task; the authors use an RNN encoder with GRU activations and RNN decoders with a conditional GRU. Two different decoders are trained for previous and next sentences.

<img src="https://drive.google.com/uc?export=download&id=18BqmFinW_whcR_SRuoNLSLZZNRJHmAQJ"/>

The skip-thoughts model. Given a tuple of contiguous sentences, the sentence sᵢ is encoded and tries to reconstruct the previous sentence sᵢ₋₁ and the next sentence sᵢ₊₁

### Vocabulary expansion in skip-thought
The skip-thought encoder uses a word embedding layer that converts each word in the input sentence to its corresponding word embedding, effectively converting the input sentence into a sequence of word embeddings. This embedding layer is also shared with both of the decoders.

<img src="https://drive.google.com/uc?export=download&id=19-qwdWdRu4SRLHADSuqCjW2bfqSKa3PC"/>

 In the skip-thoughts model, sentence sᵢ is encoded by the encoder; the two decoders condition on the hidden representation of the encoder’s output hᵢ to predict sᵢ₋₁ and sᵢ₊₁ 
 
 However, the authors only use a small vocabulary of 20,000 words, and as a result many unseen words might be encountered during use in various tasks. To overcome this, a mapping is learned from a word embedding space trained on a much larger vocabulary (e.g. word2vec) to the word embedding space of the skip-thoughts model, by solving an un-regularized L2 linear regression loss for the matrix W parameterizing this mapping.

### Applications, enhancements and further reading
The authors demonstrate the use of skip-thought vectors for semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and four sentiment and subjectivity datasets. [Broere, 2017](http://arno.uvt.nl/show.cgi?fid=146003) further investigates the syntactic properties of skip-thought sentence representations by training logistic regression on them to predict POS tags and dependency relations.

# Quick-thought vectors

[Logeswaran & Lee, 2018](https://arxiv.org/pdf/1803.02893.pdf) reformulate the document embedding task — the problem of predicting the context in which a sentence appears — as a supervised classification problem rather than the prediction task of previous approaches .

<img src="https://drive.google.com/uc?export=download&id=1q8KSjnUzmqLJh3KBD8Nf2nnC-Bwpub13"/>

The Quick-Thought problem formulation (b) contrasted with the Skip-Thought approach (a)

The gist is to use the meaning of the current sentence to predict the meanings of adjacent sentences, where meaning is represented by an embedding of the sentence computed from an encoding function; notice two encoders are learned here: f for the input sentence and g for candidates. Given an input sentence, it is encoded by an encoder (RNNs, in this case), but instead of generating the target sentence, the model chooses the correct target sentence from a set of candidate sentences; the candidate set is built from both valid context sentences (ground truth) and many other non-context sentences. Finally, the constructed training objective maximizes the probability of identifying the correct context sentences for each sentence in the training data. Viewing the former sentence prediction formulation as choosing a sentence from all possible sentences, this new approach can be seen as a discriminative approximation to the prediction problem.

The authors evaluate their approach on various text classification, paraphrase identification and semantic relatedness tasks, and also provide an official Python implementation.(https://github.com/lajanugen/S2V)

# Word Mover’s Embedding (WME)

A very recent method, coming out of IBM research, is Word Mover’s Embedding (WME), presented in [Wu et al, 2018b](https://arxiv.org/pdf/1811.01713v1.pdf). An official C-based, Python-wrapped implementation is provided.(https://github.com/IBM/WordMoversEmbeddings)

[Kushner et al, 2015](http://proceedings.mlr.press/v37/kusnerb15.pdf) presented Word Mover’s Distance (WMD); this measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” in the embedding space to reach the embedded words of another document (see Figure 13a). Additionally, [Wu et al, 2018a](https://arxiv.org/pdf/1802.04956.pdf) proposed D2KE (distances to kernels and embeddings), a general methodology for the derivation of a positive-definite kernel from a given distance function.

<img src="https://drive.google.com/uc?export=download&id=1MsKeLNQgBg7wWcRByFiSbIELhFla72Da"/>

 Contrasting WMD with WME. (a) WMD measures the distance between two documents x and y, while (b) WME approximates a kernel derived from WMD with a set of random documents 𝜔

WME builds on three components to learn continuous vector representations for texts of varying lengths:
1. The ability to learn high-quality word embedding in an unsupervised manner (e.g., using word2vec).
2. The ability to construct a distance measure for documents based on said embeddings using Word Mover’s Distance (WMD).
3. The ability to derive positive-definite kernel from a given distance function using D2KE.

Using these three components, the following approach is applied:
1. Construct a positive-definite Word Mover’s Kernel (WMK) via an infinite-dimensional feature map given by the Word Mover’s distance (WMD) to random documents 𝜔 from a given distribution, using D2KE. Due to its use of the WMD, the feature map takes into account alignments of individual words between the documents in the semantic space given by the pre-trained word embeddings (see Figure 13b).
2. Based on this kernel, derive a document embedding via a random features approximation of the kernel, whose inner products approximate exact kernel computations.
This framework is extensible, since its two building blocks, word2vec and WMD, can be replaced by other techniques such as GloVe (for word embeddings) or S-WMD (for translation of the word embedding space into a document distance metric).

The authors evaluate WME on 9 real-world text classification tasks and 22 textual similarity tasks, and demonstrate that it consistently matches, and sometimes even outperforms, other state-of-the art techniques.


# Sentence-BERT (SBERT)
2018 in NLP was marked by the rise of the transformers, state-of-the-art neural language models inspired by the transformer model presented in [Vaswani et al 2017](https://arxiv.org/pdf/1706.03762.pdf) — a sequence model that dispenses of both convolutions and recurrence and uses attention instead to incorporate sequential information into sequence representation. This booming family includes BERT (and its extensions), GPT (1 and 2) and the XL-flavored transformers.

<img src="https://drive.google.com/uc?export=download&id=1nqUo8oR2FUkqTTKlHrfeLoMr_wzugR_f"/>

These models generate contextual embeddings of input tokens (commonly sub-word units), each infused with information of its neighborhood, but are not aimed at generating a rich embedding space for input sequences. BERT even has a special [CLS] token whose output embedding is used for classification tasks, but still turns out to be a poor embedding of the input sequence for other tasks. [Reimers & Gurevych, 2019](https://arxiv.org/pdf/1908.10084.pdf)

Sentence-BERT, presented in [Reimers & Gurevych, 2019](https://arxiv.org/pdf/1908.10084.pdf) and accompanied by a Python implementation(https://github.com/UKPLab/sentence-transformers), aims to adapt the BERT architecture by using siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.

<img src="https://drive.google.com/uc?export=download&id=1myoOewhwnB9Laqs58gnJC_smBNiu2O7c"/>

The SBERT architecture in training on a classification objective (left) and inference (right)

## How to choose which technique to use

I have no easy answers here, but here are a few possible takeaways:

1. **Averaging word vectors is a strong baseline**, so a good idea is to start your quest for good document embeddings by focusing on generating very good word vectors, and simply averaging them at first. Undoubtedly, much of the power of document embeddings comes from the word vectors upon which they are built, and I think it is safe to say there is a significant delta of information to optimize in that layer before moving forwards. You can try different pre-trained word embeddings, exploring which source domains and which methods (e.g. *word2vec vs GloVe vs BERT vs ELMo*) capture the type of information you need in a better way. Then, extending this slightly by trying different summarization operators or other tricks (like those in [Arora et al, 2016](https://pdfs.semanticscholar.org/3fc9/7768dc0b36449ec377d6a4cad8827908d5b4.pdf)) might prove to be enough.

2. **Performance can be a key consideration**, especially without a clear leader among the methods. In that case, both averaging word vectors, and some lean methods like *sent2vec* and *FastSent*, are good candidates. In contrast, the real-time vector representation inference required for each sentence when using *doc2vec* might prove costly given application constraints. *SentEval*, an evaluation toolkit for sentence representations presented in [Conneau & Kiela, 2018](https://arxiv.org/pdf/1803.05449.pdf), is a tool worth mentioning in this context.

3. **Consider the validity of the learning objective to your task**. The different self-supervised techniques covered above extended the distributional hypothesis in different ways, with *skip-thought* and *quick-thought* modeling a strong relation between sentences/paragraphs based on their distance in a document. This perhaps applies trivially for books, articles and social media posts, but might not apply as strongly to other sequences of texts, especially structured ones, and might thus project your documents into an embedding space which does not apply to them. Similarly, the word-alignment approach which WME relies on might not apply to every scenario.

## There are no clear task-specific leaders.

# DOC2VEC gensim implementation

**Gensim** is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

In [1]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [2]:
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

In [3]:
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")



iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

**Note:** dm defines the training algorithm. If dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW). Distributed Memory model preserves the word order in a document whereas Distributed Bag of words just uses the bag of words approach, which doesn’t preserve any word order.

So we have saved the model and it’s ready for implementation. Lets play with it.

In [4]:
from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

V1_infer [ 0.01291809 -0.01923922 -0.01209419 -0.0105249   0.00963472  0.00369297
  0.01455202 -0.01023043 -0.00756906 -0.00471234 -0.02581836 -0.00018772
  0.01648297 -0.02244008 -0.02320112  0.02979133  0.00964482 -0.01407811
 -0.0085121   0.01085268]


In [5]:
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])# to find most similar doc using tags


[-0.154322    0.01624887  0.19770841 -0.13146916  0.15434411  0.5247697
  0.159032   -0.27075085  0.03293942  0.09079414 -0.20218018  0.1747214
  0.20651653 -0.31260332 -0.34205368  0.22538683  0.24806097  0.04244787
  0.53436244 -0.10261851]


In [6]:
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)

[('0', 0.9932781457901001), ('2', 0.9892051219940186), ('3', 0.9805758595466614)]


In [7]:
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

[-0.154322    0.01624887  0.19770841 -0.13146916  0.15434411  0.5247697
  0.159032   -0.27075085  0.03293942  0.09079414 -0.20218018  0.1747214
  0.20651653 -0.31260332 -0.34205368  0.22538683  0.24806097  0.04244787
  0.53436244 -0.10261851]
