# The pinnacle of counting methods: GloVe (Global Vectors)

GloVe, short for “Global Vectors for word representation” a [paper](http://www.aclweb.org/anthology/D14-1162) Pennington, Socher and Manning is a kind of __hybrid solution__ between the __"old school" matrix factorization__ and the __predictive window approaches__ we will discuss below.

Trainded on __global word-word__ co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus.

(The "word2vec" method came a bit earlier in time, but we discuss GloVe first for rethorical reasons.)

The essential understanding is:

"...the authors of GloVe discovered via empirical methods that __instead of learning the raw co-occurrence probabilities__, it may make more sense to __learn ratios of these co-occurrence probabilities__, which seem to better discriminate subtleties in term-term relevance."

<a href="https://cdn-images-1.medium.com/max/1500/1*J-8GUgMcuqrNDWS5pV1aXg.png"><img src="https://drive.google.com/uc?export=view&id=1K842S9HR_1PAKbSOAGWzWAbNUo5I6MOJ" width=65%></a>

GloVe is a really sophisticated model which represents the pinnacle of co-occurence based models. It was __dominant and even competitive with predictive models__ - until the advent of full sequence models.

A very good in-depth discussion of GloVe can be found in [this](https://towardsdatascience.com/emnlp-what-is-glove-part-i-3b6ce6a7f970) blogpost series especially [part 2](https://towardsdatascience.com/emnlp-what-is-glove-part-ii-9e5ad227ee0).




# Role of transfer learning

One important factor to consider here, is that with the advent of __incremental / predictive methods__ for learning vector spaces, the methods could be trained on **huge scale data** more easily without prohibitive hardware costs, thus the learned representations became way more expressive. 

As for illustration: we ourselves could use a single server with 24 CPU cores to train a predictive model for Hungarian on 7.2 billion tokens.

Beside this, even on equal data, predictive models can capitalize more on the available information, showing superior performance to n-gram based methods.

<a href="https://i0.wp.com/deliprao.com/wp-content/uploads/2016/09/knvrnn.png?w=922"><img src="https://drive.google.com/uc?export=view&id=1U9Ijf6PE7vMntfNVLcnEpowU37s0faIy" width=55%></a>

(Just for illustrative purposes, exact models down below.)

On a more detailed analysis of the __effects of corpus size__ and other factors on predictive models see [here](http://deliprao.com/archives/201) 

With this in mind, it made sense to __train on huge "general" corpora__, and __use the learned representations to solve the current task__. Thus, **transfer learning** became dominant.

The usage of __pre-trained embeddings__ can give _huge_ boosts in resource scarce settings! (see [here](http://aclweb.org/anthology/N18-2084))

It is important to note, that in __many cases__ the __"naive" way__ of transferring the models __works well enough__, thus __no task specific training__ was __necessary__, but overlaps of vocabulary (or the lack of it) can be a decisive factor!
(see for example [here](https://www.kaggle.com/sbongo/do-pretrained-embeddings-give-you-the-extra-edge)).

If we work with **terminology heavy** or "non-standard language" corpora (think scientific, medical text, etc.) we should probably use pre-trained models **only for initialization and then continue training** in an appropriate way.

For the wrap-up of general problems of transfer learning see our separate Optional material notebook.  


We will get back to the topic of language model fine tuning in detail later while discussing [UMLFiT](https://arxiv.org/abs/1801.06146).


# "Don't count, Predict!"

Philosopically the big shift __shift away from the "count based" methods__ came with the advent of methods, that try to directly __model__ the __generative process hypothetized behind language__ with the help of __predictive models__ that __approximate__ the __"next step"__ in the __generative behavior__.
For a more thorough elaboration see [Baroni et al.](http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf)

The slogan is:

**Don't count, predict!!!**


# Sequence matters

The predictive paradigm explicitly tries to account for the fact that in language, sequence matters.

**Illustration:**

<a href="https://4.bp.blogspot.com/-B32VmhfzIuc/UGYEnvHSE8I/AAAAAAAAAdY/qWn3B3fW_kk/s400/Snarky_meteorite.JPG"><img src="https://drive.google.com/uc?export=view&id=107OgogHT0GD2Hjx-VDFo49_Ad_u1Kjll" width=45%></a>

Moreover it is also true, that **long term dependencies** are having a strong influence over the semantics of a given phrase.

<a href="https://cdn-images-1.medium.com/max/1600/1*hBkRpcC7sMJGrxjXpMiF9Q.png"><img src="https://drive.google.com/uc?export=view&id=1LYckVy6Qn1khkKg_5I6yFOrVxGA5sV7W" width=50%></a>

"Bag of words" is essentially blind to this.

**Example:**

- “She only told him that she loved him.” 
- “She told only him that she loved him.” 
- “She told him only that she loved him.” 
- “She told him that only she loved him.” 
- “She told him that she only loved him.” 
- “She told him that she loved only him.” 

The sentences above all have the same BOW representation, wheras for humans they are indeed quite different!






## Some basics of squential data (recap)  

**Sequential Data:**

The order of the data matters, but the time stamp is irrelevant or it doesn’t matter. (Example: DNA sequence. As you see the concept of time is irrelevant, so the order is not temporal.)


**Temporal Sequence:**

In addition to the order of data, the time stamp also matters. (Example: Data collected from customers’ shopping behaviour, considering their transaction time stamp as the temporal dimension.)

**Time Series:**

The data is in order, with a fixed time-difference between occurrence of successive data points. (Example: Time series of the temperature of a surface being recorded every 120 seconds.) 

([source](https://www.quora.com/What-is-the-difference-between-time-series-and-sequential-data))


**There are obvious differences from other datasets:**

- The __sequential position__ of the datapoints (commonly represented as dat arrows) is of __paramount__ importance (we are in trouble if we would like to draw a random sample...)
- We should __not have__ an __i.i.d. assumption__, so we should be well advised to think, that there is a relationship between successive datapoints
- It is always __suspicious__ whether the __[Markov assumption](https://en.wikipedia.org/wiki/Markov_property)__ holds in case of the data. For random walks it does, but it it is rarely the case in practice. (We can try to figure out which "order" of Markov property is present, but it is not a clear cut line. (Does the weather of tomorrow depend on today? And yesterday? And the year before? And hundered?...)


Or more precisely: language data is a **categorical sequence** that we can try to model either by a **fixed, window like context** or a **full sequence model**.

All the approaches below will try to balance these two views.



# "Rolling window" models

Example for a __rolling window__ approach in case of a __sequential dataset__:

<a href="https://www.mathworks.com/help/econ/rollingwindow.png"><img src="https://drive.google.com/uc?export=view&id=1vISPeGSgLkMKVrNkNGGyZ_X9vGAELhBG" width=45%></a>

This is most eminently used by: **word2vec** 

In their paper titled "Distributed Representations of Words and Phrases and their Compositionality" [Mikolov et al. 2013](https://arxiv.org/pdf/1310.4546.pdf) started a huge revolution. They used a **very simple neural model** with __one hidden layer__ for the prediction of next words / contexts. Their predictive model was **"streaming compatible"**, meaning it only __needed one minibatch__ of data __at a time__, to __iteratively approxiamte__ the __distribution of language__, thus could scale to __enormous corpora__ (in the range of 1-100 billion tokens!). 

The __quality__ of the learned vector representations were also __impressive__ - later on that below.


It is important to emphasize, that later on some [research](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization) showed that word2vec effectively learns **a good approximation of some factorization of the co-occurence space**, thus it is a highly effective substitute to counting based models.

It is also noteworthy, that it had a very limited notion of context. 

Remember the "1-skip--grams"? 
<a href="https://cdn-images-1.medium.com/max/1600/1*yiH5sZI-IBxDSQMKhvbcHw.png"><img src="https://drive.google.com/uc?export=view&id=1Ox_eUKCXGAIFbIiU0uillCh4kKsjRGS8" width=45%></a>

This was originally a word2vec illustration.

The __limitation of context__ will be held against it later on.


### Left context? Both contexts?

Question is, how realistic is this approach from word2vec for modeling context?

We naively assume that the mode in which humans process text are strictly __following the dominant writing direction__ of the text, character by character. In reality though given the limitations of our visual apparatus this is a bit more complicated.

One simplistic example of a retina reading model is: 

<a href="https://washingtonvisiontherapy.com/wp-content/uploads/2018/01/Poor.gif"><img src="https://drive.google.com/uc?export=view&id=1JPT-NAVUWvZ9TVID4e7e_gt1_DX2D6dv" width=55%></a>

Besides this though, the general __eye movements__ are also __non-linear__, following __"saccades"__, __bursts of back and forth movement__.

<a href="https://www.researchgate.net/profile/Kenneth_Holmqvist/publication/240623157/figure/fig1/AS:337277172633602@1457424552865/Eye-movements-during-reading-of-a-text.png"><img src="https://drive.google.com/uc?export=view&id=1YoceCCR7lGX_Thlc1Q84b2ChAF6Y1bdi" width=55%></a>

So we can easily argue for a non strictly "left context" approach, as well as strong bi-directionality (like in case of bi-LSTM models.)


### "Blueprint"

The general blueprint of the word2vec model is as follows:

<a href="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png"><img src="https://drive.google.com/uc?export=view&id=1ZUwSGzEj4EWHCEabrrvmXadfRsdRF7_f" width=60%></a>

<a href="https://i.stack.imgur.com/igSuE.png"><img src="https://drive.google.com/uc?export=view&id=1R5hI-k4Zu8P6bE3oGgmJ5cMpDYshU7ML" width=30%></a>


According to Mikolov et al.:

"__Skip-gram__: works well with __small amount of the training data__, represents well __even rare words or phrases__.
__CBOW__: several times __faster to train__ than the skip-gram, slightly __better accuracy__ for the __frequent words__."

### Advantage

These types of models, generally called __"neural word embeddings"__, since we __"embed"__ or __transform__ the __input__ words __into dense vector representations__ are __not used for the task__ they are __trained for__, which is predicting the next word in a minimal window context (though this can be the case in predictive text input or sentence generation).

Their main advantage is their **internal, "latent" representation** about the semantic space. Typically the softmax layer for outputting next word predictions is completely discarded, only the "mapping" layer is being kept.


<a href="http://drive.google.com/uc?export=view&id=1heogQhMfvtiOSfPtKvmc2OtyGadAsOmd"><img src="https://drive.google.com/uc?export=view&id=1HbQDy8orwiRH7SiaJjE5Gu5g99iavZa5" width=60%></a>


A good analysis can be found [Marek Rei's blogpost](http://www.marekrei.com/blog/dont-count-predict/).

(Naturally the advance of these models did not stop, some of their more elaborate forms can be found [here](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a).)

### Why can the inner representations be meaningful?

The fact, that the __inner representation space__ of word2vec seems thoroughly meaningful points towards one of the __general phenomena underpinning modern machine learning__:

_For the model to __solve the predictive task__ it was given, it had to __"store" and "compress"__ the __information__ inside the corpus so that it __fits through the "bottleneck"__, the __limited capacity__ of the model. The __best compression__ of the data is to capture their __"salient features"__, that is in this case their __morphologic and semantic characteristics__, their __"form" and "meaning"__. Thus a __systematic mapping arises__._


Optimal compression, understanding and intelligence has deep relations, see: [Hütter prize](https://en.wikipedia.org/wiki/Hutter_Prize).

For more on information bottleneck see [Tishby et al.'s work](https://en.wikipedia.org/wiki/Information_bottleneck_method).

We can say, that the "side effect" of prediction is here the main result.

**Good representation is all!**



## Usage of rolling window models

### Sidenote: GloVe was for a long time competitive!

<a href="https://adriancolyer.files.wordpress.com/2016/04/glove-vs-word2vec.png"><img src="https://drive.google.com/uc?export=view&id=1hEBCI-SSQPq9csE38psebcHPybih8qCC" width=55%></a>

### Vocabulary wars

#### Speed

Important to note that since the __vocabulary__ in case of these predictive models is __large__, the $v$ vocabulary width layers were consuming __extreme amount of computation__ (by 2013 standards). (Think: a vocabulary of 300k).

When the __linear layer__ is __very wide__ (e.g., there are a lot of classes in a classification task) __computing__ the full __softmax__ is __expensive__ because even to compute the probability of single output the exponentials of all outputs has to be computed. "Cheaper" alternatives had been developed, e.g. 

#####  Hierarchic Softmax

Here we treat the softmax layer as a binary classification tree for a multiclass classification problem, and the linear outputs are at the internal nodes of the tree, where they are interpreted as the logarithms of odds for the classes on the left and right sides of the corresponding subtree:

<a href="https://i.stack.imgur.com/L6siJ.png"><img src="https://drive.google.com/uc?export=view&id=1vlN7k9sAbV4TneHeYjB2TlNbYB2ZEulj" width=45%></a>

To calculate the probability corresponding to an individual class one simply multiplies the probabilities belonging to the nodes on the path, e.g. for class $w_2$

$$P(w_2) = (1-\sigma(n(w_2,1)))(1-\sigma(n(w_2,2)))\sigma(n(w_2,3))$$

$$= \sigma(-n(w_2,1))\sigma(-n(w_2,2))\sigma(n(w_2,3))$$

In general, it is enough to compute $\log V$ exponentials to calculate the probability of a single item in the layer.

Since (as we will soon see) during training it's typically enough to calculate the probability of one item, this can significantly speed up training. In contrast, there is typically no speedup for prediction, when all probabilities need to be calculated.

See [Ruder's summary](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/) about other alternatives.


#####  Negative Sampling
Another way of achieving improvements in performance is to not predict the whole corpus every time but to do negative sampling, predicting the true class and a number of false classes (say 10), rather than all.

See [here](http://ruder.io/word-embeddings-softmax/) for more details.



#### Rigidity

As already mentioned in the section about transfer learning, one of the crucial __problems__ in __language model transfer__ is the potential existence for a __no-overlapping__ part of __vocabulary__, whereby some words present in the corpus are simply missing from the model. 

It is important to understand, that this case does not just effect transfer scenarios, it can very well be present in standard use cases, since **we can not guarantee that all words encountered in production were already present in the trainig set!** They are simply **out of vocabulary** (OoV)

Some solutions trying to mitigate this problem are:

##### Explicit Out of Vocabulary (OoV) token

The default solution for this is to add a __general OoV__ token to the vocabulary by default, so as to forego this problem. Many times the **"UNK" token** is used for this purpose.

The advantage of this approach is, that we can handle the case of __rare words inside the corpus__, that we __explicitly don't want to include in the model__ just in the __same way__, replacing them with UNK.

The disadvantage though is, that we definitely __loose information__ on these tokens, which frequently are (new) proper names, thus are crucial eg. for search scenarios. (If we replace all the customer names to UNK, that would definitely not help document recall for sales...) 

##### Dedicated OoV model

A more thorough approach to this problem is the usage of a dedicated __OoV model__ which gets as __input the word's surface form__ as well maybe a __"list of brothers" (potential related words)__ and has __to produce a word vector as prediction output__. Intuitively we could argue, that this approach works best if the words are __morphological variants__ of each-other.

See an example of such models [here](https://hal.archives-ouvertes.fr/hal-01623784/document).

Beyond these approaches the approach of **"subword embeddings"** emerged as a preferred choice in practice. See more on that later.

### The problem of word senses

As already mentioned before, it is a problem in all word embeddings, that the __different senses__, like (like bank(1) and bank(2)) are being __treated in a mixed manner__, since the "key", the vocabulary entry subsumes under it all the different usages and contexts.

For a more explicit treatment of the homonymy phenomena we can try to capitalize on the fact that __homonyms__ are __sometimes__ (though definitely _not always_) come __from different parts of speech__, thus if we can __separate__ the __vocabulary__ entries __by parts of speech__, relying on a POS tagger as an external resource, we can build up more fine grained representations. That is exactly what [sense2vec](https://arxiv.org/abs/1511.06388) and others, like [here](https://www.cs.rochester.edu/~lsong10/papers/area.pdf) and [here](http://www.aclweb.org/anthology/Q15-1017) try to achieve.

<a href="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/36f9886ad1cb9ee3f66c5af0282ae7a3359b86b2/3-Figure2-1.png"><img src="https://drive.google.com/uc?export=view&id=1nJVY5Kmt68-lX9_Wl9n-zxUb-IAJMC-P" width=45%></a>

But further on there is literature which proposes, that the word senses are represented quite well in the components of word vectors, even to the point that [detection of polysemy](https://arxiv.org/abs/1709.08858) or [explicit capitalization on included word senses](http://www.offconvex.org/2018/09/18/alacarte/) is possible.

Example:
"tie  only has two meanings (clothing and game), it is easy to see that its word embedding is a linear transformation of the sum of the average context vectors of its two senses:"

$$v_w=A\mathbb{E}v_w^\textrm{avg}=A\mathbb{E}\left[v_\textrm{clothing}^\textrm{avg}+v_\textrm{game}^\textrm{avg}\right]=A\mathbb{E}v_\textrm{clothing}^\textrm{avg}+A\mathbb{E}v_\textrm{game}^\textrm{avg}$$

[(source)](http://www.offconvex.org/2018/09/18/alacarte/)

### The problem of longer context

The biggest problem though with rolling window models is that they __enforce the same Markov assumption__ as the n-gram models: they think in a __fixed context window__, are unable to model dependencies in a broad context. So we are back to square one.

We need full sequence models.

# "Full sequence" models

One of the ways to circumvent the Markov assumption is to use such neural models. specificly recurrent neural networks that are capable of modeling potentially long range dependencies. We consider text as a long sequential data and use the appropriate architectures to predict next elements in the sequence:

<a href="http://drive.google.com/uc?export=view&id=1y8QYr9ftTvXAxgzS-ldnGlijVpmK2l21"><img src="https://drive.google.com/uc?export=view&id=1WJ-wC8yGEhnKcdYP9FJxGh4gHvJCWNaU" width=55%></a>

Some things to note:

- Input is a one-hot vector for the model, and **we use an embedding to transfrom it to a dense vector** (hint pretrained representations?)
- Output is softmax over the vocabulary. We can use tricks for that...
- Here a simple RNN is displayed, but LSTMs and other variants can (and should) be used.

## Ok, but what is an RNN?

Please consult Optional material if needed.

## Teaching RNN-s on language data

_In theory_ an RNN could be trained with full GD on the corpus in one go:

<a href="http://drive.google.com/uc?export=view&id=1XsBoRp7cNay3svFLRDv2JEDyC7m7CUdC"><img src="https://drive.google.com/uc?export=view&id=1xYDgVCan47mmVZHstaoZd2qP12dsWyqi" width=50%></a>

- The loss is generally the well-kown crossentropy, which is in this case (since the input is a one-hot vector):
  $$J^{(i)}(\Theta) = -\log (\hat y[x^{(i+1)}])$$
  the negative logarithm of the probability assigned by the network to the right word / next word.

- For the sake of more frequent updates, and since BPTT for long sequences is very expensive, teaching is done in smaller units with not necessarily the same length.
- The unit is typically one or more sentence, or if the length allows, and we have enough material, a paragraph can be a good candidate.
- Initial state in case of the time-series units: if the boundaries are inside a unit of text, it is important to _transfer the hidden state_ from the previous unit, in other cases initialization can be done by some fixed value.
- (Somewhat misleading) terminology: the length of the "time" unit is _time step_, but sometimes certain implementations call it _minibatch_, though that would generally mean the number of units processed in one go for the sake of computational efficiency.

### Parameters of LSTM-architectures

+ An LSTM is a complete layer! The most important parameter of it is the "number of (memory) units", which is the length of the hidden state vector, thus, the memory capacity.
+ It is quite widespread to use multiple LSTM layers ("stacked LSTMs") -- as in the case of ConvNets the hope is, that the layers learn a hierarchy of abstract representations:

<a href="http://wenchenli.github.io/assets/img/GNMT_residual.png"><img src="https://drive.google.com/uc?export=view&id=1cbf6VvnPTkQwJZjv2jZfpC2-w3zxLSbq" width=600 heigth=600></a>

(on the right side a network is shown with skip/residual connections!)





## Are LSTM-s really necessary?

Please consult Optional material about ConvNets over sequences.

# Some words about usage of language models

Recap: But what is a language model good for?

For example:
- Predictive text input ("autocomplete")
- Generating text
- Spell checking
- Language understanding
- And most importantly representation learning 

## Generating text with a language model

The language model produces a tree with probable continuations of the text:

<a href="https://4.bp.blogspot.com/-Jjpb7iyB37A/WBZI4ImGQII/AAAAAAAAA9s/ululnUWt2vw9NMKuEr-F9H8tR2LEv36lACLcB/s1600/prefix_probability_tree.png"><img src="https://drive.google.com/uc?export=view&id=1S2lC_pg0odAOuQGIafPJbd5IySkTrr8a" width=50%></a>

Using this tree we can try different algorithms to search for the best "continuations". A full breadth-first search oi usually impossible, due to the high branching factor of the tree.

Alternatives:
- "Greedy": we choose the continuation which has the highest direct probability, This will most probably be suboptimal, since the probability of the full sequence is tha product of the continuations, and if we would have chosen a different path, we might ahve been able to choose later words with hihg probabilities.
- Beam-search: we always store a fixed $k$ number of partial sequences, and we always try to expand these, always keeping the most probable $k$ from the possible continuations. 

Example ($k$=5):

<a href="http://opennmt.net/OpenNMT/img/beam_search.png"><img src="https://drive.google.com/uc?export=view&id=1dmId93EuHxiQYrIXmlceXDj_En0mVXNV" width=50%></a>
 
 

## Scaling up sequence models to larger chunks of text

The big question is, that even when we have good quality representations for words in dense vector form, how can we create such mapping for higher levels inside the textual hierarchy, like sentences, paragraphs and documents?

This also opens up two sub questions:
1. (How) can we keep the representation of more complex elements in the same space as the one for words?
2. How much we can learn about long chunks of texts at all? If we compress a book into a sentence, does it make sense?


There are more "direct" and "indirect" methods for obtaining representations for longer chunks of texts.



### "Simply add it up"

**"Bag-of-word-vectors"**

We simply take all the word vectors in a unit of text and we calculate some kind of "average".

<a href="https://i.stack.imgur.com/wBu7G.png"><img src="https://drive.google.com/uc?export=view&id=122qUM49HhJ-iHIT84KTFc884JuQ69lLC" width=35%></a>

And with this we get back to problems (and techniques) of BOW.

### "Use some weighting over it!"

We have learned from BOW that we would not like to let all words contribute equally to the representation, so we can use whatever we have learned by "normalization": POS, TFIDF, TextRank,...

In short: "Multiply by something, than average."


### A very strong "baseline" - SIF

"A Simple but Tough-to-beat baseline for sentence embeddings" - Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017

[Blog post](http://www.offconvex.org/2018/06/17/textembeddings/)
[Implementation](https://github.com/PrincetonML/SIF )
[Paper](https://openreview.net/pdf?id=SyK00v5xx)

<a href="https://user-images.githubusercontent.com/544269/41765307-a3e1392a-763e-11e8-9804-5eefc11f9459.png"><img src="https://drive.google.com/uc?export=view&id=1eiXM1Kp7VZHUFOKhPqc1VWacMnyJgtXU" width=55%></a>

This quite recent model is in a sense "averaging on steroids", where we use a hyperparameter to tune the averaging based on corpus frequency of words, then do SVD and remove the first principal component of the representation. 

SIF embeddings are motivated by the empirical observation that word embeddings have various pecularities stemming from the training method, which tries to capture word cooccurence probabilities using vector inner product, and words sometimes occur out of context in documents. These anomalies cause the average of word vectors to have nontrivial components along semantically meaningless directions. SIF embeddings try to combat this in two ways, which I describe intuitively first, followed by more theoretical justification.

__Idea 1: Nonuniform weighting of words__. Conventional wisdom in information retrieval holds that "frequent words carry less signal." Usually this is captured via TF-IDF weighting, which assigns weightings to words inversely proportional to their frequency. We introduce a new variant we call Smoothed Inverse Frequency (SIF) weighting, which assigns to word $w$ a weighting $\alpha_{w}=a /\left(a+p_{w}\right)$ where $p_{w}$ is the frequency of $w$ in the corpus and $a$ is a hyperparameter. Thus the embedding of a piece of text is $\sum_{w} \alpha_{w} v_{w}$ where the sum is over words in it. (Aside: word frequencies can be estimated from any sufficiently large corpus; we find embedding quality to be not too dependent upon this.)

On a related note, we found that folklore understanding of word2vec, viz., expression (1), is false. A dig into the code reveals a resampling trick that is tantamount to a weighted average quite similar to our SIF weighting. (See Section 3.1 in our paper for a discussion.)

__Idea 2: Remove component from top singular direction__. The next idea is to modify the above weighted average by removing the component in a special direction, corresponding to the top singular direction set of weighted embeddings of a smallish sample of sentences from the domain (if doing domain adaptation, component is computed using sentences of the target domain). The paper notes that the direction corresponding to the top singular vector tends to contain information related to grammar and stop words, and removing the component in this subspace really cleans up the text embedding's ability to express meaning.

Empirics prevail.

It is indeed VERY competitive with more complex methods.


### "Calculate it separately"


#### Topic + RNN models

Three different approaches for combining sequence and topic models came out during 2016-17:
[TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency](https://arxiv.org/abs/1611.01702)
[Topically Driven Neural Language Model - TDLM](https://arxiv.org/abs/1704.08012)
[Topic Compositional Neural Language Model](https://arxiv.org/abs/1712.09783)

Common in them is, that beside the predictive LSTM model they also use a topic model as a kind of mixture, that is: they somehow mix in the predictions of the topic part in the final output for next element prediction - like concatenating the topic model's output to the input for LSTMs (or in different complicated ways...)

<a href="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/ef3971b8e822820d583edb0ed76a39647b18577c/2-Figure1-1.png"><img src="https://drive.google.com/uc?export=view&id=1jKtDdjzOEKZ0AgSxhWXc-v_TsCldUM-y" width=75%></a>

The main idea is, that it is true in parallel that the production probabilities are influenced by the sequence and some kind of topic mixture.

There are newer, even more complex models in this direction, like [this](https://arxiv.org/abs/1810.03947).


**Takeaway:**

The topic RNN models try to answer the question of building multi layered representations of text in one model. The big question will be, if we can do this in a more flexible way?  


# Merging ontologies and prediction

We have already seen, that statistical / predictive methods can be used to assist in building up ontologies, the question is, if ontologies can be of any use for predictive models themselves or is any kind of "merger" necessary.

The problem is in itself complex, due to the different nature of the two representations: while graphical knowledge representations are **discrete**, that is, one atomic unit of them represents a "piece" of knowledge, vector representations are **"global"** or distributed, meaning their global "state" is encoding the knowledge, we can not pinpoint any single element as a "carrier". Their differing nature makes merging the two a great challenge.

## Knowledge graphs as constraints

### "Retrofitting"

One of the most basic form of using knowledge graph information is to modify vector spaces "after the fact" based on the relations captured, that is, use a kind of vector transformation based on individual observed relations in a knowledge base to transform the vectors of two given words eg. closer to each other based on synonymy by an additional postprocessing optimization step.  

<a href="http://drive.google.com/uc?export=view&id=1E64Wadb_X1Ufqj9iHdMb-uz2yWB0jtB6"><img src="https://drive.google.com/uc?export=view&id=1J9GOdUFSWu_Z13r7aVrSgo9LBH3KZWF0" width=85%></a>

The two important papers describing this approach are [this](https://arxiv.org/abs/1411.4166) and [this](https://arxiv.org/abs/1603.00892), latter one also includes direct sysnonymy and antonymy information.

Some discussion of "retrofitting" can be found [here](https://becominghuman.ai/enriching-word-vectors-with-lexicon-knowledge-and-semantic-relations-an-efficient-retrofitting-bcb5f1208a3e). 

There are quite current postprocessing methods like ["extrofitting"](https://arxiv.org/abs/1808.07337) which are also noteworthy.

### A nice example: ConceptNet enhanced embeddings

"We can represent the __[ConceptNet](https://arxiv.org/abs/1612.03975) graph as a sparse, symmetric term-term matrix.__ Each cell contains the __sum of the weights of all edges that connect the two corresponding terms__. For performance reasons, when building this matrix, we prune the ConceptNet graph by discarding terms connected to fewer than three edges.

We consider __this matrix to represent terms and their contexts__. In a corpus of text, the context of a term would be the terms that appear nearby in the text; here, the context is the other nodes it is connected to in ConceptNet. 

We can calculate word embeddings directly from this sparse matrix by following the practical recommendations of Levy, Goldberg, and Dagan (2015). As in Levy et al.,:

- we determine the pointwise mutual information of the matrix entries with context distributional smoothing,
- clip the negative values to yield positive pointwise mutual information (PPMI), 
- reduce the dimensionality of the result to 300 dimensions with truncated SVD, 
- and combine the terms and contexts symmetrically into a single matrix of word embeddings. 

This gives a matrix of word embeddings we call ConceptNet-PPMI. These embeddings implicitly represent the overall graph structure of ConceptNet, and allow us to compute the approximate connectedness of any pair of nodes.  

__Retrofitting__ [(Faruqui et al. 2015)](https://www.aclweb.org/anthology/N15-1184/) is a process that adjusts an existing matrix of word embeddings using a knowledge graph. Retrofitting __infers new vectors $q_i$ with the objective of being close to their original values, $\hat{q}_i$ , and also close to their neighbors in the graph with edges E, by minimizing this objective function:  

$$ \Psi(Q)=\sum_{i=1}^{n}\left[\alpha_{i}\left\|q_{i}-\hat{q}_{i}\right\|^{2}+\sum_{(i, j) \in E} \beta_{i j}\left\|q_{i}-q_{j}\right\|^{2}\right] $$  

Faruqui et al. give a __simple iterative process__ to minimize this function over the vocabulary of the original embeddings. 

The process of “expanded retrofitting” (Speer and Chin 2016) can optimize this objective over a larger vocabulary, including terms from the knowledge graph that do not appear in the vocabulary of the word embeddings. This effectively sets $\alpha_i = 0$ for terms whose original values are undefined. We set $β_{ij}$ according to the weights of the edges in ConceptNet."

<img src="http://drive.google.com/uc?export=view&id=18-i1JTOe4xL0AeOwa_CZRl-f6FL9Sulz" width=65%>

### "Conditioning on..."

Hierarchic representations of knowledge can be understood as having a distinct topology of top-down structure. During buildup of vector spaces we can think of the this topology, that we would like to enforce upon the final vector space representation of the semantic space we are currently learning.

A [Poincaré disk](https://en.wikipedia.org/wiki/Poincar%C3%A9_disk_model) can be thought of as a special topology that that enforces the vector space to lie inside a unit disc, distances conforming to the restrictions inside this space.

<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Poincare_disc_hyperbolic_parallel_lines.svg/600px-Poincare_disc_hyperbolic_parallel_lines.svg.png"><img src="https://drive.google.com/uc?export=view&id=1O-6R01BlmXa6U1t_D5BB1Oa40KAYH2UA" width=55%></a>

The paper describing [Poincaré embeddings](https://arxiv.org/abs/1705.08039) is using this topology to represent hierarchical semantic structure in word2vec like models.

The resulting vector models encode all semantic relatedness information but lay special emphasis on **learned hierarchic relations**. 

<a href="https://s3.ap-south-1.amazonaws.com/techleerimages/8435c41a-a112-41f7-88c0-b8edb0d4365c.jpg"><img src="https://drive.google.com/uc?export=view&id=13kdQkrmb8rHKurFAlZUDshJymY9MVr3S" width=55%></a>

Further upstream tasks could benefit from this kind of structure.

A good discussion of Poincaré embeddings can be found [here](https://s3.ap-south-1.amazonaws.com/techleerimages/8435c41a-a112-41f7-88c0-b8edb0d4365c.jpg).

## Knowledge graphs as additional sources of input

As mentioned earlier, the recent emergence of neural models capable of handling graph based input, like graph-convolutional and graph-recursive neural networks raises to possibility of including graph based information in a (quasi) end-to-end manner to the training of neural models. One [recent work](https://arxiv.org/pdf/1902.07282.pdf) done in neural machine translation specifically uses semantic labeling information - produced by a separate model - to enhance the performance of an attentive model (see later) and thus give a boost to translation performance.

<a href="http://drive.google.com/uc?export=view&id=1ysmQSLfBjU9y39BVJNXCSxh0pBjTKlEG"><img src="https://drive.google.com/uc?export=view&id=1k1i6j58GKwSgjROUUnIxyo9AEGo_hu6r" width=85%></a>

With this approach the authors were able to enhance the state of the art in NMT.


With the recent performance gain of semantic taggers (like Google's [SLING](https://ai.googleblog.com/2017/11/sling-natural-language-frame-semantic.html)) this direction seems more and more viable as a kind of "pipeline" approach to enhancing final performance of models.

## Direct ontology reasoning with neural models

A noteworthy [recent paper](https://arxiv.org/pdf/1808.07980.pdf) even tries to completely "subsume" the question of ontological reasoning over graphs.

As the authors write:

"The ability to conduct logical reasoning is a fundamental aspect of intelligent behavior, and thus an important problem along the way to human-level artificial intelligence. Traditionally, symbolic logic-based methods from the field of knowledge representation and reasoning have been used to equip agents with capabilities that resemble human logical reasoning qualities. More recently, however, there has been an increasing interest in using machine learning rather than symbolic logic-based formalisms to tackle these tasks. In this paper, we employ state-of-the-art methods for training deep neural networks to devise a novel model that is able to learn how to effectively perform logical reasoning in the form of basic ontology reasoning. This is an important and at the same time very natural logical reasoning task, which is why the presented approach is applicable to a plethora of important real-world problems. We present the outcomes of several experiments, which show that our model learned to perform precise ontology reasoning on diverse and challenging tasks. Furthermore, it turned out that the suggested approach suffers much less from different obstacles that prohibit logic-based symbolic reasoning, and, at the same time, is surprisingly plausible from a biological point of view."

If these models prove to be widely applicable, that would open up new possibilities in merging logical reasoning with distributed models.