# Naive modeling: Frequencies

Based on the "frequentist" school of statistics it seems like a straightforward solution to __model__ the probability distribution with __simply the co-occurence statistics__ of words. 
Our solution is thus: **take huge corpora and count!**


## On the level of words: co-occurence

The most simple approach given our "mini corpus" is:

<a href="http://drive.google.com/uc?export=view&id=1knwiBQem4174oOsc02VzBv4QpTYNbqA9"><img src="https://drive.google.com/uc?export=view&id=1t-avm9QHCveWfZ2tedZZrm2Aoi2PRkr6" width=45%></a>

This table will represent our language model.



## On the level of "documents": Bag of ...

We can treat __any units of text__, like sentences, paragraphs and documents as **"bag-of-words"**, thus effectively modeling them as __statistical tables of frequencies__.

Some say, that this is an exhaustive and sufficient description of meaning. In certain cases it might be right. :-)

<a href="https://i.imgur.com/L5MDTcO.jpg"><img src="https://drive.google.com/uc?export=view&id=1VhKdexEdhneHmidnULg2bYl9LWRVPk6f" width=25%></a>

The data representation for this approach is:

<a href="http://www.rdatamining.com/_/rsrc/1421498854941/examples/social-network-analysis/term-doc-matrix.png"><img src="https://drive.google.com/uc?export=view&id=1DxUt0_ZqfWMVmA63Qb_LC_a0QWazoeFH"></a>



### Vector space

With a slight reformulation of this approach we can imagine that the words are forming separate **dimensions of a space** and thus the documents are effectively forming **vectors in this space**. The presence of a __word__ in a document __influences it's "angle"__ and it's __frequency the "length"__ in this vector dimension.

<a href="http://blog.christianperone.com/wp-content/uploads/2013/09/vector_space.png"><img src="https://drive.google.com/uc?export=view&id=1-ZHcX48twMasyUzM3G7Cm6N4BRjmbNwq" width=45%></a>


(This is "osgooding", but based on corpus counts.)




### N-grams

Naturally, if we __only__ think in __distinct words (tokens)__, that in itself __can cause problems__, since in case of the "United States of America" we would not want to process the vector directions of "United" and especially "of" as separate elements, we would like to to handle this as __one expression__. (And this is definitely _not_ just about proper names, see "get up"...)

In this case one of the __possibilities__ is to use __Part-of-speech__ information explicitly, that is to go for __"noun phrase embeddings"__ or such. There is literature pointing in this direction (see eg. [here](http://www.aclweb.org/anthology/Q15-1017)), and it is true also, that utilizing POS information for differentiating (some) word senses (like bank(1) and bank(2)) can be useful see for example [sense2vec](https://arxiv.org/abs/1511.06388) and [others](https://www.cs.rochester.edu/~lsong10/papers/area.pdf). 

None the less, the __most widespread solution__ is the **inclusion of N-grams** or **skip-grams**, that is __consecutive or "skipped" combinations of N words for modeling__.

Example: uni-, bi- and trigrams

<a href="https://i.stack.imgur.com/8ARA1.png"><img src="https://drive.google.com/uc?export=view&id=1pMgv_5Xd6amdyEUif-66WyRf-FzR8cxQ" width=45%></a>

Example: 1-skip-bigrams:

In [1]:
text = "The quick brown fox jumps over the lazy dog ."
print(text,"\n")
splitted = text.split(" ")
print(splitted,"\n")
for position in range(len(splitted)):
    if position+1 < len(splitted):
        print(splitted[position], splitted[position+1])
    if position+2 < len(splitted):
        print(splitted[position], splitted[position+2])

The quick brown fox jumps over the lazy dog . 

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'] 

The quick
The brown
quick brown
quick fox
brown fox
brown jumps
fox jumps
fox over
jumps over
jumps the
over the
over lazy
the lazy
the dog
lazy dog
lazy .
dog .


**We add these N-grams to the vocabulary and count the frequencies accordingly.**



### Sidenote: some assumptions 

Observe, that with _gross_ simplification we assume, that the __distribution is only dependent on the prior $n-1$ words__ (where $n$ is typically $<=4$), thus we assume a __Markov chain of the order $n$__:

 $$P(w ~\vert ~ w_1,\dots,w_k) = P(w ~\vert ~ w_{k- n + 1},\dots,w_k)$$

We simply compute these probabiltites in a frequentist style by calculating the $n$-gram statistics of the corpus at hand:

$$P(w_2 ~\vert ~w_1) = \frac{c(\langle w_1, w_2 \rangle)}{c(w_1)}$$

$$P(w_{k+1} \vert~ w_1,\dots,w_k)_\mathrm = \frac{c(\langle w_1,...,w_k, w_{k+1} \rangle)}{c(\langle w_1, \dots w_k\rangle)}$$

Please note, that in this case we are using __"memorization"__, a form of database learning, with __minimal compression__ - "counting".

But what do we with the given __$n$-grams rarely or never occure__? We have to employ some __smoothing__ solutions, like: 

### Smoothing

Or: _"What to do with too rare or infrequent words?"_

What do we do when the given __$n$-grams rarely or never occure__? We have to employ some __smoothing__ solutions, like: 

##### Additive smoothing
We pretend, that we have seen the $n$-grams more times than we have actually did with a __fixed $\delta$__ number, in simplest case by $n=2$:

$$P(w_{i+1} ~\vert ~w_{i-n+1},..,~w_{i}) = \frac{c(\langle w_{i+1},.., w_{i-n+1} \rangle) + \delta}{\sum_{w\in V} [c(\langle w_{i-n+1},.., w_i\rangle) + \delta]}$$

Where V is the set of all words.

Widespread solution for $\delta$ is $1$.

Problem:
- If    c(⟨w_1,w_2⟩)=0 and c(⟨w_1,w_3⟩)=0 then under additive smoothing 

$$ p(w_1,w_2)=p(w_1,w_3)$$

- Suppose that __$w_2$__ is much __more common than $w_3$__. Then we should have:
$$ p(w_1,w_2)>p(w_1,w_3)$$

    so the result from additive smoothing seems wrong.


- Solution: interpolate between bigram and unigram models

##### Interpolation

In case of bigrams, we add - with a certain weight - the probabilities coming from the unigram frequencies:

$$P(w_2 ~\vert ~w_1)_{\mathrm{interp}} = \lambda_1\frac{c(\langle w_1, w_2 \rangle)}{c(w_1)} + (1 - \lambda_1)\frac{c(w_1)}{\sum_{w\in V}c(w)}$$

Recursive solution for arbitrary $k$:

$$P(w_{k+1} \vert~ w_1,\dots,w_k)_\mathrm{interp} = \lambda_k\frac{c(\langle w_1,...,w_k, w_{k+1} \rangle)}{c(\langle w_1, \dots w_k\rangle)} + (1-\lambda_k)P_\mathrm{interp}(\langle w_2,\dots,w_{k+1}\rangle)$$

$\lambda_k$ is empirically set by examining the corpus, typically by [Expectation Maximization algorithm](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm), which - as we have mentioned - iteratively tunes the parameters to maximize the maximum likelihood.


Good overview about the smoothing methods: [MacCartney, NLP Lunch Tutorial: Smoothing](https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)




### "Normalizing" with meaningful words

It is fairly obvious, that __not all words contribute equally to the meaning of a text__, or to put it another way: from the viewpoint of information theory, **not all words have the same added information**. (For more information theoretic considerations see: [here](https://ccc.inaoep.mx/~villasen/index_archivos/cursoTL/articulos/Aizawa-tf-idfMeasures.pdf).) 

Certain classes of words are used mainly for __syntactic purposes__ - like eg. pronouns - thus __their information content__ is fairly __low__ from a __semantic perspective__.

It is also true, as noted above, that they are pretty frequent in the corpus as of ["Zipf's law"](https://en.wikipedia.org/wiki/Zipf%27s_law).


#### TF-IDF (family)

One of the possible solutions for this is to "smooth" based on word frequencies. The most widespread solution for this is **"term frequency–inverse document frequency" or [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)**.

The basic intuition is, that the words which are **generally frequent** in documents are less informative, than the ones that are **generally rare, but frequent in a small subset of documents**. They represent the "discriminative information" for that subset.

INTUITION BEHIND TF-IDF: make rare words common in a subset of the document more prominent and effectively ignore common words.



<a href="https://cdn-images-1.medium.com/max/1600/1*8XpbsR4HdAHBXy5MgpIyug.png"><img src="https://drive.google.com/uc?export=view&id=1l1ocR4yuVFJFvcWQahW-DJ0ZznOdnFUQ" width=45%></a>


<a href="https://cdn-images-1.medium.com/max/1600/1*jNnpbGPxkjehlvTCXq9B8g.png"><img src="https://drive.google.com/uc?export=view&id=17MPUwfV-oc785DRTzqONMMzj2o8pLy_e" width=45%></a>

The __logarithm__ turns __1 into 0__, and makes __large numbers__ (those much greater than 1) __smaller__. (More on this later.) Then a word that appears in __every single document__ will be effectively __zeroed out__, and a word that appears in very __few documents__ will have an even __larger count__ than before.

For the effects of feature scaling see [here](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/ch04.html).

Let’s look at some pictures to understand what it’s all about. The next figure shows a simple example that contains four sentences: “it is a puppy,” “it is a cat,” “it is a kitten,” and “that is a dog and this is a pen.” We plot these sentences in the feature space of three words: “puppy,” “cat,” and “is.”

<a href="https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/assets/feml_0402.png"><img src="https://drive.google.com/uc?export=view&id=1tZQRJrk0FHZlMOT1zV3918ZCcG-U9oD1" width=45%></a>

It is noteworthy that TF-IDF is one of the (most basic) methods for locating keywords in a document, thus we can say, that in principle **all keyword extraction methods** can be used to do smoothing for frequency counts (or even other vector representations - for that matter).





#### TextRank (family)

Another group of unsupervised keyword extraction techniques is **TextRank** and it's many variants (original publication [here](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)), which builds on the Google founders Page and Brin's PageRank algorithm.

The basis of it is a **co-occurence graph** of words, which is in itself a filtered, "sparsified" co-occurence matrix.

<a href="https://i.stack.imgur.com/ohF5r.png"><img src="https://drive.google.com/uc?export=view&id=1ACpwqO2WAKGlMfy7Uwwmv0CtKuB9ENHY" width=35%></a>

We than simulate a **random walk over this graph** to come up with an approximation of some kind of "centrality" like metric for the nodes, that is to find the "key" words.

<a href="https://i1.wp.com/1.bp.blogspot.com/-5DGkqiLF87U/Uzqm0Vah16I/AAAAAAAABIE/aPgVRreUvts/s1600/g4.gif?w=456"><img src="https://drive.google.com/uc?export=view&id=1LLMmU-Msvpu0dPQ8oW1sip9EFg8Kd2Xm" width=35%></a>

[Source](https://www.r-bloggers.com/from-random-walks-to-personalized-pagerank/)


#### Alternatives for unsupervised keyword extraction

Over and beyond these techniques numerous variants exist.

Some alternatives for unsupervised keyword extraction:

- [RAKE](https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470689646.ch1)
- [OKAPI BM25](https://en.wikipedia.org/wiki/Okapi_BM25)
- [SGRank](http://www.aclweb.org/anthology/S15-1013)
- [DivRank](http://dx.doi.org/10.1145/1835804.1835931)

And more recent advancements: 
[here](http://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14377/14168),
[here](http://arxiv.org/abs/1702.04457), [here](http://arxiv.org/abs/1801.04470), [here](https://arxiv.org/abs/1803.08493) and [here](https://arxiv.org/abs/1811.10831).

#### Sidenote: From vectors to keywords

As it is mentioned in some of the publications above, there is a "reverse" way in which we __start from some available (typically pre-trained) semantic vector model__, __transform__ the __documents__ and their words __separately__ with the help of those embeddings and try to detect the __most salient keywords by comparing the vector representations of documents and texts__. 

**The words closest to the document vector can be considered the keywords.**

In certain cases we use this "reverse technique".

See for example [here](https://arxiv.org/abs/1710.07503)

#### Inverse: The difficult case of stopwords

The inverse problem of __finding "keywords"__, the __listing__ of so called __"stop words"__, which __do not add (too much) to the semantics of the given text__ is also intriguing. We have a strong assumption that the top words are not informative.

There were some tries to explicitly create stopword lists based on corpus statistics. like [here](http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdf).

The basic idea is, that __somewhere after the most frequent stopwords__ the __"domain keywords" start__, then the whole distribution finishes with "noise".

<a href="https://image.slidesharecdn.com/stopwords-140602111606-phpapp01/95/on-stopwords-filtering-and-data-sparsity-for-sentiment-analysis-of-twitter-13-638.jpg"><img src="https://drive.google.com/uc?export=view&id=1Bf2Yh6vkQGYAItOuelD12go-Bx598azR" width=65%></a>

([source](https://www.slideshare.net/Staano/stopwords))

This sounds nice, but often **does not work in practice**.

More on types of stopwords:[here](http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html)

Empirical observation tells us:
It __does not always help to remove stopwords__, or at least it is controversial, see [here](https://ieeexplore.ieee.org/document/7375527).


### Problems with ngram vectorspaces

### Size!

On a large enough corpus, the __memory footprint__ of the __$n$-gram__ models is __huge__, eg. for the 1T n-gram corpus of Google ([see here](https://catalog.ldc.upenn.edu/LDC2006T13)) containing 1,024,908,267,229 tokens the $n$-gram counts are as follows:
- unigram: 13,588,391, 
- bigram: 314,843,401, 
- trigram: 977,069,902, 
- fourgrams: 1,313,818,354 
- fivegram: 1,176,470,663.

### Curse of dimensionality

Consider the following table, which shows the size of a hypercube that covers the given $f$ fraction of the volume of a $[0,1]^D$ unit hypercube for a number of $D$ dimensions ($\sqrt[D]{f}$):

In [2]:
import pandas as pd

dims = list(range(1,11))
fracs = [0.2, 0.01, 0.001,0.0001]
result = []
for frac in fracs:
    result.append([frac**(1/dim) for dim in dims])
pd.DataFrame(result, columns = ["f"] +[f"{dim} dims" for dim in dims[1:]])

Unnamed: 0,f,2 dims,3 dims,4 dims,5 dims,6 dims,7 dims,8 dims,9 dims,10 dims
0,0.2,0.447214,0.584804,0.66874,0.72478,0.764724,0.794597,0.817765,0.836251,0.85134
1,0.01,0.1,0.215443,0.316228,0.398107,0.464159,0.517947,0.562341,0.599484,0.630957
2,0.001,0.031623,0.1,0.177828,0.251189,0.316228,0.372759,0.421697,0.464159,0.501187
3,0.0001,0.01,0.046416,0.1,0.158489,0.215443,0.26827,0.316228,0.359381,0.398107


#### Increasing sample requirements

The figures show that if we train an ML algorithm using one feature on the 20% of a population, and start adding new features then we need dramatically more samples to maintain the same amount of coverage on the feature space: concretely, even for 3 features we'd need almost 60% of the population:

<a href="http://www.visiondummy.com/wp-content/uploads/2014/04/curseofdimensionality.png"><img src="https://drive.google.com/uc?export=view&id=1rZ3tRLxGoDRi4KgHTwaRUi5L_BSPGMx-" width=50%></a>

([image source](http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/))

#### Problems with "neighborhoods"

Another strange effect of high dimensionality is that the ratio of normal/central examples that are close to the centroid of the population radically decreases, because of the strange behavior of hyperspheres. E.g., considering again an $[1,0]^D$ feature space and an inscribed hypersphere with an $0.5$ radius, the volume of the hypesphere, and consequently the ratio of central examples tends to 0 (even though the spheres touch all sides of the hypercubes they are inscibed in):

<a href="http://www.visiondummy.com/wp-content/uploads/2014/04/sparseness.png"><img src="https://drive.google.com/uc?export=view&id=1kud-R8BRUkKqSDxchlb5ZUIwuvfAHzQ2" width=50%></a>

<a href="http://www.visiondummy.com/wp-content/uploads/2014/04/hypersphere.png"><img src="https://drive.google.com/uc?export=view&id=1eXFwO8Kalj2xUdIXryuj42RzoVvWBpDm" width=30%></a>

([image source](http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/))


Please observe, that in our raw representation of "counts" we did not "learn" too much in the sense of **representation learning**, while there is no **compression** of the data, just **memorization**.



## Decomposition

Or: _"What should we do with these huge co-occurence matrices?"_

The solution is: **dimension reduction**.

We would like to capture the main sources of variance for the matrices and to "learn" a lower dimensional, "compressed" representation for them. (We can also suspect, that this lower dimensional representation is something more close of the meaning space behind the produced language, that is: it's **latent structure**.)

The main technique which has been used is **"principal component analysis"** (and it's variants).

(Whoever attended the class on deep learning, is not really surprised now.)


<a href="https://www.bogotobogo.com/python/scikit-learn/images/CompressData-1-DimensionalityReduction-PCA/PrincipalDirection.png"><img src="https://drive.google.com/uc?export=view&id=1RyGgmJ4NeuKtnYIc4sPBVXVRj166SRWS" width=65%></a>

About dimensionality reduction:
- [here](https://en.wikipedia.org/wiki/Dimensionality_reduction) and
- [here](https://arxiv.org/pdf/1403.2877.pdf)


We search for the rotation that "explains" the most amount of variance in our data.


Most popular method in semantics is: **Latent Semantic Indexing (LSI)**.
see [here](https://en.wikipedia.org/wiki/Latent_semantic_analysis)

The assumption again is, that the observed co-occurences are explainable form the perspective of latent meanings, from **"concepts"**, which are brought forward by the dimension reduction.

<a href="https://technowiki.files.wordpress.com/2011/08/diagram2.png"><img src="https://drive.google.com/uc?export=view&id=1O45EowtprBGkL2s_qC2JlxK_5fpb9EMQ" width=25%></a>


LSI is not totally PCA, see [here](https://irthoughts.wordpress.com/2007/05/05/pca-is-not-lsi/), but close enough.

There is also a probabilistic extension of LSI called [PLSI](https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis), which explicitly states that the probabilities for generating words comes directly from the **"topic model"**.

### Does decomposition help mitigate size?

Well, if it fits into memory...

The problem is, that for PCA like approaches it is a prerequisite to have the whole matrix in memory for the decomposition to happen, thus LSI and the like are many times **prohibitively expensive** in case of a decently big vocabulary and / or many documents. 

There were some tries to apply different algorithms like [incremental pca](https://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html) to the problem, but generally this turned out to be a limited success. 

**This is a major motivation for using different, "streaming compatible" models!**


# How to use vector spaces?

## Basic querying: Cosine distance

<a href="https://slideplayer.com/slide/5993529/20/images/16/Documents+%26+Query+in+n-dimensional+Space.jpg"><img src="https://drive.google.com/uc?export=view&id=1Yi4FdZRy8cUqdAfSYpOyz0-ZNEPQKeEa" width=45%></a>

<a href="http://blog.christianperone.com/wp-content/uploads/2013/09/Dot_Product.png"><img src="https://drive.google.com/uc?export=view&id=1ja1csLYW2zj8_YoPBqSaZQDGj5QizQ5I" width=25%></a>

## Alternative distance metrics

### Euclidean distance

<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Euclidean_distance_3d_2_cropped.png/1024px-Euclidean_distance_3d_2_cropped.png"><img src="https://drive.google.com/uc?export=view&id=1QUWzWVynns5SvAXm3hhH38_Zl9v5T5SW" width=35%></a>

In case of frequency based semantic vectors we would naively be tempted to go for the full Euclidean distance calculations, but that would __disregard__ the fact, that in case of a query and a document, __radically different frequencies will be present__ in case of words, so we might be better served with the cosine approach, which focuses on relative differences in meaning regardless of frequency.


### Earth movers's (Mahalanobis) distance


<a href="http://drive.google.com/uc?export=view&id=1L1jRYCjfRbQjGyqVKssLB2pfOc47Mch7"><img src="https://drive.google.com/uc?export=view&id=1bEkjN-KOAmlyNU9zw5j_btJLAuoGIjvx" width=40%></a>


<a href="https://slideplayer.com/slide/4511821/15/images/30/Option+3%3A+The+Earth+Mover+Distance+%28EMD%29.jpg"><img src="https://drive.google.com/uc?export=view&id=1bcKY4nEiu7BCMdtrS3yEXVvQz-jO1Rna" width=45%></a>

<a href="https://vene.ro/images/wmd-obama.png"><img src="https://drive.google.com/uc?export=view&id=1Oqqr8Gr_Qh2l0YROpjE6Ri3WKkxdiwsC" width=45%></a>

The usage of EMD makes sense in this context, since we suppose again, that certain locations of our vectors are representing some meaningful units, thus some kind of "topic mapping" is going on in case of a distance measurement.

For the usage of EMD in semantic spaces see [this](http://proceedings.mlr.press/v37/kusnerb15.pdf) article.

**Fair warning:** The __runtime__ requirements of EMD are __non linear__. For production environments this can mean a lot of CPU load. (Though [Radim Řehůřek](https://radimrehurek.com/about/) in [Gensim](https://radimrehurek.com/gensim/) [wmdistance](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.wmdistance.html) did some work on speeding it up. For easy usage see [here](https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html))



For a more general discussion about the different distance metrics in context of semantic clustering see [this](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&rep=rep1&type=pdf) article.


## What does this look like practically?


The general architecture of an (linguistic) information retrieval system is similar to this:

<a href="https://3.bp.blogspot.com/-Ix9pGH_mYNM/Wfgm064NpqI/AAAAAAAAGCA/X-P4V2917Pgr5hPnRVvRkfRD6GwgjL1rACK4BGAYYCw/s1600/information%2Bretrieval_2.PNG"><img src="https://drive.google.com/uc?export=view&id=1GWJ3p8BRZ1G7325ODfUuMAKxrqtbDisK" width=55%></a>

Naturally, if we would try to __calculate__ this process __on the fly__ for all queries, that would be prohibitively expensive (think Google search...)

We can thus decompose the process to __two steps__:

1. __Preprocessing (vectorization)__

Typically we __preprocess__ the __documents__ based on our models and store the __vector representation__ of them in an appropriate data structure (SQL is not always the best idea here, but can work).

2. __Retrieval__

We __vectorize__ "on the fly" the __incoming query__, and try to __calculate__ a __ranking__ for the pre-processed database enrties. This step in itself is not a linearly scaling problem, so good care has to be taken, that the lookup remains quick enough. We can use approximate nearest neighbor search for the shortlist of items just like Spotify does with [Annoy](https://github.com/spotify/annoy).


## Main takeaway

---------------------

<font color='red'>
Whoever can learn a right transformation from objects to a meaningful vector space and can do similarity comparisons at scale has a good search / recommender engine!
</font>
 
---------------------

 
# General problems with frequency based models


1. Extreme __memory hunger__ (as discussed already).

2. Their basic __assumptions are not realistic__, since the probabilities of words are _indeed_ **influenced by far away words and sequence information**, but we can not capture these with this methods (even increasing N-gram count is infeasible).