# Background: Distributions of words


## Psychological models of meaning: "Osgooding"

As in many innovations in AI, the foundations of the most important NLP techniques can be __traced back to psychological research__. [Osgood](https://en.wikipedia.org/wiki/Charles_E._Osgood) in his seminal paper ["The nature and measurement of meaning."](https://pdfs.semanticscholar.org/ca12/a908e86a87db152c0991ae9c5a40f1a5d2a3.pdf) proposed an __experimental method__ for __understanding__ the human subject's "mapping" of __meaning__.

[Semantic Differential](https://en.wikipedia.org/wiki/Semantic_differential) (SD) is a type of a __rating scale__ designed to measure the __connotative meaning__ of __objects, events, and concepts__. The connotations are used to derive the attitude towards the given object, event or concept.

Osgood's Semantic Differential was an application of his more general attempt to measure the semantics or meaning of words, particularly adjectives, and their referent concepts. The respondent is asked to choose where his or her position lies, on a scale between two polar adjectives (for example: "Adequate-Inadequate", "Good-Evil" or "Valuable-Worthless"). Semantic differentials can be used to measure opinions, attitudes and values on a psychometrically controlled scale."

<a href="http://methods.sagepub.com/images/virtual/encyclopedia-of-survey-research-methods/image32.jpg"><img src="https://drive.google.com/uc?export=view&id=1u3ETKwTMT72xuiRenupCvjKWPh9nTABE" width=45%></a>

This laboratory method became a standard in social sciences as well as experimental linguistics, earning the name "osgooding".

Important things to note for this research:
- Using __human judgements__ for the __measurement of association strengths between words__
- Implicit assumption of a __"space" (distance)__ which __represents meaning relations__
- Usage of __factor analysis__, that is the search for a __lower number of causal factors behind the observed variance__ of word affinities (dimensionality reduction)




## Zipf's law

The other influential foundational notion we have to take into account is the phenomena described by Zipf in his 1935 work [The psycho-biology of language](https://psycnet.apa.org/record/1935-04756-000), later bacame known as "Zipf's law".

"..Zipf's law states that given some corpus of natural language utterances, the __frequency of any word__ is __inversely proportional to its rank in the frequency table__. Thus the __most frequent word__ will __occur approximately twice as often__ as the __second most frequent word__, __three times__ as often as the __third most frequent word__, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, _the_ word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by _and_ (28,852). Only __135 vocabulary items__ are needed to account for __half the Brown Corpus__."

<a href="https://3c1703fe8d.site.internapcdn.net/newman/csz/news/800/2017/solutiontoac.png"><img src="https://drive.google.com/uc?export=view&id=1KeW5cFpaezvqh7ks8wp_tNydVlEiETiV" width=55%></a>

The main takeaway form this research is:
- The __empirical distribution__ of __written words__ has very __distinct, characteristic__
- This __distribution__ points towards __definable generating mechanisms__ for language production

### Possible explanations of Zipf's law

Though Zipf's "law" is empirically very stable and has been observed across multiple languages and corpora, a thorough explanation of it's __causes__ is still __lacking__. Among the multiple avenues for the causes __two__ are __notable for the purpose of NLP__:

On the one hand Yu et al. proposes __distinct cognitive mechanisms__ for __language processing__ as Zipf's background:

"Yu and co say the word frequencies in these languages share a common structure that differs from the one that statistical errors would produce. What’s more, they say this structure suggests that the __brain processes common words differently from uncommon ones__, an idea that has important consequences for natural-language processing and the automatic generation of text." ([source](https://www.technologyreview.com/s/611640/data-mining-reveals-fundamental-pattern-of-human-thinking/))

However, Yu and co are able to reproduce this structure using a model of the way the brain works called the __dual-process theory__. This is the idea that the brain works in two different ways.

The first is ___fast intuitive thinking___ that requires __little or no reasoning__. This type of thinking is thought to have evolved to allow humans to react quickly in threatening situations. It generally provides good solutions to difficult problems, such as pattern recognition, but can easily be tricked by non-intuitive situations.

However, humans are capable of much more rational thinking. This __second type__ of thinking is __slower, more calculating, and deliberate__. It is this kind of thinking that allows us to solve complex problems like mathematical puzzles and so on.

The dual-process theory suggests that __common words like the, and, if and so on are processed by fast, intuitive thinking and so are used more often__. These words form a kind of backbone for sentences.

However, __less common words and phrases like "hypothesis" and "Zipf’s Law" require much more careful thought__. And because of this they occur less often.

We will come back to the topic of "informative" words later in detail.



On the other hand [Manin](https://pdfs.semanticscholar.org/b04b/1ccabc3e614ba4fe784030b41d6f1e753844.pdf)'s 2007 research proposes that the zipfian distribution is a direct result of the __processes governing hypernymy (type-of-relationship) and synonymy (same as) relations__ in the human linguistic structure. This draws a strong connection between the distributional and ontological approaches to semantics. The already discussed topics of "relationship extraction" can capitalize strongly on this effect, and vice versa, some distributional methods emphasize the hierarchical structure of "meaning space" in forms of topological constraints (see for example [this](http://arxiv.org/abs/1705.08039) paper)

Interestingly enough, the __distributional properties of images tagged by humans__ also show a kind of __hierarchical distribution__, where __"higher order" concepts have more visual variability__ in the associated images than the "lower level" words in a hierarchy. This effect has been explicitly used for mining of meronymy relations for common sense ontologies. (see [here](https://people.mpi-inf.mpg.de/~ntandon/papers/pwkb-aaai2016-tandon.pdf))






## Distributional hypothesis

_"You shall know a word by the company it keeps" (Firth, 1957)_

The grounding hypothesis of **distributional semantics** is that we consider **language production** (the choice of sequences of words) as a **probabilistic process**, thus we state, that meaning can and should be modeled with some kind of (conditional) probability distributions.

 
-------------------
In short:
<font color='red'>
Meaning of a word = distribution of it's neighbors
</font>

-------------------

## Language modeling

A __language model__ is a __probability distribution__ over the __sequence of words__, modeling language (production), thus if the set of words is $w$, then for arbitrary $\mathbf w = \langle w_1,\dots, w_n\rangle$ ($w_i\in W$) sequence it defines a $P(\mathbf w)$ probability. 

Probability with chain rule:

$$P(\mathbf w)= P(w_1)\cdot P(w_2 \vert w_1 )\cdot P(w_3\vert w_1, w_2)\cdot\dots\cdot P(w_n\vert w_1,\dots, w_{n-1})$$

so this means, that for the __modeling__ we need only to give the __conditional probability__ of the __"continuation"__, the __next word__, thus for $w$ word and $\langle w_1,\dots,w_n\rangle$ sequence the probability that the next word will be $w$

$$P(w ~\vert ~ w_1,\dots,w_n)$$

There are character based models also, which take the individual characters as units, not the words, and model language as a distribution over sequences of characters (think T9...)



**Language modeling is the practical application of the distribution hypothesis.**

### Windows

It is important to note, that this conditional probability forms a Markov chain. The problem is, that we do not know, how "deep" it is, that is, **how deep the causal influence of a word "travels" through next words**. 

**In practice, we do not think, that the first word of "On war and peace", as a book directly influences the last one. Some higher level concept, like topic, narrative, etc is in place, but for practical reasons, we will consider a causal (rolling) window.** (Rings a bell? Time series?)

## Measurement of predictive performance: Perplexity
Extrinsic vs intrinsic evaluation criteria:

__Extrinsic__ - put model to task- spell checking, speech recognizer etc.

__Intrinsic__ - performance on the test-set e.g. measured as perplexity

Perplexity: Basic evaluation criterion of a language model - does it __prefer good sentences to bad ones__?
- assign __higher probability__ to __"real"__ or __frequently observed__ sentences than to __"ungrammatical"__ or __"rarely observed sentences"__



A language model $\mathcal M$'s perplexity over the word series $\mathbf w = \langle w_1,\dots, w_n\rangle$ is:

$$\mathbf{PP}_{\mathcal M}(\mathbf w) = \sqrt[n]{\frac{1}{P_{\mathcal M}(\mathbf w)}}$$

With the chain rule can be rewritten as:

$$\mathbf{PP}_{\mathcal M}(\mathbf w) = {\sqrt[n]{\frac{1}{P_{\mathcal M}(w_1)}\cdot \frac{1}{P_{\mathcal M}(w_2 \vert w_1 )}\cdot \frac{1}{P_{\mathcal M}(w_3\vert w_1, w_2)}\cdot\dots\cdot \frac{1}{P_{\mathcal M}(w_n\vert w_1,\dots, w_{n-1})}}}$$

which is exactly the __geometric mean__ of the __reciprocals of the conditional probabilities__ of all __words in the corpus__.

In case of a __bigram model__ this is further simplified to:
$$\mathbf{PP}_{\mathcal M}(\mathbf w) = \sqrt[n]{\frac{1}{P_{\mathcal M}(w_1)}\cdot \frac{1}{P_{\mathcal M}(w_2 \vert w_1 )}\cdot \frac{1}{P_{\mathcal M}(w_3\vert w_2)}\cdot\dots\cdot \frac{1}{P_{\mathcal M}(w_n\vert w_{n-1})}}$$

The lower the perplexity the better (it means that our model deemed the actuall observed sequence as more likely)

#### Connection to cross-entropy

Taking the logarithm of perplexity, with a few simple steps of algebraic manipulations we can see that the result is 

$$
\frac{1}{n} (-\log(P_{\mathcal M}(w_1)) + -\log(P_{\mathcal M}(w_2 \vert w_1 ))+ -\log(P_{\mathcal M}(w_3\vert w_1, w_2)) + \dots  + -\log(P_{\mathcal M}(w_n\vert w_1,\dots, w_{n-1}))
$$

which is the average cross-entropy per word. A simple consequence: minimizing cross-entropy one also minimizes the model's perplexitiy on the training data.