# Language modelling

* The task: assign a probability to a sequence of words $w_1,\ldots ,w_n$
* How likely it is that this sequence *belongs to the language*
* Typically modelled by factoring as follows: $\prod_i P(w_i|w_1,\ldots w_{i-1})$

## N-gram models

* Traditional approach to language modelling
* $P(w_i|w_1\ldots w_{i-1})$ can be approximated using $P(w_i|w_{i-N},\ldots,w_{i-1})$ for some sufficiently small $N$
* and then estimated using the maximum-likelihood estimate $\frac{C(w_{i-N},\ldots,w_i)}{C(w_{i-N},\ldots,w_{i-1})}$ where $C(x)$ is the count of occurrences of $x$ in some large corpus of text
  * $P(barks|the\ dog)=\frac{C(the\ dog\ barks)}{C(the\ dog)}$ ... this would be called a 3-gram model
  * 5-gram models quite common

## Traditional applications

* Decoding: choose the most likely from several options
* Speech recognizer
  * At any moment provides a probability distribution over the alphabet
  * Language model used to choose the most likely path: **decoding**
* Decoding in machine translation
  * Language model used to find a natural sounding translation
* Text filtering and post-processing
  * Recognize segments of text which are not natural
  
## Generation

* The probabilistic formulation above can also be used to generate language from the models
* At any point one can sample from the distribution $P(w_i|w_1\ldots w_{i-1})$ and get the next word
* Generate the text one word at a time:
    * https://corpora.linguistik.uni-erlangen.de/cgi-bin/demos/TextParrot/parrot.perl
    * Why do you think it is *this* bad?

## Modern language models

* Language modelling can be seen as a classification problem
* Given a context, predict which word is the best continuation
* The N-gram model is a very simple probabilistic way to achieve this
* Modern methods employ considerably more powerful techniques
* And most importantly: consider a much longer context (hundreds of words in length)

* https://talktotransformer.com/
* https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
* See the difference?

## Modern language models transcend their role

* Clearly, to generate language, one needs to understand it well
* Modern language models are not used only for their ability to generate text, but also for their ability to encode the context
* The model internally encodes the context, and uses this encoding to generate the output
* It turns out this encoding is extremely useful as input for almost any task in NLP
* Massive pre-trained language models yield the state-of-the-art across the board in today's NLP
* For those interested: https://arxiv.org/pdf/1706.03762
* Or attend our deep learning in NLP course (Turku / 4th period)





