<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

<center><h1>Introduction to Language Modelling</center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>

**2.** [**Statistical Language Models**](#Section2)<br>
  - **2.1** [**Character Based**](#Section21)
  - **2.2** [**N-Gram Model**](#Section22)
  - **2.3** [**Limitation of Statistical Approach**](#Section23)

**3.** [**Neural Language Model**](#Section3)<br>
**4.** [**Generalised Language Model**](#Section4)<br>
**5.** [**Application of Language Model**](#Section5)<br>
**6.** [**Conclusion**](#Section6)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- **Language modeling** is the use of various **statistical** and **probabilistic** techniques to determine the **probability** of a given sequence of words occurring in a sentence. 

- Language models **analyze** bodies of text data to **provide** a basis for their **word predictions**. They are used in natural language processing (NLP) applications, particularly ones that **generate** text as an **output**. 

  - For example, a language model used for **predicting** the next word in a **search query** will be absolutely different from those used in **predicting** the next word in a **long document** (such as Google Docs).

  - The approach followed to **train** the model would be **unique** in both cases.

- Language Models determine the **probability** of the **next** word by **analyzing** the text in data. These models **interpret** the data by **feeding** it through algorithms. 

- For **training** a language model, a number of **probabilistic** approaches are used.

- These approaches **vary** on the basis of **purpose** for which a **language** model is created.

- Besides assigning a **probability** to each **sequence** of words, the **language models** also assigns a **probability** for the likelihood of a given word to **follow** a **sequence** of words.


<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm10.gif"/></center>

<br> 

- **Language modelling** by itself does not have a **direct** practical use but it is a **crucial** component in **real-world applications** such as **machine-translation** and **automatic speech recognition**.

- A translation system might generate **multiple** **translations** of the same target sentence and the **language models** scores all the sentences to pick the one that is most likely.

---
<a name = Section2></a>
# **2. Statistical Language Models**
---

- **Statistical models** include the development of **probabilistic models** that are able to predict the **next word** in the sequence, given the words that **precede** it.

- A number of **statistical language models** are in use already. Some of them are as follows:

  - **The count-based methods**, such as traditional **statistical models**, usually involve making an n-th order **Markov assumption** and estimating **N-gram** probabilities via **counting** and subsequent **smoothing**. 

  - Using a **statistical** formulation to describe a **Language Modelling** is to construct the joint **probability** distribution of a sequence of words. 

<a name = Section21></a>
### **2.1 Character Based**

- It predicts **next character** based on **previous characters.**

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/tughtu.JPG"height = "300"Width="400"/></center>

<a name = Section22></a>
### **2.2 N-Gram Model**

- In **N-Gram Model**, the process of **predicting** a word sequence is **broken** up into predicting one word at a time.

- The **LM probability** *p(w1,w2,…,wn)* is a product of word **probabilities** based on a history of **preceding** words, whereby the history is limited to **m** words:
<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm3.png"height = "60"Width=600"/></center>

<br>  

- This is also called a **Markov chain**, where the number of **previous** states is the **order** of the model.

- The basic idea for **n-gram LM** is that we can predict the **probability** of w_(n+1) with its **preceding context**, by dividing the **number** of **occurrences** of w_n, w_(n+1) by the number of **occurrences** of w_n, which then would be called a **bigram**. 

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm4.png"height = "50"Width="300"/></center>
<br>  

## Unigram
- An n-gram of **size 1** is referred to as a "unigram"

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/unigram.JPG" height = "100"Width="600"/></center>

## Bigram
- An n-gram of **size 2** is referred to as a "bigram"

<center><img src="https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/bigram.JPG" height = "100"Width="600"/></center>

## Trigram 
-  An n-gram of **size 3** is referred to as a "trigram"

<center><img src="https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/trigram.JPG" height = "100"Width="600"/></center>

## Unigram , Bigram , Trigram altogether

<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/language.JPG" height = "300"Width="600"/></center>

-  The **higher** the N, the **better** is the model **usually**. But this leads to lots of **computation** overhead that requires large **computation power** in terms of RAM.

- N-grams are a **sparse representation** of language. This is because we build the **model** based on the **probability** of words **co-occurring**. It will give **zero probability** to all the words that are **not present** in the training corpus.


<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm15.png"height = "300"Width="700"/></center>

<br> 

#### **Bidirectional**:

- Unlike N-gram models, which **analyze** text in **one direction** (backwards), **bidirectional** models analyze text in **both** directions, backwards and forwards. 

- These models can **predict** any word in a **sentence** or **body** of text by using every **other** word in the text. Examining text **bidirectionally** increases result **accuracy**. 

- This type is often **utilized** in machine learning and **speech generation** applications. For example, **Google** uses a **bidirectional model** to process search **queries**.

<a name = Section23></a>
### **2.3 Limitation of Statistical Approach**

- The **distributed representation** approach allows the **embedding** representation to **scale** better with the size of the vocabulary. 

- **Classical methods** that have one **discrete** representation per word fight the **curse of dimensionality** with **larger vocabularies** of words that result in **longer** and more **sparse** representations.

- Despite the **smoothing** techniques, and the **practical usability** of n-gram , the **curse of dimensionality** is especially potent here, as there is a **huge number** of different **combinations** of values of the **input** variables that must be **discriminated** from each other. 

- For LM, this is the huge **number** of possible **sequences** of words, e.g., with a **sequence** of **10 words** taken from a vocabulary of **100,000**, there are **10⁵⁰ possible** sequences.

- **N-grams** are a **sparse representation** of language. This is because we build the **model** based on the **probability** of words **co-occurring**. It will give **zero probability** to all the words that are **not present** in the **training** corpus

---
<a name = Section3></a>
# **3. Neural Language Model**
---

- These language models are based on **neural networks** and are often considered as an **advanced approach** to execute NLP tasks. 

- There are **two** main NLM: 
  
  - **Feed-forward neural network based LM**
  
  - **Recurrent neural network based LM**

- **Neural Language Models** (NLM) address the **N-gram data sparsity** issue through parameterization of words as vectors (**word embeddings**) and using them as **inputs** to a neural network. 

- **Word embeddings** obtained through NLMs **exhibit** the property whereby **semantically** close words are **likewise close** in the induced **vector space**.

- This learned **representation** of words based on their **usage** allows words with a **similar meaning** to have a similar representation.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm8.png"width="690" height="350"/></center>

<br>  

### Feed-Forward Neural Network Based Models

- The first neural approach to LM is a **neural probabilistic language model**.

-  This learns the **parameters** of **conditional probability distribution** of the next word, given the previous n-1 words using a **feed-forward neural** network of three layers. An overview of the **network** architecture is **additionally** given in following figure:

- Bulid a **mapping** C from each **word** i of the **vocabulary** V to a distributed, **real-valued feature vector** C(i) ∈R^m, with m being the number of **features**. C is a |V| × m **matrix**, whose **row** i is the **feature** vector C(i) for **word** i.

- Learn the word **feature vectors** and the parameters of that probability function with a composite **function f**, comprised of the **two mappings C and g**.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm5.png"width="490" height="350"/></center>

<br>  

- In this model, each word in the **vocabulary** is associated with a **distributed word** feature vector, and the joint **probability function** of words sequence is expressed by a **function** of the **feature vectors** of these words in the sequence.

- The model can learn the word **feature vectors** and the parameters of that **probability function** simultaneously.

- This **neural network** approach can solve the **sparseness problem**, and have also been shown to **generalize** well in comparison to the n-gram models in terms of perplexity.

- However, a major weakness of this approach is the **very long training** and **testing** times.

#### Recurrent Neural Network Based Models


<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/LM7.png"width="390" height="350"/></center>

<br>  


- We have known that **feed-forward** neural network based LM use **fixed length** context. 

- However, recurrent neural network do not use **limited** size of **context**.

- By using **recurrent** connections we can use **different** size of context.

- The recurrent neural network based **language** model (RNNLM) provides further **generalization** by considering several **preceding words**. 

- A variant of **RNNLM** was presented to further **improve** the original RNNLM by **decreasing** its **computational complexity**, which was **implemented** by **factorization** of the output layer.


<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm9.png"height = "300"Width="700"/></center>

<br>  

 - The best models were the largest models, specifically number of memory units.

- Use of **regularization** like dropout on **input connections** improves results.
  
- **Character-level** Convolutional Neural Network (CNN) models can be used on the **front-end** instead of word embeddings, achieving **similar** and sometimes **better** results.

  - **Combining** the **prediction** from **multiple models** can offer large improvements in model performance.


---
<a name = Section4></a>
# **4. Generalised Language Model**
---

- **Large-scale** pre-trained **language** modes like **OpenAI GPT** and **BERT** have achieved **great** performance on a **variety** of language tasks using **generic** model **architectures**. 

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/lm11.jpg"height = "400"Width="700"/></center>

<br>

#### **ELMO**

  - **ELMo**, short for **Embeddings from Language Model** learns **contextualized** word representation by pre-training a **language** model in an unsupervised way.

- ELMo is applied on **semantic** intensive and **syntax intensive**.

- **Semantic task:** The **word sense disambiguation** (WSD) task emphasizes the meaning of a **word** given a context. The **biLM** top layer is better at this task than the first layer.

- **Syntax task:** The part-of-speech (POS) **tagging task** aims to infer the grammatical role of a **word** in one sentence. A **higher accuracy** can be achieved by using the biLM **first layer** than the top layer.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/elmo.png"height = "350"Width="700"/></center>

<br>  
- A **residual connection** is added between the **first** and **second** LSTM layers. The input to the first layer is **added** to its **output** before being **passed** on as the **input** to the **second** layer.

### **GPT**

OpenAI GPT, short for **Generative Pre-training Transformer**, expands the **unsupervised language** model to a much **larger** scale by training on a giant **collection** of free text corpora. Despite of the **similarity**, GPT has two major **differences** from ELMo.

- The model architectures are different: ELMo uses a **shallow concatenation** of independently trained **left-to-right** and **right-to-left** multi-layer LSTMs, while GPT is a **multi-layer transformer decoder**.

- The use of **contextualized embeddings** in downstream tasks are different: ELMo feeds **embeddings** into models **customized** for specific tasks as **additional features**, while GPT fine-tunes the same base **model** for all end tasks.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/gpt.png"width="750" height="550"/></center>

<br> 

### **BERT**

- BERT, short for **Bidirectional Encoder Representations from Transformers** is a direct descendant to GPT: train a large language model on free text and then fine-tune on specific tasks without **customized** network **architectures**.

- Compared to GPT, the **largest** difference and **improvement** of BERT is to make **training bi-directional**. The model learns to predict both context on the left and right. 

- To encourage the **bi-directional** prediction and **sentence-level** understanding, BERT is trained with **two tasks** instead of the **basic language** task.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/bert.png"width="850" height="350"/></center>

<br>  

#### **T5**

- The language model T5 is short for **Text-to-Text Transfer Transformer**.

- The **encoder-decoder** implementation follows the original Transformer architecture: 

   - tokens → embedding → encoder → decoder → output. 
   
- Instead of an **explicit** QA format, T5 uses **short task prefixes** to distinguish **task intentions** and separately **fine-tunes** the model on every **individual** task.

- The **text-to-text** framework **enables easier** transfer learning **evaluation** with the same model on a **diverse** set of tasks.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/T5.png"height = "300"Width="700"/></center>

<br> 

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/s2s_data/sum.png"width="850" height="690"/></center>

<br>  

---
<a name = Section5></a>
# **5. Application of Language Model**
---

#### Text Suggestions

- **Google services** such as **Gmail** or **Google Docs** use language models to help **users** get text **suggestions** while they compose an email or create **long text** documents.  

<br>  
<center><img src =" https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/MonstersEmail.0.gif"height = "300"Width="600"/></center>

<br>  

##  **Automatic Speech Recognition**

- It is the technology that **allows human beings** to use their voices to **speak with a computer** interface.

- It speaks in such a way that its most **sophisticated variations**, resembles normal **human conversation**.

<br>  
<center><img src ="https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/facetalk3.gif"height = "300"Width="500"/> </center>


## **Machine Translation**

- It is the task of **automatically converting** one natural language into another.

- It **preserves** the **meaning** of the input text, and **producing fluent text** in the output language.

<br>  
<center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/1_zaK7KBzFtPFodu4CyauDYw.gif"height = "200"Width="650"/></center>


---
<a name = Section6></a>
# **6. Conclusion**
---

- **Word embeddings** obtained through **NLMs** exhibit the **property** whereby **semantically** close words are likewise close in the **induced** vector space. 

- NLMs can also **capture** the **contextual information** at the **sentence-level**, corpus-level and **subword-level**.

-  **Language modeling** is the art of determining the probability of a sequence of words.

- This is useful in a large **variety** of areas **including** speech recognition, **optical character recognition**, **handwriting recognition**, **machine translation**, and **spelling correction**.

- **Nonlinear** neural network models solve some of the **shortcomings** of **traditional** language models. 

- They allow **conditioning** on **increasingly** large **context sizes** with only a **linear** increase in the number of parameters, and they support generalization across **different contexts**.

- That **state-of-the-art** results are achieved using **neural language models**, specifically those with **word embeddings** and recurrent neural network algorithms.