# Deep Learning for NLP & Sequence Modelling

<br /><br />

### Motivation

In this class we are going to cover popular Deep Learning models for **sequential data**, i.e. where each data point $X_i$ represents a sequence of some tokens $X_i={s_0, s_1, ... s_n}$. Note that sequence lengths can **vary** in a a given dataset, i.e $|{X_i}| \neq |{X_j}|$.

NLP is a classical example of sequence modeling, where a dataset represents a set of sentences/paragraphs/documents (sequence of words). In NLP usually we assume that set of tokens (words) come from a predefined fixed **vocabulary**.


### Learning Objectives

The goal of this class is a high-level overview/summary of recent trends in Deep Learning for NLP and sequence modeling. The material is huge, and, unfortunately, most details are beyond this class, however, lot of extra reading links are provided to gain more insights. 


## Recurrent Neural Networks (RNN)

### Motivation

When we work with sequences, we can simply calculate embeddings for individual tokens (word2vec) of a sequence and then sum them up (average/concatenate). This simple aggregation might work on simple tasks but in real world more complex and smart aggregation mechanism is required. RNNs are natural fit.

### Vanilla RNNs

<img src='images/seq_models/rnn.png' />

$X={x_0, x_1, ... x_t}$ is our input data point (individual $x_i$ can be bag-of-words or word2vec representations or something else you come up with), represented as blue circles; A green box is a **RNN cell** which contains a computation and weight; $h_0, h_1, ..., h_t$ is a sequence of **hidden states**, which, intuitively, are information carriers from cell to cell.

Computation in a RNN cell is the follows:  
<br />
$h_0 = 0$  
$h_t = \sigma(W x_t + U h_{t-1} + b)$, where $\sigma$ is element-wise sigmoid    
<br />
$W, U, b$ - are weights  

<br />

**Q:** How to do sequence classification/regression with RNN?  
**A:** Option 1. Just attach a classifier/regressor on the last hidden state. Option 2. Take average/max pooling of hidden layers and pass to classifier/regressor.

**Q:** How to do sequence batching?  
**A:** You will have a 3D tensor [B, T, E] or [T, B, E], where B - batch size, T - max sequence length in a batch, E - feature vector dim. Due to the fact that sequences are variable length, you take max length in a batch and append rest with zeros or some special token. Refer also https://pytorch.org/docs/master/generated/torch.nn.utils.rnn.pack_padded_sequence.html

### Long Short-Term Memory Networks (LSTM)

Vanilla RNNs suffer much from so called **vanishing gradient** problem. If you try to write down gradient equation for it, you will bump into a part where lot of small numbers (linear to input sequence size) are multiplied together, which causes the result to go to zero. (Details here https://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture07-fancy-rnn.pdf) For long enough sequences this becomes a serious problem and different architecture is required.

<img src='images/seq_models/lstm.png' />

#### Sequence-to-Sequence Framework

Sequence-to-Sequence (seq2seq) is a well know method in Deep Learning for transforming one sequence into another (original paper https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf). Classical example is machine translation, where sentence in one language gets transformed in the corresponding sentence in another language. All state-of-the-art translation systems as well as many other systems use seq-to-seq.  

The idea is very similar to Autoencoders. We have two networks (LSTM, for instance), encoder and decoder. Encoder accept input sequence and decoder predicts transformed sequence. Decoder's initial hidden state is the last hidden state of the encoder - that's how the information is shared between encoder and decoder. Intuitively, last hidden state consists of compressed representation of input sentence which a decoder should use for transformation.

<img src='images/seq_models/seq2seq_trans.jpg' />

#### Attention Mechanism 

**Q:** When doing classification/regression using RNN, can we use smarter way than just simply using last hidden state or average/max pooling of hidden states for input to final classifier/regressor layer?  
  
**A:** Attention Mechanism: We can do weighted average of hidden states where weight indicates the importance of the particular hidden state. Weights are learnt jointly along with the mainstream task.
<br />
<img src='images/seq_models/attn.png' />  
<br /><br />
Y's are hidden layers, Z is final aggregated output, TANH - is computation for similarity measurement (can be one layer fully-connected network with tanh activation, or, more popularly, just a scalar product). C - depending on a task, can be a learnable parameter or some other thing. **NOTE:** The implementations of attention mechanism can vary from paper to paper.  

Examples: https://distill.pub/2016/augmented-rnns/

### Suggested Readings about RNNs

https://karpathy.github.io/2015/05/21/rnn-effectiveness/


### PyTorch Examples

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html  
https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html

## Transformer Neural Networks

### Intro

Until 2017, arguably, LSTMs were de-factor standard for NLP tasks, but then https://arxiv.org/abs/1706.03762 has been released, introducing different type of neural network architecture that set new SOTA on neural machine translation. It is called "Transformer", which mainly uses attention mechanism, residual connections and feed forward neural networks. Since then, this new approach gained popularity and today, almost all SOTA deep NLP models rely on that.  

**Computational Cost** is the main advantage of transformer over LSTM and other RNNs. Modern sequence modeling tasks can have sequence length of order ~1000 tokens and more (https://arxiv.org/pdf/2004.05150.pdf) which is huge. RNN is inherently sequential, meaning that computation of next cell needs result from previous cell, thus making it non-parallelizable. Training on large datasets with big sequence lengths is very very slow. On the other hand, transformer is highly parallelizable and almost all components can be implemented using matrix operations.

**Self-Attention Mechanism (Special kind of attention for Transformer)** is a second thing that makes Transformers superior to RNNs. It can relate signals from any arbitrarily distant sequence locations in O(1) whereas, LSTM needs O(|distance between tokens|) operations to do the same thing. It is believed that latter makes it harder to train the model and learn complex sequential relationships.  

Code and explanation of original paper can be seen here: http://nlp.seas.harvard.edu/2018/04/03/attention.html

### BERT (Pre-Trained Transformer for Language Modeling)

#### Language Modeling

**Language Modeling** is one of the classical NLP tasks that aims to model a natural language, let's say, an English. What does it mean to model a language? It means to learn a probability distribution over sentences of a given language. Given an arbitrary sequence of words, the model must output a probability that measures the likelihood that this sequence will appear as a sentence in real world.  

Mathematically, neural network must learn $P(X)=P(word_0, word_1, ..., word_n)=\prod_{i=0}^nP(word_i | word_{i-1}, ..., word_{0})$  

There are many methods of achieving this task. Two popular choices are so called **"Next Word Prediction task"** and **"Masked Language Modeling task"**.  

RNNs usually learn the former, and Transformers are based on the latter. Attendees of this class are strongly encouraged to get familiar with these tasks in detail.

**NOTE:** Language Modeling can be thought as an alternative pre-training method for NLP, as ImageNet is for Computer Vision. It shows great results in transfer learning.  

https://arxiv.org/abs/1801.06146 - nice paper from Jeremy Howard (fast.ai) that summarizes lots of tricks and tips to get a good language model using RNN.

#### BERT

**BERT** is a Transformer neural network trained on Masked Language Modeling task for Language Modelling by Google. https://arxiv.org/abs/1810.04805

BERT became so popular, there's been huge variety of BERT based models released. you can check out them and also the code here: https://huggingface.co/transformers/pretrained_models.html

#### OpenAI GPT

OpenAI released 3 versions of their own variant of Transformer (GPT-1, GPT-2, GPT-3) that achieved really good results on text generation. 

https://openai.com/blog/better-language-models/
https://github.com/openai/gpt-3