## A Hands-on Workshop series in Machine Learning
#### Instructor: Dr. Aashita Kesarwani




Can you think of ways to vectorize sentences while preserving the order of words?

An easiest way to preserve the order or words while vectorization would be to
* Set a certain length for input sequences, say 20 words
* Process all the sentences in the training dataset so they consist of exactly $n$ words (say $20$ words) either by padding with null words or chopping off longer sentences
* Create a vocabulary of all words in the training data and assign them hashes (numbers)
* Vectorize the input sequences by replacing the words with their hashes in the vocabulary
* Feed this input to a vanilla neural network


The feed forward neural network with the above input might be able to capture information about the order of words in the sentence but it would need to learn all rules of the language separately at each point. For example, it would treat the following two sentences differently 
* "I went to Hawaii in 2018."
* "In 2018, I went to Hawaii."

The relevant information about when the narrator went to Hawaii is present in 6th and 2nd position respectively. Many more training examples would be needed to train the network to extract the relevant information as the network needs to learn the same rules separately for all positions in the input layer.


Recurrent Neural Networks (RNNs) are specifically designed to work well on sequential data. They are much more efficient for textual data as compared to vanilla neural network (i.e. multi-layer perceptrons).

Other than natural language processing, RNNs can be applied to time-series based data as well. 

### Recurrent Neural Networks (RNN)

* RNN takes an input of sequence, denoted as $x^{(1)}, x^{(2)}, \dots, x^{(t)}, \dots$, instead of a single input vector as seen previously.
* It can handle input sequences of any length.
* The information from the time step $t-1$ is passed on as input to the next time step $t$.


<img src="https://github.com/AashitaK/datasets/blob/main/images/Fig0.png?raw=True" width="800" height="200" />


In mathematics, we often study dynamical systems that involves recurrence relations:

$$s^{(t)} = f(s^{(t-1)}; \theta)$$

where $s^{(t)}$ is the state of system at time $t$.

These dynamical systems can also involve input signals $x^{(t)}$ at each step:

$$s^{(t)} = f(s^{(t-1)}, x^{(t)} ; \theta)$$

Such a recurrence relation is involved in the architecture of most RNN (though the architecture can vary greatly as discussed below). The hidden nodes in RNN at time $t$ can be defined as:

$$h^{(t)} = f(h^{(t-1)}, x^{(t)} ; \theta)$$

Here, the hidden state $h^{(t)}$ defined in terms of $h^{(t-1)}$ can be unfolded in the following manner, for say $t=3$:

\begin{align}
 h^{(3)} &= f(h^{(2)}, x^{(3)}; \theta) \\ 
 &= f(f(h^{(1)}, x^{(2)}; \theta), x^{(3)}; \theta) \\
 &= f(f(f(h^{(0)}, x^{(1)}; \theta), x^{(2)}; \theta), x^{(3)}; \theta) \\
 &= g^{(3)}(x^{(1)}, x^{(2)}, x^{(3)})
\end{align}

Thus, the unfolded recurrence can also be represented by a function $g^{(t)}$ as:

\begin{align}
 h^{(t)} &= f(h^{(t-1)}, x^{(t)}; \theta) \\ 
 &= g^{(t)}(x^{(1)}, \dots, x^{(t-1)}, x^{(t)})
\end{align}

Advantages of using the model $h^{(t)} = f(h^{(t-1)}, x^{(t)} ; \theta)$ over $h^{(t)} = g^{(t)}(x^{(1)}, \dots, x^{(t-1)}, x^{(t)})$:
* The function $g^{(t)}$ can be a different function at each time step but the ***same*** transition function $f$ with the same parameters $\theta$ can be used to define hidden state in terms of $h^{(t-1)}$ and input $x^{(t)}$ for every time step. 
* The input sequences can be of variable length but the input for the model is always the same size consisting of $h^{(t-1)}$ and input $x^{(t)}$.

Thus, a single shared model $f$ can be learned that will generalize to sequences of any length and will eliminate the need to learn a separate model $g^{(t)}$ for each time step. This parameter sharing also allows us to train the model with far fewer training examples.

As the RNN involve function compositions multiple times, the composite function can result in very high non-linear behaviour.

Some common types of RNN architecture:

### 1. Output at each step and recurrent connections between hidden units 

<img src="https://github.com/AashitaK/datasets/blob/main/images/Fig3.png?raw=True" width="600" />


The equations for the simple RNN architecture:

\begin{align}
 a^{(t)} &= b + Wh^{(t-1)} + U x^{(t)} \\ 
 h^{(t)} &= \tanh (a^{(t)})\\
 o^{(t)} &= c + Vh^{(t)}\\
 \hat{y}^{(t)} &= \text{softmax} (o^{(t)})\\
\end{align}

The activation function for the hidden unit is hyperbolic tangent which is simply sigmoid function scaled and translated along y-axis.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Hyperbolic_Tangent.svg/980px-Hyperbolic_Tangent.svg.png?20090905154026" width="450" height="200" />

Cost function, $L$, is the sum of negative loglikelihood of $y^{(t)}$ given $x^{(1)}, \dots, x^{(t-1)}, x^{(t)}$

$$ L = \sum_t L^{(t)} = - \sum_t \log p \big(y^{(t)} | x^{(1)}, \dots, x^{(t-1)}, x^{(t)} \big)$$


### 2. A single output at the end and recurrent connections between hidden units
When the output is only needed at the end after reading the entire input sequence, this architecture is used.


<img src="https://github.com/AashitaK/datasets/blob/main/images/Fig5.png?raw=True" width="600" />


Some applications for a single output:
* Sentiment analysis (Text classification)
* Sentence completion (guessing the next word in the sequence)

 #### Back-propagation through time (BPTT):
The forward propagation moves from left-to-right along the time steps whereas the backpropagation moves in the opposite direction from right-to-left. The computations for weight updates in the backpropagation cannot be parallelized for different time steps as the forward propagation is sequential and hence, the computations cannot be done for time step $t=\tau$ unless they are done for all time steps upto $t=\tau-1$. Thus, the RNNs can be powerful but very expensive to train.

### Pros and Cons

When we read and try to understand a sentence, we do not only pay attention to the current word, but we also keep the context in mind from what we read previously in the text. This dependence on previous words to understand the context for the current word is called the dependency. The RNN can learn short-term dependencies much better than long-term dependencies, meaning it remembers the context from the neighbouring previous words better but keep losing the information as we move along the sentence.

Simple RNNs 
* Pros:
    * Order/sequence of words are considered
    * Input can be of any length
    * RNNs were unreasonably far more effective for textual data than BOW methods in use at that time (See  [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy). RNN architectures were developed and studied since mid-1980s but their application in NLP tasks did not pick up pace until the computational power and libraries became widely accessible around 2015.
* Cons:
    1. Suffer from vanishing gradients, especially for longer sequences.
    2. Not sufficiently good at long term memory in sequences.
    3. Assume one-to-one correspondence between input and output sequences and hence, the architecture is not suitable for many common NLP tasks such as translation between languages, text summarization, question-answering, etc.
    4. Slow to train as the training is sequential and thus, cannot be sufficiently parallelized (think in terms of being able to write vectorized code for forward and backward propagation through time steps).
    5. Transfer learning is not very useful in application.

**Transfer learning**: Using pre-trained neural networks (that were trained on a larger dataset requiring more computing resources) and fine-tuning them for a specific NLP task. The use of transfer learning makes state-of-the-art models widely accessible for users.


LSTM and GRU are slight modifications to simple RNN to address the problem of long-term dependencies. Please read [this illustrative article](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) if you learn more about LSTM. They improve the computational time and performance to some extent. 


### Encoder-decoder or sequence-to-sequence models

Simple RNN architectures assumes a correspondence between each input and output sequence at each time step. For many common NLP tasks, the input and output sequences need not be of the same size and a one-to-one correspondence between input and output may be absent. For this purpose, sequence-to-sequence or encoder-decoder architecture is introduced. 


The encoder creates a numerical context vector from the entire input sequence and pass it on to the decoder to generate the output. 

* Encoder RNN reads input sequence, called “context” and produces a representation of the context, say C 
* Final hidden state of encoder RNN is used to compute the (generally) fixed-length context variable C, which represents a semantic summary of the input sequence
* Context variable C is Input to the Decoder RNN
* Decoder RNN generates output sequence 

<img src="https://miro.medium.com/max/1400/1*1JcHGUU7rFgtXC_mydUA_Q.jpeg" width="500" height="70" />

The encoder and decoder can also be LSTM/GRU units instead of simple RNN units. The attention-based encoder-decoder or sequence-to-sequence models, also known as transformers, have different architecture involving an attention layer.

Encoder-decoder (or sequence-to-sequence models) can be used for several NLP tasks:
* Text generation
* Text summarization
* Question-Answering
* Neural machine translation
* Audio captioning (speech-to-text)
* Text-to-speech conversion
* Image captioning



### Attention mechanism

The context vector creates a bottleneck in passing relevant information for longer sequences. Encoder-decoder models performs a lot better while using attention mechanism which simply passes on more information from each encoder unit to each decoder unit.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*B33MU-QFnkTngXklABfJnw.png" width="750" />

#### Evolution of models for textual data

<center> Bag-Of-Words </center>
$$\Downarrow$$

<center> Simple RNN </center>
$$\Downarrow$$

<center> LSTM/GRU </center>
$$\Downarrow$$

<center> Encoder-Decoder with RNNs </center>
$$\Downarrow$$

<center> Encoder-Decoder with RNNs and attention mechanism </center>
$$\Downarrow$$

<center> Transformers </center>

### Transformer models

The breakthrough in the performance of deep learning for NLP came from the introduction of transformer models that use attention mechanism (from the [Attention is all you need](https://arxiv.org/abs/1706.03762) paper) instead of recurrent connections across time steps. Transformer models completely discarded the use of both recurrence and convolution entirely. This enabled the parallelization that sped up the training process.

<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81c2aa73-dd8c-46bf-85b0-90e01145b0ed_1422x1460.png" width="350" />


Transformer based models such as GPT series by OpenAI, Google's BERT and many modifications of these algorithms have shown remarkable results in most NLP tasks. Transfer learning can be used for fine-tuning pre-trained transformer models freely available online and applying them for various specific tasks in several different languages.

### Types of transformer models

Transformer models consist of encoders and/or decoders. 
<img src="https://jalammar.github.io/images/t/The_transformer_encoders_decoders.png" width="350" />

There are three kinds of transformer models:
* *Encoders only*: Transformer models such as BERT (Bidirectional Encoder Representations from Transformers), introduced in the [paper](https://arxiv.org/pdf/1810.04805.pdf) by Devlin et al. from Google AI Language in 2019, consists of **only encoders**. Pre-trained BERT model is widely used as "contextualized word embeddings". This word embeddings when paired with another architecture such as Multi-Layer Perceptron or LSTM can be used for text classifications problems that we have seen earlier. 
* *Decoders only*: Transformers such as GPT-1/2/3/4(Generative Pre-trained Transformers series) that are language models used for text generation consists of **only decoders**.
* *Encoder-decoder models*: Transformers such as BART, a denoising autoencoder for pretraining sequence-to-sequence models, consists of **both encoders and decoders** and can be used for several NLP tasks such as question-answering, text summarization, etc.


<img src="https://amatriain.net/blog/images/02-06.png" width="900" />
<h4 align="center">
Timeline for Transformer models 
</h4>

Image credit: [Xavier Amatriain](https://arxiv.org/pdf/2302.07730.pdf)

<img src="https://pbs.twimg.com/media/Fuz4UrZaYAAE4ZS?format=jpg&name=large" width="800" />
<h4 align="center">
Evolutionary Tree for LLMs (Large Language Models)  
</h4>

[Image](https://pbs.twimg.com/media/Fuz4UrZaYAAE4ZS?format=jpg&name=large) credit: [Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond by Yang et al.](https://arxiv.org/pdf/2304.13712.pdf)

<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce3c437-4b9c-4d15-947d-7c177c9518e5_4258x5745.png" width="500" />

<h4 align="center">
Overview of Transformer models 
</h4>

Image credit: [Sebastian Raschka](https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder)

Full forms:
* BERT: Bidirectional Encoder Representations from Transformers
* GPT: Generative Pre-trained Transformers
* BART: Bidirectional and Auto-Regressive Transformer

#### Acknowledgement
Some content and diagrams are taken from the online book: https://www.deeplearningbook.org/contents/rnn.html