# Deep Learning Models

## 1. Siamese Network

- Siamese networks are a type of neural network architecture that contain two or more identical subnetworks. 
- The **subnetworks share the same weights** and learn to solve the same problem, **but with different inputs**. 
- The subnetworks are trained to **minimize the difference between their outputs on pairs of inputs that are similar, and maximize the difference between their outputs on pairs of inputs that are dissimilar**.
- SentenceBert is a variant of Siamese Network.

## 2. xDeepFM

1. xDeepFM is a deep learning model that combines the benefits of **wide and deep learning models**.
2. There are three main components of xDeepFM:
    1. **Wide component**: This component is a linear model(Embedding Layer) that captures the **implicit feature interactions**.
    2. **Deep component**: This component is a deep neural network(Plain DNN) that captures the **feature representations**.
    3. **Cross component**: This component is a Compression Interaction Network(CIN) that captures the **explicit feature interactions**.

<p align="center">
<img src="./images/2023-02-03.jpeg" width=50% height=50%>
</p>

## 3. Long Short-Term Memory (LSTM)

LSTM (Long Short-Term Memory) is **a type of Recurrent Neural Network (RNN)** that is widely used in Natural Language Processing (NLP) and speech recognition tasks. It is designed to **handle the issue of vanishing gradients** in traditional RNNs by introducing a memory cell, gates (input, forget, output) and cell state that control the flow of information into and out of the memory cell. Due to the additive update mechanism, the LSTM's memory cell ensures gradients remain consistent over lengthy sequences.

The basic idea behind LSTMs is to allow information to persist in the network for a longer time period. The input gate regulates the flow of input data into the memory cell. The forget gate decides what information should be discarded from the memory cell. The output gate controls the flow of information from the memory cell to the output. The cell state acts as a “memory” that retains information for an extended period of time.

The LSTM network takes **the current input, previous hidden state, and previous cell state as input** and **produces the current hidden state and cell state**, which are then used to make predictions. The hidden state and cell state are then passed as input to the next time step in a sequence.

Overall, LSTMs help in modeling the dependencies between elements in a sequence, making it a powerful tool in sequential data modeling.

## 4. Transformer

<p align="center">
<img src="./images/2022-09-08-11-01-25.png" width=50% height=50%>
</p>

In natural language processing (NLP), the Transformer is a neural network architecture that was introduced in a 2017 paper by Google researchers. It is used for tasks such as language translation, text summarization, and question answering.

The Transformer architecture is based on the idea of self-attention, which allows the model to weigh the importance of different parts of the input when making predictions. It uses multi-head self-attention mechanism, which means it uses multiple attention heads to weigh the importance of different parts of the input. This allows the model to better understand the relationships between words in a sentence, which helps it generate more accurate and coherent text.

The Transformer architecture also uses a technique called positional encoding, which allows the model to understand the order of words in a sentence. Additionally, it uses a feed-forward neural network to further process the output of the self-attention mechanism.

The transformer architecture has become popular in many NLP tasks and its variants such as BERT, GPT, GPT-2, GPT-3 etc have been used extensively in many NLP tasks and have provided state of the art results.

## Formula of Attention

$$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$

## 5. Pre-training

### Masked Language Model (MLM)

MLM is a pre-training technique used in NLP. It is used to train a language model by masking some of the words in a sentence and then training the model to predict the masked words.

### Causal Language Model (CLM)

CLM predicts the token after a sequence of tokens.

## 6. BERT

BERT, which stands for **Bidirectional Encoder Representations from Transformers**, is based on Transformers and was introduced in a 2018 paper by Google researchers. It is used for tasks such as question answering, text classification, and named entity recognition.

The BERT framework was pre-trained using text from Wikipedia and can be fine-tuned with question and answer datasets, such as SQuAD, to generate answers to questions. It can also be fine-tuned with text classification datasets, such as the GLUE benchmark, to classify text into different categories.

BERT Base Model has 12 Layers and 110M parameters with 768 Hidden and equal embedding layers. This large size makes it very computationally heavy to train.

### BERT Input

1. BERT takes two inputs: the tokenized sentence and the segment IDs. The tokenized sentence is a list of integers, where each integer represents a token in the sentence. The segment IDs are also a list of integers, where each integer represents the segment to which the corresponding token belongs. The segment IDs are used to differentiate between the two sentences in a pair of sentences. The segment IDs are always 0 for single sentences, and 0 and 1 for pairs of sentences.

### BERT Family

1. ALBERT: A Lite BERT has 12 million parameters with 768 hidden layers and 128 embedding layers, the following 2 techniques are used
   1. Cross-layer parameter sharing: In this method, the parameter of only the first encoder is learnt and the same is used across all encoders.
   2. Factorized embedding layer parameterization: Instead of keeping the embedding layer at 768, the embedding layer is reduced by factorization to 128 layers.
2. RoBERTa: RoBERTa stands for “Robustly Optimized BERT pre-training Approach”. In many ways this is a better version of the BERT model. The key points of difference are as follows:
   1. Dynamic Masking: BERT uses static masking i.e. the same part of the sentence is masked in each Epoch. In contrast, RoBERTa uses dynamic masking, wherein for different Epochs different part of the sentences are masked. This makes the model more robust.
   2. Remove NSP Task: It was observed that the NSP task is not very useful for pre-training the BERT model. Therefore, the RoBERTa only with the MLM task.
   3. More data Points: BERT is pre-trained on “Toronto BookCorpus” and “English Wikipedia datasets” i.e. as a total of 16 GB of data. In contrast, in addition to these two datasets, RoBERTa was also trained on other datasets like CC-News (Common Crawl-News), Open WebText etc. The total size of these datasets is around 160 GB.
   4. Large Batch size: To improve on the speed and performance of the model, RoBERTa used a batch size of 8,000 with 300,000 steps. In comparison, BERT uses a batch size of 256 with 1 million steps.
3. ELECTRA: ELECTRA stands for “Efficiently Learning an Encoder that Classifies Token Replacements Accurately”. The model uses a generator-discriminator structure. Other than being the lighter version of BERT, ELECTRA has the following distinguishing features:
   1. Replaced Token Detection: Instead of MLM for pre-training, ELECTRA uses a task called “Replaced Token Detection” (RTD). In RTD, instead of masking the token, the token is replaced by a wrong token and the model is expected to classify, whether the tokens are replaced with wrong or not.
   2. No NSP pre-training is performed.

## 7. T5

T5 is a text-to-text transformer model that was trained on a large corpus of text-to-text data. T5 is a successor to BERT and GPT-2, and is the first text-to-text transformer model that is trained on a large scale with a denoising objective. The changes compared to BERT include:

1. Adding a causal decoder to the bidirectional architecture.
2. Replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.

![T5 model structure](./images/2022-09-08-11-11-34.png)


## 8. ChatGPT

ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) model, which is a type of language model that uses deep learning to generate human-like text. The model is trained on a large dataset of text and learns patterns in the data to generate new text that is similar to the input it was trained on.

When given a prompt or input, ChatGPT uses the patterns it learned during training to generate a response. The model uses a technique called "auto-regression," where it predicts the next word in a sentence based on the previous words. This allows it to generate text that is contextually appropriate and coherent.

It uses transformer architecture and pre-training the model on a large dataset of text, then fine-tuning it on a smaller dataset specific to the task at hand. The pre-training allows the model to learn general patterns in language, while fine-tuning allows it to focus on the specific task and generate more accurate and relevant responses.