## Seq2Seq Models

1. Encoders-Decoders
2. Attentions
3. transformers
4. Transfer Learning - 2018
5. LLms - Chatgpt

2018-2020 - Vision transformers

2021 - genAI

2022 - Chatgpt / stable diffusion


## BERT (Encoder only architecture) - (Bidirectional Encoder Representations from Transformers)

* Decision to use an encoder-only architecture in BERT suggests a primary emphasis on understanding input sequences rather than generating output sequences.
* 12 layer of encoders.
* Large FFN and large attention heads.
* BERT base as 110M parameters to train.
* Takes [CLS] as 1st token.
* input token 512 at one go.
* Model output of size 786 dimension.

* WordPiece Embedding is a subword tokenization algorithm used in natural language processing (NLP) tasks. 
* It breaks down words into smaller units called subword tokens, allowing machine learning models to better handle out-of-vocabulary (OOV) words and improve performance on various NLP tasks.
* “geeksforgeeks” can be split into “geeks” “##for”, and”##geeks”. The “##” prefix indicates that the subword is a continuation of the previous one. 

### Usage:
1. Classification.
2. Question answering
3. NER
4. Text Summarization
5. Semantic similarity

## BERT is Pretrained on 2 tasks

### 1. Masked Language Model (MLM)
* Before BERT learns from sentences, it hides some words (about 15%) and replaces them with a special symbol, like [MASK]
* BERT adds a special layer on top of its learning system to make these guesses. It then checks how close its guesses are to the actual hidden words.
* BERT’s main focus during training is on getting these hidden words right. It cares less about predicting the words that are not hidden.
* of the 15% hidden words. 80% are masked, 10% is wrongly replaced and 10% not replaced.
* We do this to make sure that model learns that masked words can be wrong but not always. Its not always hidden only. It can be right also.

### 2. Next Sentence Prediction (NSP)
* BERT predicts if the second sentence is connected to the first. 
* In the training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document.
* done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.

# BERT Google

In [1]:
import sys
!{sys.executable} -m pip install -q transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [11]:
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

In [7]:
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased",
                                            unk_token="[UNK]",
                                            sep_token="[SEP]",
                                            pad_token="[PAD]",
                                            cls_token="[CLS]",
                                            mask_token="[MASK]",
                                            never_split=None)

In [4]:
text = 'ChatGPT is a language model developed by OpenAI, based on the GPT (Generative Pre-trained Transformer) architecture. '

# Tokenize and encode the text
encoding = tokenizer.encode(text, max_length=512)
print("Token IDs:", encoding)

Token IDs: [101, 24705, 1204, 17095, 1942, 1110, 170, 1846, 2235, 1872, 1118, 3353, 1592, 2240, 117, 1359, 1113, 1103, 15175, 1942, 113, 9066, 15306, 11689, 118, 3972, 13809, 23763, 114, 4220, 119, 102]


In [8]:
tokenizer(text)

{'input_ids': [101, 24705, 1204, 17095, 1942, 1110, 170, 1846, 2235, 1872, 1118, 3353, 1592, 2240, 117, 1359, 1113, 1103, 15175, 1942, 113, 9066, 15306, 11689, 118, 3972, 13809, 23763, 114, 4220, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [5]:
# Convert token IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens([encoding[0]])
print("Tokens:", tokens)

Tokens: ['[CLS]']


In [12]:
# Load the basic BERT model 
bert_model = BertModel.from_pretrained("bert-base-cased")

model.safetensors: 100%|██████████| 436M/436M [02:03<00:00, 3.54MB/s] 


In [13]:
bert_model.config

BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.37.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

In [None]:
# Build the Sentiment Classifier class 
class SentimentClassifier(nn.Module):
    
    # Constructor class 
    def __init__(self, n_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-cased")
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
    
    # Forward propagaion class
    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask
        )
        #  Add a dropout layer 
        output = self.drop(pooled_output)
        return self.out(output)

In [None]:
# Instantiate the model and move to classifier
model = SentimentClassifier(len(class_names))
model = model.to(device)

In [None]:
print(bert_model.config.hidden_size)


In [None]:
# Number of iterations 
EPOCHS = 10

# Optimizer Adam 
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)

total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

# Set the loss function 
loss_fn = nn.CrossEntropyLoss().to(device)

## T5 - Encoder- decoder 
* T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks.
* The input of the encoder is the corrupted sentence.
* the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
* Teacher Forcing.
* cross entropy.

In [15]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

In [16]:
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small")

tokenizer_config.json: 100%|██████████| 2.32k/2.32k [00:00<00:00, 4.30MB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 1.57MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 3.74MB/s]
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 3.42MB/s]
model.safetensors: 100%|██████████| 242M/242M [00:48<00:00, 5.04MB/s] 
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 477kB/s]


In [19]:
max_source_length = 512
max_target_length = 128

# Suppose we have the following 2 training examples:
input_sequence_1 = "Welcome to NYC"
output_sequence_1 = "Bienvenue à NYC"

input_sequence_2 = "HuggingFace is a company"
output_sequence_2 = "HuggingFace est une entreprise"

# encode the inputs
task_prefix = "translate English to French: "
input_sequences = [input_sequence_1, input_sequence_2]
input_sequences

['Welcome to NYC', 'HuggingFace is a company']

In [21]:
encoding = tokenizer(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=max_source_length,
    truncation=True,
    return_tensors="pt",
)
encoding

{'input_ids': tensor([[13959,  1566,    12,  2379,    10,  5242,    12, 13465,     1,     0,
             0,     0,     0,     0],
        [13959,  1566,    12,  2379,    10, 11560,  3896,   371,  3302,    19,
             3,     9,   349,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [22]:
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

In [23]:
# encode the targets
target_encoding = tokenizer(
    [output_sequence_1, output_sequence_2],
    padding="longest",
    max_length=max_target_length,
    truncation=True,
    return_tensors="pt",
)
target_encoding

{'input_ids': tensor([[10520, 15098,     3,    85, 13465,     1,     0,     0],
        [11560,  3896,   371,  3302,   259,   245, 11089,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

In [26]:
labels = target_encoding.input_ids
labels

tensor([[10520, 15098,     3,    85, 13465,     1,  -100,  -100],
        [11560,  3896,   371,  3302,   259,   245, 11089,     1]])

In [25]:
# replace padding token id's of the labels by -100 so it's ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100

In [27]:
# forward pass
loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
loss.item()

0.18801367282867432

In [31]:
y = model.generate(input_ids)
print(tokenizer.decode(y[0]))
print(tokenizer.decode(y[0], skip_special_tokens=True))

<pad> Bienvenue à NYC</s><pad><pad>
Bienvenue à NYC




In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

#set up tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)


In [None]:
one_piece_sequence = ("The series focuses on Monkey D. Luffy, a young man made of rubber, who, inspired by his childhood idol," 
             "the powerful pirate Red-Haired Shanks, sets off on a journey from the East Blue Sea to find the mythical treasure," 
             "the One Piece, and proclaim himself the King of the Pirates. In an effort to organize his own crew, the Straw Hat Pirates," 
             "Luffy rescues and befriends a pirate hunter and swordsman named Roronoa Zoro, and they head off in search of the " 
             "titular treasure. They are joined in their journey by Nami, a money-obsessed thief and navigator; Usopp, a sniper "
             "and compulsive liar; and Sanji, a perverted but chivalrous cook. They acquire a ship, the Going Merry, and engage in confrontations"  
             "with notorious pirates of the East Blue. As Luffy and his crew set out on their adventures, others join the crew later in the series, "
             "including Tony Tony Chopper, an anthropomorphized reindeer doctor; Nico Robin, an archaeologist and former Baroque Works assassin; "
             "Franky, a cyborg shipwright; Brook, a skeleton musician and swordsman; and Jimbei, a fish-man helmsman and former member of the Seven "
             "Warlords of the Sea. Once the Going Merry is damaged beyond repair, Franky builds the Straw Hat Pirates a new ship, the Thousand Sunny," 
             "Together, they encounter other pirates, bounty hunters, criminal organizations, revolutionaries, secret agents, and soldiers of the" 
             "corrupt World Government, and various other friends and foes, as they sail the seas in pursuit of their dreams.")
inputs = tokenizer.encode("summarize: " + one_piece_sequence,
                          return_tensors='pt',
                          max_length=512,
                          truncation=True)
summarization_ids = model.generate(inputs, max_length=80, min_length=40, length_penalty=5., num_beams=2)
summarization = tokenizer.decode(summarization_ids[0])
print(summarization)


In [None]:
## translate English to French

language_sequence = ("You should definitely watch 'One Piece', it is so good, you will love the comic book")
input_ids = tokenizer("translate English to French: "+language_sequence, return_tensors="pt").input_ids 
language_ids = model.generate(input_ids)
language_translation = tokenizer.decode(language_ids[0],skip_special_tokens=True)
print(language_translation)


In [None]:
### Sentence Similarity

stsb_sentence_1 = ("Luffy was fighting in the war.")
stsb_sentence_2 = ("Luffy's fighting style is comical.")
input_ids = tokenizer("stsb sentence 1: "+stsb_sentence_1+" sentence 2: "+stsb_sentence_2, return_tensors="pt").input_ids 
stsb_ids = model.generate(input_ids)
stsb = tokenizer.decode(stsb_ids[0],skip_special_tokens=True)
print(stsb)


## A Lite BERT (ALBERT)

* Parameters compressions
* 1.7X faster than BERT
* ALBERT requires much more computations due to its longer structures.
* ALBERT is suited better for problems when the speed can be traded off for achieving higher accuracy.
* Embedding size - 128.
* Hidden Units - [768, 1024, 2048, 4096]

##  RoBERTa (Robustly Optimized BERT-Pretraining Approach) (META AI)

* doesn’t use the next-sentence pretraining objective.
* Is trained with much larger mini-batches and learning rates.
* Uses a byte-level BPE tokenizer.
* Trained on 160GB of uncompressed text.
* RoBERTa is trained for longer sequences
* Training with dynamic masking:

In [None]:
import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification
 

#Loading the model and tokenizer
model_name = "cardiffnlp/twitter-roberta-base-emotion"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name)
#Tokenizing the input
inputs = tokenizer("I love my cat", return_tensors="pt")
 

#Retrieving the logits and using them for predicting the underlying emotion
with torch.no_grad():
    logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

## GPT (OpenAI’s Generative Pre-trained Transformers)

### GPT1-2
* semi-supervision.
* Decoder only model.
* learned positional encoding.
* This results in an embedding which contains information about the word, and where the word is in the sequence.
* autoregressive generation.
* 1.5B parameter Transformer - GPT2

### GPT3
* 175 billion parameters - GPT3
* GPT-3 can get pretty good at most tasks without training the model.
* Instead you can just experiment with the input to the model to find the right input format for a particular task. 

* And thus, PROMPT ENGINEERING was born.

### GPT4 
* Uses Mixture of Experts (MOE). Instead of using a whole model. Hae parts of model dedicated to different domains like art, science and use only that or load only that.
* needs to use 280 billion parameters of the models 1.8 Trillion (a mere 15%) on any given inference.


In [4]:
import sys
!{sys.executable} -m pip install -q ipywidgets torch transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from transformers import Pipeline

Pipeline()

i/p : 100 tokens
embedding dim : 512

i/p embedding: 100 * 512

Positional embedding: 100 * 512

Multihead = 8

self attention 
Default weight initialization??

If multiple head we divide 512/8

Wq, Wk, Wv = [512 * 64]

i/p = 100 * 512

q = [100*64] 
k = [100*64]

Attention scores = [100*100]
q.k
[100*64].[100*64]T = [100*100] Attention scores

Scaling it down to avoid overflow for attention scores sqrt(key length):

[100*100] scaled down

softmax => [100*100]  probabilities

v * attention_scores
[100*100] * [100*64] = [100 * 64] concat 7[100*64] = [100*512] from multihead attention

[100*512] Multihead attention layer 
Layer normalization - Length of inpute sentence 
Residual connections - Avoid vanishing gradients