<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/BERTs/Seq2Seq_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Learning Seq2Seq Models and trying to implement**
---

First, what is Seq2Seq Model? Let's define it, then will be focusing on various models related to Seq2Seq.

#### _Seq2Seq:_

The **Sequence-to-Sequence (Seq2Seq)** model is a powerful architecture in machine learning designed to transform one sequence into another, making it particularly useful for tasks involving *sequential data*, such as **language translation, text generation, and more**.

A Seq2Seq model consists of two main components:
___
1. **Encoder** \
***Function:*** The encoder processes the input sequence and encodes it into a fixed-size context vector, which represents the essential information of the input data. \
***Architecture:*** Typically implemented using Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, or Gated Recurrent Units (GRUs). The encoder reads the input sequence one element at a time and updates its hidden state accordingly. Once the entire input sequence is processed, the final hidden state is used as the context vector.

2. **Decoder** \
***Function:*** The decoder takes the context vector produced by the encoder and generates the output sequence step-by-step. \
***Process:*** The decoder operates in an autoregressive manner, meaning it generates one element of the output at each time step while considering its previous outputs. It uses both the context vector and its own previous hidden states to predict the next element in the sequence.

---
<figure align= "center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*Ismhi-muID5ooWf3ZIQFFg.png" alt="Encoder_Decoder_s2s_illustrating_architecture" width= "550"/>
  <figcaption><i>Simple illustration of the process</i></figcaption>
</figure>

---

#### _Models on Seq2Seq_

##### 1. **BART**

Click [here](https://arxiv.org/pdf/1910.13461) to access the paper.

**The sumary:** \

**BART** (*Bidirectional and Auto-Regressive Transformers*) is a denoising autoencoder designed for pretraining sequence-to-sequence models. Its training involves:

1. Corrupting text using a noising function.
2. Learning to reconstruct the original text.

BART employs a Transformer-based architecture, combining features of BERT (bidirectional encoder) and GPT (left-to-right decoder). It excels in text generation tasks and achieves strong results in comprehension tasks, matching RoBERTa on GLUE and SQuAD benchmarks.

***Key contributions:***

- Uses novel noising approaches like shuffling sentence order and text span infilling with a single mask token.
- Sets new state-of-the-art results in abstractive tasks (dialogue, QA, summarization) with gains of up to 6 ROUGE points.
- Improves BLEU scores for machine translation with target language pretraining (+1.1 BLEU over back-translation systems).

---
Starting the code:

##### _First lets setup the env_

In [None]:
!pip install datasets #@ transformers is already available in colab

> *I'll try to learn summarization stuffs first, then will go through the translation and other stuffs as well*

In [2]:
from transformers import pipeline

summarizer= pipeline("summarization", model="facebook/bart-large-cnn")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [3]:
input_text = '''
      In recent years, the study of artificial intelligence (AI) has gained significant attention across various fields, including computer science, engineering, healthcare, and social sciences. AI, defined broadly as the simulation of human intelligence in machines, encompasses various subfields such as machine learning, natural language processing, robotics, and computer vision. A key factor driving this surge in interest is the rapid development of algorithms capable of processing vast amounts of data with increasing accuracy. In healthcare, for example, AI systems are being used to diagnose diseases, predict patient outcomes, and even personalize treatment plans based on an individual’s genetic makeup. Machine learning models, which allow computers to learn from data without explicit programming, have been particularly influential in this regard. However, despite these advances, AI also presents several challenges, including issues of data privacy, algorithmic bias, and the ethical implications of autonomous decision-making. Moreover, as AI continues to evolve, there are growing concerns about its potential impact on employment, as automation threatens to replace human labor in certain industries. These challenges highlight the need for interdisciplinary research that addresses both the technological and societal dimensions of AI. Researchers are increasingly exploring the role of regulation, ethics, and governance in the development and deployment of AI systems, ensuring that they serve the broader good of society while minimizing risks.
'''

In [4]:
summary = summarizer(input_text, max_length= 200, min_length= 100, do_sample= False)

In [5]:
print(summary[0]['summary_text'])

In recent years, the study of artificial intelligence (AI) has gained significant attention. In healthcare, for example, AI systems are being used to diagnose diseases, predict patient outcomes, and even personalize treatment plans based on an individual’s genetic makeup. Despite these advances, AI also presents several challenges, including issues of data privacy, algorithmic bias, and the ethical implications of autonomous decision-making. These challenges highlight the need for interdisciplinary research that addresses both the technological and societal dimensions of AI.


The model does in-fact summarize the texts. Now lets get a dataset and pass it through  the model that I just "imported".

> *Ofc, we could create our own custom dataset but for the sake of learning. Going a bit easy.*

In [6]:
##@ loading the dataset

from datasets import load_dataset

dataset= load_dataset("cnn_dailymail", "3.0.0") #The dataset and its version that is there in HuggingFace

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [7]:
print(dataset['train'][0])

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

Time for Fine-Tuning...
> **Slight Note:** We could infact use `"Trainer"` from HuggingFace and it would be a lot easier, but Im trying to learn the mechanism here.. so will stick with PyTorch for a while.

In [9]:
import torch
from torch.utils.data import DataLoader
from transformers import BartForConditionalGeneration, BartTokenizer

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer= BartTokenizer.from_pretrained("facebook/bart-large-cnn")

## Preprocess

def preprocess_fun (batch):
  inputs = tokenizer(batch['article'], max_length =1024, truncation= True,\
                     padding= "max_length", return_tensors= 'pt')
  targets = tokenizer(batch['highlights'], max_length =128, truncation= True,\
                     padding= "max_length", return_tensors= 'pt')
  return {'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask'], \
          'labels': targets['input_ids']}

## Preparing the dataloaders

train_data = dataset['train'].map(preprocess_fun, batched= True)
train_loader = DataLoader(train_data, batch_size= 8, shuffle= True)


## Defining the optimizers
optimizer = torch.optim.Adam(model.parameters(), lr= 5e-5)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

AttributeError: 'list' object has no attribute 'to'

In [12]:
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

epochs = 3
for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1}: Loss = {loss.item()}")

OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 11134 has 14.74 GiB memory in use. Of the allocated memory 14.56 GiB is allocated by PyTorch, and 64.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
## Saving the model:
model.save_pretrained('./bart-finetuned')
tokenizer.pretrained('./bart-finetuned')

Turnsout the Torch implementation takes a lot of time. So will be using `Trainer` instead.

> _"Saving on my GPU quota"_