<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/Seq2Seq/Seq2Seq_BART_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Learning Seq2Seq Models and trying to implement**: *Implementation of BART Summarization and FineTuning*
---

First, what is Seq2Seq Model? Let's define it, then will be focusing on various models related to Seq2Seq.

#### _Seq2Seq:_

The **Sequence-to-Sequence (Seq2Seq)** model is a powerful architecture in machine learning designed to transform one sequence into another, making it particularly useful for tasks involving *sequential data*, such as **language translation, text generation, and more**.

A Seq2Seq model consists of two main components:
___
1. **Encoder** \
***Function:*** The encoder processes the input sequence and encodes it into a fixed-size context vector, which represents the essential information of the input data. \
***Architecture:*** Typically implemented using Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, or Gated Recurrent Units (GRUs). The encoder reads the input sequence one element at a time and updates its hidden state accordingly. Once the entire input sequence is processed, the final hidden state is used as the context vector.

2. **Decoder** \
***Function:*** The decoder takes the context vector produced by the encoder and generates the output sequence step-by-step. \
***Process:*** The decoder operates in an autoregressive manner, meaning it generates one element of the output at each time step while considering its previous outputs. It uses both the context vector and its own previous hidden states to predict the next element in the sequence.

---
<figure align= "center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*Ismhi-muID5ooWf3ZIQFFg.png" alt="Encoder_Decoder_s2s_illustrating_architecture" width= "550"/>
  <figcaption><i>Simple illustration of the process</i></figcaption>
</figure>

---

#### _Models on Seq2Seq_

##### 1. **BART**

Click [here](https://arxiv.org/pdf/1910.13461) to access the paper.

**The sumary:** 

**BART** (*Bidirectional and Auto-Regressive Transformers*) is a denoising autoencoder designed for pretraining sequence-to-sequence models. Its training involves:

1. Corrupting text using a noising function.
2. Learning to reconstruct the original text.

BART employs a Transformer-based architecture, combining features of BERT (bidirectional encoder) and GPT (left-to-right decoder). It excels in text generation tasks and achieves strong results in comprehension tasks, matching RoBERTa on GLUE and SQuAD benchmarks.

***Key contributions:***

- Uses novel noising approaches like shuffling sentence order and text span infilling with a single mask token.
- Sets new state-of-the-art results in abstractive tasks (dialogue, QA, summarization) with gains of up to 6 ROUGE points.
- Improves BLEU scores for machine translation with target language pretraining (+1.1 BLEU over back-translation systems).

---
Starting the code:

##### _First lets setup the env_

In [None]:
!pip install torch transformers datasets accelerator

> *I'll try to learn summarization stuffs first, then will go through the translation and other stuffs as well*

In [None]:
from transformers import pipeline

summarizer= pipeline("summarization", model="facebook/bart-large-cnn",clean_up_tokenization_spaces= True)

In [3]:
input_text = '''
      In recent years, the study of artificial intelligence (AI) has gained significant attention across various fields, including computer science, engineering, healthcare, and social sciences. AI, defined broadly as the simulation of human intelligence in machines, encompasses various subfields such as machine learning, natural language processing, robotics, and computer vision. A key factor driving this surge in interest is the rapid development of algorithms capable of processing vast amounts of data with increasing accuracy. In healthcare, for example, AI systems are being used to diagnose diseases, predict patient outcomes, and even personalize treatment plans based on an individual’s genetic makeup. Machine learning models, which allow computers to learn from data without explicit programming, have been particularly influential in this regard. However, despite these advances, AI also presents several challenges, including issues of data privacy, algorithmic bias, and the ethical implications of autonomous decision-making. Moreover, as AI continues to evolve, there are growing concerns about its potential impact on employment, as automation threatens to replace human labor in certain industries. These challenges highlight the need for interdisciplinary research that addresses both the technological and societal dimensions of AI. Researchers are increasingly exploring the role of regulation, ethics, and governance in the development and deployment of AI systems, ensuring that they serve the broader good of society while minimizing risks.
'''

In [4]:
summary = summarizer(input_text, max_length= 200, min_length= 100, do_sample= False)

In [5]:
print(summary[0]['summary_text'])

In recent years, the study of artificial intelligence (AI) has gained significant attention. In healthcare, for example, AI systems are being used to diagnose diseases, predict patient outcomes, and even personalize treatment plans based on an individual’s genetic makeup. Despite these advances, AI also presents several challenges, including issues of data privacy, algorithmic bias, and the ethical implications of autonomous decision-making. These challenges highlight the need for interdisciplinary research that addresses both the technological and societal dimensions of AI.


The model does in-fact summarize the texts. Now lets get a dataset and pass it through  the model that I just "imported".

> *Ofc, we could create our own custom dataset but for the sake of learning. Going a bit easy.*

In [None]:
##@ loading the dataset

from datasets import load_dataset

dataset= load_dataset("cnn_dailymail", "3.0.0") #The dataset and its version that is there in HuggingFace
dataset

In [7]:
print(dataset['train'][0])

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

So, let's start with fine-tuning using Trainer. I tried with full on PyTorch but turns out it takes a lot of time. So will go with Trainer instead. But before that, we need preprocessing to convert the inputs and targets into tokens. Also, will need to define the model and tokenizer :) 

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model= AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [9]:
##@ defining the preprocessing function

def preprocess_function(batch):
    source = batch['article']
    target = batch['highlights']

    source_ids = tokenizer(source, truncation= True, max_length= 1024, padding='max_length')
    target_ids = tokenizer(target, truncation= True, max_length= 128, padding='max_length')

    source_ids["labels"] = target_ids["input_ids"]
    return source_ids

In [None]:
tokenized_data = dataset.map(preprocess_function, batched=True)

In [11]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    fp16=True, ## For Mixed Precision
)



Okay so while using the entire dataset, the estimated time was approx 64hrs and that is too much. So will be using small batch instead...

In [12]:
small_train = tokenized_data["train"].select(range(10000))
small_eval = tokenized_data["validation"].select(range(5000))

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_eval,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [14]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb_api_key")

In [15]:
import wandb
wandb.login(key= secret_value_0)
wandb.init(project="Summarization_using_BART", name="run1")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mfirojpaudel[0m ([33mfirojpaudel-madan-bhandari-memorial-college[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [16]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.7307,0.731931
2,0.4673,0.767436
3,0.3978,0.801959


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


TrainOutput(global_step=1875, training_loss=0.49771818033854165, metrics={'train_runtime': 9877.295, 'train_samples_per_second': 3.037, 'train_steps_per_second': 0.19, 'total_flos': 6.501313806336e+16, 'train_loss': 0.49771818033854165, 'epoch': 3.0})

In [17]:
##@ evaluating the results 

eval_results = trainer.evaluate()

In [19]:
eval_results

{'eval_loss': 0.8019587993621826,
 'eval_runtime': 474.314,
 'eval_samples_per_second': 10.542,
 'eval_steps_per_second': 0.66,
 'epoch': 3.0}

Now saving the model ... and will test this fine-tuned model with the unknown texts...

In [18]:
model.save_pretrained('./model_finetuned')
tokenizer.save_pretrained('./model_finetuned')

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('./model_finetuned/tokenizer_config.json',
 './model_finetuned/special_tokens_map.json',
 './model_finetuned/vocab.json',
 './model_finetuned/merges.txt',
 './model_finetuned/added_tokens.json',
 './model_finetuned/tokenizer.json')

In [21]:
##@ Now using the FineTuned model and tokens and passing new text into them...

model= AutoModelForSeq2SeqLM.from_pretrained('/kaggle/working/model_finetuned')
tokenizer= AutoTokenizer.from_pretrained('/kaggle/working/model_finetuned')

In [25]:
def summarize(testing):
    #@ Tokenizng the inp text to test in the finetuned model
    inputs = tokenizer(testing, max_length= 1024, truncation= True, return_tensors='pt')

    #@ Generating the summary
    summary_enc = model.generate(inputs['input_ids'], max_length= 500, min_length= 250, length_penalty= 2.0, num_beams=4, early_stopping= True)

    #@ Decoding the generated summary
    summary_fin = tokenizer.decode(summary_enc[0], skip_special_tokens= True)

    return summary_fin

In [26]:
##@ Input test text : 
testing = '''
Reuters
Published at : January 2, 2025Updated at : January 2, 2025 10:31Seoul
South Korean police said on Thursday they had raided Jeju Air (089590.KS) and the operator of Muan International Airport as part of their investigation into Sunday’s crash that killed 179 people in the worst aviation disaster on the country’s soil.

Jeju Air 7C2216, which departed the Thai capital of Bangkok for Muan in southwestern South Korea, belly-landed and overshot the regional airport’s runway, exploding into flames after hitting an embankment.

Two crew members, who were sitting in the tail end of the Boeing 737-800, were pulled alive by rescuers but injured.

Police investigators are searching the offices of the airport operator and the transportation ministry aviation authority in the southwestern city of Muan, as the well as office of Jeju Air in Seoul, the South Jeolla provincial police said in a media statement.

Investigators plan to seize documents and materials related to the operation and maintenance of the aircraft as well as the operation of airport facilities, a police official told Reuters.

A Jeju Air spokesperson said the airline is checking the situation. The airport operator company was not immediately available for comment.

Questions by air safety experts on what led to the deadly explosion have focused on the embankment designed to prop up navigation equipment that they said are too rigid and too close to the end of the runway.

“This rigid structure proved catastrophic when the skidding aircraft made impact,” said Najmedin Meshkati, an engineering professor at the University of Southern California, adding it was concerning that the navigation antenna was mounted on “such a formidable concrete structure, rather than the standard metal tower/pylon installation”.

A probe into the Jeju Air flight is also under way involving South Korean officials and the US National Transportation Safety Board (NTSB), Federal Aviation Administration (FAA) and the aircraft’s maker, Boeing (BA.N).

It remains unanswered why the aircraft did not deploy its landing gear and what led the pilot to apparently rush into a second attempt at landing after telling air traffic control the plane had suffered a bird strike and declaring an emergency.

The aircraft’s flight data recorder, which sustained some damage, is being taken to the United States for analysis in cooperation with the NTSB.

The conversion of data from the cockpit voice recorder to audio file should be completed by Friday, acting President Choi Sang-mok said, which could provide critical information on the final minutes of the doomed flight.

A transport ministry official said on Wednesday it may be difficult to release the audio files to the public as they will be critical to the ongoing investigation.

Choi said in a disaster management meeting immediate action must be taken if a special inspection of all Boeing 737-800 aircraft operated in the country finds any issues.

“As there’s great public concern about the same aircraft model involved in the accident, the transport ministry and relevant organisations must conduct a thorough inspection of operation maintenance, education, and training,” Choi said.

Choi’s comments at the start of the meeting were provided by his office.

Investigators from the NTSB, FAA and Boeing are in South Korea to help the probe.

Choi asked that no effort be spared in helping the families of the victims as the remains of those killed are handed over them. He also asked the police to take action against anyone posting “malicious” messages and fake news on social media related to the disaster.

'''

In [27]:
summary = summarize(testing) 

In [28]:
summary

'Investigators plan to seize documents and materials related to the operation and maintenance of the aircraft and airport facilities.\n179 people died when Jeju Air 7C2216 belly-landed and overshot the regional airport’s runway, exploding into flames after hitting an embankment.\nTwo crew members, who were sitting in the tail end of the Boeing 737-800, were pulled alive by rescuers but injured.\nFlight data recorder, which sustained some damage, is being taken to the United States for analysis in cooperation with the U.S. NTSB.\nThe conversion of data from the cockpit voice recorder to audio file should be completed by Friday, acting President Choi Sang-mok said.\nChoi asked that no effort be spared in helping the families of the victims as the remains of those killed are handed over to them.\nHe also asked the police to take action against anyone posting ‘malicious’ messages and fake news on social media related to disaster.\nInvestigators from the NTSB, FAA and Boeing are in South Ko

And, viola... the summary is generated!! 