## ⚙️ 1. Environment Setup

This first cell prepares our environment by installing all the necessary libraries. In addition to the standard `transformers` and `datasets` libraries, we install:

- **`sentencepiece`**: A crucial dependency for the T5 model's tokenizer.
- **`py7zr`**: A library to handle `.7z` compressed files, which might be used for downloading and decompressing certain datasets.

In [None]:
!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets
!pip install -U bertviz
!pip install -U umap-learn
!pip install -U sentencepiece
!pip install -U urllib3
!pip install py7zr

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 📥 2. Loading and Exploring the Dataset

We use `load_dataset` to download the `samsum` dataset, which contains a collection of dialogues and their corresponding human-written summaries. This is the data we will use to teach our T5 model how to summarize conversations.

After loading, we inspect the first element of the training set to understand its structure. We can see it's a dictionary with three keys: `id`, `dialogue`, and `summary`.

In [None]:
from datasets import load_dataset
samsum = load_dataset('knkarthick/samsum', trust_remote_code=True)

In [None]:
samsum['train'][0]

## 📊 3. Exploratory Data Analysis (EDA)

Before training, it's good practice to understand the data. We calculate the length (in number of words) for both the dialogues and the summaries in our training set. Then, we plot histograms to visualize the distribution of these lengths. This helps us see that the dialogues are generally much longer than the summaries, which is what we would expect for a summarization task. This analysis can also help in choosing appropriate maximum length parameters during tokenization.

In [None]:
dialogue_len = [len(x['dialogue'].split()) for x in samsum['train'] if x['dialogue'] is not None]
summary_len = [len(x['summary'].split()) for x in samsum['train'] if x['summary'] is not None]

In [None]:
import pandas as pd

data = pd.DataFrame([dialogue_len, summary_len]).T
data.columns = ['Dialogue Length', 'Summary Length']

data.hist(figsize=(10,3))

In [None]:
from transformers import pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')

## 🤖 4. Loading the T5 Model and Tokenizer

We are using a T5 (Text-to-Text Transfer Transformer) model, which is exceptionally good at sequence-to-sequence tasks like summarization. We load the `t5-large` checkpoint for higher performance.

- **`AutoTokenizer`**: Loads the specific tokenizer that corresponds to the T5 model.
- **`AutoModelForSeq2SeqLM`**: Loads the pre-trained T5 model architecture, ready for fine-tuning on our summarization task. The model is then moved to the GPU (`cuda`) if available.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# model_ckpt = 't5-small'
model_ckpt = 't5-large'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

## ✍️ 5. Data Preprocessing and Tokenization

This is a critical step. We define a `tokenize` function to convert our text into numerical IDs. For sequence-to-sequence models, we must tokenize both the input (`dialogue`) and the output (`summary`) at the same time. The tokenizer handles this elegantly via the `text_target` argument. We also set a `max_length` to truncate long examples.

Before tokenizing, we filter the dataset to remove any rows that might have missing dialogues or summaries. Finally, we apply our `tokenize` function to the entire dataset using the efficient `.map()` method.

In [None]:
def tokenize(batch):
    encoding = tokenizer(batch['dialogue'], text_target=batch['summary'], max_length=200, truncation=True, padding=True, return_tensors='pt')

    return encoding

In [None]:
def filter_none_values(example):
    return example['dialogue'] is not None and example['summary'] is not None
samsum_clean = samsum.filter(filter_none_values)
samsum_pt = samsum_clean.map(tokenize, batched=True)

## 🚀 6. Setting up the Trainer

We configure the training process using the Hugging Face `Trainer` API.

1.  **`DataCollatorForSeq2Seq`**: This is a special class that intelligently prepares batches of data for sequence-to-sequence models. It dynamically pads the input sequences and the label (summary) sequences independently, which is highly efficient.
2.  **`TrainingArguments`**: Here we define all the hyperparameters. Notably, we set a small `per_device_train_batch_size` of 2, but use `gradient_accumulation_steps=100`. This technique allows us to simulate a much larger effective batch size (2 * 100 = 200), which stabilizes training without requiring a huge amount of GPU memory. We also enable `fp16=True` for faster mixed-precision training.
3.  **`Trainer`**: Finally, we instantiate the `Trainer`, bringing together our model, arguments, tokenizer, data collator, and datasets.

In [None]:
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = TrainingArguments(
    output_dir="train_dir",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_strategy='epoch',
    save_strategy='epoch',
    weight_decay=0.01,
    learning_rate=2e-5,
    gradient_accumulation_steps=100,
    fp16=True
)

trainer = Trainer(model=model,
                  args=args,
                  tokenizer=tokenizer,
                  data_collator=data_collator,
                  train_dataset=samsum_pt['train'],
                  eval_dataset=samsum_pt['validation']
                  )


## ▶️ 7. Training and Saving the Model

With all the setup complete, we begin the training process by simply calling `trainer.train()`. The `Trainer` handles the entire fine-tuning loop, including feeding batches to the model, calculating loss, updating weights, and evaluating on the validation set.

Once training is finished, we save our custom-tuned summarization model to a directory using `trainer.save_model()`. This allows us to easily reload and use it later.

In [None]:
trainer.train()

In [None]:
trainer.save_model("t5_samsum_summarization")

## 🧐 8. Inference with the Fine-Tuned Model

The final step is to test our new summarization model. We load it into a `summarization` `pipeline`, which provides a very simple interface for inference. We write a new, custom dialogue and pass it to the pipeline. The output is the summary generated by our fine-tuned T5 model, demonstrating its ability to summarize conversations it has never seen before.

In [None]:
from transformers import pipeline

In [None]:
pipe = pipeline('summarization', model='t5_samsum_summarization', device=device)

custom_dialogue="""
Laxmi Kant: what work you planning to give Tom?
Juli: i was hoping to send him on a business trip first.
Laxmi Kant: cool. is there any suitable work for him?
Juli: he did excellent in last quarter. i will assign new project, once he is back.
"""

output = pipe(custom_dialogue)
output