# Hugging Face Projects

This notebook will be dedicated to using Hugging Face in order to code some interesting projects:

1. Text Summarization  
2. 2
3. 3

All of these will be done through fine tuning of existing baseline models.

We will need a GPU in order to fine tune the models:

In [None]:
!nvidia-smi

## 1. Text Summarization Project (Seq2Seq)

For this task we are going to use a class of models called *Seq2Seq*.

Seq2Seq models map an input sequence to an output sequence — useful for tasks like translation, summarization, dialogue.
Transformer-based Seq2Seq models (like T5 and BART) replaced older RNN-based ones, achieving much better performance.

### 1.1 Install Dependencies

We need some packages in order to start with our project:

In [None]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q
!pip install --upgrade datasets -q

In [None]:
# disinstall and re-install accelerate for gpu acceleration

!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate  # sometimes colab uses older versions
!pip install transformers accelerate  # now we're sure we're using a new version

In [None]:
# import to test that everything is fine

from transformers import pipeline, set_seed
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer # For the model we're going to use
from datasets import load_dataset, load_from_disk # For the datasets

# python libraries
import matplotlib.pyplot as plt
import pandas as pd

# tokenization
import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm # just progress bar

import torch

nltk.download("punkt")

In [None]:
# let's check the device
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

In [None]:
# Choose our model "checkpoint" (ckpt)
model_ckpt = "google/pegasus-cnn_dailymail" # https://huggingface.co/google/pegasus-cnn_dailymail

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
# load the model and send it to device
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

### 1.2 Get the Data

In [None]:
# sometimes i have problems loading if i dont update datasets first...
!pip install --upgrade datasets fsspec

In [None]:
# load the dataset
dataset_samsum = load_dataset("knkarthick/samsum") # https://huggingface.co/datasets/knkarthick/samsum

In [None]:
dataset_samsum  # it's composed of dialogue and summary couples

In [None]:
dataset_samsum["train"]["dialogue"][1]

In [None]:
dataset_samsum["train"]["summary"][1]

In [None]:
samsum_train_df = pd.DataFrame(dataset_samsum['train'])
samsum_test_df = pd.DataFrame(dataset_samsum['test'])

#### 1.2.1: Always inspect Your Data Thoroughly...

In [None]:
# I was getting an error when mapping my dataset, went back and checked the data for NaN values...

print(samsum_train_df.isnull().sum())
print(samsum_test_df.isnull().sum())

In [None]:
samsum_train_df[samsum_train_df.isnull().any(axis=1)] # bad data here

In [None]:
# filter the dataset to remove it
# Define a filter function
def clean_example(example):
    return (example['dialogue'] is not None and
            example['summary'] is not None)

# Apply the filter to each split
dataset_samsum_clean = dataset_samsum.map(lambda x: x, remove_columns=[])  # make a copy
dataset_samsum_clean['train'] = dataset_samsum['train'].filter(clean_example)
dataset_samsum_clean['validation'] = dataset_samsum['validation'].filter(clean_example)
dataset_samsum_clean['test'] = dataset_samsum['test'].filter(clean_example)

> **Note:** Hugging Face DatasetDict objects are immutable by default.
>
> When you apply `.filter()`, it returns a new object — it doesn't modify the original
dataset in-place.
>
>If you want to keep your original `dataset_samsum` untouched, you can make a copy before applying filters.
>```python
dataset_samsum_clean = dataset_samsum.map(lambda x: x, remove_columns=[])
```
>This trick is used to make a shallow copy of the dataset before you start modifying (filtering) it, to avoid messing up the original.
>
> In this case we didn't really need to keep the original with NaN values, but just for safety I made a copy first.

In [None]:
samsum_train_df = pd.DataFrame(dataset_samsum_clean['train'])
print(samsum_train_df.isnull().sum())
print(samsum_test_df.isnull().sum())

### 1.3 Preprocess data (embedding)

In [None]:
def convert_examples_to_features(example_batch):
  """
  Encodes the dataset in batches
  """

  input_encodings = tokenizer(example_batch['dialogue'],
                              padding='max_length',
                              max_length=1024,
                              truncation=True)

  with tokenizer.as_target_tokenizer(): # target tokenizer context manager (see below)
    target_encodings = tokenizer(example_batch['summary'],
                                 padding='max_length',
                                 max_length=128,
                                 truncation=True)

  return {  # tutti i tokenizer ritornano input_ids attention_mask etc.? o Hanno strutture diverse
            'input_ids' : input_encodings['input_ids'],
            'attention_mask' : input_encodings['attention_mask'],
            'labels' : target_encodings['input_ids']
  }

> **Note:**
>
> In sequence-to-sequence (seq2seq) models like Pegasus, it is essential to differentiate between input tokens and target tokens during tokenization. Although the tokenizer might appear the same for both, using `tokenizer.as_target_tokenizer()` ensures that tokenization parameters and settings are properly adjusted for the target side (decoder). This is crucial because the model processes the source text through the encoder and generates the target text through the decoder. Properly tokenizing targets guarantees that the model receives the correct input format for loss computation and sequence generation. Without this distinction, the model could misinterpret the labels, leading to incorrect training and poor performance.


In [None]:
# apply tokenization with map
dataset_samsum_pt = dataset_samsum_clean.map(convert_examples_to_features,
                                             batched=True)


## 1.4 Training

### 1.4.1 Data Collator

When we have a huge amount of data, it's easy for our machine to run out of memory while training if we load all the data at once. That's the main reason of why we train in batches.

To correctly form batches for our training, we can use the [`DataCollator`](https://huggingface.co/docs/transformers/main_classes/data_collator#data-collator) class. It helps us construct batches in the given correct shape of choice.

There are some default data collators for different classes of models. In this case we'll use the [`DataCollatorForSeq2Seq` class](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq).

In [None]:
from transformers import DataCollatorForSeq2Seq

seq2seq_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

#### 1.4.2 Training Arguments


In [None]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir =
)