# **Hugging Face Projects**

This notebook will be dedicated to using Hugging Face in order to code some interesting projects:

1. Text Summarization (**seq2seq**)
2. Text To Image Generation (**diffusion**)
3. 3

All of these will be done through fine tuning of existing baseline models.

We will need a GPU in order to fine tune the models:

In [None]:
!nvidia-smi

## 1. Text Summarization Project (Seq2Seq)

For this task we are going to use a class of models called *Seq2Seq*.

Seq2Seq models map an input sequence to an output sequence — useful for tasks like translation, summarization, dialogue.
Transformer-based Seq2Seq models (like T5 and BART) replaced older RNN-based ones, achieving much better performance.

### 1.1 Install Dependencies

We need some packages in order to start with our project:

In [None]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q
!pip install --upgrade datasets -q

In [None]:
# disinstall and re-install accelerate for gpu acceleration

!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate  # sometimes colab uses older versions
!pip install transformers accelerate  # now we're sure we're using a new version

In [None]:
# import to test that everything is fine

from transformers import pipeline, set_seed
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer # For the model we're going to use
from datasets import load_dataset, load_from_disk # For the datasets

# python libraries
import matplotlib.pyplot as plt
import pandas as pd

# tokenization
import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm # just progress bar

import torch

nltk.download("punkt")

In [None]:
# let's check the device
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

In [None]:
# Choose our model "checkpoint" (ckpt)
model_ckpt = "google/pegasus-cnn_dailymail" # https://huggingface.co/google/pegasus-cnn_dailymail

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
# load the model and send it to device
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

### 1.2 Get the Data

In [None]:
# sometimes i have problems loading if i dont update datasets first...
!pip install --upgrade datasets fsspec

In [None]:
# load the dataset
dataset_samsum = load_dataset("knkarthick/samsum") # https://huggingface.co/datasets/knkarthick/samsum

In [None]:
dataset_samsum  # it's composed of dialogue and summary couples

In [None]:
dataset_samsum["train"]["dialogue"][1]

In [None]:
dataset_samsum["train"]["summary"][1]

In [None]:
samsum_train_df = pd.DataFrame(dataset_samsum['train'])
samsum_test_df = pd.DataFrame(dataset_samsum['test'])

#### 1.2.1: Always inspect Your Data Thoroughly...

In [None]:
# I was getting an error when mapping my dataset, went back and checked the data for NaN values...

print(samsum_train_df.isnull().sum())
print(samsum_test_df.isnull().sum())

In [None]:
samsum_train_df[samsum_train_df.isnull().any(axis=1)] # bad data here

In [None]:
# filter the dataset to remove it
# Define a filter function
def clean_example(example):
    return (example['dialogue'] is not None and
            example['summary'] is not None)

# Apply the filter to each split
dataset_samsum_clean = dataset_samsum.map(lambda x: x, remove_columns=[])  # make a copy

# Clean
dataset_samsum_clean['train'] = dataset_samsum['train'].filter(clean_example)
dataset_samsum_clean['validation'] = dataset_samsum['validation'].filter(clean_example)
dataset_samsum_clean['test'] = dataset_samsum['test'].filter(clean_example)

> **Note:** Hugging Face DatasetDict objects are immutable by default.
>
> When you apply `.filter()`, it returns a new object — it doesn't modify the original
dataset in-place.
>
>If you want to keep your original `dataset_samsum` untouched, you can make a copy before applying filters.
>```python
dataset_samsum_clean = dataset_samsum.map(lambda x: x, remove_columns=[])
```
>This trick is used to make a shallow copy of the dataset before you start modifying (filtering) it, to avoid messing up the original.
>
> In this case we didn't really need to keep the original with NaN values, but just for safety I made a copy first.

In [None]:
samsum_train_df = pd.DataFrame(dataset_samsum_clean['train'])
print(samsum_train_df.isnull().sum())
print(samsum_test_df.isnull().sum())

### 1.3 Preprocess data (embedding)

In [None]:
def convert_examples_to_features(example_batch):
  """
  Encodes the dataset in batches
  """

  input_encodings = tokenizer(example_batch['dialogue'],
                              padding='max_length',
                              max_length=1024,
                              truncation=True)

  with tokenizer.as_target_tokenizer(): # target tokenizer context manager (see below)
    target_encodings = tokenizer(example_batch['summary'],
                                 padding='max_length',
                                 max_length=128,
                                 truncation=True)

  return {  # tutti i tokenizer ritornano input_ids attention_mask etc.? o Hanno strutture diverse
            'input_ids' : input_encodings['input_ids'],
            'attention_mask' : input_encodings['attention_mask'],
            'labels' : target_encodings['input_ids']
  }

> **Note:**
>
> In sequence-to-sequence (seq2seq) models like Pegasus, it is essential to differentiate between input tokens and target tokens during tokenization. Although the tokenizer might appear the same for both, using `tokenizer.as_target_tokenizer()` ensures that tokenization parameters and settings are properly adjusted for the target side (decoder). This is crucial because the model processes the source text through the encoder and generates the target text through the decoder. Properly tokenizing targets guarantees that the model receives the correct input format for loss computation and sequence generation. Without this distinction, the model could misinterpret the labels, leading to incorrect training and poor performance.


In [None]:
# apply tokenization with map
dataset_samsum_pt = dataset_samsum_clean.map(convert_examples_to_features,
                                             batched=True)


### 1.4 Training

#### 1.4.1 Data Collator

When we have a huge amount of data, it's easy for our machine to run out of memory while training if we load all the data at once. That's the main reason of why we train in batches.

To correctly form batches for our training, we can use the [`DataCollator`](https://huggingface.co/docs/transformers/main_classes/data_collator#data-collator) class. It helps us construct batches in the given correct shape of choice.

There are some default data collators for different classes of models. In this case we'll use the [`DataCollatorForSeq2Seq` class](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq).

In [None]:
from transformers import DataCollatorForSeq2Seq

seq2seq_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

#### 1.4.2 Training Arguments


In [None]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum',               # Where to save model checkpoints and logs
    num_train_epochs=1,                        # Number of full passes over the training dataset
    warmup_steps=500,                          # Number of warmup steps for learning rate scheduler
    per_device_train_batch_size=1,             # Batch size per GPU/TPU core/CPU during training
    per_device_eval_batch_size=1,              # Batch size per GPU/TPU core/CPU during evaluation
    weight_decay=0.01,                         # Strength of L2 weight regularization to prevent overfitting
    logging_steps=10,                          # Log training metrics every 10 steps
    eval_strategy='steps',               # Evaluate the model every `eval_steps`
    eval_steps=500,                            # Number of steps between evaluations
    save_steps=1e6,                            # Save model every 1,000,000 steps (effectively disables frequent saving)
    gradient_accumulation_steps=16             # Accumulate gradients over 16 steps before performing a backward/update pass
)


In [None]:
trainer = Trainer(model=model_pegasus,
                  args=trainer_args,
                  tokenizer=tokenizer,
                  data_collator=seq2seq_collator,
                  train_dataset=dataset_samsum_clean['test'],   # Using 'test' for a quick example, otherwise for a real training we should use 'train'
                  eval_dataset=dataset_samsum_clean['validation']
                  )

In [None]:
trainer.train()

### 1.5 Evaluation Metrics: ROUGE and Beyond

Evaluating the performance of text generation tasks like summarization requires special metrics that capture the **semantic and lexical similarity** between a model’s output and a reference text. One of the most widely used metrics for summarization is **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures the overlap of n-grams (word sequences), word sequences, and longest common subsequences between the generated text and the ground truth summary.

The most common ROUGE variants are:

- `ROUGE-1`: Overlap of unigrams (single words)
- `ROUGE-2`: Overlap of bigrams (two-word sequences)
- `ROUGE-L`: Longest common subsequence (captures sentence-level similarity)

ROUGE emphasizes *recall*, meaning it rewards summaries that successfully include important pieces of the reference text. The closest it is to $1$, the best our model is performing.

Each NLP task tends to have its own set of suitable metrics. Here's a quick overview:

| Task                      | Common Metrics                          | What It Measures                                   |
|--------------------------|-----------------------------------------|----------------------------------------------------|
| **Summarization**        | `ROUGE`, `BLEU`                         | Content overlap, fluency                          |
| **Translation**          | `BLEU`, `METEOR`, `CHRF`                | N-gram matches, semantic similarity               |
| **Text Generation**      | `BLEU`, `ROUGE`, `BERTScore`, `Perplexity` | Fluency, diversity, semantic similarity         |
| **Question Answering**   | `Exact Match`, `F1 Score`               | Span correctness and token-level overlap          |
| **Classification**       | `Accuracy`, `F1`, `Precision`, `Recall` | Correctness of predicted labels                   |
| **Named Entity Recognition (NER)** | `F1 Score`, `Precision`, `Recall` | Entity extraction span correctness            |

In summarization tasks, ROUGE-F1 score is often the most reported metric, as it balances precision and recall. For semantic understanding, metrics like BERTScore may also be used.

In [None]:
# Splits a list into batches of a given size for easier processing
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]    # yield is a memory-efficient alternative to return.


# Calculates evaluation metric (like ROUGE) on a dataset using a model
def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    # Split the dataset into batches of input articles and target summaries
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    # Loop over batches and generate summaries
    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        # Tokenize the input batch of articles
        inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                           padding="max_length", return_tensors="pt")

        # Generate summaries with beam search and length penalty to avoid long output
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                   attention_mask=inputs["attention_mask"].to(device),
                                   length_penalty=0.8, num_beams=8, max_length=128)

        # Decode token IDs to strings, clean special tokens
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                              clean_up_tokenization_spaces=True)
                             for s in summaries]

        # Replace empty tokens if any slipped in
        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

        # Add generated vs reference summaries to the metric for scoring
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    # Compute final ROUGE score across the dataset
    score = metric.compute()
    return score


In [None]:
!pip install evaluate

from evaluate import load

rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load('rouge')

In [None]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

### 1.6 Save and Load the Model




In [None]:
## Save model
model_pegasus.save_pretrained("pegasus-samsum-model")

In [None]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

In [None]:
#Load
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

### 1.7 Perform Inference with Our Model

We can perform inference with our model.

In [None]:
#Prediction

gen_kwargs = {"length_penalty": 0.8, # Controls how much the model penalizes long sequences during generation. < 1.0: Encourages longer outputs, > 1.0: Encourages shorter outputs.
              "num_beams":8, # Enables beam search with 8 beams.
              "max_length": 128}

sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

print("Dialogue:")
print(sample_text)

print("\nReference Summary:")
print(reference)

print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Our model is not performing perfectly because we trained for only one epoch. If we trained it for more (and on the actual training set) we would get to a better performance.

## 2. Text To Image Generation (Diffusion)

We will see how we can use a diffusion model for text to image generation from HF.

We will use the [`diffusers`](https://huggingface.co/docs/diffusers/index#diffusers) library from Hugging Face.

### 2.1 Get Dipendencies

In [None]:
# diffusers is a hugging face page for using diffusion models from huggingface hub
!pip install diffusers transformers accelerate

In [None]:
from diffusers import StableDiffusionPipeline
import matplotlib.pyplot as plt
import torch

In [None]:
!pip show torch

### 2.2 Choose Model

In [None]:
model_id1 = "dreamlike-art/dreamlike-diffusion-1.0"   # https://huggingface.co/dreamlike-art/dreamlike-diffusion-1.0
model_id2 = "stabilityai/stable-diffusion-xl-base-1.0"    # https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

model = StableDiffusionPipeline.from_pretrained(model_id1, torch_dtype=torch.float16, use_safetensors=True)
model = model.to("cuda")

### 2.3 Use the Loaded Model

In [None]:
prompt = """dreamlikeart, a grungy woman with rainbow hair, travelling between dimensions, dynamic pose, happy, soft eyes and narrow chin,
extreme bokeh, dainty figure, long hair straight down, torn kawaii shirt and baggy jeans"""

In [None]:
image = model(prompt).images[0]

In [None]:
image

### 2.4 Playing with Parameters


In [None]:
def generate_image(pipe, prompt, params):
  img = pipe(prompt, **params).images

  num_images = len(img)
  if num_images>1:
    fig, ax = plt.subplots(nrows=1, ncols=num_images)
    for i in range(num_images):
      ax[i].imshow(img[i]);
      ax[i].axis('off');

  else:
    fig = plt.figure()
    plt.imshow(img[0]);
    plt.axis('off');
  plt.tight_layout()

In [None]:
prompt = "dreamlike, beautiful girl playing the festival of colors, draped in traditional Indian attire, throwing colors"

params = {}

In [None]:
generate_image(pipe, prompt, params)

In [None]:
#num inference steps
params = {'num_inference_steps': 100}

generate_image(pipe, prompt, params)

In [None]:
#height width
params = {'num_inference_steps': 100, 'width': 512, 'height': int(1.5*640)}

generate_image(pipe, prompt, params)