# Contronto tra il fine-tuning completo e il fine-tuning con  LoRA

#### Configurazioni

Installazione della libreria `loralib` per implementare la Low-Rank Adaptation.

In [1]:
!pip install loralib

Collecting loralib
  Downloading loralib-0.1.2-py3-none-any.whl.metadata (15 kB)
Downloading loralib-0.1.2-py3-none-any.whl (10 kB)
Installing collected packages: loralib
Successfully installed loralib-0.1.2


Carico i file `lora_utilis.py` e `models.py`.

In [2]:
import sys
sys.path.append('/kaggle/input/lora-utils/')

Importo i moduli necessari.

In [3]:
import os
import random
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import transformers
import lora_utils, models

Impostazione del seme casuale per la riproducibilità.

In [4]:
seed_value = 42

os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)

# Imposto il seme casuale anche per i calcoli CUDA
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)  
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

## Sentiment Analisys

### 1. Ottenimento dei dati e preprocessing

Confronto il fine-tuning completo e quello basato su LoRA sul task di Sentiment Analysis utilizzando il dataset IMDB Reviews.  

Il dataset IMDB è costituito da 50.000 recensioni di film, etichettate con **positive** se la recensione è positiva o **negative**, altrimenti.  

Utilizzo la libreria `kagglehub` per scaricare il dataset da Kaggle.

In [5]:
import kagglehub

# Scarico l'ultima versione del dataset
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

dataset_path = path + "/IMDB Dataset.csv"
dataset = pd.read_csv(dataset_path)

# Stampa di verifica
print(dataset.loc[0:4])
print(dataset.isnull().sum())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
review       0
sentiment    0
dtype: int64


Definisco una funzione per l'ottenimento dei dati e la loro divisione in training, validation e test set.

In [6]:
LABELS = {"negative": 0, "positive": 1}
classes = list(LABELS.keys())

In [7]:
def get_data(dataset_path, n_train=5000, n_val=500, n_test=512): 

    # Leggo il dataset
    dataset = pd.read_csv(dataset_path)

    # Converto le etichette in numeri
    dataset['sentiment'] = dataset["sentiment"].map(LABELS)

    # Divido gli esempi in negativi e positivi
    neg = dataset[ dataset['sentiment'] == LABELS['negative'] ]
    pos = dataset[ dataset['sentiment'] == LABELS['positive'] ]

    # Verifico che ci siano abbastanza esempi
    if len(neg) < n_train + n_val + n_test or len(pos) < n_train + n_val + n_test:
        raise ValueError("Non ci sono abbastanza esempi per le dimensioni del train, validation e test set specificate.")
    
    # Creo una permutazione degli esempi negativi e positivi
    neg = neg.sample(frac=1, random_state=42).reset_index(drop=True)
    pos = pos.sample(frac=1, random_state=42).reset_index(drop=True)

    # Seleziono gli elementi da inserire nei set
    neg_train, pos_train = neg[:n_train], pos[:n_train]
    neg_val, pos_val = neg[n_train:n_train+n_val], pos[n_train:n_train+n_val]
    neg_test, pos_test = neg[n_train+n_val:n_train+n_val+n_test], pos[n_train+n_val:n_train+n_val+n_test]

    # Concateno gli esempi negativi e positivi per formare un solo insieme
    train_data = pd.concat([neg_train, pos_train])
    val_data = pd.concat([neg_val, pos_val])
    test_data = pd.concat([neg_test, pos_test])

    # Mescolo i dati negli insiemi
    train_data = train_data.sample(frac=1, random_state=42).reset_index(drop=True)
    val_data = val_data.sample(frac=1, random_state=42).reset_index(drop=True)
    test_data = test_data.sample(frac=1, random_state=42).reset_index(drop=True)

    # Ottengo le features e le labels
    sentences_train, labels_train = train_data['review'] , train_data['sentiment']
    sentences_val, labels_val = val_data['review'] , val_data['sentiment']
    sentences_test, labels_test = test_data['review'] , test_data['sentiment']

    return sentences_train, labels_train, sentences_val, labels_val, sentences_test, labels_test

Creo una classe Dataset personalizzata in cui viene effettuata la tokenizzaione delle recensioni e la conversione dei dati in tensori.

In [8]:
from torch.utils.data import Dataset

class IMDBDataset(Dataset):

    def __init__(self, sentences, labels, tokenizer, max_len):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self,index):
        sentence = self.sentences[index]
        label = self.labels[index]
        
        encoding = self.tokenizer.encode_plus(
            sentence,
            add_special_tokens=True,
            max_length=self.max_len,
            truncation=True,
            return_token_type_ids=True,
            padding="max_length",
            return_attention_mask=True,
            return_tensors='pt')
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding["token_type_ids"].flatten(),
            'labels': torch.tensor(label, dtype=torch.float)         # usa torch.tensor visto che label è uno scalare 
            }

Ottengo i dati grezzi, li divido in training, validation e test set con la funzione `get_data()`. Inizializzo il Tokenizer BERT per tokenizzare le frasi e creo i dataset personalizzati.

In [9]:
from transformers import BertTokenizer
from torch.utils.data import DataLoader

MAX_SEQ_LEN = 128

# Ottieni i dati grezzi
sentences_train, labels_train, sentences_val, labels_val, sentences_test, labels_test = get_data(dataset_path)

# Inizializza il Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

#Ottieni i dataset
training_data = IMDBDataset(sentences = sentences_train,
                           labels = labels_train,
                           tokenizer = tokenizer,
                           max_len = MAX_SEQ_LEN)

validation_data = IMDBDataset(sentences = sentences_val.values,
                           labels = labels_val.values,
                           tokenizer = tokenizer,
                           max_len = MAX_SEQ_LEN)

test_data = IMDBDataset(sentences = sentences_test.values,
                           labels = labels_test.values,
                           tokenizer = tokenizer,
                           max_len = MAX_SEQ_LEN)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



### 2. Configurazione dei modelli

Definisco un classificatore basato su BERT. Utilizzo il modello BERT pre-addestrato e aggiungo un livello di pooling e un layer lineare con un solo neurone di output per la classificazione binaria. Se il parametro lora è attivo, integra LoRA nel modello.

In [10]:
from transformers import BertModel

class BERTClassifier(nn.Module):
    
    def __init__(self, lora: bool = False, r: int = 16):
        super(BERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.avg_pooling = nn.AdaptiveAvgPool1d(1) 
        self.linear = nn.Linear(self.bert.config.hidden_size, 1)

        if lora:
            print("Adding LoRA to BERT")
            lora_utils.add_lora_to_bert(self.bert, r=r)
            lora_utils.mark_only_lora_as_trainable(self.bert)

    
    def forward(self, input_ids, attention_mask, token_type_ids):
        output_bert = self.bert(
            input_ids, 
            attention_mask=attention_mask, 
            token_type_ids=token_type_ids
        )
        last_hidden_state = output_bert.last_hidden_state  
        avg_pooled = self.avg_pooling(last_hidden_state.transpose(1, 2)).squeeze(-1)
        logits = self.linear(avg_pooled)
        return logits

In [11]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [12]:
# MODELLO BERT 
full_model = BERTClassifier(lora=False)
full_model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERTClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

In [13]:
# MODELLO CON LORA
lora_model = BERTClassifier(lora=True, r=16)
lora_model.to(device)

Adding LoRA to BERT


BERTClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=False)
              (key): Linear(in_features=768, out_features=768, bias=False)
              (value): Linear(in_features=768, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### 3. Addestramento dei modelli

Definisco una serie di funzioni per l'addestramento e la valutazione di un modello.

In [14]:
import time
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score


# Funzione di training e valutazione
def train_and_evaluate_model(model, model_name, train_loader, val_loader, criterion, optimizer, scheduler, device, epochs=4):
    
    history = {"train_loss": [], "train_acc": [], "val_loss": [], "val_acc": []}
    best_accuracy = 0

    start_time = time.time()

    for epoch in range(epochs):
    
        print(f"\nEpoch {epoch + 1}/{epochs}")
        print('-' * 10)

        # Training
        train_loss, train_acc = train_model(model, train_loader, criterion, optimizer, scheduler, device)
        print(f"Train loss: {train_loss:.4f}, Train accuracy: {train_acc:.4f}")
 
        # Valutazione sul validation set
        val_loss, val_acc, val_f1, val_auc = eval_model(model, val_loader, criterion, device)
        print(f"Validation loss: {val_loss:.4f}, Validation accuracy: {val_acc:.4f}")
 
        # Salvataggio del modello migliore
        if val_acc > best_accuracy:
            torch.save(model.state_dict(), f"imbd_best_{model_name}_state.bin")
            best_accuracy = val_acc

        # Salvo le metriche
        history["train_loss"].append(train_loss)
        history["train_acc"].append(train_acc)
        history["val_loss"].append(val_loss)
        history["val_acc"].append(val_acc)
    
    end_time = time.time()
    total_training_time = end_time - start_time
    
    return history, total_training_time

In [15]:
# Funzione di training 
def train_model(model, data_loader, criterion, optimizer, scheduler, device):
  
    model = model.train() # imposto il modello in modalità di aggiornamento
    
    total_loss = 0
    all_preds = []
    all_labels = []

    for batch in data_loader:
        
        # Sposto i dati sul device
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch["labels"].unsqueeze(1).to(device)

        #  --- Forward pass ---
        
        # Azzero il gradiente
        optimizer.zero_grad()
 
        # Effettuo la previsione per il batch corrente
        outputs = model(
            input_ids = input_ids,
            attention_mask = attention_mask,
            token_type_ids = token_type_ids
        )# output contiene i valori grezzi non normalizzati prodotti dal modello per ogni classe.

        #outputs = outputs.logits  # Estrai i logits

        # Calcolo la loss
        loss = criterion(outputs, labels)


        # --- Backward pass ---
        
        # Calcolo i gradienti della loss
        loss.backward()

        # Effettuo il clipping dei gradienti
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Aggiorno i pesi
        optimizer.step()

        # Aggiorno il learning rate
        scheduler.step()

        # Salvo la loss, le previsioni e le etichette
        total_loss += loss.item()
        
        preds = (torch.sigmoid(outputs) > 0.5).long()    # trasformo i dati grezzi in etichette binarie
        
        all_preds.extend(preds.detach().cpu().numpy())
        all_labels.extend(labels.detach().cpu().numpy())

    # Calcolo la loss e le metriche
    avg_loss = total_loss / len(data_loader)
    accuracy = accuracy_score(all_labels, all_preds)
   
    return avg_loss, accuracy

In [16]:
# Funzione di valutazione
def eval_model(model, data_loader, criterion, device):
    
    model = model.eval()    # imposto il modello in modalità valutavione

    total_loss = 0
    all_preds = []
    all_labels = []
    all_probs = []

    with torch.no_grad():
        for batch in data_loader:
            
            # Sposto i dati sul device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            labels = batch['labels'].unsqueeze(1).to(device)

            # Effettuo la previsione per il batch corrente
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids = token_type_ids
            )
            #outputs = outputs.logits  # Estrai i logits
            
            # Calcolo la loss
            loss = criterion(outputs, labels)

            # Aggiorno la loss, salvo le previsioni e le etichette
            total_loss += loss.item()
    
            probs = torch.sigmoid(outputs)  # Calcolo le probabilità

            preds = (probs > 0.5).long()  # Trasformo in etichette binarie
            
            all_preds.extend(preds.detach().cpu().numpy())
            all_labels.extend(labels.detach().cpu().numpy())
            all_probs.extend(probs.detach().cpu().numpy())

        # Calcolo loss e metriche
        avg_loss = total_loss / len(data_loader)
        accuracy = accuracy_score(all_labels, all_preds)
        f1 = f1_score(all_labels, all_preds, average="weighted")
        roc_auc = roc_auc_score(all_labels, all_probs, average='weighted', multi_class='ovr')
    
    return avg_loss, accuracy, f1, roc_auc

Definisco le configurazion principali per il training.

In [17]:
# Parametri principali
learning_rate = 3e-5
EPOCHS = 4
BATCH_SIZE = 32


# Creo i DataLoader
train_loader = DataLoader(training_data, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(validation_data, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)

total_steps = len(train_loader) * EPOCHS

# Funzione di loss
criterion = torch.nn.BCEWithLogitsLoss() # Applica automaticamente la sigmoide


# Ottimizzatore
optimizer = torch.optim.AdamW(params = full_model.parameters(), lr = learning_rate)

# Scheduler
scheduler = transformers.get_linear_schedule_with_warmup(optimizer = optimizer,
                                                       num_warmup_steps = 0,
                                                       num_training_steps = total_steps)

In [18]:
history_bert, total_time_bert = train_and_evaluate_model(
    full_model, "full_model", train_loader, val_loader, criterion, optimizer, scheduler, device, epochs=4
)
print(f"\nBERT Training Time: {total_time_bert:.2f} seconds, {total_time_bert/60:.2f} minutes.")


Epoch 1/4
----------
Train loss: 0.3672, Train accuracy: 0.8342
Validation loss: 0.3072, Validation accuracy: 0.8660

Epoch 2/4
----------
Train loss: 0.1908, Train accuracy: 0.9284
Validation loss: 0.3086, Validation accuracy: 0.8900

Epoch 3/4
----------
Train loss: 0.0862, Train accuracy: 0.9722
Validation loss: 0.4209, Validation accuracy: 0.8870

Epoch 4/4
----------
Train loss: 0.0362, Train accuracy: 0.9902
Validation loss: 0.5490, Validation accuracy: 0.8820

BERT Training Time: 666.33 seconds, 11.11 minutes.


In [19]:
# Parametri principali
learning_rate = 5e-4
EPOCHS = 4
BATCH_SIZE = 16


# Creo i DataLoader
train_loader = DataLoader(training_data, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(validation_data, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)

total_steps = len(train_loader) * EPOCHS

# Funzione di loss
criterion = torch.nn.BCEWithLogitsLoss() # Applica automaticamente la sigmoide

# Ottimizzatore
optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, lora_model.parameters()), lr = learning_rate)

# Scheduler
scheduler = transformers.get_linear_schedule_with_warmup(optimizer = optimizer,
                                                       num_warmup_steps = 0,
                                                       num_training_steps = total_steps)

In [20]:
history_lora, total_time_lora = train_and_evaluate_model(
    lora_model,"lora_model", train_loader, val_loader, criterion, optimizer, scheduler, device, epochs=4
)
print(f"BERT with LoRA Training Time: {total_time_lora:.2f} seconds, {total_time_lora/60:.2f} minutes.")



Epoch 1/4
----------
Train loss: 0.4022, Train accuracy: 0.8144
Validation loss: 0.3351, Validation accuracy: 0.8580

Epoch 2/4
----------
Train loss: 0.3150, Train accuracy: 0.8626
Validation loss: 0.3248, Validation accuracy: 0.8580

Epoch 3/4
----------
Train loss: 0.2919, Train accuracy: 0.8774
Validation loss: 0.3186, Validation accuracy: 0.8660

Epoch 4/4
----------
Train loss: 0.2771, Train accuracy: 0.8855
Validation loss: 0.3171, Validation accuracy: 0.8680
BERT with LoRA Training Time: 546.44 seconds, 9.11 minutes.


### 4. Valutazione dei modelli
Valuto i modello calcolando la loss sul test set, l'accuracy, l'F1-score e ROC AUC.

In [21]:
full_model.load_state_dict(torch.load("imbd_best_full_model_state.bin"))

test_loss, test_acc, test_f1, test_auc = eval_model(full_model, test_loader, criterion, device)
print(f"Full Fine-Tuning - Test loss: {test_loss:.4f}, Accuracy: {test_acc:.4f}, F1 score: {test_f1:.4f}, ROC AUC: {test_auc:.4f}")

  full_model.load_state_dict(torch.load("imbd_best_full_model_state.bin"))


Full Fine-Tuning - Test loss: 0.3159, Accuracy: 0.8857, F1 score: 0.8856, ROC AUC: 0.9537


In [22]:
lora_model.load_state_dict(torch.load("imbd_best_lora_model_state.bin"))

lora_test_loss, lora_test_acc, lora_test_f1, lora_test_auc = eval_model(lora_model, test_loader, criterion, device)
print(f"LoRA Fine-Tuning - Test loss: {lora_test_loss:.4f}, Accuracy: {lora_test_acc:.4f}, F1 score: {lora_test_f1:.4f}, ROC AUC: {lora_test_auc:.4f}")

  lora_model.load_state_dict(torch.load("imbd_best_lora_model_state.bin"))


LoRA Fine-Tuning - Test loss: 0.3004, Accuracy: 0.8730, F1 score: 0.8730, ROC AUC: 0.9469


## Toxicity Detection

### 1. Ottenimento dei dati e preprocessing  

Confronto il fine-tuning completo e quello basato su LoRA sul task di classificazione multi-label utilizzando il dataset **Toxic Comment Classification**.  

Il dataset contiene commenti testuali etichettati con sei classi: **toxic**, **severe_toxic**, **obscene**, **threat**, **insult**, e **identity_hate**.

Definisco una funzione per l'ottenimento dei dati e la loro divisione in training, validation e test set.

In [23]:
classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
N_CLASSES = 6

In [24]:
def get_data(dataset_path, n_train=5000, n_val=500, n_test=1024): 

    # Leggo il dataset
    dataset = pd.read_csv(dataset_path)

    # Effettuo un mescolamento casuale dei dati
    dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)

    # Estraggo il testo e le etichette dal dataset
    sentences = dataset["comment_text"].fillna("null").str.lower()
    labels = dataset[classes].values.astype(np.float32)
    
    # Seleziono gli elementi da inserire nei set
    train_sentences, train_labels = sentences[:n_train], labels[:n_train]
    val_sentences, val_labels = sentences[n_train:n_train+n_val].reset_index(drop=True), labels[n_train:n_train+n_val]
    test_sentences,test_labels = sentences[n_train+n_val:n_train+n_val+n_test].reset_index(drop=True), labels[n_train+n_val:n_train+n_val+n_test]

    return train_sentences, train_labels, val_sentences, val_labels, test_sentences,test_labels

Creo una classe Dataset personalizzata in cui viene effettuata la tokenizzaione delle recensioni e la conversione dei dati in tensori.

In [25]:
from torch.utils.data import Dataset

class ToxicityDataset(Dataset):

    def __init__(self, sentences, labels, tokenizer, max_len):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, index):
        sentence = self.sentences[index]
        label = self.labels[index]
        
        encoding = self.tokenizer.encode_plus(
            sentence,
            add_special_tokens=True,
            max_length=self.max_len,
            truncation=True,
            return_token_type_ids=True,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding["token_type_ids"].flatten(),
            'labels': torch.FloatTensor(label)
        }     

Ottengo i dati grezzi, li divido in training, validation e test set con la funzione `get_data()`. Inizializzo il Tokenizer BERT per tokenizzare le frasi e creo i dataset personalizzati.

In [26]:
from transformers import BertTokenizer
from torch.utils.data import DataLoader

MAX_SEQ_LEN = 128

# Ottengo i dati divisi in training set, validation set e test set
dataset_path = "/kaggle/input/toxisity-detection-dataset/train.csv"

train_sentences, train_labels, val_sentences, val_labels, test_sentences,test_labels = get_data(dataset_path)


# Inizializzo il Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Ottengo i dataset
train_dataset = ToxicityDataset(sentences = train_sentences,
                                labels = train_labels, 
                                tokenizer = tokenizer, 
                                max_len = MAX_SEQ_LEN)

validation_dataset = ToxicityDataset(sentences = val_sentences,
                                labels = val_labels, 
                                tokenizer = tokenizer, 
                                max_len = MAX_SEQ_LEN)

test_dataset = ToxicityDataset(sentences = test_sentences,
                                labels = test_labels, 
                                tokenizer = tokenizer, 
                                max_len = MAX_SEQ_LEN)



Controllo che il caricamento dei dati sia avvenuto correttamente.

### 2. Configurazione dei modelli 

Definisco un classificatore basato su BERT per il task multi-label. Utilizzo il modello BERT pre-addestrato, seguito da un livello di dropout e un layer lineare con **N_CLASSES** neuroni di output. Se attivato, il modello integra LoRA per una fine-tuning efficiente.

In [27]:
from torch import nn
from transformers import BertModel

class BERTClassifierMultilabel(nn.Module):

    def __init__(self, lora: bool = False, r: int = 16):
        super(BERTClassifierMultilabel, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(p=0.3)
        self.linear = torch.nn.Linear(self.bert.config.hidden_size, N_CLASSES)

        if lora:
            print("Adding LoRA to BERT")
            lora_utils.add_lora_to_bert(self.bert, r=r)
            lora_utils.mark_only_lora_as_trainable(self.bert)


    
    def forward(self, input_ids, attention_mask, token_type_ids):
        output_bert = self.bert(
            input_ids, 
            attention_mask=attention_mask, 
            token_type_ids=token_type_ids
        )
        output_dropout = self.dropout(output_bert.pooler_output)
        output = self.linear(output_dropout)
        return output

In [28]:
# Device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [29]:
# MODELLO STANDARD
full_model = BERTClassifierMultilabel(lora=False)
full_model.to(device)

BERTClassifierMultilabel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

In [30]:
# MODELLO CON LORA
lora_model = BERTClassifierMultilabel(lora=True, r=16)
lora_model.to(device)

Adding LoRA to BERT


BERTClassifierMultilabel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=False)
              (key): Linear(in_features=768, out_features=768, bias=False)
              (value): Linear(in_features=768, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-1

### 3. Addestramento dei modelli

Definisco una serie di funzioni per l'addestramento e la valutazione di un modello.

In [31]:
import time
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

# Funzione di training e valutazione sul validation set

def train_and_evaluate_model(model, model_name, train_loader, val_loader, criterion, optimizer, scheduler, device, epochs=4):

    history = {"train_loss": [], "train_acc": [], "val_loss": [], "val_acc": []}
    best_accuracy = 0

    start_time = time.time()

    for epoch in range(epochs):
        
        print(f"\nEpoch {epoch + 1}/{epochs}")
        print('-' * 30)

        # Training
        train_loss, train_acc = train_model(model, train_loader, criterion, optimizer, scheduler, device)
        print(f"Train loss: {train_loss:.3f}, Train accuracy: {train_acc:.4f}")

        # Valutazione
        val_loss, val_acc, val_f1, val_auc = eval_model(model, val_loader, criterion, device)
        print(f"Validation loss: {val_loss:.3f}, Validation accuracy: {val_acc:.3f}")

        # Salvataggio del modello migliore
        if val_acc > best_accuracy:
            torch.save(model.state_dict(),  f"toxicity_best_{model_name}_state.bin")
            best_accuracy = val_acc

        # Salvaggio delle metriche
        history["train_loss"].append(train_loss)
        history["train_acc"].append(train_acc)
        history["val_loss"].append(val_loss)
        history["val_acc"].append(val_acc)
    
    end_time = time.time()
    total_training_time = end_time - start_time
    
    return history, total_training_time


In [32]:
def train_model(model, data_loader, criterion, optimizer, scheduler, device):
    
    model.train()  # imposto il modello in modalità di aggiornamento

    total_loss = 0 
    all_preds = []
    all_labels = []

    for batch in data_loader:
        
        # Sposto i dati sul device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        
        # --- Forward pass ---
        
        # Azzero il gradiente
        optimizer.zero_grad()
        
        # Effettuo la previsione per il batch corrente
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids = token_type_ids
        )
        # output contiene i valori grezzi non normalizzati prodotti dal modello per ogni classe.
        
        # Calcolo la loss
        loss = criterion(outputs, labels)

        
        # --- Backward pass ---

        # Calcolo i gradienti della loss
        loss.backward()

        # Effettuo il clipping dei gradienti
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Aggiorno i pesi
        optimizer.step()

        # Aggiorno il learning rate
        scheduler.step()

        
        # Salvo la loss, le previsioni e le etichette
        total_loss += loss.item()

        preds = (torch.sigmoid(outputs) > 0.5).long() #  trasformo i dati grezzi in etichette binarie

        all_preds.extend(preds.detach().cpu().numpy()) # disconnetto il tensore dalla computazione del gradiente, lo sposto sulla cpu, lo trasformo in array numpy e aggiungo il risultato a all_preds
        all_labels.extend(labels.detach().cpu().numpy())


    # Calcolo la loss e le metriche
    avg_loss = total_loss / len(data_loader)
    accuracy = accuracy_score(all_labels, all_preds)
    
    return avg_loss, accuracy
    

In [33]:
def eval_model(model, data_loader, criterion, device):

    model.eval()   # imposto il modello in modalità valutavione

    total_loss = 0
    all_preds = []
    all_labels = []
    all_probs = []

    with torch.no_grad():
        for batch in data_loader:

            # Sposto i dati sul device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            labels = batch['labels'].to(device)

            # Effettuo la previsione per il batch corrente
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids = token_type_ids
            )
            
            # Calcolo la loss
            loss = criterion(outputs, labels)

            # Aggiorno la loss, e salvo le previsioni le etichette
            total_loss += loss.item()
    
            probs = torch.sigmoid(outputs)  # Calcolo le probabilità

            preds = (probs > 0.5).long()  # Trasformo in etichette binarie
            
            all_preds.extend(preds.detach().cpu().numpy())
            all_labels.extend(labels.detach().cpu().numpy())
            all_probs.extend(probs.detach().cpu().numpy())

        # Calcolo loss e metriche
        avg_loss = total_loss / len(data_loader)
        accuracy = accuracy_score(all_labels, all_preds)
        f1 = f1_score(all_labels, all_preds, average="weighted")
        roc_auc = roc_auc_score(all_labels, all_probs, average='weighted', multi_class='ovr')
    
    return avg_loss, accuracy, f1, roc_auc
        

Definisco le configurazion principali per il training.

In [34]:
# Parametri principali
learning_rate = 3e-5
EPOCHS = 4
BATCH_SIZE = 16


# Creo i DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

total_steps = len(train_loader) * EPOCHS


# Funzione di loss
criterion = torch.nn.BCEWithLogitsLoss() # Applica automaticamente la sigmoide


# Ottimizzatore
optimizer = torch.optim.AdamW(params = full_model.parameters(), lr = learning_rate)

# Scheduler
scheduler = transformers.get_linear_schedule_with_warmup(optimizer = optimizer,
                                                       num_warmup_steps = 0,
                                                       num_training_steps = total_steps)

In [35]:
history_bert, total_time_bert = train_and_evaluate_model(
    full_model, "full_model", train_loader, train_loader, criterion, optimizer, scheduler, device, epochs=4
)
print(f"\nBERT Training Time: {total_time_bert:.2f} seconds, {total_time_bert/60:.2f} minutes.")


Epoch 1/4
------------------------------
Train loss: 0.107, Train accuracy: 0.8962
Validation loss: 0.048, Validation accuracy: 0.929

Epoch 2/4
------------------------------
Train loss: 0.046, Train accuracy: 0.9274
Validation loss: 0.034, Validation accuracy: 0.948

Epoch 3/4
------------------------------
Train loss: 0.034, Train accuracy: 0.9440
Validation loss: 0.026, Validation accuracy: 0.959

Epoch 4/4
------------------------------
Train loss: 0.027, Train accuracy: 0.9582
Validation loss: 0.025, Validation accuracy: 0.969

BERT Training Time: 376.70 seconds, 6.28 minutes.


In [36]:
# Parametri principali
learning_rate = 5e-4
EPOCHS = 6
BATCH_SIZE = 16

# Creo i DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

total_steps = len(train_loader) * EPOCHS


# Funzione di loss
criterion = torch.nn.BCEWithLogitsLoss() # Applica automaticamente la sigmoide

# Ottimizzatore
optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, lora_model.parameters()), lr = learning_rate)

# Scheduler
scheduler = transformers.get_linear_schedule_with_warmup(optimizer = optimizer,
                                                       num_warmup_steps = 0,
                                                       num_training_steps = total_steps)

In [37]:
history_lora, total_time_lora = train_and_evaluate_model(
    lora_model,"lora_model", train_loader, val_loader, criterion, optimizer, scheduler, device, epochs=EPOCHS
)
print(f"BERT with LoRA Training Time: {total_time_lora:.2f} seconds, {total_time_lora/60:.2f} minutes.")


Epoch 1/6
------------------------------
Train loss: 0.114, Train accuracy: 0.8936
Validation loss: 0.073, Validation accuracy: 0.894

Epoch 2/6
------------------------------
Train loss: 0.059, Train accuracy: 0.9104
Validation loss: 0.065, Validation accuracy: 0.894

Epoch 3/6
------------------------------
Train loss: 0.051, Train accuracy: 0.9156
Validation loss: 0.065, Validation accuracy: 0.896

Epoch 4/6
------------------------------
Train loss: 0.046, Train accuracy: 0.9210
Validation loss: 0.065, Validation accuracy: 0.888

Epoch 5/6
------------------------------
Train loss: 0.043, Train accuracy: 0.9254
Validation loss: 0.065, Validation accuracy: 0.892

Epoch 6/6
------------------------------
Train loss: 0.041, Train accuracy: 0.9274
Validation loss: 0.065, Validation accuracy: 0.888
BERT with LoRA Training Time: 310.94 seconds, 5.18 minutes.


### 4. Valutazione dei modelli
Valuto i modello calcolando la loss sul test set, l'accuracy, l'F1-score e ROC AUC.

In [38]:
full_model.load_state_dict(torch.load("toxicity_best_full_model_state.bin"))

test_loss, test_acc, test_f1, test_auc = eval_model(full_model, test_loader, criterion, device)
print(f"Full Fine-Tuning - Test loss: {test_loss:.4f}, Accuracy: {test_acc:.4f}, F1 score: {test_f1:.4f}, ROC AUC: {test_auc:.4f}")

  full_model.load_state_dict(torch.load("toxicity_best_full_model_state.bin"))


Full Fine-Tuning - Test loss: 0.0646, Accuracy: 0.9062, F1 score: 0.7164, ROC AUC: 0.9785


In [39]:
lora_model.load_state_dict(torch.load("toxicity_best_lora_model_state.bin"))

lora_test_loss, lora_test_acc, lora_test_f1, lora_test_auc = eval_model(lora_model, test_loader, criterion, device)
print(f"LoRA Fine-Tuning - Test loss: {lora_test_loss:.4f}, Accuracy: {lora_test_acc:.4f}, F1 score: {lora_test_f1:.4f}, ROC AUC: {lora_test_auc:.4f}")

  lora_model.load_state_dict(torch.load("toxicity_best_lora_model_state.bin"))


LoRA Fine-Tuning - Test loss: 0.0621, Accuracy: 0.9082, F1 score: 0.7009, ROC AUC: 0.9739


<a href="/kaggle/working/imbd_best_lora_model_state.bin"> Download File </a>


