# DistilBERT

DistilBERT es una versión más ligera y rápida de BERT, diseñada para usar menos recursos computacionales. Logra esto reduciendo el número de capas y parámetros a través de un proceso llamado destilación del conocimiento, donde se entrena para imitar el comportamiento de BERT. Además, simplifica los embeddings de palabras utilizando vectores de palabras pre-entrenados, en lugar de aprenderlos conjuntamente con el modelo.

En este notebook se evaluarán las capacidades de fine-tuning de un modelo distilBERT.

In [1]:
! pip install pandas seaborn tensorflow==2.15.0



In [2]:
!pip install scikit-learn



In [3]:
import os

import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
import seaborn as sns
import tensorflow as tf

from tensorflow.keras import activations, optimizers, losses
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# from tftrainer import Trainer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    confusion_matrix,
    classification_report
)

2024-05-14 17:28:35.127016: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-14 17:28:35.127056: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-14 17:28:35.128240: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Cargar los datos y dividirlos en training y validation

In [4]:
df = pd.read_csv('./data/goemotions_clean.csv', sep=",")
df.head()

Unnamed: 0,text,emotion
0,Shhh dont give idea,anger
1,Thank much kind stranger I really need,gratitude
2,Ion know would better buy trim make hard dose,neutral
3,Im honestly surprised We fallen much farther,excitement
4,Jurisprudence fetishist get technicality,neutral


In [5]:
# Dividir el dataset en train y validation
X_train, X_val, y_train, y_val = train_test_split(df['text'], df['emotion'], test_size=0.2, random_state=0)

# Preprocesamiento de los datos

Con la finalidad de poder entrenar el modelo BERT, los datos han de ser previamente procesados y convertidos a números. Para ello se emplea el tokenizador DistilBERTFast proporcionado por la biblioteca Transformers de Hugging Face.

Además, las salidas son codificadas mediante el método de one-hot encoding y se crean conjuntos de datos de PyTorch para el entrenamiento.

In [6]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")



In [7]:
import torch
from torch.utils.data import Dataset
from sklearn.preprocessing import OneHotEncoder

class PyTorchDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels.values.reshape(-1, 1)
        
        # One-hot encode the labels
        self.encoder = OneHotEncoder()
        self.labels_encoded = self.encoder.fit_transform(self.labels).toarray() 
        
    def __len__(self):
        return len(self.inputs)
    
    def __getitem__(self, idx):
        sample = {
            'input_ids': torch.tensor(self.inputs[idx]),
            'labels': torch.tensor(self.labels_encoded[idx], dtype=torch.float32)  # Use float32 for binary labels
        }
        return sample

In [8]:
# Tokenize train and test sets
X_train_tokenized = tokenizer(X_train.tolist(), truncation=True, padding=True)
X_val_tokenized = tokenizer(X_val.tolist(), truncation=True, padding=True)

# Create PyTorch datasets
train_dataset = PyTorchDataset(X_train_tokenized["input_ids"], y_train)
test_dataset = PyTorchDataset(X_val_tokenized["input_ids"], y_val)

# Entrenamiento del modelo DistilBERT

In [9]:
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", 
    num_labels=23, 
    problem_type="multi_label_classification")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
!pip install transformers[torch]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [11]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    eval_steps = 10,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="logs",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected kernel version 4.14.343, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [12]:
trainer.train()

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss
10,0.6911
20,0.6813
30,0.6617
40,0.6275
50,0.5747
60,0.5188
70,0.473
80,0.436
90,0.4036
100,0.376


TrainOutput(global_step=4038, training_loss=0.15103981555718607, metrics={'train_runtime': 1440.4944, 'train_samples_per_second': 179.28, 'train_steps_per_second': 2.803, 'total_flos': 5146785641771928.0, 'train_loss': 0.15103981555718607, 'epoch': 3.0})

In [13]:
model_path = os.path.join("./models", "distilbert_model")
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

('./models/distilbert_model/tokenizer_config.json',
 './models/distilbert_model/special_tokens_map.json',
 './models/distilbert_model/vocab.txt',
 './models/distilbert_model/added_tokens.json',
 './models/distilbert_model/tokenizer.json')

# Evaluación del modelo:

In [14]:
# See loss
trainer.evaluate(test_dataset)

{'eval_loss': 0.1332094669342041,
 'eval_runtime': 26.362,
 'eval_samples_per_second': 816.401,
 'eval_steps_per_second': 12.784,
 'epoch': 3.0}

 Esta información indica que el modelo tiene una pérdida de evaluación relativamente baja, lo que podría sugerir que está haciendo predicciones precisas en los datos de evaluación. Aun así, se necesita más información.

In [15]:
# Predict validation set
output = tf.argmax(trainer.predict(test_dataset)[0], axis=1)

In [33]:
y_val_reshaped = y_val.values.reshape(-1, 1)

encoder = OneHotEncoder()
labels_encoded = encoder.fit_transform(y_val_reshaped).toarray() 

In [39]:
# Get the confussion matrix
cm = confusion_matrix(np.argmax(labels_encoded, axis=1), output)
cm

array([[ 579,   24,    8,    4,   27,    5,    4,    5,    4,    6,    4,
           9,    2,    9,    1,   42,   50,  117,  309,   24,    2,    3,
          41],
       [  31,  436,   11,   15,    7,    2,    3,    5,    1,    3,    3,
           4,    2,    5,    4,   11,   29,   15,  159,    5,    1,    4,
          10],
       [  12,   25,  193,   52,   11,    3,    6,    8,    5,    9,   24,
          30,    2,    3,    6,    5,    1,    4,  277,    5,    0,   17,
          10],
       [  30,   54,  109,  109,   25,   10,    8,   18,    7,   15,   44,
          46,   12,   12,   13,   12,   13,   18,  733,   10,    1,   23,
          14],
       [ 140,   42,   19,   30,  136,   28,   19,   13,   14,    8,   40,
          15,    2,    9,   18,   16,   21,   57, 1021,   31,    3,   11,
          14],
       [  17,    8,    3,    3,   11,   66,    3,    5,    4,    2,    5,
           3,    0,    1,    7,    7,    9,   13,  271,   35,    2,   16,
           4],
       [  20,   17,   

In [40]:
print(classification_report(np.argmax(labels_encoded, axis=1), output))

              precision    recall  f1-score   support

           0       0.36      0.45      0.40      1279
           1       0.42      0.57      0.48       766
           2       0.27      0.27      0.27       708
           3       0.22      0.08      0.12      1336
           4       0.25      0.08      0.12      1707
           5       0.27      0.13      0.18       495
           6       0.24      0.10      0.15       692
           7       0.30      0.20      0.24       739
           8       0.34      0.25      0.29       321
           9       0.23      0.07      0.11       752
          10       0.23      0.10      0.14      1182
          11       0.23      0.18      0.20       490
          12       0.24      0.09      0.13       220
          13       0.22      0.09      0.13       414
          14       0.26      0.31      0.29       228
          15       0.65      0.63      0.64       714
          16       0.30      0.24      0.27       611
          17       0.40    

El modelo tiene una precisión media del 31% y un F1-score ponderado del 30%, lo que indica un rendimiento bajo en la clasificación de las 23 clases. Además, se observa que el modelo tiene dificultades para clasificar varias clases, especialmente aquellas con puntuaciones bajas de precisión, recall y F1-score.

Puesto que algunas clases tienen menos soporte (muestras) que otras, el rendimiento del modelo puede estar viéndose afectado.