# **Objetivo del notebook**

Debido a la escasez de datos, al entrenar un modelo de ML nos daba bastante overfitting. Gracias al ajuste de hiperparámetros conseguimos reducir el overfitting a 4%, con un accuraci de 70%.

El objetivo de este notebook es mejorar esas métricas mediante el uso de un modelo de DL preentrenado al que no le afecte tantola escasez de datos de entrenamiento.

# **Contenido del Notebook**

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import random

# Data cleaning
import re

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# DL libraries
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

In [4]:
# Set seeds for reproducibility
np.random.seed(1)
tf.random.set_seed(1)
random.seed(1)


In [5]:
# Load the dataset

data = pd.read_csv('/content/youtoxic_english_1000.csv')

In [6]:
# Function to clean the text
def clean_text(text):
    text = " ".join(text.split())  # Remove extra whitespaces
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    return text


data['CleanedText'] = data['Text'].apply(clean_text)
data_cleaned = data.drop_duplicates(subset=['Text'], keep='first').reset_index(drop=True)

In [7]:
# Split the dataset
train_data, val_data = train_test_split(data, test_size=0.1, random_state=1)


In [8]:
# Initialize the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the text
train_encodings = tokenizer(list(train_data['CleanedText']), truncation=True, padding=True)
val_encodings = tokenizer(list(val_data['CleanedText']), truncation=True, padding=True)


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [9]:
# Convert the 'IsToxic' column to integers
train_labels = train_data['IsToxic'].astype(int).values
val_labels = val_data['IsToxic'].astype(int).values

In [10]:
# Convert encodings and labels to TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))

In [11]:
# Load the pre-trained DistilBERT model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [12]:
# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.00005)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

In [13]:
# Train the model
batch_size = 8
epochs = 1
history = model.fit(
    train_dataset.shuffle(10000, seed=1).batch(batch_size),
    epochs=epochs,
    batch_size=batch_size,
    validation_data=val_dataset.batch(batch_size)
)




In [14]:
# Evaluate the model on the training and validation sets
train_results = model.evaluate(train_dataset.batch(batch_size), return_dict=True)
val_results = model.evaluate(val_dataset.batch(batch_size), return_dict=True)

# Display training and validation results
print(f"Training Loss: {train_results['loss']}, Training Accuracy: {train_results['accuracy']}")
print(f"Validation Loss: {val_results['loss']}, Validation Accuracy: {val_results['accuracy']}")

# Predicting on the validation set
val_preds = model.predict(val_dataset.batch(batch_size))
val_preds_labels = np.argmax(val_preds.logits, axis=1)

# Confusion Matrix and Classification Report
cm = confusion_matrix(val_labels, val_preds_labels)
report = classification_report(val_labels, val_preds_labels, target_names=['Not Toxic', 'Toxic'])

print("Confusion Matrix:\n", cm)
print("Classification Report:\n", report)

# Overfitting calculation
overfitting_level = (train_results['accuracy'] - val_results['accuracy']) * 100
print(f"This model has an overfitting of {overfitting_level:.2f}%")

Training Loss: 0.3827913701534271, Training Accuracy: 0.8633333444595337
Validation Loss: 0.4181479215621948, Validation Accuracy: 0.8199999928474426
Confusion Matrix:
 [[47  5]
 [13 35]]
Classification Report:
               precision    recall  f1-score   support

   Not Toxic       0.78      0.90      0.84        52
       Toxic       0.88      0.73      0.80        48

    accuracy                           0.82       100
   macro avg       0.83      0.82      0.82       100
weighted avg       0.83      0.82      0.82       100

This model has an overfitting of 4.33%


In [15]:
# Save the model
model.save("BERT_model")

