<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>False Political Claim Detection</h1>
    <h3>Selección de características y preprocesado</h3>
    <h5>Grupo 2</h5>
  </div>
        <img style="width:15%;" src="images/logo.jpg" alt="UPM" />
</header>

# Índice

1. [Importar librerias](#1.-Importar-librerias)
2. [Variables globales y funciones auxiliares](#2.-Variables-globales-y-funciones-auxiliares)
3. [Carga del dataframe](#3.-Carga-del-dataframe)
4. [Análisis de las características](#4.-Analisis-de-las-caracteristicas)
5. [Entrenamiento de un modelo fijo con varios casos de preprocesado](#5.-Entrenamiento-de-un-modelo-fijo-con-varios-casos-de-preprocesado)
    * 5.1 [Caso 1: Statement](#5.1-Caso-1:-Statement)
    * 5.2 [Caso 2: Statement y Speaker](#5.2-Caso-2:-Statement-y-Speaker)
    * 5.3 [Caso 3: Statement, Speaker y Subject](#5.3-Caso-3:-Statement,-Speaker-y-Subject)
    * 5.4 [Caso 4: Statement, Speaker y Party](#5.4-Caso-4:-Statement,-Speaker-y-Party)
    * 5.5 [Caso 5: Statement, Speaker, Party y Subject](#5.5-Caso-5:-Statement,-Speaker,-Party-y-Subject)
    * 5.6 [Caso 6: Statement, Speaker, Party y Speaker Job](#5.6-Caso-6:-Statement,-Speaker,-Party-y-Speaker-Job)
    * 5.7 [Caso 7: Statement, Speaker, Party, Speaker Job y Subject](#5.7-Caso-7:-Statement,-Speaker,-Party,-Speaker-Job-y-Subject)
7. [Entrenamiento de un caso fijo con varios modelos](#6.-Entrenamiento-de-un-caso-fijo-con-varios-modelos)
    * 6.1 [RoBERTa](#6.1-RoBERTa)
    * 6.2 [BERT](#6.2-BERT)
    * 6.3 [ALBERT](#6.3-ALBERT)
    * 6.4 [DeBERTaV3](#6.4-DeBERTaV3)
8. [Conclusiones generales](#7.-Conclusiones-generales)
9. [Referencias](#8.-Referencias)

## 1. Importar librerias

In [1]:
# General import and load data
import pandas as pd
import numpy as np

# Hugging Face
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

# Splitting
from sklearn.model_selection import train_test_split

# Evaluation
from sklearn.metrics import accuracy_score, f1_score

print("Todas las librerias fueron correctamente importadas.")

Todas las librerias fueron correctamente importadas.


## 2. Variables globales y funciones auxiliares

Se fija un *seed* para todo el documento para fijar la aleatoriedad y así obtener resultados replicables.

In [2]:
seed = 42

Función utilizada para seleccionar el *statement* con el que se quiere entrenar el modelo.

In [3]:
def selStatement(case, df):
    mapping = {
        1: 'statement',
        2: 'speaker_statement',
        3: 'speaker_statement_subject',
        4: 'speaker_party_statement',
        5: 'speaker_party_statement_subject',
        6: 'speaker_speaker-job_party_statement',
        7: 'speaker_speaker-job_party_statement_subject'
    }

    if case not in mapping:
        raise ValueError("El valor de 'case' no es válido. Debe estar entre 1 y 7.")

    return df[mapping[case]]

Función utilizada para seleccionar el modelo con el que se quiere entrenar.

In [4]:
def selModel(model):
    mapping = {
        'albert': 'albert/albert-base-v2',
        'bert': 'google-bert/bert-base-uncased',
        'deberta': 'microsoft/deberta-v3-base',
        'distil_roberta': 'distilbert/distilroberta-base',
        'roberta': 'FacebookAI/roberta-base'
    }

    if model not in mapping:
        raise ValueError("El modelo seleccionado no existe entre los disponibles. Revise el valor introducido.")

    return mapping[model]

Función encargada de evaluar el modelo de *Hugging Face*.

In [5]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="macro")
    }

Función para transformar el *dataframe* en *dataset*. Tanto para el entrenamiento.

In [6]:
class OwnDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

Como para el test de *Kaggle*.

In [7]:
class KaggleDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __len__(self):
        return len(self.encodings['input_ids'])
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


## 3. Carga del dataframe

Se cargan los datos de la ruta *formated/train_hugging.csv*, los cuales son los datos ya procesados por uno de nuestros compañeros.

In [8]:
url = "/content/train_hugging.csv"
df = pd.read_csv(url)

print("Datos cargados correctamente\n")

Datos cargados correctamente



También cargamos el test que debemos predecir para Kaggle de la ruta *formated/test_hugging.csv*.

In [9]:
url = "/content/test_hugging.csv"
df_test = pd.read_csv(url)

print("Test cargado correctamente\n")

Test cargado correctamente



## 4. Analisis de las caracteristicas

Se configura pandas para que muestre todas las columnas y después se realiza un head para ver el contenido de las mismas.

In [10]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,id,label,statement,speaker_statement,speaker_statement_subject,speaker_party_statement,speaker_party_statement_subject,speaker_speaker-job_party_statement,speaker_speaker-job_party_statement_subject
0,81f884c64a7,1,China is in the South China Sea and (building)...,"donald-trump said: ""China is in the South Chin...","donald-trump said: ""China is in the South Chin...",donald-trump (affiliated with the republican p...,donald-trump (affiliated with the republican p...,donald-trump (President-Elect) (affiliated wit...,donald-trump (President-Elect) (affiliated wit...
1,30c2723a188,0,With the resources it takes to execute just ov...,"chris-dodd said: ""With the resources it takes ...","chris-dodd said: ""With the resources it takes ...",chris-dodd (affiliated with the democrat party...,chris-dodd (affiliated with the democrat party...,chris-dodd (U.S. senator) (affiliated with the...,chris-dodd (U.S. senator) (affiliated with the...
2,6936b216e5d,0,The (Wisconsin) governor has proposed tax give...,"donna-brazile said: ""The (Wisconsin) governor ...","donna-brazile said: ""The (Wisconsin) governor ...",donna-brazile (affiliated with the democrat pa...,donna-brazile (affiliated with the democrat pa...,donna-brazile (Political commentator) (affilia...,donna-brazile (Political commentator) (affilia...
3,b5cd9195738,1,Says her representation of an ex-boyfriend who...,"rebecca-bradley said: ""Says her representation...","rebecca-bradley said: ""Says her representation...",rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...
4,84f8dac7737,0,At protests in Wisconsin against proposed coll...,"republican-party-wisconsin said: ""At protests ...","republican-party-wisconsin said: ""At protests ...",republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...


Se muestran todas las columnas para poder observarlas.

In [11]:
df.columns

Index(['id', 'label', 'statement', 'speaker_statement',
       'speaker_statement_subject', 'speaker_party_statement',
       'speaker_party_statement_subject',
       'speaker_speaker-job_party_statement',
       'speaker_speaker-job_party_statement_subject'],
      dtype='object')

Se muestran todas las columnas en formato código para que sea más sencillo su uso. Además se estructuran en diversas categorias.

In [12]:
all_features = [
    # Identificador
    'id',

    # Etiqueta objetivo
    'label',

    # Texto original
    'statement',

    # Texto original + Speaker
    'speaker_statement',

    # Texto original + Speaker + Subject
    'speaker_statement_subject',

    # Texto original + Speaker + Party affilation
    'speaker_party_statement',

    # Texto original + Speaker + Party affilation + Subject
    'speaker_party_statement_subject',

    # Texto original + Speaker + Party affilation + Speaker Job
    'speaker_speaker-job_party_statement',

    # Texto original + Speaker + Party affilation + Speaker Job + Subject
    'speaker_speaker-job_party_statement_subject'
]

Como se observa se ha buscado unificar de forma progresiva las columnas. Además, se ha intentado no quitar palabras del texto original ya que los modelos de *Hugging Face* ya los tratan internamente mediante su propio *tokenizador*. Cada una de las columnas representara un caso de entrenamiento. A continuación se muestra un ejemplo de cada una de las columnas:

* **Caso 1 -** ***statement***

In [13]:
df['statement'].iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

* **Caso 2 -** ***speaker_statement***

In [14]:
df['speaker_statement'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 3 -** ***speaker_statement_subject***

In [15]:
df['speaker_statement_subject'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 4 -** ***speaker_party_statement***

In [16]:
df['speaker_party_statement'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 5 -** ***speaker_party_statement_subject***

In [17]:
df['speaker_party_statement_subject'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 6 -** ***speaker_speaker-job_party_statement***

In [18]:
df['speaker_speaker-job_party_statement'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 7 -** ***speaker_speaker-job_party_statement_subject***

In [19]:
df['speaker_speaker-job_party_statement_subject'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

## 5. Entrenamiento de un modelo fijo con varios casos de preprocesado

El objetivo de este apartado es buscar cual es el mejor caso de preprocesado de los presentados anteriormente. Para ello se ha decidido seleccionar un modelo fijo, el cual es "distilbert/distilroberta-base".

### 5.1 Caso 1: Statement

#### 5.1.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(1, df)
y = df['label']

X_kaggle = selStatement(1, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

In [None]:
X_kaggle.iloc[0]

'Five members of [the Common Cause Georgia] board accepted maximum campaign contributions.'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.1.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso1",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.606371,0.673184,0.557581
2,0.625400,0.631998,0.648045,0.61223
3,0.571800,0.683536,0.655307,0.621772


TrainOutput(global_step=1344, training_loss=0.5739225773584276, metrics={'train_runtime': 1016.7765, 'train_samples_per_second': 21.126, 'train_steps_per_second': 1.322, 'total_flos': 2845399723130880.0, 'train_loss': 0.5739225773584276, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6835362315177917,
 'eval_accuracy': 0.6553072625698324,
 'eval_f1': 0.621771972776815,
 'eval_runtime': 4.3383,
 'eval_samples_per_second': 412.605,
 'eval_steps_per_second': 12.908,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.1.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso1 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso1 = pd.concat([df_id, df_label_caso1], axis=1)

In [None]:
df_final_caso1['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2413
0,1423


In [None]:
df_final_caso1.to_csv('./submit/distilroberta_caso1.csv', index=False)

### 5.2 Caso 2: Statement y Speaker

#### 5.2.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [None]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.2.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso2",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.603201,0.672626,0.573186
2,0.625600,0.624444,0.648045,0.613378
3,0.568200,0.672483,0.641341,0.616682


TrainOutput(global_step=1344, training_loss=0.5724630128769648, metrics={'train_runtime': 1040.3146, 'train_samples_per_second': 20.648, 'train_steps_per_second': 1.292, 'total_flos': 2845399723130880.0, 'train_loss': 0.5724630128769648, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6724834442138672,
 'eval_accuracy': 0.641340782122905,
 'eval_f1': 0.6166824105799097,
 'eval_runtime': 4.6029,
 'eval_samples_per_second': 388.884,
 'eval_steps_per_second': 12.166,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.2.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso2 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso2 = pd.concat([df_id, df_label_caso2], axis=1)

In [None]:
df_final_caso2['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2316
0,1520


In [None]:
df_final_caso2.to_csv('./submit/distilroberta_caso2.csv', index=False)

### 5.3 Caso 3: Statement, Speaker y Subject

#### 5.3.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(3, df)
y = df['label']

X_kaggle = selStatement(3, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [None]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.3.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso3",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.605667,0.671508,0.575288
2,0.628100,0.620204,0.646927,0.612718
3,0.572400,0.665811,0.641341,0.618052


TrainOutput(global_step=1344, training_loss=0.5774543853033156, metrics={'train_runtime': 1047.8118, 'train_samples_per_second': 20.5, 'train_steps_per_second': 1.283, 'total_flos': 2845399723130880.0, 'train_loss': 0.5774543853033156, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6658113598823547,
 'eval_accuracy': 0.641340782122905,
 'eval_f1': 0.6180522319007051,
 'eval_runtime': 5.1125,
 'eval_samples_per_second': 350.123,
 'eval_steps_per_second': 10.954,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.3.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso3 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso3 = pd.concat([df_id, df_label_caso3], axis=1)

In [None]:
df_final_caso3['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2251
0,1585


In [None]:
df_final_caso3.to_csv('./submit/distilroberta_caso3.csv', index=False)

### 5.4 Caso 4: Statement, Speaker y Party

#### 5.4.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(4, df)
y = df['label']

X_kaggle = selStatement(4, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [None]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.4.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso4",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=4,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.611783,0.662011,0.599538
2,0.625900,0.61726,0.658659,0.620291
3,0.571900,0.655361,0.638547,0.609654
4,0.489400,0.699826,0.651955,0.617251


TrainOutput(global_step=1792, training_loss=0.5446085163525173, metrics={'train_runtime': 1382.5259, 'train_samples_per_second': 20.716, 'train_steps_per_second': 1.296, 'total_flos': 3793866297507840.0, 'train_loss': 0.5446085163525173, 'epoch': 4.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6172598004341125,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.6202912226651099,
 'eval_runtime': 4.5452,
 'eval_samples_per_second': 393.824,
 'eval_steps_per_second': 12.321,
 'epoch': 4.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.4.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso4 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso4 = pd.concat([df_id, df_label_caso4], axis=1)

In [None]:
df_final_caso4['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2519
0,1317


In [None]:
df_final_caso4.to_csv('./submit/distilroberta_caso4.csv', index=False)

### 5.5 Caso 5: Statement, Speaker, Party y Subject

#### 5.5.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(5, df)
y = df['label']

X_kaggle = selStatement(5, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [None]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.5.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso5",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.609181,0.668156,0.582708
2,0.626700,0.621496,0.648045,0.623389
3,0.573000,0.648106,0.646369,0.621247


TrainOutput(global_step=1344, training_loss=0.577937954948062, metrics={'train_runtime': 1039.2381, 'train_samples_per_second': 20.669, 'train_steps_per_second': 1.293, 'total_flos': 2845399723130880.0, 'train_loss': 0.577937954948062, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6214957237243652,
 'eval_accuracy': 0.6480446927374302,
 'eval_f1': 0.6233889583533712,
 'eval_runtime': 5.2336,
 'eval_samples_per_second': 342.023,
 'eval_steps_per_second': 10.7,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.5.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso5 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso5 = pd.concat([df_id, df_label_caso5], axis=1)

In [None]:
df_final_caso5['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2319
0,1517


In [None]:
df_final_caso5.to_csv('./submit/distilroberta_caso5.csv', index=False)

### 5.6 Caso 6: Statement, Speaker, Party y Speaker Job

#### 5.6.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(6, df)
y = df['label']

X_kaggle = selStatement(6, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [None]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.6.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso6",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.609073,0.67095,0.584363
2,0.622800,0.614569,0.662011,0.630465
3,0.566300,0.657541,0.657542,0.629432


TrainOutput(global_step=1344, training_loss=0.5704996699378604, metrics={'train_runtime': 1028.557, 'train_samples_per_second': 20.884, 'train_steps_per_second': 1.307, 'total_flos': 2845399723130880.0, 'train_loss': 0.5704996699378604, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6145690679550171,
 'eval_accuracy': 0.6620111731843575,
 'eval_f1': 0.6304645067462962,
 'eval_runtime': 5.235,
 'eval_samples_per_second': 341.931,
 'eval_steps_per_second': 10.697,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.6.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso6 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso6 = pd.concat([df_id, df_label_caso6], axis=1)

In [None]:
df_final_caso6['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2419
0,1417


In [None]:
df_final_caso6.to_csv('./submit/distilroberta_caso6.csv', index=False)

### 5.7 Caso 7: Statement, Speaker, Party, Speaker Job y Subject

#### 5.7.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(7, df)
y = df['label']

X_kaggle = selStatement(7, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [None]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 5.7.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('distil_roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_caso7",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.613306,0.675978,0.559656
2,0.625100,0.618718,0.656983,0.623473
3,0.572700,0.657804,0.645251,0.616894


TrainOutput(global_step=1344, training_loss=0.5765270165034703, metrics={'train_runtime': 1047.2704, 'train_samples_per_second': 20.51, 'train_steps_per_second': 1.283, 'total_flos': 2845399723130880.0, 'train_loss': 0.5765270165034703, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6187180876731873,
 'eval_accuracy': 0.6569832402234637,
 'eval_f1': 0.6234734205246829,
 'eval_runtime': 5.7012,
 'eval_samples_per_second': 313.968,
 'eval_steps_per_second': 9.822,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 5.7.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_caso7 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso7 = pd.concat([df_id, df_label_caso7], axis=1)

In [None]:
df_final_caso7['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2440
0,1396


In [None]:
df_final_caso7.to_csv('./submit/distilroberta_caso7.csv', index=False)

## 6. Entrenamiento de un caso fijo con varios modelos

El mejor modelo obtenido en el anterior apartado fue el caso 2, por ello entrenaremos con ese caso de preprocesado para otros modelos de *Hugging Face*.

### 6.1 RoBERTa

#### 6.1.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [None]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.1.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: FacebookAI/roberta-base


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_roberta",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.608629,0.668156,0.497078
2,0.631900,0.616887,0.664804,0.632594
3,0.572000,0.662082,0.65419,0.622982


TrainOutput(global_step=1344, training_loss=0.57720947265625, metrics={'train_runtime': 1943.9999, 'train_samples_per_second': 11.049, 'train_steps_per_second': 0.691, 'total_flos': 5651625469132800.0, 'train_loss': 0.57720947265625, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6168866753578186,
 'eval_accuracy': 0.664804469273743,
 'eval_f1': 0.6325944170771758,
 'eval_runtime': 8.7821,
 'eval_samples_per_second': 203.823,
 'eval_steps_per_second': 6.377,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.1.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_roberta = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_roberta = pd.concat([df_id, df_label_roberta], axis=1)

In [None]:
df_final_roberta['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2416
0,1420


In [None]:
df_final_roberta.to_csv('./submit/caso2_roberta.csv', index=False)

### 6.2 BERT

#### 6.2.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [None]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [None]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [None]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.2.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [None]:
model_name = selModel('bert')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: google-bert/bert-base-uncased


Seleccionamos el tokenizador del modelo.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizamos los textos.

In [None]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [None]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results_bert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.608573,0.678771,0.530896
2,0.628300,0.621739,0.663128,0.627013
3,0.555100,0.690798,0.659777,0.625015


TrainOutput(global_step=1344, training_loss=0.5568229243868873, metrics={'train_runtime': 1950.1837, 'train_samples_per_second': 11.014, 'train_steps_per_second': 0.689, 'total_flos': 5651625469132800.0, 'train_loss': 0.5568229243868873, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

{'eval_loss': 0.6217394471168518,
 'eval_accuracy': 0.6631284916201118,
 'eval_f1': 0.6270125863425589,
 'eval_runtime': 8.0131,
 'eval_samples_per_second': 223.386,
 'eval_steps_per_second': 6.989,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.2.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_bert = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_bert = pd.concat([df_id, df_label_bert], axis=1)

In [None]:
df_final_bert['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2495
0,1341


In [None]:
df_final_bert.to_csv('./submit/case2_bert.csv', index=False)

### 6.3 ALBERT

#### 6.3.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [20]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [21]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [22]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.3.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [24]:
model_name = selModel('albert')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: albert/albert-base-v2


Seleccionamos el tokenizador del modelo.

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Tokenizamos los textos.

In [26]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [27]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [28]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
training_args = TrainingArguments(
    output_dir="./results_albert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [30]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.625946,0.659218,0.400425
2,0.642400,0.612375,0.664804,0.607491
3,0.600500,0.640534,0.659777,0.606433


TrainOutput(global_step=1344, training_loss=0.5963993753705706, metrics={'train_runtime': 2070.0754, 'train_samples_per_second': 10.376, 'train_steps_per_second': 0.649, 'total_flos': 513331225804800.0, 'train_loss': 0.5963993753705706, 'epoch': 3.0})

Evaluamos los estadísticos establecidos previamente.

In [32]:
trainer.evaluate()

{'eval_loss': 0.6123751997947693,
 'eval_accuracy': 0.664804469273743,
 'eval_f1': 0.6074911447955664,
 'eval_runtime': 9.8561,
 'eval_samples_per_second': 181.613,
 'eval_steps_per_second': 5.682,
 'epoch': 3.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.3.3 Exportar CSV

In [33]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [34]:
df_label_albert = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_albert = pd.concat([df_id, df_label_albert], axis=1)

In [35]:
df_final_albert['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2830
0,1006


In [37]:
df_final_albert.to_csv('./submit/case2_albert.csv', index=False)

### 6.4 DeBERTaV3

#### 6.4.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [38]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [39]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [40]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.4.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [42]:
model_name = selModel('deberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: microsoft/deberta-v3-base


Seleccionamos el tokenizador del modelo.

In [43]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



Tokenizamos los textos.

In [44]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [45]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [46]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
training_args = TrainingArguments(
    output_dir="./results_deberta",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [48]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [49]:
trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 368.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 328.12 MiB is free. Process 3192 has 14.42 GiB memory in use. Of the allocated memory 14.12 GiB is allocated by PyTorch, and 169.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.4.3 Exportar CSV

In [None]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [None]:
df_label_deberta = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_deberta = pd.concat([df_id, df_label_deberta], axis=1)

In [None]:
df_final_deberta['label'].value_counts()

In [None]:
df_final_deberta.to_csv('./submit/caso2_deberta.csv', index=False)

## 7. Conclusiones generales

En rasgos generales, los modelos de *Hugging Face* fueron los modelos con los que se obtuvieron mejores resultados, más especificamente con el caso dos de preprocesado y el modelo *distilbert/distilroberta-base*.

Cabe comentar que este *notebook* no muestra todas las pruebas realizadas, ni siquiera muestra el caso mejor de *Kaggle*. Esto se debe al tiempo de ejecución de estos modelos. Para aprovechar de forma eficazel tiempo se compartió este mismo notebook con pequeñas modificaciones para ejecutar varios en paralelo.

El  modo de proceder fue el siguiente. Primero se ejecuto este *notebook*, una persona se encargo de ejecutar el apartado 5 y se observo que el mejor caso de preprocesado era el 2. Después dos personas se encargaron de ejecutar el apartado 6 (dos modelos cada uno). Durante este proceso se observó que DeBERTaV3 no podía ejecutarse porque ningún integrante del grupo tenía recursos suficientes. Tras la ejecución de este apartado se observo que el mejor modelo era el primero que se probó, *distilbert/distilroberta-base*.

Posteriormente, se necesitaron 3 personas. Una volvio a ejecutar los modelos del apartado 6 con distintos preprocesados por si acaso funcionaban mejor con otros. También se probó a limitar el tamaño máximo de *tokens* por frase, ya que al ser estos cuatro modelos más grandes (utilizan una mayor cantidad de parámetros). Aún así no fue suficiente y continuó siendo el mejor *distilbert/distilroberta-base*. Las otras dos personas se encargaron de optimizar *distilbert/distilroberta-base*. Para ello se probaron los dos primeros y mejores casos de preprocesamiento. Las pruebas de optimización realizadas fueron:
* Limitar el tamaño máximo de *tokens*, se probó sin limitar y limitado tanto a 128 *tokens* como a 256.
* Probar distinto número de épocas. Desde 2 epocas hasta 6.
* Probar distintas tasas de aprendizaje. Se probaron 2e-5, 1e-5 y 5e-6.
* Probar distintos tamaños de batch. Las pruebas fueron con 8 de entrenamiento y 16 de evaluación, con 16 de entrenamiento y 32 de evaluación y con 32 de entrenamiento y 64 de evaluación.

Asimismo, no se fueron probando de forma independiente y aislada, sino que fueron combinandose entre sí (exceptuando los tamaños de batch que se realizaron al final para el caso mejor). Esta combinatoria supuso una gran cantidad de casuísticas.

Finalmente el caso mejor fue *distilbert/distilroberta-base* con el caso 2 de preprocesado y los siguientes hiperparámetros de entrenamiento:

training_args = TrainingArguments(
    output_dir="./results_caso2",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

## 8. Referencias

* [Moraites, A. (2023, 2 abril). Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library. Medium.]( https://achimoraites.medium.com/fine-tuning-roberta-for-topic-classification-with-hugging-face-transformers-and-datasets-library-c6f8432d0820)