<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>False Political Claim Detection</h1>
    <h3>Selección de características y preprocesado</h3>
    <h5>Grupo 2</h5>
  </div>
        <img style="width:15%;" src="images/logo.jpg" alt="UPM" />
</header>

# Índice

1. [Importar librerias](#1.-Importar-librerias)
2. [Variables globales](#2.-Variables-globales)
3. [Carga del dataframe](#3.-Carga-del-dataframe)
4. [Análisis y selección de las características](#4.-Analisis-y-seleccion-de-las-caracteristicas)
5. [Carga de los datos y división en entrenamiento y test](#5.-Carga-de-los-datos-y-division-en-entrenamiento-y-test)
6. [Preprocesado de los datos](#6.-Preprocesado-de-los-datos)
    * 6.1 [Introducción](#6.1-Introduccion)
    * 6.2 [Casos de preprocesado](#6.2-Casos-de-preprocesado)
    * 6.3 [Conclusiones](#6.3-Conclusiones)
7. [Referencias](#7.-Referencias)


## 1. Importar librerias

In [1]:
# General import and load data
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Splitting
from sklearn.model_selection import train_test_split

# Estimators
from sklearn.ensemble import RandomForestClassifier

# Evaluation
from sklearn.metrics import accuracy_score, f1_score

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

print("Todas las librerias fueron correctamente importadas.")

Todas las librerias fueron correctamente importadas.


## 2. Variables globales y funciones auxiliares

Se fija un *seed* para todo el documento para fijar la aleatoriedad y así obtener resultados replicables.

In [2]:
seed = 42

Función utilizada para seleccionar el *statement* con el que se quiere entrenar el modelo.

In [3]:
def selStatement(case, df):
    mapping = {
        1: 'statement',
        2: 'speaker_statement',
        3: 'speaker_statement_subject',
        4: 'speaker_party_statement',
        5: 'speaker_party_statement_subject',
        6: 'speaker_speaker-job_party_statement',
        7: 'speaker_speaker-job_party_statement_subject'
    }

    if case not in mapping:
        raise ValueError("El valor de 'case' no es válido. Debe estar entre 1 y 7.")

    return df[mapping[case]]

Función utilizada para seleccionar el modelo con el que se quiere entrenar.

In [4]:
def selModel(model):
    mapping = {
        'albert': 'albert/albert-base-v2',
        'bert': 'google-bert/bert-base-uncased',
        'deberta': 'microsoft/deberta-v3-base',
        'distil_roberta': 'distilbert/distilroberta-base',
        'roberta': 'FacebookAI/roberta-base'
    }

    if model not in mapping:
        raise ValueError("El modelo seleccionado no existe entre los disponibles. Revise el valor introducido.")

    return mapping[model]

Función encargada de evaluar el modelo de *Hugging Face*.

In [5]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions),
    }

Función para transformar el *dataframe* en *dataset*.

In [6]:
class OwnDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

In [7]:
class KaggleDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __len__(self):
        return len(self.encodings['input_ids'])
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


## 3. Carga del dataframe

Se cargan los datos de la ruta *formated/train_hugging.csv*, los cuales son los datos ya procesados por uno de nuestros compañeros.

In [8]:
url = "/content/train_hugging.csv"
df = pd.read_csv(url)

print("Datos cargados correctamente\n")

Datos cargados correctamente



También cargamos el test que debemos predecir para Kaggle de la ruta *formated/test_hugging.csv*.

In [9]:
url = "/content/test_hugging.csv"
df_test = pd.read_csv(url)

print("Test cargado correctamente\n")

Test cargado correctamente



## 4. Analisis de las caracteristicas

Se configura pandas para que muestre todas las columnas y después se realiza un head para ver el contenido de las mismas.

In [10]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,id,label,statement,speaker_statement,speaker_statement_subject,speaker_party_statement,speaker_party_statement_subject,speaker_speaker-job_party_statement,speaker_speaker-job_party_statement_subject
0,81f884c64a7,1,China is in the South China Sea and (building)...,"donald-trump said: ""China is in the South Chin...","donald-trump said: ""China is in the South Chin...",donald-trump (affiliated with the republican p...,donald-trump (affiliated with the republican p...,donald-trump (President-Elect) (affiliated wit...,donald-trump (President-Elect) (affiliated wit...
1,30c2723a188,0,With the resources it takes to execute just ov...,"chris-dodd said: ""With the resources it takes ...","chris-dodd said: ""With the resources it takes ...",chris-dodd (affiliated with the democrat party...,chris-dodd (affiliated with the democrat party...,chris-dodd (U.S. senator) (affiliated with the...,chris-dodd (U.S. senator) (affiliated with the...
2,6936b216e5d,0,The (Wisconsin) governor has proposed tax give...,"donna-brazile said: ""The (Wisconsin) governor ...","donna-brazile said: ""The (Wisconsin) governor ...",donna-brazile (affiliated with the democrat pa...,donna-brazile (affiliated with the democrat pa...,donna-brazile (Political commentator) (affilia...,donna-brazile (Political commentator) (affilia...
3,b5cd9195738,1,Says her representation of an ex-boyfriend who...,"rebecca-bradley said: ""Says her representation...","rebecca-bradley said: ""Says her representation...",rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...
4,84f8dac7737,0,At protests in Wisconsin against proposed coll...,"republican-party-wisconsin said: ""At protests ...","republican-party-wisconsin said: ""At protests ...",republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...


Se muestran todas las columnas para poder observarlas.

In [11]:
df.columns

Index(['id', 'label', 'statement', 'speaker_statement',
       'speaker_statement_subject', 'speaker_party_statement',
       'speaker_party_statement_subject',
       'speaker_speaker-job_party_statement',
       'speaker_speaker-job_party_statement_subject'],
      dtype='object')

Se muestran todas las columnas en formato código para que sea más sencillo su uso. Además se estructuran en diversas categorias.

In [12]:
all_features = [
    # Identificador
    'id',

    # Etiqueta objetivo
    'label',

    # Texto original
    'statement',

    # Texto original + Speaker
    'speaker_statement',

    # Texto original + Speaker + Subject
    'speaker_statement_subject',

    # Texto original + Speaker + Party affilation
    'speaker_party_statement',

    # Texto original + Speaker + Party affilation + Subject
    'speaker_party_statement_subject',

    # Texto original + Speaker + Party affilation + Speaker Job
    'speaker_speaker-job_party_statement',

    # Texto original + Speaker + Party affilation + Speaker Job + Subject
    'speaker_speaker-job_party_statement_subject'
]

Como se observa se ha buscado unificar de forma progresiva las columnas. Además, se ha intentado no quitar palabras del texto original ya que los modelos de *Hugging Face* ya los tratan internamente mediante su propio *tokenizador*. Cada una de las columnas representara un caso de entrenamiento. A continuación se muestra un ejemplo de cada una de las columnas:

* **Caso 1 -** ***statement***

In [13]:
df['statement'].iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

* **Caso 2 -** ***speaker_statement***

In [14]:
df['speaker_statement'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 3 -** ***speaker_statement_subject***

In [15]:
df['speaker_statement_subject'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 4 -** ***speaker_party_statement***

In [16]:
df['speaker_party_statement'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 5 -** ***speaker_party_statement_subject***

In [17]:
df['speaker_party_statement_subject'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 6 -** ***speaker_speaker-job_party_statement***

In [18]:
df['speaker_speaker-job_party_statement'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 7 -** ***speaker_speaker-job_party_statement_subject***

In [19]:
df['speaker_speaker-job_party_statement_subject'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

## 6. Entrenamiento de un modelo fijo con varios casos de preprocesado

El objetivo de este apartado es buscar cual es el mejor caso de preprocesado de los presentados anteriormente. Para ello se ha decidido seleccionar un modelo fijo, el cual es "distilbert/distilroberta-base".

### 6.1 Caso 1: Statement

#### 6.1.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [20]:
X =  selStatement(1, df)
y = df['label']

X_kaggle = selStatement(1, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [21]:
X.iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

In [22]:
X_kaggle.iloc[0]

'Five members of [the Common Cause Georgia] board accepted maximum campaign contributions.'

Se separa en entrenamiento y test.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.1.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [24]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokenizamos los textos.

In [26]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [27]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [28]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
training_args = TrainingArguments(
    output_dir="./results_caso1",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [30]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.609881,0.669832,0.774685
2,0.626900,0.621559,0.653631,0.738397


TrainOutput(global_step=896, training_loss=0.6057974440710885, metrics={'train_runtime': 679.5388, 'train_samples_per_second': 21.073, 'train_steps_per_second': 1.319, 'total_flos': 1896933148753920.0, 'train_loss': 0.6057974440710885, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [32]:
trainer.evaluate()

{'eval_loss': 0.6098808646202087,
 'eval_accuracy': 0.6698324022346369,
 'eval_f1': 0.7746854746473504,
 'eval_runtime': 4.2336,
 'eval_samples_per_second': 422.804,
 'eval_steps_per_second': 13.227,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.1.3 Exportar CSV

In [33]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [34]:
df_label_caso1 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso1 = pd.concat([df_id, df_label_caso1], axis=1)

In [35]:
df_final_caso1['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3106
0,730


In [37]:
df_final_caso1.to_csv('./submit/roberta_caso1.csv', index=False)

### 6.2 Caso 2: Statement y Speaker

#### 6.2.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [38]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [39]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [40]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.2.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [42]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [43]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [44]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [45]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [46]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
training_args = TrainingArguments(
    output_dir="./results_caso2",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [48]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [49]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.602166,0.676536,0.776879
2,0.623500,0.617187,0.650279,0.726399


TrainOutput(global_step=896, training_loss=0.6010246447154454, metrics={'train_runtime': 686.4326, 'train_samples_per_second': 20.861, 'train_steps_per_second': 1.305, 'total_flos': 1896933148753920.0, 'train_loss': 0.6010246447154454, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [50]:
trainer.evaluate()

{'eval_loss': 0.602165937423706,
 'eval_accuracy': 0.676536312849162,
 'eval_f1': 0.776878612716763,
 'eval_runtime': 4.5764,
 'eval_samples_per_second': 391.136,
 'eval_steps_per_second': 12.237,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.2.3 Exportar CSV

In [51]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [52]:
df_label_caso2 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso2 = pd.concat([df_id, df_label_caso2], axis=1)

In [53]:
df_final_caso2['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3010
0,826


In [54]:
df_final_caso2.to_csv('./submit/roberta_caso2.csv', index=False)

### 6.3 Caso 3: Statement, Speaker y Subject

#### 6.3.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [55]:
X =  selStatement(3, df)
y = df['label']

X_kaggle = selStatement(3, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [56]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [57]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.3.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [59]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [60]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [61]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [62]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [63]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [64]:
training_args = TrainingArguments(
    output_dir="./results_caso3",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [65]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [66]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.602819,0.67933,0.782246
2,0.626300,0.620244,0.64581,0.720951


TrainOutput(global_step=896, training_loss=0.6044660125459943, metrics={'train_runtime': 686.4964, 'train_samples_per_second': 20.86, 'train_steps_per_second': 1.305, 'total_flos': 1896933148753920.0, 'train_loss': 0.6044660125459943, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [67]:
trainer.evaluate()

{'eval_loss': 0.6028192043304443,
 'eval_accuracy': 0.6793296089385474,
 'eval_f1': 0.7822458270106222,
 'eval_runtime': 5.1324,
 'eval_samples_per_second': 348.767,
 'eval_steps_per_second': 10.911,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.3.3 Exportar CSV

In [68]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [69]:
df_label_caso3 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso3 = pd.concat([df_id, df_label_caso3], axis=1)

In [70]:
df_final_caso3['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3089
0,747


In [71]:
df_final_caso3.to_csv('./submit/roberta_caso3.csv', index=False)

### 6.4 Caso 4: Statement, Speaker y Party

#### 6.4.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [72]:
X =  selStatement(4, df)
y = df['label']

X_kaggle = selStatement(4, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [73]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [74]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.4.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [76]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [77]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [78]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [79]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [80]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [81]:
training_args = TrainingArguments(
    output_dir="./results_caso4",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [82]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [83]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.5981,0.681564,0.787789
2,0.624600,0.614373,0.665363,0.742144


TrainOutput(global_step=896, training_loss=0.6021936961582729, metrics={'train_runtime': 685.051, 'train_samples_per_second': 20.904, 'train_steps_per_second': 1.308, 'total_flos': 1896933148753920.0, 'train_loss': 0.6021936961582729, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [84]:
trainer.evaluate()

{'eval_loss': 0.5981001257896423,
 'eval_accuracy': 0.6815642458100558,
 'eval_f1': 0.7877885331347729,
 'eval_runtime': 4.5183,
 'eval_samples_per_second': 396.163,
 'eval_steps_per_second': 12.394,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.4.3 Exportar CSV

In [85]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [86]:
df_label_caso4 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso4 = pd.concat([df_id, df_label_caso4], axis=1)

In [87]:
df_final_caso4['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3223
0,613


In [88]:
df_final_caso4.to_csv('./submit/roberta_caso4.csv', index=False)

### 6.5 Caso 5: Statement, Speaker, Party y Subject

#### 6.5.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [89]:
X =  selStatement(5, df)
y = df['label']

X_kaggle = selStatement(5, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [90]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [91]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [92]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.5.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [93]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [94]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [95]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [96]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [97]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [98]:
training_args = TrainingArguments(
    output_dir="./results_caso5",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [99]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [100]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.601001,0.675419,0.782152
2,0.625000,0.614967,0.661453,0.738116


TrainOutput(global_step=896, training_loss=0.6031361477715629, metrics={'train_runtime': 686.3498, 'train_samples_per_second': 20.864, 'train_steps_per_second': 1.305, 'total_flos': 1896933148753920.0, 'train_loss': 0.6031361477715629, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [101]:
trainer.evaluate()

{'eval_loss': 0.6010012030601501,
 'eval_accuracy': 0.6754189944134078,
 'eval_f1': 0.7821522309711286,
 'eval_runtime': 5.2653,
 'eval_samples_per_second': 339.962,
 'eval_steps_per_second': 10.636,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.5.3 Exportar CSV

In [102]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [103]:
df_label_caso5 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso5 = pd.concat([df_id, df_label_caso5], axis=1)

In [104]:
df_final_caso5['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3198
0,638


In [105]:
df_final_caso5.to_csv('./submit/roberta_caso5.csv', index=False)

### 6.6 Caso 6: Statement, Speaker, Party y Speaker Job

#### 6.6.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [106]:
X =  selStatement(6, df)
y = df['label']

X_kaggle = selStatement(6, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [107]:
X.iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [108]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.6.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [110]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [111]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [112]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [113]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [114]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [115]:
training_args = TrainingArguments(
    output_dir="./results_caso6",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [116]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [117]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.604272,0.686592,0.791682
2,0.621700,0.619557,0.655866,0.734711


TrainOutput(global_step=896, training_loss=0.6005993570600238, metrics={'train_runtime': 742.568, 'train_samples_per_second': 19.284, 'train_steps_per_second': 1.207, 'total_flos': 1896933148753920.0, 'train_loss': 0.6005993570600238, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [118]:
trainer.evaluate()

{'eval_loss': 0.6042715907096863,
 'eval_accuracy': 0.6865921787709497,
 'eval_f1': 0.7916821388785741,
 'eval_runtime': 5.2787,
 'eval_samples_per_second': 339.1,
 'eval_steps_per_second': 10.609,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.6.3 Exportar CSV

In [119]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [120]:
df_label_caso6 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso6 = pd.concat([df_id, df_label_caso6], axis=1)

In [121]:
df_final_caso6['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3212
0,624


In [122]:
df_final_caso6.to_csv('./submit/roberta_caso6.csv', index=False)

### 6.7 Caso 7: Statement, Speaker, Party, Speaker Job y Subject

#### 6.7.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [123]:
X =  selStatement(7, df)
y = df['label']

X_kaggle = selStatement(7, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [124]:
X.iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [125]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [126]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.7.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [127]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [128]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [129]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [130]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [131]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [132]:
training_args = TrainingArguments(
    output_dir="./results_caso7",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [133]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [134]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.604165,0.682123,0.789804
2,0.624200,0.621089,0.650279,0.729004


TrainOutput(global_step=896, training_loss=0.603698696408953, metrics={'train_runtime': 737.6993, 'train_samples_per_second': 19.412, 'train_steps_per_second': 1.215, 'total_flos': 1896933148753920.0, 'train_loss': 0.603698696408953, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [135]:
trainer.evaluate()

{'eval_loss': 0.6041653752326965,
 'eval_accuracy': 0.6821229050279329,
 'eval_f1': 0.7898042113040266,
 'eval_runtime': 5.7459,
 'eval_samples_per_second': 311.528,
 'eval_steps_per_second': 9.746,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 6.7.3 Exportar CSV

In [136]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [137]:
df_label_caso7 = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_caso7 = pd.concat([df_id, df_label_caso7], axis=1)

In [138]:
df_final_caso7['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3245
0,591


In [139]:
df_final_caso7.to_csv('./submit/roberta_caso7.csv', index=False)

## 7. Entrenamiento de un caso fijo con varios modelos

El mejor modelo obtenido en el anterior apartado fue el caso 2, por ello entrenaremos con ese caso de preprocesado para otros modelos de *Hugging Face*.

### 7.1 RoBERTa

#### 7.1.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [20]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [21]:
X.iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

In [22]:
X_kaggle.iloc[0]

'Five members of [the Common Cause Georgia] board accepted maximum campaign contributions.'

Se separa en entrenamiento y test.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 7.1.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [24]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokenizamos los textos.

In [26]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [27]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [28]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
training_args = TrainingArguments(
    output_dir="./results_roberta",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [30]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.609881,0.669832,0.774685
2,0.626900,0.621559,0.653631,0.738397


TrainOutput(global_step=896, training_loss=0.6057974440710885, metrics={'train_runtime': 679.5388, 'train_samples_per_second': 21.073, 'train_steps_per_second': 1.319, 'total_flos': 1896933148753920.0, 'train_loss': 0.6057974440710885, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [32]:
trainer.evaluate()

{'eval_loss': 0.6098808646202087,
 'eval_accuracy': 0.6698324022346369,
 'eval_f1': 0.7746854746473504,
 'eval_runtime': 4.2336,
 'eval_samples_per_second': 422.804,
 'eval_steps_per_second': 13.227,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 7.1.3 Exportar CSV

In [33]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [34]:
df_label_roberta = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_roberta = pd.concat([df_id, df_label_roberta], axis=1)

In [35]:
df_final_roberta['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3106
0,730


In [37]:
df_final_roberta.to_csv('./submit/caso2_roberta.csv', index=False)

### 7.2 BERT

#### 7.2.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [38]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [39]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [40]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 7.2.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [42]:
model_name = selModel('bert')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [43]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [44]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [45]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [46]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
training_args = TrainingArguments(
    output_dir="./results_bert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [48]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [49]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.602166,0.676536,0.776879
2,0.623500,0.617187,0.650279,0.726399


TrainOutput(global_step=896, training_loss=0.6010246447154454, metrics={'train_runtime': 686.4326, 'train_samples_per_second': 20.861, 'train_steps_per_second': 1.305, 'total_flos': 1896933148753920.0, 'train_loss': 0.6010246447154454, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [50]:
trainer.evaluate()

{'eval_loss': 0.602165937423706,
 'eval_accuracy': 0.676536312849162,
 'eval_f1': 0.776878612716763,
 'eval_runtime': 4.5764,
 'eval_samples_per_second': 391.136,
 'eval_steps_per_second': 12.237,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 7.2.3 Exportar CSV

In [51]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [52]:
df_label_bert = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_bert = pd.concat([df_id, df_label_bert], axis=1)

In [53]:
df_final_bert['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3010
0,826


In [54]:
df_final_bert.to_csv('./submit/case2_bert.csv', index=False)

### 7.3 ALBERT

#### 7.3.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [55]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [56]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

In [57]:
X_kaggle.iloc[0]

'kasim-reed said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions." The statement concerns the topics of: campaign-finance,ethics,government-regulation.'

Se separa en entrenamiento y test.

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 7.3.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [59]:
model_name = selModel('albert')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [60]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [61]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [62]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [63]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [64]:
training_args = TrainingArguments(
    output_dir="./results_albert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [65]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [66]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.602819,0.67933,0.782246
2,0.626300,0.620244,0.64581,0.720951


TrainOutput(global_step=896, training_loss=0.6044660125459943, metrics={'train_runtime': 686.4964, 'train_samples_per_second': 20.86, 'train_steps_per_second': 1.305, 'total_flos': 1896933148753920.0, 'train_loss': 0.6044660125459943, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [67]:
trainer.evaluate()

{'eval_loss': 0.6028192043304443,
 'eval_accuracy': 0.6793296089385474,
 'eval_f1': 0.7822458270106222,
 'eval_runtime': 5.1324,
 'eval_samples_per_second': 348.767,
 'eval_steps_per_second': 10.911,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 7.3.3 Exportar CSV

In [68]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [69]:
df_label_albert = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_albert = pd.concat([df_id, df_label_albert], axis=1)

In [70]:
df_final_albert['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3089
0,747


In [71]:
df_final_caso3.to_csv('./submit/case2_albert.csv', index=False)

### 7.4 DeBERTaV3

#### 7.4.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [72]:
X =  selStatement(2, df)
y = df['label']

X_kaggle = selStatement(2, df_test)

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [73]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

In [74]:
X_kaggle.iloc[0]

'kasim-reed (affiliated with the democrat party) said: "Five members of [the Common Cause Georgia] board accepted maximum campaign contributions."'

Se separa en entrenamiento y test.

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 7.4.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [76]:
model_name = selModel('deberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [77]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [78]:
# Entrenamiento y evaluación
train_encodings = tokenizer(list(X_train), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

# Kaggle
kaggle_encodings = tokenizer(list(X_kaggle), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion *OwnDataset* y con la función *KaggleDataset* (la diferencia entre ellas es que *KaggleDataset* se construye sin *labels*).

In [79]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

kaggle_dataset = KaggleDataset(kaggle_encodings)

Cargamos el modelo.

In [80]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [81]:
training_args = TrainingArguments(
    output_dir="./results_deberta",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [82]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [83]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.5981,0.681564,0.787789
2,0.624600,0.614373,0.665363,0.742144


TrainOutput(global_step=896, training_loss=0.6021936961582729, metrics={'train_runtime': 685.051, 'train_samples_per_second': 20.904, 'train_steps_per_second': 1.308, 'total_flos': 1896933148753920.0, 'train_loss': 0.6021936961582729, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [84]:
trainer.evaluate()

{'eval_loss': 0.5981001257896423,
 'eval_accuracy': 0.6815642458100558,
 'eval_f1': 0.7877885331347729,
 'eval_runtime': 4.5183,
 'eval_samples_per_second': 396.163,
 'eval_steps_per_second': 12.394,
 'epoch': 2.0}

En caso de querer almacenarlo para ser evaluados por otros.

#### 7.4.3 Exportar CSV

In [85]:
predictions = trainer.predict(kaggle_dataset)
predicted_labels = predictions.predictions.argmax(-1)

In [86]:
df_label_deberta = pd.DataFrame(predicted_labels, columns=['label'])
df_id = df_test['id'].reset_index(drop=True)
df_final_deberta = pd.concat([df_id, df_label_deberta], axis=1)

In [87]:
df_final_deberta['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,3223
0,613


In [88]:
df_final_deberta.to_csv('./submit/caso2_deberta.csv', index=False)

# 8. Referencias

* [ColumnTransformer. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
* [RandomForestClassifier. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [RandomUnderSampler — Version 0.13.0. (s. f.). Imbalance learn.](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html)
* [SMOTE — Version 0.13.0. (s. f.). Imbalanced learn.](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)
* [StandardScaler. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* [TfidfVectorizer. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [train_test_split. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* [TruncatedSVD. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)