<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>False Political Claim Detection</h1>
    <h3>Selección de características y preprocesado</h3>
    <h5>Grupo 2</h5>
  </div>
        <img style="width:15%;" src="images/logo.jpg" alt="UPM" />
</header>

# Índice

1. [Importar librerias](#1.-Importar-librerias)
2. [Variables globales](#2.-Variables-globales)
3. [Carga del dataframe](#3.-Carga-del-dataframe)
4. [Análisis y selección de las características](#4.-Analisis-y-seleccion-de-las-caracteristicas)
5. [Carga de los datos y división en entrenamiento y test](#5.-Carga-de-los-datos-y-division-en-entrenamiento-y-test)
6. [Preprocesado de los datos](#6.-Preprocesado-de-los-datos)
    * 6.1 [Introducción](#6.1-Introduccion)
    * 6.2 [Casos de preprocesado](#6.2-Casos-de-preprocesado)
    * 6.3 [Conclusiones](#6.3-Conclusiones)
7. [Referencias](#7.-Referencias)


## 1. Importar librerias

In [22]:
# General import and load data
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Splitting
from sklearn.model_selection import train_test_split

# Estimators
from sklearn.ensemble import RandomForestClassifier

# Evaluation
from sklearn.metrics import accuracy_score, f1_score

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

print("Todas las librerias fueron correctamente importadas.")

Todas las librerias fueron correctamente importadas.


## 2. Variables globales y funciones auxiliares

Se fija un *seed* para todo el documento para fijar la aleatoriedad y así obtener resultados replicables.

In [23]:
seed = 42

Función utilizada para seleccionar el *statement* con el que se quiere entrenar el modelo.

In [24]:
def selStatement(case, df):
    mapping = {
        1: 'statement',
        2: 'speaker_statement',
        3: 'speaker_statement_subject',
        4: 'speaker_party_statement',
        5: 'speaker_party_statement_subject',
        6: 'speaker_speaker-job_party_statement',
        7: 'speaker_speaker-job_party_statement_subject'
    }

    if case not in mapping:
        raise ValueError("El valor de 'case' no es válido. Debe estar entre 1 y 7.")

    return df[mapping[case]]

Función utilizada para seleccionar el modelo con el que se quiere entrenar.

In [25]:
def selModel(model):
    mapping = {
        'deberta': 'microsoft/deberta-v3-base',
        'roberta': 'distilbert/distilroberta-base',
        'destilado': 'distilbert/distilbert-base-uncased',
        'berta': 'google-bert/bert-base-uncased'
    }

    if model not in mapping:
        raise ValueError("El modelo seleccionado no existe entre los disponibles. Revise el valor introducido.")

    return mapping[model]

Función encargada de evaluar el modelo de *Hugging Face*.

In [26]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions),
    }

Función para transformar el *dataframe* en *dataset*.

In [27]:
class OwnDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) 
                for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

## 3. Carga del dataframe

Se cargan los datos de la ruta *formated/train_hugging.csv*, los cuales son los datos ya procesados por uno de nuestros compañeros.

In [28]:
url = "formated/train_hugging.csv"
df = pd.read_csv(url)

print("Datos cargados correctamente\n")

Datos cargados correctamente



También cargamos el test que debemos predecir para Kaggle de la ruta *formated/test_hugging.csv*.

In [29]:
url = "formated/test_hugging.csv"
df_test = pd.read_csv(url)

print("Test cargado correctamente\n")

Test cargado correctamente



## 4. Analisis de las caracteristicas

Se configura pandas para que muestre todas las columnas y después se realiza un head para ver el contenido de las mismas.

In [30]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,id,label,statement,speaker_statement,speaker_statement_subject,speaker_party_statement,speaker_party_statement_subject,speaker_speaker-job_party_statement,speaker_speaker-job_party_statement_subject
0,81f884c64a7,1,China is in the South China Sea and (building)...,"donald-trump said: ""China is in the South Chin...","donald-trump said: ""China is in the South Chin...",donald-trump (affiliated with the republican p...,donald-trump (affiliated with the republican p...,donald-trump (President-Elect) (affiliated wit...,donald-trump (President-Elect) (affiliated wit...
1,30c2723a188,0,With the resources it takes to execute just ov...,"chris-dodd said: ""With the resources it takes ...","chris-dodd said: ""With the resources it takes ...",chris-dodd (affiliated with the democrat party...,chris-dodd (affiliated with the democrat party...,chris-dodd (U.S. senator) (affiliated with the...,chris-dodd (U.S. senator) (affiliated with the...
2,6936b216e5d,0,The (Wisconsin) governor has proposed tax give...,"donna-brazile said: ""The (Wisconsin) governor ...","donna-brazile said: ""The (Wisconsin) governor ...",donna-brazile (affiliated with the democrat pa...,donna-brazile (affiliated with the democrat pa...,donna-brazile (Political commentator) (affilia...,donna-brazile (Political commentator) (affilia...
3,b5cd9195738,1,Says her representation of an ex-boyfriend who...,"rebecca-bradley said: ""Says her representation...","rebecca-bradley said: ""Says her representation...",rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...
4,84f8dac7737,0,At protests in Wisconsin against proposed coll...,"republican-party-wisconsin said: ""At protests ...","republican-party-wisconsin said: ""At protests ...",republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...


Se muestran todas las columnas para poder observarlas.

In [31]:
df.columns

Index(['id', 'label', 'statement', 'speaker_statement',
       'speaker_statement_subject', 'speaker_party_statement',
       'speaker_party_statement_subject',
       'speaker_speaker-job_party_statement',
       'speaker_speaker-job_party_statement_subject'],
      dtype='object')

Se muestran todas las columnas en formato código para que sea más sencillo su uso. Además se estructuran en diversas categorias.

In [32]:
all_features = [
    # Identificador
    'id',

    # Etiqueta objetivo
    'label',

    # Texto original
    'statement',

    # Texto original + Speaker
    'speaker_statement',

    # Texto original + Speaker + Subject
    'speaker_statement_subject',

    # Texto original + Speaker + Party affilation
    'speaker_party_statement',
    
    # Texto original + Speaker + Party affilation + Subject
    'speaker_party_statement_subject',

    # Texto original + Speaker + Party affilation + Speaker Job
    'speaker_speaker-job_party_statement',

    # Texto original + Speaker + Party affilation + Speaker Job + Subject
    'speaker_speaker-job_party_statement_subject'
]

Como se observa se ha buscado unificar de forma progresiva las columnas. Además, se ha intentado no quitar palabras del texto original ya que los modelos de *Hugging Face* ya los tratan internamente mediante su propio *tokenizador*. Cada una de las columnas representara un caso de entrenamiento. A continuación se muestra un ejemplo de cada una de las columnas:

* **Caso 1 -** ***statement***

In [33]:
df['statement'].iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

* **Caso 2 -** ***speaker_statement***

In [34]:
df['speaker_statement'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 3 -** ***speaker_statement_subject***

In [35]:
df['speaker_statement_subject'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 4 -** ***speaker_party_statement***

In [36]:
df['speaker_party_statement'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 5 -** ***speaker_party_statement_subject***

In [37]:
df['speaker_party_statement_subject'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 6 -** ***speaker_speaker-job_party_statement***

In [38]:
df['speaker_speaker-job_party_statement'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 7 -** ***speaker_speaker-job_party_statement_subject***

In [39]:
df['speaker_speaker-job_party_statement_subject'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

## 6. Entrenamiento

El objetivo de este notebook es buscar mejorar nuestro clasificador utilizando solamente la columna textual. Para ello se ha realizado el preprocesamiento explicado en el anterior apartado. A la hora de entrenarlos se utilizarán modelos famosos de *Hugging Face*.

Como puede observarse en el apartado dos, se han creado dos funciones para centralizar tanto el caso de entrenamiento como la selección del modelo.

### 6.1. Caso 1

#### 6.1.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [40]:
X =  selStatement(1, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [41]:
X.iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

Se separa en entrenamiento y test.

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.1.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [43]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [44]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [45]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [46]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [47]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [48]:
training_args = TrainingArguments(
    output_dir="./results_caso1",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [49]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [50]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.645528,0.658659,0.794207
2,0.656300,0.64137,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6547049113682338, metrics={'train_runtime': 330.9577, 'train_samples_per_second': 43.268, 'train_steps_per_second': 2.707, 'total_flos': 1896933148753920.0, 'train_loss': 0.6547049113682338, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [51]:
trainer.evaluate()

{'eval_loss': 0.6455276608467102,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 2.232,
 'eval_samples_per_second': 801.988,
 'eval_steps_per_second': 25.09,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [52]:
trainer.save_model("roberta_caso1")
tokenizer.save_pretrained("roberta_caso1")

('roberta_caso1\\tokenizer_config.json',
 'roberta_caso1\\special_tokens_map.json',
 'roberta_caso1\\vocab.json',
 'roberta_caso1\\merges.txt',
 'roberta_caso1\\added_tokens.json',
 'roberta_caso1\\tokenizer.json')

### 6.2. Caso 2

#### 6.2.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [53]:
X =  selStatement(2, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [54]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

Se separa en entrenamiento y test.

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.2.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [56]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [57]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [58]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [59]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [60]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [61]:
training_args = TrainingArguments(
    output_dir="./results_caso2",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [62]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [63]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.646287,0.658659,0.794207
2,0.655000,0.641471,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6533178261348179, metrics={'train_runtime': 448.7501, 'train_samples_per_second': 31.911, 'train_steps_per_second': 1.997, 'total_flos': 1896933148753920.0, 'train_loss': 0.6533178261348179, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [64]:
trainer.evaluate()

{'eval_loss': 0.6462871432304382,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 3.0775,
 'eval_samples_per_second': 581.639,
 'eval_steps_per_second': 18.197,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [65]:
trainer.save_model("roberta_caso2")
tokenizer.save_pretrained("roberta_caso2")

('roberta_caso2\\tokenizer_config.json',
 'roberta_caso2\\special_tokens_map.json',
 'roberta_caso2\\vocab.json',
 'roberta_caso2\\merges.txt',
 'roberta_caso2\\added_tokens.json',
 'roberta_caso2\\tokenizer.json')

### 6.4. Caso 3

#### 6.4.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [66]:
X =  selStatement(3, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [67]:
X.iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

Se separa en entrenamiento y test.

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.4.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [69]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [70]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [71]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [72]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [73]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [74]:
training_args = TrainingArguments(
    output_dir="./results_caso3",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [75]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [76]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.646832,0.658659,0.794207
2,0.654900,0.641615,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6540344783238002, metrics={'train_runtime': 447.6495, 'train_samples_per_second': 31.989, 'train_steps_per_second': 2.002, 'total_flos': 1896933148753920.0, 'train_loss': 0.6540344783238002, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [77]:
trainer.evaluate()

{'eval_loss': 0.6468316316604614,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 3.6824,
 'eval_samples_per_second': 486.1,
 'eval_steps_per_second': 15.208,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [78]:
trainer.save_model("roberta_caso3")
tokenizer.save_pretrained("roberta_caso3")

('roberta_caso3\\tokenizer_config.json',
 'roberta_caso3\\special_tokens_map.json',
 'roberta_caso3\\vocab.json',
 'roberta_caso3\\merges.txt',
 'roberta_caso3\\added_tokens.json',
 'roberta_caso3\\tokenizer.json')

### 6.5. Caso 4

#### 6.6.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [79]:
X =  selStatement(4, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [80]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

Se separa en entrenamiento y test.

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.5.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [82]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [83]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [84]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [86]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [87]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [88]:
training_args = TrainingArguments(
    output_dir="./results_caso4",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [89]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [90]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.64744,0.658659,0.794207
2,0.655100,0.641454,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6541287217821393, metrics={'train_runtime': 544.6036, 'train_samples_per_second': 26.294, 'train_steps_per_second': 1.645, 'total_flos': 1896933148753920.0, 'train_loss': 0.6541287217821393, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [91]:
trainer.evaluate()

{'eval_loss': 0.6474397778511047,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 2.4925,
 'eval_samples_per_second': 718.147,
 'eval_steps_per_second': 22.467,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [92]:
trainer.save_model("roberta_caso4")
tokenizer.save_pretrained("roberta_caso4")

('roberta_caso4\\tokenizer_config.json',
 'roberta_caso4\\special_tokens_map.json',
 'roberta_caso4\\vocab.json',
 'roberta_caso4\\merges.txt',
 'roberta_caso4\\added_tokens.json',
 'roberta_caso4\\tokenizer.json')

### 6.6. Caso 5

#### 6.6.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [93]:
X =  selStatement(5, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [94]:
X.iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

Se separa en entrenamiento y test.

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.6.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [96]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [97]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [98]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [99]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [100]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [101]:
training_args = TrainingArguments(
    output_dir="./results_caso5",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [102]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [103]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.645673,0.658659,0.794207
2,0.655500,0.641781,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6545685359409877, metrics={'train_runtime': 336.0034, 'train_samples_per_second': 42.619, 'train_steps_per_second': 2.667, 'total_flos': 1896933148753920.0, 'train_loss': 0.6545685359409877, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [104]:
trainer.evaluate()

{'eval_loss': 0.6456727981567383,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 2.7181,
 'eval_samples_per_second': 658.557,
 'eval_steps_per_second': 20.603,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [105]:
trainer.save_model("roberta_caso5")
tokenizer.save_pretrained("roberta_caso5")

('roberta_caso5\\tokenizer_config.json',
 'roberta_caso5\\special_tokens_map.json',
 'roberta_caso5\\vocab.json',
 'roberta_caso5\\merges.txt',
 'roberta_caso5\\added_tokens.json',
 'roberta_caso5\\tokenizer.json')

### 6.7. Caso 6

#### 6.7.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [106]:
X =  selStatement(6, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [107]:
X.iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

Se separa en entrenamiento y test.

In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.7.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [109]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [110]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [111]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [112]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [113]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [114]:
training_args = TrainingArguments(
    output_dir="./results_caso6",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [115]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [116]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.647161,0.658659,0.794207
2,0.655000,0.641081,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6539074352809361, metrics={'train_runtime': 335.589, 'train_samples_per_second': 42.671, 'train_steps_per_second': 2.67, 'total_flos': 1896933148753920.0, 'train_loss': 0.6539074352809361, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [117]:
trainer.evaluate()

{'eval_loss': 0.647160530090332,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 2.6734,
 'eval_samples_per_second': 669.551,
 'eval_steps_per_second': 20.947,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [118]:
trainer.save_model("roberta_caso6")
tokenizer.save_pretrained("roberta_caso6")

('roberta_caso6\\tokenizer_config.json',
 'roberta_caso6\\special_tokens_map.json',
 'roberta_caso6\\vocab.json',
 'roberta_caso6\\merges.txt',
 'roberta_caso6\\added_tokens.json',
 'roberta_caso6\\tokenizer.json')

### 6.8. Caso 7

#### 6.8.1 Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [119]:
X =  selStatement(7, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [120]:
X.iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

Se separa en entrenamiento y test.

In [121]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

#### 6.8.2 Entrenamiento

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [122]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [123]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [124]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [125]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [126]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [127]:
training_args = TrainingArguments(
    output_dir="./results_caso7",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [128]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [129]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.646643,0.658659,0.794207
2,0.654900,0.641809,0.658659,0.794207


TrainOutput(global_step=896, training_loss=0.6541224888392857, metrics={'train_runtime': 338.7094, 'train_samples_per_second': 42.278, 'train_steps_per_second': 2.645, 'total_flos': 1896933148753920.0, 'train_loss': 0.6541224888392857, 'epoch': 2.0})

Evaluamos los estadísticos establecidos previamente.

In [130]:
trainer.evaluate()

{'eval_loss': 0.6466431617736816,
 'eval_accuracy': 0.658659217877095,
 'eval_f1': 0.7942068036375884,
 'eval_runtime': 2.8947,
 'eval_samples_per_second': 618.367,
 'eval_steps_per_second': 19.346,
 'epoch': 2.0}

Almacenamos el modelo para poder reutilizarse.

In [131]:
trainer.save_model("roberta_caso7")
tokenizer.save_pretrained("roberta_caso7")

('roberta_caso7\\tokenizer_config.json',
 'roberta_caso7\\special_tokens_map.json',
 'roberta_caso7\\vocab.json',
 'roberta_caso7\\merges.txt',
 'roberta_caso7\\added_tokens.json',
 'roberta_caso7\\tokenizer.json')

# 7. Referencias

* [ColumnTransformer. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
* [RandomForestClassifier. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [RandomUnderSampler — Version 0.13.0. (s. f.). Imbalance learn.](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html)
* [SMOTE — Version 0.13.0. (s. f.). Imbalanced learn.](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)
* [StandardScaler. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* [TfidfVectorizer. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [train_test_split. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* [TruncatedSVD. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)