<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>False Political Claim Detection</h1>
    <h3>Selección de características y preprocesado</h3>
    <h5>Grupo 2</h5>
  </div>
        <img style="width:15%;" src="images/logo.jpg" alt="UPM" />
</header>

# Índice

1. [Importar librerias](#1.-Importar-librerias)
2. [Variables globales](#2.-Variables-globales)
3. [Carga del dataframe](#3.-Carga-del-dataframe)
4. [Análisis y selección de las características](#4.-Analisis-y-seleccion-de-las-caracteristicas)
5. [Carga de los datos y división en entrenamiento y test](#5.-Carga-de-los-datos-y-division-en-entrenamiento-y-test)
6. [Preprocesado de los datos](#6.-Preprocesado-de-los-datos)
    * 6.1 [Introducción](#6.1-Introduccion)
    * 6.2 [Casos de preprocesado](#6.2-Casos-de-preprocesado)
    * 6.3 [Conclusiones](#6.3-Conclusiones)
7. [Referencias](#7.-Referencias)


## 1. Importar librerias

In [5]:
# General import and load data
import pandas as pd
import numpy as np

# Resampling
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Splitting
from sklearn.model_selection import train_test_split

# Estimators
from sklearn.ensemble import RandomForestClassifier

# Evaluation
from sklearn.metrics import accuracy_score, f1_score

# Visualization
import matplotlib.pyplot as plt

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

print("Todas las librerias fueron correctamente importadas.")

Todas las librerias fueron correctamente importadas.


## 2. Variables globales y funciones auxiliares

Se fija un *seed* para todo el documento para fijar la aleatoriedad y así obtener resultados replicables.

In [8]:
seed = 42

Función utilizada para seleccionar el *statement* con el que se quiere entrenar el modelo.

In [10]:
def selStatement(case, df):
    mapping = {
        1: 'statement',
        2: 'speaker_statement',
        3: 'speaker_statement_subject',
        4: 'speaker_party_statement',
        5: 'speaker_party_statement_subject',
        6: 'speaker_speaker-job_party_statement',
        7: 'speaker_speaker-job_party_statement_subject'
    }

    if case not in mapping:
        raise ValueError("El valor de 'case' no es válido. Debe estar entre 1 y 7.")

    return df[mapping[case]]

Función utilizada para seleccionar el modelo con el que se quiere entrenar.

In [152]:
def selModel(model):
    mapping = {
        'deberta': 'microsoft/deberta-v3-base',
        'roberta': 'distilbert/distilroberta-base',
        'destilado': 'distilbert/distilbert-base-uncased',
        'berta': 'google-bert/bert-base-uncased'
    }

    if model not in mapping:
        raise ValueError("El modelo seleccionado no existe entre los disponibles. Revise el valor introducido.")

    return mapping[model]

Función encargada de evaluar el modelo de *Hugging Face*.

In [14]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions),
    }

Función para transformar el *dataframe* en *dataset*.

In [85]:
class OwnDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) 
                for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

## 3. Carga del dataframe

Se cargan los datos de la ruta *formated/train_hugging.csv*, los cuales son los datos ya procesados por uno de nuestros compañeros.

In [17]:
url = "formated/train_hugging.csv"
df = pd.read_csv(url)

print("Datos cargados correctamente\n")

Datos cargados correctamente



También cargamos el test que debemos predecir para Kaggle de la ruta *formated/test_hugging.csv*.

In [19]:
url = "formated/test_hugging.csv"
df_test = pd.read_csv(url)

print("Test cargado correctamente\n")

Test cargado correctamente



## 4. Analisis de las caracteristicas

Se configura pandas para que muestre todas las columnas y después se realiza un head para ver el contenido de las mismas.

In [22]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,id,label,statement,speaker_statement,speaker_statement_subject,speaker_party_statement,speaker_party_statement_subject,speaker_speaker-job_party_statement,speaker_speaker-job_party_statement_subject
0,81f884c64a7,1,China is in the South China Sea and (building)...,"donald-trump said: ""China is in the South Chin...","donald-trump said: ""China is in the South Chin...",donald-trump (affiliated with the republican p...,donald-trump (affiliated with the republican p...,donald-trump (President-Elect) (affiliated wit...,donald-trump (President-Elect) (affiliated wit...
1,30c2723a188,0,With the resources it takes to execute just ov...,"chris-dodd said: ""With the resources it takes ...","chris-dodd said: ""With the resources it takes ...",chris-dodd (affiliated with the democrat party...,chris-dodd (affiliated with the democrat party...,chris-dodd (U.S. senator) (affiliated with the...,chris-dodd (U.S. senator) (affiliated with the...
2,6936b216e5d,0,The (Wisconsin) governor has proposed tax give...,"donna-brazile said: ""The (Wisconsin) governor ...","donna-brazile said: ""The (Wisconsin) governor ...",donna-brazile (affiliated with the democrat pa...,donna-brazile (affiliated with the democrat pa...,donna-brazile (Political commentator) (affilia...,donna-brazile (Political commentator) (affilia...
3,b5cd9195738,1,Says her representation of an ex-boyfriend who...,"rebecca-bradley said: ""Says her representation...","rebecca-bradley said: ""Says her representation...",rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...,rebecca-bradley (affiliated with the none part...
4,84f8dac7737,0,At protests in Wisconsin against proposed coll...,"republican-party-wisconsin said: ""At protests ...","republican-party-wisconsin said: ""At protests ...",republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...,republican-party-wisconsin (affiliated with th...


Se muestran todas las columnas para poder observarlas.

In [24]:
df.columns

Index(['id', 'label', 'statement', 'speaker_statement',
       'speaker_statement_subject', 'speaker_party_statement',
       'speaker_party_statement_subject',
       'speaker_speaker-job_party_statement',
       'speaker_speaker-job_party_statement_subject'],
      dtype='object')

Se muestran todas las columnas en formato código para que sea más sencillo su uso. Además se estructuran en diversas categorias.

In [26]:
all_features = [
    # Identificador
    'id',

    # Etiqueta objetivo
    'label',

    # Texto original
    'statement',

    # Texto original + Speaker
    'speaker_statement',

    # Texto original + Speaker + Subject
    'speaker_statement_subject',

    # Texto original + Speaker + Party affilation
    'speaker_party_statement',
    
    # Texto original + Speaker + Party affilation + Subject
    'speaker_party_statement_subject',

    # Texto original + Speaker + Party affilation + Speaker Job
    'speaker_speaker-job_party_statement',

    # Texto original + Speaker + Party affilation + Speaker Job + Subject
    'speaker_speaker-job_party_statement_subject'
]

Como se observa se ha buscado unificar de forma progresiva las columnas. Además, se ha intentado no quitar palabras del texto original ya que los modelos de *Hugging Face* ya los tratan internamente mediante su propio *tokenizador*. Cada una de las columnas representara un caso de entrenamiento. A continuación se muestra un ejemplo de cada una de las columnas:

* **Caso 1 -** ***statement***

In [29]:
df['statement'].iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

* **Caso 2 -** ***speaker_statement***

In [31]:
df['speaker_statement'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 3 -** ***speaker_statement_subject***

In [33]:
df['speaker_statement_subject'].iloc[0]

'donald-trump said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 4 -** ***speaker_party_statement***

In [35]:
df['speaker_party_statement'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 5 -** ***speaker_party_statement_subject***

In [37]:
df['speaker_party_statement_subject'].iloc[0]

'donald-trump (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

* **Caso 6 -** ***speaker_speaker-job_party_statement***

In [39]:
df['speaker_speaker-job_party_statement'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen."'

* **Caso 7 -** ***speaker_speaker-job_party_statement_subject***

In [41]:
df['speaker_speaker-job_party_statement_subject'].iloc[0]

'donald-trump (President-Elect) (affiliated with the republican party) said: "China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen." The statement concerns the topics of: china,foreign-policy,military.'

## 6. Entrenamiento

### 6.1 Introduccion

El objetivo de este notebook es buscar mejorar nuestro clasificador utilizando solamente la columna textual. Para ello se ha realizado el preprocesamiento explicado en el anterior apartado. A la hora de entrenarlos se utilizarán modelos famosos de *Hugging Face*.

Como puede observarse en el apartado dos, se han creado dos funciones para centralizar tanto el caso de entrenamiento como la selección del modelo.

### 6.2. Carga de los datos y division en entrenamiento y test

Cargamos las caracteristicas seleccionadas en la variable X y el objetivo *label* en la variable y. Para la seleccion de la variable X se utilizará la función auxiliar *selStatement*, introducciendo como parámetros el caso de entrenamiento seleccionado y el *dataframe* donde estan almacenados los datos.

In [47]:
X =  selStatement(1, df)
y = df['label']

Se realiza un *iloc* para comprobar de forma sencilla que hemos escogido el caso de entrenamiento correcto

In [49]:
X.iloc[0]

'China is in the South China Sea and (building)a military fortress the likes of which perhaps the world has not seen.'

Se separa en entrenamiento y test.

In [201]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

### 6.3. DeBERTaV3

Seleccionamos el modelo mediante la función *selModel* y el parámetro *'deberta'*.

In [220]:
model_name = selModel('roberta')
print("Modelo seleccionado:", model_name)

Modelo seleccionado: distilbert/distilroberta-base


Seleccionamos el tokenizador del modelo.

In [223]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizamos los textos.

In [225]:
train_encodings = tokenizer(list(X), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True)

Transformamos los *dataframes* en *dataset* con la funcion propia *OwnDataset*.

In [228]:
train_dataset = OwnDataset(train_encodings, list(y_train))
test_dataset = OwnDataset(test_encodings, list(y_test))

Cargamos el modelo.

In [232]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilbert/distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [233]:
training_args = TrainingArguments(
    output_dir="./results_models",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

Definimos un entrenador.

In [237]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


Entrenamos el modelo.

In [240]:
trainer.train()



Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

Evaluamos los estadísticos establecidos previamente.

In [None]:
trainer.evaluate()

# 7. Referencias

distilbert-base-uncased

✅ Rápido y ligero.

✅ Buena precisión para tareas generales de clasificación.

✅ Ideal para textos breves como los tuyos.

🔧 Preentrenado en inglés general, puede necesitar fine-tuning.

bert-base-uncased

📈 Un poco más preciso que DistilBERT, pero más lento.

✅ Buen punto de partida si tienes capacidad de cómputo.

✅ Robusto para tareas generales de clasificación binaria.

roberta-base

🔥 Suele superar a BERT en muchas tareas de NLP.

📌 Más pesado que DistilBERT, pero con mayor rendimiento.

✅ Muy buena opción si el tiempo de entrenamiento no es un gran problema.

albert-base-v2

✅ Compacto y eficiente en memoria.

🚀 Puede ser más rápido que BERT con resultados similares.

🧠 Requiere más fine-tuning para obtener buenos resultados.

https://achimoraites.medium.com/fine-tuning-roberta-for-topic-classification-with-hugging-face-transformers-and-datasets-library-c6f8432d0820

* [ColumnTransformer. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
* [RandomForestClassifier. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [RandomUnderSampler — Version 0.13.0. (s. f.). Imbalance learn.](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html)
* [SMOTE — Version 0.13.0. (s. f.). Imbalanced learn.](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)
* [StandardScaler. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* [TfidfVectorizer. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [train_test_split. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* [TruncatedSVD. (s. f.). Scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)