En este script entrenamos tanto los modelos tradicional (que servirán como baseline) como los modelos modernos basados en transformers para clasificación binaria de clickbaits.

In [1]:
import pandas as pd

df1 = pd.read_csv('dataset_clickbait_clasificacion_limpio.csv')
df2 = pd.read_csv('dataset_no_clickbait_clasificacion_limpio.csv')

# Unificación
df = pd.concat([df1, df2], ignore_index=True)

# Eliminación de duplicados
df.drop_duplicates(inplace=True)

Dividimos los datos en los conjuntos de entrenamiento (80%) y validación (20%).

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Asegurar que no hay NaN en la columna de texto
df = df.dropna(subset=['title'])

X = df['title']
y = df['is_clickbait']

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=10000)
X_vec = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

Entrenamos los modelos tradicionales: SVM, Naive Bayes, Decision Tree y Random Forest.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

models = {
    "SVM": LinearSVC(),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name}\n")
    print(classification_report(y_test, y_pred))


SVM

              precision    recall  f1-score   support

           0       0.72      0.77      0.74        82
           1       0.73      0.67      0.70        76

    accuracy                           0.72       158
   macro avg       0.72      0.72      0.72       158
weighted avg       0.72      0.72      0.72       158


Naive Bayes

              precision    recall  f1-score   support

           0       0.71      0.73      0.72        82
           1       0.70      0.68      0.69        76

    accuracy                           0.71       158
   macro avg       0.71      0.71      0.71       158
weighted avg       0.71      0.71      0.71       158


Decision Tree

              precision    recall  f1-score   support

           0       0.67      0.62      0.65        82
           1       0.62      0.67      0.65        76

    accuracy                           0.65       158
   macro avg       0.65      0.65      0.65       158
weighted avg       0.65      0.65     

Entrenamos los modelos modernos.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
import evaluate
import torch

# Dataset HuggingFace
dataset = Dataset.from_pandas(df[['title', 'is_clickbait']].rename(columns={"title": "text", "is_clickbait": "label"}))
dataset = dataset.train_test_split(test_size=0.2)

# Preprocessing function
def preprocess(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=128)

# Evaluation metric
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Model list
model_list = [
    "dccuchile/bert-base-spanish-wwm-cased",
    "PlanTL-GOB-ES/roberta-base-bne",
    "bertin-project/bertin-roberta-base-spanish",
    "microsoft/mdeberta-v3-base",
    "cardiffnlp/twitter-xlm-roberta-base",
    "Twitter/twhin-bert-base",
    "dccuchile/distilbert-base-spanish-uncased",
    "CenIA/albert-base-spanish"
]

results = []

for model_name in model_list:
    print(f"\n🔍 Entrenando modelo: {model_name}\n")

    # Tokenizer y modelo
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    # Tokenización del dataset
    tokenized = dataset.map(preprocess, batched=True)

    # Entrenamiento
    training_args = TrainingArguments(
        output_dir=f"./results_{model_name.replace('/', '_')}",
        eval_strategy="epoch",
        save_strategy="no",
        per_device_train_batch_size=8,
        num_train_epochs=3,
        logging_dir=f"./logs_{model_name.replace('/', '_')}",
        logging_steps=10
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()

    # Evaluación
    eval_results = trainer.evaluate()
    results.append({
        "Modelo": model_name,
        "Accuracy": eval_results["eval_accuracy"]
    })

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]


🔍 Entrenando modelo: dccuchile/bert-base-spanish-wwm-cased



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mismael-alfa-garc[0m ([33mismael-alfa-garc-universidad-de-murcia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.402,0.327874,0.867089
2,0.2968,0.314485,0.917722
3,0.0011,0.375673,0.905063



🔍 Entrenando modelo: PlanTL-GOB-ES/roberta-base-bne



Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-bne and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4186,0.453675,0.822785
2,0.4096,0.364286,0.892405
3,0.0436,0.410912,0.911392



🔍 Entrenando modelo: bertin-project/bertin-roberta-base-spanish



Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at bertin-project/bertin-roberta-base-spanish and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6773,0.701065,0.455696
2,0.7058,0.69723,0.455696
3,0.5163,0.497461,0.791139



🔍 Entrenando modelo: microsoft/mdeberta-v3-base



Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5413,0.377286,0.85443
2,0.3247,0.566044,0.85443
3,0.2374,0.562825,0.892405



🔍 Entrenando modelo: cardiffnlp/twitter-xlm-roberta-base



Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4282,0.311155,0.867089
2,0.3139,0.417015,0.905063
3,0.0458,0.503802,0.898734



🔍 Entrenando modelo: Twitter/twhin-bert-base



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Twitter/twhin-bert-base and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5306,0.362029,0.848101
2,0.4528,0.373994,0.898734
3,0.1951,0.610631,0.867089



🔍 Entrenando modelo: dccuchile/distilbert-base-spanish-uncased



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at dccuchile/distilbert-base-spanish-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3984,0.393586,0.835443
2,0.4052,0.362561,0.867089
3,0.2228,0.434421,0.848101



🔍 Entrenando modelo: CenIA/albert-base-spanish



tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/47.8M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at CenIA/albert-base-spanish and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/628 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5978,0.575582,0.71519
2,0.5372,0.528708,0.797468
3,0.4314,0.566902,0.816456


In [None]:
# Comparativa de los resultados (ordenados de mayor a menor accuracy)
results_df = pd.DataFrame(results)
results_df.sort_values(by="Accuracy", ascending=False)

Unnamed: 0,Modelo,Accuracy
1,PlanTL-GOB-ES/roberta-base-bne,0.911392
0,dccuchile/bert-base-spanish-wwm-cased,0.905063
4,cardiffnlp/twitter-xlm-roberta-base,0.898734
3,microsoft/mdeberta-v3-base,0.892405
5,Twitter/twhin-bert-base,0.867089
6,dccuchile/distilbert-base-spanish-uncased,0.848101
7,CenIA/albert-base-spanish,0.816456
2,bertin-project/bertin-roberta-base-spanish,0.791139
