# Model Development

In this notebook we are going to be preparing and training our model. As a base model (foundation) we are going to be using `
bert-base-spanish-wwm-uncased`.

Our objective is to build a classifier capable of predicting whether a news article is fake or real. To accomplish this, we will fine-tune the base model using the dataset prepared in`data-preparation.ipynb`.

This are the steps we are going to be following:

1. Load our data - Stored in a `.csv` file.
2. Preprocess and Tokenize the data - Implement a truncation strategy to work with the 512 tokens.
3. Create a `Dataset` object.
5. Fine-Tune the pre-trained model.
6. Evaluate the results - Check metrics on validation set.

### Import libraries

In [4]:
import torch
import pandas as pd
import evaluate
import numpy as np

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset

### Device agnostic code

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device}")

Running on: cuda


### Global variables

The paths may be different depending if it is being run on **Google Colab** or locally.

In [None]:
LONG_DATASET_PATH = "../data/final_long_dataset.csv"
SHORT_DATASET_PATH = "../data/final_short_dataset.csv"
SAVE_STAGE1_MODEL_PATH = "../models/fake-news-classifier-stage1"
SAVE_FINAL_MODEL_PATH = "/content/models/fake-news-classifier-final-v2"
SAVE_TOKENIZER_PATH = "../tokenizer"

# Data Preprocessing & Tokenization

In this stage we are going to be:

* Loading our data from the `.csv` file.
* Clean text (if needed).
* Tokenize data - Handling truncation strategy.
* Create `Dataset` object - With train/val/test splits.

#### Short Articles Dataset (`≃50k articles`)


In [7]:
short_df_temp = pd.read_csv(SHORT_DATASET_PATH, names=["label", "title", "text"], header=0)
print(f"Short articles dataframe was loaded from {SHORT_DATASET_PATH} successfully with a total of {short_df_temp.shape[0]} articles.")

Short articles dataframe was loaded from /content/final_short_dataset.csv successfully with a total of 59231 articles.


In [8]:
# Check if dataset has any Null values
print(f"Null values per column before cleanup:\n{short_df_temp.isnull().sum()}\n")
short_df = short_df_temp.dropna().copy()
print(f"Null values per column after cleanup:\n{short_df.isnull().sum()}\n")
print(f"Final dataset length: {short_df['text'].count()}")

Null values per column before cleanup:
label       0
title    2000
text        0
dtype: int64

Null values per column after cleanup:
label    0
title    0
text     0
dtype: int64

Final dataset length: 57231


### Add title to text with separator

In [9]:
short_df["text"] = short_df["title"] + ". " + short_df["text"]
short_df = short_df.drop(columns=["title"])

#### Long Articles Dataset (`≃2k articles`)





In [10]:
long_df = pd.read_csv(LONG_DATASET_PATH, names=["text", "label"], header=0)
print(f"Long articles dataframe was loaded from {LONG_DATASET_PATH} successfully with a total of {long_df.shape[0]} articles.")

Long articles dataframe was loaded from /content/final_long_dataset.csv successfully with a total of 2141 articles.


In [11]:
# Check if dataset has any Null values
print(f"Null values per column:\n{long_df.isnull().sum()}")

Null values per column:
text     0
label    0
dtype: int64


*Obervation: As we don't have any Null values we can proceed.*

### Tokenization & Truncation

We are going to use a strategy called `Head Truncation`, where we grab the first n tokens.

In case this is not the best approach we can later try another strategy called `Head-Tail Truncation` where don't truncate only from one side but from the start and the end. This allows us to have n amount of tokens from the head and m amount of tokens from the tail. The problem with this is that it needs to be implemented manually and runs sequentially, making it NOT efficient at all.




In [None]:
PRE_TRAINED_MODEL_NAME = "dccuchile/bert-base-spanish-wwm-uncased"

print(f"Loading Tokenizer for {PRE_TRAINED_MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

#### Save Tokenizer

We save the tokenizer to have the whole model bundle

In [None]:
# Save tokenizer locally
tokenizer.save_pretrained(SAVE_TOKENIZER_PATH)
print(f"Tokenizer saved locally to: {SAVE_TOKENIZER_PATH}")

In [13]:
def preprocess_dataset(dataframe: pd.DataFrame,
                       text_column: str = "text",
                       label_column: str = "label") -> dict[str, any]:
    """
    Preprocess dataframe with head truncation.

    Args:
        dataframe (pd.Dataframe): Pandas dataframe with our articles + labels
        text_column (str): Name of column that contains articles text (default = "text")
        label_colum (str): Name of column that contains articles labels (default = "label")

    Returns:
        Dictionary with inputs_ids, attention_masks and labels
    """
    print(f"Preprocessing {dataframe[text_column].count()} articles...")
    # Tokenize all articles at once
    encoded = tokenizer(
        dataframe[text_column].tolist(),
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors=None
    )

    return {
        "input_ids": encoded["input_ids"],
        "attention_mask": encoded["attention_mask"],
        "labels": dataframe[label_column].tolist()
    }

### Create `Dataset` + Split into Train/Val/Test

#### Short Articles Dataset (`≃50k articles`)

For this dataset we are only going to split into training and eval as its going to be use to `pre-fine-tune` our transformer.

In [14]:
# Preprocess short articles
processed_data_short = preprocess_dataset(dataframe=short_df)
# Create hugging face Dataset
dataset_short = Dataset.from_dict(processed_data_short)

# Split data into Train/Validation
train_test_split = dataset_short.train_test_split(test_size=0.2, shuffle=True)

# Get train and test data
train_data_short = train_test_split["train"]
val_data_short = train_test_split["test"]

print(f"Sucessfully split data into Train/Val:")
print(f"Train dataset: {train_data_short.num_rows} articles")
print(f"Validation dataset: {val_data_short.num_rows} articles")


Preprocessing 57231 articles...
Sucessfully split data into Train/Val:
Train dataset: 45784 articles
Validation dataset: 11447 articles


#### Long Articles Dataset (`≃2k articles`)

Here we are going to do the full split between train/validation/test.

In [15]:
# Preprocess articles
processed_data_long = preprocess_dataset(dataframe=long_df)
# Create hugging face Dataset
dataset_long = Dataset.from_dict(processed_data_long)

# Split data into Train/Validation/Test
train_test_split = dataset_long.train_test_split(test_size=0.2, shuffle=True)

# Get train and test data
train_data_long_temp = train_test_split["train"]
test_data_long = train_test_split["test"]

# Get validation data from train
train_val_split = train_data_long_temp.train_test_split(test_size=0.125, shuffle=True)
train_data_long = train_val_split["train"]
val_data_long = train_val_split["test"]

print(f"Sucessfully split data into Train/Val/Test:")
print(f"Train dataset: {train_data_long.num_rows} articles")
print(f"Validation dataset: {val_data_long.num_rows} articles")
print(f"Test dataset: {test_data_long.num_rows} articles")

Preprocessing 2141 articles...
Sucessfully split data into Train/Val/Test:
Train dataset: 1498 articles
Validation dataset: 214 articles
Test dataset: 429 articles


# Fine-Tuning Model

In this stage is where we use our pre-trained model `bert-base-spanish-wwm-uncased` and fine-tune it with our data.

The first approach is going to be to train it only on our 2k long articles dataset, and see how it performs. After this or in case it doesn't perform as expected we will add an extra layer of fine-tuning between. In this layer we will use our 50k short articles dataset and see if it improves or not the performance of our custom transformer (`Sequential Fine-Tune`).

In addition we are going to use our validation set to find "good" hyperparameters using `hyperparameter_search()`.

The steps we are going to be following:

* Create evaluation metrics.
* Create model instance using `AutoModelForSequenceClassification`.
* Create training arguments (`TrainingArguments`).
* Create instance of `Trainer`.
* Perform Hyperparameter search.
* Train final model using best hyperparameters.


*Note: After trying only with the 2k datasets and getting bad results, we are going for the other approch -> Sequential Fine-Tuninig.*

### Create Evaluation Metrics

We are going to be using the `accuracy`, `f1-score`, `precision` and `recall` metrics for evaluating our model.

In [None]:
metrics = evaluate.combine(["accuracy", "precision", "recall", "f1"])

def compute_metrics(eval_pred):
    """
    Computes metrics for our transformer model. It uses a combination of 'accuracy', 'precision', 'recall' and 'f1' metrics.

    Args:
        eval_pred (): Predictions

    Returns:
        Dictionary containing the metrics.
    """
    # Destructure logits and labels
    logits, labels = eval_pred
    # Generate predictions
    predictions = np.argmax(logits, axis=-1)

    return metrics.compute(predictions=predictions,
                           references=labels)

### Create Instance of `TrainingArguments`

We are going to setup the main and fundamental arguments for our training:

* `output_dir`: The output directory where the model predictions and checkpoints will be written.
* `evaluation_strategy`: he evaluation strategy to adopt during training.
* `save_strategy`: The checkpoint save strategy to adopt during training.
* `load_best_model_at_end`: Whether or not to load the best model found during training at the end of training.
* `metric_for_best_model`: Specify the metric to use to compare two different models.
* `greater_is_better`: Specify if better models should have a greater metric or not.
* `per_device_train_batch_size`: The batch size per device in training.
* `per_device_eval_batch_size`: The batch size per device in evaluation.
* `num_train_epochs`: Total number of training epochs to perform.
* `learning_rate`: The initial learning rate for AdamW optimizer.
* `weight_decay`: The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer.
* `logging_steps`: Number of update steps between two logs.
* `warmup_ratio`: The warmup phase prevents large gradients early in training from destabilizing the model, leading to better performance and stability.
* `save_total_limit`: Timit the total amount of checkpoints.

In [17]:
# Stage 1 default training arguments
training_args_stage1 = TrainingArguments(
    output_dir="./results_stage1",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to="none",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=50,
    warmup_ratio=0.1,
    save_total_limit=2,
)

# Stage 2 default training arguments
training_args_stage2 = TrainingArguments(
    output_dir="./results_stage2",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to="none",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    learning_rate=5e-6,
    weight_decay=0.01,
    logging_steps=10,
    warmup_ratio=0.1,
    save_total_limit=2,
    gradient_accumulation_steps=2,
    max_grad_norm=1.0,
)

### Load `bert-base-spanish-wwm-uncased` (base model)

As mentioned before, we are going to first start only with the 2k long articles dataset. For this we are going to freezer part of the transformer and trian the classifier.

In case the model doesn't perform well, we will include the 50k short articles dataset following this strategy:

* Pre-Fine-Tune using the 50k dataset - Training everything for a couple of epochs.
* Fine-Tune the model using the 2k dataset - Freeze only the encoder and train the classifier (what we are doing originally).

## Stage 1: Pre-Fine-Tune

We are going to pre-fine-tune our base model using the 50k short articles dataset.

In [18]:
PRE_TRAINED_MODEL_NAME = "dccuchile/bert-base-spanish-wwm-uncased"

def model_init_stage1():
    model = AutoModelForSequenceClassification.from_pretrained(
        PRE_TRAINED_MODEL_NAME,
        num_labels=2
    )

    # Freeze all parameters in the base BERT model first
    for param in model.bert.parameters():
        param.requires_grad = False

    # Unfreeze last 2-3 layers of the encoder
    for param in model.bert.encoder.layer[-2:].parameters():
        param.requires_grad = True

    # Unfreeze the pooler
    for param in model.bert.pooler.parameters():
        param.requires_grad = True

    # Unfreeze the classifier head
    for param in model.classifier.parameters():
        param.requires_grad = True

    return model

In [19]:
# Temporal instance of the model to check parameters
temp_model = model_init_stage1()

total_params = sum(p.numel() for p in temp_model.parameters())
trainable_params = sum(p.numel() for p in temp_model.parameters() if p.requires_grad)

print("Model Parameters Count:")
print(f"Total: {total_params}")
print(f"Trainable: {trainable_params}")
print(f"Frozen: {total_params - trainable_params}")

del temp_model

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model Parameters Count:
Total: 109852418
Trainable: 14767874
Frozen: 95084544


### Create the `Trainer` instance for **Stage 1**

In [None]:
# Stage 1 trainer
trainer_stage1 = Trainer(
    model=None,
    model_init=model_init_stage1,
    args=training_args_stage1,
    train_dataset=train_data_short,
    eval_dataset=val_data_short,
    compute_metrics=compute_metrics,
)

# Train stage 1 trainer
trainer_stage1.train()
# Save model to be use in stage 2
trainer_stage1.save_model(SAVE_STAGE1_MODEL_PATH)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2188,0.193999,0.928802,0.931096,0.947487,0.93922
2,0.1996,0.176485,0.937364,0.94978,0.94192,0.945834
3,0.1709,0.168711,0.943479,0.9447,0.958772,0.951684


### Stage 2: Final Fine-Tune

Using the pre-fine-tune model from stage 1, we are going to do a final fine-tune using the 2k long articles dataset, unfreezing the last 2 layers, the pooler and the classification head.

In [None]:
PRE_TRAINED_MODEL_NAME = "..models/fake-news-classifier-stage1"

def model_init_stage2():
    model = AutoModelForSequenceClassification.from_pretrained(
        PRE_TRAINED_MODEL_NAME,
    )

    for param in model.bert.parameters():
        param.requires_grad = False

    # Unfreeze last 2 layers
    for param in model.bert.encoder.layer[-1:].parameters():
        param.requires_grad = True

    # Unfreeze the pooler
    for param in model.bert.pooler.parameters():
        param.requires_grad = True

    # Unfreeze the classifier head
    for param in model.classifier.parameters():
        param.requires_grad = True

    return model

In [21]:
# Temporal instance of the model to check parameters
temp_model = model_init_stage2()

total_params = sum(p.numel() for p in temp_model.parameters())
trainable_params = sum(p.numel() for p in temp_model.parameters() if p.requires_grad)

print("Model Parameters Count:")
print(f"Total: {total_params}")
print(f"Trainable: {trainable_params}")
print(f"Frozen: {total_params - trainable_params}")

del temp_model

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model Parameters Count:
Total: 109852418
Trainable: 7680002
Frozen: 102172416


### Create the `Trainer` instance for **Stage 2**

In [22]:
# Stage 2 trainer
trainer_stage2 = Trainer(
    model=None,
    model_init=model_init_stage2,
    args=training_args_stage2,
    train_dataset=train_data_long,
    eval_dataset=val_data_long,
    compute_metrics=compute_metrics,
)
# Hyperparameter Search
best_run = trainer_stage2.hyperparameter_search(n_trials=10,
                                        direction="maximize",
                                        backend="optuna")
best_hyperparams = best_run.hyperparameters

# Re-Create the TrainingArguments using the best hyperparameters
training_args_best = TrainingArguments(
    # Configuration
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to="none",
    # Use best hyperparameters from search
    learning_rate=best_hyperparams["learning_rate"],
    per_device_train_batch_size=int(best_hyperparams.get("per_device_train_batch_size", 16)),
    per_device_eval_batch_size=int(best_hyperparams.get("per_device_eval_batch_size", 16)),
    num_train_epochs=int(best_hyperparams["num_train_epochs"]),
    weight_decay=best_hyperparams.get("weight_decay", 0.01),
    # General params
    logging_steps=50,
    warmup_ratio=0.1,
    save_total_limit=2,
    gradient_accumulation_steps=2,
)

# New Trainer Instance with the best hyperparameters
final_trainer = Trainer(
    model=None,
    model_init=model_init_stage2,
    args=training_args_best,
    train_dataset=train_data_long,
    eval_dataset=val_data_long,
    compute_metrics=compute_metrics,
)

# Train final model using best hyperparameters
final_trainer.train()
# Save final model
final_trainer.save_model(SAVE_FINAL_MODEL_PATH)

[I 2025-12-11 15:58:38,405] A new study created in memory with name: no-name-c4ecf54d-8543-4438-8092-58404c75f963


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.4464,1.393983,0.504673,0.519553,0.823009,0.636986
2,1.3354,1.34459,0.5,0.517045,0.80531,0.629758


[I 2025-12-11 16:01:26,105] Trial 0 finished with value: 2.452112974525857 and parameters: {'learning_rate': 2.8960728024867e-06, 'num_train_epochs': 2, 'seed': 19, 'per_device_train_batch_size': 64}. Best is trial 0 with value: 2.452112974525857.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.9707,0.996438,0.570093,0.574468,0.716814,0.637795
2,0.8578,0.861032,0.579439,0.589147,0.672566,0.628099
3,0.7826,0.829948,0.593458,0.6,0.690265,0.641975


[I 2025-12-11 16:05:44,907] Trial 1 finished with value: 2.5256987392928725 and parameters: {'learning_rate': 3.4604984292103825e-06, 'num_train_epochs': 3, 'seed': 26, 'per_device_train_batch_size': 8}. Best is trial 1 with value: 2.5256987392928725.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.8469,0.846468,0.593458,0.6,0.690265,0.641975


[I 2025-12-11 16:07:20,977] Trial 2 finished with value: 2.5256987392928725 and parameters: {'learning_rate': 6.426591924596102e-06, 'num_train_epochs': 1, 'seed': 8, 'per_device_train_batch_size': 4}. Best is trial 1 with value: 2.5256987392928725.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.8477,1.04291,0.537383,0.547297,0.716814,0.62069


[I 2025-12-11 16:08:55,891] Trial 3 finished with value: 2.42218428933184 and parameters: {'learning_rate': 2.864941345726809e-06, 'num_train_epochs': 1, 'seed': 2, 'per_device_train_batch_size': 4}. Best is trial 1 with value: 2.5256987392928725.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.4618,0.504006,0.738318,0.708029,0.858407,0.776
2,0.5392,0.465564,0.785047,0.781513,0.823009,0.801724


[I 2025-12-11 16:12:01,651] Trial 4 finished with value: 3.191292321502536 and parameters: {'learning_rate': 2.7924812330022644e-05, 'num_train_epochs': 2, 'seed': 5, 'per_device_train_batch_size': 4}. Best is trial 4 with value: 3.191292321502536.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.3544,1.217892,0.514019,0.52795,0.752212,0.620438


[I 2025-12-11 16:13:05,078] Trial 5 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.1153,1.193594,0.509346,0.524691,0.752212,0.618182


[I 2025-12-11 16:14:45,582] Trial 6 finished with value: 2.4044313599795637 and parameters: {'learning_rate': 4.014221245186902e-06, 'num_train_epochs': 1, 'seed': 37, 'per_device_train_batch_size': 16}. Best is trial 4 with value: 3.191292321502536.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.6538,0.65594,0.649533,0.663793,0.681416,0.672489
2,0.5869,0.595676,0.67757,0.669231,0.769912,0.716049


[I 2025-12-11 16:17:41,216] Trial 7 finished with value: 2.832761749829541 and parameters: {'learning_rate': 1.4834290833204622e-05, 'num_train_epochs': 2, 'seed': 21, 'per_device_train_batch_size': 8}. Best is trial 4 with value: 3.191292321502536.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.6986,0.52517,0.766355,0.756098,0.823009,0.788136
2,0.4404,0.442135,0.794393,0.785124,0.840708,0.811966
3,0.3289,0.427918,0.803738,0.784,0.867257,0.823529
4,0.2895,0.43203,0.808411,0.776923,0.893805,0.831276


[I 2025-12-11 16:23:50,079] Trial 8 finished with value: 3.31041532177547 and parameters: {'learning_rate': 9.897764875650956e-05, 'num_train_epochs': 4, 'seed': 9, 'per_device_train_batch_size': 32}. Best is trial 8 with value: 3.31041532177547.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.0794,0.928929,0.579439,0.590551,0.663717,0.625


[I 2025-12-11 16:24:55,843] Trial 9 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.544833,0.733645,0.818182,0.637168,0.716418
2,No log,0.455799,0.780374,0.761905,0.849558,0.803347
3,0.645900,0.437711,0.803738,0.770992,0.893805,0.827869
4,0.645900,0.425096,0.808411,0.776923,0.893805,0.831276


*Observation: The last table represents the model trained using the best hyperparameters (the hyperparameter search goes from 0 to 9 trials).*

**Best Hyperparameters:**
```
{
 "learning_rate": 9.897764875650956e-05,
 "num_train_epochs": 4,
 "seed": 9,
 "per_device_train_batch_size": 32
}
```

### Evaluate Final Model on Test Set

In [26]:
print("Evaluating final model on the test set...")
results_final_model = final_trainer.evaluate(test_data_long)
print("Test Set Evaluation Results:")
for key, value in results_final_model.items():
    print(f"  {key}: {value:.4f}")

Evaluating final model on the test set...


Test Set Evaluation Results:
  eval_loss: 0.4183
  eval_accuracy: 0.8205
  eval_precision: 0.7835
  eval_recall: 0.8702
  eval_f1: 0.8246
  eval_runtime: 14.0762
  eval_samples_per_second: 30.4770
  eval_steps_per_second: 1.9180
  epoch: 4.0000
