<div align="center">

### End-to-End Fine-Tuning on the ISOT Fake News Dataset

Fine-tuning a pre-trained transformer model for binary classification (Real vs. Fake News) using Hugging Face `transformers`.

---

</div>


## Introduction

In this notebook, we fine-tune a pre-trained transformer model to perform fake news detection.  
Using the ISOT Fake News Dataset, we demonstrate the full end-to-end process, including data loading, preprocessing, tokenization, model training, evaluation, and saving the fine-tuned model.  
We leverage the Hugging Face `transformers` library to simplify the workflow and take advantage of powerful state-of-the-art language models.



## Install Dependencies

The following libraries are required for this notebook:
- `transformers` (for model and training pipeline)
- `datasets` (for easy dataset handling)
- `tensorboard` (for training visualization)

If you are running this notebook for the first time, you may need to install them by uncommenting the commands below.


In [1]:
#!pip install -q transformers 
#!pip install -q datasets 
#!pip install tensorboard  #TensorBoard is a great option to visualize the training progress in real-time.


## Setup and Configuration

We import necessary libraries and set up the device configuration for training.

### Libraries

We import all the necessary libraries for model training, evaluation, and logging.


In [2]:
# Silence warnings and reduce logging noise
import warnings
warnings.filterwarnings("ignore")
# Silence logging output from the Hugging Face transformers library (which can be verbose)
#from transformers.utils import logging
#logging.set_verbosity_error()

In [3]:
# Libraries
from transformers import Trainer, TrainingArguments, pipeline,\
                         DistilBertTokenizerFast, DistilBertForSequenceClassification
from datasets import Dataset
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from torch.utils.tensorboard import SummaryWriter

import pandas as pd
import os
import time
import shutil

2025-04-29 18:06:49.460963: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745950009.673576      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745950009.731517      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Paths and Directory Setup

Define project paths for saving results, logs, and trained models.


In [4]:
# paths
PROJECT_ROOT = "/kaggle/working"
DATA_ROOT = "/kaggle/input/fake-news-dataset-for-llm-fine-tuning/fake_news_detection_Kaggle"

RESULTS_DIR = os.path.join(PROJECT_ROOT, "results")
LOGS_DIR = os.path.join(PROJECT_ROOT, "logs")
MODEL_SAVE_DIR = os.path.join(PROJECT_ROOT, "models", "fine_tuned_model")

## Quick Check: Using a Pre-trained Model

We first load a pre-trained DistilBERT model fine-tuned for sentiment analysis to quickly test the text classification workflow.  
*Note: This model is only used as a placeholder and will later be replaced by our fine-tuned model.*


In [5]:
#---------------------------------------------------
# Load the pre-trained model from Hugging Face
# This will download on first run and cache locally
classifier = pipeline("text-classification", \
             model="distilbert-base-uncased-finetuned-sst-2-english")

#---------------------------------------------------
def predict_with_llm(text):
    """
    Predict whether the input text is FAKE or REAL using a pre-trained LLM.
    Note: This model is fine-tuned for sentiment analysis, so this is just a placeholder.
    """
    result = classifier(text)[0]
    label = result['label']  # 'POSITIVE' or 'NEGATIVE'

    # TEMPORARY MAPPING
    if label == 'NEGATIVE':
        return "FAKE"
    else:
        return "REAL"

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


## Dataset Preparation

We load the [ISOT Fake News dataset](https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/), assign labels (1 for REAL, 0 for FAKE), and combine the true and fake news articles into a single DataFrame.  
The dataset is then shuffled, converted to a Hugging Face `Dataset`, and split into training and test sets (80% train / 20% test).


In [6]:
def prepare_dataset():
    # Load both parts of ISOT dataset
    true_df = pd.read_csv(os.path.join(DATA_ROOT, "data", "True.csv"))
    fake_df = pd.read_csv(os.path.join(DATA_ROOT, "data", "Fake.csv"))

    # Add labels
    true_df["label"] = 1  # REAL
    fake_df["label"] = 0  # FAKE

    # Keep only the 'text' and 'label' columns
    true_df = true_df[["text", "label"]]
    fake_df = fake_df[["text", "label"]]

    # Combine and shuffle
    df = pd.concat([true_df, fake_df], ignore_index=True)
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)

    # Convert to Hugging Face Dataset and split
    dataset = Dataset.from_pandas(df)
    dataset = dataset.train_test_split(test_size=0.2)

    return dataset  # A DatasetDict with 'train' and 'test'

## Tokenization

We use the pre-trained DistilBERT tokenizer from Hugging Face to tokenize the text data.  
Each text entry is tokenized with truncation and padding to a maximum length of 512 tokens.  
The tokenized dataset is then formatted for PyTorch compatibility, making it ready for model training.


In [7]:
def tokenize_dataset(dataset):
    '''
    Load the tokenizer from Hugging Face.
    Define a preprocessing function that:
    Tokenize th|e text field.
    Applly truncation and padding up to 512 tokens.
    Appliy this to the full dataset using map.
    Set the format of the returned dataset for PyTorch
    '''
    tokenizer = DistilBertTokenizerFast.from_pretrained(\
                 'distilbert-base-uncased')
    def tokenize_fn(example):
        return tokenizer(
            example['text'], 
            padding='max_length', 
            truncation=True, 
            max_length=512
        )
    tokenized_dataset = dataset.map(tokenize_fn, batched=True)
    # This part is essential for model training with PyTorch
    tokenized_dataset.set_format(
        type='torch',
        columns=['input_ids', 'attention_mask', 'label']
    )
    return tokenized_dataset, tokenizer

In [8]:
# Step 1: Prepare the dataset
dataset = prepare_dataset()  # This returns a Hugging Face DatasetDict
#print(dataset)  # Should now show DatasetDict with train/test

# Step 2: Tokenize the dataset
tokenized_dataset, tokenizer = tokenize_dataset(dataset)

# Print to verify
#print(tokenized_dataset)
#print(tokenized_dataset['train'][0])  # Check the first example

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/35918 [00:00<?, ? examples/s]

Map:   0%|          | 0/8980 [00:00<?, ? examples/s]

In [9]:
tokenized_dataset = tokenized_dataset["train"] 
train_dataset, eval_dataset = tokenized_dataset.train_test_split(test_size=0.2).values()
 

In [10]:
# Save tokenizer
tokenizer.save_pretrained(MODEL_SAVE_DIR)

('/kaggle/working/models/fine_tuned_model/tokenizer_config.json',
 '/kaggle/working/models/fine_tuned_model/special_tokens_map.json',
 '/kaggle/working/models/fine_tuned_model/vocab.txt',
 '/kaggle/working/models/fine_tuned_model/added_tokens.json',
 '/kaggle/working/models/fine_tuned_model/tokenizer.json')

## Defining Evaluation Metrics

We define a custom `compute_metrics` function to evaluate the model's performance during training.  
The metrics calculated are:
- Accuracy
- Precision
- Recall
- F1 Score

These metrics give a comprehensive view of model performance, especially for imbalanced datasets like Fake News Detection.


In [11]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

## Training Setup and Execution

We fine-tune a pre-trained DistilBERT model for binary classification (REAL vs FAKE news).  
The training configuration includes:
- **TrainingArguments** to specify training parameters (batch size, number of epochs, etc.)
- **Trainer API** from Hugging Face, which simplifies the training loop
- Real-time training visualization using **TensorBoard** logs

Finally, after training, we save the fine-tuned model for future inference.


In [12]:

def train_model(tokenized_dataset):
    
    #tokenized_dataset = tokenized_dataset["train"] 
    #train_dataset, eval_dataset = tokenized_dataset.train_test_split(test_size=0.2).values()
 
    model = DistilBertForSequenceClassification.from_pretrained(\
            "distilbert-base-uncased", num_labels=2)
    # Specify a directory for TensorBoard logs
    tb_writer = SummaryWriter(log_dir=LOGS_DIR)
    training_args = TrainingArguments(
        output_dir=RESULTS_DIR,
        logging_dir=LOGS_DIR,
        num_train_epochs=10,  # 1 a quick test run
        per_device_train_batch_size=16,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        weight_decay=0.01,
        logging_steps=10,
        eval_strategy ="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        report_to="tensorboard"
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

    return model, trainer

In [13]:
model, trainer = train_model(tokenized_dataset)
print("Starting training...")
start_time = time.time()
    
trainer.train()

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Training completed in {elapsed_time:.2f} seconds")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.0001,0.001406,0.999861,0.999856,1.0,0.999711
2,0.0,0.000487,0.999861,0.999856,0.999711,1.0
3,0.0003,0.000253,1.0,1.0,1.0,1.0
4,0.0,0.001081,0.999861,0.999856,1.0,0.999711
5,0.0,0.001186,0.999861,0.999856,1.0,0.999711
6,0.0001,0.00114,0.999861,0.999856,1.0,0.999711
7,0.0,0.001554,0.999861,0.999856,1.0,0.999711
8,0.0,0.00165,0.999861,0.999856,1.0,0.999711
9,0.0,0.001624,0.999861,0.999856,1.0,0.999711
10,0.0,0.001648,0.999861,0.999856,1.0,0.999711


Training completed in 8720.20 seconds


#### Launch TensorBoard: 
After training starts, we can use the following command to launch TensorBoard and visualize the progress. This will allow you to visualize the loss, accuracy, and other metrics in real-time during training.
Once it finishes a few steps, TensorBoard should start showing nice loss/accuracy plots and learning curves.

Once training is complete, we save the fine-tuned model for future inference, ensuring it is ready for deployment or further evaluation.

In [14]:
%load_ext tensorboard
%tensorboard --logdir /kaggle/working/logs

<IPython.core.display.Javascript object>

## Model Evaluation

In [15]:
# Evaluate model on eval_dataset
predictions = trainer.predict(eval_dataset)

# Print evaluation metrics
metrics = compute_metrics(predictions)
print("Evaluation Metrics:")
for key, value in metrics.items():
    print(f"{key}: {value:.4f}")


Evaluation Metrics:
accuracy: 1.0000
f1: 1.0000
precision: 1.0000
recall: 1.0000


#### Saving the Fine-Tuned Model and Tokenizer

After training the model, we save the fine-tuned model to disk in a **ZIP archive** format for easy storage or sharing.  
This archive contains the model's weights and configuration, making it portable for use in other environments.


In [16]:
# Save fine-tuned model
os.makedirs(MODEL_SAVE_DIR, exist_ok=True)
model.save_pretrained(MODEL_SAVE_DIR)

In [17]:
shutil.make_archive("/kaggle/working/fine_tuned_model", 'zip', MODEL_SAVE_DIR)

'/kaggle/working/fine_tuned_model.zip'