<a href="https://colab.research.google.com/github/MichailLepin/Fake-News-Classifier/blob/main/notebooks/bert_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# BERT Model Training for Fake News Classification

This notebook contains code for training a BERT-base-uncased model for fake news classification in Google Colab.

## Notebook Structure

1. **Install Dependencies** - Install required libraries including transformers
2. **Imports** - Import all necessary modules
3. **Load and Process Data** - Load ISOT/Kaggle dataset and preprocessing
4. **Initialize BERT Tokenizer and Model** - Load pre-trained BERT tokenizer and model
5. **PyTorch Dataset** - Create datasets for training with BERT tokenization
6. **Training Functions** - Functions for training and evaluating the model
7. **Training** - Model fine-tuning process
8. **Test Set Evaluation** - Final model evaluation
9. **Download Model** - Download model files for Railway deployment


## Install Dependencies

Install all required libraries for PyTorch, transformers, data processing, and visualization.


In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install transformers datasets scikit-learn pandas numpy matplotlib seaborn tqdm
!pip install kagglehub


## Imports

Import all necessary libraries and modules for data processing, BERT model, and metrics.


In [None]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AdamW,
    get_linear_schedule_with_warmup
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Проверка GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB')


## Load and Process Data

Load ISOT/Kaggle dataset from Kaggle and perform preprocessing:
- Load Fake.csv and True.csv via kagglehub
- Text cleaning (lowercase, URL removal, whitespace normalization)
- Create binary labels
- Stratified train/validation/test split (64%/16%/20%)


In [None]:
import re
import kagglehub

# Download dataset via kagglehub (no API keys required)
path = kagglehub.dataset_download("clmentbisaillon/fake-and-real-news-dataset")

# Load data
fake_df = pd.read_csv(f"{path}/Fake.csv")
true_df = pd.read_csv(f"{path}/True.csv")

print(f"✓ Fake news loaded: {fake_df.shape}")
print(f"✓ True news loaded: {true_df.shape}")

# Text cleaning function
def clean_text(text):
    """Clean text: lowercase, remove URLs, normalize whitespace"""
    if pd.isna(text):
        return ""
    text = str(text)
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

# Identify text column
text_col = None
for col in fake_df.columns:
    if fake_df[col].dtype == 'object' and col.lower() in ['text', 'title', 'article']:
        text_col = col
        break
if text_col is None:
    text_col = fake_df.select_dtypes(include=['object']).columns[0]

print(f"\nUsing column: '{text_col}'")

# Add labels
fake_df['label'] = 'fake'
true_df['label'] = 'real'

# Combine data
combined_data = pd.concat([fake_df, true_df], ignore_index=True)

# Clean text
print("\nCleaning text...")
combined_data['text_cleaned'] = combined_data[text_col].apply(clean_text)

# Create binary labels
combined_data['label_binary'] = combined_data['label'].map({'fake': 1, 'real': 0})

# Remove empty texts
combined_data = combined_data[
    combined_data['text_cleaned'].notna() &
    (combined_data['text_cleaned'].str.len() > 0)
]

print(f"\nCombined dataset: {combined_data.shape}")
print(f"Label distribution: {combined_data['label'].value_counts().to_dict()}")

# Split into train/val/test with stratification
X = combined_data['text_cleaned'].values
y = combined_data['label_binary'].values

# First split: train+val (80%) and test (20%)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: train (64%) and val (16%)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.2, random_state=42, stratify=y_train_val
)

print(f"\nData split:")
print(f"  Train: {len(X_train):,} ({len(X_train)/len(combined_data)*100:.1f}%)")
print(f"  Validation: {len(X_val):,} ({len(X_val)/len(combined_data)*100:.1f}%)")
print(f"  Test: {len(X_test):,} ({len(X_test)/len(combined_data)*100:.1f}%)")


## Initialize BERT Tokenizer and Model

Load pre-trained BERT-base-uncased tokenizer and model for sequence classification.


In [None]:
# Model name
MODEL_NAME = 'bert-base-uncased'

# Initialize tokenizer
print(f"Loading tokenizer: {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("✓ Tokenizer loaded")

# Initialize model for sequence classification (2 classes: fake/real)
print(f"\nLoading model: {MODEL_NAME}...")
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2
).to(device)
print("✓ Model loaded")

# Model parameters
MAX_LEN = 256
print(f"\nMax sequence length: {MAX_LEN}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")


## PyTorch Dataset

Create Dataset class for BERT tokenization and DataLoader for efficient data loading.


In [None]:
class BERTDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]

        # Tokenize text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Create datasets
train_dataset = BERTDataset(X_train, y_train, tokenizer, MAX_LEN)
val_dataset = BERTDataset(X_val, y_val, tokenizer, MAX_LEN)
test_dataset = BERTDataset(X_test, y_test, tokenizer, MAX_LEN)

BATCH_SIZE = 16
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"✓ Datasets created")
print(f"  Train batches: {len(train_loader)}")
print(f"  Validation batches: {len(val_loader)}")
print(f"  Test batches: {len(test_loader)}")


## Training Functions

Functions for training the model for one epoch and evaluating the model on the validation set.


In [None]:
def train_epoch(model, train_loader, optimizer, scheduler, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch in tqdm(train_loader, desc="Training"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()
        _, predicted = torch.max(outputs.logits.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    return total_loss / len(train_loader), 100 * correct / total

def evaluate(model, val_loader, device):
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            total_loss += loss.item()
            _, predicted = torch.max(outputs.logits.data, 1)

            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(val_loader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='weighted')

    return avg_loss, accuracy, f1, all_preds, all_labels


## Training

BERT model fine-tuning process with early stopping based on F1-score on the validation set.


In [None]:
print("\n" + "=" * 60)
print("TRAINING BERT MODEL")
print("=" * 60)

# Optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
total_steps = len(train_loader) * 10  # 10 epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

num_epochs = 10
best_f1 = 0
patience = 3
patience_counter = 0

train_losses = []
val_losses = []
train_accs = []
val_accs = []
val_f1s = []

for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")

    train_loss, train_acc = train_epoch(model, train_loader, optimizer, scheduler, device)
    val_loss, val_acc, val_f1, _, _ = evaluate(model, val_loader, device)

    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accs.append(train_acc)
    val_accs.append(val_acc)
    val_f1s.append(val_f1)

    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%, Val F1: {val_f1:.4f}")

    if val_f1 > best_f1:
        best_f1 = val_f1
        patience_counter = 0
        # Save model
        model.save_pretrained('best_bert_model')
        tokenizer.save_pretrained('best_bert_model')
        print(f"✓ New best F1: {best_f1:.4f}, model saved")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping after {epoch+1} epochs")
            break

print("\n" + "=" * 60)
print(f"Best validation F1: {best_f1:.4f}")
print("=" * 60)


## Test Set Evaluation

Final model evaluation on the test set with metrics (accuracy, F1-score, precision, recall) and confusion matrix.


In [None]:
# Load best model
model = AutoModelForSequenceClassification.from_pretrained('best_bert_model').to(device)

print("\nEvaluating BERT model on test set:")
test_loss, test_acc, test_f1, test_preds, test_labels = evaluate(
    model, test_loader, device
)

print(f"\nTest Results:")
print(f"  Loss: {test_loss:.4f}")
print(f"  Accuracy: {test_acc:.4f}")
print(f"  F1-Score: {test_f1:.4f}")

print("\nClassification Report:")
print(classification_report(test_labels, test_preds, target_names=['Real', 'Fake']))

# Confusion Matrix
cm = confusion_matrix(test_labels, test_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Real', 'Fake'], yticklabels=['Real', 'Fake'])
plt.title('BERT - Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

bert_results = {
    'test_loss': float(test_loss),
    'test_accuracy': float(test_acc),
    'test_f1': float(test_f1),
    'test_precision': float(precision_score(test_labels, test_preds, average='weighted')),
    'test_recall': float(recall_score(test_labels, test_preds, average='weighted'))
}

print(f"\nBERT Results: {bert_results}")


## Download Model for Railway Deployment

After training, download the best BERT model files for deployment on Railway.


In [None]:
# Download BERT model for Railway deployment
from google.colab import files
import os
import shutil

model_dir = 'best_bert_model'

if os.path.exists(model_dir):
    # Create zip archive
    shutil.make_archive('best_bert_model', 'zip', model_dir)
    
    print(f"✓ Model directory found: {model_dir}")
    print(f"  Archive size: {os.path.getsize('best_bert_model.zip') / (1024*1024):.2f} MB")
    
    # Download the archive
    files.download('best_bert_model.zip')
    
    print("\n" + "="*60)
    print("✓ MODEL DOWNLOADED!")
    print("="*60)
    print("\nNext steps:")
    print("1. The best_bert_model.zip file should be downloaded to your Downloads folder")
    print("2. Extract the zip file")
    print("3. Copy the entire 'best_bert_model' folder to the models/ folder in your project")
    print("4. The path should be: Fake-News-Classifier-2/models/best_bert_model/")
else:
    print(f"⚠ Warning: Model directory not found at {model_dir}")
    print("Make sure training has been completed successfully.")
