**FAKE NEWS DETECTION WITH BERT**

Step 1: Setting Up Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Step 2: Loading the Dataset

In [3]:
import pandas as pd

# Load Data from Google Drive
fake_data = pd.read_csv('/content/drive/MyDrive/Fake_News_Detection/data/Fake.csv')
true_data = pd.read_csv('/content/drive/MyDrive/Fake_News_Detection/data/True.csv')

# Display basic info
print(f"Fake News: {fake_data.shape}")
print(f"True News: {true_data.shape}")

Fake News: (23481, 4)
True News: (21417, 4)


Step 3: Data Preprocessing

In [5]:
# Combine Data
fake_data['label'] = 0  # Fake = 0
true_data['label'] = 1  # True = 1

# Concatenate and Shuffle Data
data = pd.concat([fake_data, true_data], ignore_index=True)
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# Drop rows with missing text
data.dropna(subset=['text'], inplace=True)
data.reset_index(drop=True, inplace=True)

# Save cleaned data
cleaned_data_path = '/content/drive/MyDrive/Fake_News_Detection/data/cleaned_data.csv'
data.to_csv(cleaned_data_path, index=False)

print(f"Cleaned Data saved at: {cleaned_data_path}")
print(f"Total Records: {data.shape[0]}")


Cleaned Data saved at: /content/drive/MyDrive/Fake_News_Detection/data/cleaned_data.csv
Total Records: 44898


Step 4: Install Necessary Libraries

In [7]:
!pip install transformers torch


Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
  Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
  Using cached nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
Using cached nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
[0mInstalling collected packages: nvidia-cuspars

Step 5: Import Required Libraries

In [8]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset, RandomSampler
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
import os


 Step 6: Load Cleaned Data

In [9]:
# Load the cleaned data
data = pd.read_csv('/content/drive/MyDrive/Fake_News_Detection/data/cleaned_data.csv')

# Display the first few rows
data.head()


Unnamed: 0,title,text,subject,date,label
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",0
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",1
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",1
3,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",0
4,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",1


Step 7: Split Data into Train and Test Sets

In [10]:
# Split data into training and testing sets
X = data['text'].values
y = data['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Samples: {len(X_train)}")
print(f"Testing Samples: {len(X_test)}")


Training Samples: 35918
Testing Samples: 8980


Step 8: Initialize BERT Model and Tokenizer

In [11]:
# Initialize BERT Tokenizer and Model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print("Using device:", device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cuda


Step 9: Define Tokenization Function

In [12]:
# Function to tokenize data
def tokenize_data(texts, labels):
    input_ids, attention_masks = [], []

    for text in texts:
        encoded = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=128,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])

    return torch.cat(input_ids), torch.cat(attention_masks), torch.tensor(labels)


Step 10: Tokenize the Data

In [13]:
train_inputs, train_masks, train_labels = tokenize_data(X_train, y_train)
test_inputs, test_masks, test_labels = tokenize_data(X_test, y_test)

print(f"Train Inputs: {train_inputs.shape}, Train Masks: {train_masks.shape}, Train Labels: {train_labels.shape}")
print(f"Test Inputs: {test_inputs.shape}, Test Masks: {test_masks.shape}, Test Labels: {test_labels.shape}")


Train Inputs: torch.Size([35918, 128]), Train Masks: torch.Size([35918, 128]), Train Labels: torch.Size([35918])
Test Inputs: torch.Size([8980, 128]), Test Masks: torch.Size([8980, 128]), Test Labels: torch.Size([8980])


Step 11: Create DataLoader for Train and Test Data

In [14]:
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler

# Set Batch Size
batch_size = 16

# Train DataLoader
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)

# Test DataLoader
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_dataloader = DataLoader(test_data, sampler=SequentialSampler(test_data), batch_size=batch_size)

print(f"Train Dataloader: {len(train_dataloader)} batches")
print(f"Test Dataloader: {len(test_dataloader)} batches")


Train Dataloader: 2245 batches
Test Dataloader: 562 batches


Step 12: Initialize Optimizer, Scheduler, and Early Stopping

In [15]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW
import numpy as np

# Initialize Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

# Gradient Scaler for Mixed Precision
scaler = torch.cuda.amp.GradScaler()

# Total Steps for Scheduler
total_steps = len(train_dataloader) * 10  # 10 epochs (can be adjusted)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Early Stopping Parameters
early_stopping_patience = 3  # Stop after 3 epochs without improvement
best_val_loss = np.inf
early_stopping_counter = 0

print("Optimizer, Scheduler, and Early Stopping Initialized.")


Optimizer, Scheduler, and Early Stopping Initialized.


  scaler = torch.cuda.amp.GradScaler()


Step 13: Define Training Function with Early Stopping

In [17]:
from tqdm import tqdm

def train_model(model, train_dataloader, test_dataloader, optimizer, scheduler, scaler, epochs=10):
    global best_val_loss, early_stopping_counter

    for epoch in range(epochs):
        print(f"\nEpoch {epoch + 1}/{epochs}")
        model.train()
        total_loss = 0

        for batch in tqdm(train_dataloader):
            batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)
            optimizer.zero_grad()

            with torch.cuda.amp.autocast():  # Mixed Precision
                outputs = model(input_ids=batch_inputs, attention_mask=batch_masks, labels=batch_labels)
                loss = outputs.loss

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()

            total_loss += loss.item()

        avg_train_loss = total_loss / len(train_dataloader)
        print(f"Average Training Loss: {avg_train_loss:.4f}")

        # Validation
        val_loss = evaluate_model(model, test_dataloader)
        print(f"Validation Loss: {val_loss:.4f}")

        # Early Stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            early_stopping_counter = 0

            # Save Best Model
            model.save_pretrained('/content/drive/MyDrive/Fake_News_Detection/models/best_model')
            tokenizer.save_pretrained('/content/drive/MyDrive/Fake_News_Detection/models/best_model')
            print("✅ Best Model Saved.")
        else:
            early_stopping_counter += 1
            if early_stopping_counter >= early_stopping_patience:
                print("Early Stopping Triggered.")
                break


Step 14: Define Evaluation Function

In [18]:
def evaluate_model(model, dataloader):
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for batch in dataloader:
            batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)
            outputs = model(input_ids=batch_inputs, attention_mask=batch_masks, labels=batch_labels)
            loss = outputs.loss
            total_loss += loss.item()

    return total_loss / len(dataloader)


Step 15: Start Training with Early Stopping

In [19]:
# Start Training
train_model(model, train_dataloader, test_dataloader, optimizer, scheduler, scaler, epochs=10)



Epoch 1/10


  with torch.cuda.amp.autocast():  # Mixed Precision
100%|██████████| 2245/2245 [04:39<00:00,  8.03it/s]


Average Training Loss: 0.0105
Validation Loss: 0.0059
✅ Best Model Saved.

Epoch 2/10


100%|██████████| 2245/2245 [04:41<00:00,  7.98it/s]


Average Training Loss: 0.0030
Validation Loss: 0.0064

Epoch 3/10


100%|██████████| 2245/2245 [04:40<00:00,  8.00it/s]


Average Training Loss: 0.0021
Validation Loss: 0.0025
✅ Best Model Saved.

Epoch 4/10


100%|██████████| 2245/2245 [04:41<00:00,  7.97it/s]


Average Training Loss: 0.0010
Validation Loss: 0.0021
✅ Best Model Saved.

Epoch 5/10


100%|██████████| 2245/2245 [04:41<00:00,  7.98it/s]


Average Training Loss: 0.0004
Validation Loss: 0.0008
✅ Best Model Saved.

Epoch 6/10


100%|██████████| 2245/2245 [04:41<00:00,  7.97it/s]


Average Training Loss: 0.0005
Validation Loss: 0.0042

Epoch 7/10


100%|██████████| 2245/2245 [04:40<00:00,  8.01it/s]


Average Training Loss: 0.0003
Validation Loss: 0.0003
✅ Best Model Saved.

Epoch 8/10


100%|██████████| 2245/2245 [04:41<00:00,  7.96it/s]


Average Training Loss: 0.0002
Validation Loss: 0.0003

Epoch 9/10


100%|██████████| 2245/2245 [04:40<00:00,  8.02it/s]


Average Training Loss: 0.0002
Validation Loss: 0.0005

Epoch 10/10


100%|██████████| 2245/2245 [04:40<00:00,  8.01it/s]


Average Training Loss: 0.0003
Validation Loss: 0.0006
Early Stopping Triggered.


Step 16: Evaluate the Best Model on Test Data

In [21]:
from sklearn.metrics import classification_report, accuracy_score

# Load Best Model
model = BertForSequenceClassification.from_pretrained('/content/drive/MyDrive/Fake_News_Detection/models/best_model')
model.to(device)

model.eval()
predictions, true_labels = [], []

for batch in test_dataloader:
    batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        outputs = model(input_ids=batch_inputs, attention_mask=batch_masks)
        preds = torch.argmax(outputs.logits, axis=1).flatten()

    predictions.extend(preds.cpu().numpy())
    true_labels.extend(batch_labels.cpu().numpy())

# Classification Report
print("\nClassification Report:\n")
print(classification_report(true_labels, predictions))

# Accuracy
accuracy = accuracy_score(true_labels, predictions)
print(f"\n✅ Model Accuracy: {accuracy * 100:.2f}%")



Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4710
           1       1.00      1.00      1.00      4270

    accuracy                           1.00      8980
   macro avg       1.00      1.00      1.00      8980
weighted avg       1.00      1.00      1.00      8980


✅ Model Accuracy: 99.99%


Step 17: Display Test Results for Sample Texts

In [22]:
# Function to Display Predictions
def display_test_results(model, tokenizer, test_texts, test_labels, num_samples=5):
    model.eval()
    print("\nDisplaying Test Results on Sample Texts:\n")

    for i in range(num_samples):
        text = test_texts[i]
        label = test_labels[i]

        # Tokenize and Encode
        encoded = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=128,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        input_ids = encoded['input_ids'].to(device)
        attention_mask = encoded['attention_mask'].to(device)

        with torch.no_grad():
            output = model(input_ids, attention_mask=attention_mask)
            pred_label = torch.argmax(output.logits, dim=1).item()

        print(f"Text: {text[:200]}...")  # Display first 200 characters
        print(f"Actual Label: {'Fake' if label == 0 else 'True'}")
        print(f"Predicted Label: {'Fake' if pred_label == 0 else 'True'}")
        print("===" * 20)

# Display Results for a Few Test Samples
display_test_results(
    model,
    tokenizer,
    X_test[:10],   # Adjust number as needed
    y_test[:10]
)



Displaying Test Results on Sample Texts:

Text: Well, that didn t take long. In the short time since Americans kinda-sorta elected Donald Trump to be Pussygrabber-in-Chief, Trump has appointed a bona fide white nationalist to a high-level position ...
Actual Label: Fake
Predicted Label: Fake
Text: (Reuters) - Republican lawmaker Devin Nunes’ investigation into whether Obama administration officials used classified intelligence reports to discredit Donald Trump’s 2016 campaign team could backfir...
Actual Label: True
Predicted Label: True
Text: WASHINGTON (Reuters) - President Donald Trump said on Friday that churches in Texas should be able to receive money from the Federal Emergency Management Agency for helping victims of Hurricane Harvey...
Actual Label: True
Predicted Label: True
Text: Print journalism and longstanding papers have been struggling since the advent of the internet and increased competition faced by blogs and op-ed sites. That, combined with poor understanding of mar