# Fake News Detector - Project:

In this project we will be training a model to identify between fake and real news. The dataset has been taken from kaggle.com . The following is the link to the data set:

- https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data

Let's install all the libraries we will need here

In [None]:
!pip install transformers
!pip install datasets
!pip install scikit-learn
!pip install pandas
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


Next up is to load the required data from kaggle.

In [None]:
import os
import opendatasets as od

url = 'https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data'

od.download(url)

os.listdir('./fake-and-real-news-dataset')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: CODElearn22
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
Downloading fake-and-real-news-dataset.zip to ./fake-and-real-news-dataset


100%|██████████| 41.0M/41.0M [00:00<00:00, 697MB/s]







['True.csv', 'Fake.csv']

In [None]:
import pandas as pd

fake_news = pd.read_csv("./fake-and-real-news-dataset/Fake.csv")
real_news = pd.read_csv("./fake-and-real-news-dataset/True.csv")

# We need to make sure that the data set is combined and then shuffled so let's first combine then by using a 'real'

fake_news['real'] = 0
real_news['real'] = 1

# now we need to combine them both

data = pd.concat([fake_news, real_news],  ignore_index=True)

In [None]:
# let's check the data

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   real     44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


Here we have fice columns but we don't need subject because its always news and date doesn't say about fake or real so we can keep that out of our training

<br>

Next on our agenda is to shuffle the data set so that the model doesn't get all fake at the beginnig and the real at the end. Also we need to get only the data that is necessary for us:
1. title
2. text
3. real


In [None]:
data = data[['title', 'text', 'real']]
data = data.sample(frac=1).reset_index(drop=True)

Let's check the data whether it is what we desired.

In [None]:
data

Unnamed: 0,title,text,real
0,WATCH: Nancy Pelosi Takes House Intel Chair T...,GOP Rep. Devin Nunes is feeling the burn after...,0
1,(VIDEO)ICE PROTECTING OBAMA: WON’T RELEASE NAM...,,0
2,"Forbes pegs Trump's wealth at $3.7 billion, $8...",WASHINGTON (Reuters) - U.S. Republican preside...,1
3,More than a thousand turn Philippine funeral t...,MANILA (Reuters) - More than a thousand people...,1
4,White House narrows search to three for Suprem...,"WASHINGTON/AUSTIN, Texas (Reuters) - The White...",1
...,...,...,...
44893,CHRISTIAN HIGH SCHOOL Told By State They Are N...,The drip drip drip of communism Leftists are s...,0
44894,CITY OF CHICAGO Forcing Out Homeless Veterans ...,There is no reason to believe the welfare of o...,0
44895,California governor signs climate policy exten...,LOS ANGELES (Reuters) - California Governor Je...,1
44896,Democratic Candidates SLAM Trump After Bloody...,Republican frontrunner Donald Trump is in the ...,0


Let's preprocess the data. Instead of taking title and text separately we can go ahead with simple content label which will do our job.

In [None]:
data["content"] = "[TITLE] " + data["title"] + " [TEXT] " + data["text"]

#drop rest keep 'content' and 'real'

data = data[['content', 'real']]

In [None]:
data

Unnamed: 0,content,real
0,[TITLE] WATCH: Nancy Pelosi Takes House Intel...,0
1,[TITLE] (VIDEO)ICE PROTECTING OBAMA: WON’T REL...,0
2,[TITLE] Forbes pegs Trump's wealth at $3.7 bil...,1
3,[TITLE] More than a thousand turn Philippine f...,1
4,[TITLE] White House narrows search to three fo...,1
...,...,...
44893,[TITLE] CHRISTIAN HIGH SCHOOL Told By State Th...,0
44894,[TITLE] CITY OF CHICAGO Forcing Out Homeless V...,0
44895,[TITLE] California governor signs climate poli...,1
44896,[TITLE] Democratic Candidates SLAM Trump Afte...,0


In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(
    data["content"].tolist(),
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors='pt'
)

import torch
labels = torch.tensor(data["real"].tolist())

print("Input IDs shape:", train_encodings['input_ids'].shape)
print("Labels shape:", labels.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Input IDs shape: torch.Size([44898, 512])
Labels shape: torch.Size([44898])


Next up is to ensure that we convert to a form that tensor can take. We need to make methods on the data set so let's create a new class and make the required methods.

In [None]:
from torch.utils.data import Dataset, DataLoader

class NewsDataset(Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, index):
    item = {key: val[index] for key, val in self.encodings.items()}
    item['labels'] = self.labels[index]
    return item

  def __len__(self):
    return len(self.labels)

train_dataset = NewsDataset(train_encodings, labels)

Next our job is to split our data into training data and testing data. So this we can do as follows.

In [None]:
from torch.utils.data import random_split

train_size = int(0.8 * len(train_dataset))
test_size = len(train_dataset) - train_size

train_dataset, test_dataset = random_split(train_dataset, [train_size, test_size])

In [None]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

Next Step should be to load and prepare for training the model

In [None]:
from transformers import DistilBertForSequenceClassification
from torch.optim import AdamW

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


now that we have everything ready it's finally time to train our model.

In [15]:
from tqdm.auto import tqdm

model.train()

for epoch in range(3):
  # training
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
  # evaluation
    model.eval()
    total_correct = 0
    total_samples = 0
  # testing
    with torch.no_grad():
        for batch in tqdm(test_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            predictions = torch.argmax(outputs.logits, dim=-1)
            total_correct += (predictions == labels).sum().item()
            total_samples += labels.size(0)

    accuracy = total_correct / total_samples
    print(f"Epoch {epoch+1} - Test Accuracy: {accuracy}")

  0%|          | 0/2245 [00:00<?, ?it/s]

  0%|          | 0/562 [00:00<?, ?it/s]

Epoch 1 - Test Accuracy: 0.9997772828507795


  0%|          | 0/2245 [00:00<?, ?it/s]

  0%|          | 0/562 [00:00<?, ?it/s]

Epoch 2 - Test Accuracy: 0.9997772828507795


  0%|          | 0/2245 [00:00<?, ?it/s]

  0%|          | 0/562 [00:00<?, ?it/s]

Epoch 3 - Test Accuracy: 0.9997772828507795


Let's save it on our drive so that we don't lose the model.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
model.save_pretrained("/content/drive/MyDrive/fake_news_detector_model")
tokenizer.save_pretrained("/content/drive/MyDrive/fake_news_detector_model")

Now that we have trained the model we need to make a function to use this model