# NLP: fake and true news discrimination with BERT

Data collected from [Kaggle](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv) refers to fake an true news datasets.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

In [3]:
fake = pd.read_csv('Fake.csv')
true = pd.read_csv('True.csv')

print('Fake news dataset: \n',fake.head())
print('\nTrue news dataset: \n',true.head())

Fake news dataset: 
                                                title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   
3  On Christmas day, Donald Trump announced that ...    News   
4  Pope Francis used his annual Christmas Day mes...    News   

                date  
0  December 31, 2017  
1  December 31, 2017  
2  December 30, 2017  
3  December 29, 2017  
4  December 25, 2017  

True news dataset: 
                                                title  \
0  As U.S. budget fight looms, Republican

In [4]:
#build dataset with training and test split

#set the text of the article as samples
fake['text'] = fake['title'] + ' ' + fake['text']
true['text'] = true['title'] + ' ' + true['text']

#set the labels
fake['label'] = 0
true['label'] = 1

#concatenate fake and true
df = pd.concat([fake, true], ignore_index=True)

#reset rows
df = df.sample(frac=1, random_state=42).reset_index(drop=True)


#training and test dataset
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

### Fine tuning of BERT

In [None]:
!pip install transformers datasets torch accelerate

In [None]:
from datasets import Dataset
from transformers import AutoTokenizer

#tokenizer from BERT
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

#preprocess function for dataset
def preprocess_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128
    )

#convert dataframe in Dataset Hugging Face
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

#map on training and test sets
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

In [None]:
from transformers import AutoModelForSequenceClassification

#load model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

In [23]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

#define metrics
def compute_metrics(pred_labels, true_labels):
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, pred_labels, average='binary')
    accuracy = accuracy_score(true_labels, pred_labels)
    return {'accuracy': accuracy, 'f1': f1, 'precision': precision, 'recall': recall}


In [9]:
from transformers import TrainingArguments, Trainer

#prepare training arguments
training_args = TrainingArguments(
    output_dir='./bert-fake-news',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
    save_total_limit=1,
    report_to='none'
)

In [None]:
#fit the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
#predictions on test set
predictions = trainer.predict(test_dataset)
pred_labels = predictions.predictions.argmax(-1)

In [24]:
#compute metrics
compute_metrics(pred_labels, test_df['label'])

{'accuracy': 0.9998886414253898,
 'f1': 0.999883273024396,
 'precision': 1.0,
 'recall': 0.9997665732959851}

All the evaluation metrics—accuracy, precision, recall, and F1 score—are all very close to 1.0. This indicates that the model performs exceptionally well in identifying and classifying fake news. Such high values suggest that the model has learned to distinguish between true and fake content with remarkable reliability, making it highly effective for practical applications in detecting misinformation.