# Project Description
In this project, we are building a <strong>Fake News Prediction System</strong> using Machine Learning with Python.
We will be using Logistic Regression model for prediction.

# Tools Required
<ul>
  <li>Python 3.12 or greater</li>
  <li>An IDE <i>(e.g VS Code, PyCharm, Spyder, Jupyter Notebook, Google Colab)</i></li>
</ul>

# Prerequisites
<ul>
  <li>Knowledge of basic of machine learning concepts</li>
  <li>Understand how to build a classification model</li>
  <li>Knowledge of the fundamental concepts of probability</li>
</ul>
<br>
<cite>Author: SH3PO | 03 Dec 2025 11:39</cite>

<hr>

<h1 style="text-align: center;">Dependencies</h1>
<br>
Run the cell below to install the required dependencies for the *project*

In [1]:
!pip install pandas numpy scikit-learn matplotlib seaborn



# Libraries

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

import re
import string
import warnings
warnings.filterwarnings('ignore')

In [19]:
# IMPORT THE FAKE DATA
df_fake = pd.read_csv('Fake.csv')
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [20]:
# IMPORT THE TRUE DATA
df_true = pd.read_csv('True.csv')
df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [21]:
# ADD A TARGET COLUMN
df_fake['class'] = 0
print('Fake data dimensions: {}'.format(df_fake.shape))

df_true['class'] = 1
print('True data dimensions: {}'.format(df_true.shape))


Fake data dimensions: (23481, 5)
True data dimensions: (21417, 5)


In [22]:
# REMOVE LAST 10 ROWS FROM FAKE DATA
new_df_fake = df_fake.tail(10)
for i in range(23480, 23470, -1):
  df_fake.drop([i], axis=0, inplace=True)

print('Fake data dimensions: {}'.format(df_fake.shape))

Fake data dimensions: (23471, 5)


In [23]:
# REMOVE LAST 10 ROWS FROM TRUE DATA
new_df_true = df_true.tail(10)
for i in range(21416, 21406, -1):
  df_true.drop([i], axis=0, inplace=True)
print('True data dimensions: {}'.format(df_true.shape))

True data dimensions: (21407, 5)


In [25]:
# ADD THE TARGET TO THE TEST DATA
new_df_fake['class'] = 0
new_df_true['class'] = 1

In [26]:
# MERGE THE DATASETS
df = pd.concat([df_fake, df_true], axis=0)
df.head()

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [35]:
# REMOVE UNNECESSARY COLUMNS
df_filtered = df.drop(['title', 'subject', 'date'], axis=1)
df_filtered.head()

Unnamed: 0,text,class
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


In [36]:
# CHECK FOR NULLS
df_filtered.isnull().sum()

Unnamed: 0,0
text,0
class,0


In [37]:
# RANDOM SHUFFLING
df_filtered = df_filtered.sample(frac=1).reset_index().drop(['index'],axis=1)
df_filtered.head()

Unnamed: 0,text,class
0,"(In this June 9 story, corrects name of compa...",1
1,BEIRUT (Reuters) - France should not interfere...,1
2,YANGON (Reuters) - Myanmar s army said on Wedn...,1
3,WASHINGTON (Reuters) - President Barack Obama ...,1
4,"No ,it s not a reality show but Obama s made i...",0


In [39]:
# TEXT PROCESSING

def process_text(text):
  """ This function is used to preprocess the text into a suitable format."""
  text = text.lower()
  text = re.sub('\[.*?\]', '', text) # remove special characters
  text = re.sub('\\W', " ", text) # remove whitespace
  text = re.sub('https?://\S+|www\.\S+', '', text) # remove urls
  text = re.sub('<.*?>+', '', text) # remove html tags
  text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
  text = re.sub('\n', '', text) # remove new lines
  text = re.sub('\w*\d\w', '', text)
  return text

df['text'] = df['text'].apply(process_text)
df.head()

In [40]:
# DATA PARTITIONING
X = df['text']
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [41]:
# CONVERT TEXT TO VECTORS
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(X_train)
xv_test = vectorization.transform(X_test)

# Model Development
In this section, we will build three models namely:
<ul>
  <li><i>Logistic Regression</i>: Predicts the probability of a given outcome.</li>
  <li><i>DecisionTree Classifier</i>: Constructs a tree-like model of decisions.</li>
  <li><i>RandomForest Classifier</i>: Constructs a multitude of decision trees during training.</li>
  <li><i>GradientBoosting Classifier</i>: Builds a strong predictive model by sequentially combining multiple weak prediction models.</li>
</ul>
<hr>

#### 1. Logistic Regression

In [48]:
# MODEL DEVELOPMENT
LR = LogisticRegression()
LR.fit(xv_train, y_train)

y_pred = LR.predict(xv_test)

# MODEL EVALUATION
LR.score(xv_test, y_test)

0.985650623885918

In [49]:
# CLASSIFICATION REPORT
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      5929
           1       0.98      0.99      0.98      5291

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



#### 2. DecisionTree Classifier

In [50]:
# BUILD CLASSIFIER MODEL
DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

y_pred = DT.predict(xv_test)

# MODEL EVALUATION
DT.score(xv_test, y_test)

0.9943850267379679

In [51]:
# CLASSIFICATION REPORT
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5929
           1       1.00      0.99      0.99      5291

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



#### 3. GradientBoosting Classifier

In [54]:
# INSTANTIATE THE MODEL
GB = GradientBoostingClassifier(random_state=0)

# FIT THE MODEL
GB.fit(xv_train, y_train)

# PREDICTIONS
y_pred = GB.predict(xv_test)

# MODEL EVALUATION
GB.score(xv_test, y_test)


0.9946524064171123

In [55]:
# CLASSIFICATION REPORT
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      5929
           1       0.99      1.00      0.99      5291

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



#### 4. RandomForest Classifier

In [59]:
# INSTANTIATE THE MODEL
RF = RandomForestClassifier(random_state=0)

# FIT THE MODEL
RF.fit(xv_train, y_train)

# PREDICTIONS
y_pred = RF.predict(xv_test)

# MODEL EVALUATION
RF.score(xv_test, y_test)

0.9893939393939394

In [60]:
# CLASSIFICATION REPORT
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5929
           1       0.99      0.99      0.99      5291

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



### Results 😎🎉✨
From the results above, the <strong>DecisionTree</strong> and <strong>GradientBoosting</strong> Classifiers are the best performing models, achieving a stunning accuracy score of <strong>99.4%</strong>. The <strong>Logistic Regression</strong> and <strong>Random Forest</strong> classifier achieved a accuracy score of 99.0%.
<hr>

### Next Steps
The next steps would be to test whether the models will still perform as expected on unseen data.

# Model Testing
This is a technique that validates and evaluates a trained model to ensure that it performs as expected when the model is fed unseen data.

In this project, we will be using a <strong>Custom</strong> testing strategy.

In [66]:
def binary_category(target):
  if target == 0:
    return 'Fake news'
  elif target == 1:
    return 'Not fake news'

def get_user_input():
  return input('Enter a news article: ')

def testing_strategy(news):
  test_news = {'text':[news]}
  new_def_test = pd.DataFrame(test_news)
  new_def_test['text'] = new_def_test['text'].apply(process_text)
  new_x_test = new_def_test['text']
  new_xv_test = vectorization.transform(new_x_test)
  pred_LR = LR.predict(new_xv_test)
  pred_DT = DT.predict(new_xv_test)
  pred_GBC = GB.predict(new_xv_test)
  pred_RFC = RF.predict(new_xv_test)

  return print(f'\n\nLogistic Regression predictions: {binary_category(pred_LR[0])}\n\nDecisionTree predictions: {binary_category(pred_DT[0])}\n\nGradientBoosting predictions: {binary_category(pred_GBC[0])}\n\nRandomForest predictions: {binary_category(pred_RFC[0])}')

In [68]:
# TESTING
news = get_user_input()
testing_strategy(news)

Enter a news article: The anonymous bulletin board sites then focused their attention on the pizza shop called Comet Ping Pong, which was frequently mentioned in the e-mail of John Podesta, head of the Clinton campaign, whose e-mails were being successively leaked on the whistle-blower site WikiLeaks at about this same time. This escalated into posts that this shop was the site of child sex trafficking. The day before the voting in the presidential election, the hashtag “#pizzagate” appeared. Even after Ms. Clinton’s defeat the following day, the tweets did not subside, and instead continued to expand. It was reported that the Central Intelligence Agency (“CIA”) had determined that there were cyber-attacks on the e-mail of Democratic Party officials, like Mr. Podesta, indicating that there was intervention from Russia aimed at ensuring that Mr. Trump would win the election; and President Obama demanded a thorough investigation of the government intelligence agencies before his own reti