<a href="https://colab.research.google.com/github/AbdelrahmanOrm/Fake-New-Prediction/blob/main/FakeNewsPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Dependencies

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Stopwords in English

In [3]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

### Data preprocessing

In [4]:
import kagglehub
path = kagglehub.dataset_download("jainpooja/fake-news-detection")

Downloading from https://www.kaggle.com/api/v1/datasets/download/jainpooja/fake-news-detection?dataset_version_number=1...


100%|██████████| 41.0M/41.0M [00:00<00:00, 108MB/s]

Extracting files...





In [5]:
import os

# Loading the dataset to a pandas Dataframe
fake_df = pd.read_csv(os.path.join(path, 'Fake.csv'))
true_df = pd.read_csv(os.path.join(path, 'True.csv'))

In [6]:
print("Fake news dataset shape: ",fake_df.shape)
fake_df.head()

Fake news dataset shape:  (23481, 4)


Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [7]:
print("True news dataset shape: ",true_df.shape)
true_df.head()

True news dataset shape:  (21417, 4)


Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [8]:
fake_df.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0


In [9]:
true_df.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0


In [12]:
# Removing "Reuters" word to not affect the model training
true_df['text'] = true_df['text'].replace("(Reuters)", "",regex=True)
true_df.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON () - The head of a conservative Rep...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON () - Transgender people will be all...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON () - The special counsel investigat...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON () - Trump campaign adviser George ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON () - President Donald Trump...,politicsNews,"December 29, 2017"


In [13]:
# labeling the data
true_df['target'] = 1
fake_df['target'] = 0

In [14]:
true_df.head(), fake_df.head()

(                                               title  \
 0  As U.S. budget fight looms, Republicans flip t...   
 1  U.S. military to accept transgender recruits o...   
 2  Senior U.S. Republican senator: 'Let Mr. Muell...   
 3  FBI Russia probe helped by Australian diplomat...   
 4  Trump wants Postal Service to charge 'much mor...   
 
                                                 text       subject  \
 0  WASHINGTON () - The head of a conservative Rep...  politicsNews   
 1  WASHINGTON () - Transgender people will be all...  politicsNews   
 2  WASHINGTON () - The special counsel investigat...  politicsNews   
 3  WASHINGTON () - Trump campaign adviser George ...  politicsNews   
 4  SEATTLE/WASHINGTON () - President Donald Trump...  politicsNews   
 
                  date  target  
 0  December 31, 2017        1  
 1  December 29, 2017        1  
 2  December 31, 2017        1  
 3  December 30, 2017        1  
 4  December 29, 2017        1  ,
                             

In [15]:
true_df = true_df.drop(['date'], axis=1)
fake_df = fake_df.drop(['date'], axis=1)

In [16]:
# Mergin the title and text body of the news for better prediction
true_df['news'] = true_df['title'] + " " + true_df['text']
fake_df['news'] = fake_df['title'] + " " + fake_df['text']

In [17]:
true_df.head()

Unnamed: 0,title,text,subject,target,news
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON () - The head of a conservative Rep...,politicsNews,1,"As U.S. budget fight looms, Republicans flip t..."
1,U.S. military to accept transgender recruits o...,WASHINGTON () - Transgender people will be all...,politicsNews,1,U.S. military to accept transgender recruits o...
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON () - The special counsel investigat...,politicsNews,1,Senior U.S. Republican senator: 'Let Mr. Muell...
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON () - Trump campaign adviser George ...,politicsNews,1,FBI Russia probe helped by Australian diplomat...
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON () - President Donald Trump...,politicsNews,1,Trump wants Postal Service to charge 'much mor...


## Merging the two datasets

In [18]:
merged_df = pd.concat([true_df, fake_df], axis=0)
merged_df.head()

Unnamed: 0,title,text,subject,target,news
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON () - The head of a conservative Rep...,politicsNews,1,"As U.S. budget fight looms, Republicans flip t..."
1,U.S. military to accept transgender recruits o...,WASHINGTON () - Transgender people will be all...,politicsNews,1,U.S. military to accept transgender recruits o...
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON () - The special counsel investigat...,politicsNews,1,Senior U.S. Republican senator: 'Let Mr. Muell...
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON () - Trump campaign adviser George ...,politicsNews,1,FBI Russia probe helped by Australian diplomat...
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON () - President Donald Trump...,politicsNews,1,Trump wants Postal Service to charge 'much mor...


In [19]:
merged_df.tail()

Unnamed: 0,title,text,subject,target,news
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,0,McPain: John McCain Furious That Iran Treated ...
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,0,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,0,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,0,How to Blow $700 Million: Al Jazeera America F...
23480,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,0,10 U.S. Navy Sailors Held by Iranian Military ...


Both datasets are merged together but not shuffled

In [24]:
merged_df = merged_df.sample(frac = 1)
merged_df.head(10)

Unnamed: 0,title,text,subject,target,news
4930,Former Trump Campaign Manager Says It’s Too D...,"After being fired from the Trump campaign, Tru...",News,0,Former Trump Campaign Manager Says It’s Too D...
1251,Trump Willing To Discuss Solar Powered Border...,Trump took some time today during a White Hous...,News,0,Trump Willing To Discuss Solar Powered Border...
1737,Patriots White House Group Photo Sure To Infu...,It s no secret that Donald Trump has deep inse...,News,0,Patriots White House Group Photo Sure To Infu...
10796,Obama says upgrading U.S. cybersecurity is com...,WASHINGTON () - President Barack Obama said on...,politicsNews,1,Obama says upgrading U.S. cybersecurity is com...
4573,Reince Priebus To Top GOP: Endorse Trump Or Y...,Several top Republicans have yet to endorse Do...,News,0,Reince Priebus To Top GOP: Endorse Trump Or Y...
2040,U.S. judge throws out Texas voter ID law suppo...,() - A federal court judge on Wednesday threw ...,politicsNews,1,U.S. judge throws out Texas voter ID law suppo...
6635,This DA Just Called B.S. On Anti-Trans Republ...,As the GOP s fight to control which bathrooms ...,News,0,This DA Just Called B.S. On Anti-Trans Republ...
12724,Poland's president designates finance minister...,WARSAW () - Poland s President Andrzej Duda de...,worldnews,1,Poland's president designates finance minister...
9156,Senate Democrats push for new gun control meas...,() - Leading U.S. Senate Democrats on Monday u...,politicsNews,1,Senate Democrats push for new gun control meas...
8509,"Obama, Senate Democrats urge Zika funding vote...",WASHINGTON () - President Barack Obama on Thur...,politicsNews,1,"Obama, Senate Democrats urge Zika funding vote..."


In [25]:
merged_df = merged_df.reset_index(drop=True)
merged_df.head(10)

Unnamed: 0,title,text,subject,target,news
0,Former Trump Campaign Manager Says It’s Too D...,"After being fired from the Trump campaign, Tru...",News,0,Former Trump Campaign Manager Says It’s Too D...
1,Trump Willing To Discuss Solar Powered Border...,Trump took some time today during a White Hous...,News,0,Trump Willing To Discuss Solar Powered Border...
2,Patriots White House Group Photo Sure To Infu...,It s no secret that Donald Trump has deep inse...,News,0,Patriots White House Group Photo Sure To Infu...
3,Obama says upgrading U.S. cybersecurity is com...,WASHINGTON () - President Barack Obama said on...,politicsNews,1,Obama says upgrading U.S. cybersecurity is com...
4,Reince Priebus To Top GOP: Endorse Trump Or Y...,Several top Republicans have yet to endorse Do...,News,0,Reince Priebus To Top GOP: Endorse Trump Or Y...
5,U.S. judge throws out Texas voter ID law suppo...,() - A federal court judge on Wednesday threw ...,politicsNews,1,U.S. judge throws out Texas voter ID law suppo...
6,This DA Just Called B.S. On Anti-Trans Republ...,As the GOP s fight to control which bathrooms ...,News,0,This DA Just Called B.S. On Anti-Trans Republ...
7,Poland's president designates finance minister...,WARSAW () - Poland s President Andrzej Duda de...,worldnews,1,Poland's president designates finance minister...
8,Senate Democrats push for new gun control meas...,() - Leading U.S. Senate Democrats on Monday u...,politicsNews,1,Senate Democrats push for new gun control meas...
9,"Obama, Senate Democrats urge Zika funding vote...",WASHINGTON () - President Barack Obama on Thur...,politicsNews,1,"Obama, Senate Democrats urge Zika funding vote..."


Regex operation

In [29]:
import string
def wordopt(text):
  text = text.lower()
  text = re.sub(r'\[.*?\]', '', text)
  text = re.sub(r'[()]','', text)
  text = re.sub(r'\w*\d\w*', '', text)
  text = re.sub(r'https?://\S+|www\.\S+', '', text)
  text = re.sub(r'<.*?>+', '', text)
  text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
  text = re.sub(r'\n', '', text)
  return text

In [30]:
merged_df['news'] = merged_df['news'].apply(wordopt)
merged_df.head(30)

Unnamed: 0,title,text,subject,target,news
0,Former Trump Campaign Manager Says It’s Too D...,"After being fired from the Trump campaign, Tru...",News,0,former trump campaign manager says it’s too d...
1,Trump Willing To Discuss Solar Powered Border...,Trump took some time today during a White Hous...,News,0,trump willing to discuss solar powered border...
2,Patriots White House Group Photo Sure To Infu...,It s no secret that Donald Trump has deep inse...,News,0,patriots white house group photo sure to infu...
3,Obama says upgrading U.S. cybersecurity is com...,WASHINGTON () - President Barack Obama said on...,politicsNews,1,obama says upgrading us cybersecurity is compl...
4,Reince Priebus To Top GOP: Endorse Trump Or Y...,Several top Republicans have yet to endorse Do...,News,0,reince priebus to top gop endorse trump or yo...
5,U.S. judge throws out Texas voter ID law suppo...,() - A federal court judge on Wednesday threw ...,politicsNews,1,us judge throws out texas voter id law support...
6,This DA Just Called B.S. On Anti-Trans Republ...,As the GOP s fight to control which bathrooms ...,News,0,this da just called bs on antitrans republica...
7,Poland's president designates finance minister...,WARSAW () - Poland s President Andrzej Duda de...,worldnews,1,polands president designates finance minister ...
8,Senate Democrats push for new gun control meas...,() - Leading U.S. Senate Democrats on Monday u...,politicsNews,1,senate democrats push for new gun control meas...
9,"Obama, Senate Democrats urge Zika funding vote...",WASHINGTON () - President Barack Obama on Thur...,politicsNews,1,obama senate democrats urge zika funding vote ...


In [32]:
# Defining data and labels
X = merged_df['news']
Y = merged_df['target']

Train test split with 25% test size and stratifying Y

In [33]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, stratify=Y, random_state=2)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((33673,), (11225,), (33673,), (11225,))

Converting textual data into numerical data

In [34]:
vectorization = TfidfVectorizer()
Xvectorized_train = vectorization.fit_transform(X_train)
Xvectorized_test = vectorization.transform(X_test)
print('X_train vectorized shape', Xvectorized_train.shape)
print('X_test vectorized shape', Xvectorized_test.shape)

X_train vectorized shape (33673, 175026)
X_test vectorized shape (11225, 175026)


Legistic Regression model training

In [36]:
lr = LogisticRegression()
lr.fit(Xvectorized_train, Y_train)
# Accuracy
train_accuracy = lr.score(Xvectorized_train, Y_train)
print('Training accuracy: ', train_accuracy)

Training accuracy:  0.9888337837436522


In [37]:
test_prediction = lr.predict(Xvectorized_test)
test_accuracy = accuracy_score(test_prediction, Y_test)
print('Test accuracy: ', test_accuracy)

Test accuracy:  0.9813808463251671


Model Evaluation

In [39]:
from sklearn.metrics import classification_report
print(classification_report(Y_test,test_prediction))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      5871
           1       0.98      0.98      0.98      5354

    accuracy                           0.98     11225
   macro avg       0.98      0.98      0.98     11225
weighted avg       0.98      0.98      0.98     11225



Passive-Aggressive Classifier

In [40]:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=1000)
pac.fit(Xvectorized_train, Y_train)
pac_train_accuracy = pac.score(Xvectorized_train, Y_train)
print('Training accuracy: ', pac_train_accuracy)

Training accuracy:  0.9999703026163395


In [41]:
pac_test_prediction = pac.predict(Xvectorized_test)
pac_test_accuracy = accuracy_score(pac_test_prediction, Y_test)
print('Test accuracy: ', pac_test_accuracy)

Test accuracy:  0.9903786191536749


In [43]:
print(classification_report(Y_test,pac_test_prediction))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5871
           1       0.99      0.99      0.99      5354

    accuracy                           0.99     11225
   macro avg       0.99      0.99      0.99     11225
weighted avg       0.99      0.99      0.99     11225



Linear SVM

In [44]:
from sklearn.svm import LinearSVC
lsvc = LinearSVC()
lsvc.fit(Xvectorized_train, Y_train)
lsvc_train_accuracy = lsvc.score(Xvectorized_train, Y_train)
print('Training accuracy: ', lsvc_train_accuracy)

Training accuracy:  0.999643631396074


In [45]:
lsvc_test_prediction = lsvc.predict(Xvectorized_test)
lsvc_test_accuracy = accuracy_score(lsvc_test_prediction, Y_test)
print('Test accuracy: ', lsvc_test_accuracy)

Test accuracy:  0.9903786191536749


Manual testing

In [61]:
def output_label(n):
  if n == 0:
    return "The News is Fake"
  elif n == 1:
    return "The News is True"

def manual_testing(the_news, title):
  # Combine title and news body into a single string
  combined_text = title + " " + the_news

  # Apply wordopt to clean the combined text
  cleaned_text = wordopt(combined_text)

  # Vectorize the cleaned text
  vectorized_text = vectorization.transform([cleaned_text]) # Pass as a list for transform

  # Make predictions using the trained models
  lr_prediction = lr.predict(vectorized_text)
  pac_prediction = pac.predict(vectorized_text)
  lsvc_prediction = lsvc.predict(vectorized_text)

  return print("\n\nLR Prediction: {} \nPAC Prediction: {} \nLSVC Prediction: {}" .format(output_label(lr_prediction[0]), output_label(pac_prediction[0]), output_label(lsvc_prediction[0])))

Manual testing on already existing data in the dataset

In [62]:
title = merged_df['title'][12]
the_news = merged_df['news'][12]
manual_testing(the_news,title)



LR Prediction: The News is Fake 
PAC Prediction: The News is Fake 
LSVC Prediction: The News is Fake


Manual testing on input data

In [63]:
title = str(input())
the_news = str(input())
manual_testing(the_news,title)

Palestinian-US teen freed after nine months in Israeli jail
A Palestinian-American teenager who spent nine months in Israeli detention without charge has been freed.  Mohammed Ibrahim was 15 when he was arrested in February in the Israeli-occupied West Bank, where he was visiting on holiday from Florida, for allegedly throwing stones at Jewish settlers, which he previously denied.  The US state department said it welcomed the news of Mohammed's release.  The BBC has contacted the Israeli authorities but has not received a reply.  Mohammed, now 16, was taken to hospital for treatment immediately after release, relatives told the media. They said he is, pale, underweight and is suffering from conditions contracted in captivity.  In a statement, Mohammed's uncle spoke of the family's "immense relief". Zeyad Kadur said the family had been "living a horrific and endless nightmare" over the last nine months.  "Right now, we are focused on getting Mohammed the immediate medical attention he n