About the Dataset:

1. Since the size of dataset is large, I wasn't able to upload it here, but the link is given below
2. https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
3. title: the title of a news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





In [92]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [93]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sidha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [94]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing

In [95]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('WELFake_Dataset.csv')

In [96]:
news_dataset.shape

(72134, 4)

In [97]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [98]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

In [99]:
# replacing the null values with empty string
news_dataset = news_dataset.drop(news_dataset.columns[0],axis='columns')
news_dataset = news_dataset.drop(news_dataset.index[20000:])
news_dataset = news_dataset.fillna('')
news_dataset.head()

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [100]:
# merging the author name and news title
news_dataset['content'] = news_dataset['text']+' '+news_dataset['title']

In [101]:
print(news_dataset['content'])

0        No comment is expected from Barack Obama Membe...
1          Did they post their votes for Hillary already? 
2         Now, most of the demonstrators gathered last ...
3        A dozen politically active pastors came here f...
4        The RS-28 Sarmat missile, dubbed Satan 2, will...
                               ...                        
19995    FRANKFURT (Reuters) - German public prosecutor...
19996    Those wacky conservatives are at it again with...
19997    Scott Walker 2016 begins today. After this spe...
19998    WASHINGTON (Reuters) - U.S. immigration offici...
19999    Governor Rick Snyder and his emergency managem...
Name: content, Length: 20000, dtype: object


In [102]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [103]:
print(X)
print(Y)

                                                   title  \
0      LAW ENFORCEMENT ON HIGH ALERT Following Threat...   
1                                                          
2      UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...   
3      Bobby Jindal, raised Hindu, uses story of Chri...   
4      SATAN 2: Russia unvelis an image of its terrif...   
...                                                  ...   
19995  Former Nazi death camp guard charged with acce...   
19996   Conservatives Raise Hell Over Obama’s Wedding...   
19997              Round I in Iowa: Scott Walker Emerges   
19998  Exclusive: U.S. plans new wave of immigrant de...   
19999   Flint’s Mayor Knows Which Candidate Has Their...   

                                                    text  \
0      No comment is expected from Barack Obama Membe...   
1         Did they post their votes for Hillary already?   
2       Now, most of the demonstrators gathered last ...   
3      A dozen politically active pasto

Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [104]:
port_stem = PorterStemmer()

In [105]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [106]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [107]:
print(news_dataset['content'])

0        comment expect barack obama member fyf fukyofl...
1                                post vote hillari alreadi
2        demonstr gather last night exercis constitut p...
3        dozen polit activ pastor came privat dinner fr...
4        rs sarmat missil dub satan replac ss fli mile ...
                               ...                        
19995    frankfurt reuter german public prosecutor char...
19996    wacki conserv new insan attempt invent obama s...
19997    scott walker begin today speech freedomsummit ...
19998    washington reuter u immigr offici plan month l...
19999    governor rick snyder emerg manag team liter po...
Name: content, Length: 20000, dtype: object


In [108]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [109]:
print(X)

['comment expect barack obama member fyf fukyoflag blacklivesmatt movement call lynch hang white peopl cop encourag other radio show tuesday night turn tide kill white peopl cop send messag kill black peopl america one f yoflag organ call sunshin radio blog show host texa call sunshin f ing opinion radio show snapshot fyf lolatwhitefear twitter page p show urg support call fyf tonight continu dismantl illus white snapshot twitter radio call invit fyf radio show air p eastern standard time show caller clearli call lynch kill white peopl minut clip radio show heard provid breitbart texa someon would like refer hannib alreadi receiv death threat result interrupt fyf confer call unidentifi black man said mother f ker start f ing like us bunch ni er takin one us roll said caus alreadi roll gang anyway six seven black mother f cker see white person lynch ass let turn tabl conspir cop start lose peopl state emerg specul one two thing would happen big ass r war ni er go start backin alreadi ge

In [110]:
print(Y)

[1 1 1 ... 0 0 1]


In [111]:
Y.shape

(20000,)

In [112]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [113]:
print(X)

  (0, 84557)	0.025672368273997645
  (0, 84522)	0.021328685981468712
  (0, 84415)	0.12280664542545672
  (0, 84221)	0.044190597441787785
  (0, 83236)	0.019786315734036084
  (0, 82719)	0.03364887609859938
  (0, 82488)	0.023284237780893773
  (0, 82318)	0.14596511350182267
  (0, 82196)	0.025439846750978826
  (0, 81939)	0.01412209292542166
  (0, 81784)	0.022868619630963646
  (0, 81645)	0.013401687589991364
  (0, 81565)	0.0589649335838173
  (0, 81405)	0.0343127913458304
  (0, 81301)	0.054796479835235216
  (0, 80306)	0.030357682780029215
  (0, 80168)	0.05091014317670827
  (0, 79314)	0.0332458865830848
  (0, 79162)	0.042604037399474826
  (0, 79121)	0.022396799981526338
  (0, 79052)	0.02911934107065803
  (0, 79034)	0.032920736719534265
  (0, 78462)	0.0326879150938389
  (0, 78452)	0.03653522684809661
  (0, 77998)	0.042045622970585804
  :	:
  (19999, 13741)	0.18673758180998637
  (19999, 13568)	0.0349912849942122
  (19999, 13408)	0.07877812463845678
  (19999, 12819)	0.06085700642394503
  (19999, 11

Splitting the dataset to training & test data

In [114]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model: Logistic Regression

In [115]:
model = LogisticRegression()

In [116]:
model.fit(X_train, Y_train)

LogisticRegression()

Evaluation

accuracy score

In [117]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [118]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9551875


In [119]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [120]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.92875


Making a Predictive System

In [121]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [122]:
print(Y_test[3])

0
