In this project, we are going to build a system using Machine Learning that can predict whether a news item is fake or real. The language used in this project is Python. The Machine Learning algorithm used is Logistic Regression. It answers the question YES/NO and it is used for classification. I like to call it the “THIS or THAT” algorithm.

The Dataset used in this project is from Kaggle and you can Download It Here! url='https://www.kaggle.com/competitions/fake-news/data?select=train.csv' It will download a zip file and when you expand it, you will see train.csv. That is the Dataset we will be using.

####  . ID is the unique id for the dataset
####  . Title is the news article title
####  . Author is the author of the news article
####  . Text is a snippet of the body of the article
####  .Label tells whether the news is Fake or True. 1= Fake, 0 = True

##### Project Assumptions
. You are using VS CODE as your IDE

. You are not a quitter :)

. You understand python basics

. You will be keen on the spellings

With that out of the way, let’s get to the fun part. Rooting for you!

# FAKE NEWS DETECTION PROJECT

In [1]:
# import libraries


import numpy as np #creating numpy arrays
import pandas as pd #creating dataframes and storing data in dataframes
import re #for searching for text in a document
from nltk.corpus import stopwords  #removes words that don't add value to the article
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer #converts texts into numbers
from sklearn.model_selection import train_test_split #splits data into training and testing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\ANZAR
[nltk_data]     AZIZ\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Print the stopwords in English. These are words that do not add much value to an article. eg articles like “a, the”

In [3]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Preprocessing
We are going to name our dataset ‘news_dataset’. You can use any name that you will remember.

Load the dataset. Copy the name to the train.csv file and replace mine with that. The rest of the code is commented on

In [4]:
#loading the dataset

news_dataset = pd.read_csv('train.csv')
news_dataset

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1


In [5]:
df = news_dataset.copy()

In [6]:
#checking the shape of the dataset
news_dataset.shape

(20800, 5)

Check for missing values. Depending on the size of the Dataset, if you have enough data, you can either drop the missing values or replace them with null strings. If you don’t have enough data, you can opt for imputation where you replace the missing values with other values. In this project, we will replace missing values with null strings since we have enough data.

In [7]:
#counting the number of missing values in the dataset

news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [8]:
#replacing the null values with empty string

news_dataset= news_dataset.fillna(' ')

We are going to merge the author’s name and the news title and then put them in a new column called ‘content’. If we use the text column it will take more time to train the model. (nothing wrong with it hence you can try that approach too)

In [9]:
# merging the author name and news title

news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

After you print, you will notice that the columns have merged.

In [10]:
print(news_dataset['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [11]:
pd.set_option('display.max_rows',30)
pd.set_option('display.max_columns',30)

In [12]:
# separating the data & label

X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [13]:

print(Y)

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


In [14]:
print(X)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

### Stemming
Stemming is the process of reducing a word to its Root word to save time when training the model. In short, it removes the prefix and the suffix.

example: actor, actress, acting → act

In [15]:
port_stem = PorterStemmer()

In [16]:
#creating a function stemming because it is not inbuilt

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content



In [17]:
#Let's unpack the above code

def stemming(content):
#the word 'content' when definig the function, is not because we named a column content after merging.
#you can use any word and it will do the same

stemmed_content = re.sub('[^a-zA-Z]',' ',content)
#we are calling the re library that we imported. 
#the function 'sub' subtitutes values
# this bit '[^a-zA-Z]' is used to remove everything that is not alphabets
#the dataset has numbers, and punctuation marks so it removes them
# (' ') there is a space in between, all the unwanted in the dataset will be replaced by a space 
#it will then feed all changes into 'content' now the columns we merged

stemmed_content = stemmed_content.lower()
#converting all to lowercase. 
#why? the model might interpret upper and lower to be different things

stemmed_content = stemmed_content.split()
#converting the words into a list

stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
#stemming using the function port_stem and removing stopwords using the function port_stem

stemmed_content = ' '.join(stemmed_content)
#join all the words

    return stemmed_content

#NB: this symbol ^ might not be available in some systems, so you might get an error.
#if so copy it from the above code





IndentationError: expected an indented block after function definition on line 3 (794766705.py, line 7)

In [18]:
#applying the above function to the content column
news_dataset['content'] = news_dataset['content'].apply(stemming) # its take huge time on my pc

In [19]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [20]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [21]:
print(Y)

[1 0 1 ... 0 1 1]


When i print, the output will still be textual data. We will convert the textual data to numerical data.

In [22]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [23]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

You have now successfully converted the textual data to numerical data. Basically, Tfidfvectorizer is divided into two Tf(Term Frequency) and IDF(Inverse Document Frequency)

TF counts the number of times a word appears on a document, and that frequency tells the model it’s important info, so it assigns the word a particular numerical value.

IDF sometimes words repeated in a document don’t have much meaning to it. Eg if we were building a system to check positive/negative reviews for a hotel called ‘hoax’. Most of the reviews would mention the name of the hotel. The name doesn’t add value. so IDF reduces such words.

Next, check the shape of the array. should be 1D

In [24]:
Y.shape

(20800,)

## Splitting the data test for training and testing
The 0.2 is just 20% of the dataset. 80% is training data, and 20% is testing data. The random state can be any integer value, but if you are following the same code as mine use 2 to get the same results.

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)


as per the dataset this is classification problem in the form of number or two classes so Logistic regression is simple fit for this problem if the result are not good then i move to the next algorithm

## Training the model: Logistic Regression

In [26]:
#load the model in a variable
model = LogisticRegression()

In [27]:
#training the model
model.fit(X_train, Y_train)

In [28]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [29]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9865985576923076


However, the accuracy score on training data is not as important as the accuracy score in the test data. We trained our model with train data but we predict using test data. So we will check that. I got 98% which is 0.98…

In [30]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)


In [31]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9790865384615385


 I got 97% which is 0.97

# Making a predictive system

In [34]:
X_new = X_test[3] #keep changing the integer to see if your model predicts correct

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [35]:
X_new = X_test[125] #keep changing the integer to see if your model predicts correct

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


# Summary
We learned how to import libraries, load the data, clean data, perform stemming, train the model, test the accuracy score and make the predictive system. One thing we did not learn here is how to comment code(pun intended) You can learn here!

What stood out for me was the concept of Stemming. What stood out for you?

I am currently working on the same project but with real-world data, where I am web scraping