<a href="https://colab.research.google.com/github/KARTIKPARATKAR/Fake-News-Detection-Using-Machine-Learning/blob/main/Project_Fake_News_Detection_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fake news detection is a binary classification problem hence we will use logistic regression model to predict whether the news is real or fake.

Dataset is collected from Caggle:
1 Each news article has unique id

2 Each news has its ows title

3 Each news has text which might be wrong

4 Each news has author

5 Each news has label which make news whether it is true or false.

Zero(0) represents real news.

One(1) represents fake news.

In [16]:
import pandas as pd
import numpy as np
import re             # re means regular expression which is used to find text in a document
from nltk.corpus import stopwords   #nltk means natural language toolkit. Stopwords are those words which does not add much value to a paragraph or text
from nltk.stem.porter import PorterStemmer  #Stemming is one of several text normalization techniques that converts raw text data into a readable format for natural language processing tasks.
from sklearn.feature_extraction.text import TfidfVectorizer #Used for converting text into feature vectors
from sklearn.model_selection import train_test_split #Used to split data in training data and test data
from sklearn.linear_model import LogisticRegression  #IMporting logistic regression model
from sklearn.metrics import accuracy_score

In [17]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
print(stopwords.words('english'))   #These are the words in english language which does not add any value to the text or news

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing Part

In [19]:
#Loading dataset in pandas dataframe
newsdataset=pd.read_csv('/content/train.csv.zip')

In [20]:
newsdataset.head()   #Here columns are id, title, author, text and label. In label column, 0 represents real news and 1 represents fake news.

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [21]:
newsdataset.shape  #We have 20800 news articlea and 5 features

(20800, 5)

In [22]:
#Counting no of missing value in the dataset
newsdataset.isnull().sum()   #Here there is no missing value in id and label column. 558 values missing in title,1957 values missing in author column, 39 values missing in text column.

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [23]:
#We have large dataset so we can replace null values with empty string
newsdataset=newsdataset.fillna('')

In [27]:
#Merging title and author columns. We wont consider text column because it has many words and may take long time to train and test the machine learning model.
newsdataset['content']=newsdataset['author']+' '+newsdataset['title']  #Merging author and title column and storing it in newsdataset


In [28]:
print(newsdataset['content'])  #So we will be using this contenet data to train our model

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [31]:
#Separating the data and lebal
X = newsdataset.drop(columns='label',axis=1) #Dropping column label from dataset "newsdataset" and storing updataed data in X
Y = newsdataset['label']                     #Storing label column in Y

In [32]:
print(X)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

In [33]:
print(Y)

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


Now we will do the stemming procedure.


Stemming is the process of reducing a word to its root words.

Root word of (actor,actress and acting ) might be act.

In [36]:
portstem=PorterStemmer()
#Creating function called stemming
def stemming(content):     #We are creating a function called stemming. content is the string that we want to give as a input to this function.
  stemmedcontent = re.sub('[^a-zA-Z]',' ',content)  #This line will exclude everything that is not an alphabet. It will all the numeric value and character values from the content and only alphabetic values will remain.
  stemmedcontent = stemmedcontent.lower()     #Convering all the data of news in lowercase so as to maintain the type of text
  stemmedcontent = stemmedcontent.split()     #This will split the text in respective list
  #IN below step,we are taking word and performing stemming operation. With help of for loop we are excluding the stopwords in english language.
  stemmedcontent = [portstem.stem(word) for word in stemmedcontent if not word in stopwords.words('english')]
  stemmedcontent = ' '.join(stemmedcontent)  #Joining all the stemmed words from text of the news
  return stemmedcontent                      #Here we are returning the stemmedcontent.

In [37]:
newsdataset['content']=newsdataset['content'].apply(stemming)  #Applying the stemming function to content column.

In [38]:
print(newsdataset['content'])   #We can observe that all the numeric and character data has been removed. News have all lower case letters.

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


In [39]:
#Separaing data and labels
X=newsdataset['content'].values
Y=newsdataset['label'].values

In [40]:
print(X)  #X has only content column

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [41]:
print(Y)  #Y has label column

[1 0 1 ... 0 1 1]


In [42]:
Y.shape  #There are 20000 values in Y

(20800,)

In [43]:
X.shape  #There are 20000 values in X

(20800,)

In [44]:
#As all the values in X is in texual format and computers cant understaand the textual data . And that is the reason we want to convert text data into nueric format.
#We will do this with the help of tfidf vectorizer
vectorizer = TfidfVectorizer()  #TFIDF means that term frequency and inverse document  which basically counts the number of times the perticular word is occuring in a document.
vectorizer.fit(X)    #Fitting vectorizer function to all the data in X
X=vectorizer.transform(X)   #This will convert all the values to their respective features

In [45]:
print(X)  #Here we have successfully converted text data into feature vectore i.e. numeric data

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

In [55]:
#Splitting dataset into training and test data
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)

Training the model

In [56]:
model=LogisticRegression()

In [57]:
model.fit(X_train,Y_train)

Evaluating the model

In [58]:
#Accuracy score on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction,Y_train)

In [59]:
print('Accuracy Score of Training data: ',training_data_accuracy)

Accuracy Score of Training data:  0.9865985576923076


In [60]:
#Accuracy score on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction,Y_test)

In [61]:
print('Accuracy Score of Testing data: ',test_data_accuracy)  #We have got nearabout same accuracy on training and testing data so we can say taht our model is not overtraind with training data

Accuracy Score of Testing data:  0.9790865384615385


Making predictive system:

In [63]:
X_new=X_test[0]  #1st row in our X_test column storing in X_new
prediction=model.predict(X_new) #We are predicting the label of X_new with the traind model.
print(prediction)
if(prediction[0]==0):
  print('The news is Real.')
else:
  print('The news is Fake.')


[1]
The news is Fake.


In [64]:
print(Y_test[0])  #Here we are checking whether the label of 0 th row is 1 or not because our model has predicted that the news is fake and label 1 represents fake news. So our model has correctly predicted the news is fake or not.

1


In [65]:
#Now we are checking for 1
X_new=X_test[1]
prediction=model.predict(X_new)
print(prediction)
if(prediction[0]==0):
  print('The news is Real.')
else:
  print('The news is Fake.')


[0]
The news is Real.


In [67]:
print(Y_test[1])   #Here we are getting the correct label 1 as predicted by the traind model.

0
