# Fake News Prediction

## work flow
-> News Data -> Data pre processing -> Train and Testing split -> Logistic Regression Model -> Trained Logistic Regression Model -> New Data -> Prediction of News TRUE or FAKE

About the Dataset:
1. Id: unique id for a news article
2. Title: The title of a news article
3. Author: Author of the news article
4. Text: The text of the article, could be incomplete
5. Label: A label that marks whether the news article is real or fake:
               0 Real News
               1 Fake News

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer  #### which is used to convert the text into feature vector
import re                                                    #### Regular Expression, it is useful for searching the words in the paragraph  
from nltk.corpus import stopwords                            #### nltk(Natural Language tool kit) stopwords basically those words which doesn't add much value to the paragraph or text like The, an,what,when,then,It etc 
from nltk.stem.porter import PorterStemmer                   #### this function is used to do give the root


In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Nishant
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
#### printing the all stopwords of english

print(stopwords.words('English'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Data Pre-Processing

In [4]:
news_data=pd.read_csv('train.csv')

In [5]:
news_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [6]:
news_data.shape

(20800, 5)

In [7]:
#### finding the missing values in the datasets

news_data.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

#### in the dataset we have some missing values, but according to the dataset 20800 values we have small numbers of missing values so we are replace those missing values to empty string


In [8]:
#### replacing the missing or null values with empty string

news_data=news_data.fillna('')

#### fillna() function can fill anything in place of null or missing values but in 
#### this case we are placing '' null value

In [9]:
news_data.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

#### now we are margin the "Title" and "Author" columns together and not using text column because while using the text column, it will take so many time for processing

In [10]:
#### merging the author name and title column

news_data['Content']=news_data['author']+' '+ news_data['title']

In [11]:
news_data.head()

Unnamed: 0,id,title,author,text,label,Content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com Why the Truth Might Get You...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy Iranian woman jailed for fictio...


#### after merging the title and author column together as a content column, we are replacing lable from the datasets

In [12]:
X=news_data.drop("label",axis=1)
Y=news_data['label']

In [13]:
X

Unnamed: 0,id,title,author,text,Content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",Consortiumnews.com Why the Truth Might Get You...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,Howard Portnoy Iranian woman jailed for fictio...
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,"Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma..."
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,Michael J. de la Merced and Rachel Abrams Macy...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...","Alex Ansary NATO, Russia To Hold Parallel Exer..."


In [14]:
Y

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64

 #### Stemming:
 - Stemming is the processes of reducing a word to its Root word
 example:
 Actor, actress, acting for all these words the root word is act

In [15]:
port_stem=PorterStemmer()

In [16]:
#### defining a function

def stemming(content):                              #### function name "stemming" and argument "content"
    stemmed_content=re.sub('[^a-zA-Z]',' ',content)  #### define a variable which read all string values with this function "re" in declared attribute "content"
    stemmed_content=stemmed_content.lower()         #### convert all the letter into the lower letter of alphabet
    stemmed_content=stemmed_content.split()         #### convert all the lower letter into the list
    stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('English')] #### then start a loop 
    stemmed_content=' '.join(stemmed_content)
    return stemmed_content

In [17]:
news_data['Content']=news_data['Content'].apply(stemming)

In [18]:
news_data.head()

Unnamed: 0,id,title,author,text,label,Content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,darrel lucu hous dem aid even see comey letter...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,daniel j flynn flynn hillari clinton big woman...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,consortiumnew com truth might get fire
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,jessica purkiss civilian kill singl us airstri...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,howard portnoy iranian woman jail fiction unpu...


In [19]:
#### Seprating Data and label
X=news_data["Content"].values   #### we only need "Content" columns
Y=news_data["label"].values

In [20]:
X

array(['darrel lucu hous dem aid even see comey letter jason chaffetz tweet',
       'daniel j flynn flynn hillari clinton big woman campu breitbart',
       'consortiumnew com truth might get fire', ...,
       'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time',
       'alex ansari nato russia hold parallel exercis balkan',
       'david swanson keep f aliv'], dtype=object)

In [21]:
Y

array([1, 0, 1, ..., 0, 1, 1], dtype=int64)

#### as we know that computer didn't understand alphabet values, so we convert alphabet into numerical data

In [22]:
#### converting the textual data to numerical data

vectorizer=TfidfVectorizer() #### this funciton tells the repeated several times
vectorizer.fit(X)            #### fit X into that particular model
X=vectorizer.transform(X)  #### will convert the values into the respective feature

In [23]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

####  Now our X and Y are ready for training and testing

In [24]:
#### splitting the dataset to training and testing

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=1,stratify=Y)

In [25]:
print(X.shape,X_train.shape,X_test.shape)

(20800, 17128) (14560, 17128) (6240, 17128)


#### Training the Logistic Regression model

In [26]:
model=LogisticRegression()

model.fit(X_train,Y_train)

#### After training the Model we find the accuracy score 

In [27]:
#### accuracy score on the training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [28]:
print(training_data_accuracy)

0.9858516483516484


In [30]:
#### accuracy score on the training data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)

In [31]:
print(test_data_accuracy)

0.9748397435897436


## Making a Predictive System
- this system is import to learn

In [39]:
X_news=X_test[20]

prediction=model.predict(X_news)

if prediction == 1:
    print("The news is Fake")
else:
    print("The news is Real")

The news is Real


In [41]:
#### for cross check we can print Y_test[20] value

print(Y_test[20])

0
