Hello All, This is my very first notebook. I'm a novice in Machine Learning therefore, when you will go through this notebook, you may not find anything interesting. But,I'm very excited to share my understanding on FakeNewsDetction Dataset. I would really appreciate your suggestions. Happy Learning !!

Little description of this notebook: This is mainly focusing on Natural Language Processing techniques such as removal of stopwords and stemming. Suggestions to innovate this would be really appresciated. So let's get started :)

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# NLP libraries to clean the text data
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

# Vectorization technique TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# For Splitting the dataset
from sklearn.model_selection import train_test_split

# Model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

#Accuracy measuring library
from sklearn.metrics import accuracy_score


## 2. Loading the data

In [2]:
data = pd.read_csv("data.csv")

In [3]:
data.shape #Returns the number of rows and columns present in the dataset

(4009, 4)

In [4]:
data.head()  # Returns the first 5 rows of the dataset

Unnamed: 0,URLs,Headline,Body,Label
0,http://www.bbc.com/news/world-us-canada-414191...,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1
1,https://www.reuters.com/article/us-filmfestiva...,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1
2,https://www.nytimes.com/2017/10/09/us/politics...,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1
3,https://www.reuters.com/article/us-mexico-oil-...,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1
4,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1


In [5]:
data.columns # Returns the column headings

Index(['URLs', 'Headline', 'Body', 'Label'], dtype='object')

In [6]:
data.isnull().sum() #To check the null values in the dataset, if any

URLs         0
Headline     0
Body        21
Label        0
dtype: int64

## 3.Data-Preprocessing

For further analysis, cleaning of data is necessary. 
In this Notebook, I will be doing 3 stages of data cleaning:
1. Removing the Null values
2. Adding a new field
3. Drop features that are not needed
3. Text Processing

In [7]:
df = data.copy() #Creating a copy of my data, I will be working on this Dataframe

## 3.1. Removing the Null Values

As Body field has some empty fields, it can be handled in two ways:
1. Drop the 21 rows
2. Replace the null value with a dummy string

Here, I will be going with the 2nd option, because although dropping 21 rows would not affect the accuracy, as it is just a minute portion of our large dataset, it is never recommended.

I will be replacing the Null(Nan) values in 'Body' field with an empty string ('')

In [8]:
df['Body'] = df['Body'].fillna('')   # As Body is empty, just filled with an empty space

In [9]:
df.isnull().sum()  # No null values found

URLs        0
Headline    0
Body        0
Label       0
dtype: int64

## 3.2. Adding a new column
For ease of implementation, I combined Headline and Body Column and created a new column 'News' 

In [10]:
df['News'] = df['Headline']+df['Body']

In [11]:
df.head()

Unnamed: 0,URLs,Headline,Body,Label,News
0,http://www.bbc.com/news/world-us-canada-414191...,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1,Four ways Bob Corker skewered Donald TrumpImag...
1,https://www.reuters.com/article/us-filmfestiva...,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1,Linklater's war veteran comedy speaks to moder...
2,https://www.nytimes.com/2017/10/09/us/politics...,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1,Trump’s Fight With Corker Jeopardizes His Legi...
3,https://www.reuters.com/article/us-mexico-oil-...,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1,Egypt's Cheiron wins tie-up with Pemex for Mex...
4,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1,Jason Aldean opens 'SNL' with Vegas tributeCou...


In [12]:
df.columns

Index(['URLs', 'Headline', 'Body', 'Label', 'News'], dtype='object')

## 3.3. Drop features that are not needed

In [13]:
features_dropped = ['URLs','Headline','Body']
df = df.drop(features_dropped, axis =1)

In [14]:
df.columns

Index(['Label', 'News'], dtype='object')

## 3.4. Text Processing
1. Remove symbols(',','-',...etc)
1. Remove stop words
3. Stemming

In [15]:
ps = PorterStemmer()
def wordopt(text):
    text = re.sub('[^a-zA-Z]', ' ',text) #Removing noise
    text = text.lower()
    text = text.split() # tokenizing data 
    text = [ps.stem(word) for word in text if not word in stopwords.words('english')]
    text = ' '.join(text)
    return text

In [16]:
df['News'] = df['News'].apply(wordopt) #Applying the text processing techniques onto every row data

In [17]:
df.head()

Unnamed: 0,Label,News
0,1,four way bob corker skewer donald trumpimag co...
1,1,linklat war veteran comedi speak modern americ...
2,1,trump fight corker jeopard legisl agendath feu...
3,1,egypt cheiron win tie pemex mexican onshor oil...
4,1,jason aldean open snl vega tributecountri sing...


In [18]:
df['News'][0]

'four way bob corker skewer donald trumpimag copyright getti imag sunday morn donald trump went twitter tirad member parti exactli huge news far first time presid turn rhetor cannon rank time howev attack particularli bite person essenti call tennesse senat bob corker chair power senat foreign relat committe coward run elect said mr corker beg presid endors refus give wrongli claim mr corker support iranian nuclear agreement polit accomplish unlik colleagu mr corker free worri immedi polit futur hold tongu skip twitter post senbobcork shame white hous becom adult day care center someon obvious miss shift morn senat bob corker senbobcork octob report end though spoke new york time realli let presid four choic quot tennesse senat interview time particularli damn know presid tweet thing true know everyon know realli sugarcoat one mr corker flat say presid liar everyon know senat particular challeng mr trump insist unsuccess plead endors accus much broader mr corker presid use someth akin 

## 4. Splitting the Data

In [19]:
X = df['News']
Y = df['Label']

In [20]:
X.head()

0    four way bob corker skewer donald trumpimag co...
1    linklat war veteran comedi speak modern americ...
2    trump fight corker jeopard legisl agendath feu...
3    egypt cheiron win tie pemex mexican onshor oil...
4    jason aldean open snl vega tributecountri sing...
Name: News, dtype: object

In [21]:
#Split the data into training and test set
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

## 5. Vectorization

This is used to handle our text data, by converting it into vectors.

In [22]:
#Vectorization
# Term Frequency is defined as how frequently the word appear in the document or corpus.
# Inverse Document frequency is another concept which is used for finding out importance of the word.
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

## 6. Model Fitting
I will be fitting my data onto 3 classifications models
1. Logistic Regression
2. SVM
3. RandomForestClassifier

The best one amongst the 3 will be used further

In [23]:
#1. Logistic Regression - used because this model is best suited for binary classification
LR_model = LogisticRegression()

#Fitting training set to the model
LR_model.fit(xv_train,y_train)

#Predicting the test set results based on the model
lr_y_pred = LR_model.predict(xv_test)

#Calculate the accurracy of this model
score = accuracy_score(y_test,lr_y_pred)
print('Accuracy of LR model is ', score)


Accuracy of LR model is  0.9780658025922233


In [24]:
#2. Support Vector Machine(SVM) - SVM works relatively well when there is a clear margin of separation between classes.
svm_model = SVC(kernel='linear')

#Fitting training set to the model
svm_model.fit(xv_train,y_train)

#Predicting the test set results based on the model
svm_y_pred = svm_model.predict(xv_test)

#Calculate the accuracy score of this model
score = accuracy_score(y_test,svm_y_pred)
print('Accuracy of SVM model is ', score)

Accuracy of SVM model is  0.9900299102691924


In [25]:
#3. Random Forest Classifier 
RFC_model = RandomForestClassifier(random_state=0)

#Fitting training set to the model
RFC_model.fit(xv_train, y_train)

#Predicting the test set results based on the model
rfc_y_pred = RFC_model.predict(xv_test)

#Calculate the accuracy score of this model
score = accuracy_score(y_test,rfc_y_pred)
print('Accuracy of RFC model is ', score)

Accuracy of RFC model is  0.9690927218344965


## 7. Manual Model Testing 

In [26]:
# As SVM is able to provide best results - SVM will be used to check the news liability

def fake_news_det(news):
    input_data = {"text":[news]}
    new_def_test = pd.DataFrame(input_data)
    new_def_test["text"] = new_def_test["text"].apply(wordopt) 
    new_x_test = new_def_test["text"]
    #print(new_x_test)
    vectorized_input_data = vectorization.transform(new_x_test)
    prediction = svm_model.predict(vectorized_input_data)
    
    if prediction == 1:
        print("Not a Fake News.")
    else:
        print("Fake News")

In [27]:
fake_news_det('U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.')

Not a Fake News.


In [28]:
fake_news_det("""The second Covid-19 wave in India is now on the "downswing," the Centre said on Thursday, highlighting that the current number of active cases is still "very high" and advised states and Union territories (UTs) to not let down their guards.""")

Not a Fake News.


In [29]:
fake_news_det("JetNation FanDuel League; Week 4 of readers think this story is Fact. Add your two cents.(Before Its News)Our FanDuel league is back again this week. Here are the details:$900 in total prize money. $250 to the winner. $10 to enter.Remember this is a one week league, pick your lineup against the salary cap and next week if you want to play again you can pick a completely different lineup if you want.Click this link to enter — http://fanduel.com/JetNation You can discuss this with other NY Jets fans on the Jet Nation message board. Or visit of on Facebook.Source: http://www.jetnation.com/2017/09/27/jetnation-fanduel-league-week-4/")

Fake News
