# Usecase - Fake News Classifier

### Problem Statement

News media has become a channel to pass on the information of what’s happening in the world to the people living. Often people perceive whatever conveyed in the news to be true. There were circumstances where even the news channels acknowledged that their news is not true as they wrote. But some news has a significant impact not only on the people or government but also on the economy. One news can shift the curves up and down depending on the emotions of people and political situation.

It is important to identify the fake news from the real true news. The problem has been taken over and resolved with the help of Natural Language Processing tools which help us identify fake or true news based on historical data. The news is now in safe hands!

### Objective

The Objective is to build model which will classify news as fake or True.

### Dataset

https://www.kaggle.com/c/fake-news/data#

### Importing Libraries

In [1]:
import nltk
import re
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# nltk.download('wordnet')
# nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split




### Loading dataset

In [2]:
data= pd.read_csv('Data.csv', nrows=1000)

### Analyzing data

In [3]:
data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
data.shape

(1000, 5)

In [5]:
data.columns

Index(['id', 'title', 'author', 'text', 'label'], dtype='object')

In [6]:
#Checking NULL Values
data.isnull().sum()

id          0
title      27
author    103
text        2
label       0
dtype: int64

In [7]:
data.duplicated().sum()

0

In [8]:
#dropping NULL values

#Creating copy of dataset

df = data.copy(deep= True)

df.dropna(inplace= True)

df.shape

(870, 5)

In [9]:
df.reset_index(inplace= True)

### Data Cleaning and Preprocessing

In [10]:
#combining title and text sentences together
df['news'] = df['title']+df['text']

In [11]:
df.columns

Index(['index', 'id', 'title', 'author', 'text', 'label', 'news'], dtype='object')

In [12]:
#dropping unnecessary columns
df.drop(['index','id','title','author','text'],inplace=True,axis=1)

In [13]:
df.head()

Unnamed: 0,label,news
0,1,House Dem Aide: We Didn’t Even See Comey’s Let...
1,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ..."
2,1,Why the Truth Might Get You FiredWhy the Truth...
3,1,15 Civilians Killed In Single US Airstrike Hav...
4,1,Iranian woman jailed for fictional unpublished...


In [14]:
df['news'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted ItHouse Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) \nWith apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton’s email server, the ranking Democrats on the relevant committees didn’t hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. \nAs we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Int

In [15]:
lm = WordNetLemmatizer()
news_corpus = []

for i in range (0,len(df)):
    sent = re.sub('[^a-zA-Z]',' ', df['news'][i])
    sent = sent.lower()
    sent = sent.split()
    
    sent = [lm.lemmatize(word) for word in sent if not word in stopwords.words('english')]
    sent = ' '.join(sent)
    news_corpus.append(sent)
    

In [16]:
news_corpus[0]

'house dem aide even see comey letter jason chaffetz tweeted ithouse dem aide even see comey letter jason chaffetz tweeted darrell lucus october subscribe jason chaffetz stump american fork utah image courtesy michael jolley available creative common license apology keith olbermann doubt worst person world week fbi director james comey according house democratic aide look like also know second worst person well turn comey sent infamous letter announcing fbi looking email may related hillary clinton email server ranking democrat relevant committee hear comey found via tweet one republican committee chairman know comey notified republican chairman democratic ranking member house intelligence judiciary oversight committee agency reviewing email recently discovered order see contained classified information long letter went oversight committee chairman jason chaffetz set political world ablaze tweet fbi dir informed fbi learned existence email appear pertinent investigation case reopened j

### Vectorizing the corpus 

In [17]:
tf = TfidfVectorizer(max_features=5000,ngram_range=(1,2))

X = tf.fit_transform(news_corpus)

In [18]:
y = df['label']

In [19]:
tf.get_feature_names()[:15]

['abandoned',
 'abc',
 'abedin',
 'ability',
 'able',
 'abortion',
 'abroad',
 'absolute',
 'absolutely',
 'absurd',
 'abuse',
 'abused',
 'academic',
 'academy',
 'accept']

In [20]:
tf.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': 5000,
 'min_df': 1,
 'ngram_range': (1, 2),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

### Splitting the Dataset into test and train

In [21]:
X_train, X_test, y_train,y_test = train_test_split(X,y,train_size=0.30,random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((261, 5000), (609, 5000), (261,), (609,))

### Creating Model

In [22]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

nb.fit(X_train,y_train)

MultinomialNB()

In [23]:
y_pred = nb.predict(X_test)

### Calculating Accuracy

In [24]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [25]:
print('Accuracy on Test data is : ', accuracy_score(y_pred,y_test))

Accuracy on Test data is :  0.8013136288998358


In [26]:
print('Classification report:\n', classification_report(y_pred,y_test))

Classification report:
               precision    recall  f1-score   support

           0       0.98      0.75      0.85       463
           1       0.55      0.96      0.70       146

    accuracy                           0.80       609
   macro avg       0.77      0.86      0.78       609
weighted avg       0.88      0.80      0.82       609



In [27]:
print('Confusion Matrix is :\n', confusion_matrix(y_pred,y_test))

Confusion Matrix is :
 [[348 115]
 [  6 140]]
