<h1> Fake news detection Project </h1>

<p> This is machine learning project where I have created a machine learning model to detect whether the news is fake or real. In this project I have used a dataset from kaggle which contains fake and real news on which our model is trained. This is a binary classification problem. Here, i have used Tf-Idf vectorizer to vectorize the text news and used Multinomial naive bayes algorithm to classify the news. </p>

In [1]:
# importing necessary libraries 

import numpy as np
import pandas as pd

In [2]:
# loading the fake news datset into pandas dataframe
df_fake = pd.read_csv('Fake.csv')

# loading the true news datset into pandas dataframe
df_true = pd.read_csv('true.csv')

In [3]:
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [5]:
# our dataset does not contain label column so we add that

# "0" label for false news
df_fake['label'] = 0

# "1" Label for true news
df_true['label'] = 1

In [6]:
df_fake.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [7]:
df_true.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [8]:
# To classify the news we only need two features the text news and the label
#dropping the rest of the features

df_true.drop(['title', "subject", "date"],axis = 1, inplace = True)
df_fake.drop(['title', "subject", "date"],axis = 1, inplace = True)

In [9]:
# Concatinating fake and true news dataset 
dataset = pd.concat([df_fake, df_true], axis = 0)

In [10]:
# checking for any null values
dataset.isnull().sum()

text     0
label    0
dtype: int64

In [11]:
# checking iif the datset is balanced or unbalanced
dataset['label'].value_counts()

0    23481
1    21417
Name: label, dtype: int64

In [12]:
# Shuffling the dataset for better training of the model
dataset = dataset.sample(frac = 1)

In [13]:
dataset.head(20)

Unnamed: 0,text,label
22263,21st Century Wire says Earlier this week the m...,0
16745,"TOKYO (Reuters) - Prime Minister Shinzo Abe, b...",1
17766,Beyonce made an attempt to glorify the violent...,0
21279,"For the umpteenth time, Obama takes the opport...",0
9484,(Reuters) - A Texas judge identified by Donald...,1
17322,In the face of mounting threats of terrorism a...,0
1330,German Chancellor Angela Merkel is up for reel...,0
12850,"LESBOS, Greece (Reuters) - Syrian migrant Bash...",1
4434,WASHINGTON (Reuters) - The majority of House F...,1
7093,"We all heard the reasons why Republicans, main...",0


<h1> Pre-processing the Data </h1>

<p> Now, because machine learning models only work with numerical data, we have to convert the text data into vectors such that it maintains it semantical meaning and can work with machine learning algorithm sufficiently. 
For this we are using natural language tool kit (nltk) and regular expression (re) libraries to preprocess the text data and after that convert the text data into vecotors using tf-idf vectorizer.</p>

In [14]:
# importing required libraries 
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saink\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saink\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
# leammatizer establish structured semantic relationships between words
lemmatizer = WordNetLemmatizer()

# stopwords are unnecessary words which we have to remove 
stop_words = stopwords.words('english')

In [16]:
# this function clean the text data and lemmatize it
def clean_data(text):
    text = text.lower() 
    text = re.sub('[^a-zA-Z]' , ' ' , text)
    token = text.split()
    token = [lemmatizer.lemmatize(word) for word in token if not word in stop_words]  
    clean_news = ' '.join(token) 
    
    return clean_news 


# applying the above function on the text dataset
dataset['text'] = dataset['text'].apply(lambda x : clean_data(x))

In [17]:
# importing tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 50000 , lowercase=False , ngram_range=(1,2))

In [18]:
# splitting the features and the labels

X = dataset['text']
y = dataset['label']

In [19]:
# splitting train and test data with test data size 20%

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

In [20]:
# applying tfidf on text data 

v_train = vectorizer.fit_transform(x_train)
v_train = v_train.toarray()

v_test = vectorizer.transform(x_test).toarray()

train_data = pd.DataFrame(v_train , columns=vectorizer.get_feature_names())
test_data = pd.DataFrame(v_test , columns= vectorizer.get_feature_names())

<h1> Creating Model </h1>

In [22]:
# importing multinomial naive bayes classifier
# importing accuracy score and classification report

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
clf = MultinomialNB()

In [23]:
# fitting the data

clf.fit(train_data, y_train)

# predicting the labels for test data
predictions = clf.predict(test_data)

In [24]:
# checking the accuracy of predictions on test data

accuracy_score(y_test, predictions)

0.9522271714922049

<p>Although we got 95.2% accuracy on test data we should also check precision and recall because in classfication problems accuracy can be misleading in some cases</p>

In [25]:
# checking the classification report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.96      0.95      0.95      4730
           1       0.95      0.95      0.95      4250

    accuracy                           0.95      8980
   macro avg       0.95      0.95      0.95      8980
weighted avg       0.95      0.95      0.95      8980



In [27]:
# saving the model as a pickle file for further use
import pickle
filename = "final_model.sav"
pickle.dump(clf, open(filename, 'wb'))

<h1>Conclusion:</h1>

<ul>
    <li> Accuracy score for our classification model is 95.2%.</li>
    <li>Becasue our data was balanced both precision and recall is high. </li>
    <li></l
</ul>