# Fake news filter using Machine Learning

First step will be to import the relevant libraries I will be using. The machine learning algorithm libraries will be important later

In [1]:
import itertools
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

import numpy as np
import pandas as pd
import csv

import warnings
warnings.filterwarnings('ignore')

### Data analysis
Importing the dataset

In [2]:
all = pd.read_csv('news.csv')

To prevent any confusion, I will be making an extra copy of the original dataset

In [3]:
all_copy = all.copy()

In [4]:
all_copy.shape

(7795, 4)

In [5]:
all_copy.head()

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


Now, we can derive that the main columns to use will be the text and label columns.

Let's also check how many tweets are there and finding if there are any null values

In [6]:
all_copy.describe()

Unnamed: 0,id,title,text,label
count,7576,7185,6929,6755
unique,7517,7085,6644,437
top,#NAME?,FAKE,"Killing Obama administration rules, dismantlin...",REAL
freq,30,5,58,3161


In [7]:
all_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7795 entries, 0 to 7794
Data columns (total 4 columns):
id       7576 non-null object
title    7185 non-null object
text     6929 non-null object
label    6755 non-null object
dtypes: object(4)
memory usage: 243.7+ KB


It seems that there are several rows which null values, let's take a closer look

In [8]:
all_copy.isna().sum()

id        219
title     610
text      866
label    1040
dtype: int64

As these are texts, it will not be possible to create any mean, median for the rows with missing values. Hence this only leaves us with dropping the rows with null values as we only have 2 columns.

In [9]:
all_copy = all_copy.dropna()

### Preparing the data

Let's then extract the feature and label from the data

In [10]:
features = all_copy.iloc[:,2]
labels = all_copy.iloc[:,3]

In [11]:
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2)

### Algorithm Testing

In this section, I will be testing several algorithms to determine which is better suited for this situation. The relevant libraries required will be imported in the same cell.

In [12]:
from sklearn.model_selection import KFold
cv = KFold(n_splits = 5, random_state = 42)

In [13]:
#Initialise a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df = 0.7)
# Tdidf and transform train set, transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

In [14]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(solver = 'liblinear')
LR.fit(tfidf_train, y_train)
y_pred_LR = LR.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_LR))

Accuracy score:  0.8623242042931162


In [15]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
MNB = MultinomialNB()
MNB.fit(tfidf_train, y_train)
y_pred_MNB = MNB.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_MNB))

Accuracy score:  0.7698001480384901


In [16]:
tfidf_train_dense = tfidf_train.toarray()
tfidf_test_dense = tfidf_test.toarray()

GNB = GaussianNB()
GNB.fit(tfidf_train_dense, y_train)
y_pred_GNB = GNB.predict(tfidf_test_dense)

print('Accuracy score: ', accuracy_score(y_test, y_pred_GNB))

Accuracy score:  0.7416728349370837


In [17]:
BNB = BernoulliNB()
BNB.fit(tfidf_train, y_train)
y_pred_BNB = BNB.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_BNB))

Accuracy score:  0.7609178386380459


In [18]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
DT.fit(tfidf_train,y_train)
y_pred_DT = DT.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_DT))

Accuracy score:  0.7475943745373798


In [19]:
from sklearn.ensemble import RandomForestClassifier
classifierRF = RandomForestClassifier()
classifierRF.fit(tfidf_train, y_train)
y_pred_RF = classifierRF.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_RF))

Accuracy score:  0.7609178386380459


In [20]:
from xgboost import XGBClassifier
XGB = XGBClassifier()
XGB.fit(tfidf_train,y_train)
y_pred_XGB = XGB.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_XGB))

Accuracy score:  0.7779422649888971


In [21]:
from sklearn.linear_model import PassiveAggressiveClassifier
PAC = PassiveAggressiveClassifier()
PAC.fit(tfidf_train, y_train)
y_pred_PAC = PAC.predict(tfidf_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred_PAC))

Accuracy score:  0.8867505551443375


In [22]:
table = pd.DataFrame({'Model': ['Logistic Regression', 'Multinomial Naive Bayes', 'Gaussian Naive Bayes'
                                , 'Bernoulli Naive Bayes', 'Decision Tree', 'Random Forest'
                                , 'XGBoost', 'Passive Aggressive Classifier']
                                , 'Accuracy Scores': ['0.852', '0.753', '0.756', '0.734', '0.742', '0.766'
                                                      , '0.750', '0.869']})


table['Model'] = table['Model'].astype('category')
table['Accuracy Scores'] = table['Accuracy Scores'].astype('float32')

pd.pivot_table(table, index = ['Model']).sort_values(by = 'Accuracy Scores', ascending = False)

Unnamed: 0_level_0,Accuracy Scores
Model,Unnamed: 1_level_1
Passive Aggressive Classifier,0.869
Logistic Regression,0.852
Random Forest,0.766
Gaussian Naive Bayes,0.756
Multinomial Naive Bayes,0.753
XGBoost,0.75
Decision Tree,0.742
Bernoulli Naive Bayes,0.734


### Ensemble Learning

Sklearn also has library which allows users to combine 2 or more machine learning algorithms to get the best of both worlds. However, after trying several combinations of algorithms with different weights, there was not any substancial difference in the accuracy. Hence it can be deduced that this method of ensemble learning is not very useful.

* I did not use passive aggressive classifier in combination as the PAC library did not havea predict_proba function which always causes an error when combining with other algorithms' weightage

In [24]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(solver = 'liblinear')

from sklearn.ensemble import RandomForestClassifier
classifierRF = RandomForestClassifier()

ensemble = VotingClassifier(estimators = [('Random Forest', classifierRF)
                            , ('Logistic regression', LR)]
                            , voting = 'soft', weights = [1,1.5]).fit(tfidf_train, y_train)


print('The accuracy for the combined algorithms are:',ensemble.score(tfidf_test, y_test))

The accuracy for the combined algorithms are: 0.8527017024426351
