Hello there and welcome to this notebook!

Here I am trying to distinguish between fake and real news and label them accordingly. 
I used 2 types of vectorizers (count and tfidf) as well as 3 types of classifiers (Naive Bayes, Linear Classifier and SVM). I know that we were supposed to use MaxEnt instead, but I found linear pretty usefull and the most successfull I must say. 
Also I tried 2 types of predictions: based on text and titles separately.

I am submitting 2 files with predictions as well because I am not sure whether you will consider linear classifier appropriate for this task. prediction.csv is based on SVM classification while prediction1.csv is based on linear one.

Sorry for inconvience and let's start! 

In [1]:
#Importing necessary packages and modules
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import svm

Now let's download and explore the data we have here.

In [2]:
#Downloading the data
train = pd.read_csv('fake_or_real_news_training.csv')
test = pd.read_csv('fake_or_real_news_test.csv')

In [3]:
#Head
print('Train df' , train.head())
print(' ')
print('Test df', test.head())

Train df       ID                                              title  \
0   8476                       You Can Smell Hillary’s Fear   
1  10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2   3608        Kerry to go to Paris in gesture of sympathy   
3  10142  Bernie supporters on Twitter erupt in anger ag...   
4    875   The Battle of New York: Why This Primary Matters   

                                                text label   X1   X2  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  NaN  NaN  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  NaN  NaN  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  NaN  NaN  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  NaN  NaN  
4  It's primary day in New York and front-runners...  REAL  NaN  NaN  
 
Test df       ID                                              title  \
0  10498  September New Homes Sales Rise——-Back To 1992 ...   
1   2439  Why The Obamacare Doomsday Cult Can't Ad

In [4]:
#Shape
print('Train shape:',train.shape)
print('Test shape:', test.shape)

Train shape: (3999, 6)
Test shape: (2321, 3)


In [5]:
#Info
print('Train info', train.info())
print(' ')
print('Test info', test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
ID       3999 non-null int64
title    3999 non-null object
text     3999 non-null object
label    3999 non-null object
X1       33 non-null object
X2       2 non-null object
dtypes: int64(1), object(5)
memory usage: 187.5+ KB
Train info None
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2321 entries, 0 to 2320
Data columns (total 3 columns):
ID       2321 non-null int64
title    2321 non-null object
text     2321 non-null object
dtypes: int64(1), object(2)
memory usage: 54.5+ KB
Test info None


I decided to delete columns X1 and X2 based on the dataset info. X1 has only 33 non-null objects while X2 has just 2 of them. So I think these columns are not that valuable for further analysis.

In [6]:
#Discarding X1 and X2
train = train[['ID', 'title', 'text', 'label']]
print(train.head())
print(' ')
print('Train shape:', train.shape)

      ID                                              title  \
0   8476                       You Can Smell Hillary’s Fear   
1  10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2   3608        Kerry to go to Paris in gesture of sympathy   
3  10142  Bernie supporters on Twitter erupt in anger ag...   
4    875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  
 
Train shape: (3999, 4)


Now I am starting to build classifiers that are based on the 'text' column of the train dataset. 
Firstly, I will split the dataset into train and test parts, than initialize 2 vectorizers and finally classify both vectorizers results.

In [7]:
#Labels and spliting
y = train.label 
X_train, X_test, y_train, y_test = train_test_split(train['text'], y, test_size=0.33, random_state=13) 

COUNT VECTORIZER

In [8]:
#Count vectorizer initialisation 
count_vector = CountVectorizer(stop_words='english') 
count_train = count_vector.fit_transform(X_train)
count_test = count_vector.transform(X_test)

TFIDF VECTORIZER

In [9]:
#Tfidf Vectorizer
tfidf_vector = TfidfVectorizer(stop_words="english", max_df=0.7) 
tfidf_train = tfidf_vector.fit_transform(X_train) 
tfidf_test = tfidf_vector.transform(X_test) 
final = tfidf_vector.transform(test['text']) 

NAIVE BAYES CLASSIFIER

In [10]:
#Count NB Classifier: initialisation and predition
nb_class1 = MultinomialNB()
nb_class1.fit(count_train, y_train) 
count_nb_pred = nb_class1.predict(count_test) 

In [11]:
#Count NB Classifier: results
score = metrics.accuracy_score(y_test, count_nb_pred) 
print('Count NB score:', score)

con_mat = metrics.confusion_matrix(y_test, count_nb_pred, labels=['FAKE', 'REAL']) 
print('Confusion matrix', '\n' ,con_mat)

Count NB score: 0.8833333333333333
Confusion matrix 
 [[572  92]
 [ 49 594]]


In [12]:
#Tfidf NB classifier: initialisation and predition
nb_class2 = MultinomialNB()
nb_class2.fit(tfidf_train, y_train) 
tf_nb_pred = nb_class2.predict(tfidf_test) 

In [13]:
#Tfidf NB classifier: results
tf_score = metrics.accuracy_score(y_test, tf_nb_pred) 
print('Tfidf NB score:', tf_score)

tf_con_mat = metrics.confusion_matrix(y_test, tf_nb_pred, labels=['FAKE', 'REAL']) 
print('Confusion matrix', '\n', tf_con_mat)

Tfidf NB score: 0.796969696969697
Confusion matrix 
 [[418 246]
 [  9 634]]


LINEAR CLASSIFIER

Idea of using linear classifier came to me after looking at this notebook:
https://github.com/docketrun/Detecting-Fake-News-with-Scikit-Learn/blob/master/Attempting%20to%20detect%20fake%20news.ipynb

In [14]:
#Linear count classifier
linear_class1 = PassiveAggressiveClassifier(n_iter=200)
linear_class1.fit(count_train, y_train)
lin_count_pred = linear_class1.predict(count_test)



In [15]:
#Linear count classifier results
lin_score = metrics.accuracy_score(y_test, lin_count_pred)
print('Linear classifier count score:', lin_score)

lin_con_mat = metrics.confusion_matrix(y_test, lin_count_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', lin_con_mat)

Linear classifier count score: 0.875
Confusion matrix 
 [[583  76]
 [ 68 572]]


In [16]:
#Linear tfidf classifier
linear_class2 = PassiveAggressiveClassifier(n_iter=200)
linear_class2.fit(tfidf_train, y_train)
lin_tf_pred = linear_class2.predict(tfidf_test)



In [17]:
#Linear tfidf classifier results
lin_tf_score = metrics.accuracy_score(y_test, lin_tf_pred)
print('Linear classifier tf score:', lin_tf_score)

lin_con_mat = metrics.confusion_matrix(y_test, lin_tf_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', lin_con_mat)

Linear classifier tf score: 0.918939393939394
Confusion matrix 
 [[632  32]
 [ 62 581]]


SVM CLASSIFIER

In [18]:
#SVM on count 
svm_class1 = svm.SVC(C=1, kernel='linear', gamma=1)
svm_class1.fit(count_train, y_train)
svm_count_pred = svm_class1.predict(count_test)

In [19]:
#SVM on count results
svm_score = metrics.accuracy_score(y_test, svm_count_pred)
print('SVM count score :', svm_score)

svm_con_mat = metrics.confusion_matrix(y_test, svm_count_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', svm_con_mat)

SVM count score : 0.8204545454545454
Confusion matrix 
 [[559  68]
 [ 96 524]]


In [20]:
#SVM on tfidf 
svm_class2 = svm.SVC(C=1, kernel='linear', gamma=1)
svm_class2.fit(tfidf_train, y_train)
svm_tfidf_pred = svm_class2.predict(tfidf_test)

In [21]:
#SVM on tfidf results
svm_tf_score = metrics.accuracy_score(y_test, svm_tfidf_pred)
print('SVM tf score:', svm_tf_score)

svm_con_mat = metrics.confusion_matrix(y_test, svm_tfidf_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', svm_con_mat)

SVM tf score: 0.9151515151515152
Confusion matrix 
 [[635  29]
 [ 70 573]]


Next cells are dedicated to the title based classifiers. I use the same code as for the text based to build both vectorizers and classifiers.

In [22]:
#Spliting
X_t_train, X_t_test, y_t_train, y_t_test = train_test_split(train['title'], y, test_size=0.33, random_state=12) 

COUNT VECTORIZER

In [23]:
#Count vectorizer initialisation 
count_vector = CountVectorizer(stop_words='english') 
count_t_train = count_vector.fit_transform(X_t_train)
count_t_test = count_vector.transform(X_t_test)

TFIDF VECTORIZER

In [24]:
#Tfidf Vectorizer
tfidf_vector = TfidfVectorizer(stop_words="english", max_df=0.7) 
tfidf_t_train = tfidf_vector.fit_transform(X_t_train) 
tfidf_t_test = tfidf_vector.transform(X_t_test) 

NAIVE BAYES CLASSIFIERS

In [25]:
#Count NB Classifier: initialisation and predition
nb_class3 = MultinomialNB()
nb_class3.fit(count_t_train, y_t_train) 
count_nb_t_pred = nb_class3.predict(count_t_test) 

In [26]:
#Count NB Classifier: results
score_t = metrics.accuracy_score(y_t_test, count_nb_t_pred) 
print('Count NB score:', score_t)

con_mat = metrics.confusion_matrix(y_t_test, count_nb_t_pred, labels=['FAKE', 'REAL']) 
print('Confusion matrix', '\n' ,con_mat)

Count NB score: 0.7840909090909091
Confusion matrix 
 [[464 196]
 [ 82 571]]


In [27]:
#Tfidf NB classifier: initialisation and predition
nb_class3 = MultinomialNB()
nb_class3.fit(tfidf_t_train, y_t_train) 
tf_nb_t_pred = nb_class3.predict(tfidf_t_test) 

In [28]:
#Tfidf NB classifier: results
tf_t_score = metrics.accuracy_score(y_t_test, tf_nb_t_pred) 
print('Tfidf NB score:', tf_t_score)

tf_con_mat = metrics.confusion_matrix(y_t_test, tf_nb_t_pred, labels=['FAKE', 'REAL']) 
print('Confusion matrix', '\n', tf_con_mat)

Tfidf NB score: 0.7704545454545455
Confusion matrix 
 [[450 210]
 [ 86 567]]


LINEAR CLASSIFIERS

In [29]:
#Linear count classifier
linear_class3 = PassiveAggressiveClassifier(n_iter=100)
linear_class3.fit(count_t_train, y_t_train)
lin_count_t_pred = linear_class3.predict(count_t_test)



In [30]:
#Linear count classifier results
lin_t_score = metrics.accuracy_score(y_t_test, lin_count_t_pred)
print('Linear classifier score:', lin_t_score)

lin_con_mat = metrics.confusion_matrix(y_t_test, lin_count_t_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', lin_con_mat)

Linear classifier score: 0.7416666666666667
Confusion matrix 
 [[455 201]
 [126 524]]


In [31]:
#Linear tfidf classifier
linear_class4 = PassiveAggressiveClassifier(n_iter=200)
linear_class4.fit(tfidf_t_train, y_t_train)
lin_tf_t_pred = linear_class4.predict(tfidf_t_test)



In [32]:
#Linear tfidf classifier results
lin_tf_t_score = metrics.accuracy_score(y_t_test, lin_tf_t_pred)
print('Linear classifier score:', lin_tf_t_score)

lin_con_mat = metrics.confusion_matrix(y_t_test, lin_tf_t_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', lin_con_mat)

Linear classifier score: 0.7583333333333333
Confusion matrix 
 [[471 187]
 [122 530]]


SVM CLASSIFIERS

In [33]:
#SVM on count 
svm_class3 = svm.SVC(C=1, kernel='linear', gamma=1)
svm_class3.fit(count_t_train, y_t_train)
svm_count_t_pred = svm_class3.predict(count_t_test)

In [34]:
#SVM on count results
svm_t_score = metrics.accuracy_score(y_t_test, svm_count_t_pred)
print('SVM score:', svm_t_score)

svm_con_mat = metrics.confusion_matrix(y_t_test, svm_count_t_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', svm_con_mat)

SVM score: 0.7674242424242425
Confusion matrix 
 [[481 179]
 [121 532]]


In [35]:
#SVM on tfidf 
svm_class4 = svm.SVC(C=1, kernel='linear', gamma=1)
svm_class4.fit(tfidf_t_train, y_t_train)
svm_tfidf_t_pred = svm_class4.predict(tfidf_t_test)

In [36]:
#SVM on tfidf results
svm_tf_t_score = metrics.accuracy_score(y_t_test, svm_tfidf_t_pred)
print('SVM score:', svm_tf_t_score)

svm_con_mat = metrics.confusion_matrix(y_t_test, svm_tfidf_t_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix', '\n', svm_con_mat)

SVM score: 0.7909090909090909
Confusion matrix 
 [[504 156]
 [113 540]]


Both results of text and title based classifiers will be compared next.

In [37]:
#Comparing scores
print('Text based classifiers')
print('Count NB score:', score)
print('Tfidf NB score:', tf_score)
print('Linear classifier count score:', lin_score)
print('Linear classifier tf score:', lin_tf_score)
print('SVM count score :', svm_score)
print('SVM tf score:', svm_tf_score)

print('\n','Title based classifiers')
print('Count NB score:', score_t)
print('Tfidf NB score:', tf_t_score)
print('Linear classifier count score:', lin_t_score)
print('Linear classifier tf score:', lin_tf_t_score)
print('SVM count score:', svm_t_score)
print('SVM tf score:', svm_tf_t_score)

Text based classifiers
Count NB score: 0.8833333333333333
Tfidf NB score: 0.796969696969697
Linear classifier count score: 0.875
Linear classifier tf score: 0.918939393939394
SVM count score : 0.8204545454545454
SVM tf score: 0.9151515151515152

 Title based classifiers
Count NB score: 0.7840909090909091
Tfidf NB score: 0.7704545454545455
Linear classifier count score: 0.7416666666666667
Linear classifier tf score: 0.7583333333333333
SVM count score: 0.7674242424242425
SVM tf score: 0.7909090909090909


We can see that count vectorizers work better with Naive Bayes classifier while tfidf vectorizers perform significantly better that count vectorizers using linear and SVM classifiers. 

We also observe that title based classifications accuracy is less then text based with all types of classifiers. Probably, this happens due to less amount of text in titles than in text of the news articles. 

For the final prediction I chose 2 text based classifiers build on top of tfidf vectorizer: linear and SVM.

FINAL PREDICTIONS

In [38]:
#Linear classifier
lin_count_pred_test = linear_class1.predict(final)

#SVM 
svm_tfidf_pred_test = svm_class2.predict(final)

In [39]:
#Preparing predictions

#Text based linear classifier on tfidf
lin_df=pd.DataFrame({'News_id':test['ID'], 'prediction':lin_count_pred_test})
print(lin_df.head())
print(lin_df.shape, '\n')

#Text based SVM on tfidf
svm_df=pd.DataFrame({'News_id':test['ID'], 'prediction':svm_tfidf_pred_test})
print(svm_df.head())
print(svm_df.shape)

   News_id prediction
0    10498       FAKE
1     2439       REAL
2      864       REAL
3     4128       REAL
4      662       REAL
(2321, 2) 

   News_id prediction
0    10498       FAKE
1     2439       REAL
2      864       REAL
3     4128       REAL
4      662       REAL
(2321, 2)


In [40]:
#Saving predictions
lin_df.to_csv('prediction1.csv', index=False)
svm_df.to_csv('prediction.csv', index=False)

Thanks for reading this notebook! Please let me know if you have any questions. 