# Developing a machine learning program to identify when an article might be fake news. Run by the UTK Machine Learning Club.

### Source: https://www.kaggle.com/c/fake-news/overview

##### Importing libraries and loading dataset. Both train and test data sets will pre cleaned before creating the model

In [8]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


In [9]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


In [10]:
downloaded = drive.CreateFile({'id':'1kzYS0Mh6HvFvtEmO0LTTnrcv1073NBbP'}) 
downloaded.GetContentFile('fakenewstrain.csv') 

In [11]:
import pandas as pd
df=pd.read_csv('fakenewstrain.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [14]:
#Observation: Fake news is represented by variable as label, where 1 means its fake else its not.

### Exploring dataset and drawing few insights

In [15]:
df.shape

(20800, 5)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [17]:
#observation: dataset has 20800 obervations and 5 attributes including dependent variable.Except id, 
#all the variables are object type. Classification/NLP problem

In [18]:
df.text.head(5)

0    House Dem Aide: We Didn’t Even See Comey’s Let...
1    Ever get the feeling your life circles the rou...
2    Why the Truth Might Get You Fired October 29, ...
3    Videos 15 Civilians Killed In Single US Airstr...
4    Print \nAn Iranian woman has been sentenced to...
Name: text, dtype: object

In [19]:
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

Observation: As text variable contain lot of information to predict whether the news is fake or not and neither title, id nor author dont give sufficient evidence to classify news, we are dropping these two columns. As we have only 39 null values, its better to drop na columns


In [145]:
df.drop(['title','author','id'], axis=1, inplace=True)

In [146]:
df=df.dropna(axis=0)

In [23]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,index,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,Ever get the feeling your life circles the rou...,0
2,2,"Why the Truth Might Get You Fired October 29, ...",1
3,3,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Print \nAn Iranian woman has been sentenced to...,1


In [149]:
df.drop(['index'], axis=1, inplace=True)

In [26]:
df.shape

(20761, 2)

In [150]:
testdf.shape

(5193, 1)

In [138]:
testdf.tail()

Unnamed: 0,text
5195,Of all the dysfunctions that plague the world’...
5196,WASHINGTON — Gov. John Kasich of Ohio on Tu...
5197,Good morning. (Want to get California Today by...
5198,« Previous - Next » 300 US Marines To Be Deplo...
5199,Perhaps you’ve seen the new TV series whose pi...


### Preprocessing dataset

In [28]:
import nltk
#nltk.download('all')
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer,PorterStemmer
import re
lemmet=WordNetLemmatizer()
stemmer=PorterStemmer()

In [29]:
corpus=[]
for i in range(df.shape[0]):
    text=re.sub('[^a-zA-Z]',' ', df['text'][i])
    text=text.lower()
    text=text.split()
    text=[lemmet.lemmatize(word) for word in text if word not in set(stopwords.words('english'))]
    text=' '.join(text)
    corpus.append(text)

In [34]:
#Finding the weightage of words by vectorizing 

In [110]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv=CountVectorizer(max_features=10000)
tf=TfidfVectorizer(max_features=10000)
cv_x=cv.fit_transform(corpus).toarray()
tf_x=tf.fit_transform(corpus).toarray()
y=df['label']

In [78]:
#Model Evaluation

In [119]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, recall_score
X_train,X_test,y_train,y_test=train_test_split(cv_x,y,test_size=.33, random_state=1)

In [120]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
bnb=BernoulliNB()
mnb=MultinomialNB()
bnb.fit(X_train,y_train)
print("BernoulliNB accuracy:", accuracy_score(y_test, bnb.predict(X_test)))
mnb.fit(X_train,y_train)
print("MultinomialNB accuracy:", accuracy_score(y_test, mnb.predict(X_test)))

BernoulliNB accuracy: 0.7279626386456509
MultinomialNB accuracy: 0.895942790426153


In [121]:
print('matrix:', confusion_matrix(y_test, mnb.predict(X_test)))

matrix: [[3177  267]
 [ 446 2962]]


In [122]:
print('report:', classification_report(y_test, mnb.predict(X_test)))

report:               precision    recall  f1-score   support

           0       0.88      0.92      0.90      3444
           1       0.92      0.87      0.89      3408

    accuracy                           0.90      6852
   macro avg       0.90      0.90      0.90      6852
weighted avg       0.90      0.90      0.90      6852



In [115]:
#Multinomial seems efficient than Bernoulli
#FP, false positives are less when compared to false negatives. 
#If valid message is tagged as fake then there is more harm.

In [123]:
X_train_tf,X_test_tf,y_train_tf,y_test_tf=train_test_split(tf_x,y,test_size=.33, random_state=1)

In [124]:
bnb_tf=BernoulliNB()
mnb_tf=MultinomialNB()
bnb_tf.fit(X_train_tf,y_train_tf)
print("BernoulliNB accuracy:", accuracy_score(y_test_tf, bnb_tf.predict(X_test_tf)))
mnb_tf.fit(X_train_tf,y_train_tf)
print("MultinomialNB accuracy:", accuracy_score(y_test_tf, mnb_tf.predict(X_test_tf)))

BernoulliNB accuracy: 0.7279626386456509
MultinomialNB accuracy: 0.8992994746059545


In [125]:
print('matrix:', confusion_matrix(y_test_tf, mnb_tf.predict(X_test_tf)))

matrix: [[3197  247]
 [ 443 2965]]


In [106]:
print('report:',classification_report(y_test_tf, mnb_tf.predict(X_test_tf)))

report:               precision    recall  f1-score   support

           0       0.87      0.91      0.89      3444
           1       0.91      0.86      0.89      3408

    accuracy                           0.89      6852
   macro avg       0.89      0.89      0.89      6852
weighted avg       0.89      0.89      0.89      6852



### FP, false positives are less when compared to false negatives and more efficient than Countvectorizer. Though accuracy is same, need to be rely on this algorithm. Using MultinomialNB algorithm and TfIdf to predict test.