**Dataset :** [Kaggle Machine Learning Fake News Dataset](https://www.kaggle.com/c/fake-news/data)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import re

In [3]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier

In [7]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix,plot_confusion_matrix 
from sklearn.metrics import classification_report

In [8]:
import warnings
warnings.filterwarnings("ignore")

## Load Dataset

**Dataset :** [Kaggle Machine Learning Fake News Dataset](https://www.kaggle.com/c/fake-news/data)

In [9]:
data = pd.read_csv("data.csv")
data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


`label` : a label that marks the article as potentially unreliable
- **1** , unreliable
- **0** , reliable

In [10]:
data.shape

(20800, 5)

## Data Cleaning

In [11]:
data[data.duplicated()]

Unnamed: 0,id,title,author,text,label


In [12]:
data.isnull().sum()/len(data)

id        0.000000
title     0.026827
author    0.094087
text      0.001875
label     0.000000
dtype: float64

In [13]:
data = data[ data['title'].notna() ]

In [14]:
data.shape

(20242, 5)

In [15]:
data.isnull().sum()/len(data)

id        0.000000
title     0.000000
author    0.096680
text      0.001927
label     0.000000
dtype: float64

In [16]:
data.reset_index(inplace=True)

## Data Preprocessing

In [17]:
ps   = PorterStemmer()
stop = set(stopwords.words('english'))

for i in range(0 ,len(data)) :
    
    review = re.sub('[^a-zA-Z]' ," ",data['title'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word)   for word in review   if not word in stop]
    review = ' '.join(review)
    
    data['title'][i] = review

In [18]:
data.loc[:,'title']

0        hous dem aid even see comey letter jason chaff...
1          flynn hillari clinton big woman campu breitbart
2                                     truth might get fire
3                 civilian kill singl us airstrik identifi
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
20237            rapper trump poster child white supremaci
20238      n f l playoff schedul matchup odd new york time
20239    maci said receiv takeov approach hudson bay ne...
20240             nato russia hold parallel exercis balkan
20241                                          keep f aliv
Name: title, Length: 20242, dtype: object

## Splitting and Vectorizing data

In [19]:
X = data.loc[:,'title']
y = data['label']

In [20]:
cv = CountVectorizer( max_features=5000 ,ngram_range=(1,3))

In [21]:
tfidf = TfidfVectorizer( max_features=4000 ,ngram_range=(1,3))

## Model

In [22]:
def prepare_Model(X,y,vectorizer,model):
    
    X_train ,X_test ,y_train ,y_test = train_test_split( 
                                                          X , 
                                                          y ,
                                                          test_size=0.2 ,
                                                          random_state=0
                                                        )
    
    X_train = vectorizer.fit_transform(X_train).toarray()
    X_test  = vectorizer.transform(X_test).toarray()
    
    vec = ""
    if(vectorizer==cv):
        vec = "CountVectorizer"
    else:
        vec = "TfidfVectorizer"
    
    print("\n\n")
    print("*"*50)
    
    print("\nModel      : ",model)
    print("\nVectorizer : ",vec)
    
#     param = {
#               'alpha':np.arange(0,1,0.1)
#             }
    
#     model = GridSearchCV(
#                           model ,
#                           param ,
#                           cv=2
#                         )
    
    model  = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
#     print("Best Parameter : ",model.best_params_)
    
    acc = accuracy_score(y_test ,y_pred)
    print("\nAccuracy   : ",acc)
    

    cm = confusion_matrix(y_test,y_pred)
    print("\nConfusion Matrix , \n",cm)
    
    print("\nClassification Report ,\n" ,classification_report(y_test,y_pred),"\n")
    
    print("*"*50)
    print("\n\n")

**1. MultinomialNB**

In [23]:
nb = MultinomialNB()

In [24]:
prepare_Model(X,y,cv,nb)




**************************************************

Model      :  MultinomialNB()

Vectorizer :  CountVectorizer

Accuracy   :  0.8970116078043961

Confusion Matrix , 
 [[1851  216]
 [ 201 1781]]

Classification Report ,
               precision    recall  f1-score   support

           0       0.90      0.90      0.90      2067
           1       0.89      0.90      0.90      1982

    accuracy                           0.90      4049
   macro avg       0.90      0.90      0.90      4049
weighted avg       0.90      0.90      0.90      4049
 

**************************************************





In [25]:
prepare_Model(X,y,tfidf,nb)




**************************************************

Model      :  MultinomialNB()

Vectorizer :  TfidfVectorizer

Accuracy   :  0.8794764139293653

Confusion Matrix , 
 [[1912  155]
 [ 333 1649]]

Classification Report ,
               precision    recall  f1-score   support

           0       0.85      0.93      0.89      2067
           1       0.91      0.83      0.87      1982

    accuracy                           0.88      4049
   macro avg       0.88      0.88      0.88      4049
weighted avg       0.88      0.88      0.88      4049
 

**************************************************





**2. Passive Aggressive Classifier**

In [26]:
pa = PassiveAggressiveClassifier()

In [27]:
prepare_Model(X,y,cv,pa)




**************************************************

Model      :  PassiveAggressiveClassifier()

Vectorizer :  CountVectorizer

Accuracy   :  0.9192393183502099

Confusion Matrix , 
 [[1905  162]
 [ 165 1817]]

Classification Report ,
               precision    recall  f1-score   support

           0       0.92      0.92      0.92      2067
           1       0.92      0.92      0.92      1982

    accuracy                           0.92      4049
   macro avg       0.92      0.92      0.92      4049
weighted avg       0.92      0.92      0.92      4049
 

**************************************************





In [28]:
prepare_Model(X,y,tfidf,pa)




**************************************************

Model      :  PassiveAggressiveClassifier()

Vectorizer :  TfidfVectorizer

Accuracy   :  0.9189923437885897

Confusion Matrix , 
 [[1860  207]
 [ 121 1861]]

Classification Report ,
               precision    recall  f1-score   support

           0       0.94      0.90      0.92      2067
           1       0.90      0.94      0.92      1982

    accuracy                           0.92      4049
   macro avg       0.92      0.92      0.92      4049
weighted avg       0.92      0.92      0.92      4049
 

**************************************************



