## PROJECT : Detecting Fake News Using Bag of Words

OBJECTIVE: 

> We aim to train a machine learning model that can accurately classify news articles as fake or real based on their text content using the Bag of Words approach.

DATASET : https://www.kaggle.com/c/fake-news/data#

__________________________________________

In [None]:
# importing the libraries 
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from IPython.display import HTML, Audio

# We will also require these to do necessary modifications to the messages like : lemmatization, stemming  , TF-IDF etc
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [None]:
## GETTING THE DATASET :

df = pd.read_csv('/content/drive/MyDrive/UNIV.AI/NLP Intro /Datasets/FAKE NEWS DATASET/train.csv', usecols = ['text', 'label'])
display(df.shape, df.head())

(20800, 2)

Unnamed: 0,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,Ever get the feeling your life circles the rou...,0
2,"Why the Truth Might Get You Fired October 29, ...",1
3,Videos 15 Civilians Killed In Single US Airstr...,1
4,Print \nAn Iranian woman has been sentenced to...,1


In [None]:
# Removing all the rows having null values

df = df.dropna()

# # Resetting the indexes 
df.reset_index(inplace = True)

display(df.shape, df.head())

(20761, 3)

Unnamed: 0,index,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,Ever get the feeling your life circles the rou...,0
2,2,"Why the Truth Might Get You Fired October 29, ...",1
3,3,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Print \nAn Iranian woman has been sentenced to...,1


In [None]:
# Spliting the Predictor and Label:
X = df['text']
y = df['label']

## Cleaning the texts and performing Bag-of-Words

In [None]:
#  we will be performing Data Cleaning and preprocessing by using Stopwords and Stemming.
porter  = PorterStemmer()
corpus = []

for i in range(len(X)):

  review = re.sub("[^a-zA-Z]", ' ', X[i])
  review = review.lower()
  review = review.split()
  review = [porter.stem(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  
  corpus.append(review)

In [None]:
# CREATING BOG-OF-WORDS

from sklearn.feature_extraction.text import CountVectorizer # To perform bag of words

# For Bag Of Words 

bow = CountVectorizer(max_features = 7000, ngram_range= (1,3)) # max_features helps to get top most occuring words/features.
X = bow.fit_transform(corpus).toarray()    #  X will be our predictors dataset

In [None]:

# Deleting unnecessary variables :
corpus = None
df = df_new= None
review = None
train_df = None
del(corpus)
del(df)
del(df_new)
del(review)
del(train_df)

In [None]:
X.shape


(20761, 7000)

In [None]:
y.shape

(20761,)

In [None]:
bow.get_feature_names_out()[:20]

array(['aaron', 'abandon', 'abbott', 'abc', 'abc news', 'abduct', 'abe',
       'abedin', 'abid', 'abil', 'abl', 'aboard', 'abort', 'abraham',
       'abroad', 'abruptli', 'absenc', 'absent', 'absolut', 'absorb'],
      dtype=object)

In [None]:
bow.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': 7000,
 'min_df': 1,
 'ngram_range': (1, 3),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

### <b> Splitting the Train and Validation Data 

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test , y_train, y_test = train_test_split(X,y, train_size = 0.75)

## <b> MODEL : Multinomial Algorithm:

In [None]:
from sklearn.naive_bayes import MultinomialNB

model_nb = MultinomialNB()

model_nb.fit(x_train, y_train)

In [None]:
# Getting the predictions :
y_pred_nb_test = model_nb.predict(x_test)  # prediction on the test set 
y_pred_nb_train = model_nb.predict(x_train) # prediction on the train set 

In [None]:
from sklearn.metrics import classification_report , accuracy_score

print (f'''
RESULTS :

* TRAIN ACCURACY              : {accuracy_score(y_train, y_pred_nb_train)}
* VALIDATION ACCURACY         : {accuracy_score(y_test, y_pred_nb_test)}

* CLASSIFICATION REPORT (VALIDATION SET ):
{classification_report(y_test, y_pred_nb_test)}
''')


RESULTS :

* TRAIN ACCURACY              : 0.9098908156711625
* VALIDATION ACCURACY         : 0.8996339818917357

* CLASSIFICATION REPORT (VALIDATION SET ):
              precision    recall  f1-score   support

           0       0.88      0.92      0.90      2609
           1       0.92      0.88      0.90      2582

    accuracy                           0.90      5191
   macro avg       0.90      0.90      0.90      5191
weighted avg       0.90      0.90      0.90      5191




## <b> MODEL : Multinomial Algorithm with HYPER-PARAMETER:

In [None]:
# Initializing the model with alpha = 0.1

model_nb_1 = MultinomialNB(alpha = 0.1)

best_score = 0

for alpha in np.arange(0.1,1,0.1):
  # print (np.round(alpha,2))
  sub_model = MultinomialNB(alpha = np.round(alpha,2))
  sub_model.fit(x_train,y_train)
  y_pred = sub_model.predict(x_test)
  acc_score = accuracy_score(y_pred, y_test)
  if acc_score > best_score:
    model_nb_1 = sub_model
    best_score = acc_score
  print (f'For Alpha : {np.round(alpha,2)}, Accuracy Score is : {acc_score}')

print (f"We are getting best accuacy as : {np.round(best_score,4)}")

We are getting best accuacy as : 0.901


## <b> MODEL: Passive Agressive Classifier Algorithm

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier

# initializing and fitting the model
pac = PassiveAggressiveClassifier()
pac.fit(x_train,y_train)


# Getting the predictions for both training and validating set :

y_pred_train_pac = pac.predict(x_train)
y_pred_test_pac  = pac.predict(x_test)


In [None]:
print (f'''
RESULTS :

* TRAIN ACCURACY              : {accuracy_score(y_train, y_pred_train_pac)}
* VALIDATION ACCURACY         : {accuracy_score(y_test, y_pred_test_pac)}

* CLASSIFICATION REPORT (VALIDATION SET ):
{classification_report(y_test, y_pred_test_pac)}
''')


RESULTS :

* TRAIN ACCURACY              : 0.9998715478484265
* VALIDATION ACCURACY         : 0.9435561548834521

* CLASSIFICATION REPORT (VALIDATION SET ):
              precision    recall  f1-score   support

           0       0.95      0.93      0.94      2609
           1       0.93      0.96      0.94      2582

    accuracy                           0.94      5191
   macro avg       0.94      0.94      0.94      5191
weighted avg       0.94      0.94      0.94      5191




## <b> MODEL: Random Forest Model:


In [None]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators = 200, criterion= 'entropy')

# Fitting out training data :
model_rf.fit(x_train,y_train)

# Getting the prediction for train and validation set :

y_pred_train_rf = model_rf.predict(x_train)
y_pred_test_rf  = model_rf.predict(x_test)


In [None]:
print (f'''
RESULTS :

* TRAIN ACCURACY              : {accuracy_score(y_train, y_pred_train_rf)}
* VALIDATION ACCURACY         : {accuracy_score(y_test, y_pred_test_rf)}

* CLASSIFICATION REPORT (VALIDATION SET ):
{classification_report(y_test, y_pred_test_rf)}
''')


RESULTS :

* TRAIN ACCURACY              : 0.9999357739242132
* VALIDATION ACCURACY         : 0.9497206703910615

* CLASSIFICATION REPORT (VALIDATION SET ):
              precision    recall  f1-score   support

           0       0.96      0.94      0.95      2609
           1       0.94      0.96      0.95      2582

    accuracy                           0.95      5191
   macro avg       0.95      0.95      0.95      5191
weighted avg       0.95      0.95      0.95      5191




## MODEL PERFORMANCE :

> In conclusion, the Random Forest model performed the best in accurately classifying fake news articles using the Bag-of-Words approach. The model achieved an accuracy of 95%, outperforming other models such as Naive Bayes and Passive Aggressive Classifier. This demonstrates the effectiveness of Bag-of-Words and Random Forest in tackling the problem of fake news classification.



RANDOM FOREST > PASSIVE AGRESSIVE > NAIVE BAYS