# **Fake news Classification**



- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


## **About Data: Fake News Detection**

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


- This data consists of two columns.
        - Text
        - label
- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the **Binary Classification.**


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -r "/content/drive/MyDrive/NLP Journey/Data/" "/content/"

In [3]:
!unzip Data/Fake.csv.zip
!unzip Data/True.csv.zip


Archive:  Data/Fake.csv.zip
  inflating: Fake.csv                
Archive:  Data/True.csv.zip
  inflating: True.csv                


In [69]:
#import pandas library
import pandas as pd

#read the dataset with name "Fake_Real_Data.csv" and store it in a variable df
df_1 = pd.read_csv("Fake.csv")

df_1['Fake']=1
df_1.head()
df_1.shape

(23481, 5)

In [70]:
#print the shape of dataframe
df_2  =pd.read_csv('True.csv')
df_2['Fake']=0
df_2.head()
df_2.shape

(21417, 5)

In [71]:
# Considering only 5k data from both due to computing power needs
df_1 = df_1.sample(5000)
df_2 = df_2.sample(5000)

In [72]:
#check the distribution of labels 
res = pd.concat([df_1, df_2],axis=0)
res.sample(5)

Unnamed: 0,title,text,subject,date,Fake
7812,Comedy Icon Dick Van Dyke’s First Endorsement...,Dick Van Dyke is 90 years old. The man was bor...,News,"February 28, 2016",1
10760,Obama digs into research on potential Supreme ...,WASHINGTON (Reuters) - President Barack Obama ...,politicsNews,"February 19, 2016",0
5133,Even Fox News Is Slamming Trump’s ‘Dangerous’...,Donald Trump is so far-out and detached from r...,News,"August 8, 2016",1
14743,Lebanon wants good ties with Saudi Arabia: for...,BEIRUT (Reuters) - Lebanese Foreign Minister G...,worldnews,"November 15, 2017",0
6389,Trump news conference sets worldwide social me...,NEW YORK (Reuters) - In his first news confere...,politicsNews,"January 12, 2017",0


In [73]:
res.shape

(10000, 5)

In [74]:
simplified_df = res.drop(['title','subject','date'],axis=1)

In [75]:
simplified_df

Unnamed: 0,text,Fake
17587,Former Baywatch star Pamela Anderson tried t...,1
4905,Donald Trump is taking a lot of fire over the ...,1
9225,Sarah Huckabee Sanders was hot today when she ...,1
8105,"There s crazy and then there s nonsensical, co...",1
21556,What would a speech from a modern Democrat be ...,1
...,...,...
3579,LONDON (Reuters) - The most senior U.S. diplom...,0
14428,WARSAW (Reuters) - European Council President ...,0
18197,LONDON (Reuters) - Britain s environment minis...,0
9004,(Reuters) - A blackout of television cameras i...,0


In [76]:
simplified_df.Fake.value_counts()

1    5000
0    5000
Name: Fake, dtype: int64

In [78]:
#Add the new column "label_num" which gives a unique number to each of these labels 
features = simplified_df['text']
labels = simplified_df['Fake']

#check the results with top 5 rows


# **Modelling without Pre-processing Text data**

In [79]:
#import train-test-split from sklearn 
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = tts(features, labels, test_size =0.2, random_state=42,stratify= labels)

#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too



In [80]:
#print the shapes of X_train and X_test

print(X_train.shape  , X_test.shape)


(8000,) (2000,)


In [81]:
y_train.value_counts()

0    4000
1    4000
Name: Fake, dtype: int64

## Usefule function for all models

In [82]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score

def get_confusion_values(y_true, y_pred):
  tn,fp,fn,tp = confusion_matrix(y_test, y_pred).ravel()
  return (tn,fp,fn,tp)


def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.
  Args:
      y_true: true labels in the form of a 1D array
      y_pred: predicted labels in the form of a 1D array
  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results


## Model 1 : KNN EUCLIDEAN

**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.


In [83]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

model_1 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(1,3))),
    ('Praba_clf', KNeighborsClassifier(n_neighbors=10, metric ='euclidean'))
])
#2. fit with X_train and y_train

model_1.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = model_1.predict(X_test)

#4. print the classfication report

report_1 = classification_report(y_test, y_pred)

In [84]:
print('KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) \n \n ')
print(report_1)


KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) 
 
 
              precision    recall  f1-score   support

           0       0.66      0.72      0.69      1000
           1       0.69      0.64      0.66      1000

    accuracy                           0.68      2000
   macro avg       0.68      0.68      0.68      2000
weighted avg       0.68      0.68      0.68      2000



In [85]:
model_1_conf_values = get_confusion_values(y_test, y_pred)
model_1_conf_values

(716, 284, 361, 639)

In [86]:
model_1_results = calculate_results(y_test, y_pred)
model_1_results

{'accuracy': 67.75,
 'precision': 0.6785586743804014,
 'recall': 0.6775,
 'f1': 0.6770212647697049}

## Model 2 : KNN COSINE

**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
- print the classification report.


In [87]:


#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

model_2 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(1,3))),
    ('Praba_clf', KNeighborsClassifier(n_neighbors=10, metric ='cosine'))
])
#2. fit with X_train and y_train

model_2.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_2 = model_2.predict(X_test)

#4. print the classfication report
report_2 = classification_report(y_test, y_preds_2)


In [88]:
print('KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) : COSINE distance for cluster \n \n ')
print(report_2)

KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) : COSINE distance for cluster 
 
 
              precision    recall  f1-score   support

           0       0.88      0.34      0.49      1000
           1       0.59      0.95      0.73      1000

    accuracy                           0.64      2000
   macro avg       0.73      0.64      0.61      2000
weighted avg       0.73      0.64      0.61      2000



In [89]:
model_2_conf_values = get_confusion_values(y_test, y_preds_2)
model_2_conf_values

(336, 664, 47, 953)

In [90]:
model_2_results = calculate_results(y_test, y_preds_2)
model_2_results

{'accuracy': 64.45,
 'precision': 0.7333238066173537,
 'recall': 0.6445,
 'f1': 0.6071075406341151}

## Model 3 : RANDOM FOREST - Depth 70


**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [91]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model_3 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(3,3))),
    ('Praba_clf', RandomForestClassifier(max_depth=70)) # If max depth is not defined it will take more time 
])
#2. fit with X_train and y_train

model_3.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_3 = model_2.predict(X_test)

#4. print the classfication report
report_3 = classification_report(y_test, y_preds_3)


In [92]:
print('Random Forest Classifier -depth 70 :  No preprocess on Input data : \n \n ')
print(report_3)

Random Forest Classifier -depth 70 :  No preprocess on Input data : 
 
 
              precision    recall  f1-score   support

           0       0.88      0.34      0.49      1000
           1       0.59      0.95      0.73      1000

    accuracy                           0.64      2000
   macro avg       0.73      0.64      0.61      2000
weighted avg       0.73      0.64      0.61      2000



In [93]:
model_3_conf_values = get_confusion_values(y_test, y_preds_3)
model_3_conf_values

(336, 664, 47, 953)

In [94]:
model_3_results = calculate_results(y_test, y_preds_3)
model_3_results

{'accuracy': 64.45,
 'precision': 0.7333238066173537,
 'recall': 0.6445,
 'f1': 0.6071075406341151}

## Model 4 : Muti Naive Bayes : Alpha 0.75


**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier with an alpha value of 0.75.
- print the classification report.


In [95]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
#1. create a pipeline object

model_4 = Pipeline([
    ('Praba_vect',CountVectorizer(ngram_range=(2,3))),
    ('Praba_clf', MultinomialNB(alpha=0.75))
])



#2. fit with X_train and y_train

model_4.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_4 = model_4.predict(X_test)

#4. print the classfication report

report_4 = classification_report(y_test, y_preds_4)

In [96]:
print('MultiNomial Naive Bayes  - Alpha 0.75:  No preprocess on Input data : \n \n ')
print(report_4)

MultiNomial Naive Bayes  - Alpha 0.75:  No preprocess on Input data : 
 
 
              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1000
           1       0.98      0.94      0.96      1000

    accuracy                           0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000



In [97]:
model_4_conf_values = get_confusion_values(y_test, y_preds_4)
model_4_conf_values

(984, 16, 65, 935)

In [98]:
model_4_results = calculate_results(y_test, y_preds_4)
model_4_results

{'accuracy': 95.95,
 'precision': 0.9606059148014383,
 'recall': 0.9595,
 'f1': 0.9594756752740832}

## Model result Comparision

In [100]:
import pandas as pd 
# Combine model results into a DataFrame
all_model_results = pd.DataFrame({"m1 Knn Euclidean ": model_1_results,
                                  "m2 Knn Cosine": model_2_results,
                                  "m3 Random Forest": model_3_results,
                                  "m4 Naive Bayes": model_4_results,
                                  })
all_model_results = all_model_results.transpose()
all_model_results


Unnamed: 0,accuracy,precision,recall,f1
m1 Knn Euclidean,67.75,0.678559,0.6775,0.677021
m2 Knn Cosine,64.45,0.733324,0.6445,0.607108
m3 Random Forest,64.45,0.733324,0.6445,0.607108
m4 Naive Bayes,95.95,0.960606,0.9595,0.959476


✨✨  Naive Bayes    **ROCKS** 

# Preprocessed 

<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [101]:
#use this utility function to get the preprocessed text data

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]
    return " ".join(filtered_tokens) 

In [102]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
# this will take some time, please be patient

simplified_df['preprocessed_text'] = simplified_df['text'].apply(preprocess)

In [104]:
#print the top 5 rows
simplified_df.head()

Unnamed: 0,text,Fake,preprocessed_text
17587,Former Baywatch star Pamela Anderson tried t...,1,Baywatch star Pamela Anderson try use char...
4905,Donald Trump is taking a lot of fire over the ...,1,Donald Trump take lot fire letter allege physi...
9225,Sarah Huckabee Sanders was hot today when she ...,1,Sarah Huckabee Sanders hot today go CNN s Jim ...
8105,"There s crazy and then there s nonsensical, co...",1,s crazy s nonsensical conspiratorial head para...
21556,What would a speech from a modern Democrat be ...,1,speech modern Democrat didn t include LGBT cro...


**Build a model with pre processed text**

In [105]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Make sure to use only the "preprocessed_txt" column for splitting
processed_features = simplified_df['preprocessed_text']
labels = simplified_df['Fake']
X_train_pr, X_test_pr , y_train, y_test = tts (processed_features,labels, test_size=0.2, random_state=42,stratify= labels )



**Let's check the scores with our best model till now**
- Random Forest

## Model 5: Random forest Clean data - ngram -3

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [107]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model_5 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(3,3))),
    ('Praba_clf', RandomForestClassifier(max_depth=70)) # If max depth is not defined it will take more time 
])
#2. fit with X_train and y_train

model_5.fit(X_train_pr,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_5 = model_5.predict(X_test_pr)

#4. print the classfication report
report_5 = classification_report(y_test, y_preds_5)


In [108]:
print('Random Forest Classifier -depth 70 :  Preprocessed Input data : \n \n ')
print(report_5)

Random Forest Classifier -depth 70 :  Preprocessed Input data : 
 
 
              precision    recall  f1-score   support

           0       0.96      0.73      0.83      1000
           1       0.78      0.97      0.87      1000

    accuracy                           0.85      2000
   macro avg       0.87      0.85      0.85      2000
weighted avg       0.87      0.85      0.85      2000



In [109]:
model_5_conf_values = get_confusion_values(y_test, y_preds_5)
model_5_conf_values

(735, 265, 34, 966)

In [110]:
model_5_results = calculate_results(y_test, y_preds_5)
model_5_results

{'accuracy': 85.05,
 'precision': 0.8702572997731975,
 'recall': 0.8505,
 'f1': 0.8484786675447132}

## Model 6: Random Forest - Clean data ngram 1-3

**Attempt2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, Bigram, and trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [111]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


In [112]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model_6 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(1,3))),
    ('Praba_clf', RandomForestClassifier(max_depth=70)) # If max depth is not defined it will take more time 
])
#2. fit with X_train and y_train

model_6.fit(X_train_pr,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_6 = model_6.predict(X_test_pr)

#4. print the classfication report
report_6 = classification_report(y_test, y_preds_6)


In [113]:
print('Random Forest Classifier -depth 70 :  Preprocessed Input data : \n \n ')
print(report_6)

Random Forest Classifier -depth 70 :  Preprocessed Input data : 
 
 
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      1000
           1       0.98      0.96      0.97      1000

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000



In [118]:
model_6_conf_values = get_confusion_values(y_test, y_preds_6)
model_6_conf_values

(983, 17, 42, 958)

In [119]:
model_6_results = calculate_results(y_test, y_preds_6)
model_6_results

{'accuracy': 97.05,
 'precision': 0.9707942464040025,
 'recall': 0.9705,
 'f1': 0.9704953899046727}

In [116]:
#finally print the confusion matrix for the best model



# Final Comparision

In [120]:
import pandas as pd 
# Combine model results into a DataFrame
all_model_results = pd.DataFrame({"m1 Knn Euclidean ": model_1_results,
                                  "m2 Knn Cosine": model_2_results,
                                  "m3 Random Forest": model_3_results,
                                  "m4 Naive Bayes": model_4_results,
                                  "m5 Random Forest":model_5_results,
                                  "m6 Random Forest": model_6_results
                                  })
all_model_results = all_model_results.transpose()
all_model_results

Unnamed: 0,accuracy,precision,recall,f1
m1 Knn Euclidean,67.75,0.678559,0.6775,0.677021
m2 Knn Cosine,64.45,0.733324,0.6445,0.607108
m3 Random Forest,64.45,0.733324,0.6445,0.607108
m4 Naive Bayes,95.95,0.960606,0.9595,0.959476
m5 Random Forest,85.05,0.870257,0.8505,0.848479
m6 Random Forest,97.05,0.970794,0.9705,0.970495


## **Observations**


Model 1:
 1. It looks to be overfitted 
 2. Higher dimensions of feature - Took time to calculate euclidean distance

Model 2:
  1. This is also overfitted

Model 3 & 4:
  1. Ngram :3 only trigram
  2. Recall and F1 improved
  3. As Random Forest uses Bootstrapping(row and column Sampling) with many decision trees and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifying the categories.

  4. The easy calculation of probabilities for the words in the corpus(Bag of words) and storing them in a contingency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.


Model 5 & 6:
  1. Have trained the best model RandomForest on the pre-processed data, but RandomForest with trigrams fails to produce the same results here.

  2. But the same randomForest with Unigram to Trigram features helps to produce very amazing results and is tops in the entire list with very good F1 scores and Recall scores.

>  😎 Machine Learning is like a trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which gives good results and satisfies the requirements like latency, interpretability, etc. ⛳