<a href="https://colab.research.google.com/github/PrabaKDataScience/DeepLearning/blob/main/NLP/Basics/Fake_news_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Bag of n_grams: Exercise**

- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


# **About Data: Fake News Detection**

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


- This data consists of two columns.
        - Text
        - label
- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the **Binary Classification.**


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -r "/content/drive/MyDrive/NLP Journey/Data/" "/content/"

In [3]:
!unzip Data/Fake.csv.zip
!unzip Data/True.csv.zip


Archive:  Data/Fake.csv.zip
  inflating: Fake.csv                
Archive:  Data/True.csv.zip
  inflating: True.csv                


In [4]:
#import pandas library
import pandas as pd

#read the dataset with name "Fake_Real_Data.csv" and store it in a variable df
df_1 = pd.read_csv("Fake.csv")

df_1['Fake']=1
df_1.head()

Unnamed: 0,title,text,subject,date,Fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


In [5]:
#print the shape of dataframe
df_2  =pd.read_csv('True.csv')
df_2['Fake']=0
df_2.head()

Unnamed: 0,title,text,subject,date,Fake
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",0


In [6]:
#check the distribution of labels 
res = pd.concat([df_1, df_2],axis=0)
res.sample(5)

Unnamed: 0,title,text,subject,date,Fake
19097,83 Yr Old Supreme Court Justice Ginsburg Gives...,"Speaking to BBC Newsnight in a rare interview,...",left-news,"Feb 23, 2017",1
20912,"Bernie Sanders: When you’re White, you don’t k...",Bernie Sanders is so focused on pandering to b...,left-news,"Mar 7, 2016",1
6113,Ethics lawyers to sue Trump over foreign payments,(Reuters) - A group including former White Hou...,politicsNews,"January 23, 2017",0
2289,Trump eyes top policy aide for communications ...,WASHINGTON (Reuters) - The White House may app...,politicsNews,"August 5, 2017",0
9741,SAN JUAN MAYOR On Video Disrespecting Our Cons...,With Puerto Rico teetering on financial oblivi...,politics,"Oct 5, 2017",1


In [7]:
simplified_df = res.drop(['title','subject','date'],axis=1)

In [8]:
simplified_df

Unnamed: 0,text,Fake
0,Donald Trump just couldn t wish all Americans ...,1
1,House Intelligence Committee Chairman Devin Nu...,1
2,"On Friday, it was revealed that former Milwauk...",1
3,"On Christmas day, Donald Trump announced that ...",1
4,Pope Francis used his annual Christmas Day mes...,1
...,...,...
21412,BRUSSELS (Reuters) - NATO allies on Tuesday we...,0
21413,"LONDON (Reuters) - LexisNexis, a provider of l...",0
21414,MINSK (Reuters) - In the shadow of disused Sov...,0
21415,MOSCOW (Reuters) - Vatican Secretary of State ...,0


In [9]:
simplified_df.Fake.value_counts()

1    23481
0    21417
Name: Fake, dtype: int64

In [10]:
#Add the new column "label_num" which gives a unique number to each of these labels 
features = simplified_df['text']
labels = simplified_df['Fake']

#check the results with top 5 rows


### **Modelling without Pre-processing Text data**

In [11]:
#import train-test-split from sklearn 
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = tts(features, labels, test_size =0.2, random_state=42,stratify= labels)

#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too



In [12]:
#print the shapes of X_train and X_test

print(X_train.shape  , X_test.shape)


(35918,) (8980,)


In [13]:
y_train.value_counts()

1    18785
0    17133
Name: Fake, dtype: int64

## Usefule function for all models

In [22]:
from sklearn.metrics import confusion_matrix

def get_confusion_values(y_true, y_pred):
  tn,fp,fn,tp = confusion_matrix(y_test, y_pred).ravel()
  return (tn,fp,fn,tp)


## Model 1

**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.


In [15]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

model_1 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(1,3))),
    ('Praba_clf', KNeighborsClassifier(n_neighbors=10, metric ='euclidean'))
])
#2. fit with X_train and y_train

model_1.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = model_1.predict(X_test)

#4. print the classfication report

report = classification_report(y_test, y_pred)

In [18]:
print('KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) \n \n ')
print(report)
report_1=report

KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) 
 
 
              precision    recall  f1-score   support

           0       0.72      0.74      0.73      4284
           1       0.76      0.73      0.74      4696

    accuracy                           0.74      8980
   macro avg       0.74      0.74      0.74      8980
weighted avg       0.74      0.74      0.74      8980



In [24]:
model_1_conf_values = get_confusion_values(y_test, y_pred)
model_1_conf_values

(3173, 1111, 1254, 3442)

## Model 2

**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
- print the classification report.


In [16]:


#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

model_2 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(1,3))),
    ('Praba_clf', KNeighborsClassifier(n_neighbors=10, metric ='cosine'))
])
#2. fit with X_train and y_train

model_2.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_2 = model_2.predict(X_test)

#4. print the classfication report
report = classification_report(y_test, y_preds_2)


In [17]:
print('KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) : COSINE distance for cluster \n \n ')
print(report)

KNN Classifier :  No preprocess on Input data : Count Vectoriser ngram (1,3) : COSINE distance for cluster 
 
 
              precision    recall  f1-score   support

           0       0.87      0.46      0.60      4284
           1       0.65      0.94      0.77      4696

    accuracy                           0.71      8980
   macro avg       0.76      0.70      0.68      8980
weighted avg       0.76      0.71      0.69      8980



## Model 3


**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [18]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model_3 = Pipeline([
    ('Praba_Vec', CountVectorizer(ngram_range=(3,3))),
    ('Praba_clf', RandomForestClassifier(max_depth=70)) # If max depth is not defined it will take more time 
])
#2. fit with X_train and y_train

model_3.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_3 = model_2.predict(X_test)

#4. print the classfication report
report = classification_report(y_test, y_preds_3)


In [19]:
print('Random Forest Classifier -depth 70 :  No preprocess on Input data : \n \n ')
print(report)

Random Forest Classifier -depth 70 :  No preprocess on Input data : 
 
 
              precision    recall  f1-score   support

           0       0.99      0.85      0.91      4284
           1       0.88      0.99      0.93      4696

    accuracy                           0.92      8980
   macro avg       0.93      0.92      0.92      8980
weighted avg       0.93      0.92      0.92      8980



## Model 4


**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier with an alpha value of 0.75.
- print the classification report.


In [30]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
#1. create a pipeline object

model_4 = Pipeline([
    ('Praba_vect',CountVectorizer(ngram_range=(2,3))),
    ('Praba_clf', MultinomialNB(alpha=0.75))
])



#2. fit with X_train and y_train

model_4.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_preds_4 = model_4.predict(X_test)

#4. print the classfication report

rep = classification_report(y_test, y_preds_4)

              precision    recall  f1-score   support

           0       0.98      0.99      0.98      4284
           1       0.99      0.98      0.99      4696

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980



In [31]:
print('MultiNomial Naive Bayes  - Alpha 0.75:  No preprocess on Input data : \n \n ')
print(rep)

MultiNomial Naive Bayes  - Alpha 0.75:  No preprocess on Input data : 
 
 
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      4284
           1       0.99      0.98      0.99      4696

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980



In [40]:
from sklearn.metrics import confusion_matrix
tn,fp,fn,tp = confusion_matrix(y_test, y_preds_4).ravel()
model_4_conf = tn,fp,fn,tp

# Preprocessed 

<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [32]:
#use this utility function to get the preprocessed text data

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [22]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
# this will take some time, please be patient


In [23]:
#print the top 5 rows


**Build a model with pre processed text**

In [24]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Make sure to use only the "preprocessed_txt" column for splitting




**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [25]:
#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


**Attempt2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, Bigram, and trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [26]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


In [27]:
#finally print the confusion matrix for the best model



## **Please write down Final Observations**


## [**Solution**](./bag_of_n_grams_exercise_solutions.ipynb)