### **Bag of n_grams: Exercise**

- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


### **About Data: Fake News Detection**

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


- This data consists of two columns.
        - Text
        - label
- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the **Binary Classification.**


In [2]:
#import pandas library
import pandas as pd

#read the dataset with name "Fake_Real_Data.csv" and store it in a variable df
df_Fake = pd.read_csv('Fake.csv')
df_True = pd.read_csv('True.csv')

df_Fake['Fake'] = 1
df_True['Fake'] = 0 

df = pd.concat([df_Fake, df_True], axis=0, ignore_index= True)


#print the shape of dataframe

print(df.shape)


#print top 5 rows
df.head()

(44898, 5)


Unnamed: 0,title,text,subject,date,Fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


In [3]:
#check the distribution of labels 
df.Fake.value_counts()

Fake
1    23481
0    21417
Name: count, dtype: int64

### **Modelling without Pre-processing Text data**

In [4]:
#import train-test-split from sklearn 
from sklearn.model_selection import train_test_split


#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
X = df.text
y = df.Fake

X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.20, random_state=2022, stratify=y)

In [5]:
#print the shapes of X_train and X_test
print(f'{X.shape} and {y.shape}')


(44898,) and (44898,)


**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.


In [6]:
#1. create a pipeline object
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

myModel = Pipeline(steps = [
    ('preprocessor', CountVectorizer()),
    ('model', XGBClassifier())
])

#2. fit with X_train and y_train
myModel.fit(X_train, Y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = myModel.predict(X_test)

#4. print the classfication report
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4284
           1       1.00      1.00      1.00      4696

    accuracy                           1.00      8980
   macro avg       1.00      1.00      1.00      8980
weighted avg       1.00      1.00      1.00      8980



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [7]:
#use this utility function to get the preprocessed text data

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [8]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
# this will take some time, please be patient
X_filtered = []


for index, x in enumerate(X):
    X_filtered.append(x)
    print("\033[H\033[J", end="")
    print(f"{index/X.shape[0]} % complete")

[H[J0.0 % complete
[H[J2.2272707024811794e-05 % complete
[H[J4.454541404962359e-05 % complete
[H[J6.681812107443539e-05 % complete
[H[J8.909082809924718e-05 % complete
[H[J0.00011136353512405898 % complete
[H[J0.00013363624214887077 % complete
[H[J0.00015590894917368256 % complete
[H[J0.00017818165619849435 % complete
[H[J0.00020045436322330617 % complete
[H[J0.00022272707024811796 % complete
[H[J0.00024499977727292975 % complete
[H[J0.00026727248429774154 % complete
[H[J0.00028954519132255333 % complete
[H[J0.0003118178983473651 % complete
[H[J0.0003340906053721769 % complete
[H[J0.0003563633123969887 % complete
[H[J0.00037863601942180055 % complete
[H[J0.00040090872644661234 % complete
[H[J0.00042318143347142413 % complete
[H[J0.0004454541404962359 % complete
[H[J0.0004677268475210477 % complete
[H[J0.0004899995545458595 % complete
[H[J0.0005122722615706713 % complete
[H[J0.0005345449685954831 % complete
[H[J0.0005568176756202949 % 

In [9]:
#print the top 5 rows
X_filtered.head()

AttributeError: 'list' object has no attribute 'head'

**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Make sure to use only the "preprocessed_txt" column for splitting

X_train, X_test, Y_train, Y_test=train_test_split(X_filtered, y, test_size=0.20, random_state=42, stratify=y)


myModel = Pipeline(steps = [
    ('preprocessor', CountVectorizer()),
    ('model', XGBClassifier())
])

#2. fit with X_train and y_train
myModel.fit(X_train, Y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = myModel.predict(X_test)

#4. print the classfication report
print(classification_report(Y_test, y_pred))

: 