### **Bag of n_grams: Exercise**

- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


### **About Data: Fake News Detection**

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


- This data consists of two columns.
        - Text
        - label
- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the **Binary Classification.**


In [2]:
#import pandas library
import pandas as pd
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset



Dataset URL: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
License(s): CC-BY-NC-SA-4.0
fake-and-real-news-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [3]:
import zipfile
import os

# Specify the path to the zip file and the destination directory
zip_file_path = '/content/fake-and-real-news-dataset.zip'
extract_dir = '/content'

# Open the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified directory
        zip_ref.extractall(extract_dir)

        print(f"Extracted all files to {extract_dir}")

Extracted all files to /content


In [4]:
train_Fake  = pd.read_csv('Fake.csv')
train_Fake['fakeness'] = 1
train_Fake

Unnamed: 0,title,text,subject,date,fakeness
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1
...,...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",1
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",1
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",1
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",1


In [5]:
train_True  = pd.read_csv('True.csv')
train_True['fakeness'] = 0
train_True

Unnamed: 0,title,text,subject,date,fakeness
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",0
...,...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",0
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",0
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",0
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",0


In [6]:
#let's concat the two datasets
train_data = pd.concat([train_True, train_Fake] , axis=0 )
train_data = train_data.sample(frac=1).reset_index(drop=True)
train_data

Unnamed: 0,title,text,subject,date,fakeness
0,Russian lawmaker Kerimov detained by French po...,"NICE, France (Reuters) - Russian businessman a...",worldnews,"November 21, 2017",0
1,Highlights: The Trump presidency on April 26 a...,(Reuters) - Highlights for U.S. President Dona...,politicsNews,"April 26, 2017",0
2,Australian citizenship crisis deepens as eight...,SYDNEY (Reuters) - The citizenship crisis engu...,worldnews,"November 13, 2017",0
3,U.S. senator: Launch probe if inappropriate Tr...,WASHINGTON (Reuters) - Congress should launch ...,politicsNews,"February 15, 2017",0
4,Sister of NY attack suspect says he may have b...,ALMATY (Reuters) - The sister of the Uzbek imm...,politicsNews,"November 3, 2017",0
...,...,...,...,...,...
44893,German citizen freed in Turkey but banned from...,ISTANBUL (Reuters) - A Turkish court has order...,worldnews,"September 7, 2017",0
44894,THOUSANDS Of Containers ROTTING At San Juan Po...,San Juan s mayor railed against President Trum...,left-news,"Oct 1, 2017",1
44895,Factbox: Trump on Twitter (Sept 27) - Filibust...,The following statements were posted to the ve...,politicsNews,"September 28, 2017",0
44896,"Ryan calls Trump, Cruz to discuss House Republ...",WASHINGTON (Reuters) - U.S. House of Represent...,politicsNews,"March 8, 2016",0


In [7]:
#check the distribution of labels
print('fake data: ',len(train_data[train_data['fakeness'] == 1]))
print('true data: ',len(train_data[train_data['fakeness'] == 0]))

fake data:  23481
true data:  21417


### **Modelling without Pre-processing Text data**

In [8]:
from sklearn.model_selection import train_test_split


#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
X_train , X_val , y_train , y_val = train_test_split(train_data['text'] , train_data['fakeness'] , test_size=0.2 , random_state = 2022 , stratify=train_data.fakeness)


In [9]:
#print the shapes of X_train and X_test

print("x_train shape: ",X_train.shape)
print("x_val shape: ",X_val.shape)

x_train shape:  (35918,)
x_val shape:  (8980,)


**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.


In [10]:
#1. create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

pipe = Pipeline([('v', CountVectorizer(ngram_range=(1, 3))),
                 ('knn', KNeighborsClassifier(n_neighbors=10 , metric='euclidean'))

                 ])


#2. fit with X_train and y_train

pipe.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

#4. print the classfication report
print(classification_report(y_val , y_pred))


              precision    recall  f1-score   support

           0       0.72      0.72      0.72      4284
           1       0.75      0.74      0.75      4696

    accuracy                           0.73      8980
   macro avg       0.73      0.73      0.73      8980
weighted avg       0.73      0.73      0.73      8980



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
- print the classification report.



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
from sklearn.ensemble import RandomForestClassifier


pipe = Pipeline([('v', CountVectorizer(ngram_range=(3, 3))),
                 ('rf_clf', RandomForestClassifier()) ])

#2. fit with X_train and y_train

pipe.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

#4. print the classfication report
print(classification_report(y_val , y_pred))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97      4284
           1       0.98      0.98      0.98      4696

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier with an alpha value of 0.75.
- print the classification report.


<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [None]:
#use this utility function to get the preprocessed text data

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [None]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
# this will take some time, please be patient
train_data['preprocessed_txt'] = train_data['text'].apply(preprocess)


In [None]:
#print the top 5 rows
train_data.head(5)


Unnamed: 0,title,text,subject,date,fakeness,preprocessed_txt
0,Russian lawmaker Kerimov detained by French po...,"NICE, France (Reuters) - Russian businessman a...",worldnews,"November 21, 2017",0,nice France Reuters russian businessman lawmak...
1,Highlights: The Trump presidency on April 26 a...,(Reuters) - Highlights for U.S. President Dona...,politicsNews,"April 26, 2017",0,Reuters Highlights U.S. President Donald Trump...
2,Australian citizenship crisis deepens as eight...,SYDNEY (Reuters) - The citizenship crisis engu...,worldnews,"November 13, 2017",0,SYDNEY Reuters citizenship crisis engulf austr...
3,U.S. senator: Launch probe if inappropriate Tr...,WASHINGTON (Reuters) - Congress should launch ...,politicsNews,"February 15, 2017",0,WASHINGTON Reuters Congress launch bipartisan ...
4,Sister of NY attack suspect says he may have b...,ALMATY (Reuters) - The sister of the Uzbek imm...,politicsNews,"November 3, 2017",0,ALMATY Reuters sister Uzbek immigrant accuse k...


**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Make sure to use only the "preprocessed_txt" column for splitting
X_train , X_val , y_train , y_val = train_test_split(train_data['preprocessed_txt'] , train_data['fakeness'] , test_size=0.2 , random_state = 2022 , stratify=train_data.fakeness)




**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
pipe = Pipeline([('v', CountVectorizer(ngram_range=(3, 3))),
                 ('rf_clf', RandomForestClassifier()) ])

#2. fit with X_train and y_train

pipe.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

#4. print the classfication report
print(classification_report(y_val , y_pred))


**Attempt2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, Bigram, and trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
pipe = Pipeline([('v', CountVectorizer(ngram_range=(1, 3))),
                 ('rf_clf', RandomForestClassifier()) ])

                 #2. fit with X_train and y_train

                 pipe.fit(X_train, y_train)

                 #3. get the predictions for X_test and store it in y_pred

                 y_pred = pipe.predict(X_val)

                 #4. print the classfication report
                 print(classification_report(y_val , y_pred))


In [None]:
#finally print the confusion matrix for the best model
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

disp = ConfusionMatrixDisplay.from_estimator(pipe,X_val,y_val,cmap=plt.cm.Blues)
plt.show()


## **Please write down Final Observations**


## [**Solution**](./bag_of_n_grams_exercise_solutions.ipynb)