<a href="https://colab.research.google.com/github/Symphoen1x/Almadani.github.io/blob/main/bag_of_n_grams_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Bag of n_grams: Exercise**

- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


### **About Data: Fake News Detection**

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


- This data consists of two columns.
        - Text
        - label
- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the **Binary Classification.**


In [1]:
#import library
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sn

In [2]:
# prompt: berikan code import dataset lewat github. saya punya link: https://github.com/codebasics/nlp-tutorials/blob/main/11_bag_of_n_grams/Fake_Real_Data.csv

import pandas as pd
!wget https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/11_bag_of_n_grams/Fake_Real_Data.csv
df = pd.read_csv('Fake_Real_Data.csv')


--2024-09-03 20:57:48--  https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/11_bag_of_n_grams/Fake_Real_Data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25876225 (25M) [text/plain]
Saving to: ‘Fake_Real_Data.csv.2’


2024-09-03 20:57:49 (174 MB/s) - ‘Fake_Real_Data.csv.2’ saved [25876225/25876225]



In [3]:
#print the shape of dataframe
df.shape

#print top 5 rows
df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [4]:
# prompt: how to check distribution of lables?

df['label'].value_counts()


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Fake,5000
Real,4900


In [5]:
min_samples = 3000 # we have these many SCIENCE articles and SCIENCE is our minority class


df_fake = df[df.label=="Fake"].sample(min_samples, random_state=2022)
df_real = df[df.label=="Real"].sample(min_samples, random_state=2022)

df_balanced = pd.concat([df_fake,df_real],axis=0)
df_balanced.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Fake,3000
Real,3000


In [6]:
df_balanced.head()

Unnamed: 0,Text,label
6715,"Trump Throws An EPIC, Brain-Melting Tantrum A...",Fake
356,GOP Senator Just Smacked Down The Most Puncha...,Fake
7317,Trump Was Just Busted For Daydreaming At G20 ...,Fake
6051,These Banks Are Fighting The Cycle Of Poverty...,Fake
7257,Trump Jr. Forced To Admit To Meeting With Sha...,Fake


In [7]:
#Add the new column "label_num" which gives a unique number to each of these labels
def get_label_num(x):
  if x == 'Real':
    return 1
  else:
    return 0

df_balanced['label_num'] = df_balanced['label'].apply(get_label_num)

#check the results with top 5 rows


In [8]:
df_balanced.head()

Unnamed: 0,Text,label,label_num
6715,"Trump Throws An EPIC, Brain-Melting Tantrum A...",Fake,0
356,GOP Senator Just Smacked Down The Most Puncha...,Fake,0
7317,Trump Was Just Busted For Daydreaming At G20 ...,Fake,0
6051,These Banks Are Fighting The Cycle Of Poverty...,Fake,0
7257,Trump Jr. Forced To Admit To Meeting With Sha...,Fake,0


### **Modelling without Pre-processing Text data**

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df_balanced['Text'], df_balanced['label_num'], test_size=0.2, random_state=2022, stratify=df_balanced['label_num']) # split data here


In [10]:
#print the shapes of X_train and X_test
X_train.shape


(4800,)

In [11]:
X_train.head()

Unnamed: 0,Text
938,Trump on Twitter (Dec 20) - Tax Bill The follo...
7540,Gulf carriers may be in focus under foreign ai...
1701,Stricter Missouri abortion rules take effect a...
7234,Trump to meet congressional leaders next week ...
321,U.S. attack will not lead to military escalati...


In [12]:
X_test.shape

(1200,)

In [13]:
y_train.value_counts()

Unnamed: 0_level_0,count
label_num,Unnamed: 1_level_1
1,2400
0,2400


**Attempt 1** :

1. using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.


In [14]:
#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_bow', CountVectorizer(ngram_range = (1, 1))),        #using the ngram_range parameter
     ('Multi NB', MultinomialNB())
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       600
           1       0.98      0.97      0.98       600

    accuracy                           0.98      1200
   macro avg       0.98      0.98      0.98      1200
weighted avg       0.98      0.98      0.98      1200



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
- print the classification report.


In [15]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [16]:
#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report



**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier with an alpha value of 0.75.
- print the classification report.


In [17]:

#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [18]:
#use this utility function to get the preprocessed text data

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [19]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
# this will take some time, please be patient


In [20]:
#print the top 5 rows


**Build a model with pre processed text**

In [21]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Make sure to use only the "preprocessed_txt" column for splitting




**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [22]:
#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


**Attempt2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, Bigram, and trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [23]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


In [24]:
#finally print the confusion matrix for the best model



## **Please write down Final Observations**


## [**Solution**](./bag_of_n_grams_exercise_solutions.ipynb)