<a href="https://colab.research.google.com/github/robitussin/CCMACLRL_EXERCISES/blob/main/Exercise7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [220]:
import pandas as pd
import re

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Neil\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Neil\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [168]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [169]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

**Validation Set**

Use this set to evaluate your model

In [170]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

In [115]:
df_validation

Unnamed: 0,text,label
0,VinTee [USERNAME] [USERNAME] and [USERNAME] Ka...,1
1,binay's sidekicks were employees of makati cit...,1
2,This is expected as we use different methodol...,0
3,Ang tanga tanga talaga ni Nancy Binay eh. Tskkk.,1
4,Binay giving away bracelets after every selfie...,0
...,...,...
2795,DAYS NA LANG ELEKSYON NA! Alam mo na ba kung s...,0
2796,MRT-Mar Roxas Tanga,1
2797,Between Jan. 1 and Nov. 30 last yearBinay alre...,0
2798,Yes burger tayo jan Let Leni LeadKakampink Len...,0


**Test Set**
  
Use this set to test your model

In [171]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

In [172]:
df_test

Unnamed: 0,text,label
0,Binay: Patuloy ang kahirapan dahil sa maling p...,0
1,SA GOBYERNONG TAPAT WELCOME SA BAGUO ANG LAHAT...,0
2,wait so ur telling me Let Leni Lead mo pero NY...,1
3,[USERNAME]wish this is just a nightmare that ...,0
4,doc willie ong and isko sabunutan po,0
...,...,...
2805,[USERNAME] January xx LENI ROBREDO declinedNo ...,0
2806,Cena__O [USERNAME] Marcos MagnanakawCes Ore a-...,1
2807,JUSKO PLEASE HO WAG HO SI ROXAS!! DUTERTE OR M...,1
2808,Isko Moreno Norberto Gonzales & Ping Lacson ar...,0


## A. Understanding your training data

1. Check the first 10 rows of the training dataset

In [173]:
df_train.head(10)

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
5,"""Ang sinungaling sa umpisa ay sinungaling hang...",1
6,Leni Kiko,0
7,Nahiya si Binay sa Makati kaya dito na lang sa...,1
8,Another reminderHalalan,0
9,[USERNAME] Maybe because VP Leni Sen Kiko and ...,0


2. Check how many rows and columns are in the training dataset using `.info()`

In [77]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


3. Check for NaN values

In [179]:
df_train.isnull().sum()

text     0
label    0
dtype: int64

4. Check for duplicate rows

In [182]:
df_train.duplicated().sum()

np.int64(0)

5. Check how many rows belong to each class

In [180]:
print(len(df_train['text']))
print(len(df_train['label']))
print(len(df_validation['text']))
print(len(df_validation['label']))

21773
21773
2800
2800


## B. Text pre-processing

6. Remove duplicate rows

In [181]:
df_train.drop_duplicates(inplace=True)
df_train

Unnamed: 0,text,label
0,presidential candidate mar roxas imply govt li...,1
1,parang mali sumunod patalastas nescaf coffee b...,1
2,bet pula kulay posas,1
3,username kakampink,0
4,parang tahimik pink doc willie ong reaction paper,1
...,...,...
21768,marcos talunan marcos magnanakaw,1
21769,grabe kayo kay binay,0
21770,username cnu ba naman hindimabibighani maamkak...,0
21771,rt username tabi tabi yung nagsasabing parang ...,1


7. Remove rows with NaN values

In [82]:
df_train.dropna(inplace=True)
df_train.isnull().sum()

text     0
label    0
dtype: int64

8. Convert all text to lowercase

In [174]:
df_train['text'] = df_train['text'].str.lower()
df_validation['text'] = df_validation['text'].str.lower()
df_train

Unnamed: 0,text,label
0,presidential candidate mar roxas implies that ...,1
1,parang may mali na sumunod ang patalastas ng n...,1
2,bet ko. pula ang kulay ng posas,1
3,[username] kakampink,0
4,bakit parang tahimik ang mga pink about doc wi...,1
...,...,...
21768,marcos talunan marcos magnanakaw,1
21769,grabe kayo kay binay ??????????,0
21770,[username] cnu ba naman ang hindimabibighani s...,0
21771,rt [username]: tabi tabi yung mga nagsasabing ...,1


9. Remove digits, URLS and special characters

In [175]:
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r"\d+","", text) #Remove digits
        text = re.sub(r'http\S+', '', text) #Remove URLs
        text = re.sub(r'[^\w\s]', '', text) #Remove Special 
        
        return text

df_train['text'] = df_train['text'].apply(clean_text)
df_validation['text'] = df_validation['text'].apply(clean_text)

10. Remove stop words

In [176]:
english_stopwords = set(stopwords.words('english'))

tagalog_stopwords = set([
    'akin', 'aking', 'ako', 'alin', 'am', 'amin', 'aming', 'ang', 'ano', 'anumang',
    'apat', 'at', 'atin', 'ating', 'ay', 'bababa', 'bago', 'bakit', 'bawat', 
    'bilang', 'dahil', 'dalawa', 'dapat', 'din', 'dito', 'doon', 'gagawin', 
    'gayunman', 'ginagawa', 'ginawa', 'ginawang', 'gumawa', 'gusto', 'habang', 
    'hanggang', 'hindi', 'huwag', 'iba', 'ibaba', 'ibabaw', 'ibig', 'ikaw', 
    'ilagay', 'ilalim', 'ilan', 'inyong', 'isa', 'isang', 'itaas', 'ito', 'iyo', 
    'iyon', 'iyong', 'ka', 'kahit', 'kailangan', 'kailanman', 'kami', 'kanila', 
    'kanilang', 'kanino', 'kanya', 'kanyang', 'kapag', 'kapwa', 'karamihan', 
    'katiyakan', 'katulad', 'kaya', 'kaysa', 'ko', 'kong', 'kulang', 'kumuha', 
    'kung', 'laban', 'lahat', 'lamang', 'likod', 'lima', 'maaari', 'maaaring', 
    'maging', 'mahusay', 'makita', 'marami', 'marapat', 'masyado', 'may', 
    'mayroon', 'mga', 'minsan', 'mismo', 'mula', 'muli', 'na', 'nabanggit', 
    'naging', 'nagkaroon', 'nais', 'nakita', 'namin', 'napaka', 'narito', 
    'nasaan', 'ng', 'ngayon', 'ni', 'nila', 'nilang', 'nito', 'niya', 'niyang', 
    'noon', 'o', 'pa', 'paano', 'pababa', 'paggawa', 'pagitan', 'pagkakaroon', 
    'pagkatapos', 'palabas', 'pamamagitan', 'panahon', 'pangalawa', 'para', 
    'paraan', 'pareho', 'pataas', 'pero', 'pumunta', 'pumupunta', 'sa', 
    'saan', 'sabi', 'sabihin', 'sarili', 'sila', 'sino', 'siya', 'tatlo', 
    'tayo', 'tulad', 'tungkol', 'una', 'walang'
])

def remove_stopwords(text):
    if isinstance(text, str):
        text = ' '.join(
            [word for word in text.split() 
             if word.lower() not in english_stopwords and word.lower() not in tagalog_stopwords]
        )
    return text
    
df_train['text'] = df_train['text'].apply(remove_stopwords)
df_validation['text'] = df_validation['text'].apply(remove_stopwords)

11. Use Stemming or Lemmatization

In [177]:
wnl = WordNetLemmatizer()
df_train["text"] = df_train["text"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))
df_validation["text"] = df_validation["text"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))

df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas imply govt li...,1
1,parang mali sumunod patalastas nescaf coffee b...,1
2,bet pula kulay posas,1
3,username kakampink,0
4,parang tahimik pink doc willie ong reaction paper,1


## C. Training your model

12. Put all text training data in variable **X_train**

In [183]:
x_train = df_train['text']

13. Put all training data labels in variable **y_train**

In [184]:
y_train = df_train['label']

14. Use `CountVectorizer()` or `TfidfVectorizer()` to convert text data to its numerical form.

Put the converted data to **X_train_transformed** variable

In [211]:
vectorizer = TfidfVectorizer()
x_train_transformed = vectorizer.fit_transform(df_train['text'])
x_validation_transformed = vectorizer.transform(df_validation['text'])
x_test_transformed = vectorizer.transform(df_test['text'])

15. Create an instance of `MultinomalNB()`

In [198]:
model = MultinomialNB()

16. Train the model using `.fit()`

In [199]:
model.fit(x_train_transformed, y_train)

## D. Evaluate your model

17. Use `.predict()` to generate model predictions using the **validation dataset**


- Put all text validation data in **X_validation** variable

- Convert **X_validation** to its numerical form.

- Put the converted data to **X_validation_transformed**

- Put all predictions in **y_validation_pred** variable

In [207]:
y_validation_pred = model.predict(x_validation_transformed)

18. Get the Accuracy, Precision, Recall and F1-Score of the model using the **validation dataset**

- Put all validation data labels in **y_validation** variable

In [222]:
y_validation = df_validation['label']
print("Validation Accuracy: ", accuracy_score(y_validation, y_validation_pred))

print("\nClassification Report:")
print(classification_report(y_validation, y_validation_pred))

Validation Accuracy:  0.8360714285714286

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.79      0.83      1385
           1       0.81      0.88      0.84      1415

    accuracy                           0.84      2800
   macro avg       0.84      0.84      0.84      2800
weighted avg       0.84      0.84      0.84      2800



19. Create a confusion matrix using the **validation dataset**

In [208]:
print("Confusion Matrix:")
print(confusion_matrix(y_validation, y_validation_pred))

Confusion Matrix:
[[1101  284]
 [ 175 1240]]


20. Use `.predict()` to generate the model predictions using the **test dataset**


- Put all text validation data in **X_test** variable

- Convert **X_test** to its numerical form.

- Put the converted data to **X_test_transformed**

- Put all predictions in **y_test_pred** variable

In [212]:
y_test_pred = model.predict(x_test_transformed)

21. Get the Accuracy, Precision, Recall and F1-Score of the model using the **test dataset**

- Put all test data labels in **y_validation** variable



In [221]:
y_test = df_test['label']
print("Validation Accuracy: ", accuracy_score(y_test, y_test_pred))


print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))

Validation Accuracy:  0.8327402135231317

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.79      0.83      1412
           1       0.80      0.88      0.84      1398

    accuracy                           0.83      2810
   macro avg       0.84      0.83      0.83      2810
weighted avg       0.84      0.83      0.83      2810



22. Create a confusion matrix using the **test dataset**

In [214]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

Confusion Matrix:
[[1113  299]
 [ 171 1227]]


## E. Test the model

23. Test the model by providing a non-hate speech input. The model should predict it as 0

In [217]:
# NEW: Testing Tagalog hate speech detection with new text input
new_text = pd.Series("Mahal ko kayo")

# Apply the same preprocessing steps to the new input
#new_text_cleaned = new_text.apply(preprocess_text)

# Transform the new text using the trained vectorizer (vect)
new_text_transform = vectorizer.transform(new_text)

# Make the prediction using the trained Naive Bayes model (nb)
prediction = model.predict(new_text_transform)

# Interpret the prediction result
if prediction == 1:
    print("The sentence is classified as hate speech.")
else:
    print("The sentence is classified as non-hate speech.")

The sentence is classified as non-hate speech.


24. Test the model by providing a hate speech input. The model should predict it as 1

In [218]:

# NEW: Testing Tagalog hate speech detection with new text input
new_text = pd.Series("Putangina Ina")

# Apply the same preprocessing steps to the new input
#new_text_cleaned = new_text.apply(preprocess_text)

# Transform the new text using the trained vectorizer (vect)
new_text_transform = vectorizer.transform(new_text)

# Make the prediction using the trained Naive Bayes model (nb)
prediction = model.predict(new_text_transform)

# Interpret the prediction result
if prediction == 1:
    print("The sentence is classified as hate speech.")
else:
    print("The sentence is classified as non-hate speech.")

The sentence is classified as hate speech.
