<a href="https://colab.research.google.com/github/NotMark6/CCMACLRL_EXERCISES_COM221/blob/main/Exercise7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence if a hate speech or not
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [484]:
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, precision_score, recall_score, accuracy_score, balanced_accuracy_score, ConfusionMatrixDisplay
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [485]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

## Training Set

Use this to train your model

In [486]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])
df_train.head(10)

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
5,"""Ang sinungaling sa umpisa ay sinungaling hang...",1
6,Leni Kiko,0
7,Nahiya si Binay sa Makati kaya dito na lang sa...,1
8,Another reminderHalalan,0
9,[USERNAME] Maybe because VP Leni Sen Kiko and ...,0


In [487]:
df_train.isnull().any().sum()

0

In [488]:
# converting to lowercase
df_train["text"] = df_train["text"].str.lower()
# removing special characters
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[@#$%^&*\/\+-_=\{\}<>]", "", x))
# removing digits
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"1234567890", "", x))
# removing stop words
stop_words = stopwords.words("english")
df_train["text"] = df_train["text"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))
# removing stop words /filipino
with open("tagalog_stopwords.txt", "r") as file:
  stopwords = file.read().splitlines()
df_train["text"] = df_train["text"].apply(lambda x: " ".join(word for word in x.split() if word not in stopwords))



In [489]:
# applying lemmatization
wnl = WordNetLemmatizer()
df_train["text"] = df_train["text"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))
df_train[["text"]].head()

Unnamed: 0,text
0,presidential candidate mar roxas imply govt li...
1,parang mali sumunod patalastas nescaf coffee b...
2,bet pula kulay posas
3,username kakampink
4,parang tahimik pink doc willie ong reaction paper


In [490]:
df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas imply govt li...,1
1,parang mali sumunod patalastas nescaf coffee b...,1
2,bet pula kulay posas,1
3,username kakampink,0
4,parang tahimik pink doc willie ong reaction paper,1


In [491]:
#checking accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred_test)*100)

#Making Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred_test)
print(cm)

81.99288256227759
[[1048  364]
 [ 142 1256]]


In [492]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

In [493]:
X = df_train["text"].values
Y = df_train["label"].values
x_test = df_test["text"].values
y_test = df_test["label"].values


In [494]:
vect = TfidfVectorizer(stop_words='english',max_df=0.5)

#fitting train data and then transforming it to count matrix#fitting
X_train_transformed = vect.fit_transform(X)
#print(x_train)

#transforming the test data into the count matrix initiated for train data
X_test_transformed = vect.transform(x_test)


In [495]:
# importing naive bayes algorithm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

#fitting the model into train data
nb.fit(X_train_transformed, Y)

#predicting the model on train and test data
y_pred_test = nb.predict(X_test_transformed)
y_pred_train = nb.predict(X_train_transformed)



## Validation Set

Use this set to evaluate your model

In [496]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

In [497]:
#checking accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred_test)*100)

#Making Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred_test)
print(cm)

81.70818505338077
[[1041  371]
 [ 143 1255]]


## Test Set

Use this set to test your model

In [498]:
df_test.isnull().any().sum()

0

In [499]:
#checking accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred_test)*100)

#Making Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred_test)
print(cm)

81.70818505338077
[[1041  371]
 [ 143 1255]]


In [500]:
new_text = pd.Series('ambobo mo naman')
new_text_transform = vect.transform(new_text)
print(" The message is a" ,nb.predict(new_text_transform))

 The message is a [1]
