# NaiveBayes Classifier

This project is implemntation of naivebayes algorithm for text classification.

A dataset of users comment on an onlion store is given. Our goal is to use naivebayes to analyze their comment's seniment and predict whether the user will recommend that product or not.Steps we are going to take are as follows:<br>
1. Dataset Preparation: The first step is the Dataset Preparation step which includes the process of loading a dataset and performing basic pre-processing. The dataset is then splitted into train and validation sets.
2. Feature Engineering: The next step is the Feature Engineering in which the raw dataset is transformed into flat features which can be used in a machine learning model. This step also includes the process of creating new features from the existing data.
3. Model Training: The final step is the Model Building step in which a machine learning model is trained on a labelled dataset.

In [28]:
import numpy as np
import pandas as pd
from hazm import *
import re
import string
import random
from collections import Counter


A file of stopwords is loaded below for future processings.This file is from Kharazi/Persian-Stopwords[1].

In [77]:
#stop_words = [re.sub(r'\u200c','',line.rstrip('\n')) for line in open('stop_words.txt',mode="r", encoding="utf-8")]
stop_words = stopwords_list()


A train and a test dataset is given from digikala which is a e-commerce company.

In [78]:
test = pd.read_csv('comment_test.csv')
train = pd.read_csv('comment_train.csv')

In [46]:
train

Unnamed: 0,title,comment,recommend
0,زیبا اما کم دوام,با وجود سابقه خوبی که از برند ایرانی نهرین سرا...,not_recommended
1,بسیار عالی,بسیار عالی,recommended
2,سلام,من الان ۳ هفته هست استفاده میکنم\r\nبرای کسایی...,not_recommended
3,به درد نمیخورهههه,عمرش کمه تا یه هفته بیشتر نمیشه استفاده کرد یا...,not_recommended
4,کلمن آب,فکر کنین کلمن بخرین با ذوق. کلی پولشو بدین. به...,not_recommended
...,...,...,...
5995,جنسش عالیه,خیلی جنس پارچش نرم ولطیفه خیلیم جنسش خوبه اما ...,recommended
5996,خرید محصول,سلام.واقعا فکر نمی کردم به این راحتی اصلاح کنم...,recommended
5997,تعریف,من از دیجی کالا خریدم خیلی زود دستم رسید،زیبا،...,recommended
5998,اصلا چای ماچا نیسش,یا شرکت نمیدونسته چای ماچا امپریال چیه یا واقع...,not_recommended


comment column of the dataset is going to be used in future ,for simplicity.<br>
In order to train our classifier, we need to transform our headlines of words into numbers, since algorithms only know how to work with numbers.Below there are two functions one for preprocessing and cleaning the data and the other one to create a list of features in which number of occurance of each word is stored.

In [47]:
def cleanText(text):
    normalizer = Normalizer()
    #stemer = Stemmer()
    #text = stemer.stem(text)
    #lemmantiz = Lemmatizer()
    #text = lemmantiz.lemmatize(text)
    text = normalizer.normalize(text)
    text = re.sub(r'(\u200c)*','',text)
    text = re.sub(r'(\r)*','',text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[A-Za-z]+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s\s*',' ',text)
    for word in stop_words: 
        text = text.replace(' ' + word + ' ', ' ') 
    return text

def bag_of_words(document,unique_words):
    #construction of the bag of words matrix
    number_of_documents = len(document)
    number_of_words = len(unique_words)
    bag_of_words  = np.zeros((number_of_documents, number_of_words))
    for i in range(number_of_documents):
        for j in range(number_of_words):
            bag_of_words[i, j] = document[i].count(unique_words[j])
    return bag_of_words

In the CleanText function two stemmer and lemmatizer are used.Both of them return root of the word while Stemming follows an algorithm with steps to perform on the words which makes it faster than Lemmatizer.Since many verbs and words have the same root I was expecting a better performane when using one of these techniques,However,results did not demonstrate any improvments.

In this problem we want to calssify a comment based on its features so we have to calculate P(recommend|features).So first P(feature|recommend) and P(feature|not_recommend) ,and then P(recommend) and P(not_recommend) must be computed.<br>
Posterior = P(recommend|features) and P(recommend|features)<br>
Liklihood = P(feature|recommend) and P(feature|not_recommend)<br>
Priore = P(recommend) and P(not_recommend)<br>
Evidence = P(features)

In [48]:
class MultinomialNB(object):
    def __init__(self, alpha=1.0):
        self.alpha = alpha

    def fit(self, X, y):
        # group by class
        separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
        count_sample = X.shape[0]
        self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated] #P(c)
        count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
        self.feature_log_prob_ = np.log(count / count.sum(axis=1)[np.newaxis].T) #P(x|c)
        
    def predict_prob(self,X):
        return [(self.feature_log_prob_ * x).sum(axis=1) + self.class_log_prior_ for x in X]
    
    def predict(self,X):
        return np.argmax(self.predict_prob(X), axis=1)

class NaiveBayes(object):
    def __init__(self, alpha=0.0):
        self.alpha = alpha

    def fit(self, X, y):
        # group by class
        separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
        count_sample = X.shape[0]
        self.class_prior_ = [np.round(len(i) / count_sample,5) for i in separated] #P(c)
        count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
        self.feature_prob_ = np.round(count / count.sum(axis=1)[np.newaxis].T,5) #P(x|c)
        
    def predict_prob(self,X):
        return [(self.feature_prob_ * x).sum(axis=1) + self.class_prior_ for x in X]
    
    def predict(self,X):
        return np.argmax(self.predict_prob(X), axis=1)

 In above implementation Log probability is used to avoid floating point underflow and alpha is used for additive smoothing.Whenever a word does not appear in a comment, the whole probability of either recommend or not_recommend is going to be zero and all other words will have no impact on the result.Therefore,additive smoothing is used to overcome this problem.<br>P(a very good product | not_recomend ) = P(a | not_recomend) x P(very | not_recomend) x P(good | not_recomend) x P(product | not_recomend)

In additive smoothing a constant(alpha) will be added to the numerator of each features probability in order to prevent them from being zero.<br>



In [49]:
def precision(TP,TN,FP,FN):
    return TP / (TP + FP)
def recall(TP,TN,FP,FN):
    return TP / (TP + FN)
def F1(precision,recall):
    return (2*precision*recall)/(precision+recall)
def Accuracy(TP,TN,FP,FN):
    return (TP + TN)/(TP + TN + FP + FN)

None of the precision or recall are sufficient to analyze the resuly of a model.In a case which the model only predicts one comment being recommendeed and that comment is recommended,but it did not predict all the other recommended comments; precision is 100% but the model is not at all a good one.On the other hand, in a scenario which a model predict all comments as recommended, it reaches the recall of 100% but none of the not_recommended comments were ditected.<br>
We use F1 score which is a harmonic mean of precision and recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric.Moreover, it penalizes the extreme values.

## Preprocessing + Additive Smoothing

In [79]:
x_train = list(map(lambda x: cleanText(x),train['comment']))
x_test = list(map(lambda x: cleanText(x),test['comment']))

In [80]:
unique_words = list()
for sentence in x_train:
    tokens = word_tokenize(sentence)
    for token in tokens:
        if token not in unique_words:
            unique_words.append(token)

In [81]:
bag_of_words_train = bag_of_words(x_train,unique_words)
bag_of_words_test = bag_of_words(x_test,unique_words)

In [82]:
X_train = pd.DataFrame(bag_of_words_train)
X_test = pd.DataFrame(bag_of_words_test)

In [83]:
# y_train = train['recommend']
# y_test = test['recommend']
y_train = train['recommend'].astype('category').cat.codes
y_test = test['recommend'].astype('category').cat.codes

In [84]:
model = MultinomialNB()
model.fit(X_train.to_numpy(),y_train.to_numpy())
y= model.predict(X_test.to_numpy())
df_confusion = pd.crosstab(y_test, y)
FN = df_confusion[0][1]
TN = df_confusion[0][0]
TP = df_confusion[1][1]
FP = df_confusion[1][0]
preci = precision(TP,TN,FP,FN)
recal = recall(TP,TN,FP,FN)
print("Accuracy :  %f" % (Accuracy(TP,TN,FP,FN)))
print("Precision :  %f" % (preci))
print("Recall :  %f" % (recal))
print("F1 :  %f" % (F1(preci,recal)))
wron_id =  [i for i in range(len(y_test)) if y_test[i] != y[i]]

Accuracy :  0.887500
Precision :  0.878049
Recall :  0.900000
F1 :  0.888889


## Only Additive smoothing

In [85]:
x_train = list(train['comment'])
x_test = list(test['comment'])

unique_words = list()
for sentence in x_train:
    tokens = word_tokenize(sentence)
    for token in tokens:
        if token not in unique_words:
            unique_words.append(token)

bag_of_words_train = bag_of_words(x_train,unique_words)
bag_of_words_test = bag_of_words(x_test,unique_words)

X_train = pd.DataFrame(bag_of_words_train)
X_test = pd.DataFrame(bag_of_words_test)

y_train = train['recommend'].astype('category').cat.codes
y_test = test['recommend'].astype('category').cat.codes

model = MultinomialNB(alpha=1)
model.fit(X_train.to_numpy(),y_train.to_numpy())
y= model.predict(X_test.to_numpy())
df_confusion = pd.crosstab(y_test, y)
FN = df_confusion[0][1]
TN = df_confusion[0][0]
TP = df_confusion[1][1]
FP = df_confusion[1][0]
preci = precision(TP,TN,FP,FN)
recal = recall(TP,TN,FP,FN)
print("Accuracy :  %f" % (Accuracy(TP,TN,FP,FN)))
print("Precision :  %f" % (preci))
print("Recall :  %f" % (recal))
print("F1 :  %f" % (F1(preci,recal)))

Accuracy :  0.885000
Precision :  0.871981
Recall :  0.902500
F1 :  0.886978


## Only Preprocessing

In [86]:
x_train = list(map(lambda x: cleanText(x),train['comment']))
x_test = list(map(lambda x: cleanText(x),test['comment']))

unique_words = list()
for sentence in x_train:
    tokens = word_tokenize(sentence)
    for token in tokens:
        if token not in unique_words:
            unique_words.append(token)

bag_of_words_train = bag_of_words(x_train,unique_words)
bag_of_words_test = bag_of_words(x_test,unique_words)

X_train = pd.DataFrame(bag_of_words_train)
X_test = pd.DataFrame(bag_of_words_test)

y_train = train['recommend'].astype('category').cat.codes
y_test = test['recommend'].astype('category').cat.codes

model = NaiveBayes(alpha=0)
model.fit(X_train.to_numpy(),y_train.to_numpy())
y= model.predict(X_test.to_numpy())
df_confusion = pd.crosstab(y_test, y)
FN = df_confusion[0][1]
TN = df_confusion[0][0]
TP = df_confusion[1][1]
FP = df_confusion[1][0]
preci = precision(TP,TN,FP,FN)
recal = recall(TP,TN,FP,FN)
print("Accuracy :  %f" % (Accuracy(TP,TN,FP,FN)))
print("Precision :  %f" % (preci))
print("Recall :  %f" % (recal))
print("F1 :  %f" % (F1(preci,recal)))

Accuracy :  0.641250
Precision :  0.591870
Recall :  0.910000
F1 :  0.717241


## Neither preprocessing nor additive smoothing

In [87]:
x_train = list(train['comment'])
x_test = list(test['comment'])

unique_words = list()
for sentence in x_train:
    tokens = word_tokenize(sentence)
    for token in tokens:
        if token not in unique_words:
            unique_words.append(token)
            
bag_of_words_train = bag_of_words(x_train,unique_words)
bag_of_words_test = bag_of_words(x_test,unique_words)

X_train = pd.DataFrame(bag_of_words_train)
X_test = pd.DataFrame(bag_of_words_test)

y_train = train['recommend'].astype('category').cat.codes
y_test = test['recommend'].astype('category').cat.codes

model = NaiveBayes(alpha=0)
model.fit(X_train.to_numpy(),y_train.to_numpy())
y= model.predict(X_test.to_numpy())
df_confusion = pd.crosstab(y_test, y)
FN = df_confusion[0][1]
TN = df_confusion[0][0]
TP = df_confusion[1][1]
FP = df_confusion[1][0]
preci = precision(TP,TN,FP,FN)
recal = recall(TP,TN,FP,FN)
print("Accuracy :  %f" % (Accuracy(TP,TN,FP,FN)))
print("Precision :  %f" % (preci))
print("Recall :  %f" % (recal))
print("F1 :  %f" % (F1(preci,recal)))

Accuracy :  0.682500
Precision :  0.633212
Recall :  0.867500
F1 :  0.732068


| preprocess/smoothing | Accuracy | Precision | Recall | F1|
| --- | --- | --- | --- | --- |
| preprocess+smoothing | 0.88 | 0.87 | 0.90 | 0.88|
| preprocess| 0.64 | 0.59 | 0.91 | 0.71|
| Addittive smoothing | 0.88 | 0.87 | 0.90 | 0.88|
| None | 0.68 | 0.63 | 0.86 | 0.73|


Addittive smoothing greatly improves the accuracy of the model,while preprocessing does not seem that helpful.preprocessing slightly improves recall of the model meaning that it correctly predict more recommended comments.The value 1 for recall means that no non-recommended comment is wrongly labeled as recomended. 

In [74]:
for i in range(len(wron_id)):
    print(i)
    print(test.iloc[i])
    print('---------------------')

0
title                                                  وری گود
comment      تازه خریدم یه مدت کار بکنه مشخص میشه کیفیت قطعاتش
recommend                                          recommended
Name: 0, dtype: object
---------------------
1
title         زیاد مناسب نیست رنگ پس میده یه وقتایی موقع نوشتن
comment      با این قیمت گزینه های بهتری هم میشه گرفت.\r\nر...
recommend                                      not_recommended
Name: 1, dtype: object
---------------------
2
title                                                پنکه گوشی
comment      خیلی عالیه، فقط کاش از اون سمتش میشد به پاوربا...
recommend                                          recommended
Name: 2, dtype: object
---------------------
3
title                                         دستگاه خیلی ضعیف
comment      من این فیس براس چند روز یپش به دستم رسید و الا...
recommend                                      not_recommended
Name: 3, dtype: object
---------------------
4
title                                              عال

Using n-gram in preprocessing might improve the accuracy since many combinations of words have the exact opposite of looking at them seperately.<br>
Using titles as well as comments might help as well.

References<br>
GitHub. 2020. Kharazi/Persian-Stopwords. [online] Available at: <https://github.com/kharazi/persian-stopwords> [Accessed 18 November 2020].

Kenzo's Blog. 2020. Naive Bayes From Scratch In Python. [online] Available at: <http://kenzotakahashi.github.io/naive-bayes-from-scratch-in-python.html> [Accessed 21 November 2020].