### Hadi Heidari Rad - 810197011

#### AI - Naive Bayse Classification

<br>

Reading Data from file, and allocating 80% of it to Training data and rest 20% to Testing data:

In [46]:
import re
import heapq
import pandas as pd
import numpy as np

data = pd.read_csv('train_test.csv')

# code for randomizing dataset:
# data = data.sample(frac = 1)

separated_data = np.split(data, [round(len(data) * 0.2)], axis=0)
train_data = separated_data[1]
test_data = separated_data[0]
test_data.head()

Unnamed: 0,text,label
0,چون می‌رود این کشتی سرگشته که آخر,hafez
1,که همین بود حد امکانش,saadi
2,ارادتی بنما تا سعادتی ببری,hafez
3,خدا را زین معما پرده بردار,hafez
4,گویی که در برابر چشمم مصوری,saadi


<br>

Separating Hafez and Saadi Poems, with the purpose of computing their used words datasets:

In [47]:
hafez_df = train_data[train_data.label != 'saadi']
saadi_df = train_data[train_data.label != 'hafez']

<br>

By using Naive bayes formula: $P(c|X) = P(x_1|c)*P(x_2|c)*P(x_3|c)...P(x_n|c)P(c)$ (by also omitting denominator as said in the project description), we need to first calculate $P(Hafez)$ and $P(Saadi)$ within the given dataset, and then calculate probabilty of occurance of each word, in their own dataset of words (which is $P(x_n|c)$) :

In [48]:
def p(favorable, total):
    return favorable / total

p_hafez = p(len(hafez_df) , len(train_data))
p_saadi = p(len(saadi_df) , len(train_data))
print("P(hafez) : " + str( p_hafez ))
print("P(saadi) : " + str( p_saadi ))

P(hafez) : 0.4030279456645323
P(saadi) : 0.5969720543354676


<br>

Splitting words of each single line poem:

In [49]:
hafez_poems = hafez_df['text']
hafez_poems_splitted = []
for each in hafez_poems:
    hafez_poems_splitted.append(re.split(' ', each))
    
saadi_poems = saadi_df['text']
saadi_poems_splitted = []
for each in saadi_poems:
    saadi_poems_splitted.append(re.split(' ', each))

whole_train_poems = train_data['text']
train_poems_splitted = []
for each in whole_train_poems:
    train_poems_splitted.append(re.split(' ', each))
    
test_poems = test_data['text']
test_labels = []
for each in test_data['label']:
    test_labels.append(each)
test_poems_splitted = []
for each in test_poems:
    test_poems_splitted.append(re.split(' ', each))  

<br>

Counting each unique words occurrence times in both (Hafez and Saadi) words datasets (and also both put together):

In [50]:
whole_words_cnt = 0
words_freq = {}
for poem in train_poems_splitted:
    for word in poem:
        if word not in words_freq:
            words_freq[word] = 1
            whole_words_cnt += 1
        else:
            words_freq[word] += 1
            whole_words_cnt += 1
            
words_freq_cnt = len(words_freq)

hafez_words_cnt = 0
hafez_words_freq = {}
for poem in hafez_poems_splitted:
    for word in poem:
        if word not in hafez_words_freq:
            hafez_words_freq[word] = 1
            hafez_words_cnt += 1
        else:
            hafez_words_freq[word] += 1
            hafez_words_cnt += 1

saadi_words_cnt = 0
saadi_words_freq = {}
for poem in saadi_poems_splitted:
    for word in poem:
        if word not in saadi_words_freq:
            saadi_words_freq[word] = 1
            saadi_words_cnt += 1
        else:
            saadi_words_freq[word] += 1
            saadi_words_cnt += 1


<br>

## Laplace Smoothing

Consider testing some data where you encounter a record (word here) that you haven't had in your trained dataset before. We shouldn't certainly say that the data doesn't belong to that class, because we havent seen the whole available dataset yet obviously, and it might exist while we had just not seen it. So, we assign a very little value to such records in order to prevent the whole expressions probabilty from getting Zero.

So, as said above, we use laplace smoothing formula to assign a new $P$ to every previous $P$ by also keeping the weights as before:

<p style="text-align:center">$P = \frac{x_i + \alpha}{total + \alpha*d}$</p>

$x_i$ and $total$ are respectively number of desired and all possible occurences in a problem, though, assigning $\alpha = 0$ will lead to the normal $P$ formula. But in practice $\alpha$ is often chosen as 1. $d$ is also the number of unique words in both datasets.

In [51]:
def laplace_p(favorable, total, alpha, d):
    return (favorable + alpha) / (total + alpha*d)

def naive_bayes(splitted_poem, alpha):
    
    global hafez_words_cnt
    global saadi_words_cnt
    global hafez_words_freq
    global saadi_words_freq
    global p_hafez
    global p_saadi
    global words_freq_cnt
    
    p_is_hafez = p_hafez
    for each in splitted_poem:
        if each not in hafez_words_freq and each not in saadi_words_freq:
            continue
        if each in hafez_words_freq:
            p_is_hafez *= laplace_p(hafez_words_freq[each] , hafez_words_cnt, alpha, words_freq_cnt)
        else:
            p_is_hafez *= laplace_p(0 , hafez_words_cnt, alpha, words_freq_cnt)
    
    p_is_saadi = p_saadi
    for each in splitted_poem:
        if each not in hafez_words_freq and each not in saadi_words_freq:
            continue
        if each in saadi_words_freq:
            p_is_saadi *= laplace_p(saadi_words_freq[each] , saadi_words_cnt, alpha, words_freq_cnt)
        else:
            p_is_saadi *= laplace_p(0 , saadi_words_cnt, alpha, words_freq_cnt)
    
    if p_is_hafez > p_is_saadi:
        return "hafez"
    else:
        return "saadi"


<br>

### Without Laplace smoothing:

In [52]:
hafez_cnt = 0
for each in test_labels:
    if each == 'hafez':
        hafez_cnt += 1
hit = 0
hit_hafez = 0
detected_hafez = 0
alpha = 0
for i in range(0, len(test_poems_splitted)):
    detected = naive_bayes(test_poems_splitted[i], alpha)
    if detected == 'hafez':
        detected_hafez += 1
    if detected == test_labels[i]:
        hit += 1
    if detected == test_labels[i] and detected == 'hafez':
        hit_hafez += 1

print("\nAccuracy: " + str(hit/len(test_data)))
print("Recall: " + str(hit_hafez/hafez_cnt))
print("Precision: " + str(hit_hafez/detected_hafez))


Accuracy: 0.7642412637625658
Recall: 0.6559714795008913
Precision: 0.7311258278145696


### With Laplace Smoothing:

In [53]:
hafez_cnt = 0
for each in test_labels:
    if each == 'hafez':
        hafez_cnt += 1
hit = 0
hit_hafez = 0
detected_hafez = 0
alpha = 1
for i in range(0, len(test_poems_splitted)):
    detected = naive_bayes(test_poems_splitted[i], alpha)
    if detected == 'hafez':
        detected_hafez += 1
    if detected == test_labels[i]:
        hit += 1
    if detected == test_labels[i] and detected == 'hafez':
        hit_hafez += 1

print("\nAccuracy: " + str(hit/len(test_data)))
print("Recall: " + str(hit_hafez/hafez_cnt))
print("Precision: " + str(hit_hafez/detected_hafez))


Accuracy: 0.8054092867400671
Recall: 0.6916221033868093
Precision: 0.7983539094650206


<br>

Clearly, it caused all of them to increase.

<br>

Now, outputting evaluate.csv data with their all predicted values into resault.csv:

In [55]:
eval_data = pd.read_csv('evaluate.csv')
eval_poems = eval_data.text
n = len(eval_poems)
alpha = 1
predictions = []
for i in range(0, n):
    predictions.append(naive_bayes(eval_poems[i], alpha))
tmp = [{'id': i + 1, 'label': label} for i, label in enumerate(predictions)]
df = pd.DataFrame(tmp, index=None)
# df.head()
df.to_csv('resault.csv', index = False)

<br>

### What if we only consider precision?

Precision specifies the number of truly predicted records over all predicted records of a same class. Thus, it does good at recognizing whether all resaults from a predicition is truly detected or not, but it's not good when it comes to recogniznig whether all same classified objects are detected or not. As an example, consider a surgeon who has to remove all cancerous parts of an important body organ like brain, that at the same time all cancerous parts have to be removed and no healthy parts should be. So, by <b>only</b> considering precision, the doctor will be restricted to only remove the parts that he is sort of sure to be canserous, which at a same time, causes recall to decrease. (A trade-off between both)

<br>

### What if we only consider accuracy?

Problem happens when the dataset we're working with, is biased (imbalance). For example, a dataset with 80% of class A and 20% of class B, if we design a model that always predicts A (!) it will have a 0.8 accuracy. <br><br><br>