# Naive Bayes Model - Gaussian

In this model, the data needs to be in integers, meaning numbers. As we already have all the data in numbers, from extracting useful information such as number of tokens, letters, digts etc, that was no need for treating the data.

First we need to separate the features from the label. Features being the data that will be analized to train the model and the label, the classification.

In [60]:
import numpy as np
import pandas as pd

words_info = pd.DataFrame({
        'length': [25, 37, 59, 76, 38, 65, 59, 64, 44, 67],
        'digit_count': [7, 7, 11, 11, 9, 12, 12, 11, 8, 6],
        'digits_group_count': [2, 2, 3, 5, 2, 5, 5, 3, 2, 4],
        'token_count': [4, 6, 9, 9, 7, 8, 7, 9, 6, 9],
        'comma_count': [1, 1, 3, 4, 2, 2, 3, 4, 2, 3],
        'comma_entities_numbers': [2, 2, 3, 2, 2, 3, 3, 2, 2, 2],
        'comma_entities_numbers_words': [2, 2, 2, 1, 1, 3, 2, 1, 2, 2],
        'label': [1, 1, 1, 0, 1, 0, 0, 0, 1, 0]
    })

#print('Info:\n',words_info)
data = words_info.drop(['label'], axis=1)
label = np.array(words_info['label'])

print('\nData:\n',data)
print('\nLabel:\n',label)


Data:
    length  digit_count  digits_group_count  token_count  comma_count  \
0      25            7                   2            4            1   
1      37            7                   2            6            1   
2      59           11                   3            9            3   
3      76           11                   5            9            4   
4      38            9                   2            7            2   
5      65           12                   5            8            2   
6      59           12                   5            7            3   
7      64           11                   3            9            4   
8      44            8                   2            6            2   
9      67            6                   4            9            3   

   comma_entities_numbers  comma_entities_numbers_words  
0                       2                             2  
1                       2                             2  
2                       3

With that settled, now its time for preparing and grouping the data, so it can be anylized by the model. The information needs to be grouped in tuples, such as (x,y,z), so we used the function intertuples() to do it.

In [36]:
features = list(data.itertuples(index=False, name=None))
print('Features: ', features)

Features:  [(25, 7, 2, 4, 1, 2, 2), (37, 7, 2, 6, 1, 2, 2), (59, 11, 3, 9, 3, 3, 2), (76, 11, 5, 9, 4, 2, 1), (38, 9, 2, 7, 2, 2, 1), (65, 12, 5, 8, 2, 3, 3), (59, 12, 5, 7, 3, 3, 2), (64, 11, 3, 9, 4, 2, 1), (44, 8, 2, 6, 2, 2, 2), (67, 6, 4, 9, 3, 2, 2)]


Now that the data is already prepared to be used, it's possible to train the model, creating a Naive Bayes model classifier and predict values from given features.

In [54]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(features,label)

predicted1 = model.predict([[62,7,3,8,2,2,2]])
print('62,7,3,8,2,2,2 => ', predicted1)

predicted2 = model.predict([[62,3,3,8,2,2,2]])
print('62, ->3<- ,3,8,2,2,2 => ', predicted2)

predicted3 = model.predict([[62,3,3,4,2,2,2]])
print('62,3,3, ->4<- ,2,2,2 => ', predicted3)

62,7,3,8,2,2,2 =>  [1]
62, ->3<- ,3,8,2,2,2 =>  [0]
62,3,3, ->4<- ,2,2,2 =>  [1]


Seeing that the model works and responds to variations (remembering its a small set of data), we can now test how acurate the model is.

In [57]:
def get_precision_recall_f1(actual: list, predicted: list):
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    
    wrong = 0

    for index, predictedVal in enumerate(predicted):
        actualVal = actual[index]
        if (actualVal and actualVal == predictedVal):
            true_positives += 1
        elif (predictedVal and actualVal != predictedVal):
            false_positives += 1
        elif (actualVal and actualVal != predictedVal):
            false_negatives += 1
    
    precision = true_positives / max(true_positives + false_positives, 1)
    recall = true_positives / max(true_positives + false_negatives, 1)
    f1 = 2 * ((precision * recall) / (1 if precision + recall == 0 else precision + recall))
        
    return [el * 100 for el in [precision, recall, f1]]

[precision, recall, f1] = get_precision_recall_f1([0,1,1,1,0], [1,1,0,0,0])

print('Precision: {:.2f} | Recall: {:.2f} | F1: {:.2f}'.format(precision, recall, f1))

Precision: 50.00 | Recall: 33.33 | F1: 40.00


In [59]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, label, test_size=0.3, random_state=109)

train_predictions = model.predict(train_features)
train_accuracy = model.score(train_features, train_labels) * 100.0
[train_precision, train_recall, train_f1] = get_precision_recall_f1(train_labels, train_predictions)

test_predictions = model.predict(test_features)
test_accuracy = model.score(test_features, test_labels) * 100.0
[test_precision, test_recall, test_f1] = get_precision_recall_f1(test_labels, test_predictions)

print('For training data:')
print('Accuracy: {:.2f} | Precision: {:.2f} | Recall: {:.2f} | F1: {:.2f}'
      .format(train_accuracy, train_precision, train_recall, train_f1))
print('For test data:')
print('Accuracy: {:.2f} | Precision: {:.2f} | Recall: {:.2f} | F1: {:.2f}'
      .format(test_accuracy, test_precision, test_recall, test_f1))

For training data:
Accuracy: 100.00 | Precision: 100.00 | Recall: 100.00 | F1: 100.00
For test data:
Accuracy: 66.67 | Precision: 100.00 | Recall: 50.00 | F1: 66.67
