# # Classifiers comparison on texts with naive Bayes assumption

In this session of laboratory we compare two models for categorical data probabilistic modeling: 
1. multivariate Bernoulli 
2. multinomial on a dataset 

We adopt a dataset on Twitter messages labelled with emotions (Joy vs Sadness).

The following program shows the loading of the data from a file.


Data are loaded into a matrix X adopting a sparse matrix representation, in order to save space and time.
Sparse matrix representation (in the csr format) represents in three "parallel" arrays the value of the matrix cells that are different from zero and the indices of those matrix cells.
The arrays are called: 
* data
* row
* col

* data[i] stores the value of the matrix cell #i whose indexes are contained in row[i] and col[i] 
* row[i] stores the index of the row in the matrix of the cell #i, 
* col[i] stores the index of the column of the cell #i.


The data file is in csv format.
Any Twitter message has been preprocessed by a Natural Language pipeline which eliminated stop words and substituted the interesting document elements with an integer identifier.  
The interesting document elements might be words, emoji or emoticons. The elements could be repeated in the same document and are uniquely identified in the documents by the same integer number (named "element_id" in the program). This "element_id" number will be used as the index of the column of the data matrix, for the purposes of storage of data.

Each row of the CSV file reports the content of a document (a Twitter message). It is formed as a list of integer number pairs, followed by a string which is the label of the document ("Joy" or "Sadness").
The first number of the pair is the identifier of a document element (the "element_id"); 
the second number of the pair is the count (frequency) of that element in that document.

The dataset has:

tot_n_docs (or rows in the file) =n_rows=11981

n_features (total number of distinct words in the corpus)=11288


The following program reads the data file and loads in a sparse way the matrix using the scipy.sparse library


In [11]:
from numpy import ndarray, zeros
import numpy as np
import scipy
from scipy.sparse import csr_matrix

class_labels = ["Joy", "Sadness"]
n_features = 11288  # number of columns in the matrix = number of features (distinct elements in the documents)
n_rows = 11981  # number rows of the matrix
n_elements = 71474  # number of the existing values in the matrix (not empty, to be loaded in the matrix in a sparse way)

# path_training="/Users/meo/Documents/Didattica/Laboratorio-15-16-Jupyter/"
path_training = "./"
file_name = "joy_sadness6000.txt"

# declare the row and col arrays with the indexes of the matrix cells (non empty) to be loaded from file
# they are needed because the matrix is sparse and we load in the matrix only the elements which are present
row = np.empty(n_elements, dtype=int)
col = np.empty(n_elements, dtype=int)
data = np.empty(n_elements, dtype=int)

row_n = 0
cur_el = 0
twitter_labels = []
twitter_target = []  # list of 0/1 for class labels
with open(path_training + file_name, "r") as fi:
    for line in fi:
        el_list = line.split(",")
        l = len(el_list)
        last_el = el_list[l - 1]
        class_name = last_el.strip()
        twitter_labels.append(class_name)
        if class_name == class_labels[0]:
            twitter_target.append(0)
        else:
            twitter_target.append(1)
        i = 0
        while i < (l - 1):
            element_id = int(el_list[i])
            element_id = element_id - 1
            i = i + 1
            value_cell = int(el_list[i])
            i = i + 1
            row[cur_el] = row_n
            col[cur_el] = element_id
            data[cur_el] = value_cell
            cur_el = cur_el + 1
        row_n = row_n + 1
fi.close
print("final n_row=" + str(row))
# loads the matrix by means of the indexes and the values in the three arrays just filled
twitter_data = csr_matrix((data, (row, col)), shape=(n_rows, n_features)).toarray()
print("resulting matrix:")
print(twitter_data)
print(twitter_labels)
print(twitter_target)

final n_row=[0 0 0 ... 0 0 0]
resulting matrix:
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]
['Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy'

Write a program in the following cell that splits the data matrix in training and test set (by random selection) and predicts the class (Joy/Sadness) of the messages on the basis of the words. 
Consider the two possible models:
multivariate Bernoulli and multinomial Bernoulli.
Find the accuracy of the models and test is the observed differences are significant.


# Fitting

I fit the models using cross validation

In [12]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.model_selection import cross_validate, train_test_split

X_train, X_test, y_train, y_test = train_test_split(twitter_data, twitter_target, test_size=0.25, random_state=42, shuffle=True)

multivariate_bernoulli_model = BernoulliNB(force_alpha=True)
multinomial_bernoulli_model = MultinomialNB(force_alpha=True)
multivariate_bernoulli_model.fit(X_train, y_train)
multinomial_bernoulli_model.fit(X_train, y_train)

multivariate_scores = cross_validate(multivariate_bernoulli_model, twitter_data, twitter_target)
multinomial_scores = cross_validate(multinomial_bernoulli_model, twitter_data, twitter_target)
print(multivariate_scores)
print(multinomial_scores)

{'fit_time': array([3.11490297, 3.09955907, 3.08687711, 3.09306383, 3.11200619]), 'score_time': array([0.18115592, 0.18207598, 0.17255902, 0.17957401, 0.18122482]), 'test_score': array([0.95953275, 0.95075125, 0.94824708, 0.95033389, 0.9490818 ])}
{'fit_time': array([2.80076599, 2.78538513, 2.77980185, 2.77790403, 2.61185622]), 'score_time': array([0.06381416, 0.06462407, 0.06238008, 0.06434393, 0.06150508]), 'test_score': array([0.95661243, 0.95075125, 0.94323873, 0.94782972, 0.94866444])}


# Predictions

I want a confusion matrix of the 2 models
I also show the F1_scores

In [13]:
from sklearn.metrics import confusion_matrix, f1_score

mv_y_pred = multivariate_bernoulli_model.predict(X_test)
mv_confusion_matrix = confusion_matrix(y_test, mv_y_pred)
print(f"Multivariate Confusion Matrix:\n{mv_confusion_matrix}")
print(f"The Multivariate F_1 Score:\t{f1_score(y_test, mv_y_pred)}")

mn_y_pred = multinomial_bernoulli_model.predict(X_test)
mn_confusion_matrix = confusion_matrix(y_test, mn_y_pred)
print(f"Multinomial Confusion Matrix:\n{mn_confusion_matrix}")
print(f"The Multinomial F_1 Score:\t{f1_score(y_test, mn_y_pred)}")


Multivariate Confusion Matrix:
[[1405  120]
 [  28 1443]]
The Multivariate F_1 Score:	0.9512195121951219
Multinomial Confusion Matrix:
[[1396  129]
 [  28 1443]]
The Multinomial F_1 Score:	0.9484061781137035


# Hypothesis Test

I will now check if the two models are statistically the same:
Using cross-validation I extract a metric
Using a Student's T test I test if the difference of the metric is 0 (the models perform identically)

In [33]:
from scipy.stats import ttest_1samp

score_differences = [mn_s - mv_s for mn_s, mv_s in zip(multinomial_scores["test_score"], multivariate_scores["test_score"])]
print(score_differences)

print(f"mean of observations: {np.average(score_differences)}")
#popmean is the mean of the student's t (we put it to 0 because we have H_0: The models are the same model)
#alternative is saying that in the alternative hypothesis H_1 we do not care if the mean is greater or less we just consider it different
g = ttest_1samp(score_differences, popmean=0, alternative="greater")
l = ttest_1samp(score_differences, popmean=0, alternative="less")
print(g)
print(l)

[-0.0029203170629953368, 0.0, -0.005008347245408995, -0.0025041736227044975, -0.0004173622704507496]
mean of observations: -0.002170040040311916
TtestResult(statistic=-2.388303180985578, pvalue=0.9623467726165449, df=4)
TtestResult(statistic=-2.388303180985578, pvalue=0.03765322738345512, df=4)


# Results

The mean of the dataset is $\mu = -0.0022$ (slightly favours the multivariate model as it is negative)

Be reminded that t-test works with means (averages of data) so we assume the mean of the dataset should be extracted from a random variable $\Mu$ with mean $\mu_0 = 0$ (as set in popmean).

We got:

- $P(\Mu \ge \mu | H_0) = 96.2 \%$
- $P(\Mu \le \mu) | H_0 = 3.8 \%$

Therefore we are in the *LEFT* side of the distribution and the result is too extreme on the left for the hypothesis to be correct. We then have to refuse the $H_0$ hypothesis in favour of one with a lower mean ( $\mu_0$ ). 
We can say with $96.2\%$ certainty that the multivariate model performs better.
