Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Kanawut Kaewnoparat"
ID = "st122109"

---

# Lab 06: Generative Classifiers: Naive Bayes

As discussed in class, a naive Bayes classifier works as follows.

We are given a feature space $\mathcal{X}$ that could be discrete, continuous, or a mix of discrete and continuous features.

We are also given a discrete set $\mathcal{Y} = { y_1, \ldots,
y_K }$ of exhaustive, mutually exclusive classes thought to be the provenance of a dataset elements $\mathbf{x} \in \mathcal{X}$.

What does it mean to say that the features come from the classes? Specifically, we mean that the observation $\mathbf{x}^{(i)}$ is a random vector statistically dependent on a random variable $y^{(i)}$.

This means that $\mathbf{x}^{(i)} \sim p(\mathbf{x} \mid y^{(i)})$, where $y^{(i)} \in \mathcal{Y}$ and $y^{(i)} \sim p(y)$. $p(y)$, the *prior*, is assumed to be a multinomial distribution over the possible classes $\mathcal{Y}$, but the class conditional distribution $p(\mathbf{x} \mid y)$ can be an arbitrarily complicated joint distribution over the feature space that is different for each $y \in \mathcal{Y}$.

The random process just described, in which a $y$ is first sampled from a multinomial distribution over $\mathcal{Y}$ then an $\mathbf{x}$ is sampled from an arbitrary joint distribution over $\mathcal{X}$ that is conditioned on $y$, is a *generative model* for the provenance of our dataset. It may not be a fully accurate model for how nature gave us our dataset, but we nevertheless assume that it is.

With all those preliminaries, now, given a new sample $\mathbf{x}$ assumed to have been generated by the same generative process, we estimate, for each $y \in \mathcal{Y}$, the *posterior* $p(y \mid \mathbf{x})$ using the following strategy:
$$\begin{eqnarray}
p(y \mid \mathbf{x} ; \theta) & = & \frac{p(\mathbf{x} \mid y ; \theta) p(y ; \theta)}{p(\mathbf{x} ; \theta)} \\
& \propto & p(\mathbf{x} \mid y ; \theta) p(y ; \theta) \\
& = & p(y ; \theta) \prod_j p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \\
& \approx & p(y ; \theta) \prod_j p(x_j \mid y ; \theta).
\end{eqnarray}$$

The critical assumption here (besides the story of the generative random process assumed to be the origin of our dataset) is the *naive Bayes assumption* that the approximation

$$ p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \approx p(x_j \mid y ; \theta)$$

is close enough to reality to be useful. Note that if the features are truly *conditionally independent of each other given the class*, then the naive Bayes classifier is an exact probabilistic classifier.

So now we know that the parameters of a naive Bayes classifier will always include the parameters $\phi_1, \ldots, \phi_k$ of the multinomial distribution over $\mathcal{Y}$ plus the individual conditional feature distributions $p(x_j \mid y)$. If $x_j$ is discrete, we can represent this conditional distribution using a simple table of probabilities, and if $x_j$ is continuous, we represent the conditional distribution using the parameters of some continuous distribution such as a univariate Gaussian, univariate exponential, etc.

In today's lab, we will use naive Bayes to perform diabetes diagnosis and text classification.

## Example 1: Diabetes classification

In this example we predict wheter a patient with specific diagnostic measurements has diabetes or not. The target classes $\mathcal{Y} = { y_1, y_2 }$ correspond respectively to "no diabetes" and "diabetes." As the features are continuous, we will model their conditional probabilities $p(x_j \mid y ; \theta)$ as univariate Gaussians with means $\mu_{j,y}$ and standard deviations $\sigma_{j,y}$.

The data are originally from the U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and are available from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [2]:
import csv
import math
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Data manipulation

First we have some functions to read the dataset, split it into train and test, and partition it according to target class ($y$).

In [3]:
# Load data from CSV file
def loadCsv(filename):
    data_raw = pd.read_csv(filename)
    headers = data_raw.columns
    dataset = data_raw.values
    return dataset, headers

# Split dataset into test and train with given ratio
def splitDataset(test_size,*arrays,**kwargs):
    return train_test_split(*arrays,test_size=test_size,**kwargs)

# Separate training data according to target class
# Return key value pairs array in which keys are possible target variable values
# and values are the data records.

def data_split_byClass(dataset):
    Xy = {}
    for i in range(len(dataset)):
        datapair = dataset[i]
        # datapair[-1] (the last column) is the target class for this record.
        # Check if we already have this value as a key in the return array
        if (datapair[-1] not in Xy):
            # Add class as key
            Xy[datapair[-1]] = []
        # Append this record to array of records for this class key
        Xy[datapair[-1]].append(datapair)
    return Xy

### Model training

Next we have some functions used for training the model. Parameters include the conditional means and standard deviations for each feature as well as the parameters of the multinomial distribution (more specifically the Bernoulli distribution since this is a binary classification problem) over $\mathcal{Y}$.

In [4]:
# Calculate Gaussian parameters mu and sigma for each attribute over a dataset

def get_gaussian_parameters(X, y):
    parameters = {}
    unique_y = np.unique(y)
    for uy in unique_y:
        mean = np.mean(X[y==uy], axis=0)
        std = np.std(X[y==uy], axis=0)
        py = y[y==uy].size / y.size
        parameters[uy] = { 'prior': py, 'mean': mean, 'std': std }
    return parameters, unique_y

def calculateProbability(x, mu, sigma):
    sigma = np.diag(sigma**2)
    x = x.reshape(-1,1)
    mu = mu.reshape(-1,1)
    exponent = np.exp(-1/2*(x-mu).T@np.linalg.inv(sigma)@(x-mu))
    return ((1/(np.sqrt(((2*np.pi)**x.size)*np.linalg.det(sigma))))*exponent)[0,0]

### Model testing

Next are some functions for testing the model on a test set and computing its accuracy. Note that `predict_one()` allows us to calculate $p(y \mid \mathbf{x} ; \theta)$ with or without the prior, i.e., as either

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta),$$

which corresponds to the assumption that the priors $p(y)$ are equal, i.e., $p(y) = \frac{1}{K}$ for all $y$, or

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta) p(y ; \theta),$$

which correctly includes the prior.

In [5]:
# Calculate class conditional probabilities for given input data vector

def predict_one(x, parameters, unique_y, prior=True):
    probabilities = []
    for key in parameters.keys():
        probabilities.append(calculateProbability(x, parameters[key]['mean'], parameters[key]['std']) * (parameters[key]['prior']**(float(prior))))
    probabilities = np.array(probabilities)
    return unique_y[np.argmax(probabilities)]

def getPredictions(X, parameters, unique_y,prior=True):
    predictions = []
    for i in range(X.shape[0]):
        predictions.append(predict_one(X[i],parameters,unique_y,prior))
    return np.array(predictions)

# Get accuracy for test set

def getAccuracy(y, y_pred):
    correct = len(y[y==y_pred])
    return correct/y.size

### Experiment

Here we load the diabetes dataset, split it into training and test data, train a Gaussian NB model, and test the model on the test set.

In [6]:
# Load dataset

filename = 'diabetes.csv'
dataset, headers = loadCsv(filename)
#print(headers)
#print(np.array(dataset)[0:5,:])

# Split into training and test

X_train,X_test,y_train,y_test = splitDataset(0.4,dataset[:,:-1],dataset[:,-1])
print("Total =",len(dataset),"Train =", len(X_train),"Test =",len(X_test))

# Train model

parameters, unique_y = get_gaussian_parameters(X_train,y_train)
prediction = getPredictions(X_test,parameters,unique_y)
print("Accuracy with Prior =",getAccuracy(y_test,prediction))

# Test model

prediction = getPredictions(X_test,parameters,unique_y,prior = False)
print("Accuracy without Prior =",getAccuracy(y_test,prediction))

Total = 768 Train = 460 Test = 308
Accuracy with Prior = 0.762987012987013
Accuracy without Prior = 0.737012987012987


###  Exercise In lab / take home work (20 points)

Find out the proportion of the records in your dataset are positive vs. negative.  Can we conclude that $p(y=1) = p(y=0)$? If not, we should use the version of the model in which we use the priors $p(y=1)$ and $p(y=0)$. Explain
whether/how it improves the result.


In [7]:
for class_ in np.unique(y_train):
    print(f"The proportion of of records in {class_} class is {round(parameters[class_]['prior'], 2 )}")

The proportion of of records in 0.0 class is 0.66
The proportion of of records in 1.0 class is 0.34


In [8]:
import copy
parameter_test = copy.deepcopy(parameters)

In [9]:
parameters

{0.0: {'prior': 0.658695652173913,
  'mean': array([  3.20132013, 109.15181518,  68.9339934 ,  19.32013201,
          68.93069307,  30.53663366,   0.44179868,  30.94389439]),
  'std': array([ 2.97776107, 27.31088689, 16.4204935 , 15.09922595, 99.31095188,
          7.95976752,  0.31564193, 11.34546713])},
 1.0: {'prior': 0.34130434782608693,
  'mean': array([  4.86624204, 143.47770701,  72.94904459,  22.15923567,
         104.51592357,  35.30318471,   0.51773885,  37.18471338]),
  'std': array([  3.6048367 ,  29.87011927,  19.00546223,  16.87817328,
         143.32638825,   6.80652947,   0.33223427,  11.03545049])}}

In [10]:
parameter_test[0]['prior'] = 0.5
parameter_test[1]['prior'] = 0.5
print(parameter_test)

{0.0: {'prior': 0.5, 'mean': array([  3.20132013, 109.15181518,  68.9339934 ,  19.32013201,
        68.93069307,  30.53663366,   0.44179868,  30.94389439]), 'std': array([ 2.97776107, 27.31088689, 16.4204935 , 15.09922595, 99.31095188,
        7.95976752,  0.31564193, 11.34546713])}, 1.0: {'prior': 0.5, 'mean': array([  4.86624204, 143.47770701,  72.94904459,  22.15923567,
       104.51592357,  35.30318471,   0.51773885,  37.18471338]), 'std': array([  3.6048367 ,  29.87011927,  19.00546223,  16.87817328,
       143.32638825,   6.80652947,   0.33223427,  11.03545049])}}


In [11]:
y_train_prediction = getPredictions(X_train,parameters,unique_y ,prior =True)
y_test_prediction = getPredictions(X_test,parameters,unique_y, prior = True)

print("Accuracy with Prior on train set =",getAccuracy(y_train,y_train_prediction))
print("Accuracy with Prior on train set =",getAccuracy(y_test,y_test_prediction))

Accuracy with Prior on train set = 0.758695652173913
Accuracy with Prior on train set = 0.762987012987013


In [12]:
y_train_prediction2 = getPredictions(X_train,parameter_test,unique_y)
y_test_prediction2 = getPredictions(X_test,parameter_test,unique_y)

print("Accuracy without Prior on train set =",getAccuracy(y_train,y_train_prediction2))
print("Accuracy without Prior on train set =",getAccuracy(y_test,y_test_prediction2))

Accuracy without Prior on train set = 0.7478260869565218
Accuracy without Prior on train set = 0.737012987012987


**Explain whether you can conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model and explain how it improves the result.**


## Answer for Q1
- We CANNOT conclude that $p(y=1) = p(y=0)$ as $p(y=1)$ is at 64% and $p(y=0)$ = 36%, basically meaning that there is imbalance sampling populated more for class y =1
- If we had NOT taken these priors into account, in other words, set the sampling to be of the same probability distribution at 0.5, the accuracy $\textbf{decreases}$ a bit (72% on without prior test set)
- But when taking the priors into account, the accuracy is not better as it takes the $\textbf{different frequency/ occurences of each class}$ into the computation of likelihood of observation as well

## Example 2: Text classification

This example has been adapted from a post by Jaya Aiyappan, available at
[Analytics Vidhya](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data).

We will generate a small dataset of sentences that are classified as either "statements" or "questions."

We will assume that occurance and placement of words within a sentence are independent of each other, so the sentence "this is my book" will have the same features as the sentence "is this my book." We will treat words without case sensitivity.

In [13]:
# Generate text data for two classes, "statement" and "question"

text_train = [['This is my novel book', 'statement'],
              ['this book has more than one author', 'statement'],
              ['is this my book', 'question'],
              ['They are novels', 'statement'],
              ['have you read this book', 'question'],
              ['who is the novels author', 'question'],
              ['what are the characters', 'question'],
              ['This is how I bought the book', 'statement'],
              ['I like fictional characters', 'statement'],
              ['what is your favorite book', 'question']]

text_test = [['this is the book', 'statement'], 
             ['who are the novels characters', 'question'], 
             ['is this the author', 'question'],
            ['I like apples']]

# Load training and test data into pandas data frames

training_data = pd.DataFrame(text_train, columns= ['sentence', 'class'])
print(training_data)
print('\n------------------------------------------\n')
testing_data = pd.DataFrame(text_test, columns= ['sentence', 'class'])
print(testing_data)


                             sentence      class
0               This is my novel book  statement
1  this book has more than one author  statement
2                     is this my book   question
3                     They are novels  statement
4             have you read this book   question
5            who is the novels author   question
6             what are the characters   question
7       This is how I bought the book  statement
8         I like fictional characters  statement
9          what is your favorite book   question

------------------------------------------

                        sentence      class
0               this is the book  statement
1  who are the novels characters   question
2             is this the author   question
3                  I like apples       None


In [14]:
# Partition training data by class

stmt_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'statement']
question_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'question']
all_docs = [train['sentence'] for index,train in training_data.iterrows()]

# Get word frequencies for each sentence and class

def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [15]:
get_words(stmt_docs)

array(['are', 'author', 'book', 'bought', 'characters', 'fictional',
       'has', 'how', 'i', 'is', 'like', 'more', 'my', 'novel', 'novels',
       'one', 'than', 'the', 'they', 'this'], dtype='<U10')

In [16]:
# Get word frequencies for statement documents

word_list_s = get_words(stmt_docs)
word_freq_table_s = get_doc_word_frequency(word_list_s, stmt_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
print(tdm_s.head())

   are  author  book  bought  characters  fictional  has  how  i  is  like  \
0    0       0     1       0           0          0    0    0  0   1     0   
1    0       1     1       0           0          0    1    0  0   0     0   
2    1       0     0       0           0          0    0    0  0   0     0   
3    0       0     1       1           0          0    0    1  1   1     0   
4    0       0     0       0           1          1    0    0  1   0     1   

   more  my  novel  novels  one  than  the  they  this  
0     0   1      1       0    0     0    0     0     1  
1     1   0      0       0    1     1    0     0     1  
2     0   0      0       1    0     0    0     1     0  
3     0   0      0       0    0     0    1     0     1  
4     0   0      0       0    0     0    0     0     0  


In [17]:
word_freq_table_s

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [18]:
# Get word frequencies over all statement documents

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
print(freq_s)

{'are': 1, 'author': 1, 'book': 3, 'bought': 1, 'characters': 1, 'fictional': 1, 'has': 1, 'how': 1, 'i': 2, 'is': 2, 'like': 1, 'more': 1, 'my': 1, 'novel': 1, 'novels': 1, 'one': 1, 'than': 1, 'the': 1, 'they': 1, 'this': 3}


In [19]:
freq_list_s

array([1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3])

In [20]:
# Get word frequencies for question documents

word_list_q = get_words(question_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, question_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
print(tdm_q)

   are  author  book  characters  favorite  have  is  my  novels  read  the  \
0    0       0     1           0         0     0   1   1       0     0    0   
1    0       0     1           0         0     1   0   0       0     1    0   
2    0       1     0           0         0     0   1   0       1     0    1   
3    1       0     0           1         0     0   0   0       0     0    1   
4    0       0     1           0         1     0   1   0       0     0    0   

   this  what  who  you  your  
0     1     0    0    0     0  
1     1     0    0    1     0  
2     0     0    1    0     0  
3     0     1    0    0     0  
4     0     1    0    0     1  


In [21]:
# Get word frequencies over all question documents

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))
print(freq_q)
print(freq_list_s)
print(freq_list_q)

{'are': 1, 'author': 1, 'book': 3, 'characters': 1, 'favorite': 1, 'have': 1, 'is': 3, 'my': 1, 'novels': 1, 'read': 1, 'the': 2, 'this': 2, 'what': 2, 'who': 1, 'you': 1, 'your': 1}
[1 1 3 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 3]
[1 1 3 1 1 1 3 1 1 1 2 2 2 1 1 1]


In [22]:
# Get word probabilities for statement class
a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a)) #Notice the a from Laplace smoothing to avoid calculating with 0
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   
    
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

Probability of words for "statement" class 

{'are': 0.043478260869565216, 'author': 0.043478260869565216, 'book': 0.08695652173913043, 'bought': 0.043478260869565216, 'characters': 0.043478260869565216, 'fictional': 0.043478260869565216, 'has': 0.043478260869565216, 'how': 0.043478260869565216, 'i': 0.06521739130434782, 'is': 0.06521739130434782, 'like': 0.043478260869565216, 'more': 0.043478260869565216, 'my': 0.043478260869565216, 'novel': 0.043478260869565216, 'novels': 0.043478260869565216, 'one': 0.043478260869565216, 'than': 0.043478260869565216, 'the': 0.043478260869565216, 'they': 0.043478260869565216, 'this': 0.08695652173913043}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.05128205128205128, 'author': 0.05128205128205128, 'book': 0.10256410256410256, 'characters': 0.05128205128205128, 'favorite': 0.05128205128205128, 'have': 0.05128205128205128, 'is': 0.10256410256410256, 'my': 0.05128205128205128, 'novels': 0.0512820512

In [23]:
dict(zip(word_list_s, prob_s))

{'are': 0.043478260869565216,
 'author': 0.043478260869565216,
 'book': 0.08695652173913043,
 'bought': 0.043478260869565216,
 'characters': 0.043478260869565216,
 'fictional': 0.043478260869565216,
 'has': 0.043478260869565216,
 'how': 0.043478260869565216,
 'i': 0.06521739130434782,
 'is': 0.06521739130434782,
 'like': 0.043478260869565216,
 'more': 0.043478260869565216,
 'my': 0.043478260869565216,
 'novel': 0.043478260869565216,
 'novels': 0.043478260869565216,
 'one': 0.043478260869565216,
 'than': 0.043478260869565216,
 'the': 0.043478260869565216,
 'they': 0.043478260869565216,
 'this': 0.08695652173913043}

In [24]:
dict(zip(word_list_q, prob_s))

{'are': 0.043478260869565216,
 'author': 0.043478260869565216,
 'book': 0.08695652173913043,
 'characters': 0.043478260869565216,
 'favorite': 0.043478260869565216,
 'have': 0.043478260869565216,
 'is': 0.043478260869565216,
 'my': 0.043478260869565216,
 'novels': 0.06521739130434782,
 'read': 0.06521739130434782,
 'the': 0.043478260869565216,
 'this': 0.043478260869565216,
 'what': 0.043478260869565216,
 'who': 0.043478260869565216,
 'you': 0.043478260869565216,
 'your': 0.043478260869565216}

In [25]:
stmt_docs

['This is my novel book',
 'this book has more than one author',
 'They are novels',
 'This is how I bought the book',
 'I like fictional characters']

In [26]:
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        print(word)
        if className == 'statement':
            idx = np.where(word_list_s == word)
            prob = prob * prob_s[np.array(idx)[0,0]]
        else:
            idx = np.where(word_list_q == word)
            prob = prob * prob_q[np.array(idx)[0,0]]   
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

In [27]:
# Get word probabilities for statement class

prob_s = []
for word, count in zip(word_list_s, freq_list_s):
    #print(word, count)
    prob_s.append(count/len(word_list_s))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append(count/len(word_list_q))
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

Probability of words for "statement" class 

{'are': 0.05, 'author': 0.05, 'book': 0.15, 'bought': 0.05, 'characters': 0.05, 'fictional': 0.05, 'has': 0.05, 'how': 0.05, 'i': 0.1, 'is': 0.1, 'like': 0.05, 'more': 0.05, 'my': 0.05, 'novel': 0.05, 'novels': 0.05, 'one': 0.05, 'than': 0.05, 'the': 0.05, 'they': 0.05, 'this': 0.15}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.0625, 'author': 0.0625, 'book': 0.1875, 'characters': 0.0625, 'favorite': 0.0625, 'have': 0.0625, 'is': 0.1875, 'my': 0.0625, 'novels': 0.0625, 'read': 0.0625, 'the': 0.125, 'this': 0.125, 'what': 0.125, 'who': 0.0625, 'you': 0.0625, 'your': 0.0625}


### In-lab exercise: Laplace smoothing

Run the code below and figure out why it fails.

When a word does not appear with a specific class in the training data, its class-conditional probability is 0, and we are unable to
get a reasonable probability for that class.

Research Laplace smoothing, and modify the code above to implement Laplace smoothing (setting the frequency of all words with frequency 0 to a frequency of 1).
Run the modified code on the test set.

In [28]:
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])
print('Getting prediction for %s"' % test_docs[0])
predict(test_docs[0])


Getting prediction for this is the book"



IndexError: index 0 is out of bounds for axis 1 with size 0

### Exercise 1.1 (10 points)

Explain Why it failed and explain how to solve the problem.

Explanation here! (Double click to explain)

## Explanation
1) When trying the original function get_words(), it returns each letter instead of the whole word so I rewrote the function to extract unique words from each input setence

2) Using original classCondProb(), we can see that if a word to be predicted is not included in the training set for that class, it will result in error

3) To solve this, I created the laplace_smooth() to make sure that for every class, every missing words from other classses must be included with the frequency squeezed to only 1 to avoid non-zero computation

4) With the new word list for each class, it is necessary to recalculate the probability of each word, and must make sure the sum of all words on that class equal to 1

### Exercise 1.2 (20 points)

Modify the code to make it work using Laplace smoothing. Include the functions `prior()`, `classCondProb()`, and `predict()`.

### 1) Redefine get_words()

In [29]:
def get_words(sentence):
    split = sentence.lower().split()
    strip = [word.strip('.,!;()[]') for word in split]
    words = np.unique(strip)
    
    
    return words

### 2) Create laplace_smooth() to ensure collectively exhaustive words in each class

In [30]:
def laplace_smooth(word_list_s, freq_list_s, word_list_q, freq_list_q):
    s_extra = np.setdiff1d(word_list_q, word_list_s, assume_unique=True)
    q_extra = np.setdiff1d(word_list_s, word_list_q, assume_unique=True)
    
    word_list_s = np.concatenate((word_list_s, s_extra))
    word_list_q = np.concatenate((word_list_q, q_extra))  
    
    freq_list_s = np.concatenate((freq_list_s, np.zeros(s_extra.shape, dtype=int)))
    freq_list_q = np.concatenate((freq_list_q, np.zeros(q_extra.shape, dtype=int)))
    
    freq_list_s = freq_list_s + 1
    freq_list_q = freq_list_q + 1
    
    return word_list_s, freq_list_s, word_list_q, freq_list_q

new_word_list_s, new_freq_list_s, new_word_list_q, new_freq_list_q = laplace_smooth(word_list_s, freq_list_s, word_list_q, freq_list_q)

### 3) Recalculate the word list frequency

In [31]:
a = 1
new_prob_s = []
for count in new_freq_list_s:
    new_prob_s.append((count+a)/(sum(new_freq_list_s)+len(new_freq_list_s)*a)) #Notice the a from Laplace smoothing to avoid calculating with 0
    
# new_prob_s.append(a/(sum(new_freq_list_s)+len(new_freq_list_s)*a))
    
# Get word probabilities for question class

new_prob_q = []
for count in new_freq_list_q:
    new_prob_q.append((count+a)/(sum(new_freq_list_q)+len(new_freq_list_q)*a))
# prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   

In [32]:
# MY VERSION 1
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
#         print(word)
        if className == 'statement':
            if word in new_word_list_s:
#                 print('yes')
                idx = np.where(new_word_list_s == word)
                prob = prob * new_prob_s[idx[0][0]] #####
            else:
                print('Not in list!!!!!!!!!!!!')
#             print()
        else:
            if word in new_word_list_q:
#                 print('yes')
                idx = np.where(new_word_list_q == word)
                prob = prob * new_prob_q[idx[0][0]] #####
            else:
                print('Not in list!!!!!!!!!!!!')
#             print()
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    print(prob_statement, prob_question )
    
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'
    
    
    # YOUR CODE HERE
# raise NotImplementedError()

---
## Test the new result

In [33]:
len(new_freq_list_s)

27

In [34]:
len(new_prob_s)

27

In [35]:
print('Probability of words for "statement" class \n')
print(dict(zip(new_word_list_s, new_prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(new_word_list_q, new_prob_q)))

Probability of words for "statement" class 

{'are': 0.0375, 'author': 0.0375, 'book': 0.0625, 'bought': 0.0375, 'characters': 0.0375, 'fictional': 0.0375, 'has': 0.0375, 'how': 0.0375, 'i': 0.05, 'is': 0.05, 'like': 0.0375, 'more': 0.0375, 'my': 0.0375, 'novel': 0.0375, 'novels': 0.0375, 'one': 0.0375, 'than': 0.0375, 'the': 0.0375, 'they': 0.0375, 'this': 0.0625, 'favorite': 0.025, 'have': 0.025, 'read': 0.025, 'what': 0.025, 'who': 0.025, 'you': 0.025, 'your': 0.025}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.03896103896103896, 'author': 0.03896103896103896, 'book': 0.06493506493506493, 'characters': 0.03896103896103896, 'favorite': 0.03896103896103896, 'have': 0.03896103896103896, 'is': 0.06493506493506493, 'my': 0.03896103896103896, 'novels': 0.03896103896103896, 'read': 0.03896103896103896, 'the': 0.05194805194805195, 'this': 0.05194805194805195, 'what': 0.05194805194805195, 'who': 0.03896103896103896, 'you': 0.038961038961

In [36]:
assert np.sum(new_prob_s) == np.sum(new_prob_q) == 1.0

In [37]:
df_s = pd.DataFrame(np.vstack((new_word_list_s, new_freq_list_s, new_prob_s)).T, columns = ['word', 'occurrence', 'prob'])
df_q = pd.DataFrame(np.vstack((new_word_list_q, new_freq_list_q, new_prob_q)).T, columns = ['word', 'occurrence', 'prob'])

In [38]:
df_q.sort_values("prob", ascending =False).head()

Unnamed: 0,word,occurrence,prob
2,book,4,0.0649350649350649
6,is,4,0.0649350649350649
10,the,3,0.0519480519480519
12,what,3,0.0519480519480519
11,this,3,0.0519480519480519


In [39]:
df_s.sort_values("prob", ascending =False).head()

Unnamed: 0,word,occurrence,prob
2,book,4,0.0625
19,this,4,0.0625
8,i,3,0.05
9,is,3,0.05
0,are,2,0.0375


In [40]:
testing_data.values

array([['this is the book', 'statement'],
       ['who are the novels characters', 'question'],
       ['is this the author', 'question'],
       ['I like apples', None]], dtype=object)

In [41]:
answer = []
for i, sentence in enumerate(test_docs[:-1]):
    predicted_class = predict(sentence)
    answer.append(predicted_class)

    
    
correct= np.sum(np.asarray(answer) == testing_data.values[:-1,1])
total = len(answer)

print(f"The accuracy rate on this test set is {round(correct/ total *100, 2)}%")
    

3.662109375e-06 5.689408207955606e-06
2.471923828125e-08 5.984961881096159e-08
2.197265625e-06 3.413644924773365e-06
The accuracy rate on this test set is 66.67%


In [42]:
print(answer)
print(testing_data.values[:-1,1])

['question', 'question', 'question']
['statement' 'question' 'question']


In [43]:
# Test function: Do not remove
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])

for sentence in test_docs:
    print('Getting prediction for %s"' % sentence)
    print(predict(sentence))
    
print("success!")
# End Test function

Getting prediction for this is the book"
3.662109375e-06 5.689408207955606e-06
question
Getting prediction for who are the novels characters"
2.471923828125e-08 5.984961881096159e-08
question
Getting prediction for is this the author"
2.197265625e-06 3.413644924773365e-06
question
Getting prediction for I like apples"
Not in list!!!!!!!!!!!!
Not in list!!!!!!!!!!!!
0.0009375 0.000337325012649688
statement
success!


**Expected result**:\
Getting prediction for this is the book"\
question\
Getting prediction for who are the novels characters"\
question\
Getting prediction for is this the author"\
question\
Getting prediction for I like apples"\
statement\
success!

### Take home exercise

Find a more substantial text classification dataset, clean up the documents, and build your NB classifier. Write a brief report on your in-lab and take home exercises and results here.

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

import re
import string

from sklearn.model_selection import train_test_split

In [45]:
data = pd.read_csv('tweet_emotions.csv')

In [46]:
data['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

In [47]:
data = data[data['sentiment'].isin(['neutral', 'sadness', 'happiness'])]

In [48]:
emotionClass = data['sentiment'].unique()

In [49]:
emotionClass

array(['sadness', 'neutral', 'happiness'], dtype=object)

In [50]:
data = data.reset_index(drop = True)

In [51]:
data

Unnamed: 0,tweet_id,sentiment,content
0,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
1,1956967696,sadness,Funeral ceremony...gloomy friday...
2,1956968416,neutral,@dannycastillo We want to trade with someone w...
3,1956968487,sadness,"I should be sleep, but im not! thinking about ..."
4,1956969035,sadness,@charviray Charlene my love. I miss you
...,...,...,...
19007,1753918881,neutral,@jasimmo Ooo showing of your French skills!! l...
19008,1753918892,neutral,"@sendsome2me haha, yeah. Twitter has many uses..."
19009,1753918900,happiness,Succesfully following Tayla!!
19010,1753918954,neutral,@JohnLloydTaylor


In [52]:
data['content'] = data['content'].apply(lambda x: re.sub("@[A-Za-z0-9_]+","", x))
data['content'] = data['content'].apply(lambda x: re.sub(r'\W+', ' ', x))
data['content'] = data['content'].apply(lambda x: re.sub(r'[0-9]+', '', x))
data['content'] = data['content'].apply(lambda x: x.lower())

In [53]:
X = data['content']

In [54]:
y = data['sentiment']

In [55]:
data.sentiment.unique()

array(['sadness', 'neutral', 'happiness'], dtype=object)

In [56]:
X = X[3500:5200]

In [57]:
y = y[3500:5200]

In [58]:
y.value_counts()

sadness      852
neutral      670
happiness    178
Name: sentiment, dtype: int64

In [59]:
X_train ,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

In [60]:
y_train

4754      neutral
4239    happiness
4230      sadness
4841      sadness
5012      sadness
          ...    
4893    happiness
4844      sadness
4027      sadness
4649      neutral
4789      neutral
Name: sentiment, Length: 1190, dtype: object

In [61]:
X_train.head(20)

4754     actually i think the na release date was conf...
4239                        doh i m an inside job lolzor 
4230    ok so everyone is out an im stuck in bored and...
4841    has has an ok day with jo she bit me and now i...
5012                            crappy music on the radio
4218     okay so rose told me your phone was shut off ...
4773    ha i got another followfriday take that oh you...
4344       last day at the ko olina off to north shoreee 
4618    red rock for lunch with colin and couldn t com...
3578                                   bout to go to work
3679                         to cold for the beach sucky 
5013    whoishonorsociety lt never wear your pajama pa...
4157    ready for this saddenning depressing dullful u...
4594                                          work  close
4331    oh my god my favorite havaianas just broke aft...
5007     i m sorry ahahah you and rachel look so much ...
3621                     im upset cuz now everyone agrees
3572     poor 

In [62]:
y_train.value_counts()

sadness      593
neutral      472
happiness    125
Name: sentiment, dtype: int64

In [63]:
y_test.value_counts()

sadness      259
neutral      198
happiness     53
Name: sentiment, dtype: int64

In [64]:
neutral_idx = np.where(y_train == 'neutral')[0]
sadness_idx = np.where(y_train == 'sadness')[0]
happiness_idx = np.where(y_train == 'happiness')[0]

In [65]:
neutral_doc = X_train.iloc[neutral_idx]
sadness_doc = X_train.iloc[sadness_idx]
happiness_doc = X_train.iloc[happiness_idx]

In [66]:
neutral_doc

4754     actually i think the na release date was conf...
4218     okay so rose told me your phone was shut off ...
4618    red rock for lunch with colin and couldn t com...
5013    whoishonorsociety lt never wear your pajama pa...
4594                                          work  close
                              ...                        
3871      ha except omg amy bb i have to give my loane...
4532                                 working on homework 
5020     just wondering if you are going to put quot o...
4649    just sneezed three times in quick succession b...
4789                        cold amp raining in inglewood
Name: content, Length: 472, dtype: object

In [67]:
sadness_doc

4230    ok so everyone is out an im stuck in bored and...
4841    has has an ok day with jo she bit me and now i...
5012                            crappy music on the radio
4773    ha i got another followfriday take that oh you...
4344       last day at the ko olina off to north shoreee 
                              ...                        
4597     i won t be getting any rotf toys till much la...
4680       watched prison break special such a sad ending
4647    willie is pouting because grandma didn t put a...
4844     its been clownin since it got flooded in htow...
4027     if i can get a ticket but the pickings are lo...
Name: content, Length: 593, dtype: object

In [68]:
happiness_doc

4239                        doh i m an inside job lolzor 
4267     haha i am aware of how one contracts a uti an...
4854     awwwww would a virtual high five make it any ...
4441    it s been so nice all day and now it looks lik...
3825     ohh what fun a night at slimes i miss that pl...
                              ...                        
4255    jon amp kate  kids have attracted a huge tv au...
4074     hey so glad it s friday but not happy that i ...
4944     quot i wanna go to prom one day quot i wish u...
4233    what a glorious week my best holiday ever i th...
4893    fever of  awesome my tonsils are so swollen i ...
Name: content, Length: 125, dtype: object

In [69]:
def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    
    return np.unique(words)[1:]


def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [70]:
neutral_words = get_words(neutral_doc)
sadness_words = get_words(sadness_doc)
happiness_words = get_words(happiness_doc)

In [71]:
len(neutral_words)

1656

In [72]:
len(sadness_words)

2066

In [73]:
len(happiness_words)

794

In [74]:
neutral_words_f = get_doc_word_frequency(neutral_words, neutral_doc).sum(axis= 0)
sadness_words_f = get_doc_word_frequency(sadness_words, sadness_doc).sum(axis= 0)
happiness_words_f = get_doc_word_frequency(happiness_words, happiness_doc).sum(axis= 0)

In [75]:
neutral_words_f

array([ 1, 91,  1, ...,  1,  2,  2])

In [76]:
sadness_words_f

array([  2, 112,   1, ...,   1,   2,   1])

In [77]:
neutral_words_f_dict = dict(zip(neutral_words, neutral_words_f ))
sadness_words_f_dict = dict(zip(sadness_words, sadness_words_f ))
happiness_words_f_dict = dict(zip(happiness_words, happiness_words_f ))

In [78]:
def laplace_smooth(word_list_s, freq_list_s, word_list_q, freq_list_q):
    s_extra = np.setdiff1d(word_list_q, word_list_s, assume_unique=True)
    q_extra = np.setdiff1d(word_list_s, word_list_q, assume_unique=True)
    
    word_list_s = np.concatenate((word_list_s, s_extra))
    word_list_q = np.concatenate((word_list_q, q_extra))  
    
    freq_list_s = np.concatenate((freq_list_s, np.zeros(s_extra.shape, dtype=int)))
    freq_list_q = np.concatenate((freq_list_q, np.zeros(q_extra.shape, dtype=int)))
    
    freq_list_s = freq_list_s + 1
    freq_list_q = freq_list_q + 1
    
    return word_list_s, freq_list_s, word_list_q, freq_list_q

In [79]:
from functools import reduce
all_words = reduce(np.union1d, (neutral_words, sadness_words, happiness_words   ))
print(len(all_words))

3257


In [80]:
extra_n = np.setdiff1d(all_words, neutral_words, assume_unique =True )
extra_s = np.setdiff1d(all_words, sadness_words , assume_unique =True)
extra_h = np.setdiff1d(all_words, happiness_words , assume_unique =True)

In [81]:
neutral_words = np.concatenate((neutral_words, extra_n))
neutral_words_f = np.concatenate((neutral_words_f, np.zeros(extra_n.shape, dtype=int))) +1 

sadness_words = np.concatenate((sadness_words, extra_s))
sadness_words_f = np.concatenate((sadness_words_f, np.zeros(extra_s.shape, dtype=int))) +1 

happiness_words = np.concatenate((happiness_words, extra_h))
happiness_words_f = np.concatenate((happiness_words_f, np.zeros(extra_h.shape, dtype=int))) +1 

In [82]:
a = 1
new_prob_n = []
for count in neutral_words_f:
    new_prob_n.append((count+a)/(sum(neutral_words_f)+len(neutral_words_f)*a)) #Notice the a from Laplace smoothing to avoid calculating with 0
    


new_prob_s = []
for count in sadness_words_f:
    new_prob_s.append((count+a)/(sum(sadness_words_f)+len(sadness_words_f)*a))
    
    
new_prob_h = []
for count in happiness_words_f:
    new_prob_h.append((count+a)/(sum(happiness_words_f)+len(happiness_words_f)*a))


In [83]:
# MY VERSION 1
# Calculate prior for one class

def prior(className):    
    denominator = len(neutral_words) + len(sadness_doc) + len(happiness_doc)
    
    if className == 'neutral':
        numerator =  len(neutral_words)
        
    elif className == 'sadness':
        numerator =  len(sadness_doc)
        
    elif className == 'happiness':
        numerator =  len(happiness_doc) 
    
    else:
        print('class name not exists')
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:

        if className == 'neutral':
            if word in neutral_words:

                idx = np.where(neutral_words == word)[0][0]
                prob = prob * new_prob_n[idx] #####!!
#             else:
#                 print(f'{word} Not in list!!!!!!!!!!!!')
               

        elif className == 'sadness':
            if word in sadness_words:

                idx = np.where(sadness_words == word)[0][0]
                prob = prob * new_prob_s[idx] #####
#             else:
#                 print(f'{word} Not in list!!!!!!!!!!!!')
                
        elif className == 'happiness':
            if word in happiness_words:

                idx = np.where(happiness_words == word)[0][0]
                prob = prob * new_prob_h[idx] #####
#             else:
#                 print(f'{word} Not in list!!!!!!!!!!!!')

    
    return prob



def predict(sentence):
    prob_neutral = classCondProb(sentence, 'neutral') * prior('neutral')
    prob_sadness = classCondProb(sentence, 'sadness') * prior('sadness')
    prob_happiness = classCondProb(sentence, 'happiness') * prior('happiness')
    
    
    result = np.asarray((prob_neutral, prob_sadness,  prob_happiness ))
    
    return result
    

In [84]:
X_test.iloc[0]

'at the risk of sounding like a whiny child i gotta say i wanna go hoooooooooooomeeeee ugh '

In [85]:
y_test.iloc[0]

'sadness'

In [86]:
predict(X_test.iloc[0])

array([3.31914271e-57, 3.23799202e-57, 4.32141164e-60])

In [87]:
result_list = np.zeros((len(y_test), 3 ))

for i, sentence in enumerate(X_test):
    yhat = predict(sentence)

    result_list[i] = yhat


In [88]:
result_list.shape

(510, 3)

In [89]:
prediction = np.argmax(result_list, axis = 1)

In [90]:
prediction

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,

In [91]:
y_test_encoded = y_test.copy()

In [92]:
y_test_encoded[y_test_encoded == 'neutral'] = 0
y_test_encoded[y_test_encoded == 'sadness'] = 1
y_test_encoded[y_test_encoded == 'happiness'] = 2

In [93]:
y_test_encoded

4877    1
4319    1
4804    1
4381    0
5035    1
       ..
3723    2
3989    0
4408    1
4091    2
3618    1
Name: sentiment, Length: 510, dtype: object

In [94]:
neutral_correct= len(np.intersect1d(np.where(prediction == y_test_encoded) , np.where(y_test_encoded == 0) ))
all_neutral = len(np.where(y_test_encoded == 0)[0])


sadness_correct = len(np.intersect1d(np.where(prediction == y_test_encoded) , np.where(y_test_encoded == 1) ))
all_sadness = len(np.where(y_test_encoded == 1)[0])

happiness_correct = len(np.intersect1d(np.where(prediction == y_test_encoded) , np.where(y_test_encoded == 2) ))
all_happiness = len(np.where(y_test_encoded == 2)[0])


all_correct = len(np.where(prediction == y_test_encoded)[0])

In [95]:
print(f"Accuracy on neutral class: {neutral_correct / all_neutral}")
print(f"Accuracy on sadness class: {sadness_correct / all_sadness}")
print(f"Accuracy on happiness class: {happiness_correct / all_happiness}")
print()
print(f"Overall prediction performance: {all_correct/ len(y_test)}")

Accuracy on neutral class: 0.8333333333333334
Accuracy on sadness class: 0.18146718146718147
Accuracy on happiness class: 0.0

Overall prediction performance: 0.41568627450980394


--- 
## Summary
### Data preparation and cleansing
- This text analysis classificaiton exercises originally contains over 40,000 rows with as many as 13 sentiment classes.
- To scope down for further focus and computational ease, I have sliced 1,700 samples, from the only selected 3 classes: 'neutral', happiness' and 'sadness'
- However, it is noteworthy that the each class we observe the imbalance distribution, with more density in 'neutral' and 'sadness'  while only around 18% from 'happiness'
- Next, 3 major strategies are performed to manipulate the data, 1) deleting all hashtage names, 2) delete all words beginning with numbers and 3) delete special characters

### Train & Test the model
- As we studied, I have calculated the frequency of words in texts corresponding to each class /sentiment, then the prior, 
- Laplace comes into picture when assruing that each class contails the collectively exhaustive set of words, with the default value of 1 for any word NOT existing in that classs
- Next is to calculate the probability on test set to answer whichch class each sentence is likely to belong to using Bayes's Theorem ,and get the final predicted class from argmax() function 

### Result Interpretation and Further Optimization
- Overall, on test set, the accuracy of test set is at only 42%. But when we look deep into each category we'll see the significantly high accuracy rate on neutral class and horrible accuracy on happiness class
- The explanation for this is
    - The imbalance distribuiton of the data in training set where happpiness class accounts less than 20% of the entire dataset
    - The neutral message can contain a large number of words overlapped in happiness and sadness classes, thus giving more likelihood for the model to interpret them as neutral
- The further optimization
    - The training data should be resampled with the normally distributed approach
    - Implementation of more advanced technique like TF-IDF text analysis should be used to reducte the importance frequent words, and more on rare words for better classification