<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 5: Probability and Statistics Applications</a>
## <a name="0">Lab 5.2: Naïve Bayes</a>

 1. <a href="#1">Business Problem: Frand Detection</a> 
 2. <a href="#2">Naive Bayes</a> 
 3. <a href="#3">Naive Bayes sklearn Implementation</a> 
 4. <a href="#4">Summary</a> 

In [None]:
# Standard libraries
# Upgrade dependencies
#!pip install --upgrade pip
#!pip install --upgrade scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math as math

import warnings
warnings.filterwarnings("ignore")

## <a name="1">1. Business Problem: Fraud Detection</a> 
(<a href="#0">Go to top</a>)

A product manager examines customer data to predict __Fraudulent__ activities, based on the following variables collected: __Grammar__, __Cheapest__, __Country__, __Picture__, and __Review__. 

This is a __classification__ problem that can be handled, aside from the other ML techniques discussed previously, with a Naive Bayes Classifier.

<img style="width: 30%;" src="../../images/fraud_detection.png"></div>

#### Data Loading

In [None]:
# Load in the dataset and print the first five examples
data = pd.read_csv('../../data/MATH_Lecture_5_Data.csv')
data.columns

model_features = ['Grammar', 'Cheapest', 'Country', 'Picture', 'Review']
data = data.drop('Customer ID', axis = 1)

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, random_state = 0)
print('Train shape: {}; test shape: {}'.format(train.shape, test.shape))

## <a name="2">2. Naive Bayes</a> 
(<a href="#0">Go to top</a>)

The Naive Bayes is linear classifier leveraging the Bayes Theorem to build a probabilistic model to predict a class label $y$ for some given data observation $Data$:

$$P(y|Data) = \frac{P(Data|y) P(y)}{P(Data)},$$

That is, it aims to estimate the conditional probability of a class label given the observed data, by finding the probability of class - which we can estimate from the labeled dataset, and most importantly, the conditional probability of the data observed based on the class label. For Bayes Theorem to succed computing this conditional probability, especially when each data input variable is dependent on all other variables, a large number of observed data samples are needed to figure out reliable probabilities distributions for all different possible combinations of variables involved. Even when such large datasets are available, (as we remarked during the lecture) Bayes Theorem itself is computationally expensive. 

However, the calculation of Bayes Theorem can be simplified with some *naive* assumptions on the input variables: consider each __input variable as being independent from each other__; and also consider that __all the input variables have an equal effect on the outcome__. Under these assumptions, the conditional probability of all input data variables given the class label becomes a product of all separate conditional probabilities of each variable value given the class label. 

Overall, the estimates of the conditional probabilities for each class label given the observed data are compared, and the class label with the largest probability is used for final class assignment for the given data instance.

Let's see how a __Naive Bayes__ classifier works in practice, for our binary classification problem with five categorical (encoded binary) input variables, ```Grammar``` $(Gr)$, ```Cheapest``` $(Ch)$, ```Country``` $(Co)$, ```Picture``` $(Pi)$, and ```Review``` $(Re)$. Under the Naive Bayes assumptions, the conditional probability of class $y$ = ```Fraudulent``` $(1)$ given the observed data would be:


$$
P(y = 1|Gr, Ch, Co, Pi, Re) = \frac{P({Gr}|y = 1) P({Ch}|y = 1) P({Co}|y = 1) P({Pi}|y = 1) P({Re}|y = 1) P(y = 1)}{P(Gr, Ch, Co, Pi, Re)},
$$


with a similar computation for the other class, $(0)$. Because $P(Gr, Ch, Co, Pi, Re)$ will be the same regardless the class, and only has the effect of normalizing the results, to further simplify the computations, we drop the $P(Gr, Ch, Co, Pi, Re)$ from the hunt for the conditional probability specific to each class.


### Summarize dataset by class, compute class probabilities

Let's first summarize our train dataset by class, and compute the two class probabilities. The dataset is first split by class, then class probabilities are calculated based on counts from each subset, using

$$
P(y = 0) = \frac{count(y = 0)}{count(y = 0) + count(y = 1)},
$$ 

and 

$$
P(y = 1) = \frac{count(y = 1)}{count(y = 0) + count(y = 1)}.
$$


In [None]:
# Compute class probabilities 
def class_probabilities(train):
    class_count = [train[train['Fraudulent'] == i].shape[0] for i in [0,1]]

    class_proba = [class_count[i]/(np.sum(class_count)) for i in [0, 1]]
    
    return class_count, class_proba

class_count, class_proba = class_probabilities(train)
print(class_count, class_proba)

### Summarize class subsets by variable values, compute conditional probabilities

The conditional probabilities are the probabilities of each input variables given each class value, and can be calculated by splitting each class specific subset further into variables values specific subsets, and counting the elements of each subsets. For example, for ```Grammar``` the conditional probabilities will be given by

$$
P({Gr} = 1|y = 1) = {P({Gr} = 1, y = 1)}/{P(y = 1)}, \\
P({Gr} = 0|y = 1) = {P({Gr} = 0, y = 1)}/{P(y = 1)}, \\
P({Gr} = 1|y = 0) = {P({Gr} = 1, y = 0)}/{P(y = 0)}, \\
P({Gr} = 0|y = 0) = {P({Gr} = 0, y = 0)}/{P(y = 0)}, \\
$$

with similar computations for the other four variables. 

__Smoothing__: If a given pair of variable value and class label never occur together in the training data, then the conditional probability estimate for that variable value conditioned on that class will be zero. One way to avoid zero probabilities, is to introduce an [additive (Laplace) smoothing](https://en.wikipedia.org/wiki/Additive_smoothing). We choose to code in a very simple approach to additive smoothing below, adding $1$ to each of the counts - we assume we saw each variable value for each class once more than we actually did. 

In [None]:
# Compute the conditional probabilities 
def variables_conditional_probabilities(train, smoothing):
    vars_proba = []
    for var in model_features:
        var_proba = []
        for class_value in [0, 1]:
            data_class_value = train[train['Fraudulent'] == class_value]
            var_count = [data_class_value[data_class_value[var] == val].shape[0] for val in [0, 1]]
            for val in [0, 1]:                     
                var_y_proba = (var_count[val] + smoothing)/(class_count[class_value] + 2*smoothing)
                var_proba.append(var_y_proba)
        vars_proba.append(var_proba)
    
    return vars_proba
    
vars_proba = variables_conditional_probabilities(train, smoothing = 1)
print(vars_proba)

### Make class predictions with Naive Bayes

We now have all the ingredients to compute the conditional probabilities for each class, and decide on final class assignmet based on largest probability value. Let's see how this works for one datapoint.

In [None]:
n = 24
datapoint = train.iloc[n][0:5]
datapoint_label = train.iloc[n][5]

def compute_nb_probas(datapoint): 
    predicted_probas = []
    max_predicted_proba = 0.0
    for class_value in [0, 1]:
        nb_proba = class_proba[class_value]
        for i, var in zip(range(5), model_features):
            if datapoint[var] == 0:
                nb_proba = nb_proba*vars_proba[i][2*class_value]
            if datapoint[var] == 1:
                nb_proba = nb_proba*vars_proba[i][2*class_value + 1]
        predicted_probas.append(nb_proba)
        if nb_proba > max_predicted_proba:
            max_predicted_proba = nb_proba
            predicted_class = class_value
            
    return predicted_probas, max_predicted_proba, predicted_class
    
predicted_probas, max_predicted_proba, predicted_class = compute_nb_probas(datapoint)

print('P(y = 0 | {}) = {}'.format(datapoint.values, predicted_probas[0]))
print('P(y = 1 | {}) = {}'.format(datapoint.values, predicted_probas[1]))
print('Predicted class: y = {}'.format(predicted_class))
print('True class: y = {}'.format(datapoint_label))

We can also split the dataset into training and test, train our model implemention on the train, and test the trained model on the test dataset. 

In [None]:
def Accuracy(y_true, y_predict):
    return np.mean(y_true == y_predict)

def Naive_Bayse_custom(train, test):
    class_proba = class_probabilities(train)
    vars_proba = variables_conditional_probabilities(train, smoothing = 0)
    predictions = []
    for i in range(test.shape[0]):
        datapoint = test.iloc[i]
        predicted_probas, max_predicted_proba, predicted_class = compute_nb_probas(datapoint)
        predictions.append(predicted_class)
    return predictions

print(Accuracy(train['Fraudulent'], Naive_Bayse_custom(train, train)))
print(Accuracy(test['Fraudulent'], Naive_Bayse_custom(train, test)))


## <a name="3">3. Naive Bayes sklearn Implementation</a> 
(<a href="#0">Go to top</a>)

Let's compare our Naive Bayes' performance to sklearn's [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) classifiers. The different Naive Bayes classifiers of sklearn differ mainly by the assumptions they make regarding the distribution of $P(variable = value | y = class)$. If the probability distribution for a variable is complex or unknown, it can be a good idea to use a kernel density estimator to approximate the distribution directly from the data samples.


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score

for model in [GaussianNB(), MultinomialNB(), ComplementNB(), BernoulliNB(), CategoricalNB()]:
    
    model.fit(train[model_features], train['Fraudulent'])
    train_predictions_sk = model.predict(train[model_features])
    test_predictions_sk = model.predict(test[model_features])
    print('{} model train accuracy: {}'.format(model, accuracy_score(train['Fraudulent'], train_predictions_sk)))
    print('{} model test accuracy: {}'.format(model, accuracy_score(test['Fraudulent'], test_predictions_sk)))
    print('\n')

## <a name="4">4. Summary</a> 
(<a href="#0">Go to top</a>)

Our Naive Bayes classifier performed as good as the sklearn implementations, but not very well overall. In particular, the MultinomialNB that implements the Naive Bayes algorithm for multinomially distributed data, worked best on training and test. Keep in mind that this is a synthetic dataset, and again we focus here on the techniques itself, and not on performance. However, this also gives us the opportunity to comment on the relevance and importance of Naive Bayes as yet another tool in the classifier's toolkit. 
 
In spite of their apparently over-simplified assumptions, Naive Bayes classifiers have worked quite well in many real-world situations (document classification, spam filtering). They require a small amount of training data to estimate the necessary parameters, and can therefore be extremely fast compared to more sophisticated methods. For example, a comparison of logistic regression and Naive Bayes, can be found [here](https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf).

While reaching for and/or trying different classifiers on a particular dataset, it's worth remembering that Naive Bayes assumes the input variables are independent of each other. This works well most of the time, even when some or most of the variables are in fact dependent. Nevertheless, the performance of the algorithm degrades when the input variables are more dependent of each other.

Also, related to data distribution assumptions of Naive Bayes, as new data becomes available - think new spam/no spam emails data coming in, it can be relatively straightforward to use this new data with the old data to update the estimates of the parameters for each variable’s probability distribution. This allows the model to easily make use of new data or the changing distributions of data over time.



### Exercise 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 1.</b> Train a logistics regression classifier on the same training dataset and compute its accuracy on the training and test dataset, then compare it with the results obtained using Naïve Bayes.
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab52_ex1_solutions.txt

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 5.2: Naïve Bayes of Lecture 5: Probability and Statistics Applications of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>