<b><font size="6">Naïve Bayes</font><a class="anchor"><a id='toc'></a></b><br>

Categorical Naive Bayes in Python

<div class="alert alert-block">
    
# TOC<a class="anchor"><a id='toc'></a> - Naïve Bayes for Categorical Data</b><br>
- [<font color='#E8800A'>Predict manually, using the formulas</font>](#First-bullet)<br>
- [<font color='#E8800A'>Predict using scikit-learn</font>](#third-bullet)<br>
    
</div>


# <font color='#E8800A'>Predict manually, using the formulas</font><a class="anchor" id="first-bullet"></a> 
[Back to TOC](#toc)

<a class="anchor" id="company">

## Weather Data
</a>

Here, you will find a small dataset filled with categorical features. Our goal, just like in the theoretical class will be to predict whether an individual will play Tennis or not using a Naive Bayes classifier. This classifier, however, will be built from scratch!

__`Step 0:`__ Import the necessary packages

In [1]:
#standard packages
import pandas as pd
import numpy as np

#sk-learn
from sklearn.metrics import classification_report, confusion_matrix #performance metrics
from sklearn.model_selection import train_test_split

__`Step 1:`__ Import the Dataset

In [2]:
weather = pd.read_csv('dataset/weather.csv')

In [3]:
weather['WINDY'] = weather['WINDY'].astype(str)

In [4]:
weather

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WINDY,PLAY
0,Sunny,Hot,High,False,No
1,Sunny,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Rainy,Mild,High,False,Yes
4,Rainy,Cool,Normal,False,Yes
5,Rainy,Cool,Normal,True,No
6,Overcast,Cool,Normal,True,Yes
7,Sunny,Mild,High,False,No
8,Sunny,Cool,Normal,False,Yes
9,Overcast,Mild,Normal,False,Yes


Assume this instance: `today = (Sunny, Cool, High, True)`<br>
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

$$P(y|X)=\frac{P(X|y)*P(y)}{P(X)}$$

Where, y is class variable and X is a independent feature vector (of size n) where:

$$X=(x_1,x_2,x_3,...,x_n)$$

Now we will find the correct prediction by identifying the y that maximizes the probability of the data:

$$y=argmax_y P(y)*\prod_{i=1}^{n}{P(x_i|y)}$$

To do this in the next steps we calculate $P(x_i | y_j)$ for each $x_i$ in $X$ and $y_j$ in $y$.

__`Step 2:`__ Let's start with y = Yes.

We want to determine $$P(Yes|today) = \frac{P(today|Yes) \cdot P(Yes)}{P(today)}$$


We simplify the calculations by making $P(Yes|today)$ approximately equal to $\underline{P(today|Yes) \cdot P(Yes)}$, ignoring $P(today)$ since it's equal FOR all classes.

So we only calculate $P(today|Yes)$;
$$P(Yes|today) = \frac{\textcolor{RED}{P(today|Yes)} \cdot P(Yes)}{P(today)}$$


Assuming all features are independent, we simply need to determine:
$$  \mathbf{P(Sunny \land Cool \land High \land True | Yes)} \cdot P(Yes) =$$

$$\mathbf{P(Outlook=Sunny | Yes) \cdot P(Temperature=Cool | Yes) \cdot P(Humidity=High | Yes) \cdot P(Windy=True | Yes)} \cdot P(Yes)$$

So,
$$
p(Yes)\cdot \prod_n p(Today_n | Yes)
$$

<br>(Remember that today = Sunny, Cool, High, True)

__`Step 2.1:`__ Calculate each part.

p1_yes = $ P(Outlook = Sunny|Yes) = \frac{P(Outlook = Sunny  \land  Yes)}{ P(Yes)}$

In [6]:
p1_yes = len(weather[(weather.PLAY == 'Yes') & (weather.OUTLOOK == 'Sunny')]) / len(weather[weather.PLAY == 'Yes'])

p2_yes = P(Temperature = Cool|Yes)

In [7]:
#CODE HERE
xi = len(weather[weather.PLAY == 'Yes'])
p2_yes = len(weather[(weather.PLAY == 'Yes') & (weather.TEMPERATURE == 'Cool')]) / xi


p3_yes = P(Humidity = High|Yes)

In [8]:
#CODE HERE
p3_yes = len(weather[(weather.PLAY == 'Yes') & (weather.HUMIDITY == 'High')]) / xi


p4_yes = P(Wind = True|Yes)

In [11]:
#CODE HERE 
p4_yes = len(weather[(weather.PLAY == 'Yes') & (weather.WINDY == "True")]) / xi


p5_yes = P(Yes)

In [12]:
p5_yes = len(weather[weather.PLAY == 'Yes'])/len(weather)

p_yes_today = P(Yes|today)

In [13]:
p_yes_today = p1_yes * p2_yes * p3_yes * p4_yes * p5_yes

In [14]:
p_yes_today

0.005291005291005291

__`Step 3:`__ Now with y = No.

__`Step 3.1:`__ Calculate each part in the same manner as before.

In [15]:
#CODE HERE
zeta = len(weather[weather.PLAY == 'No'])


p1_no = len(weather[(weather.PLAY == 'No') & (weather.OUTLOOK == 'Sunny')]) / zeta


In [16]:
#CODE HERE
p2_no = len(weather[(weather.PLAY == 'No') & (weather.TEMPERATURE == "Cool")]) / zeta


In [17]:
#CODE HERE
p3_no = len(weather[(weather.PLAY == 'No') & (weather.HUMIDITY == 'High')]) / zeta


In [18]:
#CODE HERE
p4_no = len(weather[(weather.PLAY == 'No') & (weather.WINDY == 'True')]) / zeta


In [19]:
#CODE HERE
p5_no = zeta/len(weather)


In [20]:
#CODE HERE
p_no_today = p1_no * p2_no * p3_no * p4_no * p5_no


In [21]:
p_no_today

0.02057142857142857

### Conclusion

After calculating the unnormalized probabilities for both classes ("Yes" and "No"), the most likely outcome is **"No"**. 

In practice, for classification tasks using Naive Bayes, it's not necessary to compute the actual probability of the evidence $P(\text{today})$, because this value is the same for all classes and does not affect the final decision.

Instead, we simply compare the **unnormalized probabilities** and select the class with the highest value. This allows us to make the correct classification without the need to normalize.

However, if we are interested in the **actual posterior probabilities** $ P(\text{Yes}|\text{today})$ and $ P(\text{No}|\text{today})$, we can perform an additional step to calculate the probability of the evidence $P(\text{today}) $ using the total probability theorem.

__`Step 4:`__ Normalize the results to obtain:

`p_yes_today` + `p_no_today` = 1



So far, we are calculated **unnormalized probabilities**, not the actual posterior probabilities. This is, 

`p_yes_today` $= P(\text{today}|\text{Yes}) \cdot P(\text{Yes})$

`p_no_today` $= P(\text{today}|\text{No}) \cdot P(\text{No})$

If we want to get the actual $P(\text{Yes}|\text{today})$ and $P(\text{No}|\text{today})$, we need to divide these values by $P(X) = P(\text{today})$, which can be given by the total probability theorem as:

$$P(\text{today}) = \sum_y P(\text{today}|y)\cdot P(y)$$

In this case:

$$P(\text{today}) = P(\text{today}|\text{Yes}) \cdot P(\text{Yes}) + P(\text{today}|\text{No}) \cdot P(\text{No})$$

This simplifies to:

$$P(\text{today}) = p_{\text{yes\_today}} + p_{\text{no\_today}}$$


In [20]:
P_yes_today = p_yes_today / (p_yes_today + p_no_today)

In [21]:
P_yes_today

0.20458265139116202

In [22]:
#CODE HERE
P_no_today = p_no_today / (p_yes_today + p_no_today)


In [23]:
P_no_today

0.795417348608838

__Result:__ The outcome of today is No!

__`Step 5:`__ Run the cell below, containing a function for the tennis dataset that given the features of an instance can predict the outcome! <br>

In [24]:
def result(outlook, temperature, humidity, windy):
    
    #calculate the probability of playing today
    a1 = len(weather[(weather.PLAY == 'Yes') & (weather.OUTLOOK == outlook)]) / len(weather[weather.PLAY == 'Yes'])
    a2 = len(weather[(weather.PLAY == 'Yes') & (weather.TEMPERATURE == temperature)]) / len(weather[weather.PLAY == 'Yes'])
    a3 = len(weather[(weather.PLAY == 'Yes') & (weather.HUMIDITY == humidity)]) / len(weather[weather.PLAY == 'Yes'])
    a4 = len(weather[(weather.PLAY == 'Yes') & (weather.WINDY == windy)]) / len(weather[weather.PLAY == 'Yes'])
    a5 = len(weather[weather.PLAY == 'Yes'])/len(weather)
    p_yes_today = a1 * a2 * a3 * a4 * a5
    
    #repeat the same for no 
    b1 = len(weather[(weather.PLAY == 'No') & (weather.OUTLOOK == outlook)]) / len(weather[weather.PLAY == 'No'])
    b2 = len(weather[(weather.PLAY == 'No') & (weather.TEMPERATURE == temperature)]) / len(weather[weather.PLAY == 'No'])
    b3 = len(weather[(weather.PLAY == 'No') & (weather.HUMIDITY == humidity)]) / len(weather[weather.PLAY == 'No'])
    b4 = len(weather[(weather.PLAY == 'No') & (weather.WINDY == windy)]) / len(weather[weather.PLAY == 'No'])
    b5 = len(weather[weather.PLAY == 'No'])/len(weather)
    p_no_today = b1 * b2 * b3 * b4 * b5
    
    #normalize results to ensure probability of event and complement = 1
    P_yes_today = p_yes_today / (p_yes_today + p_no_today)
    P_no_today = p_no_today / (p_yes_today + p_no_today)
    
    #make decision
    if P_yes_today > P_no_today:
        outcome = 'Yes'
    else:
        outcome = 'No'
    return outcome

In [25]:
result('Sunny', 'Cool', 'High', 'True')

'No'

## Now we can use the Scikit-learn Naive Bayes for categorical features to confirm our result

# <font color='#E8800A'>Naive Bayes for Categorical Data with sklearn</font><a class="anchor" id="third-bullet"></a>
 [Back to TOC](#toc)

__`Step 1:`__ Import CategoricalNB from sklearn.naive_bayes

In [26]:
from sklearn.naive_bayes import CategoricalNB

__`Step 2:`__ Assign to the object `data` the dataset `weather` excepting the dependent variable - `PLAY`

In [27]:
weather.head()

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WINDY,PLAY
0,Sunny,Hot,High,False,No
1,Sunny,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Rainy,Mild,High,False,Yes
4,Rainy,Cool,Normal,False,Yes


In [28]:
data = weather.drop('PLAY',axis=1)

__`Step 3:`__ Assign to the object `target`the dependent variable 

In [29]:
target = pd.DataFrame(weather['PLAY'], columns = ['PLAY'])

In [30]:
target

Unnamed: 0,PLAY
0,No
1,No
2,Yes
3,Yes
4,Yes
5,No
6,Yes
7,No
8,Yes
9,Yes


__`Step 4:`__ Encode the dataset to apply the model

__`Step 4.1:`__ Import the OrdinalEncoder

In [31]:
from sklearn.preprocessing import OrdinalEncoder

__`Step 4.2:`__ Create two instances of the encoder

In [32]:
enc1 = OrdinalEncoder() #encoder for features
enc2 = OrdinalEncoder() #encoder for labels

__`Step 4.3:`__ Fit the encoder to the data and the target

In [33]:
enc1.fit(data)

In [34]:
data

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WINDY
0,Sunny,Hot,High,False
1,Sunny,Hot,High,True
2,Overcast,Hot,High,False
3,Rainy,Mild,High,False
4,Rainy,Cool,Normal,False
5,Rainy,Cool,Normal,True
6,Overcast,Cool,Normal,True
7,Sunny,Mild,High,False
8,Sunny,Cool,Normal,False
9,Overcast,Mild,Normal,False


In [35]:
data = pd.DataFrame(enc1.transform(data), columns = data.columns)

In [36]:
data

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WINDY
0,2.0,1.0,0.0,0.0
1,2.0,1.0,0.0,1.0
2,0.0,1.0,0.0,0.0
3,1.0,2.0,0.0,0.0
4,1.0,0.0,1.0,0.0
5,1.0,0.0,1.0,1.0
6,0.0,0.0,1.0,1.0
7,2.0,2.0,0.0,0.0
8,2.0,0.0,1.0,0.0
9,0.0,2.0,1.0,0.0


In [37]:
target = enc2.fit_transform(target)

In [38]:
target = target.flatten()


__`Step 5:`__ Split the dataset into X_train, X_val, y_train and y_val, defining `test_size` as 0.25 , `random_state`equal to 5 and `stratify` by the target.

In [47]:
#CODE HERE
X_train, X_val, y_train, y_val = train_test_split(
                                                 data, target,
                                                 test_size = .25,
                                                 random_state = 5,
                                                 stratify=target
)


__`Step 6:`__ Using CategoricalNB, create a Naive Bayes classifier instance called modelNB.

In [48]:
#CODE HERE
modelNB = CategoricalNB()


### Methods in CategoricalNB

__`Step 7:`__ Use the `.fit()`method of model to fit the model to the array of points `X_train` and `y_train`,i.e., associate the argument keyword `X` to `X_train` and `y` to `y_train`.

In [49]:
modelNB.fit(X = X_train, y = y_train)



__`Step 8:`__ Use the `.predict()` method to perform classification in `X_train` and assign to the object `labels_train`. Do the same for `X_val` and assign to the object `labels_val`.

In [50]:
#CODE HERE
labels_train = modelNB.predict(X_train)
labels_val = modelNB.predict(X_val)


__`Step 9:`__ Use the `.predict_proba()` method to obtain the probability estimates for the `X_val`

In [51]:
modelNB.predict_proba(X_val)

array([[0.61049285, 0.38950715],
       [0.31982232, 0.68017768],
       [0.61049285, 0.38950715],
       [0.46545455, 0.53454545]])

__`Step 10:`__ Use the `.score()` method of modelNB to obtain the mean accuracy of the given train data `X_train` and the true labels for X, `y_train`

In [52]:
modelNB.score(X_train, y_train)

0.9

__`Step 11:`__ Use the `.score()` method of modelNB to obtain the mean accuracy of the given test data `X_val` and the true labels for X, `y_val`

In [54]:
#CODE HERE
modelNB.score(X_val, y_val)

0.75

__`Step 12:`__ Make your prediction for today!

In [55]:
today = pd.DataFrame([['Sunny', 'Cool', 'High', 'True']], columns = X_train.columns)


In [56]:
today

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WINDY
0,Sunny,Cool,High,True


__`Step 13:`__ Transform the data with the encoder

In [57]:
#
today = pd.DataFrame(enc1.transform(today), columns = data.columns)


In [58]:
today

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WINDY
0,2.0,0.0,0.0,1.0


__`Step 13.1:`__ Make the prediction for the encoded datapoint

In [62]:
modelNB.predict_proba(today)

array([[0.7966805, 0.2033195]])

In [59]:
result = modelNB.predict(today)

In [60]:
result

array([0.])

__`Step 13.2:`__ Reverse the encoding to understand the result

In [61]:
enc2.inverse_transform([result])

array([['No']], dtype=object)