# Naive Bayes
<div class="alert alert-block alert-info">
<b>Content:</b> In this notebook, 
    we demonstrate the simplest version of the Naive Bayes classifier on the tennis dataset.
</div>


In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score

In [2]:
df=pd.read_csv("play_tennis.csv", sep=',')

In [3]:
df

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


In [4]:
df=df.drop("day", axis='columns')
X_df=df.drop("play", axis='columns')
y_df=df.loc[:, ['play']]

X_raw=X_df.to_numpy()
y_raw=y_df.to_numpy()
X_raw, y_raw

(array([['Sunny', 'Hot', 'High', 'Weak'],
        ['Sunny', 'Hot', 'High', 'Strong'],
        ['Overcast', 'Hot', 'High', 'Weak'],
        ['Rain', 'Mild', 'High', 'Weak'],
        ['Rain', 'Cool', 'Normal', 'Weak'],
        ['Rain', 'Cool', 'Normal', 'Strong'],
        ['Overcast', 'Cool', 'Normal', 'Strong'],
        ['Sunny', 'Mild', 'High', 'Weak'],
        ['Sunny', 'Cool', 'Normal', 'Weak'],
        ['Rain', 'Mild', 'Normal', 'Weak'],
        ['Sunny', 'Mild', 'Normal', 'Strong'],
        ['Overcast', 'Mild', 'High', 'Strong'],
        ['Overcast', 'Hot', 'Normal', 'Weak'],
        ['Rain', 'Mild', 'High', 'Strong']], dtype=object),
 array([['No'],
        ['No'],
        ['Yes'],
        ['Yes'],
        ['Yes'],
        ['No'],
        ['Yes'],
        ['No'],
        ['Yes'],
        ['Yes'],
        ['Yes'],
        ['Yes'],
        ['Yes'],
        ['No']], dtype=object))

We can safely use an ordinal encoder, because Naive Bayes is ignorant towards relations among the class labels.

<div class="alert alert-block alert-warning">
<b>Warning:</b> The same is not true for other classification algorithms.
</div>


In [5]:
target_enc = OrdinalEncoder()
y=target_enc.fit_transform(y_df)[:,-1]
y

array([0., 0., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0.])

In [6]:
clf = Pipeline([('encoder', OrdinalEncoder()), ('classifier', CategoricalNB())])

outer_cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=10, random_state=1)
cv_result=cross_validate(clf, X=X_raw, y=y, cv=outer_cv, scoring=("balanced_accuracy"), n_jobs=8)
print(f"The mean balanced acc is {cv_result['test_score'].mean():.2f} with std {cv_result['test_score'].std():.2f}.")

The mean balanced acc is 0.66 with std 0.18.


In [7]:
clf.fit(X_raw, y)

In [8]:
clf['classifier'].category_count_ #feature, class, category

[array([[0., 2., 3.],
        [4., 3., 2.]]),
 array([[1., 2., 2.],
        [3., 2., 4.]]),
 array([[4., 1.],
        [3., 6.]]),
 array([[3., 2.],
        [3., 6.]])]

We see the different distributions per feature and class

In [9]:
new_instances=np.array([
    ["Sunny", "Hot", "High", "Strong"],
    ["Overcast", "Hot", "High", "Strong"]
])

In [10]:
pred=clf.predict(new_instances)
pred

array([0., 1.])

In [11]:
target_enc.inverse_transform(pred.reshape(-1,1))

array([['No'],
       ['Yes']], dtype=object)

In [12]:
probas = clf.predict_proba(new_instances)
probas

array([[0.83725436, 0.16274564],
       [0.43556515, 0.56443485]])

<div class="alert alert-block alert-info">
<b>Take Aways:</b> 

* Run Naive Bayes
* Interpret the results and the probabilities.
</div>

<div class="alert alert-block alert-success">
<b>Play with:</b> 
    
* create further weather situations ans classify them
</div>