# Naive Bayes

`scikit-learn` provides multiple implementations of Naive Bayes that differ on how conditional probabilities are calculated. So the different implementations are suitable for different types of data. 

- `CategorialNB` will work with categorical data once it is processed using an `OrdinalEncoder`
- `GaussianNB` assumes the numerica features have a Gaussian distribution
- `BernoulliNB` binary data
- `MultinomialNB` count data, e.g. word counts

In [None]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB, CategoricalNB
from sklearn.metrics import confusion_matrix 
from sklearn.preprocessing import OneHotEncoder

In [None]:
swim = pd.read_csv('Swimming.csv')
swim

## Categorical NB

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
swim = pd.read_csv('Swimming.csv')
y = swim.pop('Swimming').values # Set this as the y (target)
print(swim.columns)
print(y)
ord_encoder = OrdinalEncoder()
swimOE = ord_encoder.fit_transform(swim)
swimOE

In [None]:
catNB = CategoricalNB(fit_prior=True,alpha = 0.0001)
swim_catNB = catNB.fit(swimOE,y)
y_dash = swim_catNB.predict(swimOE)
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

In [None]:
ord_encoder.categories_

The model is stored as log probabilities.  
There are five features and two classes.  
The five features have 3,3,2,3,2 possible values. 

In [None]:
catNB.classes_

In [None]:
mparams = catNB.feature_log_prob_
mparams

Probability of Rain_Today = 'Heavy' given 'No'  
Probability of Temp = 'Warm' given 'No'  

In [None]:
import numpy as np
np.exp(1)**mparams[1][1,0], np.exp(1)**mparams[2][1,1]

In [None]:
# Three query examples, two from the lecture and one from the training data.

squery = pd.DataFrame([["Moderate","Moderate","Warm","Light","Some"],
                       ["Moderate","Moderate","Cold","Moderate","Some"],
                       ["Moderate","Light","Warm","Light","None"]
                      ], columns=swim.columns)

In [None]:
X_query = ord_encoder.transform(squery)
X_query, X_query.shape

In [None]:
y_query = swim_catNB.predict(X_query)
y_query

In [None]:
q_probs = swim_catNB.predict_proba(X_query)
q_probs

In [None]:
swim_catNB.get_params()

### Taking care of category order
Providing the `OrdinalEncoder` with the correct order.  
Doesn't make any difference to the classifier because `CategoryNB` does not consider order.

In [None]:
swim_cats =[['Light','Moderate','Heavy'],
            ['Light','Moderate','Heavy'],
            ['Cold','Warm'],
            ['Light', 'Moderate','Gale'],
            ['None','Some'],
           ]

In [None]:
swim = pd.read_csv('Swimming.csv')
y = swim.pop('Swimming').values # Set this as the y (target)
print(swim.columns)
print(y)
ord_encoderV2 = OrdinalEncoder(categories = swim_cats)
swimOEV2 = ord_encoderV2.fit_transform(swim)
swimOEV2

In [None]:
catNB = CategoricalNB(fit_prior=True,alpha = 0.0001)
swim_catNB = catNB.fit(swimOEV2,y)
y_dash = swim_catNB.predict(swimOEV2)
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

### One-Hot-Encode the training data
Here we use one-hot encoding to convert to the Swimming dataset to a numeric format.   
This converts the data to a binary format so it is valid to use `BernoulliNB` and possibly `MultinomialNB` - `GaussianNB` not so much. 

In [None]:
swim = pd.read_csv('Swimming.csv')
y = swim.pop('Swimming').values # Set this as the y (target)


onehot_encoder = OneHotEncoder(sparse_output=False)
swimOH = onehot_encoder.fit_transform(swim)
swimOH

In [None]:
onehot_encoder.get_feature_names_out(swim.columns)

In [None]:
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
swim_numNB = bnb.fit(swimOH,y)
y_dash = swim_numNB.predict(swimOH)

In [None]:
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

In [None]:
swim_numNB.classes_

In [None]:
swim_numNB.feature_log_prob_

In [None]:
# Three query examples, two from the lecture and one from the training data.

squery = pd.DataFrame([["Moderate","Moderate","Warm","Light","Some"],
                       ["Moderate","Moderate","Cold","Moderate","Some"],
                       ["Moderate","Light","Warm","Light","None"]
                      ], columns=swim.columns)

In [None]:
X_query = onehot_encoder.transform(squery)
X_query, X_query.shape

In [None]:
y_query = swim_numNB.predict(X_query)
y_query

In [None]:
q_probs = swim_numNB.predict_proba(X_query)
q_probs

In [None]:
swim_numNB.classes_

## Gaussian Approximations
Gaussian Naive Bayes approximates numerical features using a Normal distribution.  
Here we look at the distributions of the Penguin features to see if this looks reasonable. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
penguins_all = pd.read_csv('penguins_af.csv')
f_names = ['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']
X = penguins_all[f_names].values
y = penguins_all['species']
species_names = np.unique(y)
species_names

In [None]:
findex = 0 # any value in [0,1,2,3]
c1 = 'Adelie'     # any of ['Adelie', 'Chinstrap', 'Gentoo']
c2 = 'Chinstrap'
sns.histplot(X[y == c1][:,findex], label=c1,
            kde=True, stat="density", linewidth=0)
sns.histplot(X[y == c2][:,findex], label=c2, color = 'orange',
            kde=True, stat="density", linewidth=0)
plt.legend();
plt.xlabel(f_names[findex])
plt.ylabel('Probability')

### Discretization
The alternative to Gaussian Naive Bayes is to discretize the data and use `CategoricalNB`.  
Discretization in Naive Bayes works as follows:

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
X = [[-2, 1],
     [-1, 3],
     [ 0, 4],
     [ 2, 5]]
distOrd = KBinsDiscretizer(n_bins=3, encode='ordinal', 
                           strategy='uniform', subsample=None)

distOH = KBinsDiscretizer(n_bins=3, encode='onehot-dense', 
                           strategy='uniform', subsample=None)

distOrd.fit(X)
X_ord = distOrd.transform(X)
X_ord  

In [None]:
distOH.fit(X)
X_OH = distOH.transform(X)
X_OH 