# Naive Bayes with sklearn

We saw in class that Naive Bayes is a probabilistic classifier, that can easily support categorical and quantitative variables.

Problem is ... `sklearn` does not natively work with both ...

We will need to split again our data in quantitative and qualitative, and then code our own suggestions to take both into account.

You can start with quantitative or qualitative data depending on the majority data type in your dataset.

In [13]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.read_csv("../titanic.csv").fillna(method="backfill")

In [14]:
# My target is survived 
y = df.Survived.values
# For demonstration, I'm taking only 4 columns
categorical_variables = ["Sex", "Embarked"]
quantitative_variables = ["Age", "Fare"]
X = df[categorical_variables + quantitative_variables].values
X_quantitative = df[quantitative_variables].values
X_categorical = df[categorical_variables].values

## Working with quantitative data
With quantitative data, we can use the `GaussianNB` class.

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

In [26]:
gaussian_nb = GaussianNB()

print("======= Training")
gaussian_nb.fit(X_quantitative, y)

print("======= Prediction")
predictions = gaussian_nb.predict(X_quantitative)

print("======= Results")
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.66      0.95      0.78       549
           1       0.72      0.22      0.33       342

    accuracy                           0.67       891
   macro avg       0.69      0.58      0.56       891
weighted avg       0.68      0.67      0.61       891



**Questions**:
1. Plot the statistical distribution of your variables and see if any is highly skewed.
2. Apply Gaussian Naive Bayes to the quantitative variables of your dataset.
3. Retrieve class probability and plot the results as a function of the different features, using heatmap colors.
4. Perform k-fold cross-validation and return the classification scores (accuracy, recall, precision).
6. Try removing highly correlated data and see if your results improve.

## Working with qualitative data
With qualitative data, we can use the class CategoricalNB.

In [24]:
from sklearn.naive_bayes import CategoricalNB

In [25]:
gaussian_nb_categorical = CategoricalNB()

print("======= Training")
gaussian_nb_categorical.fit(X_quantitative, y)

print("======= Prediction")
predictions = gaussian_nb_categorical.predict(X_quantitative)

print("======= Results")
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.75      0.92      0.82       549
           1       0.79      0.50      0.62       342

    accuracy                           0.76       891
   macro avg       0.77      0.71      0.72       891
weighted avg       0.76      0.76      0.74       891



**Questions**:
1. Apply Gaussian Naive Bayes to the qualitative variables of your dataset.
3. Retrieve class probability and plot the results as a function of the different features, using heatmap colors.
4. Perform k-fold cross-validation and return the classification scores (accuracy, recall, precision).
5. Compare to previous results.
6. Transform every variable within your dataset to a qualitative using the class `sklearn.preprocessing.KBinsDiscretizer` and compare with previous results.
7. Compare to what you achieved using `knn`.

## Working with both data types
It is annoying that sklearn does not allow to deal with both variables types...

A solution to solve this is to:
- Fit a GaussianNB on the quantitative variables and get the probabilities `quantitative_probabilities`
- Fit a CategoricalNB on the qualitative variables `qualitative_probabilities`
- Fit a new GaussianNB on the probbailities `quantitative_probabilities` and `qualitative_probabilities`.

**Question**:
1. Implement this solution and compare the results with what you obtained previously.
2. **Bonus**: Suggest your own implementation using `sklearn` API for classifiers (see https://scikit-learn.org/stable/developers/develop.html).
