# Naive Bayes with sklearn

We saw in class that Naive Bayes is a probabilistic classifier, that can easily support categorical and quantitative variables.

Problem is ... `sklearn` does not natively work with both ...

We will need to split again our data in quantitative and qualitative, and then code our own suggestions to take both into account.

You can start with quantitative or qualitative data depending on the majority data type in your dataset.

In [6]:
#installation et importation des librairies nécessaires
!pip install panda
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install scipy

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import csv

#importation de la base de données de cours
dfpokemon = pd.read_csv("../pokemon.csv")

#importation de notre base de données
import requests
response=requests.get('https://api.hearthstonejson.com/v1/190920/frFR/cards.collectible.json')
data = response.json()
def save_csv(dictio, attributs, name):
    with open(name, mode='w', newline='', encoding='utf-8') as fichier_csv:
        writer = csv.DictWriter(fichier_csv, fieldnames=attributs)
        writer.writeheader()
        for objet in dictio:
            ligne = {attr: objet.get(attr, '') for attr in attributs}
            writer.writerow(ligne)


save_csv(data,["id","name","artist","cardClass","cost","attack","health","set","type","rarity"], "dataMK1.csv")
df = pd.read_csv("dataMK1.csv")

In [8]:
# My target is survived
y = df.is_legendary.values
# For demonstration, I'm taking only 4 columns
categorical_variables = ["type1", "type2"]
quantitative_variables = ['defense', 'experience_growth', 'height_m', 'hp',
       'percentage_male']

X = df[categorical_variables + quantitative_variables].values

X_quantitative = df[quantitative_variables].fillna(df[quantitative_variables].mean()).values

X_categorical = df[categorical_variables].values



## Working with quantitative data
With quantitative data, we can use the `GaussianNB` class.

In [10]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

In [11]:
gaussian_nb = GaussianNB()

print("======= Training")
gaussian_nb.fit(X_quantitative, y)

print("======= Prediction")
predictions = gaussian_nb.predict(X_quantitative)

print("======= Results")
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96       731
           1       0.56      0.71      0.62        70

    accuracy                           0.93       801
   macro avg       0.76      0.83      0.79       801
weighted avg       0.94      0.93      0.93       801



In [12]:
predictions = gaussian_nb.predict_proba(X_quantitative)

**Questions**:
1. Plot the statistical distribution of your variables and see if any is highly skewed.
2. Apply Gaussian Naive Bayes to the quantitative variables of your dataset.
3. Retrieve class probability and plot the results as a function of the different features, using heatmap colors.
4. Perform k-fold cross-validation and return the classification scores (accuracy, recall, precision).
6. Try removing highly correlated data and see if your results improve.

## Working with qualitative data
With qualitative data, we can use the class CategoricalNB.

In [14]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder

In [15]:
encoder = LabelEncoder()
encoded_vars = []
for category in X_categorical.T:
    encoded_vars.append(encoder.fit_transform(category))

In [16]:
X_cat = np.array(encoded_vars).T

In [17]:
gaussian_nb_categorical = CategoricalNB()

print("======= Training")
gaussian_nb_categorical.fit(X_cat, y)

print("======= Prediction")
predictions = gaussian_nb_categorical.predict(X_cat)

print("======= Results")
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.91      1.00      0.95       731
           1       0.00      0.00      0.00        70

    accuracy                           0.91       801
   macro avg       0.46      0.50      0.48       801
weighted avg       0.83      0.91      0.87       801



**Questions**:
1. Apply Gaussian Naive Bayes to the qualitative variables of your dataset.
3. Retrieve class probability and plot the results as a function of the different features, using heatmap colors.
4. Perform k-fold cross-validation and return the classification scores (accuracy, recall, precision).
5. Compare to previous results.
6. Transform every variable within your dataset to a qualitative using the class `sklearn.preprocessing.KBinsDiscretizer` and compare with previous results.
7. Compare to what you achieved using `knn`.

## Working with both data types
It is annoying that sklearn does not allow to deal with both variables types...

A solution to solve this is to:
- Fit a GaussianNB on the quantitative variables and get the probabilities `quantitative_probabilities`
- Fit a CategoricalNB on the qualitative variables `qualitative_probabilities`
- Fit a new GaussianNB on the probbailities `quantitative_probabilities` and `qualitative_probabilities`.

**Question**:
1. Implement this solution and compare the results with what you obtained previously.
2. **Bonus**: Suggest your own implementation using `sklearn` API for classifiers (see https://scikit-learn.org/stable/developers/develop.html).


In [60]:
# Next step to do: use train / test or cross val approach using this method.