# Redes Neuronales - Trabajo Práctico N° 1 - Ejercicio 2 - Notebook #2
En esta segunda notebook, se busca definir cuál métrica es más apropiada para analizar la performance del modelo y qué hiper parámetros se van a utilizar para el ajuste del modelo acorde a la validación. Finalmente, estas decisiones se vuelcan en la selección del mejor modelo para el problema de la clasificación de correos electrónicos asociados grupos de noticias.

### Fuentes útiles
* https://en.wikipedia.org/wiki/Bessel%27s_correction
* https://en.wikipedia.org/wiki/Kernel_density_estimation
* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
* https://stackoverflow.com/questions/58046129/can-someone-give-a-good-math-stats-explanation-as-to-what-the-parameter-var-smoo

### Integrantes del grupo
* Gaytan, Joaquín Oscar
* Kammann, Lucas Agustín

# 1. Métrica
La métrica a utilizar para cuantificar la performance de los modelos, seleccionar los hiperparámetros y validarlos, será la **sensibilidad** o **recall**.

## 1.1. Justificación

# 2. Preparación de los datasets

## 2.1. Cargando el dataset original

In [1]:
import pandas as pd

In [2]:
# Read database from .csv
df = pd.read_csv('../assets/diabetes.csv', delimiter=',')

## 2.2. Filtrado de valores inválidos
Se filtran los valores que se consideran inválidos para las variables en cuestión, estas consideraciones se obtuvieron como resultado del análisis realizado en el notebook #1.

In [3]:
import numpy as np

In [4]:
# Filtering Glucose values
df['Glucose'].replace(0, np.nan, inplace=True)

# Filtering Blood Pressure values
df['BloodPressure'].replace(0, np.nan, inplace=True)

# Filtering Skin Thickness values
df['SkinThickness'].replace(0, np.nan, inplace=True)

# Filtering Insulin values
df['Insulin'].replace(0, np.nan, inplace=True)

# Filtering Body Mass Index values
df['BMI'].replace(0, np.nan, inplace=True)

## 2.3. Filtrado de outliers

In [5]:
from src.helper import remove_outliers

In [6]:
for column in df.columns:
    remove_outliers(df, column)

In [7]:
# Summarize dataset
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,764.0,763.0,719.0,538.0,370.0,749.0,739.0,759.0,768.0
mean,3.786649,121.686763,72.115438,28.903346,132.610811,32.204005,0.429832,32.805007,0.348958
std,3.278714,30.535641,11.239072,9.86548,74.285393,6.491385,0.249684,11.113182,0.476951
min,0.0,44.0,40.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,22.0,75.0,27.4,0.238,24.0,0.0
50%,3.0,117.0,72.0,29.0,120.0,32.0,0.356,29.0,0.0
75%,6.0,141.0,80.0,36.0,177.5,36.5,0.587,40.0,1.0
max,13.0,199.0,104.0,56.0,360.0,50.0,1.191,66.0,1.0


## 2.4. Separación de datasets
Se separa el dataset original en los datasets de train, valid y test. Además, se debe corregir que los valores inválidos del dataset original fueron reemplazados por el valor NaN.

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# Splitting into the total train and the test datasets, because
# the total train contains the train and valid datasets used for
# hiper parameter selection
train, test = train_test_split(df, test_size=0.2, random_state=40)

In [10]:
# Compute the mean of training
train_means = train.mean().to_numpy()

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(train.columns):
    train.loc[:,column].replace(np.nan, train_means[index], inplace=True)

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(test.columns):
    test.loc[:,column].replace(np.nan, train_means[index], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [11]:
# Extracting the inputs and outputs of the train dataset
x_train = train.to_numpy()[:,:8]
y_train = train.to_numpy()[:,8]

# Extracting the inputs and outputs of the test dataset
x_test = test.to_numpy()[:,:8]
y_test = test.to_numpy()[:,8]

# 3. Selección de modelo e hiper parámetros

In [12]:
from sklearn.metrics import recall_score
from sklearn.model_selection import GridSearchCV
from src.gaussian_naive_bayes import BinaryGaussianNaiveBayes

## 3.1. Entrenamiento de todos los modelos

In [13]:
%%time

# Hiper parameters
parameters = {
    'smoothing': [0],
    'bessel_correction': [False],
    'filter_variables': [[True, True, True, True, False, True, False, False]]
}

# Estimator or model
estimator = BinaryGaussianNaiveBayes()

# GridSearch Cross-Validation
grid = GridSearchCV(estimator, parameters, cv=5, scoring='recall')
grid.fit(x_train, y_train)

Wall time: 80.8 ms


GridSearchCV(cv=5,
             estimator=BinaryGaussianNaiveBayes(bessel_correction=False,
                                                smoothing=0),
             param_grid={'bessel_correction': [False],
                         'filter_variables': [[True, True, True, True, False,
                                               True, False, False]],
                         'smoothing': [0]},
             scoring='recall')

## 3.2. Mejor modelo

In [14]:
print(grid.best_params_)

{'bessel_correction': False, 'filter_variables': [True, True, True, True, False, True, False, False], 'smoothing': 0}


In [15]:
print(grid.best_score_)

0.6120278971903284


## 3.3. Entrenamiento completo

In [16]:
# Create and train the model
classifier = BinaryGaussianNaiveBayes(
    smoothing=grid.best_params_['smoothing'], 
    bessel_correction=grid.best_params_['bessel_correction'], 
    filter_variables=grid.best_params_['filter_variables']
)
classifier.fit(x_train, y_train)

# 4. Validación y performance del modelo

In [17]:
predictions = classifier.predict(x_test)

In [18]:
score = recall_score(y_test, predictions)

In [19]:
print(score)

0.5084745762711864


# 5. Comparación con KDE

In [20]:
from sklearn.base import BaseEstimator
from sklearn.neighbors import KernelDensity

class BinaryKDENaiveBayes(BaseEstimator):
    
    def __init__(self, kernel='gaussian', bandwidth=0.5, filter_variables=None):
        self.priori_distribution = None
        self.log_priori_distribution = None
        self.filter_variables = filter_variables
        self.kde = None
        self.kernel = kernel
        self.bandwidth = bandwidth
    
    def fit(self, x_data, y_data):
        # Instantiating KDE objects
        self.kde = [KernelDensity(kernel=self.kernel, bandwidth=self.bandwidth), KernelDensity(kernel=self.kernel, bandwidth=self.bandwidth)]
        # Filtering data if required
        if self.filter_variables is not None:
            x_data = x_data[:,self.filter_variables]
        
        # Calculating priori distribution
        self.priori_distribution = np.array([len(y_data[y_data == 0]) , len(y_data[y_data == 1])])
        self.priori_distribution = self.priori_distribution / self.priori_distribution.sum()
        self.log_priori_distribution = np.log(self.priori_distribution)
        
        # Fitting data into KDE object
        self.kde[0].fit(x_data[y_data == 0])
        self.kde[1].fit(x_data[y_data == 1])
            
    def predict(self, x_data):
        # Filtering data if required
        if self.filter_variables is not None:
            x_data = x_data[:,self.filter_variables]
            
        # Initialization of predictions
        predictions = np.zeros(x_data.shape[0])
        
        # Prediction for each subject
        for subject_index in range(x_data.shape[0]):
            
            log_likelihood = np.array([self.kde[0].score(x_data[subject_index].reshape(1, -1)), self.kde[1].score(x_data[subject_index].reshape(1, -1))])
            log_posteriori_unnormalized = log_likelihood + self.log_priori_distribution
            log_odds = log_posteriori_unnormalized[1] - log_posteriori_unnormalized[0]
            predictions[subject_index] = 1 if log_odds > 0 else 0
        
        # Return the predictions made by the model
        return predictions

In [65]:
%%time

# Wait time: 11min
# Hiper parameters
param = {
    'kernel': ['exponential', 'gaussian'],
    'bandwidth': np.linspace(0.0001,0.1, 5),
    'filter_variables': [[(i & (0x1 << (7-j)) > 0) for j in range(8)] for i in range(1,256)] # If all columns disabled, then kde fails
}

# Estimator or model
estimator = BinaryKDENaiveBayes()


# GridSearch Cross-Validation
grid = GridSearchCV(estimator, param, cv=5, scoring='recall')
grid.fit(x_train, y_train)

Wall time: 10min 12s


GridSearchCV(cv=5, estimator=BinaryKDENaiveBayes(),
             param_grid={'bandwidth': array([0.0001  , 0.025075, 0.05005 , 0.075025, 0.1     ]),
                         'filter_variables': [[False, False, False, False,
                                               False, False, False, True],
                                              [False, False, False, False,
                                               False, False, True, False],
                                              [False, False, False, False,
                                               False, False, True, True],
                                              [False, False, False, False,
                                               False, True, False, False],
                                              [False, False, Fa...
                                               False, False, False],
                                              [False, False, False, True, True,
                                    

In [66]:
# Get best model params
print(grid.best_params_)

{'bandwidth': 0.0001, 'filter_variables': [False, True, True, True, True, False, False, True], 'kernel': 'exponential'}


In [67]:
# Get model best score
print(grid.best_score_)

0.6843732849291427


In [46]:
%%time

# Estimator or model
#filter_variables=[True, True, True, True, False, True, False, False]
estimator = BinaryKDENaiveBayes(kernel='exponential', bandwidth=1.0)
estimator.fit(x_train, y_train)
p = estimator.predict(x_test)

score = recall_score(y_test, p)
print(score)



0.5932203389830508
Wall time: 101 ms


# 6. Contrastando con sklearn

In [23]:
from sklearn.naive_bayes import GaussianNB

# Create and train the model
c = GaussianNB()
c.fit(x_train[:,grid.best_params_['filter_variables']], y_train)

# Predict and compute score
p = c.predict(x_test[:,grid.best_params_['filter_variables']])
score = recall_score(y_test, p)
print(score)

0.5084745762711864
