# Redes Neuronales - Trabajo Práctico N° 1 - Ejercicio 2 - Notebook #2
En esta segunda notebook, se busca definir cuál métrica es más apropiada para analizar la performance del modelo y qué hiper parámetros se van a utilizar para el ajuste del modelo acorde a la validación. Finalmente, estas decisiones se vuelcan en la selección del mejor modelo para el problema de la clasificación de correos electrónicos asociados grupos de noticias.

### Fuentes útiles
* https://en.wikipedia.org/wiki/Bessel%27s_correction
* https://en.wikipedia.org/wiki/Kernel_density_estimation
* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
* https://stackoverflow.com/questions/58046129/can-someone-give-a-good-math-stats-explanation-as-to-what-the-parameter-var-smoo
* https://scikit-learn.org/stable/modules/density.html

### Integrantes del grupo
* Gaytan, Joaquín Oscar
* Kammann, Lucas Agustín

# 1. Métrica
La métrica a utilizar para cuantificar la performance de los modelos, seleccionar los hiperparámetros y validarlos, será la **sensibilidad** o **recall**.

## 1.1. Justificación
Se emplea el recall o sensibilidad respecto de los positivos, que se calcula como:
$$recall = \frac{TP}{TP+FN}$$
Es decir, esta métrica da información sobre la proporción de positivos identificados sobre el total de positivos (reales). En el caso del diagnóstico de una enfermedad, nos interesa minimizar el número de falsos negativos, dado que una persona clasificada como negativo pero que efectivamente esté enferma puede no recibir el tratamiento correspondiente, empeorando su cuadro y poniendo 

# 2. Preparación de los datasets

## 2.1. Cargando el dataset original

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read database from .csv
df = pd.read_csv('../assets/diabetes.csv', delimiter=',')

## 2.2. Sustitución de valores nulos
Los valores nulos de las variables o características para la clasificación se reemplazan por NaN o Not a Number, para evitar que sean procesados en el análisis estadístico posterior, de esta forma luego serán reemplazados por algún estadístico.

In [3]:
# Filtering Glucose values
df['Glucose'].replace(0, np.nan, inplace=True)

# Filtering Blood Pressure values
df['BloodPressure'].replace(0, np.nan, inplace=True)

# Filtering Skin Thickness values
df['SkinThickness'].replace(0, np.nan, inplace=True)

# Filtering Insulin values
df['Insulin'].replace(0, np.nan, inplace=True)

# Filtering Body Mass Index values
df['BMI'].replace(0, np.nan, inplace=True)

## 2.3. Filtrado de outliers

In [4]:
from src.helper import remove_outliers

In [5]:
for column in df.columns:
    remove_outliers(df, column)

## 2.4. Separación de datasets
Se separa el dataset original en los datasets de train, valid y test. Además, se debe corregir que los valores inválidos del dataset original fueron reemplazados por el valor NaN.

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# Splitting into the total train and the test datasets, because
# the total train contains the train and valid datasets used for
# hiper parameter selection
train, test = train_test_split(df, test_size=0.2, random_state=27)

## 2.5. Sustitución de valores inválidos
Se filtran los valores inválidos de cada una de las variables, y se los reemplaza utilizando la media obtenida en el conjunto de entrenamiento. Particularmente, se opta por emplear la media de todo el conjunto de entrenamiento, para no introducir sesgo esencialmente dentro del conjunto empleado para la evaluación del modelo.

In [8]:
# Compute the mean of training
train_means = train.mean().to_numpy()

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(train.columns):
    train.loc[:,column].replace(np.nan, train_means[index], inplace=True)

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(test.columns):
    test.loc[:,column].replace(np.nan, train_means[index], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


## 2.6. Conjuntos de entrenamiento y evaluación

In [9]:
# Extracting the inputs and outputs of the train dataset
x_train = train.to_numpy()[:,:8]
y_train = train.to_numpy()[:,8]

# Extracting the inputs and outputs of the test dataset
x_test = test.to_numpy()[:,:8]
y_test = test.to_numpy()[:,8]

In [10]:
train.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,3.859247,122.032787,72.069808,28.893519,135.921311,32.366556,0.431246,32.792763,0.356678
std,3.308473,30.333958,10.708903,8.288153,52.552469,6.605536,0.244934,10.988938,0.479409
min,0.0,44.0,40.0,7.0,14.0,18.2,0.084,21.0,0.0
25%,1.0,99.0,65.0,25.0,122.0,27.5,0.238,24.0,0.0
50%,3.0,118.0,72.069808,28.893519,135.921311,32.366556,0.3815,29.0,0.0
75%,6.0,141.0,78.0,32.0,135.921311,36.8,0.58525,40.0,1.0
max,13.0,199.0,104.0,54.0,360.0,50.0,1.189,66.0,1.0


In [11]:
test.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,154.0,154.0,154.0,154.0,154.0,154.0,154.0,154.0,154.0
mean,3.499086,120.318395,72.282847,28.92785,127.967511,31.57597,0.424461,32.853106,0.318182
std,3.10637,30.901014,11.544378,8.147233,46.974638,5.539948,0.245579,11.315674,0.46729
min,0.0,56.0,44.0,8.0,15.0,18.4,0.078,21.0,0.0
25%,1.0,100.0,64.0,25.25,114.0,27.8,0.256,24.0,0.0
50%,3.0,116.5,72.069808,28.893519,135.921311,31.75,0.349,29.0,0.0
75%,5.0,136.75,80.0,31.0,135.921311,35.075,0.55125,40.0,1.0
max,13.0,197.0,104.0,56.0,285.0,49.7,1.191,66.0,1.0


# 3. Selección, validación y evaluación de modelos
Para poder realizar la validación del modelo, se utiliza el método de k-folding, dado que se cuenta con una cantidad de datos pequeña. Este algoritmo es implementado por la función **grid search**, que busca optimizar la métrica elegida (recall) variando los hiper parámetros del modelo. Luego, se entrena al modelo con los hiper parámetros que resultan de este análisis.

In [12]:
from sklearn.model_selection import GridSearchCV

In [23]:
from sklearn.metrics import recall_score

In [13]:
import itertools

## 3.1. Hiper parámetros

## 3.2. Modelo utilizando Gaussian Naive Bayes

In [14]:
from src.gaussian_naive_bayes import BinaryGaussianNaiveBayes

In [39]:
%%time

# Declaring the hiper parameters and their values for the GridSearchCV
# to search the best model according to the performance
parameters = {
    'std_smoothing': [0, 0.001, 0.01, 0.1, 1],
    'std_correction': [False, True],
    'filter_variables': list(itertools.product([True], repeat=8)),
}

# Estimator or model
estimator = BinaryGaussianNaiveBayes()

# Search the best model
grid = GridSearchCV(estimator, parameters, cv=10, scoring='recall', n_jobs=-1)
grid.fit(x_train, y_train)

Wall time: 7.53 s


GridSearchCV(cv=10, estimator=BinaryGaussianNaiveBayes(), n_jobs=-1,
             param_grid={'filter_variables': [(True, True, True, True, True,
                                               True, True, True)],
                         'std_correction': [False, True],
                         'std_smoothing': [0, 0.001, 0.01, 0.1, 1]},
             scoring='recall')

In [40]:
print(grid.best_params_)

{'filter_variables': (True, True, True, True, True, True, True, True), 'std_correction': False, 'std_smoothing': 0}


In [41]:
print(grid.best_score_)

0.6319326555797143


In [42]:
# Train the found model with the complete train set
classifier = BinaryGaussianNaiveBayes(
    std_smoothing=grid.best_params_['std_smoothing'], 
    std_correction=grid.best_params_['std_correction'], 
    filter_variables=grid.best_params_['filter_variables']
)
classifier.fit(x_train, y_train)

In [43]:
# Predictions using the test dataset and computing the score
predictions = classifier.predict(x_test)
score = recall_score(y_test, predictions)

In [46]:
print(f'Score of the model {np.round(score, 3)}')

Score of the model 0.653


In [47]:
from sklearn.naive_bayes import GaussianNB

# Create and train the model
c = GaussianNB()
c.fit(x_train, y_train)

# Predict and compute score
p = c.predict(x_test)
s = recall_score(y_test, p)

# Show the resulting score
print(f'Score of the model {np.round(s, 3)}')

Score of the model 0.653


## 3.3. Modelo utilizando Kernel Density Estimator

In [48]:
from src.kde_naive_bayes import BinaryKDENaiveBayes

In [107]:
%%time

# Declaring the hiper parameters and their values for the GridSearchCV
# to search the best model according to the performance
parameters = {
    'kernel': ['exponential', 'tophat', 'linear', 'cosine', 'gaussian', 'epanechnikov'],
    'bandwidth': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10],
    'filter_variables': list(itertools.product([True, False], repeat=8)),
}

# Estimator or model
estimator = BinaryKDENaiveBayes()

# Search the best model
grid = GridSearchCV(estimator, parameters, cv=10, scoring='recall', n_jobs=-1)
grid.fit(x_train, y_train)

Wall time: 10.5 s


GridSearchCV(cv=10, estimator=BinaryKDENaiveBayes(), n_jobs=-1,
             param_grid={'bandwidth': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1,
                                       1, 10],
                         'filter_variables': [(True, True, True, True, True,
                                               True, True, True)],
                         'kernel': ['exponential', 'tophat', 'linear', 'cosine',
                                    'gaussian', 'epanechnikov']},
             scoring='recall')

In [108]:
print(grid.best_params_)

{'bandwidth': 1, 'filter_variables': (True, True, True, True, True, True, True, True), 'kernel': 'exponential'}


In [109]:
print(grid.best_score_)

0.5722334267040149


In [110]:
# Train the found model with the complete train set
classifier = BinaryKDENaiveBayes(
    kernel=grid.best_params_['kernel'], 
    bandwidth=grid.best_params_['bandwidth'], 
    filter_variables=grid.best_params_['filter_variables']
)
classifier.fit(x_train, y_train)

In [111]:
# Predictions using the test dataset and computing the score
predictions = classifier.predict(x_test)
score = recall_score(y_test, predictions)

In [112]:
print(f'Score of the model {np.round(score, 3)}')

Score of the model 0.592
