# Redes Neuronales - Trabajo Práctico N° 1 - Ejercicio 2 - Notebook #2
En esta segunda notebook, se busca definir cuál métrica es más apropiada para analizar la performance del modelo y qué hiper parámetros se van a utilizar para el ajuste del modelo acorde a la validación. Finalmente, estas decisiones se vuelcan en la selección del mejor modelo para el problema de la clasificación de correos electrónicos asociados grupos de noticias.

### Integrantes del grupo
* Gaytan, Joaquín Oscar
* Kammann, Lucas Agustín

# 1. Métrica
La métrica a utilizar para cuantificar la performance de los modelos, seleccionar los hiperparámetros y validarlos, será la **sensibilidad** o **recall**.

## 1.1. Justificación

# 2. Preparación de los datasets

## 2.1. Cargando el dataset original

In [40]:
import pandas as pd

In [41]:
# Read database from .csv
df = pd.read_csv('../assets/diabetes.csv', delimiter=',')

In [42]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## 2.2. Filtrado de valores inválidos
Se filtran los valores que se consideran inválidos para las variables en cuestión, estas consideraciones se obtuvieron como resultado del análisis realizado en el notebook #1.

In [43]:
import numpy as np

In [44]:
# Filtering Glucose values
df['Glucose'].replace(0, np.nan, inplace=True)

# Filtering Blood Pressure values
df['BloodPressure'].replace(0, np.nan, inplace=True)

# Filtering Skin Thickness values
df['SkinThickness'].replace(0, np.nan, inplace=True)

# Filtering Insulin values
df['Insulin'].replace(0, np.nan, inplace=True)

# Filtering Body Mass Index values
df['BMI'].replace(0, np.nan, inplace=True)

## 2.3. Filtrado de outliers
Por el momento, se consideró no emplear un filtrado de outliers.

## 2.4. Separación de datasets
Se separa el dataset original en los datasets de train, valid y test. Además, se debe corregir que los valores inválidos del dataset original fueron reemplazados por el valor NaN.

In [45]:
from sklearn.model_selection import train_test_split

In [46]:
# Splitting into the total train and the test datasets, because
# the total train contains the train and valid datasets used for
# hiper parameter selection
total_train, test = train_test_split(df, test_size=0.2, random_state=24)

In [47]:
# Creating the datasets for the hiper parameter selection
train, valid = train_test_split(total_train, test_size=0.2, random_state=24)

In [48]:
train_means = train.mean().to_numpy()

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(total_train.columns):
    total_train.loc[:,column].replace(np.nan, train_means[index], inplace=True)

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(train.columns):
    train.loc[:,column].replace(np.nan, train_means[index], inplace=True)

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(test.columns):
    test.loc[:,column].replace(np.nan, train_means[index], inplace=True)

# Replacing nan values of the valid dataset with training mean values
for index, column in enumerate(valid.columns):
    valid.loc[:,column].replace(np.nan, train_means[index], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [75]:
# Extracting the inputs and outputs of the total train dataset
x_total_train = total_train.to_numpy()[:,:8]
y_total_train = total_train.to_numpy()[:,8]

# Extracting the inputs and outputs of the train dataset
x_train = train.to_numpy()[:,:8]
y_train = train.to_numpy()[:,8]

# Extracting the inputs and outputs of the valid dataset
x_valid = valid.to_numpy()[:,:8]
y_valid = valid.to_numpy()[:,8]

# Extracting the inputs and outputs of the test dataset
x_test = test.to_numpy()[:,:8]
y_test = test.to_numpy()[:,8]

# 3. Selección de modelo e hiper parámetros

In [62]:
from sklearn.metrics import recall_score

## 3.1. Entrenamiento de todos los modelos

In [87]:
%%time

from src.gaussian_naive_bayes import BinaryGaussianNaiveBayes

# Creating the lists with model parameters and their score
model_params = []
model_score = []

# Iterating over all hiper parameters to find the best model
for use_filter in [[(i & (0x1 << (7-j)) > 0) for j in range(8)] for i in range(256)]:
    for use_smoothing in np.linspace(0, 0.1, 100):

        # Create and train the model
        classifier = BinaryGaussianNaiveBayes(smoothing=use_smoothing)
        classifier.fit(x_train[:,use_filter], y_train)

        # Estimate and predict with the model
        predictions = classifier.predict(x_valid[:,use_filter])
        score = recall_score(y_valid, predictions)
        params = {
            'use_smoothing': use_smoothing,
            'use_filter': use_filter
        }

        # Save the results
        model_params.append(params)
        model_score.append(score)

Wall time: 7min 24s


## 3.2. Mejor modelo

In [88]:
# Search the model with the best score
selected_model_index = np.argmax(model_score)
selected_model_score = model_score[selected_model_index]
selected_model_params = model_params[selected_model_index]

In [89]:
selected_model_params

{'use_smoothing': 0.07171717171717172,
 'use_filter': [False, True, True, True, False, True, False, False]}

## 3.3. Entrenamiento completo

In [90]:
# Create and train the model
classifier = BinaryGaussianNaiveBayes(smoothing=selected_model_params['use_smoothing'])
classifier.fit(x_total_train, y_total_train)

# 4. Validación y performance del modelo

In [95]:
predictions = classifier.predict(x_test[:,selected_model_params['use_filter']])

In [96]:
score = recall_score(y_test, predictions)

In [97]:
print(score)

0.7857142857142857
