# Redes Neuronales - Trabajo Práctico N° 1 - Ejercicio 2 - Notebook #2
En esta segunda notebook, se busca definir cuál métrica es más apropiada para analizar la performance del modelo y qué hiper parámetros se van a utilizar para el ajuste del modelo acorde a la validación. Finalmente, estas decisiones se vuelcan en la selección del mejor modelo para el problema de la clasificación de correos electrónicos asociados grupos de noticias.

### Fuentes útiles
* https://en.wikipedia.org/wiki/Bessel%27s_correction
* https://en.wikipedia.org/wiki/Kernel_density_estimation
* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
* https://stackoverflow.com/questions/58046129/can-someone-give-a-good-math-stats-explanation-as-to-what-the-parameter-var-smoo
* https://scikit-learn.org/stable/modules/density.html

### Integrantes del grupo
* Gaytan, Joaquín Oscar
* Kammann, Lucas Agustín

# 1. Métrica
La métrica a utilizar para cuantificar la performance de los modelos, seleccionar los hiperparámetros y validarlos, será la **sensibilidad** o **recall**.

## 1.1. Justificación

# 2. Preparación de los datasets

## 2.1. Cargando el dataset original

In [1]:
import pandas as pd

In [2]:
# Read database from .csv
df = pd.read_csv('../assets/diabetes.csv', delimiter=',')

## 2.2. Filtrado de valores inválidos
Se filtran los valores inválidos de cada una de las variables, y se los reemplaza utilizando la media obtenida en el conjunto de entrenamiento. Particularmente, se opta por emplear la media de todo el conjunto de entrenamiento, para no introducir sesgo esencialmente dentro del conjunto empleado para la evaluación del modelo.

In [3]:
import numpy as np

In [4]:
# Filtering Glucose values
df['Glucose'].replace(0, np.nan, inplace=True)

# Filtering Blood Pressure values
df['BloodPressure'].replace(0, np.nan, inplace=True)

# Filtering Skin Thickness values
df['SkinThickness'].replace(0, np.nan, inplace=True)

# Filtering Insulin values
df['Insulin'].replace(0, np.nan, inplace=True)

# Filtering Body Mass Index values
df['BMI'].replace(0, np.nan, inplace=True)

## 2.3. Filtrado de outliers

In [5]:
from src.helper import remove_outliers

In [6]:
for column in df.columns:
    remove_outliers(df, column)

## 2.4. Separación de datasets
Se separa el dataset original en los datasets de train, valid y test. Además, se debe corregir que los valores inválidos del dataset original fueron reemplazados por el valor NaN.

In [7]:
from sklearn.model_selection import train_test_split

In [275]:
# Splitting into the total train and the test datasets, because
# the total train contains the train and valid datasets used for
# hiper parameter selection
train, test = train_test_split(df, test_size=0.2, random_state=27)

In [276]:
# Compute the mean of training
train_means = train.mean().to_numpy()

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(train.columns):
    train.loc[:,column].replace(np.nan, train_means[index], inplace=True)

# Replacing nan values of the test dataset with training mean values
for index, column in enumerate(test.columns):
    test.loc[:,column].replace(np.nan, train_means[index], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [277]:
# Extracting the inputs and outputs of the train dataset
x_train = train.to_numpy()[:,:8]
y_train = train.to_numpy()[:,8]

# Extracting the inputs and outputs of the test dataset
x_test = test.to_numpy()[:,:8]
y_test = test.to_numpy()[:,8]

In [278]:
train.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,3.859247,122.032787,72.069808,28.893519,135.921311,32.366556,0.431246,32.792763,0.356678
std,3.308473,30.333958,10.708903,8.288153,52.552469,6.605536,0.244934,10.988938,0.479409
min,0.0,44.0,40.0,7.0,14.0,18.2,0.084,21.0,0.0
25%,1.0,99.0,65.0,25.0,122.0,27.5,0.238,24.0,0.0
50%,3.0,118.0,72.069808,28.893519,135.921311,32.366556,0.3815,29.0,0.0
75%,6.0,141.0,78.0,32.0,135.921311,36.8,0.58525,40.0,1.0
max,13.0,199.0,104.0,54.0,360.0,50.0,1.189,66.0,1.0


In [279]:
test.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,154.0,154.0,154.0,154.0,154.0,154.0,154.0,154.0,154.0
mean,3.499086,120.318395,72.282847,28.92785,127.967511,31.57597,0.424461,32.853106,0.318182
std,3.10637,30.901014,11.544378,8.147233,46.974638,5.539948,0.245579,11.315674,0.46729
min,0.0,56.0,44.0,8.0,15.0,18.4,0.078,21.0,0.0
25%,1.0,100.0,64.0,25.25,114.0,27.8,0.256,24.0,0.0
50%,3.0,116.5,72.069808,28.893519,135.921311,31.75,0.349,29.0,0.0
75%,5.0,136.75,80.0,31.0,135.921311,35.075,0.55125,40.0,1.0
max,13.0,197.0,104.0,56.0,285.0,49.7,1.191,66.0,1.0


# 3. Selección de modelo e hiper parámetros

## 3.1. Entrenamiento de todos los modelos

In [13]:
import itertools

In [14]:
from sklearn.model_selection import GridSearchCV

In [21]:
# Including libraries from sklearn
from sklearn.base import BaseEstimator

# Including libraries from scipy
from scipy import stats

# Including libraries from numpy
import numpy as np

def gaussian_pdf(value, parameters):
    """ Probability density function of a gaussian distributed continuous random variable.
        @param value Value where the pdf is evaluated
        @param parameters Values used to parametrize the distribution, such as the mean and the std
    """
    mean = parameters['mean']
    std = parameters['std']
    return stats.norm.pdf((value - mean) / std) / std

def exponential_pdf(value, parameters):
    """ Probability density function of an exponential distributed continuous random variable.
        @param value Value where the pdf is evaluated
        @param parameters Values used to parametrize the distribution, such as the mean
    """
    _lambda = parameters['lambda']
    return stats.expon.pdf(value * _lambda) * _lambda

class BinaryNaiveBayes(BaseEstimator):
    """ Implements the Naive Bayes classification criteria to problems with two classes, 
        allowing parametric distributions based on famous density functions.
    """
    
    # Dictionary used to map the type of distribution set in the configuration of the model
    # and the function that handles. Basically, a dispatcher of probability density functions
    supported_distributions = {
        'gaussian': gaussian_pdf,
        'exponential': exponential_pdf
    }
    
    def __init__(self, std_correction=False, filter_variables=None, variables_models=None):
        
        # Parameters of the model, contains the class distribution also known as priori probabilities,
        # and the variables parameters used to parametrize the distributions assigned to each variable
        # taken into account
        self.classes_distribution = None
        self.classes_log_distribution = None
        self.variables_distributions = None
        
        # Configuration of the model, also known as the hiper parameters, selection of the
        # model settings used to optimize according a specific performance metric
        self.std_smoothing = std_smoothing if std_smoothing is not None else 0
        self.std_correction = std_correction if std_correction is not None else False
        self.filter_variables = filter_variables
        self.variables_models = variables_models
    
    def fit(self, x_data, y_data):
        """ Fit the model with the training dataset given.
            @param x_data Matrix where the rows contain study cases and the columns contain variables or features
            @param y_data Array containing the class where the corresponding study case belong
        """
        
        # Filtering data if required
        if self.filter_variables is not None:
            x_data = x_data[:,np.array(self.filter_variables)]
        
        # Computing the probability distribution of all classes
        self.classes_distribution = np.array([len(y_data[y_data == 0]) , len(y_data[y_data == 1])])
        self.classes_distribution = self.classes_distribution / self.classes_distribution.sum()
        self.classes_log_distribution = np.log(self.classes_distribution)
        
        # Initializing parameter container
        self.variables_parameters = []
        
        # Fetch the models filtered
        models = np.array(self.variables_distributions)[np.array(self.filter_variables)]
        
        # Calculating mean and standard deviation of variables
        for variable_index in range(x_data.shape[1]):
            for class_index in range(2):
                
                # Fit the corresponding distribution assigned to the variable or feature
                if self.
                self.variables_mean[class_index][variable_index] = np.nanmean(x_data[y_data == class_index, variable_index])
                self.variables_std[class_index][variable_index] = np.nanstd(x_data[y_data == class_index, variable_index])
                
                # Apply Bessel's correction to the standard deviation error 
                if self.bessel_correction:
                    n = (y_data == class_index).sum()
                    self.variables_std[class_index][variable_index] = self.variables_std[class_index][variable_index] * np.sqrt((n) / (n-1))
            
    def predict(self, x_data):
        """ Predict the class of the given input data.
            @param x_data Matrix where the rows represent study cases and the columns contain the variables or features to analyze
        """
        
        # Filtering data if required
        if self.filter_variables is not None:
            x_data = x_data[:,np.array(self.filter_variables)]
            
        # Initialization of predictions
        predictions = np.zeros(x_data.shape[0])
        
        # Fetch the type of distributions
        distributions = np.array(self.variables_distributions)[np.array(self.filter_variables)]
        
        # Prediction for each subject
        for subject_index in range(x_data.shape[0]):
            
            # For each class (positive, negative) compute the log likelihood
            log_likelihood = np.array(
                [
                    np.log(
                        self.supported_distributions[distributions[variable_index]](
                            np.array([x_data[subject_index, variable_index] for i in range(2)]), 
                            self.variables_mean[:, variable_index], 
                            self.variables_std[:, variable_index] + self.smoothing
                        )
                    )
                    for variable_index in range(x_data.shape[1])
                ]
            ).sum(axis=0)

            # Compute the log posteriori unnormalized and predict
            log_posteriori_unnormalized = log_likelihood + self.log_priori_distribution
            predictions[subject_index] = 1 if log_posteriori_unnormalized[1] > log_posteriori_unnormalized[0] else 0
        
        # Return the predictions made by the model
        return predictions

## 3.2. Búsqueda del mejor modelo

In [None]:
%%time

# Hiper parameters
parameters = {
    'smoothing': [0],
    'bessel_correction': [False, True],
    'filter_variables': list(itertools.product([True, False], repeat=8)),
    'variables_distributions': list(itertools.product(['gaussian', 'exponential'], repeat=8))
}

# Estimator or model
estimator = BinaryNaiveBayes()

# GridSearch Cross-Validation
grid = GridSearchCV(estimator, parameters, cv=2, scoring='recall', n_jobs=-1)
grid.fit(x_train, y_train)

In [335]:
print(grid.cv_results_['std_test_score'])

[0.13982733 0.12352915 0.13322312 0.13769068 0.14566054 0.15064078
 0.13788783 0.15489862 0.1296138  0.11888205 0.14031653 0.1454927
 0.14386318 0.13066683 0.13461839 0.1411286  0.15224343 0.14982042
 0.15164502 0.14020658 0.13639343 0.11942394 0.13189924 0.14580557
 0.14423311 0.15127575 0.15538071 0.13530837 0.13018426 0.12771773
 0.12780679 0.13262705 0.14574343 0.11152302 0.13344825 0.12214079
 0.14097944 0.13500021 0.14773403 0.15228223 0.12997736 0.12088271
 0.11787785 0.11646893 0.1500729  0.14997126 0.14363061 0.13848072
 0.15072644 0.13749571 0.15213447 0.12551807 0.14548354 0.13921248
 0.13621628 0.12970173 0.14506799 0.11822153 0.14248933 0.12195976
 0.13579715 0.1201416  0.12097858 0.12970173 0.14223443 0.13179208
 0.13691623 0.11301223 0.14429445 0.13016406 0.13379332 0.1224398
 0.13027479 0.14370411 0.14901826 0.12198762 0.15200741 0.13913634
 0.13225284 0.13768913 0.14920295 0.14566614 0.15117307 0.13972853
 0.15313099 0.14257774 0.14055807 0.1429472  0.1515879  0.149928

In [102]:
import pprint 

In [103]:
pprint.pprint(grid.best_params_)

{'bessel_correction': False,
 'filter_variables': (True, True, True, True, True, True, True, True),
 'smoothing': 0,
 'variables_distributions': ('gaussian',
                             'gaussian',
                             'gaussian',
                             'gaussian',
                             'exponential',
                             'gaussian',
                             'gaussian',
                             'gaussian')}


In [104]:
print(grid.best_score_)

0.6102681335692172


## 3.3. Entrenamiento completo

In [326]:
# Create and train the model
classifier = BinaryNaiveBayes(
    smoothing=grid.best_params_['smoothing'], 
    bessel_correction=grid.best_params_['bessel_correction'], 
    filter_variables=grid.best_params_['filter_variables'],
    variables_distributions=('exponential',
                             'gaussian',
                             'exponential',
                             'gaussian',
                             'gaussian',
                             'gaussian',
                             'exponential',
                             'gaussian')
)
classifier.fit(x_train, y_train)

# 4. Validación y performance del modelo

In [327]:
from sklearn.metrics import recall_score

In [328]:
predictions = classifier.predict(x_test)

In [329]:
score = recall_score(y_test, predictions)

In [330]:
print(score)

0.6530612244897959


# 5. Contrastando con sklearn

In [285]:
from sklearn.naive_bayes import GaussianNB

# Create and train the model
c = GaussianNB()
c.fit(x_train, y_train)

# Predict and compute score
p = c.predict(x_test)
score = recall_score(y_test, p)
print(score)

0.6530612244897959
