# Redes Neuronales - Trabajo Práctico N° 2 - Ejercicio 1 - Regresión Logística
# Notebook #3: Implementación de un modelo MLP
En esta notebook se busca aprovechar los conocimientos de las anteriores e implementar un modelo MLP para poder estimar la condición de diabético de un paciente, perteneciente al Pima Indians Dataset analizado en la primer notebook.

# 1. Cargando base de datos

In [1]:
import numpy as np

In [2]:
import pandas as pd

In [16]:
# Read database from .csv
df = pd.read_csv('../../databases/diabetes.csv', delimiter=',')

# Show first rows of data
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# 2. Preprocesamiento de los datos

## 2.1 Filtrado de valores inválidos

In [4]:
# Filtering Glucose values
df['Glucose'].replace(0, np.nan, inplace=True)

# Filtering Blood Pressure values
df['BloodPressure'].replace(0, np.nan, inplace=True)

# Filtering Skin Thickness values
df['SkinThickness'].replace(0, np.nan, inplace=True)

# Filtering Insulin values
df['Insulin'].replace(0, np.nan, inplace=True)

# Filtering Body Mass Index values
df['BMI'].replace(0, np.nan, inplace=True)

# 3. Separación del conjunto de entrenamiento y evaluación

In [6]:
from sklearn import model_selection

In [7]:
from sklearn import preprocessing

In [8]:
# Define input and output variables for the model
x_labels = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction','Age']
y_labels = ['Outcome']

df_x = df[x_labels]
df_y = df[y_labels]

In [9]:
# Split the dataset into train_valid and test
x_train_valid, x_test, y_train_valid, y_test = model_selection.train_test_split(df_x, df_y, test_size=0.2, random_state=15, shuffle=True)

# Split the train_valid sub-dataset into train and valid
x_train, x_valid, y_train, y_valid = model_selection.train_test_split(x_train_valid, y_train_valid, test_size=0.3, random_state=23, shuffle=True)

# 4. Reemplazo de valores inválidos

In [10]:
# Select columns to perform non-valid values replacement
replace_labels = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Perform NaN replacement
for label in replace_labels:
    # Compute mean for particular column and replace in train, valid and test
    mean = np.nanmean(x_train[label])
    x_train[label].replace(np.nan, mean, inplace=True)
    x_valid[label].replace(np.nan, mean, inplace=True)
    x_test[label].replace(np.nan, mean, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


# 5. Normalización de datos de entrada. Z Score. 

In [11]:
# IMPORTANT! Backup unnormalized subsets for further utilization
x_train_un = x_train
x_valid_un = x_valid
x_test_un = x_test

# Apply z-score to all sub-datasets
scalable_variables = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction','Age']

if scalable_variables:
    # Create an instance of the StandardScaler for each variable
    scaler = preprocessing.StandardScaler()

    # Fit the distribution
    scaler.fit(x_train.loc[:, scalable_variables])

    # Transform and normalize all variables
    x_train.loc[:, scalable_variables] = scaler.transform(x_train.loc[:, scalable_variables])
    x_test.loc[:, scalable_variables] = scaler.transform(x_test.loc[:, scalable_variables])
    x_valid.loc[:, scalable_variables] = scaler.transform(x_valid.loc[:, scalable_variables])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


In [14]:
x_train.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,429.0,429.0,429.0,429.0,429.0,429.0,429.0,429.0
mean,-4.1406920000000004e-17,-7.349728000000001e-17,-5.341493e-16,-5.300086e-16,4.1406920000000004e-17,-6.211038e-18,5.1758650000000005e-17,-1.407835e-16
std,1.001168,1.001168,1.001168,1.001168,1.001168,1.001168,1.001168,1.001168
min,-1.167432,-2.179287,-3.943736,-2.674862,-1.815199,-2.0217,-1.164196,-1.053405
25%,-0.8686085,-0.6929662,-0.6794,-0.3574133,-0.3498596,-0.7120718,-0.6945041,-0.7945294
50%,-0.2709613,-0.1524859,0.0,-4.33328e-16,0.0,-0.03567041,-0.3581441,-0.3630697
75%,0.6255096,0.5906746,0.6263344,0.3744127,0.0,0.5543819,0.5357857,0.6724333
max,3.016099,2.617476,4.053887,4.155514,6.898339,5.015753,5.911486,3.347483


# 6. Implementación de modelo MLP