# Redes Neuronales - Trabajo Práctico N° 2 - Ejercicio 1 - Regresión Logística
# Notebook #2: Implementación de una Regresión Lineal
En esta notebook se busca implementar una regresión logística para poder estimar la condición de diabético de un paciente, perteneciente al Pima Indians Dataset analizado en la notebook anterior.

# TODO List
* Chequear correcto reemplazo de NaN por mean.

# 1. Cargando base de datos

In [1]:
import numpy as np

In [2]:
import pandas as pd

In [3]:
# Read database from .csv
df = pd.read_csv('../../databases/diabetes.csv', delimiter=',')

# Show first rows of data
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# 2. Preprocesamiento de los datos

## 2.1 Filtrado de valores inválidos

In [4]:
# Filtering Glucose values
df['Glucose'].replace(0, np.nan, inplace=True)

# Filtering Blood Pressure values
df['BloodPressure'].replace(0, np.nan, inplace=True)

# Filtering Skin Thickness values
df['SkinThickness'].replace(0, np.nan, inplace=True)

# Filtering Insulin values
df['Insulin'].replace(0, np.nan, inplace=True)

# Filtering Body Mass Index values
df['BMI'].replace(0, np.nan, inplace=True)

# 3. Separación del conjunto de entrenamiento y evaluación

In [5]:
from sklearn import model_selection

In [6]:
from sklearn import preprocessing

In [7]:
# Define input and output variables for the model
x_labels = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction','Age']
y_labels = ['Outcome']

df_x = df[x_labels]
df_y = df[y_labels]

In [8]:
# Split the dataset into train_valid and test
x_train_valid, x_test, y_train_valid, y_test = model_selection.train_test_split(df_x, df_y, test_size=0.2, random_state=15, shuffle=True)

# Split the train_valid sub-dataset into train and valid
x_train, x_valid, y_train, y_valid = model_selection.train_test_split(x_train_valid, y_train_valid, test_size=0.3, random_state=23, shuffle=True)

In [9]:
# Train set before NaN replacement
x_train.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,429.0,426.0,406.0,287.0,216.0,420.0,429.0,429.0
mean,3.90676,120.514085,72.325123,28.930314,152.740741,32.247857,0.469189,33.207459
std,3.350363,29.742282,12.611486,10.04128,107.966521,7.030966,0.330389,11.6021
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0
25%,1.0,100.0,64.0,22.0,82.0,27.175,0.24,24.0
50%,3.0,115.0,74.0,29.0,128.0,31.75,0.351,29.0
75%,6.0,138.0,80.0,36.0,185.75,36.3,0.646,41.0
max,14.0,198.0,122.0,63.0,680.0,67.1,2.42,72.0


# 4. Reemplazo de valores inválidos
Como se destacó en el análisis estadístico de datos, el dataset suministrado posee varios valores faltantes en algunos individuos. Se asume que en la etapa de producción el modelo contará con todas las variables correctamente informadas, no admitiendo el faltante de alguna de ellas. Luego, se decide reemplazar aquellos valores inválidos en **train**, **valid** y **test** por la correspondiente media en el dataset de train. En este caso, se considera a la media como un estimador correcto para la ocasión.

In [14]:
# Select columns to perform non-valid values replacement
replace_labels = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Perform NaN replacement
for label in replace_labels:
    # Compute mean for particular column and replace in train, valid and test
    mean = np.nanmean(x_train[label])
    x_train[label].replace(np.nan, mean, inplace=True)
    x_valid[label].replace(np.nan, mean, inplace=True)
    x_test[label].replace(np.nan, mean, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [15]:
# Train set after NaN replacement
x_train.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,429.0,429.0,429.0,429.0,429.0,429.0,429.0,429.0
mean,3.90676,120.514085,72.325123,28.930314,152.740741,32.247857,0.469189,33.207459
std,3.350363,29.637862,12.267947,8.208243,76.522025,6.956649,0.330389,11.6021
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0
25%,1.0,100.0,64.0,26.0,126.0,27.3,0.24,24.0
50%,3.0,116.0,72.325123,28.930314,152.740741,32.0,0.351,29.0
75%,6.0,138.0,80.0,32.0,152.740741,36.1,0.646,41.0
max,14.0,198.0,122.0,63.0,680.0,67.1,2.42,72.0


In [13]:
# Validation set after NaN replacement
x_valid.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,185.0,185.0,185.0,185.0,185.0,185.0,185.0,185.0
mean,3.908108,124.081233,73.325949,29.225539,156.113313,32.909988,0.471708,33.464865
std,3.474627,30.6682,11.926671,8.568083,87.39368,6.636033,0.325015,11.88304
min,0.0,44.0,50.0,11.0,15.0,18.2,0.084,21.0
25%,1.0,101.0,65.0,24.0,115.0,28.8,0.236,24.0
50%,3.0,120.0,72.0,28.930314,152.740741,32.5,0.389,30.0
75%,6.0,145.0,80.0,34.0,152.740741,36.9,0.6,40.0
max,17.0,199.0,114.0,60.0,545.0,57.3,1.893,81.0


In [16]:
# Test set after NaN replacement
x_test.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,154.0,154.0,154.0,154.0,154.0,154.0,154.0,154.0
mean,3.597403,122.038961,71.503903,29.359428,155.872054,32.482778,0.479565,33.064935
std,3.304818,32.320876,11.814455,10.513698,103.288567,6.946169,0.343303,12.118519
min,0.0,61.0,30.0,7.0,23.0,18.4,0.078,21.0
25%,1.0,95.25,64.0,23.25,108.25,26.925,0.254,24.0
50%,3.0,117.0,72.0,28.930314,152.740741,32.273929,0.3765,28.0
75%,5.75,142.75,80.0,33.75,152.740741,36.95,0.60375,41.0
max,13.0,197.0,106.0,99.0,846.0,55.0,2.329,69.0


# 5. Modelo de Regresión Lineal