# Preprocesado


In [4]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd

#  Estandarizacion
# ------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler

#  Encoding
# ------------------------------------------------------------------------------
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder

#  Gestión de warnings
# ------------------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")


En esta lección realizaremos los cambios oportunos para poder ejecutar el modelo de regresión logística.

Cuando nos enfrentamos a problemas de regresión lineal vimos que eran necesarios algunos cambios antes de poder ajustar los modelos. 

En el caso de la regresión logística lo tendremos que hacer. Estos cambios incluyen:
* Estandarización de las variables predictoras numéricas
* Codificación de las variables categóricas
* Balanceo de la variable respuesta

Encoding

In [2]:
# Cargamos el dataframe
df = pd.read_pickle('../Datos/obesity_eda.pkl')
df.head()

Unnamed: 0,age,vegetales,num_comidas_dia,cantidad_agua_dia,freq_ejercicio,tiempo_digital,height,weight,gender,antecedentes_familiares,comida_calorica,snacks,smoke,mide_calorias,freq_alcohol,medio_transporte,nivel_obesidad
0,21.0,2.0,3.0,2.0,0.0,1.0,1.62,64.0,Female,yes,no,Sometimes,no,no,no,Public_Transportation,Normal_Weight
1,21.0,3.0,3.0,3.0,3.0,0.0,1.52,56.0,Female,yes,no,Sometimes,yes,yes,Sometimes,Public_Transportation,Normal_Weight
2,23.0,2.0,3.0,2.0,2.0,1.0,1.8,77.0,Male,yes,no,Sometimes,no,no,Frequently,Public_Transportation,Normal_Weight
3,27.0,3.0,3.0,2.0,2.0,0.0,1.8,87.0,Male,no,no,Sometimes,no,no,Frequently,Walking,Overweight_Level_I
4,22.0,2.0,1.0,2.0,0.0,0.0,1.78,89.8,Male,no,no,Sometimes,no,no,Sometimes,Public_Transportation,Overweight_Level_II


Verificamos las columnas que necesitamos codificar

In [3]:
categoricas= df.select_dtypes(include=object)
categoricas

Unnamed: 0,gender,antecedentes_familiares,comida_calorica,snacks,smoke,mide_calorias,freq_alcohol,medio_transporte,nivel_obesidad
0,Female,yes,no,Sometimes,no,no,no,Public_Transportation,Normal_Weight
1,Female,yes,no,Sometimes,yes,yes,Sometimes,Public_Transportation,Normal_Weight
2,Male,yes,no,Sometimes,no,no,Frequently,Public_Transportation,Normal_Weight
3,Male,no,no,Sometimes,no,no,Frequently,Walking,Overweight_Level_I
4,Male,no,no,Sometimes,no,no,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...
2106,Female,yes,yes,Sometimes,no,no,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,yes,yes,Sometimes,no,no,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,yes,yes,Sometimes,no,no,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,yes,yes,Sometimes,no,no,Sometimes,Public_Transportation,Obesity_Type_III


Utilizarmos label-ecoding

Creamos una bucle for para codificar todas las variables categóricas:


In [13]:
for columna in categoricas.columns:
    le = LabelEncoder()
    df[columna]=le.fit_transform(df[columna])


In [28]:
#comprobamos que la codificación fuera correcta:

df.head(3)

Unnamed: 0,age,vegetales,num_comidas_dia,cantidad_agua_dia,freq_ejercicio,tiempo_digital,height,weight,gender,antecedentes_familiares,comida_calorica,snacks,smoke,mide_calorias,freq_alcohol,medio_transporte,nivel_obesidad
0,21.0,2.0,3.0,2.0,0.0,1.0,1.62,64.0,0,1,0,2,0,0,3,3,1
1,21.0,3.0,3.0,3.0,3.0,0.0,1.52,56.0,0,1,0,2,1,1,2,3,1
2,23.0,2.0,3.0,2.0,2.0,1.0,1.8,77.0,1,1,0,2,0,0,1,3,1


### Información adicional encoding:

|   gender:	| 	|
|---	|---	|
|  female = 0   	|   male = 1	|

-

|  antecedentes_familiares: 	|   	|
|---	|---	|
|  yes = 1 	|   no = 0	|

-

|  comida_calorica: 	|   	|
|---	|---	|
|  yes = 1 	|   no = 0	|

-
|  snacks: 	|   	|   	|   	|
|---	|---	|---	|---	|
|  Always = 0 	|   Frequently = 1	|   Sometimes = 2 	|   No = 3	| 

-
|  smoke: 	|   	|
|---	|---	|
|  yes = 1 	|   no = 0	|

-
|  mide_calorias: 	|   	|
|---	|---	|
|  yes = 1 	|   no = 0	|

-

|  freq_alcohol: 	|   	|   	|   	|
|---	|---	|---	|---	|
|  Always = 0 	|   Frequently = 1	|   Sometimes = 2 	|   No = 3	| 

-

|  medio_transporte: 	|   	|   	|   	|   	|
|---	|---	|---	|---	|---	|
|  Automobile = 0 	|  Bike = 1 	|   Motorbike = 2	|  Public_Transportation = 3 	|  Walking = 4 	|

-

|  nivel_obesidad: 	|   	|   	|   	|   	|   	|   	|
|---	|---	|---	|---	|---	|---	|---	|
|  Insufficient_Weight = 0 	|  Normal_Weight = 1 	|  Obesity_Type_I = 2 	|  Obesity_Type_II = 3 	|  Obesity_Type_III = 4 	| Overweight_Level_I = 5  	|   Overweight_Level_II = 6	|