##Escalamiento y encodeo

En este Notebook revisaremos lo visto en clase sobre escalamiento y encodeo de variables

In [29]:
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Note: you may need to restart the kernel to use updated packages.Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [30]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

# Normalización y estandarización
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Encodeo
from sklearn import preprocessing


Usaremos el data set de: California Housing

In [31]:
# puedes agregar la dirección que quieras, siempre y cuando haga referencia al dataset.
# Debes poner atencion al separador. El parametro "sep" indica como se separan los datosen el dataset.

url = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
df_housing = pd.read_csv(url, sep=",")
df_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


In [32]:
# Algunas estadísticas básicas de solo las variables numéricas
df_housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [33]:
# Revisamos el tipo de datos
df_housing.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
dtype: object

**se pueden calcular muchas otras cosas, pero usamos pandas profiling que hace muchas cosas en un solo paso**

In [34]:
ProfileReport(df_housing)

Summarize dataset: 100%|██████████| 99/99 [00:12<00:00,  8.16it/s, Completed]                                     
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.79s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.90s/it]




#Normalización y estandarización

In [35]:
# Se crea primero el objeto Standard Scaler
scaler = StandardScaler()

# Luego "entrenamos" con los datos para obtener los parámetros del escalamiento
scaler.fit(df_housing)

# Finalmente aplicamos el escalamiento
array_scaler = scaler.transform(df_housing)
array_scaler

array([[ 2.619365  , -0.67152023, -1.07967114, ..., -0.07599796,
        -1.25254316, -1.21055773],
       [ 2.53956878, -0.57326437, -0.76187201, ..., -0.09940441,
        -1.08148298, -1.09674535],
       [ 2.4946834 , -0.90546278, -0.92077158, ..., -0.99925206,
        -1.17010515, -1.04846131],
       ...,
       [-2.36291168,  2.90780067, -0.92077158, ..., -0.11760942,
        -0.44666313, -0.89412482],
       [-2.36291168,  2.88908527, -0.76187201, ..., -0.06039367,
        -0.99778717, -1.04759909],
       [-2.387848  ,  2.29955006,  1.85997083, ..., -0.60134255,
        -0.45536288, -0.97172417]])

Se ve que entrega los datos más "desordenados". Esto pasa por que lo que entrega es un arreglo (array). Lo pasaremos a DataFrame

In [36]:
df_scaler = pd.DataFrame(array_scaler)
df_scaler

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,2.619365,-0.671520,-1.079671,1.361695,1.764204,-0.361184,-0.075998,-1.252543,-1.210558
1,2.539569,-0.573264,-0.761872,2.296608,3.230441,-0.261865,-0.099404,-1.081483,-1.096745
2,2.494683,-0.905463,-0.920772,-0.882462,-0.866956,-0.955354,-0.999252,-1.170105,-1.048461
3,2.489696,-0.928857,-1.159121,-0.524186,-0.480230,-0.796793,-0.715774,-0.362600,-1.154514
4,2.489696,-0.961609,-0.682422,-0.545747,-0.506328,-0.701830,-0.622148,-1.026454,-1.222629
...,...,...,...,...,...,...,...,...,...
16995,-2.342963,2.318265,1.859971,-0.195728,-0.344995,-0.455275,-0.343872,-0.799999,-0.826872
16996,-2.347950,2.369733,0.588774,-0.135174,-0.027073,-0.205236,-0.094203,-0.715727,-1.106230
16997,-2.362912,2.907801,-0.920772,0.015292,-0.019955,-0.161675,-0.117609,-0.446663,-0.894125
16998,-2.362912,2.889085,-0.761872,0.012999,0.029868,-0.114630,-0.060394,-0.997787,-1.047599


**Ahora el problema es que las columnas perdieron sus nombres**

Podemos recuperar los nombres

In [37]:
df_scaler.columns = df_housing.columns
df_scaler

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,2.619365,-0.671520,-1.079671,1.361695,1.764204,-0.361184,-0.075998,-1.252543,-1.210558
1,2.539569,-0.573264,-0.761872,2.296608,3.230441,-0.261865,-0.099404,-1.081483,-1.096745
2,2.494683,-0.905463,-0.920772,-0.882462,-0.866956,-0.955354,-0.999252,-1.170105,-1.048461
3,2.489696,-0.928857,-1.159121,-0.524186,-0.480230,-0.796793,-0.715774,-0.362600,-1.154514
4,2.489696,-0.961609,-0.682422,-0.545747,-0.506328,-0.701830,-0.622148,-1.026454,-1.222629
...,...,...,...,...,...,...,...,...,...
16995,-2.342963,2.318265,1.859971,-0.195728,-0.344995,-0.455275,-0.343872,-0.799999,-0.826872
16996,-2.347950,2.369733,0.588774,-0.135174,-0.027073,-0.205236,-0.094203,-0.715727,-1.106230
16997,-2.362912,2.907801,-0.920772,0.015292,-0.019955,-0.161675,-0.117609,-0.446663,-0.894125
16998,-2.362912,2.889085,-0.761872,0.012999,0.029868,-0.114630,-0.060394,-0.997787,-1.047599


Finalmente, corroboramos el resultado:

In [38]:
df_scaler.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-6.633961e-15,2.674984e-17,-7.606987000000001e-17,-9.028072e-17,-8.693699000000001e-17,7.356207e-17,2.0062380000000002e-17,-3.644666e-16,1.337492e-17
std,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029,1.000029
min,-2.387848,-1.443531,-2.191968,-1.211838,-1.277408,-1.242856,-1.300935,-1.773323,-1.658056
25%,-1.111108,-0.7931704,-0.8413218,-0.542077,-0.5751323,-0.5572079,-0.570134,-0.6903217,-0.7578949
50%,0.5346887,-0.6434471,0.03262582,-0.2370147,-0.2500927,-0.228759,-0.2398431,-0.1776521,-0.2319437
75%,0.7790646,0.9801141,0.6682241,0.2328499,0.2582266,0.2538954,0.2705474,0.462985,0.4974902
max,2.619365,2.959268,1.859971,16.19047,14.01132,29.84131,14.51402,5.825963,2.523704


### ¿Cómo sabemos que lo hizo bien?

## Tarea: hacer la normalización con la función MinMaxScaler

Creamos el objeto con la funcion de MinMaxScaler

In [39]:
minmax_scaler = MinMaxScaler()

Entrenamos con los datos para obtener los parámetros del escalamiento

In [40]:
minmax_scaler.fit(df_housing)

#Encodeo

Veremos Label Encoder, la cual cambia valores categóricos de una variable a valores numéricos.

Para esto usaremos un datat set pequeño para ver los resultados más facilmente

In [25]:
array_minmax = minmax_scaler.transform(df_housing)
array_minmax

array([[1.        , 0.17534538, 0.2745098 , ..., 0.07745437, 0.06853009,
        0.10701193],
       [0.98406375, 0.19766206, 0.35294118, ..., 0.07597435, 0.09104012,
        0.13422831],
       [0.9750996 , 0.12221041, 0.31372549, ..., 0.01907581, 0.07937822,
        0.14577466],
       ...,
       [0.00498008, 0.98831031, 0.31372549, ..., 0.07482322, 0.1745769 ,
        0.18268172],
       [0.00498008, 0.98405951, 0.35294118, ..., 0.07844105, 0.10205376,
        0.14598084],
       [0.        , 0.8501594 , 1.        , ..., 0.04423615, 0.17343209,
        0.1641251 ]])

In [26]:
df_minmax = pd.DataFrame(array_minmax)
df_minmax

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.000000,0.175345,0.274510,0.147885,0.198945,0.028364,0.077454,0.068530,0.107012
1,0.984064,0.197662,0.352941,0.201608,0.294848,0.031559,0.075974,0.091040,0.134228
2,0.975100,0.122210,0.313725,0.018927,0.026847,0.009249,0.019076,0.079378,0.145775
3,0.974104,0.116897,0.254902,0.039515,0.052142,0.014350,0.037000,0.185639,0.120414
4,0.974104,0.109458,0.372549,0.038276,0.050435,0.017405,0.042921,0.098281,0.104125
...,...,...,...,...,...,...,...,...,...
16995,0.008964,0.854410,1.000000,0.058389,0.060987,0.025337,0.060516,0.128081,0.198764
16996,0.007968,0.866100,0.686275,0.061869,0.081782,0.033381,0.076303,0.139170,0.131960
16997,0.004980,0.988310,0.313725,0.070515,0.082247,0.034782,0.074823,0.174577,0.182682
16998,0.004980,0.984060,0.352941,0.070384,0.085506,0.036296,0.078441,0.102054,0.145981


In [27]:
df_minmax.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,0.476882,0.327867,0.540968,0.069637,0.083552,0.039984,0.08226,0.233354,0.396497
std,0.199718,0.227135,0.246803,0.057465,0.06541,0.032172,0.063233,0.131595,0.239141
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.25498,0.147715,0.333333,0.038487,0.045934,0.022058,0.04621,0.142514,0.215259
50%,0.583665,0.181722,0.54902,0.056017,0.067194,0.032624,0.067094,0.209976,0.341032
75%,0.63247,0.550478,0.705882,0.083017,0.100442,0.048152,0.099367,0.294279,0.515464
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [19]:
datos = {'alumno': ['Zutano', 'Mengano', 'Zutano','Pepe' ,'Fulanito, Cosme', 'Maria'],
         'Primera prueba': [7, 8,7,7,8.5, 10],
         'Segunda prueba': [10,9,10,4,10,6.7],
         'observaciones':['ninguna','libre','ninguna','ninguna','libre','oyente'],
         'DNI':[23000000, 12389100,23000000, 99999, 1001,30406011]}
df = pd.DataFrame(datos)
df

Unnamed: 0,alumno,Primera prueba,Segunda prueba,observaciones,DNI
0,Zutano,7.0,10.0,ninguna,23000000
1,Mengano,8.0,9.0,libre,12389100
2,Zutano,7.0,10.0,ninguna,23000000
3,Pepe,7.0,4.0,ninguna,99999
4,"Fulanito, Cosme",8.5,10.0,libre,1001
5,Maria,10.0,6.7,oyente,30406011


In [20]:
# creamos el objeto
le = preprocessing.LabelEncoder()

# tomamos un variable categorica, en este caso usaremos la variable: "observaciones"
data = df['observaciones']
data

0    ninguna
1      libre
2    ninguna
3    ninguna
4      libre
5     oyente
Name: observaciones, dtype: object

In [21]:
# ahora aplicamos el "entrenamiento", que vendria siendo el obtener los parámetros del modelo
le.fit(data)

data_transformada = le.transform(data)
data_transformada

array([1, 0, 1, 1, 0, 2])

In [22]:
# ahora hacemos el cambio a los datos
df['observaciones'] = data_transformada
df

Unnamed: 0,alumno,Primera prueba,Segunda prueba,observaciones,DNI
0,Zutano,7.0,10.0,1,23000000
1,Mengano,8.0,9.0,0,12389100
2,Zutano,7.0,10.0,1,23000000
3,Pepe,7.0,4.0,1,99999
4,"Fulanito, Cosme",8.5,10.0,0,1001
5,Maria,10.0,6.7,2,30406011


## ¿Te preguntaste si la variable categórica era nominal u ordinal?

Es categorica nominal, no tienen orden

## Tarea: aplicar OneHotEncoder a la variable categórica.

In [28]:
from sklearn.preprocessing import OneHotEncoder

# Creamos el objeto
enc = OneHotEncoder(handle_unknown='ignore')

# Tomamos la variable categorica
data = df['observaciones'].to_numpy().reshape(-1, 1)

# Entrenamos el modelo
enc.fit(data)

# Aplicamos la transformacion
data_transformada = enc.transform(data).toarray()
data_transformada

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])