***
____
![revit](https://i.ibb.co/bQ3dB8C/curso-revit.png)

***
***


# Clase 07
## Selección de variables

In [1]:
from IPython.display import Image
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

matplotlib.rcParams['figure.figsize'] = [8, 8]

Para garantizar la reproducibilidad (esto es, que al ejecutar este notebook otra vez los resultados sean idénticos) vamos a fijar la semilla que usa scikit learn para hacer particiones aleatorias. Para esto tenemos que fijar la semilla (seed) cada vez que llamemos a la aplicacion.

In [2]:
 np.random.seed(42)

## Cargamos los datos

Cargaremos los datos del data `Caso mora en tarjetas de crédito`

In [3]:
datos = pd.read_csv("titanic.csv")
# .drop(["ID"], axis = 1)

In [4]:
datos.shape

(891, 8)

In [6]:
datos.head()

Unnamed: 0,superviviente,clase_billete,genero,edad,n_hermanos_esposos,n_hijos_padres,precio_billete,puerto_salida
0,0,3,hombre,22.0,1,0,7.25,S
1,1,1,mujer,38.0,1,0,71.2833,C
2,1,3,mujer,26.0,0,0,7.925,S
3,1,1,mujer,35.0,1,0,53.1,S
4,0,3,hombre,35.0,0,0,8.05,S


## Procesado de datos

In [7]:
# Separamos los datos numéricos y categóricos
datos_numericos = datos.select_dtypes(include=['float64', "int64"])
datos_categoricos = datos.select_dtypes(exclude=['float64', "int64"])

# Para los missing numéricos los imputamos con la media
for col in datos_numericos.columns:
    datos_numericos[col].fillna(datos_numericos[col].mean(), inplace=True)

# Para los categoricos creamos dummies
datos_categoricos_codificados = pd.get_dummies(datos_categoricos)
df_final = pd.concat([datos_numericos, datos_categoricos_codificados], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [8]:
df_final.shape

(891, 11)

Ya tenemos un dataset preparado para poder entrenar modelos. Este dataset tiene una complejidad dimensional alta ( 986 variables independientes), por lo que vamos a usar técnicas de selección de variables para reducirla.

Antes que nada vamos a ver que errores obtenemos con diversos modelos entrenando con el dataset con todas las variables.

In [9]:
from sklearn.model_selection import cross_validate
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

objetivo = "superviviente"
X=df_final.drop(objetivo, axis=1)
y=df_final[objetivo]

Vamos a usar la función `cross_validate` que es una versión más flexible que `cross_val_score`. Evaluaremos usando el área bajo la curva ROC

In [10]:
def evaluar_modelo(estimador, X, y):
    resultados_estimador = cross_validate(estimador, X, y,
                     scoring="roc_auc", n_jobs=-1, cv=10, return_train_score=True)
    return resultados_estimador

In [11]:
resultados = {}

def ver_resultados():
    resultados_df  = pd.DataFrame(resultados).T
    resultados_cols = resultados_df.columns
    for col in resultados_df:
        resultados_df[col] = resultados_df[col].apply(np.mean)
        resultados_df[col+"_idx"] = resultados_df[col] / resultados_df[col].min()
    return resultados_df

In [12]:
resultados["reg_logis_sin_seleccion"] = evaluar_modelo(LogisticRegression(), X, y)
resultados["knn_sin_seleccion"] = evaluar_modelo(KNeighborsClassifier(n_neighbors=500, weights="distance"), X, y)
resultados["rf_sin_seleccion"] = evaluar_modelo(RandomForestClassifier(n_estimators=100), X, y)

In [13]:
ver_resultados()

Unnamed: 0,fit_time,score_time,test_score,train_score,fit_time_idx,score_time_idx,test_score_idx,train_score_idx
reg_logis_sin_seleccion,0.040514,0.005302,0.852308,0.857388,8.420465,1.0,1.107752,1.0
knn_sin_seleccion,0.004811,0.032233,0.769403,0.999314,1.0,6.079395,1.0,1.165534
rf_sin_seleccion,0.236197,0.018566,0.859351,0.997575,49.091525,3.50176,1.116906,1.163505


Hay 3 tipos generales de estrategias de selección de variables:

# Métodos de filtrado

Los métodos de filtrado usan métodos estadísticos para seleccionar las variables que proporcionan la mayor cantidad de información. Estos métodos se aplican de forma previa a entrenar el modelo (preprocesado), y **son completamente independientes de la elección del estimador**. Generalmente funcionan definiendo una función de evaluación $S(xk_i, y_k)$, evaluando cada variable independiente para cada observación respecto a la variable objetivo de dicha observación, y eligiendo aquellas `K` variables que mejor funcionan.



In [14]:
from sklearn.feature_selection import SelectKBest, f_classif

In [20]:
X = X.drop("genero_hombre", axis=1)

In [21]:
selector_kbest5 = SelectKBest(f_classif, k=5)
X_kbest10 = selector_kbest5.fit_transform(X, y)

In [22]:
X_kbest10.shape

(891, 5)

La funcion `get_support` nos devuelve un vector booleano (True/False), aquellos elementos con True son las columnas que se han seleccionado

In [23]:
columnas_seleccion_kbest10 = X.loc[:,selector_kbest5.get_support()].columns
columnas_seleccion_kbest10

Index(['clase_billete', 'precio_billete', 'genero_mujer', 'puerto_salida_C',
       'puerto_salida_S'],
      dtype='object')

El parámetro `scores_` del selector nos devuelve los resultados de la función de evaluación

In [25]:
evaluacion_kbest5 = pd.DataFrame({"variable":X.columns, 
                                   "Score":selector_kbest5.scores_, 
                                   "Seleccionado":selector_kbest5.get_support()})

In [27]:
evaluacion_kbest5.sort_values("Score", ascending = False)

Unnamed: 0,variable,Score,Seleccionado
5,genero_mujer,372.405724,True
0,clase_billete,115.031272,True
4,precio_billete,63.030764,True
6,puerto_salida_C,25.895987,True
8,puerto_salida_S,22.075469,True
3,n_hijos_padres,5.963464,False
1,edad,4.353516,False
2,n_hermanos_esposos,1.110572,False
7,puerto_salida_Q,0.011846,False


Esto nos permite ver cual es la puntuación que le da el evaluador `f_regression` a cada variable independiente

In [28]:
resultados["reg_logis_kbest_10"] = evaluar_modelo(LogisticRegression(), X_kbest10, y)
resultados["knn_kbest_10"] = evaluar_modelo(KNeighborsClassifier(n_neighbors=500, weights="distance"), X_kbest10, y)
resultados["rf_kbest_10"] = evaluar_modelo(RandomForestClassifier(n_estimators=100), X_kbest10, y)

In [29]:
ver_resultados()

Unnamed: 0,fit_time,score_time,test_score,train_score,fit_time_idx,score_time_idx,test_score_idx,train_score_idx
reg_logis_sin_seleccion,0.040514,0.005302,0.852308,0.857388,32.368376,2.735035,1.107752,1.015611
knn_sin_seleccion,0.004811,0.032233,0.769403,0.999314,3.844013,16.62736,1.0,1.183728
rf_sin_seleccion,0.236197,0.018566,0.859351,0.997575,188.708465,9.577439,1.116906,1.181668
reg_logis_kbest_10,0.029148,0.001939,0.844105,0.844209,23.28784,1.0,1.097091,1.0
knn_kbest_10,0.001252,0.027539,0.829311,0.97171,1.0,14.206053,1.077863,1.15103
rf_kbest_10,0.21539,0.016171,0.8486,0.964179,172.084556,8.341721,1.102933,1.14211


# Métodos envolventes (wrapper methods)

Los métodos envolventes *(wrapper methods)* funcionan de forma similar a los métodos de ranking. Sin embargo, en lugar de usar una función estadística independiente del modelo para evaluar las variables, estos métodos usan la función de evaluación o el performance de los modelos como input para decidir que variables elegir (es decir, "envuelven" el funcionamiento del estimador). Ésto significa que los métodos de filtrado se pueden aplicar independientemente de la elección del modelo, ya que consideran los modelos como una caja negra que produce evaluaciones, aunque claro, diferentes modelos producirán diferentes selecciones de variables.

In [33]:
from sklearn.feature_selection import RFE
estimador_selector = RandomForestClassifier()
selector_rfe5_rf = RFE(estimador_selector, n_features_to_select=5)
X_rfe5_rf = selector_rfe5_rf.fit_transform(X, y)



In [34]:
X_rfe5_rf.shape

(891, 5)

In [36]:
evaluacion_kbest5 = pd.DataFrame({"variable":X.columns, 
                                   "Score":selector_rfe5_rf.ranking_, 
                                   "Seleccionado":selector_rfe5_rf.get_support()})

# evaluacion_rfe10_rf = sorted(
#     filter(lambda c: c[2], 
#         zip(
#             X.columns,
#             selector_rfe10_rf.ranking_,
#             selector_rfe10_rf.get_support()
#         )
#     ), key=lambda c: c[1],reverse=True
# )

In [41]:
X.head()

Unnamed: 0,clase_billete,edad,n_hermanos_esposos,n_hijos_padres,precio_billete,genero_mujer,puerto_salida_C,puerto_salida_Q,puerto_salida_S
0,3,22.0,1,0,7.25,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0
2,3,26.0,0,0,7.925,1,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1
4,3,35.0,0,0,8.05,0,0,0,1


In [40]:
evaluacion_kbest5.sort_values("Score")

Unnamed: 0,variable,Score,Seleccionado
0,clase_billete,1,True
1,edad,1,True
2,n_hermanos_esposos,1,True
4,precio_billete,1,True
5,genero_mujer,1,True
3,n_hijos_padres,2,False
8,puerto_salida_S,3,False
6,puerto_salida_C,4,False
7,puerto_salida_Q,5,False


In [44]:
# resultados["reg_lineal_rfe10_rf"] = evaluar_modelo(LogisticRegression(), X_rfe5_rf, y)
# resultados["rf_rfe10_rf"] = evaluar_modelo(RandomForestClassifier(), X_rfe5_rf, y)
# resultados["svr_rfe10_rf"] = evaluar_modelo(SVC(), X_rfe5_rf, y)

resultados["reg_logis_rfe5_rf"] = evaluar_modelo(LogisticRegression(), X_rfe5_rf, y)
resultados["knn_rfe5_rf"] = evaluar_modelo(KNeighborsClassifier(n_neighbors=500, weights="distance"), X_rfe5_rf, y)
resultados["rf_rfe5_rf"] = evaluar_modelo(RandomForestClassifier(n_estimators=100), X_rfe5_rf, y)

In [45]:
ver_resultados()

Unnamed: 0,fit_time,score_time,test_score,train_score,fit_time_idx,score_time_idx,test_score_idx,train_score_idx
reg_logis_sin_seleccion,0.040514,0.005302,0.852308,0.857388,32.368376,2.735035,1.131679,1.015611
knn_sin_seleccion,0.004811,0.032233,0.769403,0.999314,3.844013,16.62736,1.021599,1.183728
rf_sin_seleccion,0.236197,0.018566,0.859351,0.997575,188.708465,9.577439,1.14103,1.181668
reg_logis_kbest_10,0.029148,0.001939,0.844105,0.844209,23.28784,1.0,1.120788,1.0
knn_kbest_10,0.001252,0.027539,0.829311,0.97171,1.0,14.206053,1.101144,1.15103
rf_kbest_10,0.21539,0.016171,0.8486,0.964179,172.084556,8.341721,1.126756,1.14211
reg_lineal_rfe10_rf,0.022176,0.005909,0.85209,0.855866,17.717646,3.048162,1.13139,1.013809
rf_rfe10_rf,0.030288,0.00464,0.856066,0.995046,24.198198,2.393733,1.136669,1.178672
svr_rfe10_rf,0.079356,0.006062,0.753136,0.935878,63.401101,3.126997,1.0,1.108585
reg_logis_rfe5_rf,0.007547,0.002146,0.85209,0.855866,6.029392,1.107036,1.13139,1.013809


Si usamos otro estimador para evaluar veremos que las variables elegidas pueden ser completamente distintas. Los estimadores que se pueden usar tienen que implementar el metodo `coef_` o el metodo `feature_importance` (es decir, tienen que tener una manera de ordenar variables en función de su importancia). Por ejemplo, no podemos usar SVMs.

Ensamblados

In [47]:
# Bagging 

from sklearn.ensemble import BaggingRegressor, BaggingClassifier

resultados["Bagging_sin_sel"] = evaluar_modelo(BaggingClassifier(n_estimators=100), X, y)
resultados["Bagging_5nest"] = evaluar_modelo(BaggingClassifier(n_estimators=100), X_kbest10, y)
resultados["Bagging_rfe5"] = evaluar_modelo(BaggingClassifier(n_estimators=100), X_rfe5_rf, y)

In [48]:
ver_resultados()

Unnamed: 0,fit_time,score_time,test_score,train_score,fit_time_idx,score_time_idx,test_score_idx,train_score_idx
reg_logis_sin_seleccion,0.040514,0.005302,0.852308,0.857388,32.368376,2.735035,1.131679,1.015611
knn_sin_seleccion,0.004811,0.032233,0.769403,0.999314,3.844013,16.62736,1.021599,1.183728
rf_sin_seleccion,0.236197,0.018566,0.859351,0.997575,188.708465,9.577439,1.14103,1.181668
reg_logis_kbest_10,0.029148,0.001939,0.844105,0.844209,23.28784,1.0,1.120788,1.0
knn_kbest_10,0.001252,0.027539,0.829311,0.97171,1.0,14.206053,1.101144,1.15103
rf_kbest_10,0.21539,0.016171,0.8486,0.964179,172.084556,8.341721,1.126756,1.14211
reg_lineal_rfe10_rf,0.022176,0.005909,0.85209,0.855866,17.717646,3.048162,1.13139,1.013809
rf_rfe10_rf,0.030288,0.00464,0.856066,0.995046,24.198198,2.393733,1.136669,1.178672
svr_rfe10_rf,0.079356,0.006062,0.753136,0.935878,63.401101,3.126997,1.0,1.108585
reg_logis_rfe5_rf,0.007547,0.002146,0.85209,0.855866,6.029392,1.107036,1.13139,1.013809


In [50]:
# XGBOOST

from xgboost import XGBClassifier

resultados["XGBOOST_sin_sel"] = evaluar_modelo(XGBClassifier(n_estimators=100), X, y)
resultados["XGBOOST_5best"] = evaluar_modelo(XGBClassifier(n_estimators=100), X_kbest10, y)
resultados["XGBOOST_5rfe"] = evaluar_modelo(XGBClassifier(n_estimators=100), X_rfe5_rf, y)

In [52]:
ver_resultados()

Unnamed: 0,fit_time,score_time,test_score,train_score,fit_time_idx,score_time_idx,test_score_idx,train_score_idx
reg_logis_sin_seleccion,0.040514,0.005302,0.852308,0.857388,32.368376,2.797834,1.131679,1.015611
knn_sin_seleccion,0.004811,0.032233,0.769403,0.999314,3.844013,17.009134,1.021599,1.183728
rf_sin_seleccion,0.236197,0.018566,0.859351,0.997575,188.708465,9.797343,1.14103,1.181668
reg_logis_kbest_10,0.029148,0.001939,0.844105,0.844209,23.28784,1.022961,1.120788,1.0
knn_kbest_10,0.001252,0.027539,0.829311,0.97171,1.0,14.532233,1.101144,1.15103
rf_kbest_10,0.21539,0.016171,0.8486,0.964179,172.084556,8.533252,1.126756,1.14211
reg_lineal_rfe10_rf,0.022176,0.005909,0.85209,0.855866,17.717646,3.11815,1.13139,1.013809
rf_rfe10_rf,0.030288,0.00464,0.856066,0.995046,24.198198,2.448694,1.136669,1.178672
svr_rfe10_rf,0.079356,0.006062,0.753136,0.935878,63.401101,3.198795,1.0,1.108585
reg_logis_rfe5_rf,0.007547,0.002146,0.85209,0.855866,6.029392,1.132454,1.13139,1.013809


0.7354499999999999