# Modelación

En esta sección entrenamos cuatro modelos sobre el conjunto de datos de electrocardiogramas previamente tratados. Partimos de la Tabla Analítica de Datos obtenida en el notebook anterior, la cual contiene la media y desviación estándar de las 12 derivaciones de cada electrocardiograma. 

Los modelos que se entrenan son:

- Red Neuronal con la interfaz de sklearn
- Random Forest
- ADABoost
- XGBoost

Cada modelo se entrena con una hiper parametrización y posteriormente se evalúa su rendimiento en el conjunto de datos test mediante la métrica del área bajo la curva ROC. A lo largo del notebook se va construyendo una tabla llamada métricas, que contiene el AUC-ROC de cada modelo, para tener una tabla comparativa final y elegir el mejor modelo. 

In [1]:
# Data wrangling
import pandas as pd
import numpy as np
from scipy.stats import uniform, randint
# Metricas y split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score
# Modeling
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
X = pd.read_csv('./data/TAD_ECG.csv', index_col = 'ecg_id')

In [3]:
X

Unnamed: 0_level_0,Unnamed: 0,W_C_age,W_sex,W_nurse,W_site,W_device,y,der_1_seg_1_fft_media,der_1_seg_2_fft_media,der_1_seg_3_fft_media,...,der_12_seg_1_fft_std,der_12_seg_2_fft_std,der_12_seg_3_fft_std,der_12_seg_4_fft_std,der_12_seg_5_fft_std,der_12_seg_6_fft_std,der_12_seg_7_fft_std,der_12_seg_8_fft_std,der_12_seg_9_fft_std,der_12_seg_10_fft_std
ecg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16286,0,-0.476704,-0.225092,-0.069063,0.050721,-0.640735,1,1.476646,1.137670,1.052837,...,5.270704,3.929790,3.777307,4.135906,4.109803,3.630042,3.830436,5.282345,3.583558,3.368695
2647,1,1.529517,0.237182,-0.069063,0.050721,1.498685,0,0.715700,0.533308,0.863365,...,3.400847,1.907335,3.587203,3.687820,3.696344,3.326853,1.925027,3.602505,3.627987,3.758223
11732,2,0.552014,-0.225092,-0.069063,0.050721,1.498685,0,0.751542,0.744743,0.718097,...,3.834868,3.573462,3.569211,3.662488,3.500226,3.513594,3.600067,3.758661,3.717570,2.495752
19751,3,-0.212872,-0.225092,-0.069063,0.050721,-0.640735,1,1.489811,1.356284,1.459757,...,2.548027,2.441432,2.518861,2.476902,2.520664,3.335534,2.661575,2.532827,2.598910,2.525304
7898,4,-0.476704,0.237182,0.157180,0.069803,-0.125788,0,1.028991,1.192041,0.629712,...,13.332728,3.131690,4.544796,2.670560,3.992905,1.792294,1.876424,2.221627,3.106862,3.359491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15377,15280,0.552014,-0.225092,-0.069063,0.050721,1.498685,0,0.563758,0.568873,0.582547,...,6.460662,7.820084,11.530706,8.504455,2.593219,5.890009,6.468781,4.283926,5.226723,2.600098
5475,15281,0.552014,-0.225092,-0.069063,0.050721,-0.640735,0,1.006199,0.871506,0.748085,...,5.348450,3.841426,4.624238,3.864639,3.881883,3.764343,3.758935,5.159382,4.464792,4.242138
3790,15282,-0.653033,0.237182,-0.240539,-0.695637,-0.819288,1,1.345652,1.390240,1.804303,...,3.900970,3.766219,3.845107,5.152145,4.383191,3.830861,3.700145,4.072904,5.465196,4.738237
21643,15283,-0.653033,0.237182,-0.069063,0.050721,-0.640735,1,3.550677,3.063009,1.707286,...,11.052293,9.660376,9.828239,9.793048,9.219899,8.669737,8.386153,8.928521,8.356999,8.155467


In [4]:
# Target
y = X['y']
# Predictoras
X = X[[col for col in X.columns if col!='y']]

En el siguiente dataframe llamado métricas guardamos el AUC - ROC de cada modelo para tener una tabla comparativa

In [5]:
metricas = pd.DataFrame({'Modelo':[], 'AUC-ROC':[]})

## Train Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Red Neuronal

### Modelado

In [7]:
mlp = MLPClassifier(max_iter= 1000)

### Hiperparametrización 

In [8]:
param_grid = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu','logistic'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

In [9]:
search = RandomizedSearchCV(param_distributions=param_grid, cv=4, n_jobs=-1, scoring="roc_auc", estimator=mlp, n_iter=10, verbose=5)

### Entrenamiento

In [10]:
%%time
search.fit(X_train, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits
CPU times: user 7.36 s, sys: 14.3 s, total: 21.6 s
Wall time: 44.5 s


In [11]:
search.best_estimator_

In [12]:
search.best_score_

0.8048630314232478

### Performance en Test

In [13]:
auc_ANN = search.score(X_test, y_test) # ROC AUC

In [14]:
modelo = pd.DataFrame({'Modelo':['ANN'], 'AUC-ROC':[auc_ANN]})
metricas = pd.concat([metricas, modelo], ignore_index=True)

## Bosque Aleatorio

### Modelado

In [15]:
bos = RandomForestClassifier()

### Hyperparametrización

In [16]:
param_dict = {"n_estimators": [x for x in range(100, 1500, 100)], # Número de árboles a construir
              "max_features": ["sqrt", "log2"], # Número máximo de variables a considerar
              "criterion": ["gini", "entropy"], # Criterio de selección de corte
              "class_weight": ["balanced", None], # Balanceo o no de la target
              "min_samples_split": [x for x in range(2, 50, 2)], # Número mínimo de muestras que debe tener una hoja para cortar
              "min_samples_leaf": [x/100 for x in range(5, 55, 5)]} # Número mínimo que debe tener una hoja

In [17]:
# Búsqueda aleatorizada
search = RandomizedSearchCV(param_distributions=param_dict, cv=4, n_jobs=-1, scoring="roc_auc", estimator=bos, verbose=5,n_iter=10)

### Entrenamiento

In [18]:
%%time
search.fit(X_train, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits
CPU times: user 17.4 s, sys: 198 ms, total: 17.6 s
Wall time: 56.8 s


In [19]:
search.best_estimator_

### Performance en Test

In [20]:
auc_RF = search.score(X_test, y_test) # ROC AUC

In [21]:
modelo = pd.DataFrame({'Modelo':['RandomForest'], 'AUC-ROC':[auc_RF]})
metricas = pd.concat([metricas, modelo], ignore_index=True)

## AdaBoost

#### Modelado

In [22]:
ada = AdaBoostClassifier(algorithm="SAMME", n_estimators=100, learning_rate=0.05)

#### Hyperparametrización

In [23]:
# Hiperparametrización
search_grid={'n_estimators':[50,100,200],
             'learning_rate':[.001,0.01,.1]}

In [24]:
search = RandomizedSearchCV(param_distributions=search_grid, cv=4, n_jobs=-1, scoring="roc_auc", estimator=ada, verbose=5,n_iter=9)

### Entrenamiento

In [25]:
%%time
search.fit(X_train, y_train)

Fitting 4 folds for each of 9 candidates, totalling 36 fits
CPU times: user 1min 11s, sys: 97.7 ms, total: 1min 11s
Wall time: 5min 37s


In [26]:
search.best_estimator_

In [27]:
search.best_score_

0.8213914726710967

### Performance en Test

In [28]:
auc_ADA = search.score(X_test, y_test) # ROC AUC

In [29]:
modelo = pd.DataFrame({'Modelo':['ADABoost'], 'AUC-ROC':[auc_ADA]})
metricas = pd.concat([metricas, modelo], ignore_index=True)

## XGBoost

### Modelado

In [30]:
xgb_clf = xgb.XGBClassifier(objective="binary:logistic", eval_metric="logloss") 

### Hiperparametrizacion

In [31]:
param_grid = {
    'colsample_bytree': np.linspace(0.5, 1, 10), 
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6],
}

In [32]:
random_search = RandomizedSearchCV(xgb_clf, param_distributions = param_grid, n_iter = 50,
                             scoring = 'roc_auc', cv = 3, random_state = 42, n_jobs = -1)

### Entrenamiento

In [33]:
random_search.fit(X_train, y_train)

  _data = np.array(data, dtype=dtype, copy=copy,


In [34]:
random_search.best_estimator_

In [35]:
random_search.best_score_

0.8544235477626417

In [36]:
random_search.fit(X_train, y_train)

In [37]:
random_search.best_estimator_

In [38]:
random_search.best_score_

0.8544235477626417

### Performance en Test

In [39]:
auc_XGB = random_search.score(X_test, y_test) # ROC AUC

In [40]:
modelo = pd.DataFrame({'Modelo':['XGBoost'], 'AUC-ROC':[auc_XGB]})
metricas = pd.concat([metricas, modelo], ignore_index=True)

## Métricas

In [42]:
metricas

Unnamed: 0,Modelo,AUC-ROC
0,ANN,0.801923
1,RandomForest,0.792303
2,ADABoost,0.820862
3,XGBoost,0.851227


Podemos concluir que el mejor modelo fue XGBoost, y gracias al cuidadoso tratamiento e ingeniería de las predictoras se pudo obtener un gran desempeño para este problema. Dado que la target está casi balanceada al 50%, y que nuestra área bajo la curva ROC está por encima de 0.5, y de hecho se acerca más a 1 que a 0.5, el modelo tiene un excelente desempeño sobre los datos que no ha visto, es decir, puede generalizar.