<h1 align=center><font size = 5>Machine Learning models for aiding the decision-making process in emergency departments</font></h1>

<h1>Tabla comparativa de algoritmos<h1>
<h3>Descripción</h3>
<p>
En este script se desarrollará una serie de predicciones basadas en datos del hospital San Juan de Dios Curicó, correspondientes al año 2018 representados por registros de urgencias. El objetivo es predecir, mediante un algoritmo de arboles de desición, la categoría de gravedad de un apciente de urgencias,tomando como input, datos proporcionados por el paciente en la etapa de registro y sus signos vitales registrados en la etapa de triage.
Se correrán algoritmos de prodicción tales como árboles y bosques de desición, regresión logística, support vector machines y redes neuronales. Para finalmente evaluar el rendimiento de cada algoritmo en términos de la predicción, mediante indicadores tales como Acuraccy, F1-Score, Curva ROC, Índice de Jaccard y logloss
</p>

<h1 id="Descripción de datos">Descripción de datos</h1>
<p>
Los datos utilizados fueros proporcionados por el Hospital San Juan de Dios, Curicó, Chile y corresponden a 4.971 registros de pacientes que asistieron a urgencias durante el periodo comprendido entre el 1 de enero de 2018 y agosto de 2019, los datos fueron limpiados y transformados en un script desarrollado previamente
<ul>
    <li>Datos: <a href="https://drive.google.com/open?id=1Bp7_MnK6cGwgBuwIq1a8S4DS_0wVDiAD" target="_blank">https://drive.google.com/open?id=1Bp7_MnK6cGwgBuwIq1a8S4DS_0wVDiAD</a></li>
    <li>Tipo de datos: csv</li>
</ul>
<p>
Las variables presentes en los datos se describen a continuación:
<ul>    
   
   <li><b>PAC_EDAD</b>: corresponde a la edad del paciente en enteros</li>
   <li><b>MOTIVO_CONSULTA</b>: corresponde a la razón por la que el paciente acude al servicio de urgencias string</li>
   <li><b>MEDIO</b>: corresponde al medio de llegada, mediante el que el paciente acude al servicio de urgencias</li>
   <li><b>SEXO</b>: corresponde al sexo del paciente</li>
   <li><b>CAT</b>: corresponde a la categoría de gravedad asignada al paciente en el proceso de Triage</li>
   <li><b>PRESION_SIST</b>: corresponde la presión sistólica del paciente </li>
   <li><b>PRESION_DIAST</b>: corresponde la presióndiastólica del paciente</li>
   <li><b>SATO2</b>: Dato numérico que representa la saturometria del paciente (Nivel de oxigeno en la sangre)</li>
   <li><b>TEMPERATURA</b>: corresponde a la temperatura corporal del paciente en el momento de la categorización</li>
   <li><b>GLASGOW</b>: corresponde a al nivel registrado por el paciente en la escala Glasgow</li>
   <li><b>DM</b>: corresponde si el paciente presenta o no Diabetes Mellitus</li>
   <li><b>EVA</b>: corresponde si se aplica al paciente una evaluación de vias aéreas</li>
   <li><b>HGT</b>: corresponde a la medida de azucar en la sangre del paciente</li>
   <li><b>LCFA</b>: corresponde a si el paciente presenta obstrucción crónica de vías aéreas</li>
   <li><b>FR</b>: corresponde a la frecuencia respiratoria del paciente</li>
   <li><b>HTA</b>: corresponde a si el paciente posee Hipertención Arterial</li>
   <li><b>HORA_INSC</b>: corresponde a la hora en la que el paciente fue categorizado</li>
   <li><b>MIN_INSC</b>: corresponde al minuto en que el paciente fue categorizado</li>
   <li><b>TIEMPO_ESPERA_CAT</b>: corresponde al tiempo que espera el paciente luego de ser registrado, para ser categorizado</li>
      
</ul>
</p>

Cargar paquetes necesarios

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
import sklearn as sk  
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
#métricas de evaluación
from sklearn.metrics import f1_score
#from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
#Métodos de tuneo de parámetros
from sklearn.model_selection import GridSearchCV
# Métodos para balancer las clases
from pylab import rcParams
 
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek
from imblearn.ensemble import BalancedBaggingClassifier
 
from collections import Counter

Lectura de datos

In [None]:
archivo= 'df_definitiva_A.csv'
archivo2='df_cat_soloTexto.csv'
df_cat= pd.read_csv(archivo,encoding='latin-1',sep=",")
df_PCA= pd.read_csv(archivo2,encoding='latin-1',sep=",")

Revisar si hay columnas sin nombre

In [None]:
df_cat=df_cat.loc[:, ~df_cat.columns.str.contains('^Unnamed')]
df_PCA=df_PCA.loc[:, ~df_PCA.columns.str.contains('^Unnamed')]

Dejar solo en una base de datos la variable target

In [None]:
df_PCA.drop(columns =["CAT"], inplace = True)

Seleccionar las variables que se utilizarán

In [None]:
df_cat=df_cat[['ID_PACIENTE', 'PAC_EDAD', 'CAT', 'SATO2', 'TEMPERATURA', 'GLASGOW',
       'EVA', 'HGT', 'FR','SEXO_M', 'SEXO_F', 'DM_D', 'DM_N',
       'DM_S', 'LCFA_D', 'LCFA_N', 'LCFA_S', 'LCFA_D.1', 'LCFA_N.1',
       'LCFA_S.1']]
df_cat.columns

Unir bases de datos horizontalmente

In [None]:
df_cat =  df_cat.join(df_PCA, lsuffix='_caller', rsuffix='_other')

Quitar variables innecesarias

In [None]:
#Quitando DESC_EVENTO
df_cat.drop(columns =["DESC_EVENTO"], inplace = True)

In [None]:
# Limpiando de faltantes las columnas
df_cat.dropna(subset=["ID_PACIENTE_caller"], axis=0, inplace = True)
df_cat.dropna(subset=["ID_PACIENTE_other"], axis=0, inplace = True)

In [None]:
#Quitando ID_PACIENTE
df_cat.drop(columns =["ID_PACIENTE_caller"], inplace = True)
df_cat.drop(columns =["ID_PACIENTE_other"], inplace = True)

Identificar como "y" a la variable target

In [None]:
y = df_cat["CAT"]

Quitar la variable target de la base de datos

In [None]:
#Quitando CAT
df_cat.drop(columns =["CAT"], inplace = True)

In [None]:
df_cat.columns

Asignar a "x" el resto de variables independientes

In [None]:
X=df_cat

In [None]:
X.shape

In [None]:
y.shape

 <h1 id="Preprocesamiento de datos">Preprocesamiento de datos</h1>
<p>
Para aplicar el paquete de arboles de desición, los datos deben ser numéricos, en este caso siguen siendo en su mayoría categoricos pero serán transformados a nominales.
</p>

In [None]:
from sklearn import preprocessing

Preprocesando motivo de consulta

In [None]:
le_motivo_consulta = preprocessing.LabelEncoder()
le_motivo_consulta.fit(X["MOTIVO_CONSULTA"])

X["MOTIVO_CONSULTA"]=le_motivo_consulta.transform(X["MOTIVO_CONSULTA"]) 

In [None]:
X[0:2]

Preprocesando medio de llegada

In [None]:
le_medio = preprocessing.LabelEncoder()
le_medio.fit(X["MEDIO"])
X["MEDIO"] =le_medio.transform(X["MEDIO"])

Preprocesamiento Sexo

In [None]:
le_sexo = preprocessing.LabelEncoder()
le_sexo.fit(X['SEXO'])
X["SEXO"] = le_sexo.transform(X["SEXO"])

Preprocesamiento DM (Diabetes Mellitus)

In [None]:
le_DM = preprocessing.LabelEncoder()
le_DM.fit(X['DM'])
X["DM"] = le_DM.transform(X["DM"])

Preprocesamiento LCFA (Limintación crónica del flujo aéreo)

In [None]:
le_LCFA = preprocessing.LabelEncoder()
le_LCFA.fit(X["LCFA"])
X["LCFA"] = le_LCFA.transform(X["LCFA"]) 

Proprocesamiento HTA (Hipertención Arterial)

In [None]:
le_HTA = preprocessing.LabelEncoder()
le_HTA.fit(X["HTA"])
X["HTA"] = le_HTA.transform(X["HTA"]) 

Asignación de la variable dependiente a predecir (categoría), al vector y

In [None]:
X[0:1]

 <h1 id="Análisis de componentes principales">Análisis de componentes principales</h1>
<p>
Se realizó el análisis de componentes principales con el fin de reducir la dimensionalidad de la base de datos utilizada para la predicción de categoría. El objetivo de reducir la dimensionalidad de la base de datos es agilizar los procesos de entrenamiento y predicción de la categoria de pacientes, además de identificar las variables que presentan una mayor utilidad para esta, descartando las que no aportan indormación a la predicción.
</p>

In [None]:
#Normalizando el set de datos
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
pca = PCA()
sc = StandardScaler()
X = preprocessing.StandardScaler().fit(X).transform(X)

In [None]:
# Calculamos la matriz de covarianza

print('NumPy covariance matrix: \n%s' %np.cov(X.T))

In [None]:
#Calculamos los autovalores y autovectores de la matriz y los mostramos

cov_mat = np.cov(X.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

In [None]:

#  Hacemos una lista de parejas (autovector, autovalor) 
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Ordenamos estas parejas den orden descendiente con la función sort
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visualizamos la lista de autovalores en orden desdenciente
print('Autovalores en orden descendiente:')
for i in eig_pairs:
    print(i[0])

In [None]:
# A partir de los autovalores, calculamos la varianza explicada
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

In [None]:
# Representamos en un diagrama de barras la varianza explicada por cada autovalor, y la acumulada
with plt.style.context('seaborn-pastel'):
    plt.figure(figsize=(10, 8))

    plt.bar(range(300), var_exp, alpha=0.5, align='center',
            label='Varianza individual explicada', color='g')
    plt.step(range(300), cum_var_exp, where='mid', linestyle='--', label='Varianza explicada acumulada')
    plt.ylabel('Ratio de Varianza Explicada')
    plt.xlabel('Componentes Principales')
    plt.legend(loc='best')
    plt.tight_layout()

In [None]:

#Generamos la matríz a partir de los pares autovalor-autovector
matrix_w = np.hstack((eig_pairs[0][1].reshape(300,1),
                      eig_pairs[1][1].reshape(300,1)))

print('Matriz W:\n', matrix_w)

Y = X.dot(matrix_w)

In [None]:
with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(10, 8))
    for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
                        ('magenta', 'cyan', 'limegreen')):
        plt.scatter(Y[y==lab, 0],
                    Y[y==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Componente Principal 1')
    plt.ylabel('Componente Principal 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

In [None]:
from sklearn.decomposition import PCA
#Make an instance of the Model
pca = PCA(0.95)
pca.fit(X)
pca.explained_variance_ratio_
pca.n_components_ 

In [None]:
pca.explained_variance_ratio_

In [None]:
X=pca.fit_transform(X)

In [None]:
X[0:5]
A=pd.DataFrame(X)

In [None]:
A

In [None]:
#DF_PCA_Desc_evento=pd.merge(df_ID, A, on='ID_PACIENTE')
DF_PCA_Desc_evento=pd.concat([df_ID,A], axis=1)

In [None]:
DF_PCA_Desc_evento.head(10)

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

In [None]:
DF_PCA_Desc_evento.to_csv('DF_PCA_Desc_evento.csv')

 <h1 id="Configurando algoritmos">Configurando algoritmos</h1>
<p>
En esta sección se definen parámetros necesarios para la correcta aplicación de los algoritmos a implementar, además de seccionar el conjunto de datos en datos de prueba(30%)y de entrenamiento (70%). Los parámetros escogidos pueden ser modificados con el fin de obtener resultados diferentes
</p>

Carga de paquetes

In [None]:
from sklearn.model_selection import train_test_split

Segmentación del conjunto de datos

In [None]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

Subsampling

In [None]:
us = NearMiss(sampling_strategy='auto',version=1,n_neighbors=2, n_neighbors_ver3=3,n_jobs=None,)
X_trainset_res_1, y_trainset_res_1 = us.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution after resampling {}".format(Counter(y_trainset_res_1)))

In [None]:
us = NearMiss(sampling_strategy='majority',version=1,n_neighbors=2, n_neighbors_ver3=3,n_jobs=None,)
X_trainset_res_2, y_trainset_res_2 = us.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution after resampling {}".format(Counter(y_trainset_res_2)))

In [None]:
us = NearMiss(sampling_strategy='not minority',version=1,n_neighbors=2, n_neighbors_ver3=3,n_jobs=None,)
X_trainset_res_3, y_trainset_res_3 = us.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution after resampling {}".format(Counter(y_trainset_res_3)))

In [None]:
us = NearMiss(sampling_strategy='not majority',version=1,n_neighbors=2, n_neighbors_ver3=3,n_jobs=None,)
X_trainset_res_4, y_trainset_res_4 = us.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution after resampling {}".format(Counter(y_trainset_res_4)))

In [None]:
us = NearMiss(sampling_strategy='all',version=1,n_neighbors=2, n_neighbors_ver3=3,n_jobs=None,)
X_trainset_res_5, y_trainset_res_5 = us.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution after resampling {}".format(Counter(y_trainset_res_5)))

Oversampling

In [None]:
os =  RandomOverSampler(sampling_strategy='auto', random_state=None)
X_trainset_res_A, y_trainset_res_A = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_A)))

In [None]:
os =  RandomOverSampler(sampling_strategy='not minority', random_state=None)
X_trainset_res_C, y_trainset_res_C = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_C)))

In [None]:
os =  RandomOverSampler(sampling_strategy='not majority', random_state=None)
X_trainset_res_D, y_trainset_res_D = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_D)))

In [None]:
os =  RandomOverSampler(sampling_strategy='all', random_state=None)
X_trainset_res_E, y_trainset_res_E = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_E)))

SMOTE-Tomek

In [None]:
os_us = SMOTETomek(sampling_strategy='auto',random_state=None,smote=None,tomek=None,n_jobs=None,)
X_trainset_res_ST1, y_trainset_res_ST1 = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_ST1)))

In [None]:
os_us = SMOTETomek(sampling_strategy='majority',random_state=None,smote=None,tomek=None,n_jobs=None,)
X_trainset_res_ST2, y_trainset_res_ST2 = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_ST2)))

In [None]:
os_us = SMOTETomek(sampling_strategy='not minority',random_state=None,smote=None,tomek=None,n_jobs=None,)
X_trainset_res_ST3, y_trainset_res_ST3 = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_ST3)))

In [None]:
os_us = SMOTETomek(sampling_strategy='not majority',random_state=None,smote=None,tomek=None,n_jobs=None,)
X_trainset_res_ST4, y_trainset_res_ST4 = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_ST4)))

In [None]:
os_us = SMOTETomek(sampling_strategy='all',random_state=None,smote=None,tomek=None,n_jobs=None,)
X_trainset_res_ST5, y_trainset_res_ST5 = os.fit_sample(X_trainset, y_trainset)
 
print ("Distribution before resampling {}".format(Counter(y_trainset)))
print ("Distribution labels after resampling {}".format(Counter(y_trainset_res_ST5)))

Árbol de desición

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

In [None]:
DT= DecisionTreeClassifier(criterion="entropy", max_depth = 4)
#pipe = Pipeline(steps=[('pca', pca), ('DT', DT)])
DT.fit(X_trainset,y_trainset)

In [None]:
DT= DecisionTreeClassifier(criterion="entropy", max_depth = 4)
DT.fit(X_trainset,y_trainset)
DT_Ss1=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_1,y_trainset_res_1)
DT_Ss2=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_2,y_trainset_res_2)
DT_Ss3=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_3,y_trainset_res_3)
DT_Ss4=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_4,y_trainset_res_4)
DT_Ss5=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_5,y_trainset_res_5)
DT_OsA=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_A,y_trainset_res_A)
DT_OsC=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_C,y_trainset_res_C)
DT_OsD=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_D,y_trainset_res_D)
DT_OsE=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_E,y_trainset_res_E)
DT_ST1=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=13, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0,
                       random_state=42, splitter='best').fit(X_trainset_res_ST1,y_trainset_res_ST1)
DT_ST2=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_ST2,y_trainset_res_ST2)
DT_ST3=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_ST3,y_trainset_res_ST3)
DT_ST4=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_ST4,y_trainset_res_ST4)
DT_ST5=DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_trainset_res_ST5,y_trainset_res_ST5)

Análisis de componentes principales para identificar las variables que originan una mejor clasificación

In [None]:
pca.fit(X)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(np.arange(1, pca.n_components_ + 1),
         pca.explained_variance_ratio_, '+', linewidth=2)
ax0.set_ylabel('PCA explained variance ratio')

ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
results = pd.DataFrame(search.cv_results_)
components_col = 'param_pca__n_components'
best_clfs = results.groupby(components_col).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs.plot(x=components_col, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.xlim(-1, 70)

plt.tight_layout()
plt.show()

In [None]:
yhat_1 = DT.predict(X_testset)
print (classification_report(y_testset, yhat_1))

In [None]:
yhat_1_1 =DT_Ss1.predict(X_testset)
print (classification_report(y_testset, yhat_1_1))

In [None]:
yhat_1_2 = DT_Ss2.predict(X_testset)
print (classification_report(y_testset, yhat_1_2))


In [None]:
yhat_1_3 = DT_Ss3.predict(X_testset)
print (classification_report(y_testset, yhat_1_3))

In [None]:
yhat_1_4 = DT_Ss4.predict(X_testset)
print (classification_report(y_testset, yhat_1_4))

In [None]:
yhat_1_5 = DT_Ss5.predict(X_testset)
print (classification_report(y_testset, yhat_1_5))

In [None]:
yhat_1_A = DT_OsA.predict(X_testset)
print (classification_report(y_testset, yhat_1_A))

In [None]:
yhat_1_C = DT_OsC.predict(X_testset)
print (classification_report(y_testset, yhat_1_C))

In [None]:
yhat_1_D = DT_OsD.predict(X_testset)
print (classification_report(y_testset, yhat_1_D))

In [None]:
yhat_1_E = DT_OsE.predict(X_testset)
print (classification_report(y_testset, yhat_1_E))

In [None]:
yhat_1_ST1 = DT_ST1.predict(X_testset)
print (classification_report(y_testset, yhat_1_ST1))

In [None]:
yhat_1_ST2 = DT_ST2.predict(X_testset)
print (classification_report(y_testset, yhat_1_ST2))

In [None]:
yhat_1_ST3 = DT_ST3.predict(X_testset)
print (classification_report(y_testset, yhat_1_ST3))

In [None]:
yhat_1_ST4 = DT.predict(X_testset)
print (classification_report(y_testset, yhat_1_ST4))

In [None]:
yhat_1_ST5 = DT_ST5.predict(X_testset)
print (classification_report(y_testset, yhat_1_ST5))

In [None]:
yhat_prob_1=DT.predict_proba(X_testset)
DT_Acc=round(metrics.accuracy_score(y_testset,  yhat_1),4)
DT_F1=f1_score(y_testset, yhat_1, average='weighted') 

Mejores parámetros obtenidos a travez de grid search

In [None]:
DT_GS =DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=13, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0,
                       random_state=42, splitter='best')
DT_GS.fit(X_trainset,y_trainset)


Aplicación de grid search

In [None]:
DT_parameters = [{'criterion': ['entropy', 'gini'], 'max_depth': np.arange(3, 21)},{'min_samples_leaf': [2,5, 10, 20, 50, 100]}]
DT_GS = GridSearchCV(DecisionTreeClassifier(random_state=42), DT_parameters, verbose=1, cv=5, scoring='balanced_accuracy')
DT_GS.fit(X_trainset,y_trainset)

In [None]:
DT_GS.best_estimator_

In [None]:
yhat_1_GS= DT_GS.predict(X_testset)
DT_Acc_GS=round(metrics.accuracy_score(y_testset, yhat_1_GS),4)
DT_Acc_ST1=round(metrics.accuracy_score(y_testset, yhat_1_ST1),4)
DT_F1_GS=round(f1_score(y_testset, yhat_1_GS, average='weighted'),4) 
DT_F1_ST1=round(f1_score(y_testset, yhat_1_ST1, average='weighted'),4) 

In [None]:
resultados_DT = {'índices de rendimiento':['Accuracy','F1-Score'],
             'Árboles de decisión':[DT_Acc,DT_F1],
             'Grid Search':[DT_Acc_GS,DT_F1_GS],
             'Smote-Tomek':[DT_Acc_ST1,DT_F1_ST1]}
Tabla_resultados_DT=pd.DataFrame(resultados_DT)
print(Tabla_resultados_DT)

Instalación de paquete eli5 para aplicar técnica de "Permutation Importance"

In [None]:
pip install eli5

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

In [None]:
perm = PermutationImportance(DT, random_state=1).fit(X, y)
eli5.show_weights(perm, feature_names =X.columns.tolist())

In [None]:
perm = PermutationImportance(DT_GS, random_state=1).fit(X, y)
eli5.show_weights(perm, feature_names =X.columns.tolist())

In [None]:
feat_importances = DT.feature_importances_
indices = np.argsort(feat_importances)
features = X.columns
plt.figure(figsize=(10,5))
plt.title('Features +importante')
plt.barh(range(len(indices)), feat_importances[indices], color='g', align='center',linestyle="solid",alpha=0.8)
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Importancia')
plt.show()

Importación de paquetes para aplicar SHAP values

In [None]:
import shap
ex = shap.TreeExplainer(DT)
shap_values = ex.shap_values(X)
shap.summary_plot(shap_values, X,max_display=9)

In [None]:
conda install -c conda-forge pdpbox

In [None]:
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=DT_GS, dataset=X, model_features=X.columns, feature='EVA')

# plot it
pdp.pdp_plot(pdp_goals, 'EVA')
plt.show()

In [None]:
# Kerner Explainer
explainer = shap.KernelExplainer(DT.predict_proba,X[:100])
shap_values = explainer.shap_values(X[:100])
shap.summary_plot(shap_values, X[:100])

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[0], X)

In [None]:
shap.initjs()
columIndex= 2
shap.force_plot(explainer.expected_value[1], shap_values[1][columIndex,:], X.iloc[columIndex,:], link="logit")

Bosque de desición

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF= RandomForestClassifier(max_depth=2, random_state=0)
RF.fit(X_trainset,y_trainset)

In [None]:
RF_Ss1= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_1,y_trainset_res_1)
RF_Ss2= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_2,y_trainset_res_2)
RF_Ss3= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_3,y_trainset_res_3)
RF_Ss4= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_4,y_trainset_res_4)
RF_Ss5= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_5,y_trainset_res_5)
RF_OsA= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_A,y_trainset_res_A)
RF_OsC= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_C,y_trainset_res_C)
RF_OsD= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_D,y_trainset_res_D)
RF_OsE= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_E,y_trainset_res_E)
RF_ST1= RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=13, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False).fit(X_trainset_res_ST1,y_trainset_res_ST1)
RF_ST2= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_ST2,y_trainset_res_ST2)
RF_ST3= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_ST3,y_trainset_res_ST3)
RF_ST4= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_ST4,y_trainset_res_ST4)
RF_ST5= RandomForestClassifier(max_depth=2, random_state=0).fit(X_trainset_res_ST5,y_trainset_res_ST5)

In [None]:
yhat_2 = RF.predict(X_testset)
print (classification_report(y_testset, yhat_2))

In [None]:
yhat_2_1 = RF_Ss1.predict(X_testset)
print (classification_report(y_testset, yhat_2_1))

In [None]:
yhat_2_2 = RF_Ss2.predict(X_testset)
print (classification_report(y_testset, yhat_2_2))

In [None]:
yhat_2_3 = RF_Ss3.predict(X_testset)
print (classification_report(y_testset, yhat_2_3))

In [None]:
yhat_2_4 = RF_Ss4.predict(X_testset)
print (classification_report(y_testset, yhat_2_4))

In [None]:
yhat_2_5 = RF_Ss5.predict(X_testset)
print (classification_report(y_testset, yhat_2_5))

In [None]:
yhat_2_A = RF_OsA.predict(X_testset)
print (classification_report(y_testset, yhat_2_A))

In [None]:
yhat_2_C = RF_OsC.predict(X_testset)
print (classification_report(y_testset, yhat_2_C))

In [None]:
yhat_2_D = RF_OsD.predict(X_testset)
print (classification_report(y_testset, yhat_2_D))

In [None]:
yhat_2_E = RF_OsE.predict(X_testset)
print (classification_report(y_testset, yhat_2_E))

In [None]:
yhat_2_ST1 = RF_ST1.predict(X_testset)
print (classification_report(y_testset, yhat_2_ST1 ))

In [None]:
yhat_2_ST2 = RF_ST2.predict(X_testset)
print (classification_report(y_testset, yhat_2_ST2 ))

In [None]:
yhat_2_ST3 = RF_ST3.predict(X_testset)
print (classification_report(y_testset, yhat_2_ST3 ))

In [None]:
yhat_2_ST4 = RF_ST4.predict(X_testset)
print (classification_report(y_testset, yhat_2_ST4 ))

In [None]:
yhat_2_ST5 = RF_ST5.predict(X_testset)
print (classification_report(y_testset, yhat_2_ST5 ))

In [None]:
yhat_prob_2=RF.predict_proba(X_testset)
RF_Acc=round(metrics.accuracy_score(y_testset, yhat_2),4)
#RF_Jcc=round(jaccard_similarity_score(y_testset, yhat_2),4)
RF_F1=f1_score(y_testset, yhat_2, average='weighted') 
RF_lgl=round(log_loss(y_testset, yhat_prob_2),4)

In [None]:
RF_GS=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=13, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
RF_GS.fit(X_trainset,y_trainset)

In [None]:
RF_parameters ={ 'n_estimators': [100,200,300,400,500],'criterion': ['entropy', 'gini'], 'max_depth': np.arange(3, 21)},{'min_samples_leaf': [2,5, 10, 20, 50, 100]}
RF_GS = GridSearchCV(estimator=RF,param_grid=RF_parameters, cv= 5,scoring="accuracy")
RF_GS.fit(X_trainset,y_trainset)

In [None]:
RF_GS.best_estimator_

In [None]:
yhat_2_GS= RF_GS.predict(X_testset)
yhat_prob_2_GS=RF_GS.predict_proba(X_testset)
RF_Acc_GS=round(metrics.accuracy_score(y_testset, yhat_2_GS),4)
RF_Acc_ST1=round(metrics.accuracy_score(y_testset, yhat_2_ST1),4)
#RF_Jcc_GS=round(jaccard_similarity_score(y_testset, yhat_2_GS),4)
#RF_lgl_GS=round(log_loss(y_testset, yhat_prob_2_GS),4)
RF_F1_GS=round(f1_score(y_testset, yhat_2_GS, average='weighted'),4) 
RF_F1_ST1=round(f1_score(y_testset, yhat_2_ST1, average='weighted'),4) 

In [None]:
resultados_RF = {'índices de rendimiento':['Accuracy','F1-Score'],
             'Bosque de decisión':[RF_Acc, RF_F1],
             'Grid Search':[RF_Acc_GS,RF_F1_GS],
                'Smote Tomek':[RF_Acc_ST1,RF_F1_ST1]}
Tabla_resultados_RF=pd.DataFrame(resultados_RF)
print(Tabla_resultados_RF)

In [None]:
perm = PermutationImportance(RF_GS, random_state=1).fit(X, y)
eli5.show_weights(perm, feature_names =X.columns.tolist())

In [None]:
#import shap
ex = shap.TreeExplainer(RF_GS)
shap_values = ex.shap_values(X_testset)
shap.summary_plot(shap_values, X_testset)

In [None]:
# Kerner Explainer
explainer = shap.KernelExplainer(RF.predict_proba,X[:100])
shap_values = explainer.shap_values(X[:100])
shap.summary_plot(shap_values, X[:100])

Regresión logística (multi-class)

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000,penalty='l2',C=1).fit(X_trainset,y_trainset)

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000,penalty='l2',C=1)
pca = PCA()
pipe = Pipeline(steps=[('pca', pca), ('LR', LR)])
param_grid = {
    'pca__n_components': [2,5, 15, 30, 45, 64,100,300],
    
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

In [None]:
pca.fit(X)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(np.arange(1, pca.n_components_ + 1),
         pca.explained_variance_ratio_, '+', linewidth=2)
ax0.set_ylabel('PCA explained variance ratio')

ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
results = pd.DataFrame(search.cv_results_)
components_col = 'param_pca__n_components'
best_clfs = results.groupby(components_col).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs.plot(x=components_col, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.xlim(-1, 70)

plt.tight_layout()
plt.show()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000,penalty='l2',C=1).fit(X_trainset,y_trainset)
LR_Ss1 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_1,y_trainset_res_1)
LR_Ss2 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_2,y_trainset_res_2)
LR_Ss3 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_3,y_trainset_res_3)
LR_Ss4 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_4,y_trainset_res_4)
LR_Ss5 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_5,y_trainset_res_5)
LR_OsA = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_A,y_trainset_res_A)
LR_OsC = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_C,y_trainset_res_C)
LR_OsD = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_D,y_trainset_res_D)
LR_OsE = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_E,y_trainset_res_E)
LR_ST1 = LogisticRegression(C=0.1, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False).fit(X_trainset_res_ST1,y_trainset_res_ST1)
LR_ST2 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_ST2,y_trainset_res_ST2)
LR_ST3 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_ST3,y_trainset_res_ST3)
LR_ST4 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_ST4,y_trainset_res_ST4)
LR_ST5 = LogisticRegression(multi_class='ovr',class_weight='balanced', max_iter=1000).fit(X_trainset_res_ST5,y_trainset_res_ST5)

In [None]:
yhat_3= LR.predict(X_testset)
print (classification_report(y_testset, yhat_3))

In [None]:
yhat_3_1= LR_Ss1.predict(X_testset)
print (classification_report(y_testset, yhat_3_1))

In [None]:
yhat_3_2= LR_Ss2.predict(X_testset)
print (classification_report(y_testset, yhat_3_2))

In [None]:
yhat_3_3= LR_Ss3.predict(X_testset)
print (classification_report(y_testset, yhat_3_3))

In [None]:
yhat_3_4= LR_Ss4.predict(X_testset)
print (classification_report(y_testset, yhat_3_4))

In [None]:
yhat_3_5= LR_Ss5.predict(X_testset)
print (classification_report(y_testset, yhat_3_5))

In [None]:
yhat_3_A= LR_OsA.predict(X_testset)
print (classification_report(y_testset, yhat_3_A))

In [None]:
yhat_3_C= LR_OsC.predict(X_testset)
print (classification_report(y_testset, yhat_3_C))

In [None]:
yhat_3_D= LR_OsD.predict(X_testset)
print (classification_report(y_testset, yhat_3_D))

In [None]:
yhat_3_E= LR_OsE.predict(X_testset)
print (classification_report(y_testset, yhat_3_E))

In [None]:
yhat_3_ST1= LR_ST1.predict(X_testset)
print (classification_report(y_testset, yhat_3_ST1))

In [None]:
yhat_3_ST2= LR_ST2.predict(X_testset)
print (classification_report(y_testset, yhat_3_ST2))

In [None]:
yhat_3_ST3= LR_ST3.predict(X_testset)
print (classification_report(y_testset, yhat_3_ST3))

In [None]:
yhat_3_ST4= LR_ST4.predict(X_testset)
print (classification_report(y_testset, yhat_3_ST4))

In [None]:
yhat_3_ST5= LR_ST5.predict(X_testset)
print (classification_report(y_testset, yhat_3_ST5))

In [None]:
yhat_prob_3_4=LR_Ss4.predict_proba(X_testset)
LR_Acc=round(metrics.accuracy_score(y_testset, yhat_3_4),4)
#LR_Jcc=round(jaccard_similarity_score(y_testset, yhat_3_4),4)
#LR_lgl=round(log_loss(y_testset, yhat_3_4),4)
LR_F1=f1_score(y_testset, yhat_3_4, average='weighted') 

In [None]:
LR_GS =LogisticRegression(C=0.1, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
LR_GS.fit(X_trainset,y_trainset)

In [None]:
LR_parameters ={'C':(0.1,1,10), 'penalty':['l1','l2','elasticnet']}
LR_GS = GridSearchCV(estimator=LR_Ss4,param_grid=LR_parameters, cv= 5,verbose=0,)
LR_GS.fit(X_trainset,y_trainset)

In [None]:
LR_GS.best_estimator_

In [None]:
yhat_3_GS= LR_GS.predict(X_testset)
yhat_prob_3_GS=LR_GS.predict_proba(X_testset)
LR_Acc_GS=round(metrics.accuracy_score(y_testset, yhat_3_GS),4)
LR_Acc_ST1=round(metrics.accuracy_score(y_testset, yhat_3_ST1),4)
#LR_Jcc_GS=round(jaccard_similarity_score(y_testset, yhat_3_GS),4)
#LR_lgl_GS=round(log_loss(y_testset, yhat_prob_3_GS),4)
LR_F1_GS=round(f1_score(y_testset, yhat_3_GS, average='weighted'),4) 
LR_F1_ST1=round(f1_score(y_testset, yhat_3_ST1, average='weighted'),4) 

In [None]:
resultados_LR = {'índices de rendimiento':['Accuracy','F1-Score'],
             'Regresión Logística':[LR_Acc,LR_F1],
             'Grid Search':[LR_Acc_GS,LR_F1_GS],
                'Smote Tomek':[LR_Acc_ST1,LR_F1_ST1]}
Tabla_resultados_LR=pd.DataFrame(resultados_LR)
print(Tabla_resultados_LR)

In [None]:
perm = PermutationImportance(LR_GS, random_state=1).fit(X, y)
eli5.show_weights(perm, feature_names =X.columns.tolist())

In [None]:
# Kerner Explainer
explainer = shap.KernelExplainer(LR.predict_proba,X[:100])
shap_values = explainer.shap_values(X[:100])
shap.summary_plot(shap_values, X[:100])

Redes Neuronales

In [None]:
from sklearn.neural_network import MLPClassifier
NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=500).fit(X_trainset,y_trainset)

In [None]:
from sklearn.neural_network import MLPClassifier
NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
pca = PCA()
pipe = Pipeline(steps=[('pca', pca), ('NN', NN)])
param_grid = {
    'pca__n_components': [2,5, 15, 30, 45, 64,100,300],
    
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

In [None]:
pca.fit(X)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(np.arange(1, pca.n_components_ + 1),
         pca.explained_variance_ratio_, '+', linewidth=2)
ax0.set_ylabel('PCA explained variance ratio')

ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
results = pd.DataFrame(search.cv_results_)
components_col = 'param_pca__n_components'
best_clfs = results.groupby(components_col).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs.plot(x=components_col, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.xlim(-1, 70)

plt.tight_layout()
plt.show()

In [None]:
NN_Ss1 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_1,y_trainset_res_1)
NN_Ss2 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_2,y_trainset_res_2)
NN_Ss3 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_3,y_trainset_res_3)
NN_Ss4 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_4,y_trainset_res_4)
NN_Ss5 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_5,y_trainset_res_5)
NN_OsA = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_A,y_trainset_res_A)
NN_OsC = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_C,y_trainset_res_C)
NN_OsD = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_D,y_trainset_res_D)
NN_OsE = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_E,y_trainset_res_E)
NN_ST1 = MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=1000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='sgd',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False).fit(X_trainset_res_ST1,y_trainset_res_ST1)
NN_ST2 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_ST2,y_trainset_res_ST2)
NN_ST3 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_ST3,y_trainset_res_ST3)
NN_ST4 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_ST4,y_trainset_res_ST4)
NN_ST5 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000).fit(X_trainset_res_ST5,y_trainset_res_ST5)

In [None]:
yhat_4 = NN.predict(X_testset)
print (classification_report(y_testset, yhat_4))

In [None]:
yhat_4_1 = NN_Ss1.predict(X_testset)
print (classification_report(y_testset, yhat_4_1))

In [None]:
yhat_4_2 = NN_Ss2.predict(X_testset)
print (classification_report(y_testset, yhat_4_2))

In [None]:
yhat_4_3 = NN_Ss3.predict(X_testset)
print (classification_report(y_testset, yhat_4_3))

In [None]:
yhat_4_4 = NN_Ss4.predict(X_testset)
print (classification_report(y_testset, yhat_4_4))

In [None]:
yhat_4_5 = NN_Ss5.predict(X_testset)
print (classification_report(y_testset, yhat_4_5))

In [None]:
yhat_4_A = NN_OsA.predict(X_testset)
print (classification_report(y_testset, yhat_4_A))

In [None]:
yhat_4_C = NN_OsC.predict(X_testset)
print (classification_report(y_testset, yhat_4_C))

In [None]:
yhat_4_D = NN_OsD.predict(X_testset)
print (classification_report(y_testset, yhat_4_D))

In [None]:
yhat_4_E = NN_OsE.predict(X_testset)
print (classification_report(y_testset, yhat_4_E))

In [None]:
yhat_4_ST1 = NN_ST1.predict(X_testset)
print (classification_report(y_testset, yhat_4_ST1))

In [None]:
yhat_4_ST2 = NN_ST2.predict(X_testset)
print (classification_report(y_testset, yhat_4_ST2))

In [None]:
yhat_4_ST3 = NN_ST3.predict(X_testset)
print (classification_report(y_testset, yhat_4_ST3))

In [None]:
yhat_4_ST4 = NN_ST4.predict(X_testset)
print (classification_report(y_testset, yhat_4_ST4))

In [None]:
yhat_4_ST5 = NN_ST5.predict(X_testset)
print (classification_report(y_testset, yhat_4_ST5))

In [None]:
yhat_prob_4=NN.predict_proba(X_testset)
NN_Acc=round(metrics.accuracy_score(y_testset, yhat_4),4)
#NN_Jcc=round(jaccard_similarity_score(y_testset, yhat_4),4)
NN_lgl=round(log_loss(y_testset, yhat_prob_4),4)
NN_F1=f1_score(y_testset, yhat_4, average='weighted') 

In [None]:
NN_GS =MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=400,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='sgd',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
NN_GS.fit(X_trainset,y_trainset)

In [None]:
NN_parameters = {'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],'alpha': [0.0001, 0.05],'learning_rate': ['constant','adaptive'],'max_iter':[300,400]}

NN_GS = GridSearchCV(estimator=NN,param_grid=NN_parameters, cv= 5, verbose=True,scoring="accuracy")
NN_GS.fit(X_trainset,y_trainset)

In [None]:
NN_GS.best_estimator_

In [None]:
yhat_4_GS= NN_GS.predict(X_testset)
yhat_prob_4_GS=NN_GS.predict_proba(X_testset)
NN_Acc_GS=round(metrics.accuracy_score(y_testset, yhat_4_GS),4)
NN_Acc_ST1=round(metrics.accuracy_score(y_testset, yhat_4_ST1),4)
#NN_Jcc_GS=round(jaccard_similarity_score(y_testset, yhat_4_GS),4)
#NN_lgl_GS=round(log_loss(y_testset, yhat_prob_4_GS),4)
NN_F1_GS=round(f1_score(y_testset, yhat_4_GS, average='weighted'),4) 
NN_F1_ST1=round(f1_score(y_testset, yhat_4_ST1, average='weighted'),4) 

In [None]:
resultados_NN = {'índices de rendimiento':['Accuracy','F1-Score'],
             'Red Neuronal':[NN_Acc,NN_F1],
             'Grid Search':[NN_Acc_GS,NN_F1_GS],
                'Smote-Tomek':[NN_Acc_ST1,NN_F1_ST1]}
Tabla_resultados_NN=pd.DataFrame(resultados_NN)
print(Tabla_resultados_NN)

In [None]:
data_DF=pd.DataFrame(X, columns=df_cat.columns)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(NN_GS, random_state=1).fit(X, y)
eli5.show_weights(perm, feature_names =data_DF.columns.tolist())

In [None]:
# Kerner Explainer
import shap
explainer = shap.KernelExplainer(NN.predict_proba,data_DF[:100])
shap_values = explainer.shap_values(data_DF[:100])
shap.summary_plot(shap_values, data_DF[:100])

Support Vector Machine

In [None]:
from sklearn import svm
SVM = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset,y_trainset)
SVM_Ss1 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_1,y_trainset_res_1)
SVM_Ss2 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_2,y_trainset_res_2)
SVM_Ss3 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_3,y_trainset_res_3)
SVM_Ss4 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_4,y_trainset_res_4)
SVM_Ss5 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_5,y_trainset_res_5)
SVM_OsA = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_A,y_trainset_res_A)
SVM_OsC = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_C,y_trainset_res_C)
SVM_OsD = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_D,y_trainset_res_D)
SVM_OsE = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_E,y_trainset_res_E)
SVM_ST1 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST1,y_trainset_res_ST1)
SVM_ST2 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST2,y_trainset_res_ST2)
SVM_ST3 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST3,y_trainset_res_ST3)
SVM_ST4 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST4,y_trainset_res_ST4)
SVM_ST5 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST5,y_trainset_res_ST5)

In [None]:
from sklearn import svm
SVM = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset,y_trainset)

In [None]:
SVM_Ss1 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_1,y_trainset_res_1)

In [None]:
SVM_Ss2 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_2,y_trainset_res_2)

In [None]:
SVM_Ss3 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_3,y_trainset_res_3)

In [None]:
SVM_Ss4 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_4,y_trainset_res_4)

In [None]:
SVM_Ss5 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_5,y_trainset_res_5)

In [None]:
SVM_OsA = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_A,y_trainset_res_A)

In [None]:
SVM_OsC = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_C,y_trainset_res_C)

In [None]:
SVM_OsD = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_D,y_trainset_res_D)

In [None]:
SVM_OsE = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_E,y_trainset_res_E)

In [None]:
SVM_ST1 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST1,y_trainset_res_ST1)

In [None]:
SVM_ST2 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST2,y_trainset_res_ST2)

In [None]:
SVM_ST3 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST3,y_trainset_res_ST3)

In [None]:
SVM_ST4 = svm.SVC(kernel='rbf',decision_function_shape='ovo', probability=True).fit(X_trainset_res_ST4,y_trainset_res_ST4)

In [None]:
yhat_5 = SVM.predict(X_testset)
print (classification_report(y_testset, yhat_5))

In [None]:
yhat_5_1 = SVM_Ss1.predict(X_testset)
print (classification_report(y_testset, yhat_5_1))

In [None]:
yhat_5_2 = SVM_Ss2.predict(X_testset)
print (classification_report(y_testset, yhat_5_2))

In [None]:
yhat_5_3 = SVM_Ss3.predict(X_testset)
print (classification_report(y_testset, yhat_5_3))

In [None]:
yhat_5_4 = SVM_Ss4.predict(X_testset)
print (classification_report(y_testset, yhat_5_4))

In [None]:
yhat_5_5 = SVM_Ss5.predict(X_testset)
print (classification_report(y_testset, yhat_5_5))

In [None]:
yhat_5_A = SVM_OsA.predict(X_testset)
print (classification_report(y_testset, yhat_5_A))

In [None]:
yhat_5_C = SVM_OsC.predict(X_testset)
print (classification_report(y_testset, yhat_5_C))

In [None]:
yhat_5_D = SVM_OsD.predict(X_testset)
print (classification_report(y_testset, yhat_5_D))

In [None]:
yhat_5_E = SVM_OsE.predict(X_testset)
print (classification_report(y_testset, yhat_5_E))

In [None]:
yhat_5_ST1 = SVM_ST1.predict(X_testset)
print (classification_report(y_testset, yhat_5_ST1))

In [None]:
yhat_5_ST2 = SVM_ST2.predict(X_testset)
print (classification_report(y_testset, yhat_5_ST2))

In [None]:
yhat_5_ST3 = SVM_ST3.predict(X_testset)
print (classification_report(y_testset, yhat_5_ST3))

In [None]:
yhat_5_ST4 = SVM_ST4.predict(X_testset)
print (classification_report(y_testset, yhat_5_ST4))

In [None]:
yhat_5_ST5 = SVM_ST5.predict(X_testset)
print (classification_report(y_testset, yhat_5_ST5))

In [None]:
yhat_prob_5=SVM.predict_proba(X_testset)
SVM_Acc=round(metrics.accuracy_score(y_testset, yhat_5),4)
#SVM_Jcc=round(jaccard_similarity_score(y_testset, yhat_5),4)
SVM_lgl=round(log_loss(y_testset, yhat_prob_5),4)
SVM_F1=f1_score(y_testset, yhat_5, average='weighted') 

In [None]:
SVM_GS =svm.SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
SVM_GS.fit(X_trainset,y_trainset)

In [None]:
SVM_parameters ={'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}
SVM_GS = GridSearchCV(estimator=SVM,param_grid=SVM_parameters, cv= 3, verbose=True, n_jobs=-1)
SVM_GS.fit(X_trainset,y_trainset)

In [None]:
SVM_GS.best_estimator_

In [None]:
yhat_5_GS= SVM_GS.predict(X_testset)
yhat_prob_5_GS=SVM_GS.predict_proba(X_testset)
SVM_Acc_GS=round(metrics.accuracy_score(y_testset, yhat_5_GS),4)
#SVM_Jcc_GS=round(jaccard_similarity_score(y_testset, yhat_5_GS),4)
SVM_lgl_GS=round(log_loss(y_testset, yhat_prob_5_GS),4)
SVM_F1_GS=round(f1_score(y_testset, yhat_5_GS, average='weighted'),4) 

In [None]:
resultados_SVM = {'índices de rendimiento':['Accuracy','LogLoss','F1-Score'],
             'Support Vactor Machine':[SVM_Acc,SVM_lgl,SVM_F1],
             'Grid Search':[SVM_Acc_GS,SVM_lgl_GS,SVM_F1_GS]}
Tabla_resultados_SVM=pd.DataFrame(resultados_SVM)
print(Tabla_resultados_SVM)

In [None]:
import shap
ex = shap.TreeExplainer(SVM)
shap_values = ex.shap_values(X_testset)
shap.summary_plot(shap_values, X_testset)

In [None]:
# Kerner Explainer
explainer = shap.KernelExplainer(SVM.predict_proba,X[:100])
shap_values = explainer.shap_values(X[:100])
shap.summary_plot(shap_values, X[:100])

In [None]:
resultados = {'Algoritmos de clasificación':['Árboles de decisión','AD+GS','Bosques de decisión','BD+GS','Regresión Logística','RL+GS','Red Neuronal','NN+GS'],#'Support Vector Machine','SVM+GS'],
             'Accuracy':[DT_Acc,DT_Acc_GS,RF_Acc,RF_Acc_GS,LR_Acc,LR_Acc_GS,NN_Acc,NN_Acc_GS],#SVM_Acc,SVM_Acc_GS],
             #'Jaccard':[DT_Jcc,DT_Jcc_GS,RF_Jcc,RF_Jcc_GS,LR_Jcc,LR_Jcc_GS,NN_Jcc,NN_Jcc_GS],#SVM_Jcc,SVM_Jcc_GS],
             #'LogLoss':[DT_lgl,DT_lgl_GS,RF_lgl,RF_lgl_GS,LR_lgl,LR_lgl_GS,NN_lgl,NN_lgl_GS],#SVM_lgl,SVM_lgl_GS],
             'F1-Score':[DT_F1,DT_F1_GS,RF_F1,RF_F1_GS,LR_F1,LR_F1_GS,NN_F1,NN_F1_GS]}#,SVM_F1,SVM_F1_GS]}
Tabla_resultados=pd.DataFrame(resultados)
print(Tabla_resultados)

 <h1 id="Matrices de confusión">Matrices de confusión</h1>
<p>
En esta sección se construyen matrices de confusión para evaluar la eficiaca de cada algoritmo
</p>

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Matriz de confusión',
                          cmap=plt.cm.Blues):
    """
    Esta función muestra y dibuja la matriz de confusión.
    La normalización se puede aplicar estableciendo el valor `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Matriz de confusión normalizada")
    else:
        print('Matriz de confusión sin normalización')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('Etiqueta Real')
    plt.xlabel('Etiqueta Predicha')


Árbol de decisión

In [None]:
cnf_matrix1 = confusion_matrix(y_testset, yhat_1_ST1, labels=['C1','C2','C3','C4'])
#cnf_matrix2 = confusion_matrix(y_testset, yhat_1_GS, labels=['C1','C2','C3','C4'])
np.set_printoptions(precision=2)
fig, axs = plt.subplots(1, 2, figsize=(9, 3), sharey=True)
axs[0]=plot_confusion_matrix(cnf_matrix1, classes=['C1','C2','C3','C4'],normalize= False,  title='Matriz de confusión árbol de decisión + ST')
#axs[1]=plot_confusion_matrix(cnf_matrix2, classes=['C1','C2','C3','C4'],normalize= False,  title='Matriz de confusión árbol de decisión')
print (classification_report(y_testset, yhat_1))

Bosque de decisión

In [None]:
cnf_matrix = confusion_matrix(y_testset, yhat_2_GS, labels=['C1','C2','C3','C4'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['C1','C2','C3','C4'],normalize= False,  title='Matriz de confusión bosque de decisión +GS')
print (classification_report(y_testset,yhat_1_3))

Regresión Logística

In [None]:
cnf_matrix = confusion_matrix(y_testset, yhat_3_GS, labels=['C1','C2','C3','C4'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['C1','C2','C3','C4'],normalize= False,  title='Matriz de confusión regresión logística +GS')
print (classification_report(y_testset, yhat_3))

Red neuronal

In [None]:
cnf_matrix = confusion_matrix(y_testset, yhat_4_ST1, labels=['C1','C2','C3','C4'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['C1','C2','C3','C4'],normalize= False,  title='Matriz de confusión red neuronal +ST')
print (classification_report(y_testset, yhat_4))

Support Vector Machine

In [None]:
cnf_matrix = confusion_matrix(y_testset, yhat_5, labels=['C1','C2','C3','C4'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['C1','C2','C3','C4'],normalize= False,  title='Matriz de confusión SVM')
print (classification_report(y_testset, yhat_5))