<H1>Predicción de Ausencia de Paciente a su Cita Medica </H1>

<H2>Comparación de distintas caracteristicas y Algoritmos (XGBoost)</H2>

El ejemplo esta basado en 100 000 Citas medicas de una región (Vitória, Espírito Santo) de Brazil.
En base a caracteristicas demograficas, sociales y de enfermedades cronicas del paciente, se intenta determinar su presencia/ausencia en la cita programada.

En este ejemplo se comparan lo resultados obtenidos previamente con otro algoritmo:
<ul>
    <li><em>ABT con 11 caracteristicas y Regression Logistica</em></li>
    <li><em>ABT con 110 caracteristicas y Regression Logistica</em></li>
    <li><em>ABT con 11 caracteristicas y XGBoost</em></li>
</ul>


<H2>Carga de Librerias Python</H2>

In [1]:
#Carga inicial de librerias imprescindibles

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#import seaborn as sns
#sns.set(style="whitegrid", color_codes=True)
%matplotlib inline

#ML imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

#Pipeline needed for PMML Export
from sklearn.pipeline import Pipeline
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline
from sklearn2pmml import sklearn2pmml


from sklearn import metrics
from sklearn.metrics import roc_curve
#from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

from xgboost import XGBClassifier



<H2>Carga de Datos con Spark<H2>

In [2]:
#"ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN"
sc.setLogLevel("INFO")

In [3]:
#Carga del Modelo 'Abstract Base Table en un Spark DataFrame
sp_dfSimple=spark.read.format("com.intersystems.spark").\
option("url", "IRIS://iris4ml:51773/MLACADEMY").option("user", "SuperUser").\
option("password", "sys").\
option("query", "PublishedABT.MLSimpleAppointmentsGetfeatures()").load().limit(10000) 

In [4]:
#Carga del Modelo 'Abstract Base Table en un Spark DataFrame
sp_dfLarge=spark.read.format("com.intersystems.spark").\
option("url", "IRIS://iris4ml:51773/MLACADEMY").option("user", "SuperUser").\
option("password", "sys").\
option("query", "PublishedABT.MLAppointmentsGetfeatures()").load().limit(10000) 

In [5]:
#Mostrar las columnas de caracteristicas y la columna de resultado (Classification == Presente/Ausente) 
#sp_dfSimple.printSchema()
sp_dfLarge.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- AgeGroup_Adult: integer (nullable = true)
 |-- AgeGroup_Baby: integer (nullable = true)
 |-- AgeGroup_Child: integer (nullable = true)
 |-- AgeGroup_MiddleAged: integer (nullable = true)
 |-- AgeGroup_Senior: integer (nullable = true)
 |-- AgeGroup_TeenYoungAdult: integer (nullable = true)
 |-- Alcoholism: boolean (nullable = true)
 |-- AppointmentWeekday_Friday: integer (nullable = true)
 |-- AppointmentWeekday_Monday: integer (nullable = true)
 |-- AppointmentWeekday_Saturday: integer (nullable = true)
 |-- AppointmentWeekday_Thursday: integer (nullable = true)
 |-- AppointmentWeekday_Tuesday: integer (nullable = true)
 |-- AppointmentWeekday_Wednesday: integer (nullable = true)
 |-- Delay: integer (nullable = true)
 |-- DelayGroup_LessThanTen: integer (nullable = true)
 |-- DelayGroup_LessthanFour: integer (nullable = true)
 |-- DelayGroup_NA: integer (nullable = true)
 |-- DelayGroup_None: integer (nullable = true)
 |-- DelayGroup_One: 

<H2>Copia a Pandas Dataframes</H2>

In [None]:
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
# ((Nota: requiere fechas como TimeStamp (no funciona bien con Dates) ))
dataSimple = sp_dfSimple.select("*").toPandas()
# data.drop(['ID'],axis=1,inplace=True)  #Eliminamos las columnas de ID, no sirven para determinar el resultado

In [None]:
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
# ((Nota: requiere fechas como TimeStamp (no funciona bien con Dates) ))
dataLarge = sp_dfLarge.select("*").toPandas()
# data.drop(['ID'],axis=1,inplace=True)  #Eliminamos las columnas de ID, no sirven para determinar el resultado

In [None]:
#Mostrar las primeras filas de resultantes del ABT PublishedABT.MLAppointmentsGetfeatures()
dataLarge.head()

<H2>Algorithmos</H2>

In [None]:
#Extraer nombre de caracterisicas para uso posterior (p.e Arbol de decision XGBoost )
featuresSimple=list(dataSimple.columns)
featuresSimple.remove('noShow')

In [None]:
#Extracción de las Caracteristicas (en X) y del Resultado (columna "noShow" en y) como arrays numpy...
XSimple = dataSimple.loc[:, dataSimple.columns.isin(featuresSimple)].values
ySimple = dataSimple.loc[:, 'noShow'].values

In [None]:
#Extracción de las Caracteristicas (en X) y del Resultado (columna "noShow" en y) como arrays numpy...
XLarge = dataLarge.loc[:, ~dataLarge.columns.isin(['noShow'])].values
yLarge = dataLarge.loc[:, 'noShow'].values

In [None]:
#Se separan los datos (110 000 filas) en 2 conjuntos: Train para entrenar,
# y test para validar los resultados del algoritmo
# 0.2 => 20% para test
XSimple_train, XSimple_test, ySimple_train, ySimple_test = train_test_split(XSimple, ySimple, test_size = 0.2, random_state=42) 

In [None]:
#Se separan los datos (110 000 filas) en 2 conjuntos: Train para entrenar,
# y test para validar los resultados del algoritmo
# 0.2 => 20% para test
XLarge_train, XLarge_test, yLarge_train, yLarge_test = train_test_split(XLarge, yLarge, test_size = 0.2, random_state=42) 

<H3>ABT con 11 caracteristicas y Regression Logistica</H3>

In [None]:
#Entrenamiento del Algoritmo Regresión Logistica, con "cross validation"
model1 = LogisticRegressionCV(n_jobs=-1, cv=10, max_iter=200, scoring = 'f1', multi_class='ovr')
model1.fit(XSimple_train, ySimple_train)

In [None]:
# ROC AUC es pobre 0.65
metrics.roc_auc_score(ySimple_test,model1.predict_proba(XSimple_test)[:,1])

<H3>ABT con 110 caracteristicas y Regression Logistica</H3>

In [None]:
#Entrenamiento del Algoritmo Regresión Logistica, con "cross validation"
model2 = LogisticRegressionCV(n_jobs=-1, cv=10, max_iter=500, scoring = 'f1', multi_class='ovr')
model2.fit(XLarge_train, yLarge_train)

In [None]:
# ROC AUC esta mejor 
metrics.roc_auc_score(yLarge_test,model2.predict_proba(XLarge_test)[:,1])

<H3>XGBoost</H3>

In [None]:
#XGBoost
#Splitting the Dataset for XGBoost, where we can use another structure and retain features names
dfSimpleX_train, dfSimpleX_test, dfSimpley_train, dfSimpley_test = train_test_split(dataSimple[featuresSimple], dataSimple['noShow'],test_size=0.2, random_state=42)
#print (dfSimpleX_train.shape)
#dfSimpleX_train.head()

model3 = XGBClassifier(nthread=8)
model3.fit(dfSimpleX_train, dfSimpley_train)

In [None]:
# we have a decent AUC of 0.726 
metrics.roc_auc_score(dfSimpley_test,model3.predict_proba(dfSimpleX_test)[:,1])

<H3>Comparación de los 3 modelos</H3>

In [None]:
y1_pred_prob=model1.predict_proba(XSimple_test)[:,1]
y2_pred_prob=model2.predict_proba(XLarge_test)[:,1]
y3_pred_prob=model3.predict_proba(dfSimpleX_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr1, tpr1, thresholds1 = roc_curve(ySimple_test, y1_pred_prob)
fpr2, tpr2, thresholds2 = roc_curve(yLarge_test, y2_pred_prob)
fpr3, tpr3, thresholds3 = roc_curve(dfSimpley_test, y3_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr1, tpr1,"C1",label="Lg11")
plt.plot(fpr2, tpr2,"C2",label="Lg110")
plt.plot(fpr3, tpr3,"C3",label="XGBoost")

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve- LogRegs vs XGBoost')
plt.legend()
plt.show()

<H3>Detalles de XGBoost</H3>

In [None]:
import xgboost
xgboost.plot_importance(model3, max_num_features=20)

<H3>Pruebas Varias</H3>

In [None]:
sc._conf.getAll()

sc.stop()
conf = SparkConf().setAppName("App")
conf = (conf.setMaster('local[*]')
        .set('spark.executor.memory', '3G')
        .set('spark.driver.memory', '3G')
        .set('spark.driver.maxResultSize', '3G'))
sc = SparkContext(conf=conf)