<a href="https://colab.research.google.com/github/HWP-Wilson/Taxa_de_Mortalidade_SUS/blob/main/Previs%C3%A3o_de_interna%C3%A7%C3%A3o_em_UTI_Sirio_Liban%C3%AAs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###<font color='black'> **PREVISÃO, ATRAVÉS DE MACHINE LEARNING, DA NECESSIDADE DE INTERNAÇÃO DE PACIENTES, COM COVID, EM UNIDADES DE TRATAMENTO INTENSIVO (UTI)**

helanowilson@ufc.br
Data realização: 11/03/2021. 

###<font color='blue'> **INTRODUÇÃO**

Após 4 projetos de DataScience no BootCamp da Alura, passando por métodos de exploração de dados, DataVisualization, testes estatísticos, entre outros temas, no projeto final o foco é o desenvolvimento de um modelo de Machine Learning para previsão da necessidade de pacientes, com Covid-19, serem encaminhados para a UTI. Como é de conhecimento de todos o sistema de saude está em colapso, necessitando de novos leitos e melhorias da estrutura como um todo. Esse modelo de Machine Learning pode ajudar nesse planejamento. 


###<font color='blue'> **1. DATASET**

O conjunto de dados foi elaborado pela equipe de inteligência do Hospital Sírio-Libanês, com dados clinicos sobre pacientes internados nas unidades de São Paulo e Brasília, como: GÊNERO, GRUPO DE DOENÇA 1, TEMPERATURA_MEAN, OXYGEN_SATURATION_MIN, WINDOW e o último, o" ICU ".
Todos os dados foram tornados anônimos seguindo as melhores práticas e recomendações internacionais. Os dados foram limpos e escalados por coluna para caber entre -1 e 1.




**Importante** : 
1. A coluna PATIENT_VISIT_IDENTIFIER contém o identificador do paciente.
Existem 5 linhas para cada paciente; cada linha refere-se a uma janela de tempo, que pode ser visualizada na coluna WINDOW, a partir do momento da internação e até "acima de 12 horas". Seguindo orientação da equipe do Hospital, e por não tem dominio em nivel avançado de Machine Learning, usaremos apenas os dados das duas primeiras horas de internação do paciente. 

|Window|Descrição|
---|---
0-2|de 0 a 2 horas da admissão no hospital
2-4|de 2 a 4 horas da admissão no hospital
4-6|de 4 a 6 horas da admissão no hospital
6-12|de 6 a 12 horas da admissão no hospital
Acima-12|Acima de 12 horas da admissão no hospital

2. Variáveis do dataset:

- Informações demográficas: 3 variáveis do tipo categórica

- Doenças pré-existentes: 9 variáveis do tipo categórica

- Exames de sangue: 36 variáveis do tipo contínua ---> quando necessário, expandidas em média, mediana, max, min, diff(max-min) e diff relativa (diff/mediana)

- Sinais vitais: 6 variáveis do tipo contínua


3. A última coluna, com dados sobre ICU (UTI) será a coluna principal desse projeto. Ela define se o paciente foi ou não para a UTI. Sendo uma classificação binária: (ICU = 1) ou não (ICU = 0), estando assim pronta para uso em modelos de previsão. 

4. De acordo com orientações da equipe do Sirio-Libanês, não se deve utilizar dados de paciente com ICU = 1. No entanto, não podemos excluir os dados anteriores a internação na UTI. Sendo assim, se o paciente apresenta ICU=1 já na  primeira janela, então, esses dados precisam ser descartados. Importante frisar que não podemos perder a informação que o paciente esteve na UTI.  


###<font color='blue'> **2. BIBLIOTECAS**

In [263]:
#Importando bibliotecas; baixando arquivos e tratando os dados
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
plt.style.use('ggplot')
from scipy.stats import norm

from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (plot_confusion_matrix, roc_auc_score, plot_roc_curve, auc, 
                             accuracy_score, recall_score, f1_score)

from sklearn.model_selection import (cross_validate, StratifiedKFold, RepeatedStratifiedKFold, 
                                     train_test_split, GridSearchCV, RandomizedSearchCV)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

import sklearn


from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE

import random

import warnings
warnings.filterwarnings('ignore')



###<font color='blue'> **2. FUNÇÕES**

In [223]:
##Função para prencher os dados nulos com valores de janelas vizinhas.
def preenche_tabela(dados):
  
  features_continuas_colunas = dados.iloc[:,13:-2].columns
  features_continuas = dados.groupby('PATIENT_VISIT_IDENTIFIER', as_index = False)[features_continuas_colunas].fillna(method='ffill').fillna(method='bfill')
  features_categoricas = dados.iloc[:,:13]
  saida = dados.iloc[:,-2:]
  dados_finais = pd.concat([features_categoricas, features_continuas, saida], ignore_index=True, axis=1 )
  dados_finais.columns = dados.columns
  return dados_finais

## Função para preencher a coluna ICU com valor binário 1, 
## mesmo o paciente tendo se internado na UTI após a primeira janela.
def prepare_window(rows):
    if(np.any(rows["ICU"])):
        rows.loc[rows["WINDOW"]=="0-2", "ICU"] = 1
    return rows.loc[rows["WINDOW"] == "0-2"]
def roda_modelo_cv(modelo, dados: pd.DataFrame, n_splits: int, n_repeats: int):
    """
    Função que executa validação cruzada de um certo modelo com um número de divisões
    nos dados e um número de repetições nos testes.
    """
  
    np.random.seed(321351654)
    dados = dados.sample(frac=1).reset_index(drop=True)
    x_columns = dados.columns
    y = dados['ICU']
    x = dados.drop(['ICU'], axis=1)
    
   
    cv = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats)
    resultados = cross_validate(modelo, x, y, cv=cv, scoring='roc_auc')


    auc_medio = np.mean(resultados['test_score'])
    auc_std= np.std(resultados['test_score'])
    
    return modelo, auc_medio, (auc_medio - (2*auc_std)), (auc_medio + (2*auc_std))


def roda_cross_validate_modelos(modelos: list, dados: pd.DataFrame, n_splits: int, n_repeats: int):
    """
    Função que automatiza o processo de validação cruzada, permitindo a avaliação
    de diversos modelos dentro de uma lista.
    
    Retorna um dataframe contendo as informações e métricas de cada modelo.
    """
    scores = []
    for i in modelos:
        modelo, auc_medio, ic_auc_min, ic_auc_max = roda_modelo_cv(i, dados, n_splits, n_repeats)
        scores.append([modelo, auc_medio, ic_auc_min, ic_auc_max])
    return pd.DataFrame(data=scores, columns=['Modelo',f'AUC_Mean', 'IC_Min', 'IC_Max'])

def roda_modelo_cv_min_max(modelo, dados: pd.DataFrame, n_splits: int, n_repeats: int):
    """
    Função muito semelhante à função roda_modelo_cv(), com pequenas modificações para
    ser usada com o dataframe de pacientes fictícios 
    """
  
    np.random.seed(321351654)
    dados = dados.sample(frac=1).reset_index(drop=True)
    x_columns = dados.columns
    y = dados['ICU_EVER']
    x = dados.drop(['ICU', 'ICU_EVER', 'WINDOW', 'PATIENT_VISIT_IDENTIFIER'], axis=1)
    
   
    cv = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats)
    resultados = cross_validate(modelo, x, y, cv=cv, scoring='roc_auc')


    auc_medio = np.mean(resultados['test_score'])
    auc_std= np.std(resultados['test_score'])
    
    return modelo, auc_medio, (auc_medio - (2*auc_std)), (auc_medio + (2*auc_std))

def roda_cross_validate_modelos_min_max(modelos: list, dados: pd.DataFrame, n_splits: int, n_repeats: int):
    """
    Função que automatiza o processo de validação cruzada, permitindo a avaliação
    de diversos modelos dentro de uma lista.
    
    Retorna um dataframe contendo as informações e métricas de cada modelo.
    """
    scores = []
    for i in modelos:
        modelo, auc_medio, ic_auc_min, ic_auc_max = roda_modelo_cv_min_max(i, dados, n_splits, n_repeats)
        scores.append([modelo, auc_medio, ic_auc_min, ic_auc_max])
    return pd.DataFrame(data=scores, columns=['Modelo',f'AUC_Mean', 'IC_Min', 'IC_Max'])


###<font color='blue'> **3. DATASET**

In [224]:
dados = pd.read_excel("/content/Kaggle_Sirio_Libanes_ICU_Prediction.xlsx")
dados.head()

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,AGE_PERCENTIL,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_MEAN,ALBUMIN_MIN,ALBUMIN_MAX,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_MEAN,BE_ARTERIAL_MIN,BE_ARTERIAL_MAX,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_MEAN,BE_VENOUS_MIN,BE_VENOUS_MAX,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_MIN,BIC_ARTERIAL_MAX,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_MEAN,BIC_VENOUS_MIN,BIC_VENOUS_MAX,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_MEAN,...,DIMER_MAX,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,BLOODPRESSURE_DIASTOLIC_MEDIAN,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_MEDIAN,RESPIRATORY_RATE_MEDIAN,TEMPERATURE_MEDIAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,OXYGEN_SATURATION_DIFF,BLOODPRESSURE_DIASTOLIC_DIFF_REL,BLOODPRESSURE_SISTOLIC_DIFF_REL,HEART_RATE_DIFF_REL,RESPIRATORY_RATE_DIFF_REL,TEMPERATURE_DIFF_REL,OXYGEN_SATURATION_DIFF_REL,WINDOW,ICU
0,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,0.08642,-0.230769,-0.283019,-0.59322,-0.285714,0.736842,0.08642,-0.230769,-0.283019,-0.586207,-0.285714,0.736842,0.237113,0.0,-0.162393,-0.5,0.208791,0.89899,-0.247863,-0.459459,-0.432836,-0.636364,-0.42029,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0-2,0
1,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,0.333333,-0.230769,-0.132075,-0.59322,0.535714,0.578947,0.333333,-0.230769,-0.132075,-0.586207,0.535714,0.578947,0.443299,0.0,-0.025641,-0.5,0.714286,0.838384,-0.076923,-0.459459,-0.313433,-0.636364,0.246377,0.578947,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2-4,0
2,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,...,-0.994912,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4-6,0
3,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,-0.107143,0.736842,,,,,-0.107143,0.736842,,,,,0.318681,0.89899,,,,,-0.275362,0.736842,,,,,-1.0,-1.0,,,,,-1.0,-1.0,6-12,0
4,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0,-0.871658,-0.871658,-0.871658,-0.871658,-1.0,-0.863874,-0.863874,-0.863874,-0.863874,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.414634,-0.414634,-0.414634,-0.414634,-1.0,-0.979069,-0.979069,...,-0.996762,-1.0,-0.243021,-0.338537,-0.213031,-0.317859,0.033779,0.665932,-0.283951,-0.376923,-0.188679,-0.37931,0.035714,0.631579,-0.340206,-0.4875,-0.57265,-0.857143,0.098901,0.79798,-0.076923,0.286486,0.298507,0.272727,0.362319,0.947368,-0.33913,0.325153,0.114504,0.176471,-0.238095,-0.818182,-0.389967,0.407558,-0.230462,0.096774,-0.242282,-0.814433,ABOVE_12,1


In [226]:
## O arquivo conta com 1925 linhas de pacientes, no entanto, 
## cada paciente está em 5 linhas (por conta das janelas). Conforme explicado acima, somente serão utlizados
## os dados da primeira janela de internação no hospital. 
dados.shape

(1925, 231)

In [227]:
## Confirmando que temos 385 pacientes efetivamente em nosso banco de dados. 
dados.PATIENT_VISIT_IDENTIFIER.nunique()

385

3.1 Verificando a existência de dados nulos e faltantes. 

Esse passo é importante pois os modelos de Machine Learning trabalham com dados númericos. A ausência deles, ou nulos, trará erros nos algoritmos.

In [228]:
dados_limpos = preenche_tabela(dados)

In [229]:
dados_limpos.shape

(1925, 231)

In [230]:
dados_limpos = dados_limpos.dropna()

In [231]:
##Dados faltantes 
dados.isnull().sum()

PATIENT_VISIT_IDENTIFIER        0
AGE_ABOVE65                     0
AGE_PERCENTIL                   0
GENDER                          0
DISEASE GROUPING 1              5
                             ... 
RESPIRATORY_RATE_DIFF_REL     748
TEMPERATURE_DIFF_REL          694
OXYGEN_SATURATION_DIFF_REL    686
WINDOW                          0
ICU                             0
Length: 231, dtype: int64

In [232]:
a_remover = dados_limpos.query("WINDOW=='0-2' and ICU==1")['PATIENT_VISIT_IDENTIFIER'].values
dados_limpos = dados_limpos.query("PATIENT_VISIT_IDENTIFIER not in @a_remover")

dados_limpos.describe()

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_MEAN,ALBUMIN_MIN,ALBUMIN_MAX,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_MEAN,BE_ARTERIAL_MIN,BE_ARTERIAL_MAX,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_MEAN,BE_VENOUS_MIN,BE_VENOUS_MAX,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_MIN,BIC_ARTERIAL_MAX,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_MEAN,BIC_VENOUS_MIN,BIC_VENOUS_MAX,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_MEAN,BILLIRUBIN_MIN,...,DIMER_MIN,DIMER_MAX,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,BLOODPRESSURE_DIASTOLIC_MEDIAN,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_MEDIAN,RESPIRATORY_RATE_MEDIAN,TEMPERATURE_MEDIAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,OXYGEN_SATURATION_DIFF,BLOODPRESSURE_DIASTOLIC_DIFF_REL,BLOODPRESSURE_SISTOLIC_DIFF_REL,HEART_RATE_DIFF_REL,RESPIRATORY_RATE_DIFF_REL,TEMPERATURE_DIFF_REL,OXYGEN_SATURATION_DIFF_REL,ICU
count,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,...,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0,1760.0
mean,192.818182,0.457386,0.380682,0.1125,0.026136,0.101705,0.021591,0.132955,0.049432,0.2125,0.163636,0.828977,0.556086,0.556086,0.556086,0.556086,-1.0,-0.985555,-0.985555,-0.985555,-0.985555,-1.0,-0.948736,-0.948736,-0.948736,-0.948736,-1.0,-0.314399,-0.314399,-0.314399,-0.314399,-1.0,-0.315992,-0.315992,-0.315992,-0.315992,-1.0,-0.945832,-0.945832,-0.945832,...,-0.954625,-0.954625,-1.0,-0.067473,-0.335719,-0.263742,-0.462974,0.074319,0.749404,-0.071528,-0.340052,-0.266134,-0.455741,0.071124,0.75314,0.018076,-0.169759,-0.227282,-0.457508,0.363012,0.84961,-0.260004,-0.445989,-0.324635,-0.408023,-0.024177,0.803125,-0.827599,-0.818077,-0.831367,-0.813369,-0.841694,-0.925178,-0.852285,-0.809092,-0.874791,-0.821924,-0.84261,-0.925164,0.201705
std,110.637724,0.498322,0.485692,0.31607,0.159586,0.302345,0.145385,0.339622,0.216829,0.409193,0.37005,0.376636,0.18054,0.18054,0.18054,0.18054,0.0,0.105194,0.105194,0.105194,0.105194,0.0,0.139014,0.139014,0.139014,0.139014,0.0,0.064575,0.064575,0.064575,0.064575,0.0,0.101929,0.101929,0.101929,0.101929,0.0,0.062164,0.062164,0.062164,...,0.138371,0.138371,0.0,0.253403,0.265682,0.24671,0.202831,0.243375,0.127638,0.258404,0.268284,0.252963,0.213449,0.250693,0.127605,0.268843,0.259842,0.260774,0.24566,0.19135,0.224878,0.240722,0.25314,0.268898,0.335868,0.260564,0.138728,0.304501,0.336412,0.308805,0.385784,0.265921,0.23444,0.270025,0.347524,0.228338,0.358887,0.264551,0.234701,0.401387
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.578571,-1.0,-1.0,-1.0,-1.0,-1.0,-0.607143,-1.0,-1.0,-1.0,-1.0,-1.0,-0.340659,-1.0,-1.0,-1.0,-1.0,-1.0,-0.652174,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
25%,97.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.965725,-0.965725,-0.965725,...,-0.983811,-0.983811,-1.0,-0.223486,-0.537042,-0.432692,-0.583838,-0.107143,0.684211,-0.225309,-0.538462,-0.433962,-0.586207,-0.107143,0.684211,-0.175258,-0.375,-0.401709,-0.571429,0.230769,0.838384,-0.418803,-0.610811,-0.507463,-0.575758,-0.217391,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
50%,191.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,-0.93895,...,-0.978029,-0.978029,-1.0,-0.061728,-0.384615,-0.289913,-0.525424,0.035714,0.753289,-0.08642,-0.384615,-0.287736,-0.517241,0.035714,0.736842,0.030928,-0.1875,-0.247863,-0.428571,0.340659,0.878788,-0.247863,-0.459459,-0.358209,-0.515152,-0.072464,0.789474,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
75%,289.25,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,-0.93895,...,-0.968681,-0.968681,-1.0,0.08642,-0.193846,-0.132075,-0.389831,0.214286,0.842105,0.08642,-0.215385,-0.132075,-0.37931,0.214286,0.842105,0.237113,0.0,-0.076923,-0.357143,0.494505,0.919192,-0.094017,-0.318919,-0.179104,-0.333333,0.15942,0.894737,-0.773913,-0.754601,-0.801527,-0.823529,-0.785714,-0.939394,-0.804348,-0.763835,-0.840119,-0.817204,-0.784374,-0.938144,0.0
max,384.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,...,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,0.964286,1.0,1.0,1.0,1.0,1.0,0.964286,1.0,1.0,1.0,1.0,1.0,0.978022,1.0,1.0,0.632432,1.0,1.0,1.0,1.0,1.0,0.91411,1.0,1.0,0.333333,1.0,1.0,1.0,1.0,1.0,0.32966,1.0,1.0


3.2 Atribuindo o valor 1 na coluna ICU se em algum 
momento o paciente foi para a UTI, sendo aplicado na primeira janela para cada um dos pacientes. 


In [233]:
dados_limpos = dados_limpos.groupby("PATIENT_VISIT_IDENTIFIER").apply(prepare_window).set_index('PATIENT_VISIT_IDENTIFIER').reset_index()
dados_limpos

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,AGE_PERCENTIL,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_MEAN,ALBUMIN_MIN,ALBUMIN_MAX,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_MEAN,BE_ARTERIAL_MIN,BE_ARTERIAL_MAX,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_MEAN,BE_VENOUS_MIN,BE_VENOUS_MAX,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_MIN,BIC_ARTERIAL_MAX,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_MEAN,BIC_VENOUS_MIN,BIC_VENOUS_MAX,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_MEAN,...,DIMER_MAX,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,BLOODPRESSURE_DIASTOLIC_MEDIAN,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_MEDIAN,RESPIRATORY_RATE_MEDIAN,TEMPERATURE_MEDIAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,OXYGEN_SATURATION_DIFF,BLOODPRESSURE_DIASTOLIC_DIFF_REL,BLOODPRESSURE_SISTOLIC_DIFF_REL,HEART_RATE_DIFF_REL,RESPIRATORY_RATE_DIFF_REL,TEMPERATURE_DIFF_REL,OXYGEN_SATURATION_DIFF_REL,WINDOW,ICU
0,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.938950,-0.938950,...,-0.994912,-1.0,0.086420,-0.230769,-0.283019,-0.593220,-0.285714,0.736842,0.086420,-0.230769,-0.283019,-0.586207,-0.285714,0.736842,0.237113,0.0000,-0.162393,-0.500000,0.208791,0.898990,-0.247863,-0.459459,-0.432836,-0.636364,-0.420290,0.736842,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,1
1,2,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.938950,-0.938950,...,-0.978029,-1.0,-0.489712,-0.685470,-0.048218,-0.645951,0.357143,0.935673,-0.506173,-0.815385,-0.056604,-0.517241,0.357143,0.947368,-0.525773,-0.5125,-0.111111,-0.714286,0.604396,0.959596,-0.435897,-0.491892,0.000000,-0.575758,0.101449,1.000000,-0.547826,-0.533742,-0.603053,-0.764706,-1.000000,-0.959596,-0.515528,-0.351328,-0.747001,-0.756272,-1.000000,-0.961262,0-2,1
2,3,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.263158,-0.263158,-0.263158,-0.263158,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.972789,-0.972789,...,-0.978029,-1.0,0.012346,-0.369231,-0.528302,-0.457627,-0.285714,0.684211,0.012346,-0.369231,-0.528302,-0.448276,-0.285714,0.684211,0.175258,-0.1125,-0.384615,-0.357143,0.208791,0.878788,-0.299145,-0.556757,-0.626866,-0.515152,-0.420290,0.684211,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,0
3,4,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.935113,-0.935113,...,-1.000000,-1.0,0.333333,-0.153846,0.160377,-0.593220,0.285714,0.868421,0.333333,-0.153846,0.160377,-0.586207,0.285714,0.868421,0.443299,0.0000,0.196581,-0.571429,0.538462,0.939394,-0.076923,-0.351351,-0.044776,-0.575758,0.072464,0.894737,-1.000000,-0.877301,-0.923664,-0.882353,-0.952381,-0.979798,-1.000000,-0.883669,-0.956805,-0.870968,-0.953536,-0.980333,0-2,0
4,5,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.938950,-0.938950,...,-1.000000,-1.0,-0.037037,-0.538462,-0.537736,-0.525424,-0.196429,0.815789,-0.037037,-0.538462,-0.537736,-0.517241,-0.196429,0.815789,0.030928,-0.3750,-0.401709,-0.428571,0.252747,0.919192,-0.247863,-0.567568,-0.626866,-0.575758,-0.333333,0.842105,-0.826087,-0.754601,-0.984733,-1.000000,-0.976190,-0.979798,-0.860870,-0.714460,-0.986481,-1.000000,-0.975891,-0.980129,0-2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,380,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.578947,-0.578947,-0.578947,-0.578947,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.293564,-0.293564,...,-0.978029,-1.0,-0.160494,-0.692308,0.339623,-0.457627,0.142857,0.736842,-0.160494,-0.692308,0.339623,-0.448276,0.142857,0.736842,0.030928,-0.3750,0.401709,-0.357143,0.472527,0.898990,-0.418803,-0.783784,0.059701,-0.515152,-0.072464,0.736842,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,1
348,381,1,Above 90th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.938950,-0.938950,...,-0.978029,-1.0,-0.407407,-0.692308,-0.283019,-0.457627,-0.059524,0.526316,-0.407407,-0.692308,-0.283019,-0.448276,-0.250000,0.526316,-0.175258,-0.3750,-0.162393,-0.357143,0.230769,0.818182,-0.589744,-0.783784,-0.432836,-0.515152,0.072464,0.526316,-1.000000,-1.000000,-1.000000,-1.000000,-0.619048,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-0.612627,-1.000000,0-2,0
349,382,0,50th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.938950,-0.938950,...,-0.964461,-1.0,0.012346,-0.384615,-0.320755,-0.457627,-0.071429,0.894737,0.012346,-0.384615,-0.320755,-0.448276,-0.071429,0.894737,0.175258,-0.1250,-0.196581,-0.357143,0.340659,0.959596,-0.299145,-0.567568,-0.462687,-0.515152,-0.246377,0.894737,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,1
350,383,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.000000,-1.000000,-1.000000,-1.000000,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.938950,-0.938950,...,-0.978029,-1.0,0.086420,-0.230769,-0.301887,-0.661017,-0.107143,0.736842,0.086420,-0.230769,-0.301887,-0.655172,-0.107143,0.736842,0.237113,0.0000,-0.179487,-0.571429,0.318681,0.898990,-0.247863,-0.459459,-0.447761,-0.696970,-0.275362,0.736842,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,0


3.3 Tratando colunas não numéricas

In [234]:
dados_limpos.select_dtypes(object)

Unnamed: 0,AGE_PERCENTIL,WINDOW
0,60th,0-2
1,10th,0-2
2,40th,0-2
3,10th,0-2
4,10th,0-2
...,...,...
347,40th,0-2
348,Above 90th,0-2
349,50th,0-2
350,40th,0-2


3.4 Decidiu-se por binarizar a coluna AGE_PERCENTIL


In [235]:
dados_limpos.drop('WINDOW', axis=1, inplace=True)

In [236]:
coluna_para_binarizar = dados_limpos.select_dtypes(object).columns
coluna_para_binarizar

Index(['AGE_PERCENTIL'], dtype='object')

In [237]:
dados_limpos_binarizado = pd.get_dummies(dados_limpos, columns = coluna_para_binarizar)      #hotcoding
dados_limpos_binarizado.head()

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_MEAN,ALBUMIN_MIN,ALBUMIN_MAX,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_MEAN,BE_ARTERIAL_MIN,BE_ARTERIAL_MAX,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_MEAN,BE_VENOUS_MIN,BE_VENOUS_MAX,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_MIN,BIC_ARTERIAL_MAX,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_MEAN,BIC_VENOUS_MIN,BIC_VENOUS_MAX,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_MEAN,BILLIRUBIN_MIN,...,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_MEDIAN,RESPIRATORY_RATE_MEDIAN,TEMPERATURE_MEDIAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,OXYGEN_SATURATION_DIFF,BLOODPRESSURE_DIASTOLIC_DIFF_REL,BLOODPRESSURE_SISTOLIC_DIFF_REL,HEART_RATE_DIFF_REL,RESPIRATORY_RATE_DIFF_REL,TEMPERATURE_DIFF_REL,OXYGEN_SATURATION_DIFF_REL,ICU,AGE_PERCENTIL_10th,AGE_PERCENTIL_20th,AGE_PERCENTIL_30th,AGE_PERCENTIL_40th,AGE_PERCENTIL_50th,AGE_PERCENTIL_60th,AGE_PERCENTIL_70th,AGE_PERCENTIL_80th,AGE_PERCENTIL_90th,AGE_PERCENTIL_Above 90th
0,0,1,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,-0.93895,...,-0.230769,-0.283019,-0.586207,-0.285714,0.736842,0.237113,0.0,-0.162393,-0.5,0.208791,0.89899,-0.247863,-0.459459,-0.432836,-0.636364,-0.42029,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1,0,0,0,0,0,1,0,0,0,0
1,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,-0.93895,...,-0.815385,-0.056604,-0.517241,0.357143,0.947368,-0.525773,-0.5125,-0.111111,-0.714286,0.604396,0.959596,-0.435897,-0.491892,0.0,-0.575758,0.101449,1.0,-0.547826,-0.533742,-0.603053,-0.764706,-1.0,-0.959596,-0.515528,-0.351328,-0.747001,-0.756272,-1.0,-0.961262,1,1,0,0,0,0,0,0,0,0,0
2,3,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.263158,-0.263158,-0.263158,-0.263158,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.972789,-0.972789,-0.972789,...,-0.369231,-0.528302,-0.448276,-0.285714,0.684211,0.175258,-0.1125,-0.384615,-0.357143,0.208791,0.878788,-0.299145,-0.556757,-0.626866,-0.515152,-0.42029,0.684211,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0,0,0,0,1,0,0,0,0,0,0
3,4,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.935113,-0.935113,-0.935113,...,-0.153846,0.160377,-0.586207,0.285714,0.868421,0.443299,0.0,0.196581,-0.571429,0.538462,0.939394,-0.076923,-0.351351,-0.044776,-0.575758,0.072464,0.894737,-1.0,-0.877301,-0.923664,-0.882353,-0.952381,-0.979798,-1.0,-0.883669,-0.956805,-0.870968,-0.953536,-0.980333,0,1,0,0,0,0,0,0,0,0,0
4,5,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,-0.93895,...,-0.538462,-0.537736,-0.517241,-0.196429,0.815789,0.030928,-0.375,-0.401709,-0.428571,0.252747,0.919192,-0.247863,-0.567568,-0.626866,-0.575758,-0.333333,0.842105,-0.826087,-0.754601,-0.984733,-1.0,-0.97619,-0.979798,-0.86087,-0.71446,-0.986481,-1.0,-0.975891,-0.980129,0,1,0,0,0,0,0,0,0,0,0


In [238]:
dados_limpos_binarizado.shape

(352, 239)

###<font color='blue'> **4. METODOLOGIA PARA APLICAÇÃO DE MODELOS DE MACHINE LEARNING**
Serão analisados dois datasets: Um baseado na metodologia do Sirio-Libanês em relação ao uso das caracteristicas de cada paciente, e outro um dataset utilizado com eliminação de variaveis com alta correlação. 

In [240]:
dataset_sirio = dados_limpos_binarizado.copy()

In [241]:
dataset_sirio.shape

(352, 239)

In [242]:
'''
Essa é a função do Sírio Libanês, que seleciona as variáveis, consideradas por eles, mais relevantes
'''

def makebio_df(df:pd.DataFrame):

  df["BLOODPRESSURE_ARTERIAL_MEAN"] = (df['BLOODPRESSURE_SISTOLIC_MEAN'] + 2*df['BLOODPRESSURE_DIASTOLIC_MEAN'])/3
 
  df["NEUTROPHILES/LINFOCITOS"] = df['NEUTROPHILES_MEAN']/df['LINFOCITOS_MEAN']

  df["GASO"] = df.groupby("PATIENT_VISIT_IDENTIFIER").P02_ARTERIAL_MEAN.apply(lambda x: x.fillna(method='ffill'))
  df["GASO"] = (~df["GASO"].isna()).astype(int)

  return df[["ICU",
               "AGE_ABOVE65", 
               "GENDER", 
               "BLOODPRESSURE_ARTERIAL_MEAN", 
               "RESPIRATORY_RATE_MAX", 
               "HTN", 
               "DISEASE GROUPING 1",
               "DISEASE GROUPING 2",
               "DISEASE GROUPING 3",
               "DISEASE GROUPING 4",
               "DISEASE GROUPING 5",
               "DISEASE GROUPING 6",
               "NEUTROPHILES/LINFOCITOS",
               "GASO",
               "OXYGEN_SATURATION_MIN",
               "HEART_RATE_MAX",
               "PCR_MEAN",
               "CREATININ_MEAN"]]

In [243]:
dataset_sirio = makebio_df(dataset_sirio)
dataset_sirio.head()

Unnamed: 0,ICU,AGE_ABOVE65,GENDER,BLOODPRESSURE_ARTERIAL_MEAN,RESPIRATORY_RATE_MAX,HTN,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,NEUTROPHILES/LINFOCITOS,GASO,OXYGEN_SATURATION_MIN,HEART_RATE_MAX,PCR_MEAN,CREATININ_MEAN
0,1,1,0,-0.01931,-0.636364,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.949515,1,0.89899,-0.432836,-0.875236,-0.868365
1,1,0,0,-0.554965,-0.575758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45445,1,0.959596,0.0,-0.939887,-0.912243
2,0,0,1,-0.114846,-0.515152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.938541,1,0.878788,-0.626866,-0.503592,-0.968861
3,0,0,0,0.17094,-0.575758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.267746,1,0.939394,-0.044776,-0.990926,-0.913659
4,0,0,0,-0.204179,-0.575758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.48741,1,0.919192,-0.626866,-0.997732,-0.891012


In [244]:
dataset_sirio.shape

(352, 18)

In [245]:
## Exportando e salvando Dataset formado
dataset_sirio.to_csv("dataset_sirio.csv", index=False)

###<font color='blue'> **5. ANÁLISE DE CORRELAÇÃO PARA CRIAR O SEGUNDO DATASET A SER UTILIZADO EM MACHINE LEARNING**

In [246]:
### Função para remoção de variáveis com alta correlação
def remove_alta_correlacao(dados, valor_corte):

  matriz_correlacao = dados.iloc[:, 24:-1].corr().abs()  

  matriz_superior = matriz_correlacao.where(np.triu(np.ones_like(matriz_correlacao, dtype=bool), k=1))

  excluir = []
  for coluna in matriz_superior.columns:
    if any(matriz_superior[coluna] > valor_corte):
      excluir.append(coluna)

  print(f"{len(excluir)} variáveis possuem correlação maior do que {valor_corte}, portanto serão removidas")

  return dados.drop(excluir, axis=1)

In [247]:
### Removendo as variáveis menos importantes no dataset
dados_limpos_binarizado_removido = remove_alta_correlacao(dados_limpos_binarizado, 0.9)

126 variáveis possuem correlação maior do que 0.9, portanto serão removidas


In [248]:
## Exportando e salvando o 2º Dataset formado
dados_limpos_binarizado_removido.to_csv("dados_limpos_binarizado_removido.csv", index=False)

###<font color='blue'> **6. DATASETS A SEREM UTILIZADOS NOS MODELOS DE MACHINE LEARNING**

Dataset com limpeza sugerida pelo próprio Sírio, resuntado em 18 variáveis. O segundo dataset foi obtido após os tratamentos de limpeza realizados acima, resultando em 113 variáveis.

In [181]:
dataset_sirio.shape

(352, 18)

In [180]:
dados_limpos_binarizado_removido.shape

(352, 113)

###<font color='blue'> **7. MODELOS DE MACHINE LEARNING**

Serão testados os seguintes modelos:
1. LogisticRegression;
2. DecisionTreeClassifier;
3. RandomForestClassifier;
4. GradientBoosting
5. DummyRegressor

A escolha entre os modelos levará em conta o resultado da métrica AUC (Area Under the ROC Curve), recomendada para avaliar a relação entre falsos positivos e falsos negativos.
 
Quanto mais próximo de 1, maior será a quantidade de acerto do método. 

In [251]:
###Função criada para executar triagem de diversos modelos classificatórios perante os dados.
def roda_modelos_selecao_unica(dados: pd.DataFrame, n: int):
    dados = dados.sample(frac=1).reset_index(drop=True)
    y = dados['ICU']
    x = dados.drop(['ICU'], axis=1)
    
    modelo_dummy =          DummyClassifier()
    modelo_logit =          LogisticRegression(max_iter=10000)
    modelo_arvore =         DecisionTreeClassifier()
    modelo_random_forest =  RandomForestClassifier(n_estimators=100)
    modelo_gradient_boost = GradientBoostingClassifier(n_estimators=100)
  
    lista_modelos = [modelo_dummy, modelo_logit, modelo_arvore, modelo_random_forest, modelo_gradient_boost]
    nome_modelos = ['DummyRegressor', 'LogisticRegression', 'DecisionTreeClassifier', 'RandomForestClassifier',
                    'GradientBoostingClassifier']
    
    
    df = pd.DataFrame(columns=['Modelo', f'AUC_Mean{n}', f'IC_Min{n}', f'IC_Max{n}'])
    lines = []
    for modelo in range(len(lista_modelos)):
        auc_lista = []
        for _ in range(n):
            x_train, x_test, y_train, y_test = train_test_split(x, y, stratify = y)
            lista_modelos[modelo].fit(x_train, y_train)
            prob_predict = lista_modelos[modelo].predict_proba(x_test)
            auc = roc_auc_score(y_test, prob_predict[:,1])
            auc_lista.append(auc)
        auc_medio = np.mean(auc_lista)
        auc_std = np.std(auc_lista)
        lines.append([nome_modelos[modelo], auc_medio, auc_medio - 2* auc_std, auc_medio + 2* auc_std])
    for i in range(len(lines)):
        df.loc[i] = lines[i]
    
    return df.sort_values(f'AUC_Mean{n}', ascending=False).reset_index(drop=True)
    
modelo_random_forest =  RandomForestClassifier(n_estimators=100)
modelo_gradient_boost = GradientBoostingClassifier(n_estimators=100)
modelo_logit = LogisticRegression()

###<font color='blue'> **8. MODELOS DE MACHINE LEARNING NOS DATASETS - CROSS VALIDATION

In [253]:
### Dataset com 18 variáveis
Modelos_SIRIO = roda_modelos_selecao_unica(dataset_sirio, 50)
Modelos_SIRIO

Unnamed: 0,Modelo,AUC_Mean50,IC_Min50,IC_Max50
0,RandomForestClassifier,0.766331,0.674365,0.858297
1,LogisticRegression,0.754209,0.659944,0.848473
2,GradientBoostingClassifier,0.740903,0.63975,0.842055
3,DecisionTreeClassifier,0.630633,0.533999,0.727267
4,DummyRegressor,0.489289,0.374338,0.60424


In [254]:
roda_cross_validate_modelos([modelo_random_forest, modelo_gradient_boost, modelo_logit], dataset_sirio,
                           5, 10)

Unnamed: 0,Modelo,AUC_Mean,IC_Min,IC_Max
0,"RandomForestClassifier(bootstrap=True, ccp_alp...",0.776009,0.67352,0.878499
1,"GradientBoostingClassifier(ccp_alpha=0.0, crit...",0.752605,0.657255,0.847955
2,"LogisticRegression(C=1.0, class_weight=None, d...",0.746286,0.632394,0.860179


In [256]:
### Dataset com 113 variáveis
modelo_Dataset_limpo = roda_modelos_selecao_unica(dados_limpos_binarizado_removido, 50)
modelo_Dataset_limpo

Unnamed: 0,Modelo,AUC_Mean50,IC_Min50,IC_Max50
0,RandomForestClassifier,0.793051,0.716608,0.869495
1,GradientBoostingClassifier,0.769704,0.699507,0.839901
2,LogisticRegression,0.759875,0.671515,0.848236
3,DecisionTreeClassifier,0.591074,0.502122,0.680027
4,DummyRegressor,0.501417,0.407113,0.59572


In [257]:
roda_cross_validate_modelos([modelo_random_forest, modelo_gradient_boost, modelo_logit], dados_limpos_binarizado_removido,
                           5, 10)

Unnamed: 0,Modelo,AUC_Mean,IC_Min,IC_Max
0,"RandomForestClassifier(bootstrap=True, ccp_alp...",0.797866,0.712462,0.883271
1,"GradientBoostingClassifier(ccp_alpha=0.0, crit...",0.771154,0.66732,0.874987
2,"LogisticRegression(C=1.0, class_weight=None, d...",0.749383,0.655862,0.842903


###<font color='blue'> **9. USO DE RANDOMIZEDSEARCHCV PARA MELHORA DOS HIPERPARÂMETROS (Hyperparameter Tuning)**

A utilização do RandomizedSearchCV serve para otimizar os estimadores, indicando os melhores parâmetros. 

In [258]:
# Separando os dados em treino e  testes. Utilizando o Dataset com maior variáveis, 
# tendo em vista que obteve melhores resultados de AUC.  
X, y = dados_limpos_binarizado_removido.drop('ICU',axis=1), dados_limpos_binarizado_removido['ICU']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify = y)

In [259]:
# Número de árvores
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Número de features para cada 'split'
max_features = ['auto', 'sqrt']
# Número máximo de folhas em cada árvore
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Número mínimo de amostras para separar um nó
min_samples_split = [2, 5, 10]
# Número mínimo de amostras por nó nas folhas
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [260]:
modelo_random_forest = RandomForestClassifier()

In [261]:
rf_random = RandomizedSearchCV(estimator = modelo_random_forest, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1,
                               return_train_score=True)

In [262]:
rf_random.fit(x_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   55.5s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  7.1min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [264]:
print(f'Train score : {rf_random.score(x_train, y_train):.3f}')
print(f'Test score : {rf_random.score(x_test, y_test):.3f}')

Train score : 1.000
Test score : 0.636


In [265]:
modelo_random_forest_param = rf_random.best_estimator_

In [266]:
modelo_random_forest_param

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=100, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=800,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [268]:
modelos = [modelo_random_forest, modelo_random_forest_param]
df_comparacao_parametros = roda_cross_validate_modelos(modelos, dados_limpos_binarizado_removido, 5, 10)

In [269]:
df_comparacao_parametros

Unnamed: 0,Modelo,AUC_Mean,IC_Min,IC_Max
0,"RandomForestClassifier(bootstrap=True, ccp_alp...",0.797866,0.712462,0.883271
1,"(DecisionTreeClassifier(ccp_alpha=0.0, class_w...",0.789326,0.701116,0.877536


###<font color='blue'> **10. RESULTADOS E CONCLUSÃO**

1. O modelo que obteve os melhores resultados, em ambos os datasets testados, foi o RandomForestClassifier, com AUC = 0.797866 para dataset com 113 variáveis e AUC = 0.766331 no dataset (Sírio) com 18 variáveis. 

2. Ao se utilizar o RANDOMIZEDSEARCHCV para melhoria dos Hiperparâmetros, o mesmo obteve resultados levemente inferiores. 

3. O trabalho, apesar da sua importância, foi utilizado para iniciar os meus conhecimentos em Machine Learning. Pode-se obter resultados muito melhores, com outras metodologias. Optei por usar o que aprendi durante o Bootcamp. Ainda tem um longo caminho pela frente. Avante, sempre!

