# Introdução



## Nosso desafio 

Tentar prever admissão na UTI de casos confirmados de COVID-19,
com base nos dados disponíveis. É viável prever quais pacientes precisarão de suporte em unidade de terapia intensiva?  
O objetivo é fornecer aos hospitais terciários e trimestrais a resposta mais precisa, para que os recursos da UTI possam ser arranjados ou a transferência do paciente seja agendada.  

## Entendendo nossos dados

Utilizando a [base de dados de COVID-19](https://www.kaggle.com/S%C3%ADrio-Libanes/covid19) disponibilizada no Kaggle pelo hospital Sírio-Libanês.

Os dados estão estruturados estão usando o conceito de janela de eventos agrupados da seguinte forma, cada linha representa os dados de um paciente em uma determinada janela de tempo desde a admisão.

Janela   | Descrição  
---------|---------------------------------
0-2      | De 0 a 2 horas desde a admissão.   
2-4	     | De 2 a 4 horas desde a admissão.  
4-6	     | De 4 a 6 horas desde a admissão.  
6-12     | De 6 a 12 horas desde a admissão. 
Above-12 | De 12 horas após a admissão.  
<hr>     | <hr>

Exemplos: 

![Exemplo 01](https://raw.githubusercontent.com/LucasGabrielB/Previsao-de-internacao-na-UTI-para-casos-de-COVID19/main/Imagem%20exemplo%2001.png)  

![Exemplo 02](https://raw.githubusercontent.com/LucasGabrielB/Previsao-de-internacao-na-UTI-para-casos-de-COVID19/main/Imagem%20exemplo%2002.png)  


## Engenharia de Características (Feature engineering)
Os dados foram tratados da seguinte forma:

* Foram criadas 2 novas features para serem incluídas no Machine Learning:

    * BLOODPRESSURE_ARTERIAL_MEAN = (BLOODPRESSURE_SISTOLIC_MEAN + 2* BLOODPRESSURE_DIASTOLIC_MEAN)/3
  
    * NEUTROPHILES/LINFOCITOS = NEUTROPHILES_MEAN/LINFOCITOS_MEAN

    Estas Features foram criadas com base nos artigos abaixo:  
[BLOODPRESSURE ARTERIAL MEAN, Nature, Ago/2020 ](https://www.nature.com/articles/s41440-020-00541-w)   
[NEUTROPHILES/LINFOCITOS, Revista Brasileira de Análises Clínicas, Ago/2020
](http://www.rbac.org.br/artigos/covid-19-e-o-laboratorio-de-hematologia-uma-revisao-da-literatura-recente/)

* Se o paciente foi para a UTI em qualquer janela de tempo o valor da coluna ICU da primeira janela (0-2) desse paciente foi alterada para 1, pois queremos saber se o paciente ira precisar de UTI o logo na primeira janela de tempo. 

* Foi aplicada a tecnica one-hot-encoding na coluna AGE_PERCENTIL para tornar o processo de Machine Learning mais preciso.

* Features com correlações maiores que 0,95 foram removidas.

* Pacientes que tem a coluna ICU igual 1 na primeira janela (0-2) foram descartados.

* Pacientes que não possuem nenhuma feature preechida foram descartados.

* Para pacientes que possuem valores de algumas features como NaN, na janela 0-2, mas possuem valores nas janelas subsequentes, estes foram preenchidos com usando a tecnica back-fill e foward-fill.


## Resultados

Foi escolhido o modelo [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).   
No qual obtivemos uma precisão media final de 73.17%, curva ROC, AUC de 77% e recall de 65.7%.

# Analise exploratória

Dados:

![Features](https://raw.githubusercontent.com/LucasGabrielB/Previsao-de-internacao-na-UTI-para-casos-de-COVID19/main/Features.png)

In [None]:
!pip install pycaret

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from pycaret.classification import *
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots 
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# importando os dados
url = 'https://github.com/LucasGabrielB/Previsao-de-internacao-na-UTI-para-casos-de-COVID19/raw/main/Kaggle_Sirio_Libanes_ICU_Prediction.xlsx?raw=true'

df = pd.read_excel(url)

df.head()

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,AGE_PERCENTIL,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_MEAN,ALBUMIN_MIN,ALBUMIN_MAX,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_MEAN,BE_ARTERIAL_MIN,BE_ARTERIAL_MAX,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_MEAN,BE_VENOUS_MIN,BE_VENOUS_MAX,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_MIN,BIC_ARTERIAL_MAX,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_MEAN,BIC_VENOUS_MIN,BIC_VENOUS_MAX,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_MEAN,...,DIMER_MAX,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,BLOODPRESSURE_DIASTOLIC_MEDIAN,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_MEDIAN,RESPIRATORY_RATE_MEDIAN,TEMPERATURE_MEDIAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,OXYGEN_SATURATION_DIFF,BLOODPRESSURE_DIASTOLIC_DIFF_REL,BLOODPRESSURE_SISTOLIC_DIFF_REL,HEART_RATE_DIFF_REL,RESPIRATORY_RATE_DIFF_REL,TEMPERATURE_DIFF_REL,OXYGEN_SATURATION_DIFF_REL,WINDOW,ICU
0,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,0.08642,-0.230769,-0.283019,-0.59322,-0.285714,0.736842,0.08642,-0.230769,-0.283019,-0.586207,-0.285714,0.736842,0.237113,0.0,-0.162393,-0.5,0.208791,0.89899,-0.247863,-0.459459,-0.432836,-0.636364,-0.42029,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0-2,0
1,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,0.333333,-0.230769,-0.132075,-0.59322,0.535714,0.578947,0.333333,-0.230769,-0.132075,-0.586207,0.535714,0.578947,0.443299,0.0,-0.025641,-0.5,0.714286,0.838384,-0.076923,-0.459459,-0.313433,-0.636364,0.246377,0.578947,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,2-4,0
2,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,0.605263,0.605263,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.93895,-0.93895,...,-0.994912,-1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4-6,0
3,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,-0.107143,0.736842,,,,,-0.107143,0.736842,,,,,0.318681,0.89899,,,,,-0.275362,0.736842,,,,,-1.0,-1.0,,,,,-1.0,-1.0,6-12,0
4,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0,-0.871658,-0.871658,-0.871658,-0.871658,-1.0,-0.863874,-0.863874,-0.863874,-0.863874,-1.0,-0.317073,-0.317073,-0.317073,-0.317073,-1.0,-0.414634,-0.414634,-0.414634,-0.414634,-1.0,-0.979069,-0.979069,...,-0.996762,-1.0,-0.243021,-0.338537,-0.213031,-0.317859,0.033779,0.665932,-0.283951,-0.376923,-0.188679,-0.37931,0.035714,0.631579,-0.340206,-0.4875,-0.57265,-0.857143,0.098901,0.79798,-0.076923,0.286486,0.298507,0.272727,0.362319,0.947368,-0.33913,0.325153,0.114504,0.176471,-0.238095,-0.818182,-0.389967,0.407558,-0.230462,0.096774,-0.242282,-0.814433,ABOVE_12,1


In [None]:
df.info(memory_usage='deep')

Como se comporta a necessidade de UTI em relação às janelas de tempo.
Podemos ver que ao passar do tempo as cada vez mais pessoas vão precisando de UTI, por isso é importante descobrir se uma pessoa vai precisar de UTI logo na primeira janela, pois isso pode acarretar em complicações para os pacientes.

In [None]:
df_icu_windows = pd.crosstab(df['WINDOW'], df['ICU'], normalize='index') * 100

fig = make_subplots(y_title='%', x_title='Janela')

fig.add_trace(go.Bar(
    x=df_icu_windows.index,
    y=df_icu_windows[0],
    name='Não internado',
))

fig.add_trace(go.Bar(
    x=df_icu_windows.index,
    y=df_icu_windows[1],
    name='Internado',
))

fig.update_traces(
    hovertemplate='<b>%{y:.2f}%</b>'
)

fig.update_yaxes(range=[0, 100], ticksuffix='%')

fig.update_layout(
    title={
        'text': 'Porcentagem de internações no decorrer das janelas',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font':{
            'size': 22
        }
    }
)

fig.show()

Vamos dar uma olhada geral em como a idade pode influenciar a necessidade de internação na UTI.

In [None]:
df_icu_age_percentil = pd.crosstab(df['AGE_PERCENTIL'], df['ICU'], normalize='index') * 100

fig = make_subplots(y_title='%', x_title='Janela')

fig.add_trace(go.Bar(
    x=df_icu_age_percentil.index,
    y=df_icu_age_percentil[0],
    name='Não internado',
))

fig.add_trace(go.Bar(
    x=df_icu_age_percentil.index,
    y=df_icu_age_percentil[1],
    name='Internado',
))

fig.update_traces(
    hovertemplate='<b>%{y:.2f}%</b>'
)

fig.update_yaxes(range=[0, 100], ticksuffix='%')

fig.update_layout(
    title={
        'text': 'Porcentagem de internações de acordo com a idade do paciente',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font':{
            'size': 22
        }
    }
)

fig.show()

Como podemos ver no heatmap de correlação das colunas abaixo algumas colunas tem uma alta correlação, talvez seja interessante remover estas colunas antes de treinar o nosso modelo pois elas acabam não adicionando muito informação.

In [None]:
corr = df.drop(columns=['ICU']).corr().abs() * 100
mask = np.triu(np.ones_like(corr, dtype=bool))
rLT = corr.mask(mask)

fig = go.Figure(
    data=go.Heatmap(
            z=rLT.values,
            x=rLT.columns,
            y=rLT.index,
            colorscale='hot_r',
            name='')
)

fig.update_layout(
    title={
        'text': 'Heatmap de correlação dos dados',
        'y':1,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font':{
            'size': 22
        }
    },
    width=1500,
    height=1500,
    yaxis_autorange='reversed',
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig.update_traces(hovertemplate='%{x}<br>----<br>%{y}<br><br>=%{z:.2f}%')

fig.show()

## Pipeline de dados


Baseado nas observações feitas acima o seguinte pipeline abaixo foi desenvolvido.

In [None]:
class DataFramePipeline():
    ''' Aplica todas as transformações necessarias ao DataFrame. '''
    
    def __init__(self, data_frame: pd.DataFrame) -> None:
        self._data_frame = data_frame
    

    def run(self) -> pd.DataFrame:
        ''' Roda todo o pipeline e retorna o DataFrame ja tratado. '''
        
        self._fill_data_frame()
        
        # removendo pacientes que foram internados na primeira janela (0-2)
        to_remove = self._data_frame.query('WINDOW == "0-2" and ICU == 1')['PATIENT_VISIT_IDENTIFIER'].values
        self._data_frame = self._data_frame.query('PATIENT_VISIT_IDENTIFIER not in @to_remove').dropna()
        
        self._data_frame = self._data_frame.groupby('PATIENT_VISIT_IDENTIFIER').apply(self._prepare_window)

        self._create_additional_info()

        self._remove_high_corr_colums()

        self._data_frame.reset_index(drop=True, inplace=True)

        return self._data_frame
    

    def _fill_data_frame(self) -> None:
        ''' Prenche os dados ausentes do DataFrame. '''

        def _fill_na(rows):
            rows.loc[rows['ICU'] != 1] = rows.loc[rows['ICU'] != 1].fillna(method='ffill').fillna(method='bfill')
            return rows

        continuous_features_columns = self._data_frame.iloc[:, 13:-2].columns
        continuous_features = self._data_frame.groupby('PATIENT_VISIT_IDENTIFIER', as_index=False)[continuous_features_columns.to_list() + ['ICU']].apply(_fill_na)
        continuous_features.drop("ICU", axis=1, inplace=True)

        categorical_features = self._data_frame.iloc[:, :13]

        outputs = df.iloc[:, -2:]
        
        final_df = pd.concat([categorical_features, continuous_features, outputs], ignore_index=True, axis=1)
        final_df.columns = self._data_frame.columns
        
        self._data_frame = final_df


    def _create_additional_info(self) -> None:
        ''' Cria as colunas BLOODPRESSURE_ARTERIAL_MEAN e NEUTROPHILES/LINFOCITOS. ''' 

        self._data_frame['BLOODPRESSURE_ARTERIAL_MEAN'] = (self._data_frame['BLOODPRESSURE_SISTOLIC_MEAN'] + 2*self._data_frame['BLOODPRESSURE_DIASTOLIC_MEAN']) / 3
        self._data_frame['NEUTROPHILES/LINFOCITOS'] = self._data_frame['NEUTROPHILES_MEAN'] / self._data_frame['LINFOCITOS_MEAN']


    def _remove_high_corr_colums(self) -> None:
        ''' Removendo colunas que tem uma alta correlação (>95%). '''
        
        # colunas considaras importantes de acordo com os criadores do dataset
        not_remove = [
                'BIC_ARTERIAL_MEAN',
                'UREA_MEAN',
                'TEMPERATURE_MEAN'
                'OXYGEN_SATURATION_MIN',
                'HEART_RATE_MAX',
                'PCR_MEAN',
                'CREATININ_MEAN',
                'BLOODPRESSURE_ARTERIAL_MEAN',
                'NEUTROPHILES/LINFOCITOS'
                'ICU',
                'PATIENT_VISIT_IDENTIFIER'
        ]
        
        # verificando quais colunas tem correlação maior que 95%
        corr_matrix = df.corr().abs()
        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
        to_remove = [column for column in upper.columns if any(upper[column] > 0.95) and column not in not_remove]
        
        # removendo colunas com alta correlação
        self._data_frame = self._data_frame.drop(to_remove, axis=1)


    def _prepare_window(self, rows):
        ''' Se o paciente foi internado em alguma janela altera o valor da sua primeira janela ("0-2") para 1. '''

        if np.any(rows['ICU']):
            rows.loc[rows['WINDOW'] == '0-2', 'ICU'] = 1
        
        return rows.loc[rows['WINDOW'] == '0-2']


In [None]:
df_clear = DataFramePipeline(df).run() 

df_clear.head()

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,AGE_PERCENTIL,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_DIFF,BLAST_MEDIAN,BLAST_DIFF,CALCIUM_MEDIAN,CALCIUM_DIFF,CREATININ_MEDIAN,CREATININ_MEAN,CREATININ_DIFF,FFA_MEDIAN,FFA_DIFF,GGT_MEDIAN,GGT_DIFF,GLUCOSE_MEDIAN,GLUCOSE_DIFF,HEMATOCRITE_MEDIAN,...,SODIUM_MEDIAN,SODIUM_DIFF,TGO_MEDIAN,TGO_DIFF,TGP_DIFF,TTPA_MEDIAN,TTPA_DIFF,UREA_MEDIAN,UREA_MEAN,UREA_DIFF,DIMER_MEDIAN,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,WINDOW,ICU,BLOODPRESSURE_ARTERIAL_MEAN,NEUTROPHILES/LINFOCITOS
0,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.93895,-1.0,-1.0,-1.0,0.183673,-1.0,-0.868365,-0.868365,-1.0,-0.742004,-1.0,-0.945093,-1.0,-0.891993,-1.0,0.090147,...,-0.028571,-1.0,-0.997201,-1.0,-1.0,-0.825613,-1.0,-0.836145,-0.836145,-1.0,-0.994912,-1.0,0.08642,-0.230769,-0.283019,-0.59322,-0.285714,0.736842,0.736842,0.237113,0.0,-0.162393,-0.5,0.208791,0.89899,-0.247863,-0.459459,-0.432836,-0.636364,-0.42029,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,0-2,1,-0.01931,0.949515
1,2,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.93895,-1.0,-1.0,-1.0,0.357143,-1.0,-0.912243,-0.912243,-1.0,-0.742004,-1.0,-0.958528,-1.0,-0.780261,-1.0,0.144654,...,0.085714,-1.0,-0.995428,-1.0,-1.0,-0.846633,-1.0,-0.836145,-0.836145,-1.0,-0.978029,-1.0,-0.489712,-0.68547,-0.048218,-0.645951,0.357143,0.935673,0.947368,-0.525773,-0.5125,-0.111111,-0.714286,0.604396,0.959596,-0.435897,-0.491892,0.0,-0.575758,0.101449,1.0,-0.547826,-0.533742,-0.603053,-0.764706,-1.0,0-2,1,-0.554965,0.45445
2,3,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.263158,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.972789,-1.0,-1.0,-1.0,0.326531,-1.0,-0.968861,-0.968861,-1.0,-0.19403,-1.0,-0.316589,-1.0,-0.891993,-1.0,-0.203354,...,0.2,-1.0,-0.989549,-1.0,-1.0,-0.846633,-1.0,-0.937349,-0.937349,-1.0,-0.978029,-1.0,0.012346,-0.369231,-0.528302,-0.457627,-0.285714,0.684211,0.684211,0.175258,-0.1125,-0.384615,-0.357143,0.208791,0.878788,-0.299145,-0.556757,-0.626866,-0.515152,-0.42029,0.684211,-1.0,-1.0,-1.0,-1.0,-1.0,0-2,0,-0.114846,0.938541
3,4,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.935113,-1.0,-1.0,-1.0,0.357143,-1.0,-0.913659,-0.913659,-1.0,-0.829424,-1.0,-0.938084,-1.0,-0.851024,-1.0,0.358491,...,0.142857,-1.0,-0.998507,-1.0,-1.0,-0.846633,-1.0,-0.903614,-0.903614,-1.0,-1.0,-1.0,0.333333,-0.153846,0.160377,-0.59322,0.285714,0.868421,0.868421,0.443299,0.0,0.196581,-0.571429,0.538462,0.939394,-0.076923,-0.351351,-0.044776,-0.575758,0.072464,0.894737,-1.0,-0.877301,-0.923664,-0.882353,-0.952381,0-2,0,0.17094,1.267746
4,5,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.93895,-1.0,-1.0,-1.0,0.357143,-1.0,-0.891012,-0.891012,-1.0,-0.742004,-1.0,-0.958528,-1.0,-0.891993,-1.0,0.291405,...,0.085714,-1.0,-0.997947,-1.0,-1.0,-0.846633,-1.0,-0.884337,-0.884337,-1.0,-1.0,-1.0,-0.037037,-0.538462,-0.537736,-0.525424,-0.196429,0.815789,0.815789,0.030928,-0.375,-0.401709,-0.428571,0.252747,0.919192,-0.247863,-0.567568,-0.626866,-0.575758,-0.333333,0.842105,-0.826087,-0.754601,-0.984733,-1.0,-0.97619,0-2,0,-0.204179,2.48741


# Previsão dos dados


Vamos usar o framework pycaret para treinar nosso modelo

In [None]:
# pycaret automaticamente divide os dados entre teste e treino internamente então
# podemos passar todos os nossos dados para função de setup
ml_setup = setup(
            df_clear,
            target='ICU',
            ignore_features=['PATIENT_VISIT_IDENTIFIER'],
            experiment_name='Previsão de ICU',
            session_id=45676,
            n_jobs=5
)

Unnamed: 0,Description,Value
0,session_id,45676
1,Target,ICU
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(294, 114)"
5,Missing Values,False
6,Numeric Features,99
7,Categorical Features,13
8,Ordinal Features,False
9,High Cardinality Features,False


In [None]:
# treinando e comparando varios modelos diferentes
compare_models(fold=10, turbo=False);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.7221,0.7779,0.4857,0.7344,0.5649,0.3736,0.4021,0.319
et,Extra Trees Classifier,0.6879,0.7309,0.4196,0.6843,0.5027,0.29,0.3191,0.294
ridge,Ridge Classifier,0.6829,0.0,0.4821,0.6278,0.5244,0.2952,0.3114,0.017
nb,Naive Bayes,0.6733,0.8,0.1911,0.7167,0.2917,0.1831,0.2513,0.02
lr,Logistic Regression,0.6731,0.726,0.4429,0.6395,0.4951,0.2652,0.2875,0.867
gbc,Gradient Boosting Classifier,0.664,0.7164,0.525,0.5758,0.5358,0.2762,0.2839,0.2
lightgbm,Light Gradient Boosting Machine,0.6638,0.6968,0.4964,0.5766,0.5191,0.2653,0.2731,0.298
mlp,MLP Classifier,0.6593,0.6457,0.4946,0.5645,0.5135,0.2552,0.2617,0.705
knn,K Neighbors Classifier,0.6493,0.6067,0.2571,0.6112,0.346,0.164,0.1962,0.056
lda,Linear Discriminant Analysis,0.6486,0.6551,0.4964,0.5945,0.5149,0.2426,0.2617,0.031


## Random Forest

Exemplo:


![random_forest_gif](https://aigraduate.com/content/images/downloaded_images/Building-Intuition-for-Random-Forests/1-Vko_J9ejaHOfwc_CqdSQpg.gif
)

![random_forest_gif2](https://aigraduate.com/content/images/downloaded_images/Building-Intuition-for-Random-Forests/1-bYGSIgMlmVdedFJaE6PuBg.gif)

In [None]:
model_rf = create_model('rf')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6667,0.7644,0.625,0.5556,0.5882,0.3099,0.3114
1,0.7143,0.7404,0.5,0.6667,0.5714,0.3636,0.3721
2,0.7143,0.7837,0.375,0.75,0.5,0.3298,0.3686
3,0.6667,0.6923,0.5,0.5714,0.5333,0.2759,0.2774
4,0.8095,0.8077,0.5,1.0,0.6667,0.5532,0.6183
5,0.75,0.7912,0.2857,1.0,0.4444,0.3421,0.4543
6,0.8,0.8297,0.5714,0.8,0.6667,0.5294,0.5447
7,0.7,0.901,0.375,0.75,0.5,0.3182,0.3572
8,0.6,0.6875,0.375,0.5,0.4286,0.1304,0.1336
9,0.8,0.7813,0.75,0.75,0.75,0.5833,0.5833


Vamos dar uma olhada mais a fundo no modelo Random Forest Classifier e em algumas metricas deste modelo.

In [None]:
evaluate_model(model_rf)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

In [None]:
# salvando modelo
save_model(model_rf, 'tuned_model_rf')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True,
                                       features_todrop=['PATIENT_VISIT_IDENTIFIER'],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='ICU',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numer...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                         class_weight=None, criterion='gini',
                                         max_depth=None, max_features='auto',
                                         

# Em produção...

In [None]:
loaded_model = load_model('tuned_model_rf')

Transformation Pipeline and Model Successfully Loaded


In [None]:
loaded_model

Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True,
                                      features_todrop=['PATIENT_VISIT_IDENTIFIER'],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], target='ICU',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numer...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=No

Realizando previsões

In [None]:
data = df_clear.head(1)

In [None]:
data

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,AGE_PERCENTIL,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_DIFF,BLAST_MEDIAN,BLAST_DIFF,CALCIUM_MEDIAN,CALCIUM_DIFF,CREATININ_MEDIAN,CREATININ_MEAN,CREATININ_DIFF,FFA_MEDIAN,FFA_DIFF,GGT_MEDIAN,GGT_DIFF,GLUCOSE_MEDIAN,GLUCOSE_DIFF,HEMATOCRITE_MEDIAN,...,SODIUM_MEDIAN,SODIUM_DIFF,TGO_MEDIAN,TGO_DIFF,TGP_DIFF,TTPA_MEDIAN,TTPA_DIFF,UREA_MEDIAN,UREA_MEAN,UREA_DIFF,DIMER_MEDIAN,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,WINDOW,ICU,BLOODPRESSURE_ARTERIAL_MEAN,NEUTROPHILES/LINFOCITOS
0,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.93895,-1.0,-1.0,-1.0,0.183673,-1.0,-0.868365,-0.868365,-1.0,-0.742004,-1.0,-0.945093,-1.0,-0.891993,-1.0,0.090147,...,-0.028571,-1.0,-0.997201,-1.0,-1.0,-0.825613,-1.0,-0.836145,-0.836145,-1.0,-0.994912,-1.0,0.08642,-0.230769,-0.283019,-0.59322,-0.285714,0.736842,0.736842,0.237113,0.0,-0.162393,-0.5,0.208791,0.89899,-0.247863,-0.459459,-0.432836,-0.636364,-0.42029,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,0-2,1,-0.01931,0.949515


In [None]:
predict_model(loaded_model, data)

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,AGE_PERCENTIL,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,ALBUMIN_MEDIAN,ALBUMIN_DIFF,BE_ARTERIAL_MEDIAN,BE_ARTERIAL_DIFF,BE_VENOUS_MEDIAN,BE_VENOUS_DIFF,BIC_ARTERIAL_MEDIAN,BIC_ARTERIAL_MEAN,BIC_ARTERIAL_DIFF,BIC_VENOUS_MEDIAN,BIC_VENOUS_DIFF,BILLIRUBIN_MEDIAN,BILLIRUBIN_DIFF,BLAST_MEDIAN,BLAST_DIFF,CALCIUM_MEDIAN,CALCIUM_DIFF,CREATININ_MEDIAN,CREATININ_MEAN,CREATININ_DIFF,FFA_MEDIAN,FFA_DIFF,GGT_MEDIAN,GGT_DIFF,GLUCOSE_MEDIAN,GLUCOSE_DIFF,HEMATOCRITE_MEDIAN,...,TGO_MEDIAN,TGO_DIFF,TGP_DIFF,TTPA_MEDIAN,TTPA_DIFF,UREA_MEDIAN,UREA_MEAN,UREA_DIFF,DIMER_MEDIAN,DIMER_DIFF,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,OXYGEN_SATURATION_MEDIAN,BLOODPRESSURE_DIASTOLIC_MIN,BLOODPRESSURE_SISTOLIC_MIN,HEART_RATE_MIN,RESPIRATORY_RATE_MIN,TEMPERATURE_MIN,OXYGEN_SATURATION_MIN,BLOODPRESSURE_DIASTOLIC_MAX,BLOODPRESSURE_SISTOLIC_MAX,HEART_RATE_MAX,RESPIRATORY_RATE_MAX,TEMPERATURE_MAX,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,WINDOW,ICU,BLOODPRESSURE_ARTERIAL_MEAN,NEUTROPHILES/LINFOCITOS,Label,Score
0,0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.605263,-1.0,-1.0,-1.0,-1.0,-1.0,-0.317073,-0.317073,-1.0,-0.317073,-1.0,-0.93895,-1.0,-1.0,-1.0,0.183673,-1.0,-0.868365,-0.868365,-1.0,-0.742004,-1.0,-0.945093,-1.0,-0.891993,-1.0,0.090147,...,-0.997201,-1.0,-1.0,-0.825613,-1.0,-0.836145,-0.836145,-1.0,-0.994912,-1.0,0.08642,-0.230769,-0.283019,-0.59322,-0.285714,0.736842,0.736842,0.237113,0.0,-0.162393,-0.5,0.208791,0.89899,-0.247863,-0.459459,-0.432836,-0.636364,-0.42029,0.736842,-1.0,-1.0,-1.0,-1.0,-1.0,0-2,1,-0.01931,0.949515,1,0.73
