<a href="https://colab.research.google.com/github/RPGraciotti/BootCampAlura/blob/main/Projeto_final/TPOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Seleção de modelos por AutoML

Para a definição de quais modelos de ML buscar, eu optei por uma ferramenta de Auto Machine Learning, o [chamado AutoML](https://medium.com/data-hackers/automated-machine-learning-automl-parte-i-1d3219d57d31). Auto Machine Learning compreende uma série de etapas, de métodos e algoritmos desenvolvidos para buscar modelos, pipelines e otimizar parâmetros para um determinado propósito automaticamente. Existem diversas ferramentas que implementam AutoML, acessíveis a qualquer pessoa que faz uso da linguagem python, mesmo quem não tem muita familiaridade (como este que vos fala). 
Podemos citar o [Auto Sklearn](https://automl.github.io/auto-sklearn/master/), [HyperOpt](http://hyperopt.github.io/hyperopt-sklearn/), [LazyPredict](https://lazypredict.readthedocs.io/en/latest/#), [TPOT](http://epistasislab.github.io/tpot/), etc.

Existem diversas vantagens e [também desvantagens](https://www.kdnuggets.com/2019/03/why-automl-wont-replace-data-scientists.html) associadas com o uso de ferramentas de AutoML, e seu uso também não é necessariamente trivial. Porém, são ferramentas extremamente úteis para se determinar um ponto de partida, podem ser utilizadas em etapas iniciais de um projeto para se delimitar seu escopo, e são ótimas ferramentas didáticas. Para este projeto, eu realizei testes com os algoritmos mencionados seguindo alguns [tutoriais](https://machinelearningmastery.com/automl-libraries-for-python/), e por fim preferi apresentar o uso do TPOT (Tree-based Pipeline Optmization Tool).

Os pontos positivos que me atrairam no uso do TPOT foram a facilidade de implementação e exportação dos resultados, assim como a possibilidade de busca do melhor espaço de parâmetros, que as vezes é realizado com bibliotecas à parte do algoritmo de AutoML. Além disso, é possível determinar que a busca e validação dos modelos seja feita de forma bastante personalizada, a fim de atender a demanda do usuário dependendo do objetivo.

# TPOT

![Logotipo TPOT](https://github.com/RPGraciotti/BootCampAlura/raw/main/Projeto_final/figs/tpot-logo.jpg)

O objetivo do TPOT é determinar não somente um modelo de ML a ser aplicado, mas sim o melhor *Pipeline* a ser utilizado, incluindo etapas de pré-processamento. Também é possível buscar os melhores parâmetros daquele pipeline de forma conjunta. O raciocínio da aplicaçao do TPOT pode ser resumida na seguinte figura (disponível na própria [documentação](http://epistasislab.github.io/tpot/)):

![descrição do funcionamento do TPOT](https://github.com/RPGraciotti/BootCampAlura/raw/main/Projeto_final/figs/tpot-ml-pipeline.png)

Na página ["como usar"](http://epistasislab.github.io/tpot/) o TPOT, há uma série de exemplos e recomendações sobre como utilizar um algoritmo de AutoML, quais os cuidados a serem tomados, as limitações, etc. Estando ciente de que uma busca exaustiva envolve tempo e recursos, eu limitei alguns dos parâmetros de busca para que eu pudesse utilizá-lo de forma eficiente em um menor espaço de tempo e recursos computacionais. É possível instalá-lo facilmente pelo Google Colab, mas para uma busca exaustiva essa ferramenta acaba sendo relativamente limitada pelo limite de conexão. Portanto, os resultados apresentados aqui devem ser interpretados à luz de um exercício didático. Feitas as ressalvas necessárias, vou exemplificar o procedimento adotado.

**Repare que algumas saídas desse notebook estão silenciadas, pois os procedimentos de aplicação do algoritmo foram aplicadas diversas vezes, em instâncias diferentes, e elas também tomam bastante tempo. Na hora de resumir tudo em um notebook para a construção da visualização do projeto final eu omiti as saídas de busca de modelos.**

A primeira etapa é instalar a biblioteca:

In [1]:
pip install tpot



Depois, definir as bibliotecas utilizadas nessa etapa. Numpy e pandas para importação e eventual manipuação do dataset, e algumas ferramentas do Slkearn para pré-processamento e validação de modelos.

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_validate

from tpot import TPOTClassifier



Leitura dos dados:

In [3]:
path = "https://raw.githubusercontent.com/RPGraciotti/BootCampAlura/main/Data/data_clean_ohe.csv"

In [4]:
df_clean = pd.read_csv(path)
df_clean

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_ABOVE65,GENDER,DISEASE GROUPING 1,DISEASE GROUPING 2,DISEASE GROUPING 3,DISEASE GROUPING 4,DISEASE GROUPING 5,DISEASE GROUPING 6,HTN,IMMUNOCOMPROMISED,OTHER,AGE_PERCENTIL_10th,AGE_PERCENTIL_20th,AGE_PERCENTIL_30th,AGE_PERCENTIL_40th,AGE_PERCENTIL_50th,AGE_PERCENTIL_60th,AGE_PERCENTIL_70th,AGE_PERCENTIL_80th,AGE_PERCENTIL_90th,ALBUMIN_MEDIAN,BE_ARTERIAL_MEDIAN,BE_VENOUS_MEDIAN,BIC_ARTERIAL_MEDIAN,BIC_VENOUS_MEDIAN,BILLIRUBIN_MEDIAN,BLAST_MEDIAN,CALCIUM_MEDIAN,CREATININ_MEDIAN,FFA_MEDIAN,GGT_MEDIAN,GLUCOSE_MEDIAN,HEMATOCRITE_MEDIAN,INR_MEDIAN,LACTATE_MEDIAN,LEUKOCYTES_MEDIAN,LINFOCITOS_MEDIAN,P02_VENOUS_MEDIAN,PC02_ARTERIAL_MEDIAN,PC02_VENOUS_MEDIAN,PCR_MEDIAN,PH_VENOUS_MEDIAN,PLATELETS_MEDIAN,POTASSIUM_MEDIAN,SAT02_VENOUS_MEDIAN,SODIUM_MEDIAN,TGO_MEDIAN,TGP_MEDIAN,TTPA_MEDIAN,UREA_MEDIAN,DIMER_MEDIAN,BLOODPRESSURE_DIASTOLIC_MEAN,BLOODPRESSURE_SISTOLIC_MEAN,HEART_RATE_MEAN,RESPIRATORY_RATE_MEAN,TEMPERATURE_MEAN,OXYGEN_SATURATION_MEAN,OXYGEN_SATURATION_MIN,OXYGEN_SATURATION_MAX,BLOODPRESSURE_DIASTOLIC_DIFF,BLOODPRESSURE_SISTOLIC_DIFF,HEART_RATE_DIFF,RESPIRATORY_RATE_DIFF,TEMPERATURE_DIFF,OXYGEN_SATURATION_DIFF,HEART_RATE_DIFF_REL,WINDOW,ICU
0,0,1,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0,0,0,0,0,1,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.938950,-1.0,0.183673,-0.868365,-0.742004,-0.945093,-0.891993,0.090147,-0.932246,1.000000,-0.835844,-0.914938,-0.704142,-0.77931,-0.754601,-0.875236,0.363636,-0.540721,-0.518519,0.345679,-0.028571,-0.997201,-0.990854,-0.825613,-0.836145,-0.994912,0.086420,-0.230769,-0.283019,-0.593220,-0.285714,0.736842,0.898990,0.736842,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,1
1,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,0,0,0,0,0,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.938950,-1.0,0.357143,-0.912243,-0.742004,-0.958528,-0.780261,0.144654,-0.959849,1.000000,-0.382773,-0.908714,-0.704142,-0.77931,-0.754601,-0.939887,0.363636,-0.399199,-0.703704,0.345679,0.085714,-0.995428,-0.986662,-0.846633,-0.836145,-0.978029,-0.489712,-0.685470,-0.048218,-0.645951,0.357143,0.935673,0.959596,1.000000,-0.547826,-0.533742,-0.603053,-0.764706,-1.000000,-0.959596,-0.747001,0-2,1
2,3,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,1,0,0,0,0,0,-0.263158,-1.0,-1.000000,-0.317073,-0.317073,-0.972789,-1.0,0.326531,-0.968861,-0.194030,-0.316589,-0.891993,-0.203354,-0.959849,-0.828421,-0.729239,-0.836100,-0.633136,-0.77931,-0.779141,-0.503592,0.363636,-0.564753,-0.777778,0.580247,0.200000,-0.989549,-0.956555,-0.846633,-0.937349,-0.978029,0.012346,-0.369231,-0.528302,-0.457627,-0.285714,0.684211,0.878788,0.684211,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,0
3,4,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,0,0,0,0,0,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.935113,-1.0,0.357143,-0.913659,-0.829424,-0.938084,-0.851024,0.358491,-0.959849,1.000000,-0.702202,-0.641079,-0.704142,-0.77931,-0.754601,-0.990926,0.363636,-0.457944,-0.592593,0.345679,0.142857,-0.998507,-0.991235,-0.846633,-0.903614,-1.000000,0.333333,-0.153846,0.160377,-0.593220,0.285714,0.868421,0.939394,0.894737,-1.000000,-0.877301,-0.923664,-0.882353,-0.952381,-0.979798,-0.956805,0-2,0
4,5,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,0,0,0,0,0,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.938950,-1.0,0.357143,-0.891012,-0.742004,-0.958528,-0.891993,0.291405,-0.959849,1.000000,-0.706450,-0.340249,-0.704142,-0.77931,-0.754601,-0.997732,0.363636,-0.292390,-0.666667,0.345679,0.085714,-0.997947,-0.988948,-0.846633,-0.884337,-1.000000,-0.037037,-0.538462,-0.537736,-0.525424,-0.196429,0.815789,0.919192,0.842105,-0.826087,-0.754601,-0.984733,-1.000000,-0.976190,-0.979798,-0.986481,0-2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,380,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,1,0,0,0,0,0,-0.578947,-1.0,-1.000000,-0.317073,-0.317073,-0.293564,-1.0,0.326531,-0.937721,1.000000,-0.147196,-0.824953,-0.253669,-0.806775,1.000000,-0.704519,-0.879668,-0.704142,-0.77931,-0.754601,-0.565974,0.363636,-0.895861,-0.629630,0.345679,-0.428571,-0.925725,-0.981326,-0.629428,-0.860241,-0.978029,-0.160494,-0.692308,0.339623,-0.457627,0.142857,0.736842,0.898990,0.736842,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,1
290,381,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.938950,-1.0,0.285714,-0.886766,-0.742004,-0.958528,-0.891993,-0.241090,-0.959849,1.000000,-0.794129,-0.921162,-0.704142,-0.77931,-0.754601,-0.993195,0.363636,-0.516689,-0.518519,0.345679,-0.314286,-0.998507,-0.995808,-0.846633,-0.855422,-0.978029,-0.407407,-0.692308,-0.283019,-0.457627,-0.059524,0.526316,0.818182,0.526316,-1.000000,-1.000000,-1.000000,-1.000000,-0.619048,-1.000000,-1.000000,0-2,0
291,382,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.938950,-1.0,0.357143,-0.905166,-0.742004,-0.958528,-0.891993,0.064990,-0.959849,1.000000,-0.718038,-0.838174,-0.704142,-0.77931,-0.754601,-0.034405,0.363636,-0.658211,-0.407407,0.345679,-0.085714,-0.995428,-0.986662,-0.846633,-0.787952,-0.964461,0.012346,-0.384615,-0.320755,-0.457627,-0.071429,0.894737,0.959596,0.894737,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,1
292,383,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,0,0,0,0,0,0.605263,-1.0,-1.000000,-0.317073,-0.317073,-0.938950,-1.0,0.357143,-0.922151,-0.742004,-0.958528,-0.843575,-0.069182,-0.959849,1.000000,-0.877559,-0.819502,-0.704142,-0.77931,-0.754601,-0.804159,0.363636,-0.623498,-0.555556,0.345679,0.085714,-0.995428,-0.986662,-0.846633,-0.937349,-0.978029,0.086420,-0.230769,-0.301887,-0.661017,-0.107143,0.736842,0.898990,0.736842,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0-2,0


O TPOT implementa a busca de modelos de forma semelhante ao procedimento adotado para se implementar um modelo no sklearn: determinação de x e y, adoção de um método de validação, chamada do modelo e "fit" do modelo. 

A busca foi realizada diversas vezes, com parâmetros diferentes. A fim de promover a replicabilidade dos dados a serem testados a cada vez que uma nova rodada do algoritmo foi feita, eu optei por padronizar alguns elementos com aleatoridade embutida, com o "shuffle" do dataset e procedimento de estratificação cruzada:

In [5]:
df_clean = df_clean.sample(frac = 1, random_state = 78329).reset_index(drop = True)
x_columns = df_clean.columns
y = df_clean.loc[:,"ICU"]
x = df_clean.drop(["PATIENT_VISIT_IDENTIFIER", "ICU", "WINDOW"], axis = 1)

In [6]:
y = y.rename("target")

Como mencionado, o TPOT permite que a busca pelo melhor pipeline seja feita atráves de diversos algoritmos. Uma vantagem que eu identifiquei foi a possibilidade de implementar um algoritmo de validação cruzada, ao invés de separar em conjuntos de treino e de teste a cada nova rodada. Utilizei o método "[RepeatedStratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html)":

In [7]:
cv = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 10, random_state = 78329)

# O que buscar?

A implementação do TPOT permite personalizações interessantes. Além de parâmetros próprios para manipular a permissividade do modelo como ```generations``` e ```population_size```, como dito anteriormente, é possível fazer a validação cruzada de algum parâmetro que se queira maximizar (ou minimizar parâmetros de erro). O meu objetivo aqui foi buscar alguns parâmetros que julguei os mais relevantes para se avaliar os modelos de forma geral, e também olhando mais a fundo alguns indicadores mais específicos relativos ao tipo de modelo e objetivo original (identificar corretamente a necessidade de pacientes serem admitidos em leitos de UTI). Dessa forma, busquei otimizar os parâmetros de Acurácia (accuracy), Precisão (precision), Recall,,ROC AUC e F1 score (média harmônica de precisão e recall).


![exemplo de comandos do TPOT](https://github.com/RPGraciotti/BootCampAlura/raw/main/Projeto_final/figs/tpot_exemplo.png)

Os parâmetros ```generations``` e ```population_size``` tem ambos 100 por padrão. Isso faria com que o TPOT executasse 10.000 configurações diferentes, uma tarefa que consome muito tempo e poder computacional, ainda mais utilizando-se da busca com validação cruzada. Embora seja a melhor estratégia a longo prazo, reduzi esse espaço de parâmetros de acordo com os exemplos do próprio TPOT e o [tutorial](https://machinelearningmastery.com/automl-libraries-for-python/) mencionado anteriormente.




ACC

Primeiro, a chamada do modelo. Como dito, determinei o mesmo ```random_state``` para todas as buscas.

In [8]:
ACC = TPOTClassifier(generations = 5, population_size = 20, cv = cv, 
                      scoring = "accuracy", verbosity = 2, random_state = 78329)

A busca do modelo é feita de forma similar ao "fit" de um modelo no sklearn:

PS: essa é a etapa lenta do processo.

In [None]:
ACC.fit(x, y)

Por último, o modelo pode ser exportado em forma de script.py:

In [None]:
ACC.export("shuffle_random_state_acc.py")

![saída do modelo que maximiza acurácia](https://github.com/RPGraciotti/BootCampAlura/raw/main/Projeto_final/figs/TPOT_output.png)

Seguindo para os outros pipelines de otimização de parâmetros de performance:

PRECISION

In [9]:
Prec = TPOTClassifier(generations = 5, population_size = 20, cv = cv, 
                      scoring = "precision", verbosity = 2, random_state = 78329)

In [None]:
Prec.fit(x, y)

In [None]:
Prec.export("shuffle_random_state_prec.py")

ROC AUC

In [10]:
ROC = TPOTClassifier(generations = 5, population_size = 20, cv = cv, 
                      scoring = "roc_auc", verbosity = 2, random_state = 78329)


In [None]:
ROC.fit(x, y)

In [None]:
ROC.export("shuffle_random_state_roc.py")

RECALL

Um detalhe sobre a otimização do "recall", é que esta busca precisou ser mais aprofundada. Enquanto todas as outras buscas começam de um bom ponto inicial e não apresentam melhora significativa com maior número de gerações; a busca por recall não retorna bons resultados com poucas gerações, portanto, eu expandi o espaço para busca.

In [11]:
REC = TPOTClassifier(generations = 50, population_size = 20, 
                      cv = cv, scoring = "recall", verbosity = 2, random_state = 78329)

In [None]:
REC.fit(x, y)

In [None]:
REC.export("shuffle_random_state_recall_50.py")

F1

In [12]:
F1 = TPOTClassifier(generations = 5, population_size = 20, cv = cv, 
                      scoring = "f1", verbosity = 2, random_state = 78329)

In [None]:
F1.fit(x, y)

In [None]:
F1.export("shuffle_random_state_f1.py")