# Identifica√ß√£o

**Assunto:** Modelagem

**Tutor:** Manoel Ver√≠ssimo dos Santos Neto e Matheus Patusco

## 1- Objetivos de Aprendizagem
Neste notebook, iremos:
1. Recuperar o dataset processado em CSV.
2. Comparar m√∫ltiplos algoritmos de machine learning.
3. Armazenar e versionar no MLflow o modelo com melhor desempenho e menor custo computacional.

### 1.1- Bibliotecas Necess√°rias
    

In [1]:
# Manipula√ß√£o e visualiza√ß√£o de dados
import pandas as pd
import seaborn as sns
import time

# Bibliotecas para aprendizado de m√°quina
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score

# MLflow para gerenciamento de experimentos
import mlflow
import mlflow.sklearn

# Supress√£o de avisos
import warnings
warnings.filterwarnings("ignore")

## 2- Recuperando o Dataset do MLflow

In [2]:
# Especificar o caminho do artefato no MLflow
artifact_path = "../02-dados/nha/dados_processados.csv"

# Carregar o dataset processado
dados = pd.read_csv(artifact_path)
dados.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 3- Dividindo os Dados para Treinamento e Teste

In [3]:
# Converter a coluna 'Sex' (ou qualquer outra categ√≥rica) para vari√°veis dummy
dados_limpos = dados.copy()
dados_limpos = pd.get_dummies(dados, columns=["Sex"], drop_first=False)

# Separando as features (X) e o target (y)
X = dados_limpos.drop(columns=["Survived", "Name", "Ticket", "Cabin", "Embarked"], errors='ignore')  # Substitua 'Survived' pelo nome da coluna alvo, se necess√°rio
y = dados_limpos["Survived"]  # Substitua 'Survived' pelo nome da coluna alvo, se necess√°rio

# Divis√£o dos dados em conjuntos de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Conjunto de treinamento: {X_train.shape}")
print(f"Conjunto de teste: {X_test.shape}")

# Preencher NaNs
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

Conjunto de treinamento: (712, 8)
Conjunto de teste: (179, 8)


## 4- Comparando Algoritmos de Machine Learning

In [36]:
from mlflow.tracking import MlflowClient # IMPORTA√á√ÉO ADICIONADA

# Lista de modelos para comparar
# Modelos_Docker1
# modelos = {
#     "Random Forest": RandomForestClassifier(random_state=42), #0.804469  0.802769                  0.058543
#     "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42), #0.810056  0.808114                  0.021659
#     "K-Nearest Neighbors": KNeighborsClassifier(), #0.659218  0.638369                  0.002176
#     "Gradient Boosting": GradientBoostingClassifier(random_state=42) #0.810056  0.807491                  0.050162
# }

# Modelos_Docker2
# Piorou tudo :(
# modelos = {
#     "Random Forest": RandomForestClassifier(random_state=42, n_estimators=200, max_depth=10),
#     "Logistic Regression": LogisticRegression(max_iter=1000, C=0.5, penalty='l2', solver='lbfgs', random_state=42),
#     "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=7, weights='distance', p=2),
#     "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42)
# }

# Modelos_Docker3
# Random Forest melhorou muito
# modelos = {
#     "Random Forest": RandomForestClassifier(random_state=20, n_estimators=200, max_depth=10),
#     "Logistic Regression": LogisticRegression(max_iter=1000, C=0.5, penalty='l1', solver='liblinear', random_state=20),
#     "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=7, weights='uniform', p=2),
#     "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=4, random_state=20)
# }

# Modelos_Docker4
# Melhorou muito Gradient Boosting =D
# Piorou muito Logistic Regression =(
modelos = {
    "Random Forest": RandomForestClassifier(random_state=20, n_estimators=300, max_depth=10),
    "Logistic Regression": LogisticRegression(max_iter=1000, C=1.0, penalty='l1', solver='saga', random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=7, metric='minkowski', p=1),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, random_state=42)
}

# Modelos_Docker5
# Quero melhorar o Nearest Neighbors
# Ave Maria, piorou
modelos = {
    "Random Forest": RandomForestClassifier(random_state=20, n_estimators=300, max_depth=10),
    "Logistic Regression": LogisticRegression(max_iter=1000, C=1.0, penalty='l1', solver='saga', random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=3, n_jobs=-1),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, random_state=42)
}

# Modelos_Docker6
# Quero melhorar o Nearest Neighbors
# Quase um miss√£o imposs√≠vel
modelos = {
    "Random Forest": RandomForestClassifier(random_state=20, n_estimators=300, max_depth=10),
    "Logistic Regression": LogisticRegression(max_iter=1000, C=1.0, penalty='l1', solver='saga', random_state=42),
    "K-Nearest Neighbors Auto": KNeighborsClassifier(n_neighbors=15, algorithm='auto'),
    "K-Nearest Neighbors Uniform": KNeighborsClassifier(n_neighbors=7, weights='uniform'),
    "K-Nearest Neighbors 15 vizinhos": KNeighborsClassifier(n_neighbors=15),
    "K-Nearest Neighbors 3 vizinhos": KNeighborsClassifier(n_neighbors=3),
    "K-Nearest Neighbors KD_TREE": KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree'),
    "K-Nearest Neighbors KD_TREE 15 vizinhos": KNeighborsClassifier(n_neighbors=15, algorithm='kd_tree'),
    "K-Nearest Neighbors Brute": KNeighborsClassifier(n_neighbors=5, algorithm='brute'),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, random_state=42)
}

# O 'mlflow' aqui refere-se ao nome do servi√ßo no docker-compose
MLFLOW_TRACKING_URI = "http://localhost:5050"
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
print(f"Configurando MLflow Tracking URI para: {MLFLOW_TRACKING_URI}")

# Define um experimento
mlflow.set_experiment("Comparacao_Modelos_Docker6")

resultados = []

# Avaliar cada modelo
for nome, modelo in modelos.items():
    inicio = time.time()
    modelo.fit(X_train, y_train)  # Treinamento
    fim = time.time()

    # Previs√µes
    y_pred = modelo.predict(X_test)

    # M√©tricas
    acuracia = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average="weighted")
    tempo_treino = fim - inicio

    # Registrar no MLflow
    with mlflow.start_run(run_name=nome):
        mlflow.log_param("Modelo", nome)
        mlflow.log_metric("Acur√°cia", acuracia)
        mlflow.log_metric("F1-Score", f1)
        mlflow.log_metric("Tempo de Treinamento", tempo_treino)
        mlflow.sklearn.log_model(modelo, "modelo")

    # Armazenar resultados
    resultados.append({
        "Modelo": nome,
        "Acur√°cia": acuracia,
        "F1-Score": f1,
        "Tempo de Treinamento (s)": tempo_treino
    })
    print(f"Modelo {nome} treinado e registrado no MLflow.")

Configurando MLflow Tracking URI para: http://localhost:5050




üèÉ View run Random Forest at: http://localhost:5050/#/experiments/525698902962365616/runs/032e13fea3d348c388a8f1d2ad3323da
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo Random Forest treinado e registrado no MLflow.




üèÉ View run Logistic Regression at: http://localhost:5050/#/experiments/525698902962365616/runs/1db2455cd7884c1a9c7a7d85a8f499bd
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo Logistic Regression treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors Auto at: http://localhost:5050/#/experiments/525698902962365616/runs/8388ac27f9484162bbfe12941db97f69
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors Auto treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors Uniform at: http://localhost:5050/#/experiments/525698902962365616/runs/231afa4c04484f4eacd999a3606c4b74
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors Uniform treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors 15 vizinhos at: http://localhost:5050/#/experiments/525698902962365616/runs/b2e0914584434db4ba2c1a5cedd39f0d
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors 15 vizinhos treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors 3 vizinhos at: http://localhost:5050/#/experiments/525698902962365616/runs/92c418b88b4b4878a23b47daa7dd8102
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors 3 vizinhos treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors KD_TREE at: http://localhost:5050/#/experiments/525698902962365616/runs/d22616a2f4234edd9bafcb468b68c72a
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors KD_TREE treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors KD_TREE 15 vizinhos at: http://localhost:5050/#/experiments/525698902962365616/runs/c9651e0f325a4551a5334fd7b0f206e7
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors KD_TREE 15 vizinhos treinado e registrado no MLflow.




üèÉ View run K-Nearest Neighbors Brute at: http://localhost:5050/#/experiments/525698902962365616/runs/bd7064c87e6c4b5e9157332dcd3d279c
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo K-Nearest Neighbors Brute treinado e registrado no MLflow.




üèÉ View run Gradient Boosting at: http://localhost:5050/#/experiments/525698902962365616/runs/21e3e0e24025412792814366b9a003f5
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Modelo Gradient Boosting treinado e registrado no MLflow.


## 5- Resultados da Compara√ß√£o

In [37]:
# Criar um DataFrame com os resultados
df_resultados = pd.DataFrame(resultados)
df_resultados.sort_values(by=["Acur√°cia", "Tempo de Treinamento (s)"], ascending=[False, True], inplace=True)
print("Resultados da Compara√ß√£o:")
print(df_resultados)

# Exibir o modelo com melhor desempenho
melhor_modelo = df_resultados.iloc[0]
print(f"Melhor Modelo: {melhor_modelo['Modelo']}")

Resultados da Compara√ß√£o:
                                    Modelo  Acur√°cia  F1-Score  \
9                        Gradient Boosting  0.815642  0.813462   
0                            Random Forest  0.810056  0.808681   
1                      Logistic Regression  0.703911  0.665356   
8                K-Nearest Neighbors Brute  0.659218  0.638369   
6              K-Nearest Neighbors KD_TREE  0.659218  0.638369   
3              K-Nearest Neighbors Uniform  0.653631  0.631249   
4          K-Nearest Neighbors 15 vizinhos  0.648045  0.602216   
2                 K-Nearest Neighbors Auto  0.648045  0.602216   
7  K-Nearest Neighbors KD_TREE 15 vizinhos  0.648045  0.602216   
5           K-Nearest Neighbors 3 vizinhos  0.592179  0.574788   

   Tempo de Treinamento (s)  
9                  0.051495  
0                  0.154133  
1                  0.036246  
8                  0.001028  
6                  0.001064  
3                  0.001128  
4                  0.001075  
2   

## 6- Armazenando o Melhor Modelo no MLflow

In [38]:
# Recuperar o modelo com melhor desempenho
nome_melhor_modelo = melhor_modelo["Modelo"]
modelo_final = modelos[nome_melhor_modelo]

# Armazenar o modelo final no MLflow
#mlflow.set_tracking_uri("http://localhost:5050")

# Teste pra saber se configurei certo o MLFlow e SQLite
# with mlflow.start_run(run_name="teste_persistente"):
#     mlflow.log_param("modelo", "RandomForest")
#     mlflow.log_metric("acuracia", 0.91)
#     print("‚úÖ Run registrada no MLflow!")

# with mlflow.start_run(run_name="Melhor Modelo Aula2.1"):
#     mlflow.log_param("Modelo", nome_melhor_modelo)
#     mlflow.log_metric("Acur√°cia", melhor_modelo["Acur√°cia"])
#     mlflow.log_metric("F1-Score", melhor_modelo["F1-Score"])
#     mlflow.log_metric("Tempo de Treinamento", melhor_modelo["Tempo de Treinamento (s)"])
#     mlflow.sklearn.log_model(modelo_final, "melhor_modelo")
# print(f"Melhor modelo ({nome_melhor_modelo}) armazenado com sucesso no MLflow.")

# Define a pasta local como raiz de artefatos
#artifact_root = "./mlflow_artifacts"  # esta pasta existe no seu Mac e tem permiss√£o

with mlflow.start_run(run_name="Melhor Modelo Aula2.1"):
    mlflow.log_param("Modelo", nome_melhor_modelo)
    mlflow.log_metric("Acur√°cia", melhor_modelo["Acur√°cia"])
    mlflow.log_metric("F1-Score", melhor_modelo["F1-Score"])
    mlflow.log_metric("Tempo de Treinamento", melhor_modelo["Tempo de Treinamento (s)"])
    mlflow.sklearn.log_model(modelo_final, "melhor_modelo")

print(f"Melhor modelo ({nome_melhor_modelo}) armazenado com sucesso no MLflow.")




üèÉ View run Melhor Modelo Aula2.1 at: http://localhost:5050/#/experiments/525698902962365616/runs/4141ffa9cedd4096bf8633f83c2d0ab3
üß™ View experiment at: http://localhost:5050/#/experiments/525698902962365616
Melhor modelo (Gradient Boosting) armazenado com sucesso no MLflow.


## 7- Exerc√≠cios


1.   Verificar a documenta√ß√£o dos modelos RandomForestClassifier, LogisticRegression, KNeighborsClassifier, GradientBoostingClassifier e altere ou inclua algum par√¢metro dos modelos e compare os resultados com o baseline executado nesse notebook.
2.   Busque algum outro dataset no Kaggle para um problema de regress√£o e fa√ßa um novo treino. Lembre de modificar as m√©tricas, ex.: MSE.
3.   Execute o MLFlow de maneira que se parar o container os dados n√£o sejam perdidos, podendo salvar os dados ou no SQLite (default) ou outro banco de dados da sua escolha.

**Importante:**

*   Todas as altera√ß√µes devem ser registradas no Mlflow (subir em container) para que seja poss√≠vel realizar compara√ß√µes entre os experimentos.

## Conclus√£o


Este notebook demonstrou como comparar m√∫ltiplos algoritmos de machine learning, avaliar seus desempenhos e armazenar o melhor modelo no MLflow.
O MLflow foi utilizado para rastrear e versionar os experimentos e os modelos de forma eficaz.
    