# Treinamento e Predi√ß√£o com Machine Learning

## Objetivo
Treinar um modelo supervisionado para predi√ß√£o de COVID-19 com base em dados metatranscript√¥micos e aplicar na amostra do paciente.

## Passos do Pipeline
1. Explora√ß√£o dos Dados (EDA)
2. Pr√©-processamento
3. Treinamento de Modelos (Random Forest, XGBoost, Logistic Regression)
4. Avalia√ß√£o e Sele√ß√£o do Melhor Modelo
5. Predi√ß√£o da Amostra do Paciente

In [1]:
# Configura√ß√£o inicial
import os
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Obter diret√≥rio raiz do projeto
notebook_dir = Path().resolve()
if 'notebooks' in str(notebook_dir):
    project_root = notebook_dir.parent
else:
    project_root = Path('/Users/larissa/Desktop/TCC_metatrascriptomica')

# Adicionar scripts ao path
sys.path.insert(0, str(project_root))

# Verificar se scripts existe
scripts_dir = project_root / "scripts"
if not scripts_dir.exists():
    raise FileNotFoundError(
        f"Diret√≥rio 'scripts' n√£o encontrado em {project_root}.\n"
        f"Diret√≥rio atual de trabalho: {Path().resolve()}\n"
        f"Tente executar: os.chdir('{project_root / 'notebooks'}')"
    )

# Configurar caminhos relativos
DATA_DIR = project_root / "data"
RESULTS_DIR = project_root / "results"
ML_DIR = RESULTS_DIR / "ml_results"
FEATURES_DIR = RESULTS_DIR / "features"

ML_DIR.mkdir(parents=True, exist_ok=True)

# Mudar para diret√≥rio do notebook (opcional, mas √∫til)
os.chdir(project_root / "notebooks")

# Configurar matplotlib
try:
    plt.style.use('seaborn-v0_8')
except:
    plt.style.use('seaborn')
sns.set_palette("husl")

print(f"Project root: {project_root}")
print(f"Scripts directory: {scripts_dir} {'‚úÖ' if scripts_dir.exists() else '‚ùå'}")
print(f"Diret√≥rio de trabalho: {os.getcwd()}")

Project root: /Users/larissa/Desktop/TCC_metatrascriptomica
Scripts directory: /Users/larissa/Desktop/TCC_metatrascriptomica/scripts ‚úÖ
Diret√≥rio de trabalho: /Users/larissa/Desktop/TCC_metatrascriptomica/notebooks


## 1. Carregar Dados

In [2]:
from scripts.ml_utils import load_training_data, prepare_data

# Carregar matriz de treinamento
training_matrix_path = str(DATA_DIR / "training" / "pivoted-virome-organisms-atleast10tpm-species-covid-TCC-pos-2.csv")

if os.path.exists(training_matrix_path):
    print("Carregando dados de treinamento...")
    X, y = load_training_data(training_matrix_path, target_column='COVID')
    print(f"  ‚úÖ Dados carregados: {X.shape[0]} amostras, {X.shape[1]} features")
    print(f"  ‚úÖ Distribui√ß√£o de classes:")
    print(y.value_counts().to_string())

    # Preparar dados (log transform, scaling, train/test split)
    print("\nPreparando dados (log transform, scaling, train/test split)...")
    X_train, X_test, y_train, y_test, scaler = prepare_data(
        X=X, y=y,
        test_size=0.2,
        random_state=42,
        apply_log=True,
        apply_scaling=True
    )
    print(f"  ‚úÖ Conjunto de treino: {X_train.shape[0]} amostras")
    print(f"  ‚úÖ Conjunto de teste: {X_test.shape[0]} amostras")
else:
    print(f"‚ùå Matriz de treinamento n√£o encontrada: {training_matrix_path}")
    X_train, X_test, y_train, y_test, scaler = None, None, None, None, None

Carregando dados de treinamento...
  ‚úÖ Dados carregados: 100 amostras, 4670 features
  ‚úÖ Distribui√ß√£o de classes:
COVID
1    85
0    15

Preparando dados (log transform, scaling, train/test split)...
üîπ Ap√≥s drop all-NaN: 2186 features
üîπ Ap√≥s VarianceThreshold: 793 features
‚ö†Ô∏è Removendo 7 colunas all-NaN no treino
‚úÖ Dataset final: 786 features
  ‚úÖ Conjunto de treino: 80 amostras
  ‚úÖ Conjunto de teste: 20 amostras


## 2. Treinamento de Modelos

In [3]:
from scripts.ml_utils import (
    train_random_forest,
    train_xgboost,
    train_logistic_regression,
    evaluate_model,
    get_feature_importance
)

if X_train is not None and y_train is not None:
    # Treinar modelos
    print("\n" + "="*50)
    print("TREINAMENTO DE MODELOS")
    print("="*50)
    
    print("\n1. Treinando Random Forest...")
    rf_model = train_random_forest(X_train, y_train)
    rf_metrics = evaluate_model(rf_model, X_test, y_test, "Random Forest")
    
    print("\n2. Treinando XGBoost...")
    try:
        xgb_model = train_xgboost(X_train, y_train)
        xgb_metrics = evaluate_model(xgb_model, X_test, y_test, "XGBoost")
    except Exception as e:
        print(f"  ‚ö†Ô∏è Erro ao treinar XGBoost: {e}")
        xgb_model = None
        xgb_metrics = None
    
    print("\n3. Treinando Logistic Regression...")
    lr_model = train_logistic_regression(X_train, y_train)
    lr_metrics = evaluate_model(lr_model, X_test, y_test, "Logistic Regression")
    
    # Comparar modelos
    print("\n" + "="*50)
    print("COMPARA√á√ÉO DE MODELOS")
    print("="*50)
    models_comparison = []
    models_comparison.append({"Modelo": "Random Forest", **rf_metrics})
    if xgb_metrics:
        models_comparison.append({"Modelo": "XGBoost", **xgb_metrics})
    models_comparison.append({"Modelo": "Logistic Regression", **lr_metrics})
    
    comparison_df = pd.DataFrame(models_comparison)
    print(comparison_df.to_string(index=False))
else:
    print("‚ùå N√£o √© poss√≠vel treinar modelos: dados n√£o carregados.")


TREINAMENTO DE MODELOS

1. Treinando Random Forest...

=== Random Forest ===
              precision    recall  f1-score   support

           0       1.00      0.33      0.50         3
           1       0.89      1.00      0.94        17

    accuracy                           0.90        20
   macro avg       0.95      0.67      0.72        20
weighted avg       0.91      0.90      0.88        20

ROC-AUC: 0.7353

2. Treinando XGBoost...
  ‚ö†Ô∏è Erro ao treinar XGBoost: feature_names must be string, and may not contain [, ] or <

3. Treinando Logistic Regression...

=== Logistic Regression ===
              precision    recall  f1-score   support

           0       0.50      0.33      0.40         3
           1       0.89      0.94      0.91        17

    accuracy                           0.85        20
   macro avg       0.69      0.64      0.66        20
weighted avg       0.83      0.85      0.84        20

ROC-AUC: 0.6275

COMPARA√á√ÉO DE MODELOS
             Modelo  accur

## 3. Sele√ß√£o do Melhor Modelo e Predi√ß√£o do Paciente

In [4]:
from scripts.ml_utils import predict_patient
import joblib

if X_train is not None:
    # Selecionar melhor modelo (baseado em ROC-AUC)
    print("\n" + "="*50)
    print("SELE√á√ÉO DO MELHOR MODELO")
    print("="*50)
    
    # Comparar ROC-AUC
    models_scores = [
        ("Random Forest", rf_model, rf_metrics['roc_auc']),
    ]
    if xgb_model:
        models_scores.append(("XGBoost", xgb_model, xgb_metrics['roc_auc']))
    models_scores.append(("Logistic Regression", lr_model, lr_metrics['roc_auc']))
    
    # Selecionar melhor
    best_name, best_model, best_score = max(models_scores, key=lambda x: x[2])
    print(f"\n‚úÖ Melhor modelo: {best_name} (ROC-AUC: {best_score:.4f})")
    
    # Salvar modelo e scaler
    model_file = str(ML_DIR / "best_model.pkl")
    scaler_file = str(ML_DIR / "scaler.pkl")
    
    joblib.dump(best_model, model_file)
    joblib.dump(scaler, scaler_file)
    print(f"  ‚úÖ Modelo salvo: {model_file}")
    print(f"  ‚úÖ Scaler salvo: {scaler_file}")
    
    # Carregar features do paciente
    patient_features_path = str(FEATURES_DIR / "patient_joao_features_vector_log.csv")
    
    print("\n" + "="*50)
    print("PREDI√á√ÉO DO PACIENTE")
    print("="*50)
    
    if os.path.exists(patient_features_path):
        patient_features = pd.read_csv(patient_features_path)
        print(f"\n‚úÖ Features do paciente carregadas: {patient_features.shape[1]} features")
        
        # Fazer predi√ß√£o
        prediction_result = predict_patient(
            model=best_model,
            patient_features=patient_features,
            scaler=scaler,
            apply_log=False  # J√° aplicado no vetor de features
        )
        
        print("\n" + "-"*50)
        print("RESULTADO DA PREDI√á√ÉO")
        print("-"*50)
        prediction_text = "POSITIVO para COVID-19" if prediction_result['prediction'] == 1 else "NEGATIVO para COVID-19"
        print(f"Predi√ß√£o: {prediction_text}")
        print(f"Probabilidade COVID-19: {prediction_result['probability_covid']:.4f} ({prediction_result['probability_covid']*100:.2f}%)")
        print(f"Probabilidade N√ÉO COVID-19: {prediction_result['probability_no_covid']:.4f} ({prediction_result['probability_no_covid']*100:.2f}%)")
        print("-"*50)
        
        # Salvar resultado
        result_df = pd.DataFrame({
            'patient': ['Jo√£o'],
            'prediction': [prediction_result['prediction']],
            'prediction_text': [prediction_text],
            'probability_covid': [prediction_result['probability_covid']],
            'probability_no_covid': [prediction_result['probability_no_covid']]
        })
        result_file = str(ML_DIR / "patient_joao_prediction.csv")
        result_df.to_csv(result_file, index=False)
        print(f"\n‚úÖ Resultado salvo em: {result_file}")
    else:
        print(f"\n‚ùå Features do paciente n√£o encontradas: {patient_features_path}")
        print("   Execute o Notebook 3 primeiro para gerar o vetor de features.")
else:
    print("‚ùå N√£o √© poss√≠vel fazer predi√ß√£o: modelo n√£o foi treinado.")


SELE√á√ÉO DO MELHOR MODELO

‚úÖ Melhor modelo: Random Forest (ROC-AUC: 0.7353)
  ‚úÖ Modelo salvo: /Users/larissa/Desktop/TCC_metatrascriptomica/results/ml_results/best_model.pkl
  ‚úÖ Scaler salvo: /Users/larissa/Desktop/TCC_metatrascriptomica/results/ml_results/scaler.pkl

PREDI√á√ÉO DO PACIENTE

‚úÖ Features do paciente carregadas: 4678 features

--------------------------------------------------
RESULTADO DA PREDI√á√ÉO
--------------------------------------------------
Predi√ß√£o: POSITIVO para COVID-19
Probabilidade COVID-19: 0.9579 (95.79%)
Probabilidade N√ÉO COVID-19: 0.0421 (4.21%)
--------------------------------------------------

‚úÖ Resultado salvo em: /Users/larissa/Desktop/TCC_metatrascriptomica/results/ml_results/patient_joao_prediction.csv
