# Projeto Final - Aprendizagem de Máquina 2025-2

Neste notebook gravamos métricas, probabilidades e previsões em disco (`artifacts/askl2/<dataset>/`).
Depois de executar esse notebook, execute o `model_comparison.ipynb` para gerar as visualizações comparativas dos modelos.

## Instalando os pacotes

Instalamos apenas o necessário para:
- carregar datasets do OpenML;
- pré-processar as bases com scikit-learn;
- treinar o ASKL 2.0 (Auto-Sklearn 2.0).

As dependências de visualização/análises agora residem em `model_comparison.ipynb`.


In [10]:
%pip install auto-sklearn==0.15.0 \
    pandas==1.5.3 \
    openml==0.14.2 \
    numpy==1.24.4 \
    scikit-learn

[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Configurando o ambiente
- define hiperparâmetros globais
- fixa a semente para reprodutibilidade


In [11]:
SEARCH_ITERATIONS = 5
RANDOM_STATE = 42
CV_FOLDS = 5
N_JOBS = -1

import numpy as np
import warnings

warnings.filterwarnings('ignore')
np.random.seed(RANDOM_STATE)

## Obtenção e preparação dos dados
Carrega os menores datasets de classificação do OpenML (CC-18) e armazena em
memória para posterior divisão.


In [12]:
import openml

OPENML_CC18_ID = 99
NUM_DATASETS = 10

suite = openml.study.get_suite(suite_id=OPENML_CC18_ID)
datasets_df = openml.datasets.list_datasets(data_id=suite.data, output_format='dataframe')

datasets_df_sorted = datasets_df.sort_values(by='NumberOfInstances')
top_datasets = datasets_df_sorted.head(NUM_DATASETS)

datasets_memory = {}
for idx, row in top_datasets.iterrows():
    dataset_id = row['did']
    dataset_name = row['name']
    print(f"Fetching {dataset_name} (ID: {dataset_id}, Instances: {row['NumberOfInstances']})...")
    try:
        dataset = openml.datasets.get_dataset(dataset_id)
        X, y, _, _ = dataset.get_data(
            target=dataset.default_target_attribute,
            dataset_format='dataframe'
        )
        if y is not None:
            X['target'] = y
        datasets_memory[dataset_name] = X
    except Exception as exc:
        print(f"Failed to load {dataset_name}: {exc}")

print(f"Done! {len(datasets_memory)} datasets available in 'datasets_memory'.")


Fetching dresses-sales (ID: 23381, Instances: 500.0)...
Fetching kc2 (ID: 1063, Instances: 522.0)...
Fetching cylinder-bands (ID: 6332, Instances: 540.0)...
Fetching climate-model-simulation-crashes (ID: 40994, Instances: 540.0)...
Fetching wdbc (ID: 1510, Instances: 569.0)...
Fetching ilpd (ID: 1480, Instances: 583.0)...
Fetching balance-scale (ID: 11, Instances: 625.0)...
Fetching credit-approval (ID: 29, Instances: 690.0)...
Fetching breast-w (ID: 15, Instances: 699.0)...
Fetching eucalyptus (ID: 188, Instances: 736.0)...
Done! 10 datasets available in 'datasets_memory'.


## Divisão dos dados em treino e teste
Cada dataset é dividido em 70% treino e 30% teste usando a mesma seed.


In [13]:
from sklearn.model_selection import train_test_split

train_test_splits = {}
for dataset_name, df in datasets_memory.items():
    X = df.drop(columns=['target'])
    y = df['target']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.30, random_state=RANDOM_STATE
    )

    train_test_splits[dataset_name] = {
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'train_size': len(X_train),
        'test_size': len(X_test)
    }

    print(f"{dataset_name}: treino={len(X_train)} | teste={len(X_test)}")

print(f"Total de {len(train_test_splits)} datasets divididos com sucesso!")


dresses-sales: treino=350 | teste=150
kc2: treino=365 | teste=157
cylinder-bands: treino=378 | teste=162
climate-model-simulation-crashes: treino=378 | teste=162
wdbc: treino=398 | teste=171
ilpd: treino=408 | teste=175
balance-scale: treino=437 | teste=188
credit-approval: treino=483 | teste=207
breast-w: treino=489 | teste=210
eucalyptus: treino=515 | teste=221
Total de 10 datasets divididos com sucesso!


## Pré-processamento e artefatos compartilhados
Utilitários responsáveis por codificar os dados e salvar artefatos do ASKL 2.0
em `artifacts/askl2/<dataset_slug>/` (metadados + previsões).


In [14]:
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from artifact_utils import write_artifact_bundle, get_dataset_dir

ASKL_MODEL_KEY = 'askl2'


def artifact_exists(dataset_name: str) -> bool:
    dataset_dir = get_dataset_dir(ASKL_MODEL_KEY, dataset_name)
    return (
        dataset_dir.joinpath('metadata.json').exists()
        and dataset_dir.joinpath('predictions.npz').exists()
    )


def artifact_output_dir(dataset_name: str) -> str:
    return str(get_dataset_dir(ASKL_MODEL_KEY, dataset_name))


def preprocess_dataset(splits):
    X_train = splits['X_train']
    y_train = splits['y_train']
    X_test = splits['X_test']
    y_test = splits['y_test']

    le = LabelEncoder()
    y_train_encoded = le.fit_transform(y_train)
    y_test_encoded = le.transform(y_test)

    X_train_encoded = X_train.copy()
    X_test_encoded = X_test.copy()
    categorical_cols = X_train_encoded.select_dtypes(include=['object', 'category']).columns.tolist()

    if categorical_cols:
        oe = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
        X_train_encoded[categorical_cols] = oe.fit_transform(X_train_encoded[categorical_cols].astype(str))
        X_test_encoded[categorical_cols] = oe.transform(X_test_encoded[categorical_cols].astype(str))

    scaler = StandardScaler()
    numeric_cols = X_train_encoded.select_dtypes(include=[np.number]).columns
    X_train_scaled = X_train_encoded.copy()
    X_test_scaled = X_test_encoded.copy()
    if len(numeric_cols) > 0:
        X_train_scaled[numeric_cols] = scaler.fit_transform(X_train_encoded[numeric_cols])
        X_test_scaled[numeric_cols] = scaler.transform(X_test_encoded[numeric_cols])

    return {
        'X_train_encoded': X_train_encoded,
        'X_test_encoded': X_test_encoded,
        'X_train_scaled': X_train_scaled,
        'X_test_scaled': X_test_scaled,
        'y_train_encoded': y_train_encoded,
        'y_test_encoded': y_test_encoded,
        'y_train_original': y_train.reset_index(drop=True),
        'y_test_original': y_test.reset_index(drop=True),
        'label_encoder': le,
        'categorical_cols': categorical_cols,
        'numeric_cols': numeric_cols
    }


## Treinamento dedicado com ASKL 2.0
Roda o Auto-Sklearn 2.0 para cada dataset e persiste apenas artefatos
(independentes do ambiente) que serão consumidos pelo notebook de comparação.


In [15]:
from autosklearn.experimental.askl2 import AutoSklearn2Classifier
from sklearn.metrics import accuracy_score
import time

processed_datasets = {}
total_datasets = len(train_test_splits)

ASKL_TIME_BUDGET = 300  # segundos
ASKL_PER_RUN_LIMIT = 120  # segundos por configuração
ASKL_MEMORY_LIMIT = 6144  # MB


def run_askl_training(processed_data):
    X_train = processed_data['X_train_scaled'].to_numpy()
    y_train = processed_data['y_train_encoded']
    X_test = processed_data['X_test_scaled'].to_numpy()
    y_test = processed_data['y_test_encoded']

    classifier = AutoSklearn2Classifier(
        time_left_for_this_task=ASKL_TIME_BUDGET,
        per_run_time_limit=ASKL_PER_RUN_LIMIT,
        memory_limit=ASKL_MEMORY_LIMIT,
        n_jobs=N_JOBS,
        seed=RANDOM_STATE,
    )

    start_time = time.time()
    classifier.fit(X_train, y_train)
    total_time = time.time() - start_time

    cv_score = np.nan
    try:
        cv_scores = classifier.cv_results_.get('mean_test_score')
        if cv_scores is not None:
            cv_score = float(np.nanmax(cv_scores))
    except Exception:
        pass

    y_prob = classifier.predict_proba(X_test)
    y_pred = classifier.predict(X_test)
    test_score = accuracy_score(y_test, y_pred)

    return {
        'y_prob': y_prob,
        'y_pred': y_pred,
        'cv_score': cv_score,
        'test_score': test_score,
        'runtime': total_time,
    }


askl_datasets_processed = 0
for dataset_idx, (dataset_name, splits) in enumerate(train_test_splits.items(), 1):
    print(f"{'='*80}")
    print(f"ASKL 2.0 | Dataset {dataset_idx}/{total_datasets}: {dataset_name}")
    print(f"{'='*80}")

    if artifact_exists(dataset_name):
        print("  ✓ Artefatos ASKL 2.0 já existem. Pulando...")
        continue

    processed_data = processed_datasets.get(dataset_name)
    if processed_data is None:
        processed_data = preprocess_dataset(splits)
        processed_datasets[dataset_name] = processed_data

    try:
        askl_result = run_askl_training(processed_data)
        write_artifact_bundle(
            model_key=ASKL_MODEL_KEY,
            dataset_name=dataset_name,
            y_true=processed_data['y_test_encoded'],
            y_pred=askl_result['y_pred'],
            y_prob=askl_result['y_prob'],
            class_labels=processed_data['label_encoder'].classes_.tolist(),
            metrics={
                'cv_accuracy': askl_result['cv_score'],
                'test_accuracy': askl_result['test_score'],
            },
            hyperparams={
                'time_left_for_this_task': ASKL_TIME_BUDGET,
                'per_run_time_limit': ASKL_PER_RUN_LIMIT,
                'memory_limit': ASKL_MEMORY_LIMIT,
                'seed': RANDOM_STATE,
                'n_jobs': N_JOBS,
            },
            runtime_seconds=askl_result['runtime'],
            extra_metadata={
                'train_samples': len(processed_data['X_train_encoded']),
                'test_samples': len(processed_data['X_test_encoded']),
            },
        )
        print(
            f"  ✓ ASKL 2.0 treinado! ACC teste = {askl_result['test_score']:.4f}\n"
            f"    → Artefatos salvos em {artifact_output_dir(dataset_name)}"
        )
        askl_datasets_processed += 1
    except Exception as exc:
        print(f"  ✗ Erro ao treinar ASKL 2.0: {str(exc)[:120]}")

print(f"{'='*80}")
print(f"ASKL 2.0 executado em {askl_datasets_processed} datasets.")
print(f"{'='*80}")


ASKL 2.0 | Dataset 1/10: dresses-sales
  ✓ Artefatos ASKL 2.0 já existem. Pulando...
ASKL 2.0 | Dataset 2/10: kc2
  ✓ Artefatos ASKL 2.0 já existem. Pulando...
ASKL 2.0 | Dataset 3/10: cylinder-bands
  ✓ ASKL 2.0 treinado! ACC teste = 0.8086
    → Artefatos salvos em artifacts/askl2/cylinder-bands
ASKL 2.0 | Dataset 4/10: climate-model-simulation-crashes
  ✓ ASKL 2.0 treinado! ACC teste = 0.9753
    → Artefatos salvos em artifacts/askl2/climate-model-simulation-crashes
ASKL 2.0 | Dataset 5/10: wdbc
  ✓ ASKL 2.0 treinado! ACC teste = 0.9883
    → Artefatos salvos em artifacts/askl2/wdbc
ASKL 2.0 | Dataset 6/10: ilpd
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	Dummy models: 1
	Models besides current dummy model: 0
	D

## Próximos passos
1. Execute `SAINT.ipynb` para gerar `artifacts/saint/<dataset>/`.
2. Execute este notebook para gerar `artifacts/askl2/<dataset>/`.
3. Abra `model_comparison.ipynb` para carregar ambos os conjuntos e produzir as
   tabelas, gráficos e testes estatísticos.


In [16]:
from pathlib import Path

artifact_root = Path('artifacts') / ASKL_MODEL_KEY
print(f"Artefatos ASKL 2.0: {artifact_root.resolve()}")
if not artifact_root.exists():
    print("Nenhum dataset processado ainda. Execute o treinamento acima.")
else:
    for dataset_dir in sorted(artifact_root.iterdir()):
        if dataset_dir.is_dir():
            print(f"  - {dataset_dir.name}")


Artefatos ASKL 2.0: /workspaces/AM-1/artifacts/askl2
  - balance-scale
  - breast-w
  - climate-model-simulation-crashes
  - credit-approval
  - cylinder-bands
  - dresses-sales
  - eucalyptus
  - ilpd
  - kc2
  - wdbc
