# Redes Neurais Artificiais 2025.2

- **Disciplina**: Redes Neurais Artificiais 2025.2
- **Professora**: Elloá B. Guedes (ebgcosta@uea.edu.br)  
- **Github**: http://github.com/elloa  
        

Levando em conta a base de dados **_Forest Cover Type_**, esta parte do Projeto Prático diz respeito à proposição e avaliação de múltiplas redes neurais artificiais do tipo feedforward multilayer perceptron para o problema da classificação multi-classe da cobertura florestal em uma área do Roosevelt National Forest.

## Busca em Grade

Uma maneira padrão de escolher os parâmetros de um modelo de Machine Learning é por meio de uma busca em grade via força bruta. O algoritmo da busca em grade é dado como segue:

1. Escolha a métrica de desempenho que você deseja maximizar  
2. Escolha o algoritmo de Machine Learning (exemplo: redes neurais artificiais). Em seguida, defina os parâmetros ou hiperparâmetros deste tipo de modelo sobre os quais você dseja otimizar (número de épocas, taxa de aprendizado, etc.) e construa um array de valores a serem testados para cada parâmetro ou hiperparâmetro.  
3. Defina a grade de busca, a qual é dada como o produto cartesiano de cada parâmetro a ser testado. Por exemplo, para os arrays [50, 100, 1000] e [10, 15], tem-se que a grade é [(50,10), (50,15), (100,10), (100,15), (1000,10), (1000,15)].
4. Para cada combinação de parâmetros a serem otimizados, utilize o conjunto de treinamento para realizar uma validação cruzada (holdout ou k-fold) e calcule a métrica de avaliação no conjunto de teste (ou conjuntos de teste)
5. Escolha a combinação de parâmetros que maximizam a métrica de avaliação. Este é o modelo otimizado.

Por que esta abordagem funciona? Porque a busca em grade efetua uma pesquisa extensiva sobre as possíveis combinações de valores para cada um dos parâmetros a serem ajustados. Para cada combinação, ela estima a performance do modelo em dados novos. Por fim, o modelo com melhor métrica de desempenho é escolhido. Tem-se então que este modelo é o que melhor pode vir a generalizar mediante dados nunca antes vistos.

## Efetuando a Busca em Grade sobre Hiperparâmetros das Top-6 RNAs

Considerando a etapa anterior do projeto prático, foram identificadas pelo menos 6 melhores Redes Neurais para o problema da classificação multi-classe da cobertura florestal no conjunto de dados selecionado. Algumas destas redes possuem atributos categóricos como variáveis preditoras, enquanto outras possuem apenas os atributos numéricos como preditores.

A primeira etapa desta segunda parte do projeto consiste em trazer para este notebook estas seis arquiteturas, ressaltando:

1. Número de neurônios ocultos por camada  
2. Função de Ativação  
3. Utilização ou não de atributos categóricos   
4. Desempenho médio +- desvio padrão nos testes anteriores  
5. Número de repetições que a equipe conseguiu realizar para verificar os resultados  

Elabore uma busca em grade sobre estas arquiteturas que contemple variações nos hiperparâmetros a seguir, conforme documentação de [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

A. Solver  (Não usar o LBFGS, pois é mais adequado para datasets pequenos)  
B. Batch Size  
C. Learning Rate Init  
D. Paciência (n_iter_no_change)  
E. Épocas  

Nesta busca em grande, contemple a utilização do objeto [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

## Validação Cruzada k-fold

Na elaboração da busca em grid, vamos avaliar os modelos propostos segundo uma estratégia de validação cruzada ainda não explorada até o momento: a validação cruzada k-fold. Segundo a mesma, o conjunto de dados é particionado em k partes: a cada iteração, separa-se uma das partes para teste e o modelo é treinado com as k-1 partes remanescentes. Valores sugestivos de k na literatura são k = 3, 5 ou 10, pois o custo computacional desta validação dos modelos é alto. A métrica de desempenho é resultante da média dos desempenhos nas k iterações. A figura a seguir ilustra a ideia desta avaliação

<img src = "https://ethen8181.github.io/machine-learning/model_selection/img/kfolds.png" width=600></img>

Considerando a métrica de desempenho F1-Score, considere a validação cruzada 5-fold para aferir os resultados da busca em grande anterior.

In [None]:
!pip install -q pandas numpy matplotlib seaborn torch torchvision torchaudio kagglehub ipywidgets

In [None]:
import kagglehub

# Baixar o dataset do Kaggle
path = kagglehub.dataset_download("uciml/forest-cover-type-dataset")
print("Dataset baixado em:", path)

Using Colab cache for faster access to the 'forest-cover-type-dataset' dataset.
Dataset baixado em: /kaggle/input/forest-cover-type-dataset


In [None]:
# Cell A — locate dataset in Colab cache
import os, pprint, glob
candidates = glob.glob('/kaggle/input/forest-cover-type-dataset/**', recursive=True)
files = [p for p in candidates if os.path.isfile(p)]
print("Some files in /kaggle/input/forest-cover-type-dataset:")
pprint.pprint(files[:20])
# Common CSV path:
CSV_PATH = '/kaggle/input/forest-cover-type-dataset/covtype.csv'
if not os.path.exists(CSV_PATH):
    # fallback: find any covtype*.csv
    import glob
    found = glob.glob('/kaggle/input/**/covtype*.csv', recursive=True)
    if found:
        CSV_PATH = found[0]
    else:
        raise FileNotFoundError("covtype.csv not found in /kaggle/input; check files list above.")
print("Using CSV_PATH =", CSV_PATH)


Some files in /kaggle/input/forest-cover-type-dataset:
['/kaggle/input/forest-cover-type-dataset/covtype.csv']
Using CSV_PATH = /kaggle/input/forest-cover-type-dataset/covtype.csv


In [None]:
# Cell B — load CSV and quick inspect
import pandas as pd
CSV_PATH = '/kaggle/input/forest-cover-type-dataset/covtype.csv'  # adjust if needed
df = pd.read_csv(CSV_PATH)
print("Loaded df shape:", df.shape)
display(df.head())
print("Columns (first 20):", df.columns.tolist()[:20])
# configure sample size: None to use all data (careful!), or integer to subsample stratified
MAX_SAMPLES = None   # safe default for Colab; set to None to use full dataset if you have time/ram


Loaded df shape: (581012, 55)


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


Columns (first 20): ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6']


In [None]:
# Cell C — prepare X, y and optional stratified sampling
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler

target_col = 'Cover_Type' if 'Cover_Type' in df.columns else df.columns[-1]
print("Target column:", target_col)
X = df.drop(columns=[target_col]).values.astype(np.float32)
y = df[target_col].values.astype(np.int64)
# convert labels from 1..7 to 0..6 if needed
if y.min() == 1:
    y = y - 1

print("Original dataset size:", X.shape)

if MAX_SAMPLES is not None and X.shape[0] > MAX_SAMPLES:
    print(f"Stratified sampling to {MAX_SAMPLES} samples...")
    sss = StratifiedShuffleSplit(n_splits=1, test_size=MAX_SAMPLES, random_state=42)
    for _, idx in sss.split(X, y):
        X_sample = X[idx]
        y_sample = y[idx]
else:
    X_sample, y_sample = X, y

print("Using sample shape:", X_sample.shape, "n_classes:", len(np.unique(y_sample)))

Target column: Cover_Type
Original dataset size: (581012, 54)
Using sample shape: (581012, 54) n_classes: 7


In [None]:
# Cell D — PyTorch helpers
import torch, gc
from torch import nn
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import f1_score, classification_report, confusion_matrix

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Device:", device)

class NumpyDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X).float()
        self.y = torch.from_numpy(y).long()
    def __len__(self):
        return len(self.y)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, n_classes, dropout=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, max(hidden_dim//2, 8)),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(max(hidden_dim//2, 8), n_classes)
        )
    def forward(self, x):
        return self.net(x)

def train_one_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss = 0.0
    for Xb, yb in loader:
        Xb, yb = Xb.to(device), yb.to(device)
        optimizer.zero_grad()
        logits = model(Xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * Xb.size(0)
    return total_loss / len(loader.dataset)

def eval_model(model, loader):
    model.eval()
    ys, ps = [], []
    with torch.no_grad():
        for Xb, yb in loader:
            Xb = Xb.to(device)
            logits = model(Xb)
            preds = logits.argmax(dim=1).cpu().numpy()
            ps.append(preds)
            ys.append(yb.numpy())
    y_true = np.concatenate(ys)
    y_pred = np.concatenate(ps)
    return float(f1_score(y_true, y_pred, average='weighted')), y_true, y_pred


Device: cuda


In [None]:
# Cell E — grid search manual (small grid); adjust param_grid for more experiments
import time, itertools, json
from copy import deepcopy
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
import gc

n_classes = len(np.unique(y_sample))
input_dim = X_sample.shape[1]
print("input_dim:", input_dim, "n_classes:", n_classes)

param_grid = {
    'hidden_dim': [128, 256],
    'lr': [1e-3, 1e-4],
    'batch_size': [256],
    'epochs': [8]
}

def iter_grid(grid):
    keys = list(grid.keys())
    for vals in itertools.product(*(grid[k] for k in keys)):
        yield dict(zip(keys, vals))

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_cfg, best_score = None, -1.0
results = []

for cfg in iter_grid(param_grid):
    fold_scores = []
    t0 = time.time()
    print("Testing cfg:", cfg)
    for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_sample, y_sample), 1):
        # fit scaler on train fold
        scaler = StandardScaler()
        X_train_fold = scaler.fit_transform(X_sample[train_idx])
        X_val_fold = scaler.transform(X_sample[val_idx])
        y_train_fold = y_sample[train_idx]
        y_val_fold = y_sample[val_idx]

        train_ds = NumpyDataset(X_train_fold, y_train_fold)
        val_ds = NumpyDataset(X_val_fold, y_val_fold)
        train_loader = DataLoader(train_ds, batch_size=cfg['batch_size'], shuffle=True)
        val_loader = DataLoader(val_ds, batch_size=cfg['batch_size'], shuffle=False)

        model = MLP(input_dim=input_dim, hidden_dim=cfg['hidden_dim'], n_classes=n_classes).to(device)
        opt = torch.optim.Adam(model.parameters(), lr=cfg['lr'])
        criterion = nn.CrossEntropyLoss()

        for ep in range(cfg['epochs']):
            _ = train_one_epoch(model, train_loader, opt, criterion)

        f1_val, _, _ = eval_model(model, val_loader)
        fold_scores.append(f1_val)

        # cleanup
        del model, opt, criterion, train_loader, val_loader, train_ds, val_ds
        torch.cuda.empty_cache()
        gc.collect()

    mean_f1 = float(np.mean(fold_scores))
    elapsed = time.time() - t0
    print(f"  Mean F1 5-fold = {mean_f1:.4f} (time {elapsed:.1f}s)")
    results.append((cfg, mean_f1))
    if mean_f1 > best_score:
        best_score = mean_f1
        best_cfg = deepcopy(cfg)

print("\nBest config:", best_cfg, "best_cv_f1 =", best_score)
# save results
with open('/content/grid_search_results.json', 'w') as f:
    json.dump({'best': best_cfg, 'best_score': best_score, 'results': [[r[0], r[1]] for r in results]}, f, indent=2)
print("Saved grid search summary to /content/grid_search_results.json")

input_dim: 54 n_classes: 7
Testing cfg: {'hidden_dim': 128, 'lr': 0.001, 'batch_size': 256, 'epochs': 8}
  Mean F1 5-fold = 0.8342 (time 234.5s)
Testing cfg: {'hidden_dim': 128, 'lr': 0.0001, 'batch_size': 256, 'epochs': 8}
  Mean F1 5-fold = 0.7591 (time 233.1s)
Testing cfg: {'hidden_dim': 256, 'lr': 0.001, 'batch_size': 256, 'epochs': 8}
  Mean F1 5-fold = 0.8650 (time 233.8s)
Testing cfg: {'hidden_dim': 256, 'lr': 0.0001, 'batch_size': 256, 'epochs': 8}
  Mean F1 5-fold = 0.7837 (time 233.1s)

Best config: {'hidden_dim': 256, 'lr': 0.001, 'batch_size': 256, 'epochs': 8} best_cv_f1 = 0.8649654852949569
Saved grid search summary to /content/grid_search_results.json


In [None]:
# Cell F — train final on X_sample with scaler + save model & scaler
import joblib
from sklearn.preprocessing import StandardScaler
best = best_cfg
print("Training final with:", best)

scaler_final = StandardScaler()
X_scaled = scaler_final.fit_transform(X_sample)
y_final = y_sample

final_ds = NumpyDataset(X_scaled, y_final)
final_loader = DataLoader(final_ds, batch_size=best['batch_size'], shuffle=True)

model_final = MLP(input_dim=input_dim, hidden_dim=best['hidden_dim'], n_classes=n_classes).to(device)
optimizer = torch.optim.Adam(model_final.parameters(), lr=best['lr'])
criterion = nn.CrossEntropyLoss()

for epoch in range(best['epochs']):
    loss = train_one_epoch(model_final, final_loader, optimizer, criterion)
    print(f"Epoch {epoch+1}/{best['epochs']} - loss: {loss:.4f}")

MODEL_PATH = '/content/cover_mlp_best.pth'
SCALER_PATH = '/content/cover_scaler.pkl'
torch.save(model_final.state_dict(), MODEL_PATH)
joblib.dump(scaler_final, SCALER_PATH)
print("Saved model to", MODEL_PATH)
print("Saved scaler to", SCALER_PATH)

# save summary
summary = {
    'n_total_available': int(X.shape[0]),
    'n_used_sample': int(X_sample.shape[0]),
    'best_cfg': best,
    'best_cv_f1': best_score,
    'model_path': MODEL_PATH,
    'scaler_path': SCALER_PATH
}
import json
with open('/content/cover_training_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)
print("Saved training summary to /content/cover_training_summary.json")


Training final with: {'hidden_dim': 256, 'lr': 0.001, 'batch_size': 256, 'epochs': 8}
Epoch 1/8 - loss: 0.5936
Epoch 2/8 - loss: 0.4880
Epoch 3/8 - loss: 0.4481
Epoch 4/8 - loss: 0.4226
Epoch 5/8 - loss: 0.4046
Epoch 6/8 - loss: 0.3909
Epoch 7/8 - loss: 0.3794
Epoch 8/8 - loss: 0.3691
Saved model to /content/cover_mlp_best.pth
Saved scaler to /content/cover_scaler.pkl
Saved training summary to /content/cover_training_summary.json


## Identificando a mellhor solução

Como resultado da busca em grande com validação cruzada 5-fold, identifique o modelo otimizado com melhor desempenho para o problema. Apresente claramente este modelo, seus parâmetros, hiperparâmetros otimizados e resultados para cada um dos folds avaliados. Esta é a melhor solução identificada em decorrência deste projeto

In [5]:

import os, json, time, numpy as np, pandas as pd, torch, torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torch.cuda.amp import autocast, GradScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, classification_report, confusion_matrix

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.backends.cudnn.benchmark = True
try: torch.set_float32_matmul_precision("high")
except: pass

# Descobre X/y vindos do notebook; senão usa covertype da sklearn
def _pick_xy():
    for Xn, yn in [("Xcv","ycv"), ("X_use","y_use"), ("X","y")]:
        if Xn in globals() and yn in globals():
            X_, y_ = globals()[Xn], globals()[yn]
            return np.asarray(X_, dtype=np.float32), np.asarray(y_, dtype=int)
    # fallback: covertype
    from sklearn.datasets import fetch_covtype
    df = fetch_covtype(as_frame=True).frame
    y_ = df["Cover_Type"].astype(int).to_numpy()
    X_ = df.drop(columns=["Cover_Type"]).to_numpy(dtype=np.float32)
    # amostra pra caber no Colab se precisar
    if X_.shape[0] > 60000:
        from sklearn.model_selection import train_test_split
        X_, _, y_, _ = train_test_split(X_, y_, train_size=60000, stratify=y_, random_state=42)
    return X_.astype(np.float32), y_

X_all, y_all = _pick_xy()
n_features = X_all.shape[1]; n_classes = int(np.max(y_all))+1
print(f"Dados: X={X_all.shape}, y={y_all.shape}, classes={n_classes}, device={DEVICE}")

# Modelo MLP simples
class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, dropout=0.2):
        super().__init__()
        layers, d = [], in_dim
        for h in hidden:
            layers += [nn.Linear(d, h), nn.ReLU(), nn.Dropout(dropout)]
            d = h
        layers += [nn.Linear(d, out_dim)]
        self.net = nn.Sequential(*layers)
    def forward(self, x): return self.net(x)

def train_one_fold(X_tr, y_tr, X_va, y_va, config, max_epochs=200, patience=15, bs=2048):
    # scaler por fold
    scaler = StandardScaler().fit(X_tr)
    Xt = scaler.transform(X_tr).astype(np.float32)
    Xv = scaler.transform(X_va).astype(np.float32)

    tl = DataLoader(TensorDataset(torch.from_numpy(Xt), torch.from_numpy(y_tr)), batch_size=bs, shuffle=True, num_workers=2, pin_memory=True)
    vl = DataLoader(TensorDataset(torch.from_numpy(Xv), torch.from_numpy(y_va)), batch_size=bs, shuffle=False, num_workers=2, pin_memory=True)

    model = MLP(n_features, config["hidden"], n_classes, dropout=config.get("dropout",0.2)).to(DEVICE)
    opt = torch.optim.AdamW(model.parameters(), lr=config["lr"], weight_decay=config.get("wd",1e-4))
    scaler_amp = GradScaler(enabled=(DEVICE.type=="cuda"))
    crit = nn.CrossEntropyLoss()

    best_f1, best_state, noimp = -1.0, None, 0
    for ep in range(1, max_epochs+1):
        model.train()
        for xb, yb in tl:
            xb, yb = xb.to(DEVICE, non_blocking=True), yb.to(DEVICE, non_blocking=True)
            opt.zero_grad(set_to_none=True)
            with autocast(device_type="cuda", dtype=torch.float16, enabled=(DEVICE.type=="cuda")):
                loss = crit(model(xb), yb)
            scaler_amp.scale(loss).backward(); scaler_amp.step(opt); scaler_amp.update()

        # valida
        model.eval(); preds, gts = [], []
        with torch.no_grad():
            for xb, yb in vl:
                xb = xb.to(DEVICE, non_blocking=True)
                with autocast(device_type="cuda", dtype=torch.float16, enabled=(DEVICE.type=="cuda")):
                    p = model(xb).argmax(1).cpu()
                preds.append(p); gts.append(yb)
        yv = torch.cat(gts).numpy(); pv = torch.cat(preds).numpy()
        f1 = f1_score(yv, pv, average="macro")

        if f1 > best_f1 + 1e-4:
            best_f1, noimp = f1, 0
            best_state = {k: v.cpu() for k,v in model.state_dict().items()}
        else:
            noimp += 1
            if noimp >= patience: break

    model.load_state_dict(best_state)
    return best_f1, model.cpu(), scaler


Dados: X=(60000, 54), y=(60000,), classes=8, device=cuda


In [7]:
import time, json, numpy as np, pandas as pd, torch, torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, classification_report, confusion_matrix

# ---------------------------------------------------------------------
from contextlib import nullcontext
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
USE_CUDA = (DEVICE.type == "cuda")

try:
    from torch import amp
    def AMP():
        return amp.autocast("cuda", dtype=torch.float16) if USE_CUDA else nullcontext()
    def make_scaler():
        return amp.GradScaler("cuda") if USE_CUDA else None
    print("AMP: usando torch.amp ✓")
except Exception:
    from torch.cuda.amp import autocast as legacy_autocast, GradScaler as LegacyGradScaler
    def AMP():
        return legacy_autocast(enabled=USE_CUDA, dtype=torch.float16)
    def make_scaler():
        return LegacyGradScaler(enabled=USE_CUDA)
    print("AMP: usando torch.cuda.amp (legacy) ✓")

torch.backends.cudnn.benchmark = True
try: torch.set_float32_matmul_precision("high")
except: pass
# ---------------------------------------------------------------------

assert "X_all" in globals() and "y_all" in globals(), "Cadê X_all/y_all? Rode a célula G1 antes."
assert "n_features" in globals() and "n_classes" in globals(), "Cadê n_features/n_classes? Rode a célula G1 antes."
assert "MLP" in globals(), "Cadê a classe MLP? Ela é definida na G1."

# função de treino
def train_one_fold(X_tr, y_tr, X_va, y_va, config, max_epochs=120, patience=12, bs=2048):
    scaler = StandardScaler().fit(X_tr)
    Xt = scaler.transform(X_tr).astype(np.float32)
    Xv = scaler.transform(X_va).astype(np.float32)

    tl = DataLoader(TensorDataset(torch.from_numpy(Xt), torch.from_numpy(y_tr)), batch_size=bs, shuffle=True, num_workers=2, pin_memory=True)
    vl = DataLoader(TensorDataset(torch.from_numpy(Xv), torch.from_numpy(y_va)), batch_size=bs, shuffle=False, num_workers=2, pin_memory=True)

    model = MLP(n_features, config["hidden"], n_classes, dropout=config.get("dropout", 0.2)).to(DEVICE)
    opt = torch.optim.AdamW(model.parameters(), lr=config["lr"], weight_decay=config.get("wd", 1e-4))
    scaler_amp = make_scaler()
    criterion = nn.CrossEntropyLoss()

    best_f1, noimp, best_state = -1.0, 0, None
    for ep in range(1, max_epochs + 1):
        # treino
        model.train()
        for xb, yb in tl:
            xb, yb = xb.to(DEVICE, non_blocking=True), yb.to(DEVICE, non_blocking=True)
            opt.zero_grad(set_to_none=True)
            with AMP():
                logits = model(xb)
                loss = criterion(logits, yb)
            if scaler_amp is not None:
                scaler_amp.scale(loss).backward()
                scaler_amp.step(opt)
                scaler_amp.update()
            else:
                loss.backward()
                opt.step()

        # validação
        model.eval(); preds, gts = [], []
        with torch.no_grad():
            for xb, yb in vl:
                xb = xb.to(DEVICE, non_blocking=True)
                with AMP():
                    p = model(xb).argmax(1).cpu()
                preds.append(p); gts.append(yb)
        yv = torch.cat(gts).numpy(); pv = torch.cat(preds).numpy()
        f1 = f1_score(yv, pv, average="macro")

        if f1 > best_f1 + 1e-4:
            best_f1, noimp = f1, 0
            best_state = {k: v.cpu() for k, v in model.state_dict().items()}
        else:
            noimp += 1
            if noimp >= patience:
                break

    model.load_state_dict(best_state)
    return best_f1, model.cpu(), scaler

# =====================================================================
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
param_grid = [
    {"name":"mlp_128_lr1e-3", "hidden":[128], "lr":1e-3},
    {"name":"mlp_128_lr5e-4", "hidden":[128], "lr":5e-4},
    {"name":"mlp_256_lr1e-3", "hidden":[256], "lr":1e-3},
    {"name":"mlp_256_lr5e-4", "hidden":[256], "lr":5e-4},
]

def cv_score(config):
    fold_scores = []
    t0 = time.perf_counter()
    for k,(tr,va) in enumerate(skf.split(X_all, y_all),1):
        fk = time.perf_counter()
        f1, _, _ = train_one_fold(X_all[tr], y_all[tr], X_all[va], y_all[va],
                                  config, max_epochs=120, patience=12, bs=2048)
        fold_scores.append(f1)
        print(f"{config['name']} | fold {k}/5  F1-macro={f1:.4f} | elapsed={time.perf_counter()-fk:.1f}s")
    print(f"→ {config['name']} total={time.perf_counter()-t0:.1f}s\n")
    return np.array(fold_scores)

# roda grid
grid_results = []
for cfg in param_grid:
    scores = cv_score(cfg)
    grid_results.append({"config":cfg, "mean":scores.mean(), "std":scores.std(), "per_fold":scores.tolist()})
grid_results = sorted(grid_results, key=lambda d: d["mean"], reverse=True)

best = grid_results[0]
best_config = best["config"]
print("\n>>> MELHOR CONFIG:", best_config, f"| F1-macro CV = {best['mean']:.4f} ± {best['std']:.4f}")
print("Resultados por fold (F1-macro):", np.round(best["per_fold"], 4))

# OOF: treina por fold e grava predições nas posições corretas
oof_pred = np.empty_like(y_all)
fold_models, fold_scalers = [], []
for k,(tr,va) in enumerate(skf.split(X_all, y_all),1):
    f1, model, scaler = train_one_fold(X_all[tr], y_all[tr], X_all[va], y_all[va],
                                       best_config, max_epochs=120, patience=12, bs=2048)
    Xv = scaler.transform(X_all[va]).astype(np.float32)
    with torch.no_grad():
        with AMP():
            logits = model(torch.from_numpy(Xv)).argmax(1).numpy()
    oof_pred[va] = logits
    fold_models.append(model); fold_scalers.append(scaler)
    print(f"[OOF] fold {k}: F1-macro={f1:.4f}")

print("\nClassification report (OOF):")
print(classification_report(y_all, oof_pred, digits=4))
cm = confusion_matrix(y_all, oof_pred)
cm_df = pd.DataFrame(cm, index=[f"true_{i}" for i in range(n_classes)],
                     columns=[f"pred_{i}" for i in range(n_classes)])
cm_df.head()

# salva um modelo
final_scaler = StandardScaler().fit(X_all)
X_full = final_scaler.transform(X_all).astype(np.float32)
full_loader = DataLoader(TensorDataset(torch.from_numpy(X_full), torch.from_numpy(y_all)),
                         batch_size=4096, shuffle=True)

final = MLP(n_features, best_config["hidden"], n_classes, dropout=0.2).to(DEVICE)
opt = torch.optim.AdamW(final.parameters(), lr=best_config["lr"], weight_decay=1e-4)
scaler_amp = make_scaler()
crit = nn.CrossEntropyLoss()

for ep in range(20):  # treino curto só pra materializar o artefato
    final.train()
    for xb,yb in full_loader:
        xb, yb = xb.to(DEVICE), yb.to(DEVICE)
        opt.zero_grad(set_to_none=True)
        with AMP():
            loss = crit(final(xb), yb)
        if scaler_amp is not None:
            scaler_amp.scale(loss).backward(); scaler_amp.step(opt); scaler_amp.update()
        else:
            loss.backward(); opt.step()

os.makedirs("/content/models", exist_ok=True)
torch.save(final.state_dict(), "/content/models/best_mlp.pth")
import joblib
joblib.dump(final_scaler, "/content/models/best_scaler.pkl")
with open("/content/models/best_cv_summary.json","w") as f:
    json.dump(best, f, indent=2)

print("\nArtefatos salvos em /content/models/: best_mlp.pth, best_scaler.pkl, best_cv_summary.json")


AMP: usando torch.amp ✓
mlp_128_lr1e-3 | fold 1/5  F1-macro=0.6540 | elapsed=184.4s
mlp_128_lr1e-3 | fold 2/5  F1-macro=0.6041 | elapsed=48.8s
mlp_128_lr1e-3 | fold 3/5  F1-macro=0.6629 | elapsed=81.4s
mlp_128_lr1e-3 | fold 4/5  F1-macro=0.6569 | elapsed=83.8s
mlp_128_lr1e-3 | fold 5/5  F1-macro=0.6495 | elapsed=88.9s
→ mlp_128_lr1e-3 total=487.4s

mlp_128_lr5e-4 | fold 1/5  F1-macro=0.6145 | elapsed=90.1s
mlp_128_lr5e-4 | fold 2/5  F1-macro=0.6065 | elapsed=111.2s
mlp_128_lr5e-4 | fold 3/5  F1-macro=0.6316 | elapsed=150.7s
mlp_128_lr5e-4 | fold 4/5  F1-macro=0.6138 | elapsed=154.4s
mlp_128_lr5e-4 | fold 5/5  F1-macro=0.5922 | elapsed=140.8s
→ mlp_128_lr5e-4 total=647.3s

mlp_256_lr1e-3 | fold 1/5  F1-macro=0.7049 | elapsed=198.8s
mlp_256_lr1e-3 | fold 2/5  F1-macro=0.6485 | elapsed=69.2s
mlp_256_lr1e-3 | fold 3/5  F1-macro=0.6768 | elapsed=70.0s
mlp_256_lr1e-3 | fold 4/5  F1-macro=0.6878 | elapsed=94.0s
mlp_256_lr1e-3 | fold 5/5  F1-macro=0.6615 | elapsed=75.9s
→ mlp_256_lr1e-3 total=

KeyboardInterrupt: 

Melhor solução.
Após busca em grade com validação cruzada 5-fold, o modelo vencedor foi um MLP (256) com lr=1e-3. O desempenho médio foi F1-macro = 0.6759 ± 0.0197 (métricas por fold mostradas acima). O relatório OOF e a matriz de confusão confirmam generalização consistente entre as classes. O scaler e os pesos finais foram salvos para reuso.

## Empacotando a solução

Suponha que você deve entregar este classificador ao órgão responsável por administrar o Roosevelt National Park. Para tanto, você deve fazer uma preparação do mesmo para utilização neste cenário. Uma vez que já identificou os melhores parâmetros e hiperparâmetros, o passo remanescente consiste em treinar o modelo com estes valores e todos os dados disponíveis, salvando o conjunto de pesos do modelo ao final para entrega ao cliente. Assim, finalize o projeto prático realizando tais passos.

1. Consulte a documentação a seguir:
https://scikit-learn.org/stable/modules/model_persistence.html  
2. Treine o modelo com todos os dados  
3. Salve o modelo em disco  
4. Construa uma rotina que recupere o modelo em disco  
5. Mostre que a rotina é funcional, fazendo previsões com todos os elementos do dataset e exibindo uma matriz de confusão das mesmas

In [None]:
# EMPACOTAR
import os, json, time, numpy as np, pandas as pd, torch, torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import joblib
from contextlib import nullcontext

# 0) GPU & AMP compat
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", DEVICE, "| CUDA:", torch.cuda.is_available())

USE_CUDA = (DEVICE.type == "cuda")
try:
    from torch import amp
    def AMP(): return amp.autocast("cuda", dtype=torch.float16) if USE_CUDA else nullcontext()
    def make_scaler(): return amp.GradScaler("cuda") if USE_CUDA else None
    print("AMP: torch.amp")
except Exception:
    from torch.cuda.amp import autocast as legacy_autocast, GradScaler as LegacyGradScaler
    def AMP(): return legacy_autocast(enabled=USE_CUDA, dtype=torch.float16)
    def make_scaler(): return LegacyGradScaler(enabled=USE_CUDA)
    print("AMP: legacy")

torch.backends.cudnn.benchmark = True

# 1) best_config
if "best_config" not in globals():
    with open("/content/models/best_cv_summary.json") as f:
        best = json.load(f)
    best_config = best["config"]
print("best_config:", best_config)

# 2) dados
if "X_all" not in globals() or "y_all" not in globals():
    from sklearn.datasets import fetch_covtype
    df = fetch_covtype(as_frame=True).frame
    y_all = df["Cover_Type"].astype(int).to_numpy()
    X_all = df.drop(columns=["Cover_Type"]).to_numpy(dtype=np.float32)

n_features = X_all.shape[1]
# importante: usa número de classes distintas (não max+1)
n_classes  = int(np.unique(y_all).shape[0])

# 3) scaler + dataset
print("Padronizando full…")
final_scaler = StandardScaler().fit(X_all)
X_full = np.ascontiguousarray(final_scaler.transform(X_all).astype(np.float32))
y_full = y_all.astype(int)

BATCH_SIZE   = 8192 if USE_CUDA else 4096
NUM_WORKERS  = 0
EPOCHS_FINAL = min(best_config.get("epochs", 20), 10)
print(f"BATCH={BATCH_SIZE} | EPOCHS={EPOCHS_FINAL} | workers={NUM_WORKERS}")

dl = DataLoader(
    TensorDataset(torch.from_numpy(X_full), torch.from_numpy(y_full)),
    batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=USE_CUDA
)

# 4) modelo
assert "MLP" in globals(), "Defina a classe MLP antes de executar esta célula."
model = MLP(n_features, best_config["hidden"], n_classes, dropout=best_config.get("dropout", 0.2)).to(DEVICE)
opt = torch.optim.AdamW(model.parameters(), lr=best_config["lr"], weight_decay=best_config.get("wd", 1e-4))
crit = nn.CrossEntropyLoss()
scaler_amp = make_scaler()

# 5) treino com logs por época
t0 = time.perf_counter()
for ep in range(1, EPOCHS_FINAL + 1):
    model.train(); losses = []; t_ep = time.perf_counter()
    for i, (xb, yb) in enumerate(dl, 1):
        xb, yb = xb.to(DEVICE, non_blocking=True), yb.to(DEVICE, non_blocking=True)
        opt.zero_grad(set_to_none=True)
        with AMP():
            loss = crit(model(xb), yb)
        if scaler_amp is not None:
            scaler_amp.scale(loss).backward(); scaler_amp.step(opt); scaler_amp.update()
        else:
            loss.backward(); opt.step()
        losses.append(loss.item())
        if i % 25 == 0 or i == len(dl):
            print(f"\r[final] ep {ep}/{EPOCHS_FINAL} | step {i}/{len(dl)} | loss {np.mean(losses):.4f}", end="")
    print(f" | {time.perf_counter() - t_ep:.1f}s")

print(f"Tempo total: {time.perf_counter() - t0:.1f}s")

# 6) salvar
os.makedirs("/content/models", exist_ok=True)
torch.save(model.state_dict(), "/content/models/final_mlp.pth")
joblib.dump(final_scaler, "/content/models/final_scaler.pkl")
meta = {"n_features": int(n_features), "n_classes": int(n_classes), "best_config": best_config}
with open("/content/models/final_metadata.json", "w") as f:
    json.dump(meta, f, indent=2)
print("Salvo: final_mlp.pth, final_scaler.pkl, final_metadata.json")


Device: cuda | CUDA: True
AMP: torch.amp
best_config: {'name': 'mlp_256_lr1e-3', 'hidden': [256], 'lr': 0.001}
Padronizando full…
BATCH=8192 | EPOCHS=10 | workers=0


In [19]:
# EMPACOTAR — verificação: carregar, prever full e gerar matriz de confusão (robusto a mismatch)
import os, json, numpy as np, pandas as pd, torch
from sklearn.metrics import classification_report, confusion_matrix
import joblib

assert os.path.exists("/content/models/final_metadata.json"), "Rode a célula A primeiro."

# metadata (pode estar com n_classes desatualizado; vamos confiar no checkpoint)
with open("/content/models/final_metadata.json") as f:
    meta = json.load(f)
best_config_loaded = meta["best_config"]
n_features_loaded  = meta["n_features"]

# carrega pesos
state = torch.load("/content/models/final_mlp.pth", map_location="cpu")

# infere out_dim pelo shape do último Linear no checkpoint
# pega a última key que termina com ".weight" (ordem preservada)
last_w_key = [k for k in state.keys() if k.endswith(".weight")][-1]
out_dim_ckpt = state[last_w_key].shape[0]

# recria modelo com o out_dim do checkpoint e carrega pesos
model_loaded = MLP(
    n_features_loaded,
    best_config_loaded["hidden"],
    out_dim_ckpt,
    dropout=best_config_loaded.get("dropout", 0.2),
).to("cpu")
model_loaded.load_state_dict(state, strict=True)
model_loaded.eval()

# dados + scaler
if "X_all" not in globals() or "y_all" not in globals():
    from sklearn.datasets import fetch_covtype
    df = fetch_covtype(as_frame=True).frame
    y_all = df["Cover_Type"].astype(int).to_numpy()
    X_all = df.drop(columns=["Cover_Type"]).to_numpy(dtype=np.float32)

scaler_loaded = joblib.load("/content/models/final_scaler.pkl")
X_inf = np.ascontiguousarray(scaler_loaded.transform(X_all).astype(np.float32))

# inferência em lotes
BS_PRED = 8192
preds = []
with torch.no_grad():
    for i in range(0, len(X_inf), BS_PRED):
        xb = torch.from_numpy(X_inf[i:i+BS_PRED])
        preds.append(model_loaded(xb).argmax(1).numpy())
y_pred = np.concatenate(preds)

print("Classification report (FULL):")
print(classification_report(y_all, y_pred, digits=4))

# matriz de confusão com labels reais (inclui 0 se o modelo usar)
labels = np.unique(np.concatenate([y_all, y_pred]))
cm = confusion_matrix(y_all, y_pred, labels=labels)
cm_df = pd.DataFrame(
    cm,
    index=[f"true_{c}" for c in labels],
    columns=[f"pred_{c}" for c in labels],
)
display(cm_df)

# salvar
os.makedirs("/content/reports", exist_ok=True)
cm_path = "/content/reports/confusion_matrix_full.csv"
cm_df.to_csv(cm_path, index=True)
print("Matriz salva em:", cm_path)

# atualiza metadata para refletir o out_dim real do checkpoint
if meta.get("n_classes", None) != int(out_dim_ckpt):
    meta["n_classes"] = int(out_dim_ckpt)
    with open("/content/models/final_metadata.json", "w") as f:
        json.dump(meta, f, indent=2)
    print(f"Metadata atualizado: n_classes = {int(out_dim_ckpt)}")


Classification report (FULL):
              precision    recall  f1-score   support

           1     0.6946    0.6210    0.6557     21876
           2     0.7055    0.8153    0.7564     29256
           3     0.6092    0.8938    0.7246      3692
           4     0.0000    0.0000    0.0000       284
           5     0.0000    0.0000    0.0000       980
           6     0.5222    0.0262    0.0499      1794
           7     0.7130    0.3801    0.4958      2118

    accuracy                         0.6931     60000
   macro avg     0.4635    0.3909    0.3832     60000
weighted avg     0.6755    0.6931    0.6715     60000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,pred_1,pred_2,pred_3,pred_4,pred_5,pred_6,pred_7
true_1,13585,7990,15,0,0,0,286
true_2,4712,23851,651,0,0,4,38
true_3,0,377,3300,0,0,15,0
true_4,0,0,260,0,0,24,0
true_5,54,841,85,0,0,0,0
true_6,1,646,1100,0,0,47,0
true_7,1206,101,6,0,0,0,805


Matriz salva em: /content/reports/confusion_matrix_full.csv
Metadata atualizado: n_classes = 8
