## Etapas

1. Cria√ß√£o de features
2. Tratamento de outliers
3. Normaliza√ß√£o dos dados
4. Pipeline de pr√©-processamento
5. Separa√ß√£o em treino/teste
7. Treinamento do modelo

In [1]:
!pip install xgboost catboost lightgbm

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m99.2/99.2 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [1]:
# @title Importa√ß√£o das bibliotecas utilizadas no programa

import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

# Carregamento do dataset
import requests
from pathlib import Path

# Cria√ß√£o de features
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Normaliza√ß√£o dos dados
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Treinamento do modelo
from sklearn.model_selection import cross_validate, TimeSeriesSplit
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier,  AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import  AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import RidgeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier



In [2]:
# @title Carregamento do dataset

github_link = "https://github.com/mfigueireddo/ciencia-de-dados/blob/ba579573c5b8a9246ca04f7da29bc2c74c8b362c/datasets/pre-pipeline_wildfires.parquet"
url = github_link.replace("/blob/", "/raw/")

local_file_path = Path("/content/raw_wildfires.parquet")

# Faz uma requisi√ß√£o HTTP GET ao GitHub
with requests.get(url, stream=True) as request:

    request.raise_for_status() # Confere se houve √™xito

    with open(local_file_path , "wb") as file:
        for chunk in request.iter_content(chunk_size=1024*1024):
            if chunk:
                file.write(chunk)

wildfires = pd.read_parquet(local_file_path,  engine="pyarrow") # Leitura realizada com a engine pyarrow

In [3]:
# @title Vari√°veis globais

data_column_name = 'data'
id_column_name = 'fire_id'
latitude_column_name = 'latitude'
longitude_column_name = 'longitude'
precipitation_column_name = 'precipitacao'
max_temperature_column_name = 'temperatura_max'
precipitation_sum_window_column_name = 'soma_precipitacao_14d'
max_temperature_mean_column_name = 'media_temp_max_7d'
season_column_name = 'estacao_ano_id'
region_column_name = 'regiao_incendio'
target_column_name = 'houve_incendio'

## 1. Cria√ß√£o de features

Features geradas
- Esta√ß√£o do ano baseada no hemisf√©rio Sul (0=Ver√£o, 1=Outono, 2=Inverno, 3=Primavera)
- Regi√£o do inc√™ndio
- Soma da precipita√ß√£o nos √∫ltimos 14 dias
- M√©dia de temperatura m√°xima nos √∫ltimos 7 dias

In [5]:
# @title Classe respons√°vel pela cria√ß√£o de features

class FeaturesCreation(BaseEstimator, TransformerMixin):

    def __init__(self):
        # Colunas utilizadas
        self.m_date_column = data_column_name
        self.m_group_column = id_column_name
        self.m_latitude_column = latitude_column_name
        self.m_longitude_column = longitude_column_name
        self.m_precipitation_column = precipitation_column_name
        self.m_max_temperature_column = max_temperature_column_name

        # Par√¢metros personalizados para cria√ß√£o das features
        self.m_precipitation_window_days = 90
        self.m_max_temperature_window_days =90

        # Par√¢metros do DBSCAN
        self.mm_max_radiuskm = 1.0
        self.m_min_samples = 5

        # Objetos aprendidos no fit
        self.m_dbscan = None
        self.m_nearest_neighbors_core = None
        self.m_core_labels = None
        self.m_max_radius = None

    # Convers√£o necess√°ria para o DBSCAN
    @staticmethod
    def convert_to_radians(latitude_or_longitude):
        return np.radians(latitude_or_longitude.astype(float))

    @staticmethod
    def convert_month_to_season(month):
        if month in (12, 1, 2): return 0  # Inverno
        if month in (3, 4, 5): return 1  # Primavera
        if month in (6, 7, 8): return 2  # Ver√£o
        return 3  # 9, 10, 11 -> Outono

    def fit(self, dataframe, target=None):
        dataframe = dataframe.copy()

        # Desativa o DBSCAN caso n√£o haja latitude e longitude
        missing_cols = [column for column in [self.m_latitude_column, self.m_longitude_column] if column not in dataframe.columns]
        if missing_cols:
            self.m_dbscan = None
            self.m_nearest_neighbors_core = None
            self.m_core_labels = None
            self.m_max_radius = None
            return self

        latitude_or_longitude = dataframe[[self.m_latitude_column, self.m_longitude_column]].to_numpy()
        latitude_or_longitude_radians = self.convert_to_radians(latitude_or_longitude)

        earth_radius = 6371.0
        self.m_max_radius = self.mm_max_radiuskm / earth_radius

        dbscan = DBSCAN(eps=self.m_max_radius, min_samples=self.m_min_samples, metric='haversine')
        dbscan.fit(latitude_or_longitude_radians)
        self.m_dbscan = dbscan

        # Treina um NearestNeighbors apenas nos pontos-core para atribui√ß√£o de novos pontos √† clusters existentes em transform()
        core_mask = np.zeros_like(dbscan.labels_, dtype=bool)
        if hasattr(dbscan, 'core_sample_indices_') and len(dbscan.core_sample_indices_) > 0:
            core_mask[dbscan.core_sample_indices_] = True
            core_points = latitude_or_longitude_radians[core_mask]
            core_labels = dbscan.labels_[core_mask]

            if len(core_points) > 0:
                nearest_neighbors = NearestNeighbors(n_neighbors=1, metric='haversine')
                nearest_neighbors.fit(core_points)
                self.m_nearest_neighbors_core = nearest_neighbors
                self.m_core_labels = core_labels
            else:
                self.m_nearest_neighbors_core = None
                self.m_core_labels = None
        else:
            self.m_nearest_neighbors_core = None
            self.m_core_labels = None

        return self

    # Rotula novos pontos
    def assign_dbscanlabels(self, dataframe):

        # Se n√£o tivemos lat/lon ou DBSCAN treinado, devolve NaN
        if self.m_nearest_neighbors_core is None or self.m_core_labels is None or self.m_max_radius is None:
            return pd.Series([-1] * len(dataframe), index=dataframe.index, dtype='int64')

        latitude_or_longitude = dataframe[[self.m_latitude_column, self.m_longitude_column]].to_numpy()
        latitude_or_longitude_radians = self.convert_to_radians(latitude_or_longitude)

        # Atribui r√≥tulo do core mais pr√≥ximo, desde que dentro do raio
        distances, indices = self.m_nearest_neighbors_core.kneighbors(latitude_or_longitude_radians, n_neighbors=1, return_distance=True)
        distances = distances.reshape(-1)
        indices = indices.reshape(-1)

        labels = np.full(len(dataframe), -1, dtype='int64')
        within = distances <= self.m_max_radius
        labels[within] = self.m_core_labels[indices[within]]

        return pd.Series(labels, index=dataframe.index, dtype='int64')

    # Adiciona m√©dias m√≥veis e soma pro grupo de inc√™ndio
    def add_temporal_rollings(self, dataframe):

        # Ordena por grupo e tempo para garantir rolling correto
        if self.m_group_column in dataframe.columns and self.m_date_column in dataframe.columns:
            dataframe = dataframe.sort_values([self.m_group_column, self.m_date_column])
        else:
            # Se faltar algo, s√≥ ordena por data (se existir)
            if self.m_date_column in dataframe.columns:
                dataframe = dataframe.sort_values(self.m_date_column)

        # Rolling de precipita√ß√£o (soma dos √∫ltimos 14 dias)
        if self.m_precipitation_column in dataframe.columns:
            dataframe[precipitation_sum_window_column_name] = (
                dataframe.groupby(self.m_group_column, dropna=False)[self.m_precipitation_column]
                  .rolling(self.m_precipitation_window_days, min_periods=1)
                  .sum()
                  .reset_index(level=0, drop=True)
            )
        else:
            dataframe[precipitation_sum_window_column_name] = np.nan

        # Rolling de temperatura m√°xima (m√©dia dos √∫ltimos 7 dias)
        if self.m_max_temperature_column in dataframe.columns:
            dataframe[max_temperature_mean_column_name] = (
                dataframe.groupby(self.m_group_column, dropna=False)[self.m_max_temperature_column]
                  .rolling(self.m_max_temperature_window_days, min_periods=1)
                  .mean()
                  .reset_index(level=0, drop=True)
            )
        else:
            dataframe[max_temperature_mean_column_name] = np.nan

        return dataframe

    # Aplica transforma√ß√µes e cria as novas features
    def transform(self, dataframe, target=None):

        # Trabalha em DataFrame para manter nomes/√≠ndices
        dataframe = pd.DataFrame(dataframe).copy()

        # Esta√ß√£o do ano
        if self.m_date_column in dataframe.columns:
            # Garante dtype datetime
            dataframe[self.m_date_column] = pd.to_datetime(dataframe[self.m_date_column], errors='coerce')
            estacao = dataframe[self.m_date_column].dt.month.map(self.convert_month_to_season).astype('Int64')
            dataframe[season_column_name] = estacao.astype('float').astype('Int64')  # evita problemas de NaN -> imputar depois
            dataframe[season_column_name] = dataframe[season_column_name].astype('float')
        else:
            dataframe[season_column_name] = np.nan

        # Regi√£o
        if all(c in dataframe.columns for c in [self.m_latitude_column, self.m_longitude_column]):
            dataframe[region_column_name] = self.assign_dbscanlabels(dataframe).astype('int64')
        else:
            dataframe[region_column_name] = -1

        # Rollings temporais
        dataframe = self.add_temporal_rollings(dataframe)

        return dataframe


### Ajustes nos dados

As features esta√ß√£o do ano e regi√£o do inc√™ndio que foram criadas precisam ser tratadas.
Isso porque seus valores s√£o categ√≥ricos e n√£o h√° rela√ß√£o num√©rica entre eles.
Isto ser√° feito na normaliza√ß√£o dos dados

## 2. Tratamento de outliers

**Transforma√ß√£o Logar√≠tmica**
- Aplica log(x) ou log(x+constante) para valores positivos
- Muito eficaz para dados com distribui√ß√£o assim√©trica positiva
- Comprime valores grandes e expande valores pequenos
- F√≥rmula: X_log = log(X + c), onde c evita log(0)

**Transforma√ß√£o Raiz Quadrada**
- Menos dr√°stica que a transforma√ß√£o logar√≠tmica
- √ötil para dados de contagem e vari√°veis positivamente assim√©tricas
- F√≥rmula: X_sqrt = sqrt(X)

**Winsoriza√ß√£o (Capping/Clipping)**

A **Winsoriza√ß√£o** √© uma t√©cnica de tratamento de outliers que **limita valores extremos** sem remov√™-los completamente. Em vez de excluir outliers, substitu√≠mos os valores extremos pelos valores de percentis espec√≠ficos.

**Como funciona:**
- Define-se limites baseados em percentis (ex: 5¬∫ e 95¬∫ percentil)
- Valores abaixo do limite inferior s√£o substitu√≠dos pelo valor do limite inferior
- Valores acima do limite superior s√£o substitu√≠dos pelo valor do limite superior

| Vari√°vel                                 | Melhor m√©todo         | Justificativa (1 linha)                                                                                   |
| :--------------------------------------- | :-------------------- | :-------------------------------------------------------------------------------------------------------- |
| **precipitacao**                         | **Log(x + 1)**        | Reduziu fortemente a assimetria (7.87 ‚Üí 2.59) e manteve os limites IQR est√°veis ‚Äî ideal para cauda longa. |
| **umidade_relativa_max**                 | **Winsorizar 5%-5%**  | Cortou outliers (23 ‚Üí 0) e melhorou levemente a simetria; log/sqrt inflaram valores.                      |
| **umidade_relativa_min**                 | **Winsorizar 5%-5%**  | Assimetria e outliers foram totalmente corrigidos (1155 ‚Üí 0).                                             |
| **umidade_especifica**                   | **Sqrt**              | Melhor simetria (0.89 ‚Üí 0.16) e forte redu√ß√£o de outliers (6967 ‚Üí 2102).                                  |
| **radiacao_solar**                       | **Sem transforma√ß√£o** | J√° sim√©trica e sem outliers; log piorou, winsor apenas repete.                                            |
| **temperatura_min**                      | **Winsorizar 5%-5%**  | Remo√ß√£o completa de outliers (5066 ‚Üí 0) e leve ganho de simetria.                                         |
| **temperatura_max**                      | **Winsorizar 5%-5%**  | Mesmo comportamento de `temperatura_min`.                                                                 |
| **velocidade_vento**                     | **Log(x + 1)**        | Reduziu assimetria (1.23 ‚Üí 0.19) e outliers (8723 ‚Üí 1128) sem eliminar extremos reais.                    |
| **indice_queima**                        | **Winsorizar 5%-5%**  | Log e sqrt aumentaram outliers via IQR; winsor eliminou-os com m√≠nima distor√ß√£o.                          |
| **umidade_combustivel_morto_100_horas**  | **Sqrt**              | Forte queda de outliers (44 ‚Üí 9) e leve suaviza√ß√£o de forma.                                              |
| **umidade_combustivel_morto_1000_horas** | **Sqrt**              | Reduziu outliers (373 ‚Üí 4) mantendo distribui√ß√£o coerente.                                                |
| **componente_energia_lancada**           | **Sem transforma√ß√£o** | J√° equilibrada; transforma√ß√µes criam falsos outliers.                                                     |
| **evapotranspiracao_real**               | **Sqrt**              | Melhorou drasticamente a simetria (0.71 ‚Üí -0.00) e reduziu outliers (3292 ‚Üí 153).                         |
| **evapotranspiracao_potencial**          | **Log(x + 1)**        | Skew caiu (0.53 ‚Üí -0.36) e outliers zeraram (679 ‚Üí 0).                                                    |
| **deficit_pressao_vapor**                | **Log(x + 1)**        | Alta cauda direita suavizada (1.46 ‚Üí 0.52) e outliers despencaram (14758 ‚Üí 672).                          |


In [6]:
# @title Classe respons√°vel pelo tratamento dos outliers

log_columns = [
    "precipitacao",
    "velocidade_vento",
    "evapotranspiracao_potencial",
    "deficit_pressao_vapor",
]
sqrt_columns = [
    "umidade_especifica",
    "umidade_combustivel_morto_100_horas",
    "umidade_combustivel_morto_1000_horas",
    "evapotranspiracao_real",
]
winsor_columns = [
    "umidade_relativa_max",
    "umidade_relativa_min",
    "temperatura_min",
    "temperatura_max",
    "indice_queima",
]

class OutliersTreatment(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.m_log_columns = log_columns
        self.m_sqrt_columns = sqrt_columns
        self.m_winsor_columns = winsor_columns
        self.m_winsor_limits = (0.05, 0.05)

    # Calcula par√¢metros necess√°rios para aplicar as transforma√ß√µes corretamente
    def fit(self, dataframe, target=None):

        # Garante que o usu√°rio esteja enviado um dataframe no formato correto
        dataframe = dataframe if isinstance(dataframe, pd.DataFrame) else pd.DataFrame(dataframe)

        # C√°lculo de offsets para garantir que n√£o haver√£o valores zerados ou negativos
        # "coerce" convete valores inv√°lidos para NaN

        self.m_log_offset = {}
        for column in self.m_log_columns:
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                min_value = series.min()
                self.m_log_offset[column] = (abs(min_value) + 1) if pd.notna(min_value) and min_value <= 0 else 1.0

        self.m_sqrt_offset = {}
        for column in self.m_sqrt_columns:
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                min_value = series.min()
                self.m_sqrt_offset[column] = (abs(min_value) + 0.01) if pd.notna(min_value) and min_value < 0 else 0.0

        # Garante que a winsoriza√ß√£o s√≥ seja feita com colunas que realmente est√£o no dataframe
        actual_columns_to_winsor = [column for column in self.m_winsor_columns if column in dataframe.columns]
        low_quantile, high_quantile = self.m_winsor_limits

        if actual_columns_to_winsor:
            # Converte cada coluna para num√©rico (coerces -> NaN) e calcula quantis por coluna
            winsor_dataframe = dataframe[actual_columns_to_winsor].apply(pd.to_numeric, errors="coerce")
            self.m_low_quantile  = winsor_dataframe.quantile(low_quantile)
            self.m_high_quantile = winsor_dataframe.quantile(1 - high_quantile)
        else:
            # garante atributos vazios para n√£o quebrar no transform()
            self.m_low_quantile  = pd.Series(dtype=float)
            self.m_high_quantile = pd.Series(dtype=float)

        return self

    # Aplica as transforma√ß√µes
    def transform(self, dataframe):

        # Garante que o usu√°rio esteja enviado o dataframe correto
        dataframe = dataframe.copy() if isinstance(dataframe, pd.DataFrame) else pd.DataFrame(dataframe).copy()

        # LOG
        for column, offset in self.m_log_offset.items():
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                dataframe[column] = np.log(series + offset)

        # SQRT
        for column, offset in self.m_sqrt_offset.items():
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                dataframe[column] = np.sqrt(series + offset)

        # Winsoriza√ß√£o
        for column in self.m_low_quantile.index:
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                dataframe[column] = series.clip(lower=self.m_low_quantile[column], upper=self.m_high_quantile[column])

        return dataframe

## 3. Normaliza√ß√£o dos dados



In [7]:
from google.colab import drive
drive.mount('/content/drive')

# @title Normaliza√ß√£o dos Dados
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Carrega o dataset j√° transformado
df = pd.read_csv('/content/drive/MyDrive/wildfires_transformed.csv')

# Colunas num√©ricas a normalizar
colunas_numericas = [
    'precipitacao',
    'umidade_relativa_max', 'umidade_relativa_min',
    'umidade_especifica',
    'radiacao_solar',
    'temperatura_min', 'temperatura_max',
    'velocidade_vento',
    'indice_queima',
    'umidade_combustivel_morto_100_horas',
    'umidade_combustivel_morto_1000_horas',
    'componente_energia_lancada',
    'evapotranspiracao_real',
    'evapotranspiracao_potencial',
    'deficit_pressao_vapor'
]

# Aplicar Min‚ÄìMax Scaling
scaler = MinMaxScaler()
df_normalizado = df.copy()
df_normalizado[colunas_numericas] = scaler.fit_transform(df[colunas_numericas])

# Conferir o intervalo
print("Intervalo das vari√°veis ap√≥s normaliza√ß√£o:")
print(df_normalizado[colunas_numericas].describe().loc[['min','max']])

# Salvar dataset normalizado
df_normalizado.to_csv('wildfires_normalized.csv', index=False, encoding='utf-8-sig')

print("\n‚úÖ Normaliza√ß√£o conclu√≠da com sucesso!")
print("üìÅ Arquivo salvo como 'wildfires_normalized.csv'")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Intervalo das vari√°veis ap√≥s normaliza√ß√£o:
     precipitacao  umidade_relativa_max  umidade_relativa_min  \
min           0.0                   0.0                   0.0   
max           1.0                   1.0                   1.0   

     umidade_especifica  radiacao_solar  temperatura_min  temperatura_max  \
min                 0.0             0.0              0.0              0.0   
max                 1.0             1.0              1.0              1.0   

     velocidade_vento  indice_queima  umidade_combustivel_morto_100_horas  \
min               0.0            0.0                                  0.0   
max               1.0            1.0                                  1.0   

     umidade_combustivel_morto_1000_horas  componente_energia_lancada  \
min                                   0.0                         0.0   
max               

In [8]:
# @title Algoritmo de normaliza√ß√£o

categoric_cols = [season_column_name, region_column_name]

def numeric_columns_selector(dataframe):
    # Seleciona colunas num√©ricas, exceto as categ√≥ricas codificadas numericamente
    num = dataframe.select_dtypes(include='number').columns.tolist()
    return [column for column in num if column not in categoric_cols]

numeric_columns_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ])

categoric_columns_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one_hot_encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

DataNormalization = ColumnTransformer(
    transformers=[
        ('numerical', numeric_columns_pipeline, numeric_columns_selector),
        ('categoric', categoric_columns_pipeline, categoric_cols)
    ],
    remainder='passthrough'  # mant√©m quaisquer colunas n√£o listadas
)

## 4. Pipeline de pr√©-processamento

In [9]:
# @title C√≥digo da Pipeline de pr√©-processamento

preprocess = Pipeline(steps=[
    ("features_creation", FeaturesCreation()),
    ("outliers_treatment", OutliersTreatment()),
    ("data_normalization", DataNormalization)
])

# Remove data e fire_id, al√©m de converter o dataframe para o formato esperado pelos classificadores
def sanitize_after_preprocess(features):
    # Transforma em numpy.ndarray
    if isinstance(features, pd.DataFrame):
        cols = [column for column in features.columns if column not in ('data', 'fire_id')]
        features = features[cols]
        features = features.select_dtypes(include='number')
    return features

sanitize = FunctionTransformer(sanitize_after_preprocess, validate=False, feature_names_out='one-to-one')

## 5. Separa√ß√£o em treino/teste

### Time Series Cross Validation

A Time Series Cross Validation √© uma t√©cnica especializada para validar modelos quando os dados possuem ordem cronol√≥gica. Diferente das t√©cnicas tradicionais, ela respeita a estrutura temporal dos dados.

In [10]:
# @title Algoritmo de separa√ß√£o

# Gera folds (train_idx, test_idx) respeitando ordem temporal por grupo, embargo em n√≠vel de grupo e exclus√£o m√∫tua treino/teste por grupo.
def group_time_series_cross_validation():
    dataframe = wildfires
    time_column = data_column_name
    group_column = id_column_name
    folds_amount = 5
    fold_groups_size = 1
    gap_between_groups_amount = 0

    # Ordena grupos pelo primeiro timestamp
    first_time = (
        dataframe[[group_column, time_column]]
        .dropna(subset=[time_column])
        .groupby(group_column)[time_column]
        .min()
        .sort_values()
    )
    ordered_groups = first_time.index.to_numpy()
    ordered_groups_len = len(ordered_groups)

    groups_amount_by_step = fold_groups_size

    min_train_groups = max(1, fold_groups_size) # pelo menos 1

    # √Çncora: √∫ltimo grupo incluso no treino
    # Precisamos garantir espa√ßo para gap + teste √† frente
    max_anchor = ordered_groups_len - gap_between_groups_amount - fold_groups_size
    if max_anchor <= min_train_groups:
        return  # n√£o h√° splits poss√≠veis

    splits = 0
    anchor = min_train_groups
    while anchor <= max_anchor and splits < folds_amount:
        train_groups = ordered_groups[:anchor]

        test_start = anchor + gap_between_groups_amount
        test_end = test_start + fold_groups_size
        test_groups = ordered_groups[test_start:test_end]

        train_idx = dataframe.index[dataframe[group_column].isin(train_groups)].to_numpy()
        test_idx  = dataframe.index[dataframe[group_column].isin(test_groups)].to_numpy()

        if train_idx.size and test_idx.size:
            yield (train_idx, test_idx)
            splits += 1

        anchor += groups_amount_by_step

cross_validation = cross_validation_splits = list(group_time_series_cross_validation())


## 6. Treinamento do modelo

In [5]:
# Dataset
wildfires = pd.read_csv('/content/drive/MyDrive/outliers_wildfires.csv')
wildfires['data'] = pd.to_datetime(wildfires['data'], errors='coerce')
wildfires = wildfires.sort_values('data').reset_index(drop=True)

X = wildfires.drop(columns=["houve_incendio", "data", "fire_id"])
y = wildfires["houve_incendio"].astype(int)

# Modelos conforme aula (sem hiperpar√¢metros avan√ßados)
# üîß Aperfei√ßoamento dos modelos (baseado nas aulas)

modelos = {
    # Baseline simples
    "Dummy (mais frequente)": DummyClassifier(strategy="most_frequent"),

    # Regress√£o Log√≠stica - com regulariza√ß√£o L2 leve
    "Regress√£o Log√≠stica": LogisticRegression(
        C=0.5, penalty='l2', solver='liblinear', max_iter=2000
    ),

    # √Årvore de Decis√£o - controlando profundidade e tamanho de folha
    "√Årvore de Decis√£o": DecisionTreeClassifier(
        max_depth=8, min_samples_split=4, min_samples_leaf=2, random_state=42
    ),

    # Random Forest - mais √°rvores e profundidade moderada
    "Random Forest": RandomForestClassifier(
        n_estimators=150, max_depth=10, min_samples_split=5,
        random_state=42, n_jobs=-1
    ),

    # Naive Bayes - suaviza√ß√£o leve (mais est√°vel)
    "Naive Bayes": GaussianNB(var_smoothing=1e-8),

    # KNN - mais vizinhos e dist√¢ncia ponderada
    "KNN": KNeighborsClassifier(n_neighbors=7, weights='distance'),

    # Gradient Boosting - par√¢metros leves conforme aula
    "Gradient Boosting": GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.05, max_depth=5, random_state=42
    ),

    # AdaBoost - taxa de aprendizado menor
    "AdaBoost": AdaBoostClassifier(
        n_estimators=100, learning_rate=0.5, random_state=42
    ),

    # HistGradientBoosting - mais itera√ß√µes e taxa menor
    "HistGradientBoosting": HistGradientBoostingClassifier(
        max_iter=150, learning_rate=0.05, max_depth=5, random_state=42
    ),

    # RidgeClassifier - regulariza√ß√£o leve
    "RidgeClassifier": RidgeClassifier(alpha=1.0),


    # XGBoost - par√¢metros t√≠picos de equil√≠brio (aula 15 parte 4)
    "XGBoost": XGBClassifier(
        eval_metric='logloss', n_estimators=150, learning_rate=0.05,
        max_depth=5, subsample=0.8, colsample_bytree=0.8, random_state=42
    ),

    # CatBoost - taxa de aprendizado reduzida e itera√ß√µes extras
    "CatBoost": CatBoostClassifier(
        iterations=150, learning_rate=0.05, depth=5,
        verbose=0, random_state=42
    ),

    # LightGBM - profundidade controlada e taxa moderada
    "LightGBM": LGBMClassifier(
        n_estimators=150, learning_rate=0.05,
        num_leaves=31, max_depth=6, random_state=42
    )
}


# Valida√ß√£o cruzada
tscv = TimeSeriesSplit(n_splits=3)

# M√©tricas
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score, zero_division=0),
    'f1': make_scorer(f1_score, zero_division=0),
    'roc_auc': make_scorer(roc_auc_score)
}

# Treinamento
resultados = []

for nome, modelo in modelos.items():
    print(f"Treinando modelo: {nome}...")
    try:
        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("clf", modelo)
        ])

        scores = cross_validate(pipeline, X, y, cv=tscv, scoring=scoring)

        resultados.append({
            "Modelo": nome,
            "Accuracy": scores['test_accuracy'].mean(),
            "Precision": scores['test_precision'].mean(),
            "Recall": scores['test_recall'].mean(),
            "F1-score": scores['test_f1'].mean(),
            "ROC AUC": scores['test_roc_auc'].mean(),
        })

        print(f" Modelo {nome} treinado com sucesso.\n")

    except Exception as e:
        print(f" Erro ao rodar o modelo {nome}: {e}\n")

# Resultado final
df_resultados = pd.DataFrame(resultados).sort_values("F1-score", ascending=False)
df_resultados

Treinando modelo: Dummy (mais frequente)...
 Modelo Dummy (mais frequente) treinado com sucesso.

Treinando modelo: Regress√£o Log√≠stica...
 Modelo Regress√£o Log√≠stica treinado com sucesso.

Treinando modelo: √Årvore de Decis√£o...
 Modelo √Årvore de Decis√£o treinado com sucesso.

Treinando modelo: Random Forest...
 Modelo Random Forest treinado com sucesso.

Treinando modelo: Naive Bayes...
 Modelo Naive Bayes treinado com sucesso.

Treinando modelo: KNN...
 Modelo KNN treinado com sucesso.

Treinando modelo: Gradient Boosting...
 Modelo Gradient Boosting treinado com sucesso.

Treinando modelo: AdaBoost...
 Modelo AdaBoost treinado com sucesso.

Treinando modelo: HistGradientBoosting...
 Modelo HistGradientBoosting treinado com sucesso.

Treinando modelo: RidgeClassifier...
 Modelo RidgeClassifier treinado com sucesso.

Treinando modelo: MLP (Neural Net)...




 Modelo MLP (Neural Net) treinado com sucesso.

Treinando modelo: XGBoost...
 Modelo XGBoost treinado com sucesso.

Treinando modelo: CatBoost...
 Modelo CatBoost treinado com sucesso.

Treinando modelo: LightGBM...
[LightGBM] [Info] Number of positive: 10171, number of negative: 68075
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004452 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3552
[LightGBM] [Info] Number of data points in the train set: 78246, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129987 -> initscore=-1.901070
[LightGBM] [Info] Start training from score -1.901070




[LightGBM] [Info] Number of positive: 23751, number of negative: 132738
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.015630 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3729
[LightGBM] [Info] Number of data points in the train set: 156489, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.151774 -> initscore=-1.720753
[LightGBM] [Info] Start training from score -1.720753




[LightGBM] [Info] Number of positive: 38141, number of negative: 196591
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.013085 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3742
[LightGBM] [Info] Number of data points in the train set: 234732, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.162487 -> initscore=-1.639836
[LightGBM] [Info] Start training from score -1.639836




 Modelo LightGBM treinado com sucesso.



Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
4,Naive Bayes,0.588798,0.188665,0.354248,0.227289,0.504112
5,KNN,0.784922,0.214438,0.036801,0.061151,0.501886
2,√Årvore de Decis√£o,0.775541,0.198405,0.041659,0.059747,0.498515
10,MLP (Neural Net),0.781386,0.127466,0.040767,0.058487,0.501949
6,Gradient Boosting,0.802768,0.068505,0.002476,0.004765,0.499875
13,LightGBM,0.801678,0.046412,0.001374,0.002666,0.498769
8,HistGradientBoosting,0.803407,0.064399,0.000845,0.001668,0.499614
1,Regress√£o Log√≠stica,0.804404,0.036364,0.000147,0.000293,0.499947
11,XGBoost,0.804174,0.026847,9.5e-05,0.00019,0.499785
3,Random Forest,0.8044,0.008772,2.3e-05,4.6e-05,0.499894
