# Modeling — Baseline Preditivo

Este notebook tem como objetivo construir um **modelo baseline preditivo**
para estimar a probabilidade de atraso (>15 minutos) em voos, a partir de
informações disponíveis **antes do evento**, evitando data leakage.

Este notebook:
- reconstrói o target definido conceitualmente
- prepara os dados para modelagem
- realiza split temporal
- treina modelos simples e robustos
- avalia desempenho inicial

Não há tuning avançado nem análise profunda de interpretabilidade aqui.


In [None]:
import sys
from pathlib import Path
PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.models.train import train_pipeline
from src.models.preprocessing import make_preprocessor
from src.features.selection import get_baseline_features
from src.features.target import build_delay_probability, binarize_target
from src.data.clean import drop_invalid_rows, drop_leakage_columns, drop_missing
from src.data.load import load_raw_data
from src import config
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

In [14]:
df = load_raw_data()
df.shape

(318017, 21)

## Reconstrução do target

Target: probabilidade de atraso (>15 minutos)

delay_probability = arr_del15 / arr_flights


In [15]:
df = df.copy()

# remove linhas inviáveis
df = df[df["arr_flights"] > 0]

df["delay_probability"] = df["arr_del15"] / df["arr_flights"]

df["delay_probability"].describe()

count    317280.000000
mean          0.197760
std           0.112311
min           0.000000
25%           0.122302
50%           0.184896
75%           0.258065
max           1.000000
Name: delay_probability, dtype: float64

## Conversão para classificação binária

O target contínuo (probabilidade) é convertido para um problema
de classificação binária para o baseline:

Classe 1 → atraso frequente (probabilidade acima do percentil 75)
Classe 0 → atraso menos frequente


In [16]:
threshold = df["delay_probability"].quantile(0.75)
threshold

np.float64(0.25806451612903225)

In [17]:
df["target"] = (df["delay_probability"] >= threshold).astype(int)

df["target"].value_counts(normalize=True)

target
0    0.748479
1    0.251521
Name: proportion, dtype: float64

In [18]:
df = df.drop(columns=config.LEAKAGE_COLS, errors="ignore")

## Features do baseline

Features simples, explicáveis e disponíveis pré-evento.


In [19]:
FEATURE_COLS = [
    "year",
    "month",
    "airport",
    "carrier",
    "arr_flights",
]

TARGET_COL = "target"

df[FEATURE_COLS + [TARGET_COL]].head()

Unnamed: 0,year,month,airport,carrier,arr_flights,target
0,2022,5,ABE,9E,136.0,0
1,2022,5,ABY,9E,91.0,0
2,2022,5,ACK,9E,19.0,0
3,2022,5,AEX,9E,88.0,0
4,2022,5,AGS,9E,181.0,0


In [20]:
df = df[FEATURE_COLS + [TARGET_COL]].dropna()
df.shape

(317517, 6)

## Split temporal

Treino: anos anteriores  
Teste: últimos anos

Evita data leakage temporal.


In [21]:
train_df = df[df["year"] < 2019]
test_df = df[df["year"] >= 2019]

X_train = train_df[FEATURE_COLS]
y_train = train_df[TARGET_COL]

X_test = test_df[FEATURE_COLS]
y_test = test_df[TARGET_COL]

X_train.shape, X_test.shape

((249401, 5), (68116, 5))

## Pré-processamento

- One-Hot Encoding para variáveis categóricas
- Mantém pipeline reprodutível


In [22]:
categorical_cols = ["airport", "carrier"]
numeric_cols = ["year", "month", "arr_flights"]

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore",
         sparse_output=False), categorical_cols),
        ("num", "passthrough", numeric_cols),
    ]
)

In [28]:
logreg_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LogisticRegression(
            max_iter=1000,
            class_weight="balanced",
            random_state=config.RANDOM_STATE
        )),
    ]
)

In [29]:
logreg_pipeline.fit(X_train, y_train)

y_pred_proba = logreg_pipeline.predict_proba(X_test)[:, 1]
y_pred = logreg_pipeline.predict(X_test)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [33]:
roc_auc = roc_auc_score(y_test, y_pred_proba)
roc_auc


0.6020971454139127

In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.62      0.72     57000
           1       0.21      0.52      0.30     11116

    accuracy                           0.60     68116
   macro avg       0.54      0.57      0.51     68116
weighted avg       0.76      0.60      0.65     68116



In [37]:
tree_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", DecisionTreeClassifier(
            max_depth=6,
            min_samples_leaf=100,
            class_weight="balanced",
            random_state=config.RANDOM_STATE
        )),
    ]
)

In [38]:
tree_pipeline.fit(X_train, y_train)

y_pred_tree = tree_pipeline.predict(X_test)
y_pred_tree_proba = tree_pipeline.predict_proba(X_test)[:, 1]

roc_auc_tree = roc_auc_score(y_test, y_pred_tree_proba)
roc_auc_tree

0.6487326763066357

In [39]:
print(classification_report(y_test, y_pred_tree))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86     57000
           1       0.29      0.28      0.29     11116

    accuracy                           0.77     68116
   macro avg       0.57      0.57      0.57     68116
weighted avg       0.77      0.77      0.77     68116



## Comparação inicial

- Regressão Logística:
  - baseline linear
  - interpretável
  - robusta a ruído

- Árvore rasa:
  - captura não linearidades
  - maior risco de overfitting
  - boa referência estrutural

Ambos os modelos servem como ponto de partida para melhorias no
Notebook 04.


## Conclusão do Notebook 03

Este notebook demonstrou que:
- o problema é modelável
- é possível prever padrões de atraso sem data leakage
- modelos simples já capturam sinal relevante

Próximo passo:
- engenharia de features
- modelos mais expressivos
- melhoria sistemática de métricas
