# 04 — Tuning Models (Time-aware CV)

This notebook runs the XGBoost tuner that performs chronological cross-validation **inside the training window**, selects by **AUC-PR**, and then trains a final model and evaluates on the held-out validation window.

Make sure to change your kernel on the upper right (on vs-code) to the airline-delay-prediction (Python 3.10)

### Conda / environment
```bash
# create (or activate) your env with xgboost, pyarrow, sklearn, matplotlib
conda create -n airline-delay-prediction python=3.10 -y
conda activate airline-delay-prediction
pip install xgboost==2.0.3 pyarrow pandas scikit-learn matplotlib joblib
```

In [1]:
from pathlib import Path
import os

def to_repo_root(start=Path.cwd()):
    for p in [start, *start.parents]:
        if (p/"src").exists() and (p/"requirements.txt").exists():
            os.chdir(p); print("Project root:", p); return
    raise SystemExit("Could not locate project root (needs ./src and ./requirements.txt)")

to_repo_root()


Project root: /Users/nikhilroy/Documents/MSML610/repo


In [None]:
from src.utils_model import (
    SCHEMA, BASE_CATEGORICAL, BASE_NUMERIC,
    load_model, load_metrics, predict_proba, coerce_schema,
    pick_threshold, load_all_metrics_table, score_row, score_dataframe
)


In [2]:
import os, sys, glob, ctypes, importlib

env_prefix = os.environ.get("CONDA_PREFIX", sys.prefix)
candidates = glob.glob(os.path.join(env_prefix, "lib", "libomp*.dylib"))
print("Found libomp candidates:", candidates)

if not candidates:
    raise RuntimeError("libomp.dylib not found in the conda env. Make sure llvm-openmp installed in THIS env.")

libomp_path = candidates[0]
# Preload OpenMP before importing xgboost
ctypes.CDLL(libomp_path)

# (Optional) nudge the loader to see the env's lib directory first
os.environ["DYLD_LIBRARY_PATH"] = env_prefix + "/lib:" + os.environ.get("DYLD_LIBRARY_PATH","")

# Now import/test xgboost
import xgboost as xgb, numpy as np
print("XGBoost version:", xgb.__version__)
X = np.random.randn(200, 10); y = (np.random.rand(200) > 0.8).astype(int)
d = xgb.DMatrix(X, label=y)
xgb.train({'objective':'binary:logistic','tree_method':'hist','verbosity':0}, d, num_boost_round=1)
print("XGBoost OK")


Found libomp candidates: ['/Users/nikhilroy/opt/anaconda3/envs/airline-delay-prediction/lib/libomp.dylib']
XGBoost version: 2.1.1
XGBoost OK


## Tuning our XGBoost model with our features (departure_delay included) - with cross-validation and Bayesian Optimization For Updating each factor that is varied:

This took over 40 hours to run soooooooooooooooo hopefully my grade is worth this commitment (literally full work week hours just to run the tuning part of modeling)

Data & I/O

--in_path (str, default: data/processed/flights_with_weather.parquet)
Path to the parquet with your engineered dataset.

--out_dir (str, default: models)
Where all artifacts (CSV of trials, model, metrics, plots) are written.

--tag (str, default: tuned_all_features_bo)
Prefix for artifact filenames so runs don’t overwrite each other.

Train/validation split

--split (time|random, default: time)
time: sorts by FL_DATE, uses the last eval_size portion as final validation.
random: random stratified split.

--eval_size (float, default: 0.20)
Fraction of data reserved for the final validation holdout.

--use_departure_delay (bool-ish str, default: true)
If true, includes DEPARTURE_DELAY as a feature. If false, excludes it.

CV (for Bayesian objective)

--cv_folds (int, default: 5)
Number of time-aware folds inside the train pool. We only validate on future blocks (no leakage).

Bayesian Optimization (Optuna)

--bo_trials (int, default: 40)
Total trials to sample. More trials = better search but slower.

--bo_startup_trials (int, default: 10)
Initial random/TPE warm-up trials before the sampler “gets smart.”

--bo_timeout (int seconds, default: 0)
Hard wall-clock limit for the search. 0 = no limit.

Boosting / early stopping (used in CV and final fit)

--n_rounds (int, default: 1200)
Max number of boosting rounds allowed.

--early_stopping (int, default: 100)
Stop if the validation metric doesn’t improve for this many rounds. Keeps the best iteration.

Hyperparameter search bounds (inclusive)

These define the search space Optuna samples from. Tighten if you know the sweet spot; widen if you want more exploration.

--lr_low, --lr_high (floats, default: 0.03 … 0.2)
Learning rate (eta). Lower = slower but often more stable; higher can converge faster but overfit.

--max_depth_low, --max_depth_high (ints, default: 5 … 9)
Tree depth. Larger can fit more complex interactions; too large risks overfitting.

--min_child_weight_low, --min_child_weight_high (ints, default: 1 … 8)
Minimum sum of instance weight needed in a child. Higher = more conservative (simpler trees).

--subsample_low, --subsample_high (floats, default: 0.6 … 1.0)
Row subsampling per tree. <1.0 adds randomness, generally helps generalization.

--colsample_bytree_low, --colsample_bytree_high (floats, default: 0.6 … 1.0)
Column subsampling per tree. Similar bias-variance trade-off as subsample.

--reg_alpha_low, --reg_alpha_high (floats, default: 1e-8 … 1.0, log scale)
L1 regularization. Drives sparsity; can zero out weak splits.

--reg_lambda_low, --reg_lambda_high (floats, default: 1e-2 … 10.0, log scale)
L2 regularization. Smooths weights; often stabilizes training.

In [5]:
# Bayesian tuning with Optuna (maximize AP over time-aware CV)
%run src/tuning_models.py \
  --in_path data/processed/flights_with_weather.parquet \
  --out_dir models \
  --split time --eval_size 0.20 \
  --use_departure_delay true \
  --tag tuned_all_features_bo \
  --cv_folds 5 \
  --bo_trials 10 \
  --bo_startup_trials 10 \
  --bo_timeout 0 \
  --n_rounds 1001 \
  --early_stopping 100 \
  --lr_low 0.03 \
  --lr_high 0.2 \
  --max_depth_low 5 \
  --max_depth_high 9 \
  --min_child_weight_low 1 \
  --min_child_weight_high 8 \
  --subsample_low 0.6 \
  --subsample_high 1.0 \
  --colsample_bytree_low 0.6 \
  --colsample_bytree_high 1.0 \
  --reg_alpha_low 1e-8 \
  --reg_alpha_high 1.0 \
  --reg_lambda_low 1e-2 \
  --reg_lambda_high 10.0


[I 2025-11-14 23:20:50,114] A new study created in memory with name: no-name-4a8ca277-4e9f-43be-aaa6-2f7098cb077c
[I 2025-11-15 00:32:07,274] Trial 0 finished with value: 0.9125777861991686 and parameters: {'learning_rate': 0.08757364264182173, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8805135844073053, 'colsample_bytree': 0.6817038283811238, 'reg_alpha': 9.572006494251443e-06, 'reg_lambda': 0.09496277503603154}. Best is trial 0 with value: 0.9125777861991686.
[I 2025-11-15 01:05:38,672] Trial 1 finished with value: 0.9108925521988819 and parameters: {'learning_rate': 0.12007457923586612, 'max_depth': 9, 'min_child_weight': 7, 'subsample': 0.6841278824951569, 'colsample_bytree': 0.7284174394919806, 'reg_alpha': 0.2390630533840136, 'reg_lambda': 0.14296883396345103}. Best is trial 0 with value: 0.9125777861991686.
[I 2025-11-15 01:36:34,865] Trial 2 finished with value: 0.9115564443909854 and parameters: {'learning_rate': 0.1770459916789657, 'max_depth': 8, 'min_child_weight


Best hyperparams: {'learning_rate': 0.08757364264182173, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8805135844073053, 'colsample_bytree': 0.6817038283811238, 'reg_alpha': 9.572006494251443e-06, 'reg_lambda': 0.09496277503603154}

tuned_all_features_bo: AUC=0.962  AP=0.918  F1=0.839  P=0.895  R=0.790  (best_iter=983)
Saved: models/tuned_all_features_bo_tune_trials.csv and model/metrics/plots under models/


As you can see from the output above (or just looking at "models/tuned_all_features_bo_tune_trials.csv"), you can see that despite tuning our various paremters that make up the XgBoost model using a powerful Bayesian Optimizer across several variations, that our final XgBoost model is barely increasing in performance. This is because after hundreds and hundreds of n_estimators as and our iterations keep increasing, we learn from our mistakes and built upon our trees to produce more precise and recall more accurate classes. Thus, any of these models are valid and produce very close performance (rip my 40 hours of running this though, side note). 

### Let's see our trial table and how it looks, sorting by descending average precision:

In [11]:
import pandas as pd
pd.read_csv("models/tuned_all_features_bo_tune_trials.csv").sort_values("cv_ap", ascending= False).head(10)


Unnamed: 0,trial,cv_ap,cv_auc,cv_logloss,learning_rate,max_depth,min_child_weight,subsample,colsample_bytree,reg_alpha,reg_lambda
0,0,0.912578,0.961354,0.204983,0.087574,5,3,0.880514,0.681704,9.572006e-06,0.094963
1,9,0.912544,0.961229,0.20387,0.085377,6,8,0.815019,0.77833,0.05983642,0.069709
2,6,0.912128,0.960903,0.203714,0.124273,6,4,0.844534,0.826475,0.1361383,0.07286
3,7,0.9121,0.960854,0.20261,0.082109,7,3,0.759804,0.762482,0.0005758946,0.500824
4,4,0.912019,0.960881,0.20166,0.148901,8,5,0.940639,0.672563,0.004182561,0.255453
5,3,0.911994,0.960929,0.204755,0.093496,6,5,0.706706,0.734326,0.5182465,0.569428
6,5,0.911701,0.96075,0.204188,0.126309,7,7,0.742425,0.821419,4.883618e-08,0.015254
7,2,0.911556,0.960639,0.203269,0.177046,8,8,0.832793,0.676556,0.05474506,0.724493
8,8,0.91099,0.960219,0.203602,0.19388,8,8,0.86109,0.950996,2.745679e-08,0.025091
9,1,0.910893,0.960345,0.202722,0.120075,9,7,0.684128,0.728417,0.2390631,0.142969
