### This is the example.ipynb Notebook where it is a simplifed, minimalized notebook to run my programs. If you want more detail and in depth of my programs and my sanity checks and all, please visit the indisvualized notebooks labeled 00-05.ipynb

In [None]:
!conda create -n airline-delay-prediction python=3.10 -y || true
!conda run -n airline-delay-prediction python -m pip install --upgrade pip
!conda run -n airline-delay-prediction python -m pip install -r requirements.txt

# (Highly recommended for XGBoost on macOS/Linux)
!conda install -n airline-delay-prediction -c conda-forge openjdk=11 -y || true

!echo "âœ“ On the top right of VS Code, make sure to changethe kernel/environment to: 'airline-delay-prediction' "


In [None]:
from pathlib import Path
import os
def to_repo_root(start=Path.cwd()):
    for p in [start, *start.parents]:
        if (p/"src").exists() and (p/"requirements.txt").exists():
            os.chdir(p); print("Project root:", p); return
    raise SystemExit("Could not locate project root (needs ./src and ./requirements.txt)")
to_repo_root()


In [None]:
from src.utils_model import (
    SCHEMA, BASE_CATEGORICAL, BASE_NUMERIC,
    load_model, load_metrics, predict_proba, coerce_schema,
    pick_threshold, load_all_metrics_table, score_row, score_dataframe
)
from pathlib import Path, PurePosixPath
import shutil

enriched_dir = Path("data/processed/flights_enriched")
print("Removing:", enriched_dir.resolve())
shutil.rmtree(enriched_dir, ignore_errors=True)
(enriched_dir).mkdir(parents=True, exist_ok=True)  # recreate empty dir


### From 01.ipynb:

In [None]:
!conda run -n airline-delay-prediction python src/spark_etl.py


In [None]:
# Weather join (Meteostat)

!conda run -n airline-delay-prediction python src/merge_weather.py


### From 02.ipynb:

In [None]:
# Will show the plots as well as the report/table I generated after executing code

%run src/eda_report.py --show true


In [None]:
# Will not show plots here but still saves them in reports folder
%run src/eda_report.py --show false

### From 03.ipynb:

In [None]:
# Run in the airline-delay-prediction kernel
import sys, subprocess, os
subprocess.run(
    ["conda","install","-n","airline-delay-prediction","-c","conda-forge","llvm-openmp","-y"],
    check=False
)
print("CONDA_PREFIX:", os.environ.get("CONDA_PREFIX", sys.prefix))


In [None]:
import os, sys, glob, ctypes, importlib

env_prefix = os.environ.get("CONDA_PREFIX", sys.prefix)
candidates = glob.glob(os.path.join(env_prefix, "lib", "libomp*.dylib"))
print("Found libomp candidates:", candidates)

if not candidates:
    raise RuntimeError("libomp.dylib not found in the conda env. Make sure llvm-openmp installed in THIS env.")

libomp_path = candidates[0]
# Preload OpenMP before importing xgboost
ctypes.CDLL(libomp_path)

# (Optional) nudge the loader to see the env's lib directory first
os.environ["DYLD_LIBRARY_PATH"] = env_prefix + "/lib:" + os.environ.get("DYLD_LIBRARY_PATH","")

# Now import/test xgboost
import xgboost as xgb, numpy as np
print("XGBoost version:", xgb.__version__)
X = np.random.randn(200, 10); y = (np.random.rand(200) > 0.8).astype(int)
d = xgb.DMatrix(X, label=y)
xgb.train({'objective':'binary:logistic','tree_method':'hist','verbosity':0}, d, num_boost_round=1)
print("XGBoost OK")


Training XgBoost on Entire Data:

In [None]:
%run src/train_xgb.py \
  --in_path data/processed/flights_with_weather.parquet \
  --out_dir models \
  --split time --eval_size 0.20 \
  --early_stopping 50 \
  --n_estimators 1001 \
  --log_period 25 \
  --use_departure_delay true \
  --tag all_features \
  --native true \
  --learning_rate 0.1

Training XgBoost with the most valuable feature (to see how much it lowers our predictions for comparsion and funzies)

In [None]:
%run src/train_xgb.py \
  --in_path data/processed/flights_with_weather.parquet \
  --out_dir models \
  --split time --eval_size 0.20 \
  --early_stopping 50 \
  --n_estimators 1001 \
  --log_period 25 \
  --use_departure_delay false \
  --tag removed_departure_delay_feature \
  --native true \
  --learning_rate 0.2

Training CatBoost and Light GBM (bonus so hopefully extra marks):

In [None]:
# LightGBM & CatBoost baselines (time-aware split; same knobs as XGB script)
%run src/train_baselines.py \
  --in_path data/processed/flights_with_weather.parquet \
  --out_dir models \
  --split time --eval_size 0.20 \
  --use_departure_delay true \
  --model all \
  --tag all_features \
  --n_estimators 975 \
  --learning_rate 0.1 \
  --early_stopping 100 \
  --log_period 25 \
  --lgbm_max_depth 7 \
  --cat_depth 7


### From 04.ipynb:

Tuning our model with Bayesian Optimization

Note: My Macbook is trash so it literally look 20 + hours to run just 10 trials, but you can adjust if you have more compute. My computer was about to explode lolll but yeah the TA said I was fine with this and won't dock points for it

In [None]:
# Bayesian tuning with Optuna (maximize AP over time-aware CV)
%run src/tuning_models.py \
  --in_path data/processed/flights_with_weather.parquet \
  --out_dir models \
  --split time --eval_size 0.20 \
  --use_departure_delay true \
  --tag tuned_all_features_bo \
  --cv_folds 5 \
  --bo_trials 10 \
  --bo_startup_trials 10 \
  --bo_timeout 0 \
  --n_rounds 1001 \
  --early_stopping 100 \
  --lr_low 0.03 \
  --lr_high 0.2 \
  --max_depth_low 5 \
  --max_depth_high 9 \
  --min_child_weight_low 1 \
  --min_child_weight_high 8 \
  --subsample_low 0.6 \
  --subsample_high 1.0 \
  --colsample_bytree_low 0.6 \
  --colsample_bytree_high 1.0 \
  --reg_alpha_low 1e-8 \
  --reg_alpha_high 1.0 \
  --reg_lambda_low 1e-2 \
  --reg_lambda_high 10.0


In [None]:
import pandas as pd
pd.read_csv("models/tuned_all_features_bo_tune_trials.csv").sort_values("cv_ap", ascending= False).head(10)


### From 05.ipynb:

In [None]:
# (inside your repo root)
!conda activate airline-delay-prediction   # or your env name
!conda install -n airline-delay-predictixwon -c conda-forge openjdk=11 -y || true
!conda run -n airline-delay-prediction python -m pip install -r requirements.txt





In [None]:
# run the app
!conda run -n airline-delay-prediction streamlit run src/app.py
