# Environmental expert (XGBoost): final HSI model (context expert)

**Author:** Kacper Ryske  
**Date:** January 2026  
**Purpose:** Train and evaluate an environmental feature-based classifier whose calibrated probabilities serve as the contextual (HSI) expert in the late-fusion ensemble.

This notebook trains the **final environmental (context) expert** used in the thesis: an XGBoost multi-class model that outputs **occurrence probabilities for 8 target insect species**, conditioned on weather, land-cover, temporal context, and related environmental features.

## Validation strategy
- **Train:** 80% of **2018â€“2024** observations (random stratified split by species)
- **Validation:** 20% of **2018â€“2024** observations (random stratified split by species)
- **Test:** **2025** observations (temporal holdout; not used in model selection)

## Data layout (local, not tracked)
- `data/amsterdam_2/train/observations_filtered_50m_accuracy.parquet`  (2018â€“2024)
- `data/amsterdam_2/val/observations_filtered_50m_accuracy.parquet`    (2025)

## Outputs (local, not tracked)
- Model + metadata: `models/context/`
- Predictions + metrics: `outputs/context_xgb/`

> Tip: keep `data/`, `models/`, and `outputs/` in `.gitignore`.

## How to run
Run the notebook **top-to-bottom**. The precomputed parquet files are expected to include:
- observation metadata (e.g., `species`, `observed_at`, `final_latitude`, `final_longitude`, â€¦)
- weather features (e.g., `temp_c`, `rhum`, `wspd_ms`, `prcp_mm`, `cloud_cover`, `swrad`, `vpd_kpa`)
- land-cover fractions / indices at multiple radii (e.g., `wc50_water`, `wc250_tree_cover`, â€¦)

## What this notebook produces
- Trained XGBoost model (context expert) + inference metadata (feature list, species mapping)
- Test-set probabilities and summary metrics for downstream fusion

## Workflow
1. **Data loading & splitting**: load preprocessed observations and create train/val split; keep 2025 as temporal test
2. **Feature engineering**: derive environmental features from contextual inputs
3. **Feature selection**: remove identifiers/leakage features and known non-informative features
4. **Model training**: train candidate XGBoost configurations with early stopping on the validation split
5. **Evaluation**: report Top-k accuracy and probabilistic metrics (log loss, Brier)
6. **Export**: save model, metadata, and test-set probabilities for fusion

> Note: Probability calibration and diagnostic plots (reliability diagrams, confusion matrix, feature importance) are typically kept in a separate **analysis notebook** to keep this training notebook lightweight.

## Target species
1. *Aglais urticae* (Small Tortoiseshell)  
2. *Apis mellifera* (Western Honey Bee)  
3. *Bombus lapidarius* (Red-tailed Bumblebee)  
4. *Bombus terrestris* (Buff-tailed Bumblebee)  
5. *Coccinella septempunctata* (Seven-spot Ladybird)  
6. *Episyrphus balteatus* (Marmalade Hoverfly)  
7. *Eristalis tenax* (Drone Fly)  
8. *Eupeodes corollae* (Footballer Hoverfly)


## 1. Setup
Imports, plotting defaults, and random seed for reproducibility.


In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pathlib import Path

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import (
    classification_report, log_loss, accuracy_score,
)

# Optional plotting (comment out if you want a pure-training notebook)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.rcParams["figure.dpi"] = 120

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 2. Paths and output folders
Repo-relative paths so the notebook runs on any machine (no absolute `/Users/...` paths).


In [None]:
# ============================================================================
# PATHS (repo-friendly; no absolute /Users/... paths)
# ============================================================================

REPO_ROOT = Path.cwd().resolve()
# If running from notebooks/, step up one level
if not (REPO_ROOT / "configs").exists() and (REPO_ROOT.parent / "configs").exists():
    REPO_ROOT = REPO_ROOT.parent

DATA_ROOT = REPO_ROOT / "data" / "amsterdam"

HISTORICAL_PATH = DATA_ROOT / "train" / "observations_filtered_50m_accuracy.parquet"  # 2018â€“2024
TEST_PATH       = DATA_ROOT / "val"   / "observations_filtered_50m_accuracy.parquet"  # 2025 holdout

MODEL_DIR = REPO_ROOT / "models" / "context"
OUT_DIR   = REPO_ROOT / "outputs" / "context_xgb"
SPECIES_MAP_PATH = MODEL_DIR / "species_mapping_FINAL_no_vespula.csv"

MODEL_DIR.mkdir(parents=True, exist_ok=True)
(OUT_DIR / "preds").mkdir(parents=True, exist_ok=True)
(OUT_DIR / "metrics").mkdir(parents=True, exist_ok=True)
(OUT_DIR / "figures").mkdir(parents=True, exist_ok=True)

print("REPO_ROOT:", REPO_ROOT)
print("HISTORICAL_PATH:", HISTORICAL_PATH)
print("TEST_PATH:", TEST_PATH)


## 3. Experiment configuration
Target species and global constants.


In [None]:
# ============================================================================
# CONFIGURATION
# ============================================================================

TARGET_SPECIES = [
    "Apis mellifera", "Eristalis tenax",
    "Bombus terrestris", "Coccinella septempunctata",
    "Bombus lapidarius", "Episyrphus balteatus",
    "Aglais urticae", "Eupeodes corollae",
]

GRID_DECIMALS = 2  # used only for coarse bins (not used as model features)
SCALE = 10 ** GRID_DECIMALS

print("="*80)
print("FINAL XGBOOST HSI MODEL (PROPER TRAIN/VAL/TEST SPLIT)")
print("="*80)
print(f"Target species: {len(TARGET_SPECIES)}")
print("Validation: 80/20 stratified split (2018â€“2024) + 2025 temporal test")
print("Metrics: Top-K Accuracy + Log Loss + Brier Score")


## 4. Load and split data
2018â€“2024 are split into train/val (stratified by species). The 2025 data are kept as a temporal holdout test set.


In [None]:
# ============================================================================
# STEP 1: LOAD AND SPLIT DATA
# ============================================================================

def add_coarse_bins(df: pd.DataFrame, grid_decimals: int = 2) -> pd.DataFrame:
    """Add coarse spatial bins (lat_bin, lon_bin) from final_latitude/final_longitude.

    These bins are for analysis only; coordinates are excluded from model features.
    """
    df = df.copy()
    scale = 10 ** grid_decimals

    # robust parsing
    if "final_latitude" in df.columns:
        df["latitude"] = pd.to_numeric(df["final_latitude"], errors="coerce")
    elif "latitude" in df.columns:
        df["latitude"] = pd.to_numeric(df["latitude"], errors="coerce")
    else:
        df["latitude"] = np.nan

    if "final_longitude" in df.columns:
        df["longitude"] = pd.to_numeric(df["final_longitude"], errors="coerce")
    elif "longitude" in df.columns:
        df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce")
    else:
        df["longitude"] = np.nan

    df["lat_bin"] = np.round(df["latitude"] * scale).astype("Int64")
    df["lon_bin"] = np.round(df["longitude"] * scale).astype("Int64")
    return df

In [None]:
print("Loading historical data (2018â€“2024)...")
historical_df = pd.read_parquet(HISTORICAL_PATH)
historical_df = historical_df[historical_df["species"].isin(TARGET_SPECIES)].copy()
historical_df = add_coarse_bins(historical_df, GRID_DECIMALS)
print(f"  Loaded: {len(historical_df):,} observations")

print("Creating train/val split (80/20, stratified by species)...")
train_df, val_df = train_test_split(
    historical_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=historical_df["species"],
)

print("Loading 2025 temporal holdout test data...")
test_df = pd.read_parquet(TEST_PATH)
test_df = test_df[test_df["species"].isin(TARGET_SPECIES)].copy()
test_df = add_coarse_bins(test_df, GRID_DECIMALS)
print(f"  Loaded: {len(test_df):,} observations")

print("\nSplit summary:")
print(f"  Train: {len(train_df):,}")
print(f"  Val:   {len(val_df):,}")
print(f"  Test:  {len(test_df):,}")


In [None]:
# Species distribution per split (counts)
dist = (
    pd.concat([
        train_df.assign(split="train"),
        val_df.assign(split="val"),
        test_df.assign(split="test"),
    ], ignore_index=True)
    .groupby(["split", "species"])
    .size()
    .unstack("split")
    .fillna(0)
    .astype(int)
    .loc[sorted(TARGET_SPECIES)]
)

dist


## 5. Feature engineering
Transforms raw columns into model-ready environmental features.


In [None]:
# ============================================================================
# STEP 2: FEATURE ENGINEERING (final)
# ============================================================================
# Notes:
# - Assumes your parquet already contains weather + landcover columns
#   e.g., temp_c, rhum, wspd_ms, prcp_mm, cloud_cover, swrad, vpd_kpa,
#   and worldcover fractions wc{radius}_<class> for radius in {10,50,100,250}.
# - If some columns are missing, you can either regenerate the parquet or
#   adapt the feature engineering below.

def engineer_features(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """Create comprehensive environmental features."""
    df = df.copy()

    # Ensure time basics exist (derive if needed)
    if "obs_dt_utc" in df.columns:
        dt_utc = pd.to_datetime(df["obs_dt_utc"], utc=True, errors="coerce")
    elif "observed_at" in df.columns:
        dt_utc = pd.to_datetime(df["observed_at"], utc=True, errors="coerce")
    else:
        dt_utc = pd.to_datetime(pd.NaT)

    if "hour_local" not in df.columns:
        # derive local hour from UTC timestamp if possible
        try:
            dt_local = dt_utc.dt.tz_convert("Europe/Amsterdam")
            df["hour_local"] = dt_local.dt.hour
        except Exception:
            df["hour_local"] = np.nan

    if "obs_month" not in df.columns:
        try:
            dt_local = dt_utc.dt.tz_convert("Europe/Amsterdam")
            df["obs_month"] = dt_local.dt.month
        except Exception:
            df["obs_month"] = np.nan

    if "doy" not in df.columns:
        try:
            dt_local = dt_utc.dt.tz_convert("Europe/Amsterdam")
            df["doy"] = dt_local.dt.dayofyear
        except Exception:
            df["doy"] = np.nan

    # TEMPORAL FEATURES
    if verbose: print("  âœ“ Temporal features...")
    df["hour_sin"] = np.sin(2 * np.pi * df["hour_local"] / 24)
    df["hour_cos"] = np.cos(2 * np.pi * df["hour_local"] / 24)

    week_of_year = dt_utc.dt.isocalendar().week if hasattr(dt_utc, "dt") else pd.Series([np.nan]*len(df))
    df["week_of_year"] = week_of_year.astype("Int64")
    df["week_sin"] = np.sin(2 * np.pi * df["week_of_year"].astype(float) / 52)
    df["week_cos"] = np.cos(2 * np.pi * df["week_of_year"].astype(float) / 52)

    # Approx. day-length model (same as your thesis code)
    day_length = 12 + 6 * np.sin(2 * np.pi * (df["doy"].astype(float) - 80) / 365)
    sunrise_hour = 12 - day_length / 2
    sunset_hour = 12 + day_length / 2
    df["hours_since_sunrise"] = df["hour_local"] - sunrise_hour
    df["hours_until_sunset"] = sunset_hour - df["hour_local"]
    df["is_golden_hour"] = ((df["hours_since_sunrise"] < 2) | (df["hours_until_sunset"] < 2)).astype(int)

    df["is_spring"] = df["obs_month"].isin([3, 4, 5]).astype(int)
    df["is_summer"] = df["obs_month"].isin([6, 7, 8]).astype(int)
    df["is_fall"]   = df["obs_month"].isin([9, 10]).astype(int)

    # WEATHER FEATURES
    if verbose: print("  âœ“ Weather features...")
    df["is_optimal_temp"] = ((df["temp_c"] >= 15) & (df["temp_c"] <= 28)).astype(int)
    df["temp_squared"] = df["temp_c"] ** 2
    df["is_humid"] = (df["rhum"] > 70).astype(int)
    df["is_dry"] = (df["rhum"] < 40).astype(int)
    df["is_calm"] = (df["wspd_ms"] < 3).astype(int)
    df["is_windy"] = (df["wspd_ms"] > 7).astype(int)
    df["has_rain"] = (df["prcp_mm"] > 0.5).astype(int)
    df["is_sunny"] = (df["cloud_cover"] < 30).astype(int)
    df["is_overcast"] = (df["cloud_cover"] > 70).astype(int)
    df["swrad_per_hour"] = df["swrad"] / np.maximum(day_length, 1)

    # HABITAT COMPOSITION
    if verbose: print("  âœ“ Habitat composition...")
    for radius in [10, 50, 100, 250]:
        df[f"vegetation_total_{radius}"] = (
            df[f"wc{radius}_tree"] + df[f"wc{radius}_shrub"] + df[f"wc{radius}_grass"]
        )
        df[f"natural_total_{radius}"] = (
            df[f"wc{radius}_tree"] + df[f"wc{radius}_shrub"] +
            df[f"wc{radius}_grass"] + df[f"wc{radius}_herb_wetland"]
        )
        df[f"impervious_{radius}"] = df[f"wc{radius}_builtup"] + df[f"wc{radius}_bare"]

    # HABITAT DIVERSITY
    if verbose: print("  âœ“ Habitat diversity...")
    for radius in [10, 50, 100, 250]:
        habitat_cols = [
            f"wc{radius}_tree", f"wc{radius}_shrub", f"wc{radius}_grass",
            f"wc{radius}_cropland", f"wc{radius}_builtup", f"wc{radius}_water",
        ]
        habitat_matrix = df[habitat_cols].values + 1e-6
        habitat_matrix = habitat_matrix / habitat_matrix.sum(axis=1, keepdims=True)
        shannon = -np.sum(habitat_matrix * np.log(habitat_matrix), axis=1)

        df[f"habitat_diversity_{radius}"] = shannon
        df[f"habitat_richness_{radius}"] = (df[habitat_cols] > 0.05).sum(axis=1)
        df[f"habitat_dominance_{radius}"] = df[habitat_cols].max(axis=1)

    # CROSS-SCALE GRADIENTS
    if verbose: print("  âœ“ Cross-scale gradients...")
    df["vegetation_gradient_10_50"] = df["vegetation_total_10"] - df["vegetation_total_50"]
    df["vegetation_gradient_50_250"] = df["vegetation_total_50"] - df["vegetation_total_250"]
    df["urban_gradient_10_50"] = df["wc10_builtup"] - df["wc50_builtup"]
    df["urban_gradient_50_250"] = df["wc50_builtup"] - df["wc250_builtup"]
    df["water_gradient_10_100"] = df["wc10_water"] - df["wc100_water"]
    df["tree_gradient_10_100"] = df["wc10_tree"] - df["wc100_tree"]

    # WEATHER Ã— HABITAT INTERACTIONS
    if verbose: print("  âœ“ Weather-habitat interactions...")
    df["temp_x_vegetation_50"] = df["temp_c"] * df["vegetation_total_50"]
    df["temp_x_builtup_50"] = df["temp_c"] * df["wc50_builtup"]
    df["temp_x_water_50"] = df["temp_c"] * df["wc50_water"]
    df["humidity_x_vegetation_50"] = df["rhum"] * df["vegetation_total_50"]
    df["humidity_x_wetland_50"] = df["rhum"] * df["wc50_herb_wetland"]
    df["wind_x_vegetation_100"] = df["wspd_ms"] * df["vegetation_total_100"]
    df["wind_x_tree_shelter"] = df["wspd_ms"] * df["wc100_tree"]
    df["solar_x_vegetation"] = df["swrad"] * df["vegetation_total_50"]
    df["solar_x_builtup"] = df["swrad"] * df["wc50_builtup"]
    df["vpd_x_vegetation"] = df["vpd_kpa"] * df["vegetation_total_50"]
    df["vpd_x_water_proximity"] = df["vpd_kpa"] * (1 - df["wc50_water"])

    # URBAN CONTEXT
    if verbose: print("  âœ“ Urban context...")
    df["urban_heat_index"] = df["wc50_builtup"] * 2 + df["wc250_builtup"] - 0.5 * df["vegetation_total_50"]
    df["floral_resources"] = (
        df["wc10_grass"] * 0.5 + df["wc50_grass"] * 1.0 +
        df["wc50_shrub"] * 1.5 + df["wc50_cropland"] * 0.8
    )
    df["cavity_nesting_habitat"] = df["wc50_tree"] + df["wc50_builtup"] * 0.2
    df["ground_nesting_habitat"] = df["wc10_grass"] + df["wc10_bare"] * 0.5
    df["habitat_edges_50"] = df["habitat_richness_50"] * df["habitat_diversity_50"]

    # TEMPORAL Ã— HABITAT INTERACTIONS
    if verbose: print("  âœ“ Temporal-habitat interactions...")
    df["spring_x_vegetation"] = df["is_spring"] * df["vegetation_total_50"]
    df["summer_x_water"] = df["is_summer"] * df["wc50_water"]
    df["morning_x_flowers"] = (df["hour_local"] < 12).astype(int) * df["floral_resources"]
    df["afternoon_x_flowers"] = (df["hour_local"] >= 12).astype(int) * df["floral_resources"]

    if verbose:
        print(f"âœ… Feature engineering complete! ({len(df.columns)} columns)")
    return df


In [None]:
print("Engineering features...")
train_df = engineer_features(train_df, verbose=True)
val_df   = engineer_features(val_df, verbose=False)
test_df  = engineer_features(test_df, verbose=False)

print("Done.")


## 6. Feature selection and matrices
Drops identifiers and potential leakage (coordinates), removes zero-importance features, and builds `X_*` / `y_*`.


In [None]:
# ============================================================================
# STEP 3: PREPARE FEATURE MATRICES
# ============================================================================

EXCLUDE_COLS = [
    'observation_id', 'gbifID', 'taxon_name', 'species', 'speciesKey',
    'observed_at', 'obs_dt_local_naive', 'obs_dt_utc', 'obs_dt_hour_utc', 'day_utc',
    'dataGeneralizations', 'recordedBy', 'individualCount', 'sex', 'lifeStage',
    'level0Name', 'level1Name', 'level2Name', 'mediaURL', 'mediaLicense',
    'mediaURL_is_image', 'source_ref', 'occurrenceLicense',
    'precise_latitude', 'precise_longitude', 'extraction_success', 'extraction_message',
    'coord_unc_m', 'eff_unc_m', 'is_generalized',
    'week_of_year',
    # EXCLUDE COORDINATES (no spatial leakage)
    'latitude', 'longitude', 
    'final_latitude', 'final_longitude', 
    'obs_day', 'obs_hour', 'obs_min', 'time_imputed', 'gps_accuracy_m',
    'obs_year','obs_month', 
     #"is_spring", "is_fall", "is_summer", 
     #"doy", 
     #'hour_sin'
     #"lat_bin", "lon_bin",
     #'final_accuracy_m', 
     #'temp_squared'
]

# âœ… EXCLUDE ZERO-IMPORTANCE FEATURES
ZERO_IMPORTANCE_FEATURES = [
    # Snow/ice features (not relevant in Netherlands urban environment)
    'wc10_snowice', 'wc50_snowice', 'wc100_snowice', 'wc250_snowice',
    'snow_mm',
    
    # Shrub features (minimal urban coverage)
    'wc10_shrub', 'wc50_shrub', 'wc100_shrub', 'wc250_shrub',
    
    # Moss/lichen features (not relevant for these species)
    'wc10_moss_lichen', 'wc50_moss_lichen', 'wc100_moss_lichen', 'wc250_moss_lichen',
    
    # Mangrove features (not present in Netherlands)
    'wc10_mangroves', 'wc50_mangroves', 'wc100_mangroves', 'wc250_mangroves',
    
    # Other zero-importance features
    'wc10_cropland',  # Urban environment
    'wc10_bare',      # Minimal bare ground at 10m scale
]


def select_feature_columns(df: pd.DataFrame) -> list[str]:
    cols = []
    for c in df.columns:
        if c in EXCLUDE_COLS or c in ZERO_IMPORTANCE_FEATURES:
            continue
        # keep only numeric columns
        if not pd.api.types.is_numeric_dtype(df[c]):
            continue
        # skip all-NA columns
        if df[c].notna().sum() == 0:
            continue
        cols.append(c)
    return cols

feature_cols = select_feature_columns(train_df)

print(f"Selected {len(feature_cols)} numeric features")

X_train = train_df[feature_cols].fillna(0).values
X_val   = val_df[feature_cols].fillna(0).values
X_test  = test_df[feature_cols].fillna(0).values

species_to_idx = {sp: i for i, sp in enumerate(sorted(TARGET_SPECIES))}
idx_to_species = {i: sp for sp, i in species_to_idx.items()}

y_train = train_df["species"].map(species_to_idx).values
y_val   = val_df["species"].map(species_to_idx).values
y_test  = test_df["species"].map(species_to_idx).values

sample_weights = compute_sample_weight("balanced", y=y_train)

print("Shapes:")
print("  X_train:", X_train.shape)
print("  X_val:  ", X_val.shape)
print("  X_test: ", X_test.shape)


## 7. Hyperparameter comparison
Quick comparison across a small set of configs. Early stopping uses validation performance.


In [None]:
# ============================================================================
# STEP 4: QUICK HYPERPARAMETER COMPARISON
# ============================================================================

def top_k_accuracy(y_true: np.ndarray, y_proba: np.ndarray, k: int) -> float:
    top_k = np.argsort(y_proba, axis=1)[:, -k:]
    return float(np.mean([y_true[i] in top_k[i] for i in range(len(y_true))]))

base_params = {
    "objective": "multi:softprob",
    "num_class": len(TARGET_SPECIES),
    "subsample": 0.8,
    "colsample_bytree": 0.7,
    "colsample_bylevel": 0.7,
    "min_child_weight": 5,
    "gamma": 0.1,
    "reg_alpha": 0.1,
    "reg_lambda": 1.0,
    "random_state": RANDOM_STATE,
    "tree_method": "hist",
    "n_jobs": -1,
    "early_stopping_rounds": 50,
    "eval_metric": ["mlogloss", "merror"],
}

configs = [
    {"name": "Config A (More Regularization)", "max_depth": 5, "learning_rate": 0.05, "n_estimators": 500},
    {"name": "Config B (Baseline)",           "max_depth": 6, "learning_rate": 0.05, "n_estimators": 500},
    {"name": "Config C (More Capacity)",      "max_depth": 7, "learning_rate": 0.03, "n_estimators": 500},
]

results = []

for cfg in configs:
    params = dict(base_params)
    params.update({k: v for k, v in cfg.items() if k != "name"})

    model = xgb.XGBClassifier(**params)
    model.fit(
        X_train, y_train,
        sample_weight=sample_weights,
        eval_set=[(X_val, y_val)],
        verbose=False,
    )

    y_pred_val = model.predict(X_val)
    y_proba_val = model.predict_proba(X_val)

    results.append({
        "name": cfg["name"],
        "config": cfg,
        "model": model,
        "val_top1": accuracy_score(y_val, y_pred_val),
        "val_top3": top_k_accuracy(y_val, y_proba_val, 3),
        "val_top5": top_k_accuracy(y_val, y_proba_val, 5),
        "val_logloss": log_loss(y_val, y_proba_val),
        "best_iteration": getattr(model, "best_iteration", None),
    })

df_results = pd.DataFrame([{k: v for k, v in r.items() if k not in ["model","config"]} for r in results])
df_results


## 8. Select best configuration
Selection criterion: maximise validation Top-3 accuracy, then minimise validation log loss.


In [None]:
# Pick best config: highest Top-3, then lowest logloss
best = max(results, key=lambda r: (r["val_top3"], -r["val_logloss"]))
model = best["model"]
print("Best:", best["name"])
print(best)


## 9. Final evaluation
Reports Top-k accuracy and log loss across train/val/test.


In [None]:
# ============================================================================
# STEP 5: FINAL EVALUATION (Train / Val / Test)
# ============================================================================

y_pred_train = model.predict(X_train)
y_pred_val   = model.predict(X_val)
y_pred_test  = model.predict(X_test)

y_proba_train = model.predict_proba(X_train)
y_proba_val   = model.predict_proba(X_val)
y_proba_test  = model.predict_proba(X_test)

print(f"{'Split':<10} {'Top-1':>8} {'Top-3':>8} {'Top-5':>8} {'Log Loss':>10}")
print("-"*60)
print(f"{'Train':<10} {accuracy_score(y_train, y_pred_train):>8.1%} {top_k_accuracy(y_train, y_proba_train, 3):>8.1%} "
      f"{top_k_accuracy(y_train, y_proba_train, 5):>8.1%} {log_loss(y_train, y_proba_train):>10.4f}")
print(f"{'Val':<10}   {accuracy_score(y_val, y_pred_val):>8.1%}   {top_k_accuracy(y_val, y_proba_val, 3):>8.1%} "
      f"{top_k_accuracy(y_val, y_proba_val, 5):>8.1%}   {log_loss(y_val, y_proba_val):>10.4f}")
print(f"{'Test':<10}  {accuracy_score(y_test, y_pred_test):>8.1%}  {top_k_accuracy(y_test, y_proba_test, 3):>8.1%} "
      f"{top_k_accuracy(y_test, y_proba_test, 5):>8.1%}  {log_loss(y_test, y_proba_test):>10.4f}")


## 10. Detailed test report
Per-class precision/recall/F1 on the 2025 temporal holdout.


In [None]:
print("="*80)
print("TEST SET CLASSIFICATION REPORT (2025 Temporal Holdout)")
print("="*80)
print(classification_report(
    y_test, y_pred_test,
    target_names=[idx_to_species[i] for i in range(len(TARGET_SPECIES))],
    digits=3,
    zero_division=0,
))


## 11. Calibration quality
Multi-class (normalised) Brier score and per-species Brier on test.


In [None]:
# ============================================================================
# STEP 5B: BRIER SCORE (multi-class)
# ============================================================================

def multiclass_brier_score(y_true: np.ndarray, y_proba: np.ndarray) -> float:
    n_samples, n_classes = y_proba.shape
    brier = 0.0
    for i in range(n_samples):
        for k in range(n_classes):
            y_true_k = 1 if y_true[i] == k else 0
            brier += (y_proba[i, k] - y_true_k) ** 2
    return brier / (n_samples * n_classes)

brier_train = multiclass_brier_score(y_train, y_proba_train)
brier_val   = multiclass_brier_score(y_val, y_proba_val)
brier_test  = multiclass_brier_score(y_test, y_proba_test)

print("Brier (multi-class):")
print("  Train:", round(brier_train, 4))
print("  Val:  ", round(brier_val, 4))
print("  Test: ", round(brier_test, 4))

# Per-species Brier on test
species_brier_scores = []
for k, sp in enumerate(sorted(TARGET_SPECIES)):
    y_true_k = (y_test == k).astype(int)
    y_pred_k = y_proba_test[:, k]
    b = float(np.mean((y_pred_k - y_true_k) ** 2))
    species_brier_scores.append({"species": sp, "brier": b})

pd.DataFrame(species_brier_scores).sort_values("brier")


## 12. Save artefacts
Writes the model, feature list, species mapping, test probabilities, and summary metrics.


In [None]:
# ============================================================================
# STEP 6: SAVE MODEL + METADATA + PREDICTIONS
# ============================================================================

# Model
model_path = MODEL_DIR / "xgboost_hsi_model_FINAL_no_vespula.json"
model.save_model(str(model_path))
print("Saved model:", model_path)

# Feature names + species mapping (needed for inference/fusion)
(pd.DataFrame({"feature": feature_cols})
   .to_csv(MODEL_DIR / "feature_names_FINAL_no_vespula.csv", index=False))

(pd.DataFrame([{"species": sp, "idx": idx} for sp, idx in species_to_idx.items()])
   .to_csv(MODEL_DIR / "species_mapping_FINAL_no_vespula.csv", index=False))

print("Saved feature names + mapping in:", MODEL_DIR)

# ---------------------------------------------------------------------------
# Test predictions parquet (for fusion)
# ---------------------------------------------------------------------------
if "observation_id" in test_df.columns:
    test_preds_df = test_df[["observation_id", "species"]].copy()
else:
    test_preds_df = test_df[["species"]].copy()

for i, sp in enumerate(sorted(TARGET_SPECIES)):
    col = f"prob_{sp.replace(' ', '_')}"
    test_preds_df[col] = y_proba_test[:, i]

test_preds_df["predicted_species"] = [idx_to_species[i] for i in y_pred_test]

test_pred_path = OUT_DIR / "preds" / "test_predictions_2025_FINAL_no_vespula.parquet"
test_preds_df.to_parquet(test_pred_path, index=False)
print("Saved test predictions:", test_pred_path)

# ---------------------------------------------------------------------------
# Metrics
# ---------------------------------------------------------------------------
summary_metrics = pd.DataFrame({
    "metric": ["top_1_accuracy", "top_3_accuracy", "top_5_accuracy", "log_loss", "brier_score"],
    "test_value": [
        accuracy_score(y_test, y_pred_test),
        top_k_accuracy(y_test, y_proba_test, 3),
        top_k_accuracy(y_test, y_proba_test, 5),
        log_loss(y_test, y_proba_test),
        brier_test,
    ],
})
summary_metrics.to_csv(OUT_DIR / "metrics" / "summary_metrics_FINAL.csv", index=False)

pd.DataFrame(species_brier_scores).to_csv(OUT_DIR / "metrics" / "brier_scores_per_species.csv", index=False)

print("Saved metrics in:", OUT_DIR / "metrics")


In [None]:
# %% 6) SAVE PREDICTIONS WITH PROBABILITIES + CONTEXT
# Species mapping (order matters!)
species_map = pd.read_csv(SPECIES_MAP_PATH).sort_values("idx")
species_names = species_map["species"].tolist()
def save_predictions_with_context(df, y_true, y_proba, species_names, split_name, out_dir=None):
    """Save predictions with context for analysis (robust column selection)."""
    print(f"\nðŸ’¾ Saving {split_name} predictions with context...")

    if out_dir is None:
        out_dir = OUT_DIR / "preds"
    out_dir.mkdir(parents=True, exist_ok=True)

    # Pick a sensible set of context columns if available
    preferred_cols = [
        "observation_id",
        "species",
        "final_latitude", "final_longitude",
        "latitude", "longitude",
        "observed_at", "obs_dt_utc",
        "temp_c", "hour_local", "doy",
        "rhum", "wspd_ms", "prcp_mm", "cloud_cover", "swrad", "vpd_kpa",
        "lat_bin", "lon_bin",
    ]
    context_cols = [c for c in preferred_cols if c in df.columns]
    pred_df = df[context_cols].copy()

    # Truth / prediction
    pred_df["true_species"] = df["species"].astype(str).values if "species" in df.columns else y_true
    pred_df["true_species_idx"] = y_true
    pred_df["predicted_species_idx"] = y_proba.argmax(axis=1)
    pred_df["predicted_species"] = [species_names[i] for i in pred_df["predicted_species_idx"]]
    pred_df["max_probability"] = y_proba.max(axis=1)
    pred_df["correct"] = (y_true == y_proba.argmax(axis=1)).astype(int)

    # Top-3 predictions
    top3_idx = np.argsort(y_proba, axis=1)[:, -3:][:, ::-1]
    for k in range(3):
        pred_df[f"top{k+1}_species"] = [species_names[i] for i in top3_idx[:, k]]
        pred_df[f"top{k+1}_probability"] = y_proba[np.arange(len(y_proba)), top3_idx[:, k]]

    # Per-species probabilities
    for i, sp in enumerate(species_names):
        safe_name = sp.replace(" ", "_").replace(".", "")
        pred_df[f"prob_{safe_name}"] = y_proba[:, i]

    out_path = out_dir / f"{split_name}_predictions_with_hsi.parquet"
    pred_df.to_parquet(out_path, index=False)
    print(f"âœ… Saved {len(pred_df):,} predictions to: {out_path}")

    # sample CSV (handy for manual inspection)
    sample = pred_df.sample(min(500, len(pred_df)), random_state=42)
    sample_path = out_dir / f"{split_name}_predictions_sample.csv"
    sample.to_csv(sample_path, index=False)
    print(f"âœ… Saved sample CSV to: {sample_path}")

    return pred_df


In [None]:
_ = save_predictions_with_context(test_df, y_test, y_proba_test, species_names, "test")
_ = save_predictions_with_context(val_df,  y_val,  y_proba_val,  species_names, "val")
_ = save_predictions_with_context(train_df, y_train, y_proba_train, species_names, "train")


## Next steps
- For probability calibration (isotonic / sigmoid with spatial CV) and model diagnostics (feature importance, confusion matrix, reliability diagrams), keep these in a separate analysis notebook to keep this training notebook focused and lightweight.
