# NeurIPS 2024 Ariel Data Challenge — Baseline Models

**Goal**: Establish quantitative lower bounds on predictive performance using simple, interpretable baselines before training neural or tree-based models.

**Baselines implemented**:
1. **Constant predictor** — predict the training-set median for every planet and wavelength.
2. **Per-wavelength Ridge regression** — fit 283 independent Ridge regressors on the 9 auxiliary features.
3. **Sigma sensitivity analysis** — demonstrate how miscalibrated uncertainty degrades GLL even with perfect mean predictions.

**Metric**: Gaussian Log-Likelihood (GLL).  Higher is better.  A perfect prediction scores 0.

$$\text{GLL}(y, \mu, \sigma) = -\frac{1}{2} \mathbb{E}\left[\log(2\pi\sigma^2) + \left(\frac{y - \mu}{\sigma}\right)^2\right]$$

> **Note**: This notebook is Kaggle-ready. Run it as a Kaggle kernel with the `ariel-data-challenge-2024` dataset attached.

## 1. Setup

In [None]:
import sys
from pathlib import Path

# ── Kaggle: clone repo and add to sys.path ─────────────────────────────────
# Uncomment the block below and replace YOUR_GITHUB_USERNAME once the repo
# has been pushed to GitHub:
#
# import subprocess
# subprocess.run(
#     ["git", "clone",
#      "https://github.com/YOUR_GITHUB_USERNAME/ariel-exoplanet-ml.git",
#      "/kaggle/working/ariel-exoplanet-ml"],
#     check=True
# )

sys.path.insert(0, "/kaggle/working/ariel-exoplanet-ml")

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, train_test_split
from sklearn.pipeline import Pipeline

# ── Data root ──────────────────────────────────────────────────────────────
DATA_ROOT = Path("/kaggle/input/ariel-data-challenge-2024")

# ── Plot style ─────────────────────────────────────────────────────────────
plt.rcParams.update({
    "figure.dpi": 110,
    "axes.spines.top": False,
    "axes.spines.right": False,
})

# ── Constants ──────────────────────────────────────────────────────────────
N_WAVELENGTHS = 283   # competition output dimension
RANDOM_STATE  = 42

print(f"Python      : {sys.version.split()[0]}")
print(f"NumPy       : {np.__version__}")
print(f"Pandas      : {pd.__version__}")
print(f"Matplotlib  : {matplotlib.__version__}")
print(f"DATA_ROOT   : {DATA_ROOT}")
print(f"Exists      : {DATA_ROOT.exists()}")
print("[Done] Setup complete.")

## 2. Load Labels and Auxiliary Features

- `AuxillaryTable.csv` — 9 stellar/planetary parameters per planet (all planets, labelled + unlabelled).
- `QuartilesTable.csv` — quartile labels `{i}_q1`, `{i}_q2`, `{i}_q3` for i in 0..282 (labelled planets only, ≈24%).

We merge the two tables on planet ID and retain only the labelled planets for training.

In [None]:
# ── Attempt to load real data; fall back to synthetic if not found ──────────
_USING_SYNTHETIC = False

aux_path = DATA_ROOT / "AuxillaryTable.csv"
q_path   = DATA_ROOT / "QuartilesTable.csv"

if aux_path.exists() and q_path.exists():
    print("Loading real CSV files...")
    df_aux = pd.read_csv(aux_path)
    df_q   = pd.read_csv(q_path)
    print(f"  AuxillaryTable : {df_aux.shape}")
    print(f"  QuartilesTable : {df_q.shape}")
else:
    print("WARNING: CSV files not found. Generating SYNTHETIC fallback data.")
    _USING_SYNTHETIC = True

    rng_synth = np.random.default_rng(RANDOM_STATE)
    N_PLANETS_ALL      = 400
    N_PLANETS_LABELLED = 100
    N_AUX_FEATURES     = 9

    # Synthetic planet IDs
    all_planet_ids = np.arange(N_PLANETS_ALL)

    # Auxiliary features: stellar radius, temperature, mass, log-g,
    #                     planet radius, mass, period, semi-major axis, eccentricity
    aux_feature_names = [
        "star_radius_rsun", "star_temp_k", "star_mass_msun", "star_logg",
        "planet_radius_rearth", "planet_mass_mearth", "orbital_period_days",
        "semi_major_axis_au", "eccentricity"
    ]
    aux_data = rng_synth.normal(0, 1, size=(N_PLANETS_ALL, N_AUX_FEATURES))

    df_aux = pd.DataFrame(aux_data, columns=aux_feature_names)
    df_aux.insert(0, "planet_id", all_planet_ids)

    # Synthetic quartiles for first N_PLANETS_LABELLED planets
    labelled_ids = all_planet_ids[:N_PLANETS_LABELLED]
    q2_base  = rng_synth.normal(0.01, 0.003, size=(N_PLANETS_LABELLED, N_WAVELENGTHS))
    iqr_vals = np.abs(rng_synth.normal(0.002, 0.0005, size=(N_PLANETS_LABELLED, N_WAVELENGTHS)))

    q_cols_dict = {"planet_id": labelled_ids}
    for i in range(N_WAVELENGTHS):
        q_cols_dict[f"{i}_q1"] = q2_base[:, i] - iqr_vals[:, i]
        q_cols_dict[f"{i}_q2"] = q2_base[:, i]
        q_cols_dict[f"{i}_q3"] = q2_base[:, i] + iqr_vals[:, i]

    df_q = pd.DataFrame(q_cols_dict)
    print(f"  Synthetic AuxillaryTable : {df_aux.shape}")
    print(f"  Synthetic QuartilesTable : {df_q.shape}")

# ── Identify planet ID column in each table ────────────────────────────────
# The planet ID is the first column in both tables.
aux_id_col = df_aux.columns[0]
q_id_col   = df_q.columns[0]

print(f"\nAux ID column    : '{aux_id_col}'")
print(f"Quartile ID col  : '{q_id_col}'")

# Normalise planet ID types to strings for safe merging
df_aux[aux_id_col] = df_aux[aux_id_col].astype(str)
df_q[q_id_col]     = df_q[q_id_col].astype(str)

# ── Merge: keep only labelled planets ─────────────────────────────────────
df_merged = pd.merge(
    df_q,
    df_aux,
    left_on=q_id_col,
    right_on=aux_id_col,
    how="inner",
    suffixes=("_q", "_aux"),
)

print(f"\nMerged labelled set shape : {df_merged.shape}")
print(f"  Labelled planets  : {len(df_merged)}")
print(f"  Total columns     : {df_merged.shape[1]}")

# ── Extract q1 / q2 / q3 column names ─────────────────────────────────────
q2_cols = [f"{i}_q2" for i in range(N_WAVELENGTHS)]
q1_cols = [f"{i}_q1" for i in range(N_WAVELENGTHS)]
q3_cols = [f"{i}_q3" for i in range(N_WAVELENGTHS)]

# Verify the columns exist; if the naming scheme differs, detect automatically
missing_q2 = [c for c in q2_cols if c not in df_merged.columns]
if missing_q2:
    # Try alternate naming: look for any column containing 'q2'
    q2_cols_alt = [c for c in df_merged.columns if "_q2" in c or "q2_" in c]
    if len(q2_cols_alt) == N_WAVELENGTHS:
        q2_cols = sorted(q2_cols_alt, key=lambda c: int(c.split("_q2")[0].split("q2_")[-1]))
        print(f"Auto-detected q2 columns via alternate naming: first={q2_cols[0]}, last={q2_cols[-1]}")
    else:
        raise ValueError(
            f"Could not find {N_WAVELENGTHS} q2 columns. Found {len(q2_cols_alt)}: {q2_cols_alt[:5]}..."
        )

# ── Extract numpy arrays ───────────────────────────────────────────────────
Y_q1 = df_merged[q1_cols].values.astype(np.float64)   # (n_planets, 283)
Y_q2 = df_merged[q2_cols].values.astype(np.float64)   # (n_planets, 283)
Y_q3 = df_merged[q3_cols].values.astype(np.float64)   # (n_planets, 283)

# Auxiliary feature columns (numeric, excluding ID columns)
id_cols_set = {aux_id_col, q_id_col,
               aux_id_col + "_aux", q_id_col + "_q"}
aux_feat_cols = [
    c for c in df_merged.columns
    if c not in id_cols_set
    and c not in q1_cols and c not in q2_cols and c not in q3_cols
    and pd.api.types.is_numeric_dtype(df_merged[c])
]
X_aux = df_merged[aux_feat_cols].values.astype(np.float64)  # (n_planets, 9)

print(f"\nY_q2  shape : {Y_q2.shape}  (labelled planets x wavelengths)")
print(f"X_aux shape : {X_aux.shape}  (labelled planets x aux features)")
print(f"Aux features ({len(aux_feat_cols)}): {aux_feat_cols}")
print("[Done] Labels and auxiliary features loaded and merged.")

## Gaussian Log-Likelihood (GLL) — Implementation

The competition metric is:

$$\text{GLL}(y, \mu, \sigma) = -\frac{1}{2} \operatorname{mean}\!\left[\log(2\pi\sigma^2) + \left(\frac{y-\mu}{\sigma}\right)^2\right]$$

- **Higher is better**; a perfect predictor (y = μ, σ → 0 correctly calibrated) gives 0.
- Both **mean accuracy** (term 2) and **uncertainty calibration** (term 1) matter.
- Overconfident predictions (σ too small) are penalised by the squared residual blowing up.
- Underconfident predictions (σ too large) are penalised by the log(σ²) term.

In [None]:
def gaussian_log_likelihood(
    y: np.ndarray,
    mu: np.ndarray,
    sigma: np.ndarray,
) -> float:
    """
    Compute the competition Gaussian Log-Likelihood score.

    GLL = -0.5 * mean( log(2*pi*sigma^2) + ((y - mu) / sigma)^2 )

    Higher is better.  Perfect prediction = 0.

    Parameters
    ----------
    y     : (...,) ground truth values
    mu    : (...,) predicted means (same shape as y)
    sigma : (...,) predicted stds  (must be positive; clipped at 1e-9)

    Returns
    -------
    float  — GLL score
    """
    sigma = np.clip(sigma, 1e-9, None)
    term1 = np.log(2.0 * np.pi * sigma ** 2)
    term2 = ((y - mu) / sigma) ** 2
    return float(-0.5 * np.mean(term1 + term2))


# ── Quick sanity checks ────────────────────────────────────────────────────
y_test    = np.array([1.0, 2.0, 3.0])
mu_test   = np.array([1.0, 2.0, 3.0])
sig_test  = np.array([0.1, 0.1, 0.1])

gll_perfect_mean = gaussian_log_likelihood(y_test, mu_test, sig_test)
gll_scipy = float(np.mean(scipy.stats.norm.logpdf(y_test, loc=mu_test, scale=sig_test)))

print(f"GLL (perfect mean, sigma=0.1)  : {gll_perfect_mean:.6f}")
print(f"scipy reference                : {gll_scipy:.6f}")
print(f"Difference vs scipy            : {abs(gll_perfect_mean - gll_scipy):.2e}  (should be <1e-10)")

# Verify: small sigma → GLL near 0 when prediction is exact
sigma_tiny = np.full_like(y_test, 1e-6)
gll_tiny = gaussian_log_likelihood(y_test, mu_test, sigma_tiny)
print(f"GLL (perfect mean, sigma=1e-6) : {gll_tiny:.4f}  (near 0 expected)")

# Verify: large sigma → strongly negative GLL
sigma_large = np.full_like(y_test, 100.0)
gll_large = gaussian_log_likelihood(y_test, mu_test, sigma_large)
print(f"GLL (perfect mean, sigma=100)  : {gll_large:.4f}  (strongly negative expected)")

print("[Done] GLL function implemented and verified.")

## 3. Baseline 1 — Constant Predictor (Training-Set Median)

The simplest possible baseline: ignore auxiliary features entirely and predict the training-set
median for every planet.

- **μ[λ]** = median of q2 values over all training planets at wavelength λ
- **σ[λ]** = 0.5 × median(q3 − q1) at wavelength λ  (IQR-based robust std estimate)

This establishes the floor that *any* model must beat: if a model cannot outperform a global
constant, it has learned nothing from the data.

In [None]:
# ── Train / held-out 80/20 split ───────────────────────────────────────────
n_planets = Y_q2.shape[0]
(
    X_train, X_val,
    Y_q1_train, Y_q1_val,
    Y_q2_train, Y_q2_val,
    Y_q3_train, Y_q3_val,
) = train_test_split(
    X_aux,
    Y_q1, Y_q2, Y_q3,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

print(f"Train split : {len(X_train)} planets")
print(f"Val   split : {len(X_val)} planets")

# ── Compute constant predictions from training set ─────────────────────────
# mu[λ]    = median of q2 over training planets
mu_const    = np.median(Y_q2_train, axis=0)     # (283,)

# sigma[λ] = 0.5 * median(q3 - q1) over training planets  (robust IQR-based std)
iqr_train   = Y_q3_train - Y_q1_train           # (n_train, 283)
sigma_const = 0.5 * np.median(iqr_train, axis=0)  # (283,)
sigma_const = np.clip(sigma_const, 1e-9, None)   # guard against zero

print(f"\nConstant predictor:")
print(f"  mu    : mean={mu_const.mean():.6f},  std={mu_const.std():.6f}")
print(f"  sigma : mean={sigma_const.mean():.6f}, std={sigma_const.std():.6f}")

# ── Evaluate on held-out validation set ───────────────────────────────────
# Broadcast constant predictions to all validation planets
n_val = len(Y_q2_val)
mu_val_const    = np.tile(mu_const,    (n_val, 1))   # (n_val, 283)
sigma_val_const = np.tile(sigma_const, (n_val, 1))   # (n_val, 283)

gll_const = gaussian_log_likelihood(Y_q2_val, mu_val_const, sigma_val_const)

print(f"\nBaseline (constant median) GLL = {gll_const:.4f}")
print("[Done] Baseline 1 (constant predictor) evaluated.")

## 4. Baseline 2 — Per-Wavelength Ridge Regression on Aux Features

Fit 283 independent Ridge regressors (one per wavelength channel), each using the 9 standardised
auxiliary features to predict q2 at that wavelength.

- **Features** (`X`): 9 aux parameters, standardised via `StandardScaler`.
- **Target** (`y`): q2 at each wavelength.
- **Sigma**: estimated from the standard deviation of in-fold training residuals.
- **Evaluation**: 5-fold cross-validation → mean GLL across all folds and wavelengths.

This tests whether the auxiliary features carry *any* predictive signal about the spectrum shape.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold

ALPHA       = 1.0    # Ridge regularisation strength
N_FOLDS     = 5

kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=RANDOM_STATE)

# Collect per-fold GLL scores
fold_gll_scores = []

for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X_aux)):
    X_tr, X_vl  = X_aux[train_idx], X_aux[val_idx]
    Y_tr, Y_vl  = Y_q2[train_idx],  Y_q2[val_idx]

    # Standardise features
    scaler = StandardScaler()
    X_tr_s = scaler.fit_transform(X_tr)
    X_vl_s = scaler.transform(X_vl)

    # Fit 283 Ridge regressors in one shot using multi-output Ridge
    # (Ridge natively supports multi-output when y is 2-D)
    ridge = Ridge(alpha=ALPHA, fit_intercept=True)
    ridge.fit(X_tr_s, Y_tr)

    # Predict on validation fold
    mu_pred  = ridge.predict(X_vl_s)   # (n_val, 283)

    # Estimate sigma as std of in-fold TRAINING residuals (per wavelength)
    resid_tr = Y_tr - ridge.predict(X_tr_s)   # (n_train, 283)
    sigma_wl = resid_tr.std(axis=0)            # (283,)  — one sigma per wavelength
    sigma_wl = np.clip(sigma_wl, 1e-9, None)

    # Broadcast sigma to validation set
    n_vl = len(X_vl)
    sigma_pred = np.tile(sigma_wl, (n_vl, 1))   # (n_val, 283)

    fold_gll = gaussian_log_likelihood(Y_vl, mu_pred, sigma_pred)
    fold_gll_scores.append(fold_gll)
    print(f"  Fold {fold_idx + 1}/{N_FOLDS}:  val size={n_vl:3d}  GLL={fold_gll:.4f}")

gll_ridge_cv = float(np.mean(fold_gll_scores))
gll_ridge_std = float(np.std(fold_gll_scores))

print(f"\nBaseline (Ridge regression) GLL (5-fold CV) = {gll_ridge_cv:.4f}  "
      f"± {gll_ridge_std:.4f} (std across folds)")
print("[Done] Baseline 2 (Ridge regression) evaluated.")

## 5. Baseline 3 — Constant Sigma Sensitivity Analysis

This experiment holds the **mean prediction fixed** at the training median (Baseline 1) and
varies only σ across a range of IQR multiples.

Key insight: **GLL is maximised at a well-calibrated σ**. Too small (overconfident) or too large
(underconfident) both hurt.  The optimal σ equals the true standard deviation of the target
distribution around the mean, i.e. `std(y - mu)`.

In [None]:
# ── IQR multipliers to test ────────────────────────────────────────────────
sigma_multipliers = [0.1, 0.25, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0]

# sigma_const is the 0.5× IQR baseline computed in Section 3
# We use the same held-out val set (Y_q2_val, mu_val_const) from Baseline 1
iqr_baseline_median = np.median(iqr_train, axis=0)   # (283,)  raw IQR median

gll_by_mult = []
for mult in sigma_multipliers:
    sigma_test = np.clip(0.5 * mult * iqr_baseline_median, 1e-9, None)  # (283,)
    sigma_val_test = np.tile(sigma_test, (n_val, 1))                     # (n_val, 283)
    gll_score = gaussian_log_likelihood(Y_q2_val, mu_val_const, sigma_val_test)
    gll_by_mult.append(gll_score)

gll_by_mult = np.array(gll_by_mult)
best_idx  = int(np.argmax(gll_by_mult))
best_mult = sigma_multipliers[best_idx]
best_gll  = gll_by_mult[best_idx]

print(f"{'Multiplier':>12}  {'Sigma (mean)':>14}  {'GLL':>10}")
print("-" * 42)
for mult, gll_s in zip(sigma_multipliers, gll_by_mult):
    sigma_test = 0.5 * mult * iqr_baseline_median
    mark = " <<< BEST" if mult == best_mult else ""
    print(f"{mult:>12.2f}  {sigma_test.mean():>14.6f}  {gll_s:>10.4f}{mark}")

# ── Plot GLL vs sigma multiplier ───────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(sigma_multipliers, gll_by_mult,
        marker="o", markersize=7, lw=2, color="steelblue", label="GLL (constant mean)")

# Highlight optimal point
ax.plot(best_mult, best_gll,
        marker="*", markersize=16, color="crimson", zorder=5,
        label=f"Optimal: mult={best_mult:.2f}, GLL={best_gll:.4f}")

ax.annotate(
    f"Optimal σ multiplier = {best_mult:.2f}\nGLL = {best_gll:.4f}",
    xy=(best_mult, best_gll),
    xytext=(best_mult * 1.3 if best_mult < 5 else best_mult * 0.4,
            best_gll - (gll_by_mult.max() - gll_by_mult.min()) * 0.25),
    arrowprops=dict(arrowstyle="->", color="gray", lw=1.5),
    fontsize=10, color="crimson",
    bbox=dict(boxstyle="round,pad=0.4", facecolor="lightyellow",
              edgecolor="gray", alpha=0.9),
)

# Shade the overconfident / underconfident halves
ax.axvspan(sigma_multipliers[0], best_mult,
           alpha=0.06, color="tomato",
           label="Overconfident (σ too small)")
ax.axvspan(best_mult, sigma_multipliers[-1],
           alpha=0.06, color="royalblue",
           label="Underconfident (σ too large)")

ax.set_xscale("log")
ax.set_xlabel("σ multiplier  (σ = mult × 0.5 × IQR)", fontsize=11)
ax.set_ylabel("GLL score  (higher = better)", fontsize=11)
ax.set_title(
    "Uncertainty Sensitivity Analysis\n"
    "GLL vs σ multiplier with fixed constant mean (training median)",
    fontsize=12
)
ax.legend(fontsize=9, loc="lower right")
ax.set_xticks(sigma_multipliers)
ax.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())

plt.tight_layout()
plt.show()

print(f"\nOptimal σ multiplier : {best_mult:.2f}  (out of {sigma_multipliers})")
print(f"Optimal GLL          : {best_gll:.4f}")
print("[Done] Sigma sensitivity analysis complete.")

## 6. Plot Baseline Predictions

For 3 representative held-out planets, compare the two baselines:
- **Baseline 1** (constant median): flat spectrum, uncertainty from IQR.
- **Baseline 2** (Ridge regression): per-wavelength prediction conditioned on aux features.

Ground truth is shown as the q2 median spectrum with a shaded q1–q3 band.

In [None]:
# ── Retrain Ridge on full training set for illustration ────────────────────
scaler_final = StandardScaler()
X_train_s    = scaler_final.fit_transform(X_train)
X_val_s      = scaler_final.transform(X_val)

ridge_final  = Ridge(alpha=ALPHA, fit_intercept=True)
ridge_final.fit(X_train_s, Y_q2_train)

mu_ridge_val  = ridge_final.predict(X_val_s)               # (n_val, 283)
resid_train   = Y_q2_train - ridge_final.predict(X_train_s)
sigma_ridge_wl = np.clip(resid_train.std(axis=0), 1e-9, None)  # (283,)

# ── Select 3 example planets from validation set ──────────────────────────
N_EXAMPLE   = 3
example_idx = np.linspace(0, len(Y_q2_val) - 1, N_EXAMPLE, dtype=int)

wl_axis = np.arange(N_WAVELENGTHS)   # wavelength channel index (0..282)

fig, axes = plt.subplots(N_EXAMPLE, 2, figsize=(16, 4 * N_EXAMPLE))
fig.suptitle(
    "Baseline Predictions vs Ground Truth — 3 Example Validation Planets",
    fontsize=13, fontweight="bold", y=1.01
)

for row, pid in enumerate(example_idx):
    # Ground truth
    gt_q1 = Y_q1_val[pid]
    gt_q2 = Y_q2_val[pid]
    gt_q3 = Y_q3_val[pid]

    # ── Left: Baseline 1 (constant median) ────────────────────────────────
    ax_left = axes[row, 0]

    # Ground truth band
    ax_left.fill_between(wl_axis, gt_q1, gt_q3,
                         alpha=0.25, color="steelblue", label="GT q1–q3 band")
    ax_left.plot(wl_axis, gt_q2, lw=1.2, color="steelblue",
                 label="GT q2 (median)")

    # Constant prediction
    ax_left.plot(wl_axis, mu_const, lw=1.2, color="darkorange", linestyle="--",
                 label="Pred μ (const median)")
    ax_left.fill_between(wl_axis,
                         mu_const - sigma_const,
                         mu_const + sigma_const,
                         alpha=0.18, color="darkorange", label="Pred μ±σ")

    ax_left.set_title(f"Planet val[{pid}] — Baseline 1 (Constant Median)", fontsize=10)
    ax_left.set_xlabel("Wavelength channel", fontsize=9)
    ax_left.set_ylabel("Transit depth", fontsize=9)
    if row == 0:
        ax_left.legend(fontsize=8, loc="upper right")
    ax_left.tick_params(labelsize=8)

    # ── Right: Baseline 2 (Ridge regression) ──────────────────────────────
    ax_right = axes[row, 1]

    # Ground truth band
    ax_right.fill_between(wl_axis, gt_q1, gt_q3,
                          alpha=0.25, color="steelblue", label="GT q1–q3 band")
    ax_right.plot(wl_axis, gt_q2, lw=1.2, color="steelblue",
                  label="GT q2 (median)")

    # Ridge prediction
    mu_ridge_p = mu_ridge_val[pid]          # (283,)
    ax_right.plot(wl_axis, mu_ridge_p, lw=1.2, color="crimson", linestyle="--",
                  label="Pred μ (Ridge)")
    ax_right.fill_between(wl_axis,
                          mu_ridge_p - sigma_ridge_wl,
                          mu_ridge_p + sigma_ridge_wl,
                          alpha=0.18, color="crimson", label="Pred μ±σ")

    ax_right.set_title(f"Planet val[{pid}] — Baseline 2 (Ridge Regression)", fontsize=10)
    ax_right.set_xlabel("Wavelength channel", fontsize=9)
    ax_right.set_ylabel("Transit depth", fontsize=9)
    if row == 0:
        ax_right.legend(fontsize=8, loc="upper right")
    ax_right.tick_params(labelsize=8)

plt.tight_layout()
plt.show()

print(f"[Done] Prediction plots for {N_EXAMPLE} example planets.")

## 7. Summary

### Results Table

| Method | GLL | Notes |
|---|---|---|
| Baseline 1: Constant median | *see cell below* | Global median μ, IQR-based σ. No aux features used. |
| Baseline 2: Ridge regression (5-fold CV) | *see cell below* | 283 independent Ridge models on 9 aux features. σ from training residuals. |

### Key Takeaways

1. **GLL is sensitive to both mean accuracy and uncertainty calibration.**  
   The sigma sensitivity analysis (Section 5) shows that even with perfect mean predictions,
   mis-calibrated uncertainty halves or more the GLL score.  A model that outputs a good μ
   but a bad σ will score poorly.

2. **The constant predictor is a meaningful floor.**  
   Any model that fails to beat Baseline 1 has not learned anything useful from the data.
   Concretely, ~24% of planets are labelled, so the training-set median is computed on a
   modest sample.  The model must generalise beyond this.

3. **Auxiliary features carry some signal (Baseline 2 vs 1).**  
   If Ridge regression outperforms the constant predictor, stellar/planetary parameters
   (stellar temperature, planet radius, etc.) are correlated with the atmospheric spectrum.  
   This is physically expected: hotter stars illuminate more, larger planets block more light.

4. **What the full model needs to beat.**  
   The competition-winning approaches use direct photometric extraction from AIRS-CH0 raw
   light curves.  The HDF5 data provides per-channel flux time series that, once preprocessed
   (see `02_preprocessing.ipynb`), yield a transit depth spectrum per planet.  This spectrum,
   combined with aux features, is expected to drive GLL well above the baselines here.

5. **GLL = 0 is unreachable in practice.**  
   A score of 0 would require perfect mean prediction *and* σ equal to the irreducible noise.
   Competition leaderboard scores are typically in the range −5 to −0.5; see the Kaggle
   discussion forum for context.

In [None]:
# ── Print final summary table ───────────────────────────────────────────────
print("=" * 65)
print("BASELINE RESULTS SUMMARY")
print("=" * 65)
print(f"{'Method':<42}  {'GLL':>8}")
print("-" * 65)
print(f"{'Baseline 1: Constant median (20% val split)':<42}  {gll_const:>8.4f}")
print(f"{'Baseline 2: Ridge regression (5-fold CV)':<42}  {gll_ridge_cv:>8.4f}")
print("-" * 65)

improvement = gll_ridge_cv - gll_const
sign = "+" if improvement >= 0 else ""
print(f"Ridge vs Constant improvement : {sign}{improvement:.4f} GLL points")
print(f"Using synthetic data           : {_USING_SYNTHETIC}")
print("=" * 65)

print("\nBaseline (constant median) GLL =", round(gll_const, 4))
print("Baseline (Ridge regression) GLL (5-fold CV) =", round(gll_ridge_cv, 4))

print("\n[Done] Baseline notebook complete.")