# Model 3 (XGB/LGBM Blend)

This model was an attempt to push performance by combining:
- a stronger feature set,
- photometric redshift augmentation using `Z_err`,
- an ensemble with both XGBoost and LightGBM,
- and an OOF-tuned blend weight (alpha) + decision threshold.

## Results

Best parameters:
- n_estimators: 4328
- learning_rate: 0.007079630495604182
- max_depth: 4
- min_child_weight: 1
- subsample: 0.5936362024881353
- colsample_bytree: 0.9146736072473098
- gamma: 0.6466321530023438
- reg_alpha: 4.130135428812432
- reg_lambda: 5.5649710878468825
- max_delta_step: 1
- grow_policy: depthwise

OOF multiseed best threshold: 0.01  
OOF multiseed best F1: 0.51875  
OOF best alpha: 0.03  

| Submission | Public LB F1 | Private LB F1 |
|-------------|--------------|----------------|
| 1 | 0.5613 | 0.5313 |
| 2 | 0.5082 | 0.5119 |

## What changed vs Model 2
- Added Z_err usage and photo-z augmentation
- Added a LightGBM model alongside XGBoost.
- Added OOF blending: `p = alpha * p_xgb + (1 - alpha) * p_lgb`.
- Optuna tuning target changed to PR AUC inside CV, while thresholding is still tuned for F1.

## Likely reasons this underperformed
- The OOF-optimal blend weight was alpha = 0.03, meaning the blend heavily preferred one model (mostly LGB).
- The OOF-optimal threshold was extremely low (0.01), suggesting probability calibration / distribution mismatch.
- Optimizing PR AUC but selecting thresholds by F1 can create objective mismatch.
- Ensembles seem to cause noise. I later checked with other people in the top 100 at the end of the competition. Some said ensembles lowered LB results, and some said they were using 100+ ensembles in their models and getting great results.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, average_precision_score
from sklearn.model_selection import StratifiedGroupKFold
from xgboost import XGBClassifier
from extinction import fitzpatrick99
from lightgbm import LGBMClassifier
import optuna
from pathlib import Path

## Filter and extinction constants

- `FILTERS`: the 6 LSST bands used in the lightcurves (u,g,r,i,z,y)
- `EFF_WL_AA`: effective wavelength (Angstrom) per band
- `R_V`: extinction curve parameter used to convert EBV -> A_V
- `STETSON_DT_MAX`: max time separation for pairing points in the Stetson J variability statistic

These constants support consistent de-extinction and time-series variability calculations.


In [2]:
FILTERS = ["u", "g", "r", "i", "z", "y"]

EFF_WL_AA = {
    "u": 3641.0,
    "g": 4704.0,
    "r": 6155.0,
    "i": 7504.0,
    "z": 8695.0,
    "y": 10056.0,
}

R_V = 3.1
STETSON_DT_MAX = 0.5

### Data safety + numerics
- `safe_float`: safe conversion that handles missing / NaN values
- `trapz_safe`: numerical integration

### Robust statistics
- MAD, IQR, skewness, kurtosis
- von Neumann eta (variability / smoothness)

### Time-series dynamics
- slope summaries (`max_slope`, `median_abs_slope`, `linear_slope`)
- goodness-of-fit to constant (`chi2_to_constant`)
- interpolation for color-at-time features (`interp_flux_at_time`)

### Variability metrics
- `stetson_J`: correlated variability statistic for closely-spaced observations
- `fractional_variability`: noise-corrected intrinsic variability amplitude

### De-extinction (dust correction)
- `deextinct_band` and `deextinct_lightcurve` correct flux + flux_err using EBV and wavelength-dependent extinction.

These functions are building blocks for the per-object feature extraction function.


In [3]:
def safe_float(x, default=np.nan):
    try:
        if x is None:
            return default
        x = float(x)
        if np.isnan(x):
            return default
        return x
    except Exception:
        return default


def trapz_safe(y, x):
    if hasattr(np, "trapezoid"):
        return np.trapezoid(y, x)
    y = np.asarray(y)
    x = np.asarray(x)
    return np.sum((x[1:] - x[:-1]) * (y[1:] + y[:-1]) * 0.5)


def median_abs_dev(x):
    x = np.asarray(x)
    med = np.median(x)
    return float(np.median(np.abs(x - med)))


def iqr(x):
    q75, q25 = np.percentile(x, [75, 25])
    return float(q75 - q25)


def skewness(x):
    x = np.asarray(x)
    mu = np.mean(x)
    s = np.std(x)
    if s < 1e-12:
        return 0.0
    m3 = np.mean((x - mu) ** 3)
    return float(m3 / (s ** 3))


def kurtosis_excess(x):
    x = np.asarray(x)
    mu = np.mean(x)
    s = np.std(x)
    if s < 1e-12:
        return 0.0
    m4 = np.mean((x - mu) ** 4)
    return float(m4 / (s ** 4) - 3.0)


def von_neumann_eta(x):
    x = np.asarray(x)
    n = len(x)
    if n < 3:
        return np.nan
    v = np.var(x)
    if v < 1e-12:
        return 0.0
    dif = np.diff(x)
    return float(np.mean(dif ** 2) / v)


def max_slope(t, f):
    t = np.asarray(t)
    f = np.asarray(f)
    if len(t) < 3:
        return np.nan
    dt = np.diff(t)
    df = np.diff(f)
    good = dt > 0
    if not np.any(good):
        return np.nan
    slopes = df[good] / dt[good]
    return float(np.max(np.abs(slopes)))


def median_abs_slope(t, f):
    t = np.asarray(t)
    f = np.asarray(f)
    if len(t) < 3:
        return np.nan
    dt = np.diff(t)
    df = np.diff(f)
    good = dt > 0
    if not np.any(good):
        return np.nan
    slopes = df[good] / dt[good]
    return float(np.median(np.abs(slopes)))


def linear_slope(t, f):
    t = np.asarray(t)
    f = np.asarray(f)
    if len(t) < 3:
        return np.nan
    try:
        a, b = np.polyfit(t, f, 1)
        return float(a)
    except Exception:
        return np.nan


def chi2_to_constant(f, ferr):
    f = np.asarray(f)
    ferr = np.asarray(ferr)
    n = len(f)
    if n < 3:
        return np.nan
    mu = np.median(f)
    denom = (ferr + 1e-8) ** 2
    chi2 = np.sum((f - mu) ** 2 / denom)
    dof = max(1, n - 1)
    return float(chi2 / dof)


def interp_flux_at_time(tb, fb, t0):
    # returns flux at t0 using linear interpolation
    tb = np.asarray(tb)
    fb = np.asarray(fb)
    if len(tb) < 2:
        return np.nan
    if (t0 < tb.min()) or (t0 > tb.max()):
        return np.nan
    return float(np.interp(t0, tb, fb))


def stetson_J(t, f, ferr, dt_max=STETSON_DT_MAX):
    """
    Simplified Stetson J:
    Pair consecutive observations that are close in time.
    delta_i = sqrt(n/(n-1)) * (f_i - mean_f) / err_i
    J = mean( sign(P_k)*sqrt(|P_k|) ) where P_k = delta_i * delta_j
    """
    t = np.asarray(t)
    f = np.asarray(f)
    ferr = np.asarray(ferr)

    n = len(t)
    if n < 4:
        return np.nan

    mu = np.mean(f)
    scale = np.sqrt(n / max(1, n - 1))
    delta = scale * (f - mu) / (ferr + 1e-8)

    pairs = []
    for i in range(n - 1):
        if (t[i + 1] - t[i]) <= dt_max:
            pairs.append((i, i + 1))

    if len(pairs) == 0:
        return np.nan

    vals = []
    for i, j in pairs:
        P = delta[i] * delta[j]
        vals.append(np.sign(P) * np.sqrt(np.abs(P)))

    return float(np.mean(vals))


def fractional_variability(f, ferr):
    """
    Noise-corrected intrinsic variability amplitude:
    F_var = sqrt(max(0, S^2 - mean(err^2))) / |mean(f)|
    """
    f = np.asarray(f, float)
    ferr = np.asarray(ferr, float)
    n = len(f)
    if n < 3:
        return np.nan

    mu = np.mean(f)
    if np.abs(mu) < 1e-8:
        return np.nan

    s2 = np.var(f, ddof=1)
    mean_err2 = np.mean(ferr**2)

    excess = max(0.0, s2 - mean_err2)
    return float(np.sqrt(excess) / np.abs(mu))


def deextinct_band(flux, flux_err, ebv, band, r_v=R_V):
    if ebv is None or (isinstance(ebv, float) and np.isnan(ebv)):
        return flux, flux_err, 0.0

    A_V = float(ebv) * float(r_v)
    wave = np.array([EFF_WL_AA[band]], dtype=float)  # Angstrom
    A_lambda = float(fitzpatrick99(wave, A_V, r_v=r_v, unit="aa")[0])  # mag

    fac = 10.0 ** (0.4 * A_lambda)
    return flux * fac, flux_err * fac, A_lambda


def deextinct_lightcurve(lc, ebv):
    flux = lc["Flux"].to_numpy().astype(float)
    ferr = lc["Flux_err"].to_numpy().astype(float)
    filt = lc["Filter"].to_numpy()

    flux_corr = flux.copy()
    ferr_corr = ferr.copy()

    for b in FILTERS:
        m = (filt == b)
        if not np.any(m):
            continue
        flux_corr[m], ferr_corr[m], _ = deextinct_band(flux_corr[m], ferr_corr[m], ebv, b)

    return flux_corr, ferr_corr

# Model 3: Expanded variability + per-band diagnostics + color-evolution features

Differences vs Model 2:
 - Adds photo-z uncertainty features (Z_err, log1pZerr) so the model can learn when redshift is unreliable.
 - Adds stronger variability diagnostics: chi2-to-constant, Stetson J, fractional variability (global + per-band).
 - Adds robust amplitude features (p95 - p5) globally and per-band to reduce sensitivity to outliers.
 - Computes widths/rise/decay/asymmetry at 50% amplitude in both observed + rest frame (more detailed transient shape).
 - Adds color features (g-r and r-i) at r-band peak and their 10-day color slopes (captures spectral evolution).

## Global (all-filters combined) features

These are computed using all observations across all bands for a given object.  
They summarize time coverage, brightness distribution, cadence, variability, and context (redshift + dust + redshift uncertainty).

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `n_obs` | Total number of observations across all filters | Captures overall sampling density and how well-measured the object is |
| `total_time_obs` | Observed-frame time baseline: `max(t_rel) - min(t_rel)` | Separates short transients vs long events and measures overall monitoring duration |
| `total_time_rest` | Rest-frame time baseline: `total_time_obs / (1+z)` | Removes time dilation so the model compares intrinsic evolution speed across redshifts |

### Flux distribution (dust-corrected `flux_corr`)

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `flux_mean` | Mean corrected flux | Measures average intrinsic brightness level (sensitive to sustained high flux) |
| `flux_median` | Median corrected flux | Robust typical brightness baseline (less sensitive to one-off spikes) |
| `flux_std` | Standard deviation of corrected flux | Captures variability strength (high = more change over time) |
| `flux_min` | Minimum corrected flux | Captures deep fades / dips / negative excursions from noise-subtraction artifacts |
| `flux_max` | Maximum corrected flux | Captures peak brightness or flare intensity (key transient signature) |
| `flux_mad` | Median absolute deviation of corrected flux | Robust variability estimate that doesn’t get bullied by outliers |
| `flux_iqr` | Interquartile range of corrected flux | Another robust variability measure (spread of the middle 50%) |
| `flux_skew` | Skewness of corrected flux distribution | Detects asymmetric lightcurves (fast rise / slow decay vs vice versa) |
| `flux_kurt_excess` | Excess kurtosis of corrected flux distribution | Detects heavy tails/spiky behavior from rare bursts or sharp transients |
| `flux_p5` | 5th percentile of corrected flux | Robust low-end brightness level (less sensitive than min) |
| `flux_p25` | 25th percentile of corrected flux | Lower-quartile brightness level |
| `flux_p75` | 75th percentile of corrected flux | Upper-quartile brightness level |
| `flux_p95` | 95th percentile of corrected flux | Robust high-end brightness level (less sensitive than max) |
| `robust_amp_global` | Robust amplitude: `flux_p95 - flux_p5` | Outlier-resistant variability scale, often better than max-min |
| `neg_flux_frac` | Fraction of corrected flux values `< 0` | Flags noise-dominated objects or weak detections where measurements hover around zero |

### SNR (using corrected errors `err_corr`)

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `snr_median` | Median SNR where `snr = \|flux_corr\| / (err_corr + 1e-8)` | Typical detection quality (separates clean signals from noisy junk) |
| `snr_max` | Maximum SNR | Captures the strongest detection event (some transients “light up” briefly) |

### Cadence / gaps

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `median_dt` | Median time gap between consecutive observations in `t_rel` | Describes typical cadence (important since sparse sampling hides shape) |
| `max_gap` | Maximum time gap between consecutive observations in `t_rel` | Detects missing windows (large gaps can explain unreliable peak/width estimates) |

### Time-series variability / shape

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `eta_von_neumann` | Von Neumann eta statistic on `flux_corr` (smoothness vs jumpiness) | Separates smooth evolving curves from noisy jitter or sudden jumps |
| `chi2_const_global` | Chi-square vs constant-flux model using `err_corr` | Quantifies variability relative to measurement noise (true variability vs noise) |
| `stetsonJ_global_obs` | Stetson J index using observed-frame times | Captures correlated variability behavior (often strong for real transients) |
| `stetsonJ_global_rest` | Stetson J index using rest-frame times | Same idea, but corrected for time dilation so timing-related correlation is comparable |

### Slopes / rate of change (global)

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `max_slope_global_obs` | Maximum absolute slope in observed time (`t_rel`) | Captures fastest brightness change (sharp rise/fall events) |
| `max_slope_global_rest` | Maximum absolute slope in rest-frame time (`t_rest`) | Intrinsic fastest change rate (removes redshift stretching) |
| `med_abs_slope_global_obs` | Median absolute slope in observed time | Typical observed change rate (slow drifters vs active transients) |
| `med_abs_slope_global_rest` | Median absolute slope in rest-frame time | Typical intrinsic change rate |
| `slope_global_obs` | Best-fit linear slope over observed time | Captures long-term trend direction (rising vs fading overall) |
| `slope_global_rest` | Best-fit linear slope over rest-frame time | Same trend, but comparable across redshifts |

### Fractional variability

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `fvar_global` | Fractional variability accounting for measurement errors | Estimates intrinsic variability strength after subtracting noise contribution |

### Context metadata

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `Z` | Redshift `z` | Encodes distance/epoch effects and shifts events into different observed regimes |
| `log1pZ` | `log(1+z)` | Stabilizes redshift scaling for models (less extreme leverage at high `z`) |
| `Z_err` | Redshift uncertainty (clipped to `>= 0`) | Captures confidence in rest-frame correction; noisy redshifts degrade timing features |
| `log1pZerr` | `log(1+Z_err)` | Stabilizes uncertainty scaling and helps tree models split more smoothly |
| `EBV` | Dust reddening used for extinction correction | Helps the model learn residual dust systematics and measurement conditions |

### Filter coverage

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `n_filters_present` | Number of filters with ≥ 1 observation | Multi-band coverage gives richer color/shape info; missing bands can correlate with class |
| `total_obs` | Total observations summed across all filters (same as `n_obs`) | Redundant but convenient sanity/coverage signal that some models exploit |

## Per-filter (band-wise) features

For each band `b ∈ {u,g,r,i,z,y}`, the following features are computed independently per filter.  
These capture color-dependent brightness behavior and band-specific temporal dynamics.

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `n_{b}` | Number of observations in band `b` | Band missingness and sampling density vary by object/class and affect reliability |
| `amp_{b}` | Amplitude above baseline: `max(fb) - median(fb)` | Measures event strength in that band (key for color-specific transient signatures) |
| `robust_amp_{b}` | Robust amplitude: `p95_b - p5_b` | More stable amplitude estimate when peaks/outliers are noisy |
| `tpeak_{b}_obs` | Observed-frame time of peak flux in band `b` | Captures when the band reaches maximum brightness (timing is class-dependent) |
| `tpeak_{b}_rest` | Rest-frame time of peak flux: `tpeak_obs / (1+z)` | Removes time dilation so peak timing is comparable across redshifts |
| `width50_{b}_obs` | Observed-frame width above 50% amplitude | Measures mid-brightness duration in observed time (useful when timing is observationally relevant) |
| `width50_{b}_rest` | Rest-frame width above 50% amplitude | Intrinsic mid-brightness duration (fast vs slow transients) |
| `width80_{b}_obs` | Observed-frame width above 80% amplitude | Captures core peak width in observed time (sharp vs broad peaks) |
| `width80_{b}_rest` | Rest-frame width above 80% amplitude | Intrinsic core peak width (class-discriminative) |
| `rise50_{b}_obs` | Observed-frame time from 50% rise crossing to peak | Encodes rise speed in observed time (cadence-aware) |
| `decay50_{b}_obs` | Observed-frame time from peak to 50% decay crossing | Encodes decay speed in observed time |
| `asym50_{b}_obs` | Rise/decay asymmetry: `rise50 / (decay50 + 1e-8)` | Separates fast-rise slow-decay vs slow-rise fast-decay shapes |
| `rise50_{b}_rest` | Rest-frame time from 50% rise crossing to peak | Intrinsic rise timescale per band |
| `decay50_{b}_rest` | Rest-frame time from peak to 50% decay crossing | Intrinsic decay timescale per band |
| `asym50_{b}_rest` | Rest-frame rise/decay asymmetry | Intrinsic shape asymmetry (less biased by redshift) |
| `auc_pos_{b}_obs` | Observed-frame AUC of positive signal: `∫ max(fb - baseline, 0) dt` | Energy-like summary in observed time (cadence and time dilation included) |
| `auc_pos_{b}_rest` | Rest-frame AUC of positive signal | Energy-like summary comparable across redshifts |
| `snrmax_{b}` | Maximum SNR within band `b` | Strongest detection in that band (some classes peak strongly only in certain filters) |
| `eta_{b}` | Von Neumann eta within band `b` | Detects smooth evolution vs noise inside a single wavelength band |
| `chi2_const_{b}` | Chi-square vs constant-flux model within band | Measures variability significance relative to band-specific noise |
| `slope_{b}_obs` | Best-fit linear slope in band over observed time | Captures overall rise/fade trend per band |
| `slope_{b}_rest` | Best-fit linear slope in band over rest-frame time | Intrinsic trend per band (comparable across redshifts) |
| `maxslope_{b}_obs` | Maximum absolute slope in band (observed time) | Captures sharpest observed change (rise/fall) per band |
| `maxslope_{b}_rest` | Maximum absolute slope in band (rest time) | Captures sharpest intrinsic change rate per band |
| `stetsonJ_{b}_obs` | Stetson J in band using observed time | Detects correlated variability patterns per band |
| `stetsonJ_{b}_rest` | Stetson J in band using rest-frame time | Same, but corrected for time dilation |
| `p5_{b}` | 5th percentile of band flux `fb` | Robust low-end level per band |
| `p25_{b}` | 25th percentile of `fb` | Lower-quartile level per band |
| `p75_{b}` | 75th percentile of `fb` | Upper-quartile level per band |
| `p95_{b}` | 95th percentile of `fb` | Robust high-end level per band |
| `mad_{b}` | Median absolute deviation of `fb` | Robust band variability (outlier-resistant) |
| `iqr_{b}` | Interquartile range of `fb` | Robust spread of the middle 50% per band |
| `mad_over_std_{b}` | `mad_b / (std_b + 1e-8)` | Flags spike-dominated vs Gaussian-like variability (robustness/shape cue) |
| `fvar_{b}` | Fractional variability within band (noise-corrected) | Intrinsic variability strength per band |

## Cross-band pair features (adjacent pairs: `ug, gr, ri, iz, zy`)

For each adjacent filter pair `(a,b)`, these compare amplitude, timing, and peak ratios across wavelengths.

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `ampdiff_{a}{b}` | Amplitude difference: `amp_a - amp_b` | Captures color gradients and temperature evolution signatures between bands |
| `tpeakdiff_{a}{b}_obs` | Observed-frame peak time difference: `tpeak_a_obs - tpeak_b_obs` | Chromatic peak lag/lead in observed time (also reflects cadence + time dilation) |
| `tpeakdiff_{a}{b}_rest` | Rest-frame peak time difference: `tpeak_a_rest - tpeak_b_rest` | Measures intrinsic chromatic peak lag/lead (more physically comparable) |
| `peakratio_{a}{b}` | Peak flux ratio: `peak_flux_a / (peak_flux_b + 1e-8)` | Strong color/SED proxy without needing explicit magnitudes |

## Color features at r-peak (observed-frame) + 10-day color evolution

These interpolate `g`, `r`, `i` flux at the observed time when the r-band peaks (`tpeak_r_obs`), then compute log-flux colors.

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `color_gr_at_rpeak_obs` | `log1p(f_g) - log1p(f_r)` evaluated at `tpeak_r_obs` | Measures g-r color at peak, which is highly class-dependent |
| `color_ri_at_rpeak_obs` | `log1p(f_r) - log1p(f_i)` evaluated at `tpeak_r_obs` | Measures r-i color at peak (temperature / SED proxy) |
| `color_gr_slope10_obs` | `(color_gr(t+10) - color_gr(t)) / 10` days | Captures how color evolves after peak (cooling/heating signatures) |
| `color_ri_slope10_obs` | `(color_ri(t+10) - color_ri(t)) / 10` days | Another post-peak color evolution cue (very discriminative for transients) |

In [4]:
# Comments added through AI
def extract_features_for_object(lc_raw, z, z_err, ebv):
    feats = {}

    # Sort observations by time so time-based calculations make sense
    lc = lc_raw.sort_values("Time (MJD)").reset_index(drop=True)

    # Extract time values and filter (band) labels
    t = lc["Time (MJD)"].to_numpy().astype(float)
    filt = lc["Filter"].to_numpy()

    # If there are no observations, return minimal info
    if len(t) == 0:
        feats["n_obs"] = 0
        return feats

    # Make sure metadata fields are valid numbers (avoid NaNs / strings / missing values)
    z = safe_float(z, default=0.0)                     # redshift (distance proxy)
    z_err = safe_float(z_err, default=0.0)             # redshift uncertainty
    ebv = safe_float(ebv, default=np.nan)              # dust amount (can be missing)

    # Convert time to start at 0 (relative time axis)
    t_rel = t - t.min()

    # Convert observed time to intrinsic time of the object
    # Distant objects appear to evolve slower, so divide by (1 + z)
    t_rest = t_rel / (1.0 + z)

    # Correct brightness values for dust in the Milky Way
    # (dust makes objects look dimmer than they really are)
    flux_corr, err_corr = deextinct_lightcurve(lc, ebv)

    # Basic observation statistics
    feats["n_obs"] = int(len(t))                                  # total number of measurements
    feats["total_time_obs"] = float(t_rel.max() - t_rel.min())    # total observed duration
    feats["total_time_rest"] = float(t_rest.max() - t_rest.min()) # duration corrected for distance effects

    # Global brightness statistics (after dust correction)
    feats["flux_mean"] = float(np.mean(flux_corr))                # average brightness level
    feats["flux_median"] = float(np.median(flux_corr))            # robust typical brightness
    feats["flux_std"] = float(np.std(flux_corr))                  # overall variability
    feats["flux_min"] = float(np.min(flux_corr))                  # dimmest point
    feats["flux_max"] = float(np.max(flux_corr))                  # brightest point

    # Robust statistics that are less sensitive to outliers
    feats["flux_mad"] = median_abs_dev(flux_corr)                 # median absolute deviation
    feats["flux_iqr"] = iqr(flux_corr)                            # interquartile range (Q3 - Q1)

    # Distribution shape features
    feats["flux_skew"] = skewness(flux_corr)                      # asymmetry of values
    feats["flux_kurt_excess"] = kurtosis_excess(flux_corr)        # tail heaviness / spikiness

    # Robust amplitude using percentiles (stable against a few extreme points)
    p5, p25, p75, p95 = np.percentile(flux_corr, [5, 25, 75, 95])
    feats["flux_p5"] = float(p5)
    feats["flux_p25"] = float(p25)
    feats["flux_p75"] = float(p75)
    feats["flux_p95"] = float(p95)
    feats["robust_amp_global"] = float(p95 - p5)                  # robust amplitude proxy

    # Fraction of measurements that are below zero
    # (often indicates noise-dominated detections)
    feats["neg_flux_frac"] = float(np.mean(flux_corr < 0))

    # Signal-to-noise ratio summaries
    snr = np.abs(flux_corr) / (err_corr + 1e-8)
    feats["snr_median"] = float(np.median(snr))                   # typical signal quality
    feats["snr_max"] = float(np.max(snr))                         # strongest detection

    # Observation timing properties
    if len(t_rel) >= 2:
        dt = np.diff(t_rel)
        feats["median_dt"] = float(np.median(dt))                 # typical time between observations
        feats["max_gap"] = float(np.max(dt))                      # largest observation gap
    else:
        feats["median_dt"] = np.nan                               # undefined with <2 points
        feats["max_gap"] = np.nan

    # Global time-series shape / variability diagnostics
    feats["eta_von_neumann"] = von_neumann_eta(flux_corr)              # smoothness vs noise proxy
    feats["chi2_const_global"] = chi2_to_constant(flux_corr, err_corr) # variability vs constant model
    feats["stetsonJ_global_obs"] = stetson_J(t_rel, flux_corr, err_corr)  # correlated variability (obs frame)
    feats["stetsonJ_global_rest"] = stetson_J(t_rest, flux_corr, err_corr) # correlated variability (rest frame)

    # Global slope features (obs + rest frame)
    feats["max_slope_global_obs"] = max_slope(t_rel, flux_corr)        # fastest brightness change (obs)
    feats["max_slope_global_rest"] = max_slope(t_rest, flux_corr)      # fastest brightness change (rest)

    feats["med_abs_slope_global_obs"] = median_abs_slope(t_rel, flux_corr)   # typical change rate (obs)
    feats["med_abs_slope_global_rest"] = median_abs_slope(t_rest, flux_corr) # typical change rate (rest)

    feats["slope_global_obs"] = linear_slope(t_rel, flux_corr)         # best-fit linear trend (obs)
    feats["slope_global_rest"] = linear_slope(t_rest, flux_corr)       # best-fit linear trend (rest)

    # Fractional variability (accounts for measurement noise)
    feats["fvar_global"] = fractional_variability(flux_corr, err_corr)

    # Metadata features
    feats["Z"] = float(z)                           # distance proxy (redshift)
    feats["log1pZ"] = float(np.log1p(max(0.0, z)))  # compressed redshift scale
    feats["Z_err"] = float(max(0.0, z_err))         # clamp negative uncertainty to 0
    feats["log1pZerr"] = float(np.log1p(max(0.0, feats["Z_err"])))  # compressed uncertainty scale
    feats["EBV"] = ebv                              # dust amount

    # Counters for band coverage
    feats["n_filters_present"] = 0                  # how many bands have >= 1 observation
    feats["total_obs"] = 0                          # total observations across all bands

    # Storage for cross-band comparison features
    band_amp = {}                                   # per-band amplitude values
    band_tpeak_rest = {}                            # per-band peak time (rest frame)
    band_tpeak_obs = {}                             # per-band peak time (obs frame)
    band_peak = {}                                  # per-band peak flux values

    # Storage for interpolation-based color features (need time series per band)
    band_tb_rest = {}                               # per-band time arrays (rest frame)
    band_tb_obs = {}                                # per-band time arrays (obs frame)
    band_fb = {}                                    # per-band flux arrays (dust-corrected)

    # Loop over each wavelength band (u, g, r, i, z, y)
    for b in FILTERS:
        m = (filt == b)
        nb = int(np.sum(m))

        # Number of observations in this band
        feats[f"n_{b}"] = nb
        feats["total_obs"] += nb

        # Initialize band features as missing by default
        feats[f"amp_{b}"] = np.nan                   # peak - baseline amplitude
        feats[f"robust_amp_{b}"] = np.nan            # p95 - p5 amplitude (robust)

        feats[f"tpeak_{b}_obs"] = np.nan             # peak time in observed frame
        feats[f"tpeak_{b}_rest"] = np.nan            # peak time in rest frame

        feats[f"width50_{b}_obs"] = np.nan           # width above 50% amplitude (obs)
        feats[f"width50_{b}_rest"] = np.nan          # width above 50% amplitude (rest)
        feats[f"width80_{b}_obs"] = np.nan           # width above 80% amplitude (obs)
        feats[f"width80_{b}_rest"] = np.nan          # width above 80% amplitude (rest)

        feats[f"rise50_{b}_obs"] = np.nan            # time from 50% crossing to peak (obs)
        feats[f"decay50_{b}_obs"] = np.nan           # time from peak to 50% decay (obs)
        feats[f"asym50_{b}_obs"] = np.nan            # rise/decay asymmetry at 50% (obs)

        feats[f"rise50_{b}_rest"] = np.nan           # time from 50% crossing to peak (rest)
        feats[f"decay50_{b}_rest"] = np.nan          # time from peak to 50% decay (rest)
        feats[f"asym50_{b}_rest"] = np.nan           # rise/decay asymmetry at 50% (rest)

        feats[f"auc_pos_{b}_obs"] = np.nan           # area above baseline, positive only (obs)
        feats[f"auc_pos_{b}_rest"] = np.nan          # area above baseline, positive only (rest)

        feats[f"snrmax_{b}"] = np.nan                # best SNR in this band
        feats[f"eta_{b}"] = np.nan                   # Von Neumann smoothness for this band
        feats[f"chi2_const_{b}"] = np.nan            # variability vs constant model (band)

        feats[f"slope_{b}_obs"] = np.nan             # best-fit linear trend (obs)
        feats[f"slope_{b}_rest"] = np.nan            # best-fit linear trend (rest)

        feats[f"maxslope_{b}_obs"] = np.nan          # fastest change rate (obs)
        feats[f"maxslope_{b}_rest"] = np.nan         # fastest change rate (rest)

        feats[f"stetsonJ_{b}_obs"] = np.nan          # correlated variability (obs)
        feats[f"stetsonJ_{b}_rest"] = np.nan         # correlated variability (rest)

        feats[f"p5_{b}"] = np.nan                    # 5th percentile flux
        feats[f"p25_{b}"] = np.nan                   # 25th percentile flux
        feats[f"p75_{b}"] = np.nan                   # 75th percentile flux
        feats[f"p95_{b}"] = np.nan                   # 95th percentile flux
        feats[f"mad_{b}"] = np.nan                   # median absolute deviation
        feats[f"iqr_{b}"] = np.nan                   # interquartile range
        feats[f"mad_over_std_{b}"] = np.nan          # robust-to-standard variability ratio

        feats[f"fvar_{b}"] = np.nan                  # fractional variability (noise-corrected)

        # Skip bands with no data
        if nb == 0:
            continue

        feats["n_filters_present"] += 1

        # Extract time, brightness, and error for this band
        tb_obs = t_rel[m]
        fb = flux_corr[m]
        eb = err_corr[m]

        # Sort observations within the band by time
        order = np.argsort(tb_obs)
        tb_obs = tb_obs[order]
        fb = fb[order]
        eb = eb[order]

        # Convert to intrinsic time scale
        tb_rest = tb_obs / (1.0 + z)

        # Define baseline brightness and peak brightness
        baseline = float(np.median(fb))              # typical level (robust baseline)
        pidx = int(np.argmax(fb))                    # index of brightest point
        peak_flux = float(fb[pidx])                  # peak brightness

        tpeak_obs = float(tb_obs[pidx])              # time of peak (observed)
        tpeak_rest = float(tb_rest[pidx])            # time of peak (intrinsic)

        # Amplitude of brightening (relative to baseline)
        amp = peak_flux - baseline

        # Robust per-band amplitude using percentiles (stable against outliers)
        p5b, p25b, p75b, p95b = np.percentile(fb, [5, 25, 75, 95])
        feats[f"p5_{b}"] = float(p5b)
        feats[f"p25_{b}"] = float(p25b)
        feats[f"p75_{b}"] = float(p75b)
        feats[f"p95_{b}"] = float(p95b)
        feats[f"robust_amp_{b}"] = float(p95b - p5b)

        # Robust variability summaries
        feats[f"mad_{b}"] = median_abs_dev(fb)
        feats[f"iqr_{b}"] = iqr(fb)
        stdb = float(np.std(fb))
        feats[f"mad_over_std_{b}"] = float(feats[f"mad_{b}"] / (stdb + 1e-8))

        # Core band features
        feats[f"amp_{b}"] = float(amp)
        feats[f"tpeak_{b}_obs"] = tpeak_obs
        feats[f"tpeak_{b}_rest"] = tpeak_rest

        # Band-level quality + variability diagnostics
        feats[f"snrmax_{b}"] = float(np.max(np.abs(fb) / (eb + 1e-8)))  # best detection quality
        feats[f"eta_{b}"] = von_neumann_eta(fb)                         # smoothness vs noise
        feats[f"chi2_const_{b}"] = chi2_to_constant(fb, eb)             # variability vs constant

        # Linear trend + slope features (obs + rest frame)
        feats[f"slope_{b}_obs"] = linear_slope(tb_obs, fb)              # best-fit trend (obs)
        feats[f"slope_{b}_rest"] = linear_slope(tb_rest, fb)            # best-fit trend (rest)

        feats[f"maxslope_{b}_obs"] = max_slope(tb_obs, fb)              # fastest change (obs)
        feats[f"maxslope_{b}_rest"] = max_slope(tb_rest, fb)            # fastest change (rest)

        feats[f"stetsonJ_{b}_obs"] = stetson_J(tb_obs, fb, eb)          # correlated variability (obs)
        feats[f"stetsonJ_{b}_rest"] = stetson_J(tb_rest, fb, eb)        # correlated variability (rest)

        # Noise-corrected variability
        feats[f"fvar_{b}"] = fractional_variability(fb, eb)

        # Total positive signal above baseline (area under curve)
        if nb >= 2:
            feats[f"auc_pos_{b}_obs"] = float(trapz_safe(np.maximum(fb - baseline, 0.0), tb_obs))
            feats[f"auc_pos_{b}_rest"] = float(trapz_safe(np.maximum(fb - baseline, 0.0), tb_rest))

        # Width / rise / decay features at 50% and 80% of amplitude
        # (requires a positive amplitude and enough points to have both sides of the peak)
        if (amp > 0) and (nb >= 3):
            lvl50 = baseline + 0.50 * amp
            lvl80 = baseline + 0.80 * amp

            # Find the first time the curve crosses a level on the rise/decay side
            # (this is a coarse but stable width proxy without curve fitting)
            def first_crossing_time(tt, ff, level, mode):
                if len(tt) < 2:
                    return np.nan
                if mode == "rise":
                    idx = np.where(ff >= level)[0]
                    return float(tt[idx[0]]) if len(idx) else np.nan
                if mode == "decay":
                    idx = np.where(ff <= level)[0]
                    return float(tt[idx[0]]) if len(idx) else np.nan
                return np.nan

            # Split into rising and falling segments around the peak
            tb_rise_obs = tb_obs[:pidx + 1]
            fb_rise = fb[:pidx + 1]
            tb_dec_obs = tb_obs[pidx:]
            fb_dec = fb[pidx:]

            # Observed-frame width at 50% amplitude
            t_rise50_obs = first_crossing_time(tb_rise_obs, fb_rise, lvl50, "rise")
            t_fall50_obs = first_crossing_time(tb_dec_obs, fb_dec, lvl50, "decay")
            if (not np.isnan(t_rise50_obs)) and (not np.isnan(t_fall50_obs)):
                feats[f"width50_{b}_obs"] = float(t_fall50_obs - t_rise50_obs)
                feats[f"rise50_{b}_obs"] = float(tpeak_obs - t_rise50_obs)
                feats[f"decay50_{b}_obs"] = float(t_fall50_obs - tpeak_obs)
                feats[f"asym50_{b}_obs"] = float(feats[f"rise50_{b}_obs"] / (feats[f"decay50_{b}_obs"] + 1e-8))

            # Observed-frame width at 80% amplitude (no rise/decay split stored here)
            t_rise80_obs = first_crossing_time(tb_rise_obs, fb_rise, lvl80, "rise")
            t_fall80_obs = first_crossing_time(tb_dec_obs, fb_dec, lvl80, "decay")
            if (not np.isnan(t_rise80_obs)) and (not np.isnan(t_fall80_obs)):
                feats[f"width80_{b}_obs"] = float(t_fall80_obs - t_rise80_obs)

            # Rest-frame times for the same segments
            tb_rise_rest = tb_rest[:pidx + 1]
            tb_dec_rest = tb_rest[pidx:]

            # Rest-frame width at 50% amplitude
            t_rise50_rest = first_crossing_time(tb_rise_rest, fb_rise, lvl50, "rise")
            t_fall50_rest = first_crossing_time(tb_dec_rest, fb_dec, lvl50, "decay")
            if (not np.isnan(t_rise50_rest)) and (not np.isnan(t_fall50_rest)):
                feats[f"width50_{b}_rest"] = float(t_fall50_rest - t_rise50_rest)
                feats[f"rise50_{b}_rest"] = float(tpeak_rest - t_rise50_rest)
                feats[f"decay50_{b}_rest"] = float(t_fall50_rest - tpeak_rest)
                feats[f"asym50_{b}_rest"] = float(feats[f"rise50_{b}_rest"] / (feats[f"decay50_{b}_rest"] + 1e-8))

            # Rest-frame width at 80% amplitude
            t_rise80_rest = first_crossing_time(tb_rise_rest, fb_rise, lvl80, "rise")
            t_fall80_rest = first_crossing_time(tb_dec_rest, fb_dec, lvl80, "decay")
            if (not np.isnan(t_rise80_rest)) and (not np.isnan(t_fall80_rest)):
                feats[f"width80_{b}_rest"] = float(t_fall80_rest - t_rise80_rest)

        # Store values for cross-band comparisons
        band_amp[b] = feats[f"amp_{b}"]
        band_tpeak_obs[b] = feats[f"tpeak_{b}_obs"]
        band_tpeak_rest[b] = feats[f"tpeak_{b}_rest"]
        band_peak[b] = peak_flux

        # Store time series for interpolation-based color features
        band_tb_obs[b] = tb_obs
        band_tb_rest[b] = tb_rest
        band_fb[b] = fb

    # Cross-band comparison features between adjacent wavelengths
    # (captures color differences and peak-time lags across filters)
    pairs = [("u", "g"), ("g", "r"), ("r", "i"), ("i", "z"), ("z", "y")]
    for a, b in pairs:
        va, vb = band_amp.get(a, np.nan), band_amp.get(b, np.nan)
        ta_obs, tb_obs = band_tpeak_obs.get(a, np.nan), band_tpeak_obs.get(b, np.nan)
        ta_rest, tb_rest = band_tpeak_rest.get(a, np.nan), band_tpeak_rest.get(b, np.nan)
        pa, pb = band_peak.get(a, np.nan), band_peak.get(b, np.nan)

        # Difference in brightness amplitude
        feats[f"ampdiff_{a}{b}"] = (va - vb) if (not np.isnan(va) and not np.isnan(vb)) else np.nan

        # Difference in peak timing (obs + rest frame)
        feats[f"tpeakdiff_{a}{b}_obs"] = (ta_obs - tb_obs) if (not np.isnan(ta_obs) and not np.isnan(tb_obs)) else np.nan
        feats[f"tpeakdiff_{a}{b}_rest"] = (ta_rest - tb_rest) if (not np.isnan(ta_rest) and not np.isnan(tb_rest)) else np.nan

        # Ratio of peak brightness values
        feats[f"peakratio_{a}{b}"] = (pa / (pb + 1e-8)) if (not np.isnan(pa) and not np.isnan(pb)) else np.nan

    # Safe log transform used for color features
    # (log1p + clamp avoids log(negative) explosions when flux is noisy)
    def logp(x):
        if np.isnan(x):
            return np.nan
        return float(np.log1p(max(0.0, x)))

    # Color features anchored at r-band peak time
    # (measures spectral shape at peak and how it evolves shortly after)
    tpr_obs = feats.get("tpeak_r_obs", np.nan)
    if not np.isnan(tpr_obs):
        # Interpolate g, r, i flux at the r-band peak time
        fr = interp_flux_at_time(band_tb_obs.get("r", np.array([])), band_fb.get("r", np.array([])), tpr_obs)
        fg = interp_flux_at_time(band_tb_obs.get("g", np.array([])), band_fb.get("g", np.array([])), tpr_obs)
        fi = interp_flux_at_time(band_tb_obs.get("i", np.array([])), band_fb.get("i", np.array([])), tpr_obs)

        # Approximate colors using log-flux differences (stable scale, less dominated by raw amplitude)
        feats["color_gr_at_rpeak_obs"] = (logp(fg) - logp(fr)) if (not np.isnan(fg) and not np.isnan(fr)) else np.nan
        feats["color_ri_at_rpeak_obs"] = (logp(fr) - logp(fi)) if (not np.isnan(fr) and not np.isnan(fi)) else np.nan

        # Evaluate colors again 10 days after peak to estimate color evolution slope
        dt_obs = 10.0
        t2 = tpr_obs + dt_obs

        fr2 = interp_flux_at_time(band_tb_obs.get("r", np.array([])), band_fb.get("r", np.array([])), t2)
        fg2 = interp_flux_at_time(band_tb_obs.get("g", np.array([])), band_fb.get("g", np.array([])), t2)
        fi2 = interp_flux_at_time(band_tb_obs.get("i", np.array([])), band_fb.get("i", np.array([])), t2)

        cgr1 = feats["color_gr_at_rpeak_obs"]
        cri1 = feats["color_ri_at_rpeak_obs"]

        cgr2 = (logp(fg2) - logp(fr2)) if (not np.isnan(fg2) and not np.isnan(fr2)) else np.nan
        cri2 = (logp(fr2) - logp(fi2)) if (not np.isnan(fr2) and not np.isnan(fi2)) else np.nan

        # Color slopes (change per day over a 10-day window)
        feats["color_gr_slope10_obs"] = ((cgr2 - cgr1) / dt_obs) if (not np.isnan(cgr2) and not np.isnan(cgr1)) else np.nan
        feats["color_ri_slope10_obs"] = ((cri2 - cri1) / dt_obs) if (not np.isnan(cri2) and not np.isnan(cri1)) else np.nan
    else:
        # If r-band has no peak time (missing r-band), color-at-rpeak is undefined
        feats["color_gr_at_rpeak_obs"] = np.nan
        feats["color_ri_at_rpeak_obs"] = np.nan
        feats["color_gr_slope10_obs"] = np.nan
        feats["color_ri_slope10_obs"] = np.nan

    return feats

In [5]:
def build_lightcurve_cache(splits, base_dir, kind="train"):
    base_dir = Path(base_dir)
    lc_cache = {}
    idx_cache = {}

    for s in splits:
        path = base_dir / str(s) / f"{kind}_full_lightcurves.csv"
        lc = pd.read_csv(path)
        lc["object_id"] = lc["object_id"].astype(str)
        groups = lc.groupby("object_id").indices
        lc_cache[s] = lc
        idx_cache[s] = groups

    return lc_cache, idx_cache


def get_lightcurve(lc_cache, idx_cache, split, object_id):
    idx = idx_cache[split].get(object_id, None)
    if idx is None:
        return None
    return lc_cache[split].iloc[idx]


def build_feature_table(
    log_df,
    lc_cache,
    idx_cache,
    augment_photoz=False,
    test_zerr_pool=None,
    n_aug=1,
    seed=6
):
    rng = np.random.default_rng(seed)
    rows = []

    if test_zerr_pool is not None:
        test_zerr_pool = np.asarray(test_zerr_pool, float)
        test_zerr_pool = test_zerr_pool[np.isfinite(test_zerr_pool)]
        test_zerr_pool = test_zerr_pool[test_zerr_pool > 0]

    for i in range(len(log_df)):
        r = log_df.iloc[i]
        obj = r["object_id"]
        split = r["split"]

        lc = get_lightcurve(lc_cache, idx_cache, split, obj)

        if lc is None:
            feats = {"n_obs": 0}
            feats["object_id"] = obj
            feats["split"] = split
            feats["photoz_aug"] = 0
            if "target" in log_df.columns:
                feats["target"] = int(r["target"])
            rows.append(feats)
            continue

        feats = extract_features_for_object(
            lc_raw=lc,
            z=r["Z"],
            z_err=r.get("Z_err", 0.0),
            ebv=r["EBV"],
        )
        feats["object_id"] = obj
        feats["split"] = split
        feats["photoz_aug"] = 0
        if "target" in log_df.columns:
            feats["target"] = int(r["target"])
        rows.append(feats)

        if augment_photoz and ("target" in log_df.columns) and (test_zerr_pool is not None) and (len(test_zerr_pool) > 0):
            z0 = safe_float(r["Z"], default=0.0)

            for _ in range(n_aug):
                sigma = float(rng.choice(test_zerr_pool))
                z_sim = max(0.0, z0 + float(rng.normal(0.0, sigma)))

                feats2 = extract_features_for_object(
                    lc_raw=lc,
                    z=z_sim,
                    z_err=sigma,
                    ebv=r["EBV"],
                )
                feats2["object_id"] = obj
                feats2["split"] = split
                feats2["target"] = int(r["target"])
                feats2["photoz_aug"] = 1
                rows.append(feats2)

    return pd.DataFrame(rows)

In [6]:
def clean_features(df, drop_cols):
    X = df.drop(columns=drop_cols).copy()

    X = X.replace([np.inf, -np.inf], np.nan)

    med = X.median(numeric_only=True)
    X = X.fillna(med)
    X = X.fillna(0.0)
    return X


def best_threshold_f1(y_true, probs):
    ths = np.linspace(0.01, 0.99, 200)
    f1s = [f1_score(y_true, probs > t, zero_division=0) for t in ths]
    j = int(np.argmax(f1s))
    return float(ths[j]), float(f1s[j])


def best_alpha_and_threshold(y_true, p_xgb, p_lgb):
    alphas = np.linspace(0.0, 1.0, 101)
    best = (0.5, 0.5, -1.0)  # alpha, th, f1

    for a in alphas:
        p = a * p_xgb + (1.0 - a) * p_lgb
        th, f1 = best_threshold_f1(y_true, p)
        if f1 > best[2]:
            best = (float(a), float(th), float(f1))

    return best


def make_splitter(n_splits, random_state=6):
    return StratifiedGroupKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

In [7]:
def run_optuna_xgb(train_feat, n_folds_tune=10, timeout_sec=7200):
    y = train_feat["target"].astype(int).to_numpy()
    groups = train_feat["split"].to_numpy()

    X = clean_features(train_feat, drop_cols=["object_id", "split", "target"])

    def objective(trial):
        params = {
            "objective": "binary:logistic",
            "eval_metric": "logloss",
            "random_state": 6,
            "n_jobs": -1,

            "tree_method": "hist",
            "device": "cuda",

            "n_estimators": trial.suggest_int("n_estimators", 600, 6000),
            "learning_rate": trial.suggest_float("learning_rate", 0.005, 0.12, log=True),

            "max_depth": trial.suggest_int("max_depth", 2, 10),
            "min_child_weight": trial.suggest_int("min_child_weight", 1, 40),

            "subsample": trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),

            "gamma": trial.suggest_float("gamma", 0.0, 10.0),
            "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 20.0),
            "reg_lambda": trial.suggest_float("reg_lambda", 0.05, 30.0),

            "max_delta_step": trial.suggest_int("max_delta_step", 0, 10),

            "grow_policy": trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"]),
        }

        if params["grow_policy"] == "lossguide":
            params["max_leaves"] = trial.suggest_int("max_leaves", 16, 256)

        scores = []

        splitter = make_splitter(n_folds_tune, random_state=6)
        split_iter = splitter.split(X, y, groups)

        for fold, (tr_idx, va_idx) in enumerate(split_iter, 1):
            X_tr, y_tr = X.iloc[tr_idx], y[tr_idx]
            X_va, y_va = X.iloc[va_idx], y[va_idx]

            neg = np.sum(y_tr == 0)
            pos = np.sum(y_tr == 1)
            params["scale_pos_weight"] = float(neg / max(1, pos))

            model = XGBClassifier(**params)
            model.fit(X_tr, y_tr, verbose=False)

            probs = model.predict_proba(X_va)[:, 1]
            ap = average_precision_score(y_va, probs)
            scores.append(ap)

            trial.report(float(np.mean(scores)), step=fold)
            if trial.should_prune():
                raise optuna.TrialPruned()

        return float(np.mean(scores))

    sampler = optuna.samplers.TPESampler(seed=6, multivariate=True, group=True)
    pruner = optuna.pruners.MedianPruner(n_startup_trials=30, n_warmup_steps=3)

    study = optuna.create_study(
        direction="maximize",
        sampler=sampler,
        pruner=pruner,
        study_name="xgb_ap_split_cv_gpu",
        storage="sqlite:///optuna_xgb_ap_gpu.db",
        load_if_exists=True
    )

    study.optimize(objective, n_trials=999999, timeout=timeout_sec)

    print("\nOptuna best AP:", study.best_value)
    print("Best params:")
    for k, v in study.best_params.items():
        print(k, "=", v)

    return study.best_params


## Training the full ensemble (XGB + LGB) and calibrating the blend

This function trains two separate model families per fold:
- XGBoost models (using Optuna-tuned params)
- LightGBM models (fixed baseline params)

For each fold:
- train both models on the fold training set
- store OOF probabilities for both models
- report a temporary 0.5/0.5 blend best-F1 threshold (for sanity checking)

After all folds:
- compute the best blend weight `alpha` and best threshold using full OOF predictions:
  - `alpha_best = 0.03`
  - `best_th = 0.01`
  - OOF blended best F1 = 0.51875

These values are then used for test prediction.


In [8]:
def train_full_ensemble(train_feat, xgb_params, n_splits_full=20):
    y = train_feat["target"].astype(int).to_numpy()
    groups = train_feat["split"].to_numpy()

    X = clean_features(train_feat, drop_cols=["object_id", "split", "target"])

    splitter = make_splitter(n_splits_full, random_state=6)
    split_iter = splitter.split(X, y, groups)

    xgb_base = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "random_state": 6,
        "n_jobs": -1,
        "tree_method": "hist",
        "device": "cuda",
        **xgb_params
    }

    lgb_base = dict(
        objective="binary",
        boosting_type="gbdt",
        n_estimators=4000,
        learning_rate=0.02,
        num_leaves=64,
        max_depth=-1,
        min_child_samples=20,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.0,
        reg_lambda=0.0,
        n_jobs=-1,
        random_state=6
    )

    xgb_models = []
    lgb_models = []

    oof_xgb = np.zeros(len(X), dtype=float)
    oof_lgb = np.zeros(len(X), dtype=float)

    for fold, (tr_idx, va_idx) in enumerate(split_iter, 1):
        X_tr, y_tr = X.iloc[tr_idx], y[tr_idx]
        X_va, y_va = X.iloc[va_idx], y[va_idx]

        neg = np.sum(y_tr == 0)
        pos = np.sum(y_tr == 1)
        spw = float(neg / max(1, pos))

        xgb_base["scale_pos_weight"] = spw
        xgb_model = XGBClassifier(**xgb_base)
        xgb_model.fit(X_tr, y_tr, verbose=False)
        p_xgb = xgb_model.predict_proba(X_va)[:, 1]
        oof_xgb[va_idx] = p_xgb
        xgb_models.append(xgb_model)

        lgb_model = LGBMClassifier(**{**lgb_base, "scale_pos_weight": spw})
        lgb_model.fit(X_tr, y_tr)
        p_lgb = lgb_model.predict_proba(X_va)[:, 1]
        oof_lgb[va_idx] = p_lgb
        lgb_models.append(lgb_model)

        p_tmp = 0.5 * p_xgb + 0.5 * p_lgb
        th, f1 = best_threshold_f1(y_va, p_tmp)
        print(f"Fold {fold:02d} | temp blend(0.5) best F1={f1:.4f} @ th={th:.3f}")

    alpha_best, th_best, f1_best = best_alpha_and_threshold(y, oof_xgb, oof_lgb)
    print("\nOOF best alpha:", alpha_best)
    print("OOF best threshold:", th_best)
    print("OOF blended best F1:", f1_best)

    return xgb_models, lgb_models, alpha_best, th_best


def predict_ensemble(test_feat, xgb_models, lgb_models, alpha):
    X_test = clean_features(test_feat, drop_cols=["object_id", "split"])

    p_xgb = np.mean([m.predict_proba(X_test)[:, 1] for m in xgb_models], axis=0)
    p_lgb = np.mean([m.predict_proba(X_test)[:, 1] for m in lgb_models], axis=0)

    p_blend = alpha * p_xgb + (1.0 - alpha) * p_lgb
    return p_blend

In [9]:
from pathlib import Path
ROOT = Path.cwd().parents[0]
DATA_DIR = ROOT / "data"

train_log = pd.read_csv(DATA_DIR / "train_log.csv")
test_log  = pd.read_csv(DATA_DIR / "test_log.csv")

train_log["Z_err"] = train_log["Z_err"].fillna(0.0)
test_log["Z_err"] = test_log["Z_err"].fillna(0.0)

train_splits = sorted(train_log["split"].unique())
test_splits = sorted(test_log["split"].unique())

train_lc_cache, train_idx_cache = build_lightcurve_cache(train_splits, DATA_DIR, kind="train")
test_lc_cache, test_idx_cache = build_lightcurve_cache(test_splits, DATA_DIR, kind="test")
test_zerr_pool = test_log["Z_err"].dropna().values

In [24]:
train_feat = build_feature_table(
    train_log, train_lc_cache, train_idx_cache,
    augment_photoz=True,
    test_zerr_pool=test_zerr_pool,
    n_aug=1,
    seed=6
)

test_feat = build_feature_table(test_log, test_lc_cache, test_idx_cache)

print("train_feat:", train_feat.shape)
print("test_feat :", test_feat.shape)

train_feat: (6086, 272)
test_feat : (7135, 271)


In [None]:
best_xgb_params = run_optuna_xgb(train_feat, n_folds_tune=10, timeout_sec=7200)
xgb_models, lgb_models, alpha_best, best_th = train_full_ensemble(
    train_feat, best_xgb_params, n_splits_full=len(train_splits)
)

Best Parameters:
```json
{
  "n_estimators": 4328,
  "learning_rate": 0.007080,
  "max_depth": 4,
  "min_child_weight": 1,
  "subsample": 0.5936,
  "colsample_bytree": 0.9145,
  "gamma": 0.6466,
  "reg_alpha": 4.1301,
  "reg_lambda": 5.5650,
  "max_delta_step": 1,
  "grow_policy": 'depthwise'
}
```

OOF best alpha: 0.03  
OOF best threshold: 0.01  
OOF blended best F1: 0.51875  

These results are extremely strange, the threshold is extremely low. Almost looks like an error. I wouldn't be surprised. I didn't give much attention to this model, it was more of a test I ran over night to see if a small ensemble improved performance. I saw the 0.03 and just assumed LGBM wasn't pulling it's weight, when it was actually XGB that wasn't helping. I even did a test on a later model and LGBM was weighted at ~0 every single time. I removed LGBM from subsequent models because of that. Next competition I will give LGBM and possibly CatBoost more of a chance instead of ignoring them. I know I used LGBM for predicting SpecType, but I could use it a lot more.

In [None]:
test_probs = predict_ensemble(test_feat, xgb_models, lgb_models, alpha=alpha_best)
test_pred = (test_probs > best_th).astype(int)

sub = pd.DataFrame({
    "object_id": test_feat["object_id"].values,
    "target": test_pred
})
sub.to_csv("XGB-LGBM-2.csv", index=False)