# Model 2 (Physics-aware Features + Split-Safe CV XGBoost Ensemble)

This model is a major upgrade from Model 1.

## What changed vs Model 1
### De-extinction correction (EBV)
Model 1 used raw flux values directly.
Model 2 applies de-extinction using EBV to correct flux measurements for dust obscuration before feature extraction.

This is a common astronomy preprocessing step and a notebook created by the host of the competition showed how to use it.

### Stronger feature extraction pipeline
Model 1 had simple summary stats.
Model 2 builds a much richer set of time-series features.

### Split-aware cross validation (GroupKFold)
Model 1 used a standard `train_test_split`, which can leak patterns across splits.
Model 2 uses GroupKFold grouped by `split`, meaning the model must generalize across different split domains.

This makes validation more realistic and reduces leakage.

### Fold ensemble + OOF thresholding
Instead of training one model, Model 2 trains one model per fold and:
- collects out-of-fold predictions (OOF)
- selects a global best threshold based on OOF
- predicts test using the average of fold probabilities

### Optuna tuning is now CV-based (not one holdout split)
Model 1 tuned on one validation split.
Model 2 tunes hyperparameters using grouped CV, so the best params are more stable.

## Performance
- Public leaderboard F1: 0.5921


In [None]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import GroupKFold
from sklearn.metrics import f1_score
from scipy.optimize import curve_fit

FILTERS = ["u", "g", "r", "i", "z", "y"]

# Wavelengths for each band, provided by the competition and souved from SVO Filter Profile Service
EFF_WL_AA = {
    "u": 3641.0,
    "g": 4704.0,
    "r": 6155.0,
    "i": 7504.0,
    "z": 8695.0,
    "y": 10056.0,
}

## De-extinction using EBV (dust correction)

In astronomy, observed flux is often reduced by dust between the source and the observer.

- EBV represents how much the source is obscured by dust (reddening)
- Different filters are affected differently because dust attenuation depends on wavelength

This model applies de-extinction before feature extraction, meaning the model uses an estimate of the intrinsic flux rather than the dust-dimmed observed flux.

## De-extinction functions

These functions correct flux values using the Fitzpatrick (1999) extinction law.

### `deextinct_band()`
Applies a wavelength-dependent correction factor to flux and flux error for one filter band:
- converts EBV into A_V using R_V (default 3.1)
- uses the effective wavelength of the band
- returns corrected flux, corrected error, and A_lambda

### `deextinct_lightcurve()`
Applies `deextinct_band` across all filters in the lightcurve so that feature extraction uses corrected flux values.

**I am not an astronomer, I used AI to help me implement this correctly.
The competition also provided an example notebook showing de-extinction steps.**


In [None]:
from extinction import fitzpatrick99

def deextinct_band(flux, flux_err, ebv, band, r_v=3.1):
    if ebv is None or (isinstance(ebv, float) and np.isnan(ebv)):
        return flux, flux_err, 0.0

    A_V = float(ebv) * float(r_v)
    wave = np.array([EFF_WL_AA[band]], dtype=float)

    A_lambda = float(fitzpatrick99(wave, A_V, r_v=r_v, unit="aa")[0])

    fac = 10.0 ** (0.4 * A_lambda)
    return flux * fac, flux_err * fac, A_lambda


def deextinct_lightcurve(lc, ebv):
    flux = lc["Flux"].to_numpy().astype(float)
    ferr = lc["Flux_err"].to_numpy().astype(float)
    filt = lc["Filter"].to_numpy()

    flux_corr = flux.copy()
    ferr_corr = ferr.copy()

    for b in FILTERS:
        m = (filt == b)
        if not np.any(m):
            continue
        flux_corr[m], ferr_corr[m], _ = deextinct_band(flux_corr[m], ferr_corr[m], ebv, b)

    return flux_corr, ferr_corr

## Utility functions (safe casting + robust statistics)

These functions are helper functions used throughout feature extraction:
- safe_float: converts values to float and handles missing values
- weighted summary functions: mean and std with weights
- robust dispersion stats like MAD and IQR
- distribution shape metrics like skewness and kurtosis
- variability measures like von Neumann eta
- slope-based time-series stats


In [None]:
def safe_float(x, default=np.nan):
    try:
        if x is None:
            return default
        x = float(x)
        if np.isnan(x):
            return default
        return x
    except Exception:
        return default


def weighted_mean(x, w):
    s = np.sum(w)
    if s <= 0:
        return np.nan
    return float(np.sum(w * x) / s)


def weighted_std(x, w):
    mu = weighted_mean(x, w)
    if np.isnan(mu):
        return np.nan
    s = np.sum(w)
    if s <= 0:
        return np.nan
    var = np.sum(w * (x - mu) ** 2) / s
    return float(np.sqrt(var))


def median_abs_dev(x):
    med = np.median(x)
    return float(np.median(np.abs(x - med)))


def iqr(x):
    q75, q25 = np.percentile(x, [75, 25])
    return float(q75 - q25)


def skewness(x):
    x = np.asarray(x)
    mu = np.mean(x)
    s = np.std(x)
    if s < 1e-12:
        return 0.0
    m3 = np.mean((x - mu) ** 3)
    return float(m3 / (s ** 3))


def kurtosis_excess(x):
    x = np.asarray(x)
    mu = np.mean(x)
    s = np.std(x)
    if s < 1e-12:
        return 0.0
    m4 = np.mean((x - mu) ** 4)
    return float(m4 / (s ** 4) - 3.0)


def von_neumann_eta(x):
    x = np.asarray(x)
    n = len(x)
    if n < 3:
        return np.nan
    v = np.var(x)
    if v < 1e-12:
        return 0.0
    dif = np.diff(x)
    return float(np.mean(dif ** 2) / v)


def fraction_beyond_n_std(x, n=1.5):
    x = np.asarray(x)
    if len(x) < 3:
        return np.nan
    mu = np.mean(x)
    s = np.std(x)
    if s < 1e-12:
        return 0.0
    return float(np.mean(np.abs(x - mu) > n * s))


def max_slope(t, f):
    if len(t) < 3:
        return np.nan
    dt = np.diff(t)
    df = np.diff(f)
    good = dt > 0
    if not np.any(good):
        return np.nan
    slopes = df[good] / dt[good]
    return float(np.max(np.abs(slopes)))


def median_abs_slope(t, f):
    if len(t) < 3:
        return np.nan
    dt = np.diff(t)
    df = np.diff(f)
    good = dt > 0
    if not np.any(good):
        return np.nan
    slopes = df[good] / dt[good]
    return float(np.median(np.abs(slopes)))


def first_crossing_time(t, f, level, mode):
    if len(t) < 2:
        return np.nan

    if mode == "rise":
        idx = np.where(f >= level)[0]
        if len(idx) == 0:
            return np.nan
        return float(t[idx[0]])

    if mode == "decay":
        idx = np.where(f <= level)[0]
        if len(idx) == 0:
            return np.nan
        return float(t[idx[0]])

    return np.nan

## Bazin parametric lightcurve model (curve fitting)

This model fits a Bazin-style parametric function to the lightcurve:

- captures rise and decay behavior in a compact set of parameters
- gives physically meaningful shape descriptors like:
  - rise time
  - fall time
  - peak time
  - fit quality (reduced chi^2-like metric)

Instead of only summary stats, the model can also learn from fitted shape parameters that describe the transient profile more directly. Another person in a Kaggle discussion said to take note of the shape.

In [None]:
def bazin_model(t, A, t0, tfall, trise, B):
    x1 = -(t - t0) / tfall
    x2 = -(t - t0) / trise
    x1 = np.clip(x1, -60, 60)
    x2 = np.clip(x2, -60, 60)
    return A * np.exp(x1) / (1.0 + np.exp(x2)) + B



def bazin_fit_features(t, f):
    out = {
        "bazin_A": np.nan,
        "bazin_B": np.nan,
        "bazin_t0": np.nan,
        "bazin_trise": np.nan,
        "bazin_tfall": np.nan,
        "bazin_redchi2": np.nan,
    }

    if len(t) < 6:
        return out

    t = np.asarray(t, dtype=float)
    f = np.asarray(f, dtype=float)

    B0 = float(np.median(f))
    A0 = float(np.max(f) - B0)
    t0 = float(t[np.argmax(f)])
    tr0 = max(1.0, 0.1 * (t.max() - t.min() + 1e-6))
    tf0 = max(5.0, 0.3 * (t.max() - t.min() + 1e-6))

    p0 = [A0, t0, tf0, tr0, B0]

    lower = [0.0, t.min(), 0.5, 0.5, np.min(f) - abs(A0)]
    upper = [10.0 * abs(A0) + 1e-6, t.max(), 5000.0, 5000.0, np.max(f) + abs(A0)]

    try:
        popt, _ = curve_fit(
            bazin_model,
            t,
            f,
            p0=p0,
            bounds=(lower, upper),
            maxfev=20000
        )

        A, t0, tfall, trise, B = popt
        pred = bazin_model(t, *popt)

        resid = f - pred
        dof = max(1, len(t) - len(popt))
        redchi2 = float(np.sum(resid ** 2) / dof)

        out["bazin_A"] = float(A)
        out["bazin_B"] = float(B)
        out["bazin_t0"] = float(t0)
        out["bazin_trise"] = float(trise)
        out["bazin_tfall"] = float(tfall)
        out["bazin_redchi2"] = float(redchi2)

    except Exception:
        pass

    return out

## Feature extraction for a single object (high-level)

extract_features_for_object() converts one raw lightcurve into a single feature row.

Key preprocessing steps added in this model:
- Sort observations in time order
- Apply de-extinction correction using EBV
- Convert time to:
  - observed time (relative)
  - rest-frame time using redshift Z

## Features

### Global (all filters combined)
- `n_obs`: total number of observations across all filters  
- `total_time_obs`: observed-frame time baseline `(max(t_rel) - min(t_rel))`  
- `total_time_rest`: rest-frame time baseline `(max(t_rest) - min(t_rest)) = total_time_obs / (1+z)`  

**Flux distribution (using extinction-corrected `flux_corr`)**
- `flux_mean`: mean corrected flux  
- `flux_median`: median corrected flux  
- `flux_std`: standard deviation of corrected flux  
- `flux_min`: minimum corrected flux  
- `flux_max`: maximum corrected flux  
- `flux_range`: `flux_max - flux_min`  
- `flux_mad`: median absolute deviation of corrected flux  
- `flux_iqr`: interquartile range of corrected flux  
- `flux_skew`: skewness of corrected flux distribution  
- `flux_kurt_excess`: excess kurtosis of corrected flux distribution  
- `neg_flux_frac`: fraction of corrected flux values `< 0`  

**SNR (using corrected errors `err_corr`)**
- `snr_median`: median SNR where `snr = |flux_corr| / (err_corr + 1e-8)`  
- `snr_max`: maximum SNR  

**Cadence / gaps**
- `median_dt`: median time step between consecutive observations in `t_rel`  
- `max_gap`: maximum time step between consecutive observations in `t_rel`  

**Time-series variability / shape**
- `eta_von_neumann`: von Neumann eta statistic (smoothness/jumpiness)  
- `beyond_1p5std`: fraction of points beyond `1.5 * std` from the center  
- `max_slope_global`: maximum absolute slope over time (fastest change)  
- `med_abs_slope_global`: median absolute slope over time (typical change rate)  

**Context metadata**
- `Z`: redshift  
- `log1pZ`: `log(1+z)`  
- `EBV`: dust reddening used for extinction correction  

**Filter coverage**
- `n_filters_present`: number of filters with at least 1 observation  
- `total_obs`: total observations summed across all filters  

### Per-filter features (for each band `b âˆˆ {u,g,r,i,z,y}`)
- `n_{b}`: number of observations in band `b`  
- `amp_{b}`: peak amplitude above baseline in band `b` where `baseline = median(fb)` and `amp = max(fb) - baseline`  
- `tpeak_{b}_obs`: observed-frame time of peak flux in band `b`  
- `tpeak_{b}_rest`: rest-frame time of peak flux in band `b`  
- `width50_{b}_rest`: rest-frame width above 50% amplitude (if measurable)  
- `width80_{b}_rest`: rest-frame width above 80% amplitude (if measurable)  
- `auc_pos_{b}_rest`: rest-frame trapezoid integral of `max(fb - baseline, 0)`  
- `snrmax_{b}`: max SNR within band `b`  
- `eta_{b}`: von Neumann eta computed within band `b`  
- `maxslope_{b}`: maximum slope within band `b` (rest-frame time)  

**Bazin fit features (only if `n_b >= 6`)**
- `bazin_A_{b}`: Bazin amplitude-like parameter  
- `bazin_B_{b}`: Bazin baseline-like parameter  
- `bazin_trise_{b}`: Bazin rise timescale  
- `bazin_tfall_{b}`: Bazin fall timescale  
- `bazin_redchi2_{b}`: reduced chi-square of Bazin fit  

### Cross-band pair features (adjacent pairs: `ug, gr, ri, iz, zy`)
For each pair `(a,b)`:
- `ampdiff_{a}{b}`: `amp_a - amp_b`  
- `tpeakdiff_{a}{b}_rest`: `tpeak_a_rest - tpeak_b_rest`  
- `peakratio_{a}{b}`: `peak_flux_a / (peak_flux_b + 1e-8)`  
- `aucdiff_{a}{b}`: `auc_a - auc_b`  


In [None]:
def extract_features_for_object(lc_raw, z, ebv):
    feats = {}

    # Sort observations by time so time-based calculations make sense
    lc = lc_raw.sort_values("Time (MJD)").reset_index(drop=True)

    # Extract time values and filter (band) labels
    t = lc["Time (MJD)"].to_numpy().astype(float)
    filt = lc["Filter"].to_numpy()

    # If there are no observations, return minimal info
    if len(t) == 0:
        feats["n_obs"] = 0
        return feats

    # Correct brightness values for dust in the Milky Way
    # (dust makes objects look dimmer than they really are)
    flux_corr, err_corr = deextinct_lightcurve(lc, ebv)

    # Make sure redshift is a valid number
    z = safe_float(z, default=0.0)

    # Convert time to start at 0 (relative time axis)
    t_rel = t - t.min()

    # Convert observed time to intrinsic time of the object
    # Distant objects appear to evolve slower, so divide by (1 + z)
    t_rest = t_rel / (1.0 + z)

    # Basic observation statistics
    feats["n_obs"] = int(len(t))                                  # total number of measurements
    feats["total_time_obs"] = float(t_rel.max() - t_rel.min())    # total observed duration
    feats["total_time_rest"] = float(t_rest.max() - t_rest.min()) # duration corrected for distance effects

    # Global brightness statistics (after dust correction)
    feats["flux_mean"] = float(np.mean(flux_corr))
    feats["flux_median"] = float(np.median(flux_corr))
    feats["flux_std"] = float(np.std(flux_corr))
    feats["flux_min"] = float(np.min(flux_corr))
    feats["flux_max"] = float(np.max(flux_corr))
    feats["flux_range"] = feats["flux_max"] - feats["flux_min"]

    # Robust statistics that are less sensitive to outliers
    feats["flux_mad"] = median_abs_dev(flux_corr)   # median absolute deviation
    feats["flux_iqr"] = iqr(flux_corr)              # interquartile range

    # Distribution shape features
    feats["flux_skew"] = skewness(flux_corr)                # asymmetry of values
    feats["flux_kurt_excess"] = kurtosis_excess(flux_corr)  # tail heaviness / spikiness

    # Fraction of measurements that are below zero
    # (often indicates noise-dominated detections)
    feats["neg_flux_frac"] = float(np.mean(flux_corr < 0))

    # Signal-to-noise ratio summaries
    snr = np.abs(flux_corr) / (err_corr + 1e-8)
    feats["snr_median"] = float(np.median(snr))    # typical signal quality
    feats["snr_max"] = float(np.max(snr))          # strongest detection

    # Observation timing properties
    if len(t) >= 2:
        dt = np.diff(t_rel)
        feats["median_dt"] = float(np.median(dt))  # typical time between observations
        feats["max_gap"] = float(np.max(dt))       # largest observation gap
    else:
        feats["median_dt"] = np.nan
        feats["max_gap"] = np.nan

    # Global time-series shape features
    feats["eta_von_neumann"] = von_neumann_eta(flux_corr)              # smoothness vs noise
    feats["beyond_1p5std"] = fraction_beyond_n_std(flux_corr, n=1.5)   # outlier fraction
    feats["max_slope_global"] = max_slope(t_rel, flux_corr)            # fastest brightness change
    feats["med_abs_slope_global"] = median_abs_slope(t_rel, flux_corr) # typical change rate

    # Metadata features
    feats["Z"] = float(z)                           # distance proxy
    feats["log1pZ"] = float(np.log1p(max(0.0, z)))  # scaled redshift
    feats["EBV"] = safe_float(ebv, default=np.nan)  # dust amount

    # Counters for band coverage
    feats["n_filters_present"] = 0
    feats["total_obs"] = 0

    # Storage for cross-band comparison features
    band_amp = {}
    band_tpeak = {}
    band_peak = {}
    band_auc = {}

    # Loop over each wavelength band (u, g, r, i, z, y)
    for b in FILTERS:
        m = (filt == b)
        nb = int(np.sum(m))

        # Number of observations in this band
        feats[f"n_{b}"] = nb
        feats["total_obs"] += nb

        # Initialize band features as missing by default
        feats[f"amp_{b}"] = np.nan
        feats[f"tpeak_{b}_obs"] = np.nan
        feats[f"tpeak_{b}_rest"] = np.nan
        feats[f"width50_{b}_rest"] = np.nan
        feats[f"width80_{b}_rest"] = np.nan
        feats[f"auc_pos_{b}_rest"] = np.nan
        feats[f"snrmax_{b}"] = np.nan
        feats[f"eta_{b}"] = np.nan
        feats[f"maxslope_{b}"] = np.nan

        # Parametric light-curve shape features
        feats[f"bazin_A_{b}"] = np.nan
        feats[f"bazin_B_{b}"] = np.nan
        feats[f"bazin_trise_{b}"] = np.nan
        feats[f"bazin_tfall_{b}"] = np.nan
        feats[f"bazin_redchi2_{b}"] = np.nan

        # Skip bands with no data
        if nb == 0:
            continue

        feats["n_filters_present"] += 1

        # Extract time, brightness, and error for this band
        tb = t_rel[m]
        fb = flux_corr[m]
        eb = err_corr[m]

        # Sort observations within the band by time
        order = np.argsort(tb)
        tb = tb[order]
        fb = fb[order]
        eb = eb[order]

        # Convert to intrinsic time scale
        tb_rest = tb / (1.0 + z)

        # Define baseline brightness and peak brightness
        baseline = float(np.median(fb))     # typical level
        pidx = int(np.argmax(fb))           # index of brightest point
        peak_flux = float(fb[pidx])         # peak brightness
        tpeak_obs = float(tb[pidx])         # time of peak (observed)
        tpeak_rest = float(tb_rest[pidx])   # time of peak (intrinsic)

        # Amplitude of brightening
        amp = peak_flux - baseline

        # Core band features
        feats[f"amp_{b}"] = float(amp)
        feats[f"tpeak_{b}_obs"] = tpeak_obs
        feats[f"tpeak_{b}_rest"] = tpeak_rest
        feats[f"snrmax_{b}"] = float(np.max(np.abs(fb) / (eb + 1e-8)))

        feats[f"eta_{b}"] = von_neumann_eta(fb)         # smoothness
        feats[f"maxslope_{b}"] = max_slope(tb_rest, fb) # fastest change rate

        # Total positive signal above baseline (area under curve)
        if nb >= 2:
            feats[f"auc_pos_{b}_rest"] = float(
                np.trapezoid(np.maximum(fb - baseline, 0.0), tb_rest)
            )

        # Width of the brightening at 50% and 80% of peak
        if (amp > 0) and (nb >= 3):
            lvl50 = baseline + 0.50 * amp
            lvl80 = baseline + 0.80 * amp

            # Rising side
            trise_seg_t = tb_rest[:pidx + 1]
            trise_seg_f = fb[:pidx + 1]

            # Falling side
            tdec_seg_t = tb_rest[pidx:]
            tdec_seg_f = fb[pidx:]

            # Width at 50%
            t_rise50 = first_crossing_time(trise_seg_t, trise_seg_f, lvl50, "rise")
            t_fall50 = first_crossing_time(tdec_seg_t, tdec_seg_f, lvl50, "decay")
            if (not np.isnan(t_rise50)) and (not np.isnan(t_fall50)):
                feats[f"width50_{b}_rest"] = float(t_fall50 - t_rise50)

            # Width at 80%
            t_rise80 = first_crossing_time(trise_seg_t, trise_seg_f, lvl80, "rise")
            t_fall80 = first_crossing_time(tdec_seg_t, tdec_seg_f, lvl80, "decay")
            if (not np.isnan(t_rise80)) and (not np.isnan(t_fall80)):
                feats[f"width80_{b}_rest"] = float(t_fall80 - t_rise80)

        # Fit a smooth transient curve model if enough points exist
        if nb >= 6:
            bf = bazin_fit_features(tb_rest, fb)
            feats[f"bazin_A_{b}"] = bf["bazin_A"]
            feats[f"bazin_B_{b}"] = bf["bazin_B"]
            feats[f"bazin_trise_{b}"] = bf["bazin_trise"]
            feats[f"bazin_tfall_{b}"] = bf["bazin_tfall"]
            feats[f"bazin_redchi2_{b}"] = bf["bazin_redchi2"]

        # Store values for cross-band comparisons
        band_amp[b] = feats[f"amp_{b}"]
        band_tpeak[b] = feats[f"tpeak_{b}_rest"]
        band_peak[b] = peak_flux
        band_auc[b] = feats[f"auc_pos_{b}_rest"]

    # Cross-band comparison features between adjacent wavelengths
    pairs = [("u", "g"), ("g", "r"), ("r", "i"), ("i", "z"), ("z", "y")]
    for a, b in pairs:
        va, vb = band_amp.get(a, np.nan), band_amp.get(b, np.nan)
        ta, tb_ = band_tpeak.get(a, np.nan), band_tpeak.get(b, np.nan)

        # Difference in brightness amplitude
        feats[f"ampdiff_{a}{b}"] = (va - vb) if (not np.isnan(va) and not np.isnan(vb)) else np.nan

        # Difference in peak timing (intrinsic frame)
        feats[f"tpeakdiff_{a}{b}_rest"] = (ta - tb_) if (not np.isnan(ta) and not np.isnan(tb_)) else np.nan

        # Ratio of peak brightness values
        pa, pb = band_peak.get(a, np.nan), band_peak.get(b, np.nan)
        feats[f"peakratio_{a}{b}"] = (pa / (pb + 1e-8)) if (not np.isnan(pa) and not np.isnan(pb)) else np.nan

        # Difference in total positive signal
        aa, ab = band_auc.get(a, np.nan), band_auc.get(b, np.nan)
        feats[f"aucdiff_{a}{b}"] = (aa - ab) if (not np.isnan(aa) and not np.isnan(ab)) else np.nan

    return feats


## Lightcurve cache

This cache system loads `{kind}_full_lightcurves.csv` per split once and stores:

- `lc_cache[split]`: raw dataframe for that split
- `idx_cache[split]`: map from object_id to row indices

This avoids repeated disk reads and repeated groupby filtering during feature extraction.


In [None]:
def build_lightcurve_cache(splits, base_dir="data", kind="train"):
    lc_cache = {}
    idx_cache = {}

    for s in splits:
        path = f"{base_dir}/{s}/{kind}_full_lightcurves.csv"
        lc = pd.read_csv(path)
        groups = lc.groupby("object_id").indices
        lc_cache[s] = lc
        idx_cache[s] = groups

    return lc_cache, idx_cache


def get_lightcurve(lc_cache, idx_cache, split, object_id):
    idx = idx_cache[split].get(object_id, None)
    if idx is None:
        return None
    return lc_cache[split].iloc[idx]

In [None]:
def build_feature_table(log_df, lc_cache, idx_cache, drop_cols=None):
    rows = []
    for i in range(len(log_df)):
        r = log_df.iloc[i]
        obj = r["object_id"]
        split = r["split"]

        lc = get_lightcurve(lc_cache, idx_cache, split, obj)
        if lc is None:
            feats = {"n_obs": 0}
        else:
            feats = extract_features_for_object(
                lc_raw=lc,
                z=r["Z"],
                ebv=r["EBV"]
            )

        feats["object_id"] = obj
        feats["split"] = split
        if "target" in log_df.columns:
            feats["target"] = int(r["target"])

        rows.append(feats)

    feat_df = pd.DataFrame(rows)

    if drop_cols is not None:
        for c in drop_cols:
            if c in feat_df.columns:
                feat_df.drop(columns=[c], inplace=True)

    return feat_df

In [9]:
def best_f1_threshold(y_true, prob):
    ths = np.linspace(0.01, 0.99, 199)
    f1s = [f1_score(y_true, prob > t) for t in ths]
    j = int(np.argmax(f1s))
    return float(ths[j]), float(f1s[j])

In [None]:
def train_cv_ensemble(train_feat_df, base_params=None, n_splits=20, seed=6):
    y = train_feat_df["target"].to_numpy().astype(int)
    groups = train_feat_df["split"].to_numpy()

    X = train_feat_df.drop(columns=["object_id", "split", "target"])

    oof = np.zeros(len(train_feat_df), dtype=float)

    gkf = GroupKFold(n_splits=n_splits)
    models = []

    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X, y, groups=groups), 1):
        X_tr, y_tr = X.iloc[tr_idx], y[tr_idx]
        X_va, y_va = X.iloc[va_idx], y[va_idx]

        neg = np.sum(y_tr == 0)
        pos = np.sum(y_tr == 1)
        spw = float(neg / max(1, pos))

        params = dict(
            objective="binary:logistic",
            eval_metric="logloss",
            tree_method="hist",
            n_estimators=5000,
            learning_rate=0.02,
            max_depth=5,
            min_child_weight=10,
            subsample=0.9,
            colsample_bytree=0.9,
            gamma=0.5,
            reg_alpha=5.0,
            reg_lambda=2.0,
            scale_pos_weight=spw,
            random_state=seed,
            n_jobs=-1,
        )

        if base_params is not None:
            params.update(base_params)

        model = XGBClassifier(**params)

        model.fit(
            X_tr, y_tr,
            eval_set=[(X_va, y_va)],
            verbose=False,
            early_stopping_rounds=200
        )

        p_va = model.predict_proba(X_va)[:, 1]
        oof[va_idx] = p_va

        models.append(model)

        th, f1 = best_f1_threshold(y_va, p_va)
        print(f"Fold {fold:02d} | val best F1={f1:.4f} @ th={th:.3f} | best_iter={model.best_iteration}")

    th_global, f1_global = best_f1_threshold(y, oof)
    print("\nOOF best F1:", f1_global, " @ th=", th_global)

    return models, th_global, oof


In [11]:
def predict_ensemble(models, test_feat_df):
    X_test = test_feat_df.drop(columns=["object_id", "split"])
    probs = np.zeros(len(test_feat_df), dtype=float)
    for m in models:
        probs += m.predict_proba(X_test)[:, 1]
    probs /= len(models)
    return probs

In [None]:
train_log = pd.read_csv("data/train_log.csv")

for c in ["English Translation", "SpecType"]:
    if c in train_log.columns:
        train_log.drop(columns=[c], inplace=True)

splits = sorted(train_log["split"].unique())

train_lc_cache, train_idx_cache = build_lightcurve_cache(splits, base_dir="data", kind="train")
train_feat = build_feature_table(train_log, train_lc_cache, train_idx_cache)

In [None]:
import numpy as np
import optuna
from xgboost import XGBClassifier
from sklearn.model_selection import GroupKFold
from sklearn.metrics import f1_score

y = train_feat["target"].astype(int).to_numpy()
groups = train_feat["split"].to_numpy()
X = train_feat.drop(columns=["object_id", "split", "target"])

def best_threshold_f1(y_true, probs):
    ths = np.linspace(0.03, 0.97, 80)
    f1s = [f1_score(y_true, probs > t, zero_division=0) for t in ths]
    j = int(np.argmax(f1s))
    return float(ths[j]), float(f1s[j])

unique_groups = np.unique(groups)
N_FOLDS_TUNE = min(10, len(unique_groups))

def objective(trial):
    params = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "random_state": 6,
        "n_jobs": -1,
        "tree_method": "hist",
        "device": "cuda",
        "n_estimators": trial.suggest_int("n_estimators", 500, 5000),
        "learning_rate": trial.suggest_float("learning_rate", 0.005, 0.15, log=True),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 40),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "gamma": trial.suggest_float("gamma", 0.0, 10.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 20.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.05, 30.0),
        "max_delta_step": trial.suggest_int("max_delta_step", 0, 10),
    }

    gkf = GroupKFold(n_splits=N_FOLDS_TUNE)
    fold_f1s = []

    for fold, (tr_idx, va_idx) in enumerate(gkf.split(X, y, groups), 1):
        X_tr, y_tr = X.iloc[tr_idx], y[tr_idx]
        X_va, y_va = X.iloc[va_idx], y[va_idx]

        neg = np.sum(y_tr == 0)
        pos = np.sum(y_tr == 1)
        params["scale_pos_weight"] = float(neg / max(1, pos))

        model = XGBClassifier(**params)
        model.fit(X_tr, y_tr, verbose=False)

        probs = model.predict_proba(X_va)[:, 1]
        _, f1 = best_threshold_f1(y_va, probs)
        fold_f1s.append(f1)

        trial.report(float(np.mean(fold_f1s)), step=fold)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return float(np.mean(fold_f1s))


sampler = optuna.samplers.TPESampler(
    seed=6,
    multivariate=True,
    group=True
)

pruner = optuna.pruners.MedianPruner(
    n_startup_trials=30,
    n_warmup_steps=3
)

study = optuna.create_study(
    direction="maximize",
    sampler=sampler,
    pruner=pruner,
    study_name="xgb_split_cv_gpu",
    storage="sqlite:///optuna_xgb_gpu.db",
    load_if_exists=True
)

study.optimize(objective, n_trials=2000, timeout=12*60*60)

print("\nBest F1:", study.best_value)
print("Best params:", study.best_params)

  optuna_warn(
  optuna_warn(
[32m[I 2026-01-22 01:36:26,633][0m Using an existing study with name 'xgb_split_cv_gpu' instead of creating a new one.[0m
Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)
[32m[I 2026-01-22 01:37:03,895][0m Trial 1 finished with value: 0.5118836951842481 and parameters: {'n_estimators': 2185, 'learning_rate': 0.12684995383408856, 'max_depth': 8, 'min_child_weight': 24, 'subsample': 0.5780093202212182, 'colsample_bytree': 0.5779972601681014, 'gamma': 0.5808361216819946, 'reg_alpha': 17.323522915498703, 'reg_lambda': 18.053394601709105, 'max_delta_step': 7}. Best is trial 1 with value: 0.5118836951842481.[0m
[32m[I 2026-01-22 01:37:18,372][0m Trial 2 finished with value: 0.5247092761623329 and parameters: {'n_estimators': 592, 'learning_rate': 0.13540804321249839, 'max_depth': 9, 'min_child_weight': 9, 'subsample': 0.590912


Best F1: 0.5586239807478686
Best params: {'n_estimators': 3560, 'learning_rate': 0.02401398593158807, 'max_depth': 5, 'min_child_weight': 11, 'subsample': 0.5332205617155243, 'colsample_bytree': 0.5562957788395715, 'gamma': 0.702445538151101, 'reg_alpha': 5.862037822434758, 'reg_lambda': 9.498770508139778, 'max_delta_step': 9}


In [None]:
best_params = study.best_params

final_params = {
    "objective": "binary:logistic",
    "eval_metric": "logloss",
    "tree_method": "hist",
    "random_state": 6,
    "n_jobs": -1,
    **best_params
}

gkf = GroupKFold(n_splits=len(np.unique(groups)))

models = []
oof_probs = np.zeros(len(X), dtype=float)

for fold, (tr_idx, va_idx) in enumerate(gkf.split(X, y, groups), 1):
    X_tr, y_tr = X.iloc[tr_idx], y[tr_idx]
    X_va, y_va = X.iloc[va_idx], y[va_idx]

    neg = np.sum(y_tr == 0)
    pos = np.sum(y_tr == 1)
    final_params["scale_pos_weight"] = float(neg / max(1, pos))

    model = XGBClassifier(**final_params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        verbose=False,
    )

    probs = model.predict_proba(X_va)[:, 1]
    oof_probs[va_idx] = probs
    models.append(model)

    th, f1 = best_threshold_f1(y_va, probs)
    print(f"Fold {fold:02d} | F1={f1:.4f} @ th={th:.3f}")

best_t, best_f1 = best_threshold_f1(y, oof_probs)
print("\nBest F1:", best_f1)
print("Best threshold:", best_t)


Fold 01 | F1=0.6667 @ th=0.220
Fold 02 | F1=0.5455 @ th=0.375
Fold 03 | F1=0.3529 @ th=0.125
Fold 04 | F1=0.8571 @ th=0.589
Fold 05 | F1=0.7692 @ th=0.470
Fold 06 | F1=0.2727 @ th=0.030
Fold 07 | F1=0.7500 @ th=0.232
Fold 08 | F1=0.5333 @ th=0.208
Fold 09 | F1=0.6000 @ th=0.554
Fold 10 | F1=0.6486 @ th=0.220
Fold 11 | F1=0.5116 @ th=0.113
Fold 12 | F1=0.5714 @ th=0.078
Fold 13 | F1=0.0000 @ th=0.030
Fold 14 | F1=0.0000 @ th=0.030
Fold 15 | F1=0.7143 @ th=0.435
Fold 16 | F1=0.7273 @ th=0.530
Fold 17 | F1=0.8333 @ th=0.803
Fold 18 | F1=0.5385 @ th=0.304
Fold 19 | F1=0.5714 @ th=0.161
Fold 20 | F1=0.7500 @ th=0.518

Best F1: 0.5102040816326531
Best threshold: 0.4583544303797469


In [None]:
test_log = pd.read_csv("data/test_log.csv")

for c in ["English Translation", "SpecType", "target"]:
    if c in test_log.columns:
        test_log.drop(columns=[c], inplace=True)

test_splits = sorted(test_log["split"].unique())

test_lc_cache, test_idx_cache = build_lightcurve_cache(
    test_splits,
    base_dir="data",
    kind="test"
)

test_feat = build_feature_table(test_log, test_lc_cache, test_idx_cache)

In [None]:
X_test = test_feat.drop(columns=["object_id", "split"])

test_probs = np.mean([m.predict_proba(X_test)[:, 1] for m in models], axis=0)
test_pred = (test_probs > best_t).astype(int)

sub = pd.DataFrame({
    "object_id": test_feat["object_id"].values,
    "target": test_pred
})
sub.to_csv("submission.csv", index=False)