# Model 1 (Baseline XGBoost)

First attempt at building a model for this competition.

### What this model does
- Uses basic lightcurve-derived statistical features
- Trains an XGBoost binary classifier
- Uses Optuna to tune hyperparameters for F1
- Tunes a custom probability threshold to maximize validation F1
- Generates Kaggle submissions (I submitted twice using different seeds)

## Results

Best parameters:
- n_estimators: 1556
- learning_rate: 0.011529
- max_depth: 5
- min_child_weight: 11
- subsample: 0.990469
- colsample_bytree: 0.964860
- colsample_bylevel: 0.931161
- gamma: 0.008020
- reg_alpha: 7.434848
- reg_lambda: 1.937161

OOF multiseed best threshold: 0.5147491638795987  
Best validation F1: 0.730769  

| Submission | Public LB F1 | Private LB F1 |
|-------------|--------------|----------------|
| 1 | 0.4582 | 0.4153 |
| 2 | 0.4610 | 0.4540 |


This model did not use heavy feature engineering, and mainly focuses on extracting simple per-object + per-filter summary stats from the raw lightcurve.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from pathlib import Path

ROOT = Path.cwd().parents[0]
DATA_DIR = ROOT / "data"

In [None]:
df = pd.read_csv(DATA_DIR / "train_log.csv")
df

Unnamed: 0,object_id,Z,Z_err,EBV,SpecType,English Translation,split,target
0,Dornhoth_fervain_onodrim,3.0490,,0.110,AGN,Trawn Folk (Dwarfs) + northern + Ents (people),split_01,0
1,Dornhoth_galadh_ylf,0.4324,,0.058,SN II,Trawn Folk (Dwarfs) + tree + drinking vessel,split_01,0
2,Elrim_melethril_thul,0.4673,,0.577,AGN,Elves + lover (fem.) + breath,split_01,0
3,Ithil_tobas_rodwen,0.6946,,0.012,AGN,moon + roof + noble maiden,split_01,0
4,Mirion_adar_Druadan,0.4161,,0.058,AGN,"jewel, Silmaril + father + Wild Man",split_01,0
...,...,...,...,...,...,...,...,...
3038,tinnu_gellui_tathar,0.8898,,0.042,AGN,"dusk, twilight + triumphant + tathar",split_20,0
3039,uir_heleg_corf,0.9598,,0.042,AGN,eternity + ice + ring,split_20,0
3040,uir_rhosc_law,0.1543,,0.024,SN II,"eternity + russet, red, brown + no! don't!",split_20,0
3041,uruk_in_pess,1.1520,,0.019,AGN,evil creature + year + feather,split_20,0


In [23]:
df.isnull().sum()

object_id                 0
Z                         0
Z_err                  3043
EBV                       0
SpecType                  0
English Translation       0
split                     0
target                    0
dtype: int64

## Dropping columns

Before feature building, I removed columns that are not useful for this baseline model. I didn't know what to do with SpecType at the time so i removed it, a Kaggle user later pointed out in a discussion that there is a very good way to use SpecType. Z_err was also not used yet. Z_err is only null for the training values. This is also one of the reasons the model didn't perform too well.


In [24]:
df.drop(columns=['Z_err', 'English Translation', 'SpecType'], inplace=True)

## Loading lightcurve files

The dataset is stored in separate folders by `split`.

Instead of re-reading CSV files every time I want an object's lightcurve, I load each split's full lightcurve file once and store it.


In [None]:
splits = 20
light_curve_cache = {}
idx_cache = {}

for s in df['split'].unique():
    path = DATA_DIR / str(s) / f"train_full_lightcurves.csv"
    light_curve = pd.read_csv(path)

    groups = light_curve.groupby("object_id").indices

    light_curve_cache[s] = light_curve
    idx_cache[s] = groups

## get_lightcurve(split, object_id)

This function retrieves the full lightcurve rows for a single object.

Inputs:
- split: which data split folder the object belongs to
- object_id: unique object identifier

Steps:
- Look up the row indices for that object in idx_cache
- Use .iloc[idx] to extract the object's rows


In [None]:
def get_lightcurve(split, object_id):
    df = light_curve_cache[split]
    idx = idx_cache[split].get(object_id)
    return df.iloc[idx]

Test

In [48]:
x = df.iloc[3]
split = x["split"]
obj_id = x["object_id"]
lc = get_lightcurve(split, obj_id)
lc

Unnamed: 0,object_id,Time (MJD),Flux,Flux_err,Filter
267,Ithil_tobas_rodwen,62867.5631,0.462736,1.159424,y
268,Ithil_tobas_rodwen,62867.5631,1.250500,0.342737,z
269,Ithil_tobas_rodwen,62867.5631,1.298654,0.274093,i
270,Ithil_tobas_rodwen,62864.6990,0.752622,0.080461,g
271,Ithil_tobas_rodwen,62864.6990,1.028319,0.199470,i
...,...,...,...,...,...
1060,Ithil_tobas_rodwen,61807.8308,0.126247,0.247917,z
1061,Ithil_tobas_rodwen,61810.6950,0.585870,0.314342,i
1062,Ithil_tobas_rodwen,61799.2384,0.455831,0.175963,i
1063,Ithil_tobas_rodwen,61807.8308,0.598166,0.606470,y


## Creating feature columns

Each object has a time series (lightcurve) with observations in up to 6 filters: u, g, r, i, z, y

The majority of these features were created by tasking AI to go through astronomy research papers and create features. I am not an astronomer therefore I cannot create astronomy features with my knowledge.

- `Time (MJD)`: observation time in Modified Julian Date  
- `Flux`: measured brightness (can be negative due to noise/subtraction artifacts)  
- `Flux_err`: uncertainty in the flux measurement  
- `Filter`: which band the observation belongs to  

The goal of feature engineering here is to compress each irregular time series into a fixed-length numeric vector so a tabular model (like XGBoost) can learn patterns that separate classes.

### 1) Global features (all filters combined)
These are computed using all observations across all bands for a given object.  
They summarize the overall time coverage, brightness distribution, variability, uncertainty, and signal quality.

### 2) Per-filter features (computed separately per band)
These are computed independently for each filter band.
They let the model detect color-dependent behavior (for example: strong variability in `g` but not in `i`).

## Global (all-filters combined) features

Below are the global feature columns and what each one represents:

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `total_time` | Total time span covered by the object's observations: `(max(Time) - min(Time))` after shifting to start at zero | Separates fast events vs slow events and distinguishes sparse vs long-baseline coverage |
| `n_obs` | Total number of observations across all filters | Captures sampling density and whether an object is well-measured |
| `median_flux` | Median flux across all observations | Robust estimate of typical brightness (less sensitive to spikes) |
| `mean_flux` | Mean flux across all observations | Captures average brightness but is more sensitive to outliers |
| `std_flux` | Standard deviation of flux across all observations | Measures overall variability (high = more change over time) |
| `min_flux` | Minimum observed flux | Captures dips, fading, or negative excursions from noise |
| `max_flux` | Maximum observed flux | Captures peak brightness or flare intensity |
| `range_flux` | Flux range: `max_flux - min_flux` | Simple variability amplitude proxy |
| `median_err` | Median flux uncertainty across all observations | Measures how noisy the measurements are overall |
| `median_snr` | Median signal-to-noise ratio: `median(\|Flux\| / Flux_err)` | Typical detection strength across observations |
| `max_snr` | Maximum signal-to-noise ratio: `max(\|Flux\| / Flux_err)` | Whether the object ever has a highly confident detection |
| `neg_flux_frac` | Fraction of observations where `Flux < 0` | Indicates low-SNR objects or subtraction-dominated measurements |

## Per-filter (band-wise) features

For each band in `filters = ["u", "g", "r", "i", "z", "y"]`, the following features are created:

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `n_obs_{band}` | Number of observations in this band | Some classes are observed more in specific bands; also captures missingness patterns |
| `total_time_{band}` | Time span covered within this band | Band-dependent cadence coverage (important if data is uneven across filters) |
| `median_flux_{band}` | Median flux in this band | Typical brightness in this band (captures spectral/color behavior) |
| `std_flux_{band}` | Flux standard deviation in this band | Variability strength in that band |
| `amp_{band}` | Simple amplitude proxy: `max(flux) - median(flux)` | Captures flare-like peaks or transient bursts without being too sensitive to one negative outlier |
| `median_err_{band}` | Median uncertainty in this band | Band-specific noise level (some filters are noisier) |
| `median_snr_{band}` | Median SNR in this band: `median(\|flux\| / err)` | Typical detection quality per filter |
| `max_snr_{band}` | Max SNR in this band: `max(\|flux\| / err)` | Best detection strength per filter |
| `neg_flux_frac_{band}` | Fraction of band observations with `flux < 0` | Band-specific low-SNR indicator |

## Additional summary features

These features describe how much band coverage exists overall:

| Feature | Meaning | Why it helps |
|--------|---------|--------------|
| `n_filters_present` | Count of how many filters have at least 1 observation | Objects with multi-band coverage have richer information; missing bands may correlate with class |
| `total_obs` | Total observations summed over all filters (same as `n_obs`) | Redundant but convenient for downstream logic or sanity checks |


Commenting is added by AI.

In [None]:
filters = ["u", "g", "r", "i", "z", "y"]  # creating features for each filter

# Global feature columns (computed using all observations combined)
base_cols = [
    "total_time",      # total time span covered by all observations
    "n_obs",           # total number of observations across all filters
    "median_flux",     # median flux across all observations
    "mean_flux",       # mean flux across all observations
    "std_flux",        # standard deviation of flux across all observations
    "min_flux",        # minimum flux observed
    "max_flux",        # maximum flux observed
    "range_flux",      # max_flux - min_flux, variability range proxy
    "median_err",      # median measurement uncertainty across all observations
    "median_snr",      # median |flux| / err across all observations
    "max_snr",         # maximum |flux| / err across all observations
    "neg_flux_frac"    # fraction of observations with flux < 0
]

# Initialize global features to NaN (filled later per object)
for c in base_cols:
    df[c] = np.nan

# Initialize per-filter features
for f in filters:
    df[f"n_obs_{f}"] = 0                  # number of observations in this band
    df[f"total_time_{f}"] = 0.0           # time span covered in this band
    df[f"median_flux_{f}"] = 0.0          # typical flux level in this band
    df[f"std_flux_{f}"] = 0.0             # variability in this band
    df[f"amp_{f}"] = 0.0                  # amplitude proxy in this band
    df[f"median_err_{f}"] = 0.0           # typical uncertainty in this band
    df[f"median_snr_{f}"] = 0.0           # typical SNR in this band
    df[f"max_snr_{f}"] = 0.0              # best SNR in this band
    df[f"neg_flux_frac_{f}"] = 0.0        # fraction of negative flux values in this band

# Summary features for filter coverage
df["n_filters_present"] = 0               # how many bands have >= 1 observation
df["total_obs"] = 0                       # total observations across all bands

## Extracting features from the raw lightcurve (per object)

This loop runs through every object in the dataframe and builds features from its lightcurve.

### Steps per object
1. Fetch the object's lightcurve with `get_lightcurve(split, object_id)`
2. Extract arrays for:
   - Time (MJD)
   - Flux
   - Flux_err
3. Convert time into a relative scale (`t_rel = t - t.min()`)
4. Compute global lightcurve stats:
   - time span
   - total observation count
   - flux summary stats (median/mean/std/min/max/range)
   - error and SNR summary stats
   - fraction of negative flux values
5. For each filter band (u,g,r,i,z,y):
   - subset lightcurve rows for that band
   - compute band-specific stats (n_obs, time span, median flux, std, amplitude, SNR, etc.)
6. Track how many filters are actually present (`n_filters_present`)
7. Track the total number of observations across all filters (`total_obs`)

This produces a structured tabular feature dataset where each row corresponds to exactly one object.

In [None]:
for i in range(df.shape[0]):
    x = df.iloc[i]
    lc = get_lightcurve(x["split"], x["object_id"])

    # Extract arrays for all observations (all filters combined)
    t = lc["Time (MJD)"].to_numpy()
    f = lc["Flux"].to_numpy()
    e = lc["Flux_err"].to_numpy()

    # Shift time so the first observation occurs at t=0 for numerical stability
    t_rel = t - t.min()

    df.loc[i, "total_time"] = float(t_rel.max() - t_rel.min())  # overall time baseline
    df.loc[i, "n_obs"] = int(lc.shape[0])                       # total number of observations

    df.loc[i, "median_flux"] = float(np.median(f))              # typical brightness (robust)
    df.loc[i, "mean_flux"]   = float(np.mean(f))                # average brightness
    df.loc[i, "std_flux"]    = float(np.std(f))                 # overall variability
    df.loc[i, "min_flux"]    = float(np.min(f))                 # dimmest point
    df.loc[i, "max_flux"]    = float(np.max(f))                 # brightest point
    df.loc[i, "range_flux"]  = float(np.max(f) - np.min(f))     # simple variability range

    df.loc[i, "median_err"] = float(np.median(e))               # typical measurement noise
    snr = np.abs(f) / (e + 1e-8)                                # SNR per observation
    df.loc[i, "median_snr"] = float(np.median(snr))             # typical detection quality
    df.loc[i, "max_snr"]    = float(np.max(snr))                # best detection quality
    df.loc[i, "neg_flux_frac"] = float(np.mean(f < 0))          # how often flux is negative

    present = 0
    total_obs = 0

    for band in filters:
        sub = lc[lc["Filter"] == band] # only observations in this band
        n = int(sub.shape[0])

        df.loc[i, f"n_obs_{band}"] = n
        total_obs += n

        # If the band is missing entirely, skip the rest
        if n == 0:
            continue
        present += 1

        tb = sub["Time (MJD)"].to_numpy()
        fb = sub["Flux"].to_numpy()
        eb = sub["Flux_err"].to_numpy()

        tb_rel = tb - tb.min()

        # Band time span
        df.loc[i, f"total_time_{band}"] = float(tb_rel.max() - tb_rel.min())

        # Band flux distribution
        df.loc[i, f"median_flux_{band}"] = float(np.median(fb))
        df.loc[i, f"std_flux_{band}"] = float(np.std(fb))

        # Band amplitude proxy (peak relative to median baseline)
        df.loc[i, f"amp_{band}"] = float(np.max(fb) - np.median(fb))

        # Band uncertainty + SNR
        df.loc[i, f"median_err_{band}"] = float(np.median(eb))
        snr_b = np.abs(fb) / (eb + 1e-8)
        df.loc[i, f"median_snr_{band}"] = float(np.median(snr_b))
        df.loc[i, f"max_snr_{band}"] = float(np.max(snr_b))

        # Band negative flux fraction
        df.loc[i, f"neg_flux_frac_{band}"] = float(np.mean(fb < 0))

    # Final band coverage summaries
    df.loc[i, "n_filters_present"] = int(present)
    df.loc[i, "total_obs"] = int(total_obs)

In [57]:
y = df['target']
X = df.drop(columns=['object_id', 'split', 'target'])
X

Unnamed: 0,Z,EBV,total_time,n_obs,median_flux,mean_flux,std_flux,min_flux,max_flux,range_flux,...,total_time_y,median_flux_y,std_flux_y,amp_y,median_err_y,median_snr_y,max_snr_y,neg_flux_frac_y,n_filters_present,total_obs
0,3.0490,0.110,1254.2719,65.0,-0.367840,0.928483,4.766352,-2.756285,25.047343,27.803628,...,1241.0691,-1.424537,2.463050,7.290787,1.111663,1.344504,3.762247,0.545455,6,65
1,0.4324,0.058,2362.1560,167.0,0.094237,0.388622,1.367368,-1.747082,11.375499,13.122581,...,2362.1560,0.094237,2.457391,11.281263,1.300994,0.576262,14.659265,0.482759,6,167
2,0.4673,0.577,1206.0218,35.0,1.076724,1.691347,2.602937,-6.400816,6.617915,13.018732,...,767.8628,0.032667,6.433483,6.433483,1.121800,5.735005,5.770355,0.500000,6,35
3,0.6946,0.012,2858.4129,798.0,0.327391,0.375366,0.859220,-7.641818,5.353821,12.995639,...,2841.2281,0.523353,1.798179,4.830468,1.079121,0.888426,3.630331,0.365217,6,798
4,0.4161,0.058,2202.3065,129.0,0.308845,0.233832,1.142101,-3.060399,5.384463,8.444862,...,1809.1164,0.826709,1.642351,1.543811,0.981047,1.312869,2.640597,0.285714,6,129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3038,0.8898,0.042,2582.0800,148.0,0.328642,0.459678,0.999605,-1.419681,6.428462,7.848143,...,2392.4960,0.758492,1.853276,5.669970,1.135060,0.997986,2.542929,0.375000,6,148
3039,0.9598,0.042,2916.1706,138.0,0.400943,0.374945,1.448300,-5.330229,7.369065,12.699294,...,2563.4588,0.363133,1.773719,2.076330,1.336029,0.498384,2.478262,0.333333,6,138
3040,0.1543,0.024,1936.1637,172.0,0.104000,0.376307,1.070895,-2.773028,5.085714,7.858741,...,1919.4185,-0.031903,1.798593,5.117617,1.198319,0.698936,4.619904,0.548387,6,172
3041,1.1520,0.019,2699.8022,161.0,0.562826,0.352351,1.075207,-2.895248,2.871105,5.766353,...,2635.9214,0.300762,1.446578,2.512944,1.378667,0.673259,2.712396,0.407407,6,161


In [None]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=6
)

In [None]:
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()
neg / pos # This dataset is heavily imbalanced, this ratio is for 'scale_pos_weight'

np.float64(19.627118644067796)

## Hyperparameter tuning using Optuna

Use Optuna to search for XGBoost hyperparameters that maximize F1 score.

In [None]:
import optuna
import numpy as np
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

def objective(trial):
    params = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",

        "n_estimators": trial.suggest_int("n_estimators", 300, 2000),
        "learning_rate": trial.suggest_float("learning_rate", 0.005, 0.2, log=True),
        "max_depth": trial.suggest_int("max_depth", 2, 8),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),

        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.5, 1.0),

        "gamma": trial.suggest_float("gamma", 0.0, 5.0),

        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 10.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.1, 10.0),

        "scale_pos_weight": 19.6,

        "random_state": 42,
        "n_jobs": -1
    }

    model = XGBClassifier(**params)

    model.fit(X_train, y_train)

    probs = model.predict_proba(X_val)[:,1]

    ths = np.linspace(0.01, 0.99, 200)
    f1s = [f1_score(y_val, probs > t) for t in ths]

    return float(np.max(f1s))


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=200)

print("\nBest F1:", study.best_value)
print("\nBest params:")
for k, v in study.best_params.items():
    print(f"{k}: {v}")

Best F1: 0.7308

## Results

Best parameters:
- n_estimators: 1556
- learning_rate: 0.011529
- max_depth: 5
- min_child_weight: 11
- subsample: 0.990469
- colsample_bytree: 0.964860
- colsample_bylevel: 0.931161
- gamma: 0.008020
- reg_alpha: 7.434848
- reg_lambda: 1.937161

In [71]:
best_params = study.best_params

model = XGBClassifier(
    **best_params,
    objective="binary:logistic",
    eval_metric="logloss",
    scale_pos_weight=19.6,
    random_state=42,
    n_jobs=-1
)

model.fit(X_train, y_train)

probs = model.predict_proba(X_val)[:,1]

ths = np.linspace(0.01, 0.99, 300)
f1s = [f1_score(y_val, probs > t) for t in ths]
best_t = ths[np.argmax(f1s)]

print("Best threshold:", best_t)

Best threshold: 0.5147491638795987


In [None]:
from loader.test_loader import build_test

X_test, test_df = build_test()
probs = model.predict_proba(X_test)[:,1]
y_test_pred = (probs > best_t).astype(int)

submission = pd.DataFrame({
    "object_id": test_df["object_id"],
    "target": y_test_pred
})
submission.to_csv("Submissions/first_XGB-2.csv", index=False)