[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)
[![Open In Kaggle](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-kaggle.svg)](https://www.kaggle.com/code/crunchdao/structural-break-baseline)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.


### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [7]:
# Install the Crunch CLI
%pip install crunch-cli --upgrade --quiet --progress-bar off

# Setup your local environment
!crunch setup-notebook structural-break sSwhgkE7cS2nPb7aAet6thaR

crunch-cli, version 7.4.0
delete /content/.crunchdao
you appear to have never submitted code before
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
                                
---
Success! Your environment has been correctly setup.
Next recommended actions:
1. Load the Crunch Toolings: `crunch = crunch.load_notebook()`
2. Execute the cells with your code
3. Run a test: `crunch.test()`
4. Downloa

# Your model

## Setup

In [20]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics


import numpy as np
import pandas as pd
import joblib


In [10]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 7.4.0
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [11]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

In [40]:
import pandas as pd
from pathlib import Path
from scipy.stats import kurtosis, skew
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.utils.class_weight import compute_sample_weight
import joblib

# --- Rich features: robust stats + higher moments + FFT + wavelets ---
import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis
from numpy.fft import rfft
import pywt  # pip install PyWavelets

In [30]:
def _stats_robust(x):
    if x.size == 0:
        return {}
    q5, q10, q25, q35, q50, q65, q75, q90, q95 = np.percentile(x, [5, 10, 25, 35, 50, 65, 75, 90, 95])
    mad = np.median(np.abs(x - q50))  # Median Absolute Deviation
    return {
        "mean": np.mean(x), "std": np.std(x),
        "min": np.min(x),  "max": np.max(x),  "median": q50,
        "q5": q5, "q10": q10, "q25": q25, "q35": q35, "q65": q65, "q75": q75, "q90": q90, "q95": q95,
        "mad": mad, "skew": skew(x, bias=False) if x.size > 2 else 0.0, "kurt": kurtosis(x, fisher=True, bias=False) if x.size > 3 else 0.0,
        "rms": np.sqrt(np.mean(x**2)),  "ptp": np.ptp(x),  # peak-to-peak
    }

def _fft_energy_bands(x, n_bands=5):
    """Simple relative-band energies from power spectrum."""
    if x.size < 8:
        return {f"fft_band_{i}": 0.0 for i in range(n_bands)}
    spec = np.abs(rfft(x - np.mean(x)))**2
    spec = spec[1:]  # drop DC for energy ratios
    if spec.size == 0:
        return {f"fft_band_{i}": 0.0 for i in range(n_bands)}
    # split into equal bands
    bands = np.array_split(spec, n_bands)
    energies = np.array([b.sum() for b in bands], dtype=float)
    tot = energies.sum() + 1e-12
    return {f"fft_band_{i}": float(e/tot) for i, e in enumerate(energies)}

def _wavelet_energies(x, wavelet="db4", level=None):
    """Wavelet packet-ish: energy per scale from DWT coefficients."""
    if x.size < 8:
        return {}
    coeffs = pywt.wavedec(x - np.mean(x), wavelet=wavelet, level=level)
    energies = [np.sum(c**2) for c in coeffs]  # [cA_L, cD_L, ..., cD1]
    tot = np.sum(energies) + 1e-12
    out = {"wl_cA": float(energies[0]/tot)}
    for i, e in enumerate(energies[1:], start=1):
        out[f"wl_cD_{i}"] = float(e/tot)
    return out

def _segment_features(x):
    """Compose robust stats + spectral features for one segment."""
    f = {}
    f.update(_stats_robust(x))
    f.update(_fft_energy_bands(x, n_bands=5))
    f.update(_wavelet_energies(x))
    return f

def extract_features_rich(X: pd.DataFrame) -> pd.DataFrame:
    """Per-id features using value and period columns."""
    feats = []
    print_count = 0
    for id_, g in X.groupby(level="id"):

        if print_count == 1_000:
            print('Progress report...extracting features from id: {}'.format(id_))
            print_count = 0
        print_count += 1

        v = g["value"].values.astype(float)
        pre = g.loc[g["period"] == 0, "value"].values.astype(float)
        post = g.loc[g["period"] == 1, "value"].values.astype(float)

        d = {"id": id_}
        # global
        d.update({f"g_{k}": v for k, v in _segment_features(v).items()})

        # pre/post
        d.update({f"pre_{k}": v for k, v in _segment_features(pre).items()})
        d.update({f"post_{k}": v for k, v in _segment_features(post).items()})

        # deltas (post - pre) for key stats
        for k in ["mean", "std", "median", "mad", "skew", "kurt", "rms"]:
            d[f"delta_{k}"] = (d.get(f"post_{k}", 0.0) - d.get(f"pre_{k}", 0.0))

        # counts & ratio
        d["len_total"] = int(v.size)
        d["n_pre"] = int(pre.size)
        d["n_post"] = int(post.size)
        d["ratio_post_pre"] = float(d["n_post"]/(d["n_pre"]+1e-6))
        feats.append(d)

    df = pd.DataFrame(feats).set_index("id")
    return df.replace([np.inf, -np.inf], 0.0).fillna(0.0)

In [46]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.utils.class_weight import compute_sample_weight

def tune_hgb(Xf, y):
    est = HistGradientBoostingClassifier(
        random_state=42, early_stopping=True, validation_fraction=0.2
    )

    grid = {
        "learning_rate":      [0.04, 0.06, 0.10],
        "max_depth":          [4, 6, 10],
        "min_samples_leaf":   [20, 40, 100],
        "l2_regularization":  [1e-3, 1e-2],
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    sw = compute_sample_weight("balanced", y.astype(int))
    gs = GridSearchCV(est, grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True, verbose=1)
    gs.fit(Xf, y.astype(int), sample_weight=sw)
    print("BEST AUC:", gs.best_score_, "BEST PARAMS:", gs.best_params_)
    return gs.best_estimator_

def fit_ensemble(Xf, y, base_params, seeds=(11, 23, 42, 77)):
    members = []
    for s in seeds:
        clf = HistGradientBoostingClassifier(random_state=s, early_stopping=True, validation_fraction=0.1, **base_params)
        clf.fit(Xf, y.astype(int))
        members.append(clf)
    return members

def predict_ensemble(members, Xf):
    ps = [m.predict_proba(Xf)[:,1] for m in members]
    return np.mean(ps, axis=0)

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [47]:
import os
import joblib
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier

import os
import joblib
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier

# assumes extract_features_rich(X) is defined/imported in this module

def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    """
    Train on per-id features, align y to ids, save bundle with feature schema.
    X_train: MultiIndex (id, time), columns ["value","period"]
    y_train: Series/DataFrame indexed by id with {0,1} or {False,True}
    """
    os.makedirs(model_directory_path, exist_ok=True)

    # 1) y: squeeze to Series[int], index = ids
    if isinstance(y_train, pd.DataFrame):
        y_train = y_train.squeeze()
    y = y_train.astype(int).copy()

    # 2) X: aggregate to per-id features
    print("Progress report...extracting features per id")
    Xf = extract_features_rich(X_train).replace([np.inf, -np.inf], 0.0).fillna(0.0)


    # 3) align y to feature rows (ids)
    # ensure both indices are comparable types
    try:
        Xf.index = Xf.index.astype(int)
        y.index = y.index.astype(int)
    except Exception:
        pass
    y = y.reindex(Xf.index)
    if y.isna().any():
        # Drop any ids without a label (shouldn't happen on official train)
        mask = ~y.isna()
        Xf = Xf.loc[mask]
        y = y.loc[mask]

    # 4) fit model (deterministic baseline)
    model = HistGradientBoostingClassifier(
        random_state=42, early_stopping=True, validation_fraction=0.1,
        max_iter=1_000, learning_rate=0.02,
        max_depth=15,  min_samples_leaf=50, l2_regularization=1e-3,
    )
    model.fit(Xf, y)

    # OR small ensemble (comment the above line and use this instead):
    #  base_params = {k: v for k, v in best.get_params().items()
    #                if k in {"learning_rate","max_depth","max_iter","min_samples_leaf","l2_regularization"}}
    # members = fit_ensemble(Xf, y, base_params, seeds=(11,23,42,77))
    # model = members  # store list; handle in infer()

    #best = tune_hgb(Xf, y)
    # single strong model:
    #model = best

    # 5) persist bundle with schema for inference
    bundle = {"model": model, "feature_names": list(Xf.columns)}
    joblib.dump(bundle, os.path.join(model_directory_path, "model.joblib"))
    print(f"[train] saved -> {os.path.join(model_directory_path, 'model.joblib')}")


### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [48]:
# def infer(
#     X_test: typing.Iterable[pd.DataFrame],
#     model_directory_path: str,
# ):
#     model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

#     yield  # Mark as ready

#     # X_test can only be iterated once.
#     # Before getting the next dataset, you must predict the current one.
#     X_test_rich = extract_features_rich(X_test)

#     for dataset in X_test_rich:
#         # Baseline approach: Compute t-test between values before and after boundary point
#         # The negative p-value is used as our score - smaller p-values (larger negative numbers)
#         # indicate more evidence against the null hypothesis that distributions are the same,
#         # suggesting a structural break
#         def t_test(u: pd.DataFrame):
#             return -scipy.stats.ttest_ind(
#                 u["value"][u["period"] == 0],  # Values before boundary point
#                 u["value"][u["period"] == 1],  # Values after boundary point
#             ).pvalue

#         prediction = model.p(dataset) #replace with my trained model predict probability ....for which class tho?
#         yield prediction  # Send the prediction for the current dataset

#         # Note: This baseline approach uses a t-test to compare the distributions
#         # before and after the boundary point. A smaller p-value (larger negative number)
#         # suggests stronger evidence that the distributions are different,
#         # indicating a potential structural break.

In [49]:
import os, typing as t
import numpy as np
import pandas as pd
import joblib

def _to_feature_row(df_one_id: pd.DataFrame, feature_names: t.List[str]) -> pd.DataFrame:
    """
    Turn a single-id time series DataFrame into a 1xD feature row,
    aligned to 'feature_names' (missing features -> 0.0).
    """
    # If df has MultiIndex (id, time), drop the id level
    if isinstance(df_one_id.index, pd.MultiIndex) and "id" in df_one_id.index.names:
        # Expect a single id per dataset; drop it
        df_one_id = df_one_id.droplevel("id")

    # Build features for a single id by reusing the rich extractor over a tiny fake batch
    # (wrap in a MultiIndex with id=0 temporarily)
    tmp = df_one_id.copy()
    tmp.index = pd.MultiIndex.from_product([[0], tmp.index], names=["id", "time"])
    feats = extract_features_rich(tmp)          # returns DataFrame indexed by id
    row = feats.iloc[[0]]                       # 1xD
    if feature_names:                           # align to saved schema
        row = row.reindex(columns=feature_names, fill_value=0.0)
    row = row.replace([np.inf, -np.inf], 0.0).fillna(0.0)
    return row

def infer(
    X_test: t.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    """
    Load trained model and yield P(y=1) for each incoming dataset.
    """
    bundle = joblib.load(os.path.join(model_directory_path, "model.joblib"))

    # Support both: (a) raw estimator, (b) dict/bundle with feature_names.
    if isinstance(bundle, dict) and "model" in bundle:
        model = bundle["model"]
        feature_names = bundle.get("feature_names", [])
    else:
        model = bundle
        # Try to load feature schema if saved separately
        feat_path = os.path.join(model_directory_path, "feature_names.joblib")
        feature_names = joblib.load(feat_path) if os.path.exists(feat_path) else []

    # Handshake: ready
    yield

    # Iterate ONCE over datasets
    for dataset in X_test:
        # Compute 1xD feature row aligned to training schema
        x_row = _to_feature_row(dataset, feature_names)

        # Predict probability for the positive class (y=1: structural break)
        # scikit-learn's predict_proba returns [:, 1] for the positive class
        proba_pos = float(model.predict_proba(x_row)[:, 1][0])

        # Yield a scalar in [0,1]
        yield proba_pos


## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [50]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

14:28:16 no forbidden library found
14:28:16 
14:28:17 started
14:28:17 running local test
14:28:17 internet access isn't restricted, no check will be done
14:28:17 
14:28:18 starting unstructured loop...
14:28:18 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match
Progress report...extracting features per id
Progress report...extracting features from id: 1000
Progress report...ex

Traceback (most recent call last):
  File "https://github.com/crunchdao/competitions/raw/refs/heads/master/competitions/structural-break/scoring/runner.py", line 27, in run
  File "/usr/local/lib/python3.12/dist-packages/crunch/runner/local.py", line 559, in execute
    result = utils.smart_call(
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/crunch/utils.py", line 270, in smart_call
    return function(**arguments)
           ^^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/crunchdao/competitions/raw/refs/heads/master/competitions/structural-break/scoring/runner.py", line 78, in train
  File "/usr/local/lib/python3.12/dist-packages/crunch/utils.py", line 270, in smart_call
    return function(**arguments)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ipython-input-225521241.py", line 65, in train
    best = tune_hgb(Xf, y)
           ^^^^^^^^^^^^^^^
  File "/tmp/ipython-input-750138674.py", line 20, in tune_hgb
    gs.fit(Xf, y.astype(int), sample_weigh

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/crunch/utils.py", line 434, in limit_traceback
    yield
  File "/usr/local/lib/python3.12/dist-packages/crunch/utils.py", line 268, in smart_call
    return function(**arguments)
           ^^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/crunchdao/competitions/raw/refs/heads/master/competitions/structural-break/scoring/runner.py", line 27, in run
  File "/usr/local/lib/python3.12/dist-packages/crunch/runner/local.py", line 559, in execute
    result = utils.smart_call(
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/crunch/utils.py", line 270, in smart_call
    return function(**arguments)
           ^^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/crunchdao/competitions/raw/refs/heads/master/competitions/structural-break/scoring/runner.py", line 78, in train
  File "/usr/local/lib/python3.12/dist-packages/crunch/utils.py", line 270, in smart_call
    return function(**ar

TypeError: object of type 'NoneType' has no len()

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [38]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
10001,0.266338
10002,0.352830
10003,0.229006
10004,0.141836
10005,0.282935
...,...
10097,0.255436
10098,0.274997
10099,0.296849
10100,0.191998


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [39]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

np.float64(0.6938967136150235)