[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)
[![Open In Kaggle](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-kaggle.svg)](https://www.kaggle.com/code/crunchdao/structural-break-baseline)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [1]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break Z1nLW76rKrAZxriHTfIpS1tm

crunch-cli, version 7.5.0
main.py: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/29012/main.py (5864 bytes)
notebook.ipynb: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/29012/notebook.ipynb (74090 bytes)
requirements.txt: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/29012/requirements.original.txt (203 bytes)
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--c

In [2]:
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import numpy as np
import scipy
import scipy.stats
import sklearn.metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

# Load the data simply
X_train, y_train, X_test = crunch.load_data()

In [None]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    def extract_features(df: pd.DataFrame) -> dict:
        """Extract comprehensive features from time series segments"""
        before = df[df["period"] == 0]["value"]
        after = df[df["period"] == 1]["value"]

        features = {}

        # Basic statistics
        features['mean_diff'] = np.abs(after.mean() - before.mean())
        features['std_diff'] = np.abs(after.std() - before.std())
        features['var_ratio'] = after.var() / (before.var() + 1e-8)
        features['median_diff'] = np.abs(after.median() - before.median())

        # Distribution tests
        try:
            _, features['ttest_pvalue'] = scipy.stats.ttest_ind(before, after)
        except:
            features['ttest_pvalue'] = 0.5

        try:
            _, features['ks_pvalue'] = scipy.stats.ks_2samp(before, after)
        except:
            features['ks_pvalue'] = 0.5

        try:
            _, features['mannwhitney_pvalue'] = scipy.stats.mannwhitneyu(before, after)
        except:
            features['mannwhitney_pvalue'] = 0.5

        # Trend analysis
        try:
            before_trend = np.polyfit(range(len(before)), before, 1)[0] if len(before) > 1 else 0
            after_trend = np.polyfit(range(len(after)), after, 1)[0] if len(after) > 1 else 0
            features['trend_change'] = np.abs(after_trend - before_trend)
        except:
            features['trend_change'] = 0

        # Volatility measures
        try:
            before_vol = before.rolling(min(10, len(before)//2), min_periods=1).std().mean()
            after_vol = after.rolling(min(10, len(after)//2), min_periods=1).std().mean()
            features['volatility_change'] = np.abs(after_vol - before_vol)
        except:
            features['volatility_change'] = 0

        # Quantile differences
        for q in [0.25, 0.5, 0.75, 0.9]:
            try:
                features[f'quantile_{q}_diff'] = np.abs(after.quantile(q) - before.quantile(q))
            except:
                features[f'quantile_{q}_diff'] = 0

        # Range and IQR changes
        features['range_before'] = before.max() - before.min()
        features['range_after'] = after.max() - after.min()
        features['range_change'] = np.abs(features['range_after'] - features['range_before'])

        # Skewness and kurtosis changes
        try:
            features['skew_change'] = np.abs(scipy.stats.skew(after) - scipy.stats.skew(before))
            features['kurtosis_change'] = np.abs(scipy.stats.kurtosis(after) - scipy.stats.kurtosis(before))
        except:
            features['skew_change'] = 0
            features['kurtosis_change'] = 0

        return features

    # Extract features for all training time series
    feature_list = []
    labels = []

    for ts_id in X_train.index.get_level_values('id').unique():
        ts_data = X_train.loc[ts_id]
        features = extract_features(ts_data)
        feature_list.append(features)
        labels.append(y_train.loc[ts_id])

    # Convert to DataFrame
    feature_df = pd.DataFrame(feature_list)

    # Handle missing values
    feature_df = feature_df.fillna(0)
    feature_df = feature_df.replace([np.inf, -np.inf], 0)

    # Scale features
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(feature_df)

    # Train Random Forest classifier
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42,
        class_weight='balanced'
    )
    model.fit(scaled_features, labels)

    # Save model and scaler
    joblib.dump({
        'model': model,
        'scaler': scaler,
        'feature_columns': feature_df.columns.tolist(),
        'extract_features': extract_features
    }, os.path.join(model_directory_path, 'model.joblib'))

# Your model

In [None]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)