[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

*   List item
*   List item



## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [None]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break HBS3kIvMp1h4YLvQTBuMZTNB

# Your model

## Setup

In [None]:
import os
import typing
import tqdm
import logging

# Import your dependencies
import joblib
import numpy as np # == 1.26.4
import pandas as pd
from datetime import datetime
from scipy.stats import ttest_ind, ks_2samp, levene
from scipy.stats import entropy, normaltest, jarque_bera
from statsmodels.tsa.stattools import adfuller
from scipy.signal import periodogram

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import sklearn.metrics
from xgboost import XGBClassifier
from prophet import Prophet
# from pmdarima import auto_arima # == 2.0.4

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Install required libraries
# !pip install numpy==1.26.4
# !pip install pmdarima==2.0.4

In [None]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [None]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: The timestep within each time series

**Columns:**
- `value`: The actual time series value at each timestep
- `period`: A binary indicator where `0` represents the **period before** the boundary point, and `1` represents the **period after** the boundary point

In [None]:
X_train

### Understanding `y_train`

This is a simple `pandas.Series` that tells if a dataset id has a structural breakpoint or not.

**Index:**
- `id`: the ID of the dataset

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [None]:
y_train

### Understanding `X_test`

The test data is provided as a **`list` of `pandas.DataFrame`s** with the same format as [`X_train`](#understanding-X_test).

It is structured as a list to encourage processing records one by one, which will be mandatory in the `infer()` function.

In [None]:
print("Number of datasets:", len(X_test))

In [None]:
X_test[0]

## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [None]:
# Generate features from values before and after breakpoint
def generate_features(df: pd.DataFrame):
    # Define number of windows
    n_points = df.groupby('period')['id'].count().min()
    start = [s for s in range(0, 50, 10)]
    windows = [5, 10, 20, 30, 40, 50, 60, 70, 80, 90] + [w for w in range(100, 1001, 50)]
    print('STARTED FEATURE GENERATION: ', datetime.now().strftime("%d-%m-%Y %H:%M:%S"))

    def extract_statistical_test_features(time_series_data, n):
        period_0 = time_series_data[time_series_data['period'] == 0]['value']
        period_1 = time_series_data[time_series_data['period'] == 1]['value']
        _ , p_value_1 = ks_2samp(period_0, period_1)
        _ , p_value_2 = ttest_ind(period_0, period_1)
        _ , p_value_3 = levene(period_0, period_1)
        p_values_stat_tests = pd.DataFrame({f'pval_ks_2samp_{n}': p_value_1, f'pval_ttest_ind_{n}': p_value_2, f'pval_levene_{n}': p_value_3}, index=[0])
        return p_values_stat_tests

    def extract_frequency_features(time_series_data, sampling_rate=1, num_main_freqs=3):
        """
        Extracts the main frequencies from time series data using a periodogram.

        Args:
          time_series_data (np.array): The input time series data.
          sampling_rate (float): The sampling rate of the time series data (samples per unit time).
          num_main_freqs (int): The number of main frequencies to extract.

        Returns:
          dict: A dictionary containing the top frequencies, their corresponding periods,
                and their power spectral densities.
        """
        # Estimate power spectral density using a periodogram
        frequencies, power_spectral_density = periodogram(time_series_data, fs=sampling_rate)

        # Get indices for the highest power spectral density values
        top_freq_indices = np.argsort(power_spectral_density)[::-1][:num_main_freqs]

        # Extract the top frequencies, powers, and calculate periods
        main_frequencies = frequencies[top_freq_indices]
        main_powers = power_spectral_density[top_freq_indices]
        main_periods = 1 / main_frequencies

        results = {}
        for i in range(num_main_freqs):
            results[f'freq_{i+1}'] = main_frequencies[i]
            results[f'period_{i+1}'] = main_periods[i]
            results[f'power_{i+1}'] = main_powers[i]

        df_freq = pd.DataFrame(results, index=[0])
        return df_freq

    agg_funcs = [('mean', 'mean'), ('std', 'std'), ('max', 'max'), ('min', 'min'), ('median', 'median'), ('skew', 'skew'), ('kurtosis', lambda y: y.kurt()),
                 ('ptp', lambda y: np.ptp(y)), ('percentile_25', lambda y: np.percentile(y, 25)), ('percentile_75', lambda y: np.percentile(y, 75)),
                 ('count', 'count'), ('first', 'first'), ('last', 'last'),
                 ('trend', lambda y: np.mean(np.diff(y))), ('volatility', lambda y: np.std(np.diff(y))), ('range', lambda y: np.max(y) - np.min(y)),
                 ('mean_absolute_change', lambda y: np.mean(np.abs(np.diff(y)))), ('max_jump', lambda y: np.nan if len(np.abs(np.diff(y))) == 0 else np.max(np.abs(np.diff(y)))), # Change dynamics
                 ('num_turning_points', lambda y: np.sum(np.diff(np.sign(np.diff(y))) != 0)), ('upward_steps', lambda y:  np.sum(np.diff(y) > 0)), ('downward_steps', lambda y: np.sum(np.diff(y) < 0)), # Complexity
                 # ('entropy', lambda y: lambda y: entropy(np.histogram(y, bins=20, density=True)[0])), ('normality', lambda y: normaltest(y)[0]), ('stationarity', lambda y: adfuller(y)[0]), # Distribution shape
                 ('avg_fft_mag', lambda y: np.mean(np.abs(np.fft.fft(y)))), ('dominant_freq_ind', lambda y: np.argmax(np.abs(np.fft.fft(y)))), ('sum_fft_mag', lambda y: np.sum(np.abs(np.fft.fft(y)))),  # Frequency
                 ('auto_1', lambda y: y.autocorr(lag=1)), ('auto_2', lambda y: y.autocorr(lag=2)), ('auto_3', lambda y: y.autocorr(lag=3))]

    # Find breakpoint time & position difference for each 'id'
    break_point_times = df[df['period'] == 1].groupby(['id','period'])[['time']].first()\
                                           .rename({'time': 'breakpoint_time'}, axis=1).reset_index()
    df = pd.merge(df, break_point_times[['id', 'breakpoint_time']], on='id', how='inner')
    df['position_from_breakpoint_time'] = df['time'] - df['breakpoint_time']
    pos = df['position_from_breakpoint_time'].values

    # Generate aggrgate features for full range pre and post boundary
    pre_break_features = df[df['period'] == 0].groupby('id').agg({'value': agg_funcs}).sort_index()
    pre_break_features.columns = [col[0] + '_' + col[1] + '_pre' for col in  pre_break_features.columns]
    post_break_features = df[df['period'] == 1].groupby('id').agg({'value': agg_funcs}).sort_index()
    post_break_features.columns = [col[0] + '_' + col[1] + '_post' for col in  post_break_features.columns]
    pre_break_features['value_slope_pre'] = pre_break_features.apply(lambda x: (x['value_last_pre'] - x['value_first_pre'])/x['value_count_pre'], axis=1)
    post_break_features['value_slope_post'] = post_break_features.apply(lambda x: (x['value_last_post'] - x['value_first_post'])/x['value_count_post'], axis=1)

    df_features = pd.concat([pre_break_features, post_break_features], axis=1)

    # Calculate diff, ratio and pct_change between pre and post-break periods for aggregate functions
    for agg in [f[0] for f in agg_funcs]:
        df_features[f'diff_{agg}'] = df_features[f'value_{agg}_post'] - df_features[f'value_{agg}_pre']
        df_features[f'avg_{agg}'] =  (df_features[f'value_{agg}_post'] + df_features[f'value_{agg}_pre'])/2
        df_features[f'ratio_{agg}'] =  df_features.apply(lambda x: round(x[f'value_{agg}_post']/x[f'value_{agg}_pre'], 4)
                                                                         if x[f'value_{agg}_pre'] != 0 else np.nan, axis=1)
        df_features[f'pct_change_{agg}'] = df_features.apply(lambda x: round(x[f'diff_{agg}']/x[f'avg_{agg}'], 4)
                                                                         if x[f'avg_{agg}'] != 0 else np.nan, axis=1)

    print('DONE 1: ', datetime.now().strftime("%d-%m-%Y %H:%M:%S"))

    # Generate aggrgate features for n-point window pre and post boundary (sliding windows)
    # for n in tqdm.tqdm(range(1, 21, 1)):
    #     pre_break_features = df[df['position_from_breakpoint_time'].between(-50*n, -50*(n-1))].groupby('id').agg({'value': agg_funcs}).sort_index()
    #     pre_break_features.columns = [col[0] + '_' + col[1] + f'_pre_{n}_mw' for col in  pre_break_features.columns]
    #     post_break_features = df[df['position_from_breakpoint_time'].between(50*(n-1), 50*n)].groupby('id').agg({'value': agg_funcs}).sort_index()
    #     post_break_features.columns = [col[0] + '_' + col[1] + f'_post_{n}_mw' for col in  post_break_features.columns]
    #     pre_break_features[f'value_slope_pre_{n}_mw'] = pre_break_features.apply(lambda x: (x[f'value_last_pre_{n}_mw'] - x[f'value_first_pre_{n}_mw'])/x[f'value_count_pre_{n}_mw'], axis=1)
    #     post_break_features[f'value_slope_post_{n}_mw'] = post_break_features.apply(lambda x: (x[f'value_last_post_{n}_mw'] - x[f'value_first_post_{n}_mw'])/x[f'value_count_post_{n}_mw'], axis=1)

    #     df_features = pd.concat([df_features, pre_break_features, post_break_features], axis=1)

    # Calculate diff, ratio and pct_change between pre and post-break periods for aggregate functions
    # for agg in [f[0] for f in agg_funcs]:
    #     df_features[f'diff_{agg}_{n}_mw'] = df_features[f'value_{agg}_post_{n}_mw'] - df_features[f'value_{agg}_pre_{n}_mw']
    #     df_features[f'avg_{agg}_{n}_mw'] =  (df_features[f'value_{agg}_post_{n}_mw'] + df_features[f'value_{agg}_pre_{n}_mw'])/2
    #     df_features[f'ratio_{agg}_{n}_mw'] =  df_features.apply(lambda x: round(x[f'value_{agg}_post_{n}_mw']/x[f'value_{agg}_pre_{n}_mw'], 4)
    #                                                                        if x[f'value_{agg}_pre_{n}_mw'] != 0 else np.nan, axis=1)
    #     df_features[f'pct_change_{agg}_{n}_mw'] = df_features.apply(lambda x: round(x[f'diff_{agg}_{n}_mw']/x[f'avg_{agg}_{n}_mw'], 4)
    #                                                                        if x[f'avg_{agg}_{n}_mw'] != 0 else np.nan, axis=1)

    # # Generate aggrgate features for n-point window pre and post boundary (centre windows)
    for n in tqdm.tqdm(windows):
        pre_break_features = df[df['position_from_breakpoint_time'].between(-n, -1)].groupby('id').agg({'value': agg_funcs}).sort_index()
        pre_break_features.columns = [col[0] + '_' + col[1] + f'_pre_{n}' for col in  pre_break_features.columns]

        post_break_features = df[df['position_from_breakpoint_time'].between(0, n-1)].groupby('id').agg({'value': agg_funcs}).sort_index()
        post_break_features.columns = [col[0] + '_' + col[1] + f'_post_{n}' for col in  post_break_features.columns]

        pre_break_features[f'value_slope_pre_{n}'] = pre_break_features.apply(lambda x: (x[f'value_last_pre_{n}'] - x[f'value_first_pre_{n}'])/x[f'value_count_pre_{n}'], axis=1)
        post_break_features[f'value_slope_post_{n}'] = post_break_features.apply(lambda x: (x[f'value_last_post_{n}'] - x[f'value_first_post_{n}'])/x[f'value_count_post_{n}'], axis=1)

        df_features = pd.concat([df_features, pre_break_features, post_break_features], axis=1)

        # pre_break_features_out = df[~df['position_from_breakpoint_time'].between(-n, -1)].groupby('id').agg({'value': agg_funcs}).sort_index()
        # post_break_features_out = df[~df['position_from_breakpoint_time'].between(0, n-1)].groupby('id').agg({'value': agg_funcs}).sort_index()
        # pre_break_features_out.columns = [col[0] + '_' + col[1] + f'_pre_{n}_out' for col in  pre_break_features_out.columns]
        # post_break_features_out.columns = [col[0] + '_' + col[1] + f'_post_{n}_out' for col in  post_break_features_out.columns]

        # pre_break_features_out[f'value_slope_pre_{n}_out'] = pre_break_features_out.apply(lambda x: (x[f'value_last_pre_{n}_out'] - x[f'value_first_pre_{n}_out'])/x[f'value_count_pre_{n}_out'], axis=1)
        # post_break_features_out[f'value_slope_post_{n}_out'] = post_break_features_out.apply(lambda x: (x[f'value_last_post_{n}_out'] - x[f'value_first_post_{n}_out'])/x[f'value_count_post_{n}_out'], axis=1)

        # df_features = pd.concat([df_features, pre_break_features_out, post_break_features_out], axis=1)

    for agg in [f[0] for f in agg_funcs]:
        df_features[f'diff_{agg}_{n}'] = df_features[f'value_{agg}_post_{n}'] - df_features[f'value_{agg}_pre_{n}']
        df_features[f'avg_{agg}_{n}'] =  (df_features[f'value_{agg}_post_{n}'] + df_features[f'value_{agg}_pre_{n}'])/2
        df_features[f'ratio_{agg}_{n}'] =  df_features.apply(lambda x: round(x[f'value_{agg}_post_{n}']/x[f'value_{agg}_pre_{n}'], 4)
                                                                           if x[f'value_{agg}_pre_{n}'] != 0 else np.nan, axis=1)
        df_features[f'pct_change_{agg}_{n}'] = df_features.apply(lambda x: round(x[f'diff_{agg}_{n}']/x[f'avg_{agg}_{n}'], 4)
                                                                           if x[f'avg_{agg}_{n}'] != 0 else np.nan, axis=1)

    print('DONE 2: ', datetime.now().strftime("%d-%m-%Y %H:%M:%S"))

    # Generate statistical test features for n-point window pre and post boundary (sliding windows)
    # for n in tqdm.tqdm(range(1, 21, 1)):
    #     p_values_stat_tests = df[df['position_from_breakpoint_time'].between(-n, n-1)].groupby('id').apply(lambda x: extract_statistical_test_features(x, n))
    #     p_values_stat_tests = p_values_stat_tests.droplevel(1)
    #     df_features = pd.concat([df_features, p_values_stat_tests], axis=1)

    # Generate statistical test features for n-point window pre and post boundary (centre windows)
    for n in tqdm.tqdm(windows):
        p_values_stat_tests = df[df['position_from_breakpoint_time'].between(-n, n-1)].groupby('id').apply(lambda x: extract_statistical_test_features(x, n))
        p_values_stat_tests = p_values_stat_tests.droplevel(1)
        df_features = pd.concat([df_features, p_values_stat_tests], axis=1)

    print('DONE 3: ', datetime.now().strftime("%d-%m-%Y %H:%M:%S"))

    # # Generate frequency features for pre and post boundary signals
    for i in (0, 1):
        suffix = '_pre' if i == 0 else '_post'
        freq_features = df[df['period'] == i].groupby('id').apply(lambda x: extract_frequency_features(x['value'], num_main_freqs=3))
        freq_features = freq_features.droplevel(1).filter(regex='^freq|^power')
        freq_features.columns = [col + suffix for col in freq_features.columns]
        df_features = pd.concat([df_features, freq_features], axis=1)

    # # Calculate percentage change between pre and post-break frequencies (n=1-3)
    for n in range(1, 4):
        df_features[f'diff_freq_{n}'] = df_features[f'freq_{n}_post'] - df_features[f'freq_{n}_pre']
        df_features[f'avg_freq_{n}'] =  (df_features[f'freq_{n}_post'] + df_features[f'freq_{n}_pre'])/2
        df_features[f'pct_change_freq_{n}'] = df_features.apply(lambda x: round(x[f'diff_freq_{n}']/x[f'avg_freq_{n}'], 4)
                                                             if x[f'avg_freq_{n}'] != 0 else np.nan, axis=1)
    print('DONE 4: ', datetime.now().strftime("%d-%m-%Y %H:%M:%S"))

    feature_cols = df_features.filter(regex='^value\_|^diff\_|^ratio.*|^freq.*|^pval.*').columns
    df_features = df_features[feature_cols]
    df_features = df_features.copy()

    return df_features

In [None]:
# # Generate featrures from train data
# X_train_features = generate_features(X_train.reset_index())
# assert len(list(X_train_features.columns)) == len(set(X_train_features.columns))

# X_train_features = pd.read_parquet('/content/data/X_train_features.parquet')
# X_train_features = X_train_features[best_features]

# Select top-k best features for training
best_features = ['value_max_jump_pre_600', 'pval_levene_40', 'value_mean_absolute_change_post_650', 'value_ptp_pre_150', 'value_kurtosis_post_850', 'value_mean_post', 'value_min_post_600', 'value_percentile_25_pre_750', 'value_avg_fft_mag_pre_20', 'pval_ks_2samp_200', 'value_downward_steps_pre_850', 'value_max_post', 'value_kurtosis_pre', 'value_percentile_25_pre_850', 'value_percentile_25_post_450', 'pval_levene_100', 'value_mean_absolute_change_pre_250', 'value_min_post_500', 'value_percentile_25_post_250', 'value_num_turning_points_post_800', 'value_max_post_300', 'value_avg_fft_mag_pre_350', 'value_ptp_post_50', 'value_percentile_25_pre_150', 'value_kurtosis_pre_600', 'value_std_post_60', 'value_sum_fft_mag_post_750', 'value_kurtosis_pre_700', 'value_mean_pre_550', 'value_std_pre_20', 'value_avg_fft_mag_post', 'value_volatility_pre_30', 'pval_ks_2samp_90', 'value_mean_absolute_change_post_50', 'pval_ks_2samp_950', 'value_auto_1_post', 'value_auto_2_post', 'value_slope_post_350', 'value_mean_absolute_change_pre_700', 'value_std_pre_700', 'value_kurtosis_post_600', 'value_min_pre_70', 'value_mean_absolute_change_pre_60', 'value_percentile_25_pre_80', 'value_avg_fft_mag_post_650', 'pval_ks_2samp_30', 'value_mean_pre_450', 'value_max_pre_300', 'value_auto_2_pre_900', 'value_percentile_25_post_90', 'value_percentile_25_pre', 'value_min_pre_950', 'value_mean_absolute_change_post_90', 'value_max_post_40', 'value_num_turning_points_pre_10', 'value_upward_steps_pre_950', 'value_ptp_post_250', 'value_avg_fft_mag_post_500', 'value_ptp_pre_250', 'value_std_post_10', 'value_ptp_post_600', 'value_downward_steps_post_200', 'value_sum_fft_mag_post_400', 'value_downward_steps_post', 'value_mean_post_500', 'value_auto_1_post_450', 'value_max_jump_pre_70', 'value_num_turning_points_pre_200', 'pval_levene_400', 'value_mean_post_650', 'value_percentile_25_post_150', 'value_skew_post_800', 'value_std_post_20', 'value_ptp_post_5', 'diff_max_jump', 'pval_ks_2samp_60', 'value_mean_absolute_change_pre_450', 'value_auto_3_pre_350', 'ratio_mean_absolute_change', 'diff_mean', 'value_ptp_pre_500', 'pval_levene_30', 'value_num_turning_points_post_80', 'value_volatility_post_90', 'value_ptp_pre_700', 'value_upward_steps_post_90', 'value_upward_steps_pre_250', 'value_percentile_25_pre_250', 'value_volatility_post_300', 'value_num_turning_points_post_50', 'value_max_jump_pre_10', 'value_volatility_pre_20', 'value_dominant_freq_ind_post_800', 'value_ptp_pre_550', 'value_upward_steps_pre_70', 'value_std_pre_500', 'ratio_std_1000', 'pval_ks_2samp_150', 'pval_ks_2samp_500', 'value_ptp_pre_950', 'value_percentile_75_post_450', 'value_std_post_80', 'pval_ttest_ind_950', 'value_volatility_pre_700', 'value_ptp_pre_650', 'value_upward_steps_post', 'pval_ks_2samp_700', 'value_upward_steps_pre_60', 'value_auto_1_pre_800', 'ratio_mean_absolute_change_1000', 'value_skew_post_400', 'value_auto_3_post_550', 'value_mean_pre_350', 'value_num_turning_points_pre_250', 'pval_ks_2samp_5', 'pval_levene_80', 'value_auto_1_pre_700', 'value_downward_steps_post_500', 'value_trend_post_350', 'ratio_num_turning_points', 'value_volatility_pre_750', 'value_median_pre_250', 'value_trend_post_80', 'value_skew_post', 'value_max_pre_80', 'value_ptp_post_750', 'value_mean_absolute_change_post_60', 'value_avg_fft_mag_pre', 'value_mean_pre_800', 'value_median_pre_150', 'value_percentile_75_pre_600', 'ratio_median', 'pval_ks_2samp_20', 'value_skew_pre_40', 'diff_percentile_25', 'value_trend_pre_150', 'value_kurtosis_post_150', 'value_auto_2_post_550', 'pval_levene_750', 'value_upward_steps_pre_350', 'value_max_jump_pre', 'value_mean_absolute_change_pre_650', 'value_percentile_25_post_500', 'pval_ks_2samp_550', 'value_auto_2_post_300', 'value_max_jump_post_250', 'diff_num_turning_points_1000', 'value_ptp_pre', 'value_mean_absolute_change_pre_20', 'value_dominant_freq_ind_post_950']

# Select best params for model training
best_params = {'n_estimators': 1000, 'max_depth': 8, 'learning_rate': 0.1}

In [None]:
# def validate(X_train_features: pd.DataFrame, y_train: pd.Series):
#     # Train classifier model to predict 'period'
#     model = XGBClassifier(n_estimators=500, objective='binary:logistic', max_depth=6, n_jobs=-1, random_state=15)
#     print('RUNNING CROSS VALIDATION :')
#     roc_auc_scores = cross_val_score(model, X_train_features, y_train.values, cv=5, scoring='roc_auc', verbose=1)
#     return roc_auc_scores

# roc_auc_scores = validate(X_train_features, y_train)
# print(roc_auc_scores)
# print(roc_auc_scores.mean(), roc_auc_scores.std())

# from google.colab import files

# X_train_features.to_parquet('X_train_features.parquet')
# files.download('X_train_features.parquet')

In [None]:
def train(X_train: pd.DataFrame, y_train: pd.Series, model_directory_path: str):
    # Generate featrures from train data
    X_train_features = generate_features(X_train.reset_index())
    X_train_features = X_train_features[best_features]

    # Train classifier model to predict 'period'
    model = XGBClassifier(**best_params, objective='binary:logistic', n_jobs=-1, random_state=15)
    print('TRAINING MODEL :')
    model.fit(X_train_features, y_train.values)

    # Get prediction from trained model
    # y_train_prediction = model.predict(X_train_features)
    # train_auc = sklearn.metrics.roc_auc_score(y_train['structural_breakpoint'], y_train_prediction)

    joblib.dump(model, os.path.join(model_directory_path, 'model.joblib'))

In [None]:
def feature_importance(X_train: pd.DataFrame, y_train: pd.Series):
    model = XGBClassifier(n_estimators=500, objective='binary:logistic', max_depth=6, n_jobs=-1, random_state=15)

    # Put into a DataFrame for easy sorting
    feat_imp = pd.DataFrame({
        "feature": X_train.columns,
        "importance": model.feature_importances_})

    # Sort by importance (descending)
    feat_imp = feat_imp.sort_values(by="importance", ascending=False)
    return feat_imp

# feat_imp = feature_importance(X_train, y_train)
# feat_imp

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [None]:
# Infer structural break based on trained model predictions
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))
    yield  # Mark as ready
    best_features = ['value_max_jump_pre_600', 'pval_levene_40', 'value_mean_absolute_change_post_650', 'value_ptp_pre_150', 'value_kurtosis_post_850', 'value_mean_post', 'value_min_post_600', 'value_percentile_25_pre_750', 'value_avg_fft_mag_pre_20', 'pval_ks_2samp_200', 'value_downward_steps_pre_850', 'value_max_post', 'value_kurtosis_pre', 'value_percentile_25_pre_850', 'value_percentile_25_post_450', 'pval_levene_100', 'value_mean_absolute_change_pre_250', 'value_min_post_500', 'value_percentile_25_post_250', 'value_num_turning_points_post_800', 'value_max_post_300', 'value_avg_fft_mag_pre_350', 'value_ptp_post_50', 'value_percentile_25_pre_150', 'value_kurtosis_pre_600', 'value_std_post_60', 'value_sum_fft_mag_post_750', 'value_kurtosis_pre_700', 'value_mean_pre_550', 'value_std_pre_20', 'value_avg_fft_mag_post', 'value_volatility_pre_30', 'pval_ks_2samp_90', 'value_mean_absolute_change_post_50', 'pval_ks_2samp_950', 'value_auto_1_post', 'value_auto_2_post', 'value_slope_post_350', 'value_mean_absolute_change_pre_700', 'value_std_pre_700', 'value_kurtosis_post_600', 'value_min_pre_70', 'value_mean_absolute_change_pre_60', 'value_percentile_25_pre_80', 'value_avg_fft_mag_post_650', 'pval_ks_2samp_30', 'value_mean_pre_450', 'value_max_pre_300', 'value_auto_2_pre_900', 'value_percentile_25_post_90', 'value_percentile_25_pre', 'value_min_pre_950', 'value_mean_absolute_change_post_90', 'value_max_post_40', 'value_num_turning_points_pre_10', 'value_upward_steps_pre_950', 'value_ptp_post_250', 'value_avg_fft_mag_post_500', 'value_ptp_pre_250', 'value_std_post_10', 'value_ptp_post_600', 'value_downward_steps_post_200', 'value_sum_fft_mag_post_400', 'value_downward_steps_post', 'value_mean_post_500', 'value_auto_1_post_450', 'value_max_jump_pre_70', 'value_num_turning_points_pre_200', 'pval_levene_400', 'value_mean_post_650', 'value_percentile_25_post_150', 'value_skew_post_800', 'value_std_post_20', 'value_ptp_post_5', 'diff_max_jump', 'pval_ks_2samp_60', 'value_mean_absolute_change_pre_450', 'value_auto_3_pre_350', 'ratio_mean_absolute_change', 'diff_mean', 'value_ptp_pre_500', 'pval_levene_30', 'value_num_turning_points_post_80', 'value_volatility_post_90', 'value_ptp_pre_700', 'value_upward_steps_post_90', 'value_upward_steps_pre_250', 'value_percentile_25_pre_250', 'value_volatility_post_300', 'value_num_turning_points_post_50', 'value_max_jump_pre_10', 'value_volatility_pre_20', 'value_dominant_freq_ind_post_800', 'value_ptp_pre_550', 'value_upward_steps_pre_70', 'value_std_pre_500', 'ratio_std_1000', 'pval_ks_2samp_150', 'pval_ks_2samp_500', 'value_ptp_pre_950', 'value_percentile_75_post_450', 'value_std_post_80', 'pval_ttest_ind_950', 'value_volatility_pre_700', 'value_ptp_pre_650', 'value_upward_steps_post', 'pval_ks_2samp_700', 'value_upward_steps_pre_60', 'value_auto_1_pre_800', 'ratio_mean_absolute_change_1000', 'value_skew_post_400', 'value_auto_3_post_550', 'value_mean_pre_350', 'value_num_turning_points_pre_250', 'pval_ks_2samp_5', 'pval_levene_80', 'value_auto_1_pre_700', 'value_downward_steps_post_500', 'value_trend_post_350', 'ratio_num_turning_points', 'value_volatility_pre_750', 'value_median_pre_250', 'value_trend_post_80', 'value_skew_post', 'value_max_pre_80', 'value_ptp_post_750', 'value_mean_absolute_change_post_60', 'value_avg_fft_mag_pre', 'value_mean_pre_800', 'value_median_pre_150', 'value_percentile_75_pre_600', 'ratio_median', 'pval_ks_2samp_20', 'value_skew_pre_40', 'diff_percentile_25', 'value_trend_pre_150', 'value_kurtosis_post_150', 'value_auto_2_post_550', 'pval_levene_750', 'value_upward_steps_pre_350', 'value_max_jump_pre', 'value_mean_absolute_change_pre_650', 'value_percentile_25_post_500', 'pval_ks_2samp_550', 'value_auto_2_post_300', 'value_max_jump_post_250', 'diff_num_turning_points_1000', 'value_ptp_pre', 'value_mean_absolute_change_pre_20', 'value_dominant_freq_ind_post_950']

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for dataset in X_test:
        def predict(df_features: pd.DataFrame):
            prob = model.predict_proba(df_features)[:, 1]
            return prob

        dataset = dataset.reset_index()

        # Generate features from values before and after breakpoint
        X_test_features = generate_features(dataset)
        X_test_features = X_test_features[best_features]

        prediction = predict(X_test_features)
        yield prediction  # Send the prediction for the current dataset

        # Note: This baseline approach uses a t-test to compare the distributions
        # before and after the boundary point. A smaller p-value (larger negative number)
        # suggests stronger evidence that the distributions are different,
        # indicating a potential structural break.

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
crunch.test(
    # Uncomment to disable the train
    force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)