# üìå ADIA Lab Structural Break Challenge  

**Assalam o Alaikumüëã**  
In this notebook, we are going to explore the concept of **structural breaks (regime shifts)** ‚Äì basically jab data ka trend, mean, or variance suddenly change ho jaye üìä.  

Structural break detection is an important problem because real-world data kabhi bhi smooth aur stable nahi hota. Kabhi kabhi beech me major shifts aate hain jo forecasting aur analysis dono ko effect kar dete hain.  

---

## üîé Challenge Overview  

Welcome to the **ADIA Lab Structural Break Challenge!**  
In this competition, you will analyze **univariate time series data** to determine whether a **structural break** has occurred at a specified boundary point.  

### üìñ What is a Structural Break?  

A **structural break** occurs when the process governing the data generation changes at a certain point in time.  
These changes can be subtle or dramatic, and detecting them accurately is crucial across domains:  

- üå¶ **Climatology** ‚Üí shifts in long-term weather patterns  
- üè≠ **Industrial Monitoring** ‚Üí detecting sudden machine behavior changes  
- üíπ **Finance** ‚Üí market crashes or regime shifts  
- üè• **Healthcare** ‚Üí sudden change in patient health indicators  

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)  

---

## üìù Our Task  

For each time series in the **test set**, we need to predict a **score between `0` and `1`:**  

- `0` ‚Üí No structural break at the specified boundary point  
- `1` ‚Üí A structural break **did occur**  

---

## üìä Evaluation Metric  

The challenge uses **ROC AUC (Area Under the Receiver Operating Characteristic Curve)** as the evaluation metric:  

- **ROC AUC ‚âà 0.5** ‚Üí No better than random guessing  
- **ROC AUC ‚Üí 1.0** ‚Üí Perfect detection performance  

More about ROC AUC: [sklearn.metrics.roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)  

---

## üìÇ Notebook Flow üöÄ  

1. **Exploratory Data Analysis (EDA)** ‚Äì visualize aur samajhenge dataset.  
2. **Methods** ‚Äì different techniques (statistical + ML-based) try karenge for break detection.  
3. **Evaluation** ‚Äì compare karenge results aur dekhenge kaun sa method best perform karta hai.  

---

‚ö° **Goal**: A clean, reproducible, and easy-to-follow Kaggle-style notebook ‚Äì jahan beginner bhi seekh le aur advanced banda bhi enjoy kare.  

**Chalo shuru karte hain üöÄ**  


In [2]:
# import Important Libraries

!pip install antropy --quiet
!pip install PyWavelets --quiet


In [3]:
import os
import typing

# Import your dependencies
import joblib
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import sklearn.metrics
from scipy.stats import wasserstein_distance  # 1D Earth Mover's Distance
from scipy.stats import skew, kurtosis, ks_2samp
from scipy.stats import wasserstein_distance
from sklearn.model_selection import cross_val_score


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report, RocCurveDisplay
from scipy.stats import skew, kurtosis, ks_2samp, mannwhitneyu, wasserstein_distance
from scipy.signal import welch, hilbert
from statsmodels.tsa.stattools import acf, pacf

from scipy.signal import welch
import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore UserWarning
warnings.filterwarnings("ignore", category=UserWarning)

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

### Crunch CLI SETUP

In [1]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break 0X4q9UObLYpTP0DOnQ6i6nyK

Note: you may need to restart the kernel to use updated packages.
crunch-cli, version 7.5.0
main.py: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/29908/main.py (18776 bytes)
notebook.ipynb: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/29908/notebook.ipynb (102836 bytes)
requirements.txt: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/29908/requirements.original.txt (249 bytes)
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 

### BASIC EDA

In [5]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

X_train, y_train, X_test = crunch.load_data()

loaded inline runner with module: <module '__main__'>

cli version: 7.5.0
available ram: 31.35 gb
available cpu: 4 core
----
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file lengt

In [6]:
print(y_train)

print(type(X_train))
print(type(y_train))
print(type(X_test))

print(type(X_test[0]))
print(X_test[0].shape)
display(X_test[0][:5])


X_train

id
0        False
1        False
2         True
3        False
4        False
         ...  
9996     False
9997     False
9998     False
9999     False
10000     True
Name: structural_breakpoint, Length: 10001, dtype: bool
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'list'>
<class 'pandas.core.frame.DataFrame'>
(2779, 2)


Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0


Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
10000,2134,0.001137,1
10000,2135,0.003526,1
10000,2136,0.000687,1
10000,2137,0.001640,1


In [7]:
def preprocess_test_data(raw_test_list, start_id=10001):
    """
    Convert raw test list of arrays to proper MultiIndex DataFrame with fixed index.

    Args:
        raw_test_list (list of arrays): Raw test data, each element shape (T, 2) with columns (value, period).
        start_id (int): Starting sample ID for test data indexing (avoid overlap with train IDs).

    Returns:
        pd.DataFrame: MultiIndex DataFrame with index levels ['id', 'time'].
    """
    test_list = []
    for i, ts in enumerate(raw_test_list):
        df = pd.DataFrame(ts, columns=['value', 'period'])
        df['id'] = start_id + i
        df['time'] = df.index
        df.set_index(['id', 'time'], inplace=True)
        test_list.append(df)

    X_test_df = pd.concat(test_list)

    # Fix 'time' level if it contains tuples instead of integers
    new_time_level = [t[1] if isinstance(t, tuple) else t for t in X_test_df.index.get_level_values('time')]
    new_index = pd.MultiIndex.from_arrays([
        X_test_df.index.get_level_values('id'),
        new_time_level,
    ], names=['id', 'time'])
    X_test_df.index = new_index

    return X_test_df


In [8]:
# raw_test_data is the list you get from CrunchDAO or elsewhere

X_test_df = preprocess_test_data(X_test)



In [9]:
print("Train index levels:", X_train.index.names)
print("Train index sample:")
print(X_train.head())

print("\nTest index levels:", X_test_df.index.names)
print("Test index sample:")
print(X_test_df.head())

assert X_train.index.names == X_test_df.index.names, "Index levels mismatch!"


Train index levels: ['id', 'time']
Train index sample:
            value  period
id time                  
0  0    -0.005564       0
   1     0.003705       0
   2     0.013164       0
   3     0.007151       0
   4    -0.009979       0

Test index levels: ['id', 'time']
Test index sample:
               value  period
id    time                  
10001 0     0.010753       0
      1    -0.031915       0
      2    -0.010989       0
      3    -0.011111       0
      4     0.011236       0


### feature engineering

In [21]:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
from scipy.stats import entropy

def extract_features(df, seq_ids, val_col='value'):
    features_list = []

    for seq_id in seq_ids:
        seq_vals = df.loc[seq_id][val_col].values.astype(np.float32)
        mean = np.mean(seq_vals)
        std = np.std(seq_vals)
        minimum = np.min(seq_vals)
        maximum = np.max(seq_vals)
        median = np.median(seq_vals)
        range_ = maximum - minimum
        skewness = pd.Series(seq_vals).skew()
        kurtosis = pd.Series(seq_vals).kurtosis()
        energy = np.sum(seq_vals ** 2)
        # Normalize values to calculate entropy robustly; add small constant to avoid zero counts
        hist, _ = np.histogram(seq_vals, bins=30, density=True)
        seq_entropy = entropy(hist + 1e-6)
        peaks, _ = find_peaks(seq_vals)
        num_peaks = len(peaks)
        # Peak distances statistics
        if len(peaks) > 1:
            peak_distances = np.diff(peaks)
            peak_dist_mean = np.mean(peak_distances)
            peak_dist_std = np.std(peak_distances)
        else:
            peak_dist_mean = 0
            peak_dist_std = 0
        rms = np.sqrt(np.mean(seq_vals**2))

        features = [
            mean, std, minimum, maximum, median, range_, skewness, kurtosis,
            energy, seq_entropy, num_peaks, peak_dist_mean, peak_dist_std, rms
        ]
        features_list.append(features)

    features_array = np.array(features_list, dtype=np.float32)
    return torch.tensor(features_array)


In [22]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler, random_split
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
from scipy.stats import entropy

# Sequence Dataset
class SequenceDataset(Dataset):
    def __init__(self, df, seq_ids, max_len=500, val_col='value'):
        self.df = df
        self.seq_ids = seq_ids
        self.max_len = max_len
        self.val_col = val_col
    def __len__(self):
        return len(self.seq_ids)
    def __getitem__(self, idx):
        seq_id = self.seq_ids[idx]
        seq_vals = self.df.loc[seq_id][self.val_col].values
        padded = np.zeros(self.max_len, dtype=np.float32)
        length = min(len(seq_vals), self.max_len)
        padded[:length] = seq_vals[:length]
        return torch.tensor(padded).unsqueeze(-1)

# Embedding Dataset
class EmbeddingDataset(Dataset):
    def __init__(self, embeddings, labels):
        self.embeddings = embeddings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        return self.embeddings[idx], self.labels[idx]

# Weighted Sampler for Imbalance
def create_balanced_sampler(labels):
    class_sample_counts = torch.tensor([(labels == 0).sum(), (labels == 1).sum()], dtype=torch.float32)
    weight_per_class = 1.0 / class_sample_counts
    weights = weight_per_class[labels]
    sampler = WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)
    return sampler

# Improved LSTM Autoencoder
class LSTMAutoencoderImproved(nn.Module):
    def __init__(self, input_dim=1, hidden_dim=64, latent_dim=32, num_layers=2, dropout=0.2):
        super().__init__()
        self.encoder = nn.LSTM(input_dim, hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout)
        self.enc_fc = nn.Sequential(
            nn.Linear(hidden_dim, latent_dim),
            nn.ReLU()
        )
        self.dec_fc = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU()
        )
        self.decoder = nn.LSTM(hidden_dim, input_dim, num_layers=num_layers, batch_first=True, dropout=dropout)
    def forward(self, x):
        enc_out, (h_n, _) = self.encoder(x)
        h_last = h_n[-1]
        latent = self.enc_fc(h_last)
        dec_in = self.dec_fc(latent).unsqueeze(1).repeat(1, x.size(1), 1)
        dec_out, _ = self.decoder(dec_in)
        return dec_out, latent

# Enhanced feature extractor based on research
def extract_features(df, seq_ids, val_col='value'):
    features_list = []

    for seq_id in seq_ids:
        seq_vals = df.loc[seq_id][val_col].values.astype(np.float32)
        mean = np.mean(seq_vals)
        std = np.std(seq_vals)
        minimum = np.min(seq_vals)
        maximum = np.max(seq_vals)
        median = np.median(seq_vals)
        range_ = maximum - minimum
        skewness = pd.Series(seq_vals).skew()
        kurtosis = pd.Series(seq_vals).kurtosis()
        energy = np.sum(seq_vals ** 2)
        hist, _ = np.histogram(seq_vals, bins=30, density=True)
        seq_entropy = entropy(hist + 1e-6)
        peaks, _ = find_peaks(seq_vals)
        num_peaks = len(peaks)
        if len(peaks) > 1:
            peak_distances = np.diff(peaks)
            peak_dist_mean = np.mean(peak_distances)
            peak_dist_std = np.std(peak_distances)
        else:
            peak_dist_mean = 0
            peak_dist_std = 0
        rms = np.sqrt(np.mean(seq_vals**2))

        features = [
            mean, std, minimum, maximum, median, range_, skewness, kurtosis,
            energy, seq_entropy, num_peaks, peak_dist_mean, peak_dist_std, rms
        ]
        features_list.append(features)

    features_array = np.array(features_list, dtype=np.float32)
    return torch.tensor(features_array)

# Extract embeddings utility
def extract_embeddings(model, dataloader, device):
    model.to(device)
    model.eval()
    embeddings = []
    with torch.no_grad():
        for batch in dataloader:
            batch = batch.to(device)
            _, latent = model(batch)
            embeddings.append(latent.cpu())
    embeddings = torch.cat(embeddings, dim=0)
    print(f"Extracted embeddings shape: {embeddings.shape}")
    return embeddings

# Classifier Model
class BreakClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim=64):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, 1)
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x.squeeze()

# Create train/validation loaders for sequences
def create_train_val_loaders(df, seq_ids, val_fraction=0.1, batch_size=64, max_len=500):
    total = len(seq_ids)
    val_size = int(total * val_fraction)
    train_size = total - val_size
    train_ids, val_ids = random_split(seq_ids, [train_size, val_size])
    train_dataset = SequenceDataset(df, train_ids, max_len=max_len)
    val_dataset = SequenceDataset(df, val_ids, max_len=max_len)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    return train_loader, val_loader, train_ids

# Autoencoder training with early stopping and improved optimization
def train_autoencoder_with_early_stopping_and_val(
    model, train_loader, val_loader, device,
    epochs=50, patience=5, min_delta=1e-4):
    
    model.to(device)
    optimizer = optim.AdamW(model.parameters(), lr=0.0001, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5, verbose=True)
    criterion = nn.MSELoss()
    
    best_val_loss = float('inf')
    epochs_no_improve = 0
    
    for epoch in range(epochs):
        model.train()
        running_loss = 0
        for batch in train_loader:
            batch = batch.to(device)
            optimizer.zero_grad()
            recon, _ = model(batch)
            loss = criterion(recon, batch)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        avg_train_loss = running_loss / len(train_loader)
        
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for val_batch in val_loader:
                val_batch = val_batch.to(device)
                recon, _ = model(val_batch)
                loss = criterion(recon, val_batch)
                val_loss += loss.item()
        avg_val_loss = val_loss / len(val_loader)
        
        scheduler.step(avg_val_loss)
        
        print(f"Epoch {epoch+1}/{epochs} - Train Loss: {avg_train_loss:.6f} - Val Loss: {avg_val_loss:.6f}")
        
        if best_val_loss - avg_val_loss > min_delta:
            best_val_loss = avg_val_loss
            epochs_no_improve = 0
            torch.save(model.state_dict(), 'best_autoencoder.pth')
        else:
            epochs_no_improve += 1
            if epochs_no_improve >= patience:
                print(f"Early stopping after {epoch+1} epochs (min_delta={min_delta}).")
                break
    model.load_state_dict(torch.load('best_autoencoder.pth'))
    print("Autoencoder training complete with early stopping.")

# Classifier training with validation, early stopping, AdamW, and balanced sampler
def train_classifier_with_val(
    model, dataset, device,
    epochs=20, batch_size=64,
    patience=5, val_fraction=0.1, min_delta=1e-4):
    
    total_len = len(dataset)
    val_len = int(total_len * val_fraction)
    train_len = total_len - val_len
    train_set, val_set = random_split(dataset, [train_len, val_len])
    
    train_labels = torch.tensor([label for _, label in train_set])
    train_loader = DataLoader(train_set, batch_size=batch_size, sampler=create_balanced_sampler(train_labels), shuffle=False)
    val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)
    
    model.to(device)
    optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2, factor=0.5, verbose=True)

    all_labels = torch.tensor([label for _, label in dataset])
    pos_weight_val = (all_labels == 0).sum() / (all_labels == 1).sum()
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_val.to(device))
    
    best_val_loss = float('inf')
    epochs_no_improve = 0

    for epoch in range(epochs):
        model.train()
        running_loss = 0
        for batch_x, batch_y in train_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device).float()
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        avg_train_loss = running_loss / len(train_loader)
        
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_x, batch_y in val_loader:
                batch_x, batch_y = batch_x.to(device), batch_y.to(device).float()
                output = model(batch_x)
                loss = criterion(output, batch_y)
                val_loss += loss.item()
        avg_val_loss = val_loss / len(val_loader)
        
        scheduler.step(avg_val_loss)
        
        print(f"Epoch {epoch+1}/{epochs} - Train Loss: {avg_train_loss:.6f} - Val Loss: {avg_val_loss:.6f}")
        
        if best_val_loss - avg_val_loss > min_delta:
            best_val_loss = avg_val_loss
            epochs_no_improve = 0
            torch.save(model.state_dict(), 'best_classifier.pth')
        else:
            epochs_no_improve += 1
            if epochs_no_improve >= patience:
                print(f"Early stopping after {epoch+1} epochs (min_delta={min_delta}).")
                break
    model.load_state_dict(torch.load('best_classifier.pth'))
    print("Classifier training complete with early stopping.")

# Full training pipeline
def train(df, seq_ids, labels_dict, device='cpu', max_len=500, ae_epochs=50, clf_epochs=20, batch_size=64):
    print("== Starting training pipeline with validation split and improved optimization ==")
    
    # Create train and validation loaders for autoencoder training
    train_loader, val_loader, train_ids = create_train_val_loaders(df, seq_ids, val_fraction=0.1, batch_size=batch_size, max_len=max_len)
    
    # Initialize and train the improved autoencoder
    autoencoder = LSTMAutoencoderImproved()
    print("Training Autoencoder with validation and early stopping...")
    train_autoencoder_with_early_stopping_and_val(autoencoder, train_loader, val_loader, device=device, epochs=ae_epochs)
    
    # Extract embeddings on full dataset after autoencoder training completes
    print("Extracting embeddings from full training set...")
    full_ae_dataset = SequenceDataset(df, seq_ids, max_len=max_len)
    full_ae_loader = DataLoader(full_ae_dataset, batch_size=batch_size, shuffle=False)
    embeddings = extract_embeddings(autoencoder, full_ae_loader, device=device)
    
    # Extract enhanced handcrafted features and combine
    handcrafted_features = extract_features(df, seq_ids)
    combined_features = torch.cat([embeddings, handcrafted_features], dim=1)
    
    # Prepare classifier dataset and initialize classifier
    label_list = [labels_dict[id_] for id_ in seq_ids]
    label_tensor = torch.tensor(label_list, dtype=torch.long)
    
    clf_dataset = EmbeddingDataset(combined_features, label_tensor)
    classifier = BreakClassifier(input_dim=combined_features.shape[1])
    
    # Train classifier with validation, early stopping, AdamW, and balancing
    print("Training classifier with validation and early stopping...")
    train_classifier_with_val(classifier, clf_dataset, device=device, epochs=clf_epochs, batch_size=batch_size)
    
    # Save final models
    torch.save(autoencoder.state_dict(), "autoencoder.pth")
    torch.save(classifier.state_dict(), "classifier.pth")
    print("Models saved successfully.")
    print("== Training pipeline complete ==")
    return autoencoder, classifier



- Input: Raw sequence data (X_train) + labels (y_train) per sequence.
- Feature Engineering: Handcrafted features ya embeddings extraction ke liye preprocessing.
- Autoencoder Training: Sequence reconstruction ke liye autoencoder train ho.
- Embeddings Extraction: Encoder se latent embeddings nikalain.
- Concatenate Features: Handcrafted + embeddings.
- Oversampling Setup: Minority class oversample karain.
- Classifier Training: Combined features ke sath break/no-break classifier train ho.
- Progress Prints: Har step pe status print ho.

In [19]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
labels_dict = y_train.astype(int).to_dict()

autoencoder_model, classifier_model = train(
    X_train,
    seq_ids,
    labels_dict,
    device=device,
    max_len=500,
    ae_epochs=20,
    clf_epochs=20,
    batch_size=64
)


== Starting training pipeline with validation split ==
Training Autoencoder with validation and early stopping...
Epoch 1/10 - Train Loss: 0.006458 - Val Loss: 0.000622
Epoch 2/10 - Train Loss: 0.002666 - Val Loss: 0.000619
Epoch 3/10 - Train Loss: 0.002664 - Val Loss: 0.000619
Epoch 4/10 - Train Loss: 0.002664 - Val Loss: 0.000619
Epoch 5/10 - Train Loss: 0.002663 - Val Loss: 0.000619
Epoch 6/10 - Train Loss: 0.002663 - Val Loss: 0.000619
Epoch 7/10 - Train Loss: 0.002662 - Val Loss: 0.000619
Epoch 8/10 - Train Loss: 0.002662 - Val Loss: 0.000619
Epoch 9/10 - Train Loss: 0.002661 - Val Loss: 0.000619
Epoch 10/10 - Train Loss: 0.002661 - Val Loss: 0.000619
Autoencoder training complete with early stopping.
Extracting embeddings from full training set...
Extracted embeddings shape: torch.Size([10001, 32])
Training classifier with validation and early stopping...
Epoch 1/10 - Train Loss: 1.068882 - Val Loss: 1.227996
Epoch 2/10 - Train Loss: 1.050734 - Val Loss: 1.129139
Epoch 3/10 - Tra

In [None]:
preds = list(infer(X_test, "artifacts"))
print(preds[:10])


In [None]:
crunch.test(
    # Uncomment to disable the train
    #force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)



In [None]:
prediction = pd.read_parquet("data/prediction.parquet")

prediction

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

In [None]:
target