This is a final version of the model training code markdown file. Print statement are added for tracking where execution is at atm.

We start with imports, self-expanatory

In [78]:
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import (
    col, lag, lead, when, unix_timestamp, floor, lit,
    first, last, max as _max, min as _min, sum as _sum, input_file_name, regexp_extract,
    pow, sqrt, udf, avg, abs as sabs, greatest
)
from functools import reduce
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
import os
import sys
import psutil
import gc
import re
import shutil
import time
import tempfile
from pathlib import Path

This creates a UDF that extracts a probability of UP from Spark ML. Spark outputs predictions with probability vectors, this accesses the second element which is UP. We'll need this for later.

In [79]:
extract_prob_udf = udf(lambda v: float(v[1]) if v is not None else 0.5, DoubleType())

Memory monitoring function to keep track of RAM and swap file memory use, added since the model was initially training on 16GB RAM laptop (probably still is, but might be changed to a cluster of 32GB later if i'm not lazy). 1e9 converts bytes to gigabytes, .1f formats to 1 decimal point.

In [80]:
def check_memory():
    mem = psutil.virtual_memory()
    swap = psutil.swap_memory()
    print(f"Memory: {mem.percent}% used ({mem.used/1e9:.1f}GB/{mem.total/1e9:.1f}GB)")
    if swap.percent > 5:
        print(f"Swap: {swap.percent}% used ({swap.used/1e9:.1f}GB/{swap.total/1e9:.1f}GB) ")

This section configures the PySpark environment and sets up the necessary directory structure for the project. Sets python executability paths, hadoop home directory creates dirs for Spark operations, model storage and data processing. Old checkpoints and temporary files are cleaned to prevent conflicts from previous runs. A bunch of this section is hardcoded and taylored to a specific local machine the training was ran on due to troubles with windows related errors and version conflicts.

In [81]:
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
hadoop_home = os.environ.get('HADOOP_HOME', r'C:\hadoop')
os.environ['HADOOP_HOME'] = hadoop_home
BASE_TEMP_DIR = Path(tempfile.gettempdir()) / "spark_crypto_ml"
SPARK_LOCAL_DIRS = str(BASE_TEMP_DIR / "temp")
try:
    BASE_DIR = Path(__file__).resolve().parent.parent
except NameError:
    BASE_DIR = Path(os.getcwd()).resolve().parent
DATA_FOLDER =(os.path.join(BASE_DIR, "data", "raw"))
MODEL_FOLDER = os.path.join(BASE_DIR, "models")
CHECKPOINT_DIR = os.path.join(BASE_DIR, "checkpoints")

if os.path.exists(CHECKPOINT_DIR):
    shutil.rmtree(CHECKPOINT_DIR)
    print(f"Old checkpoints removed")
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
#cleaning up old temp dir
try:
    if os.path.exists(str(BASE_TEMP_DIR)):
        shutil.rmtree(str(BASE_TEMP_DIR))
        print(f"Temp directory cleaned: {BASE_TEMP_DIR}")
except Exception as e:
    print(f"Warning: Could not clean temp directory - {e}")


Spark setup. Creates and configures the Spark session with optimized settings for machine learning workloads. Key configurations include adaptive query execution, increased memory allocation (6GB driver, 9GB executor), Kryo serialization for performance, compression for RDDs and shuffles, and disabled code generation to avoid Windows compatibility issues. The session is set to ERROR-level logging and uses a dedicated checkpoint directory for fault tolerance.

In [82]:
spark = (SparkSession.builder
    .appName("Stacking_Ensemble_Training")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.shuffle.partitions", "32")
    .config("spark.default.parallelism", "32")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.driver.memory", "6g")
    .config("spark.executor.memory", "9g")
    .config("spark.driver.maxResultSize", "3g")
    .config("spark.memory.fraction", "0.8")
    .config("spark.memory.storageFraction", "0.3")
    .config("spark.sql.autoBroadcastJoinThreshold", "-1")
    .config("spark.rdd.compress", "true")
    .config("spark.shuffle.compress", "true")
    .config("spark.shuffle.spill.compress", "true")
    .config("spark.ui.enabled", "true")
    .config("spark.ui.port", "4040")
    .config("spark.sql.codegen.wholeStage", "false")
    .config("spark.sql.codegen.factoryMode", "NO_CODEGEN")
    .config("hadoop.native.lib", "false")
    .config("spark.hadoop.fs.file.impl", "org.apache.hadoop.fs.RawLocalFileSystem")
    .getOrCreate())

spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.setCheckpointDir(CHECKPOINT_DIR)

Configuration, LOOKBACK and LAG_INTERVALS are used to set up how many previous candles are used for feature generation. CANDLE_MINUTES sets up a variable for aggregating 1m candles from the dataset into larger timeframes. ENABLE_GBT_UNDERSAMPLING is a flag used to enable or disable the undersampling of gradient boosted trees during training. 

In [83]:

LOOKBACK = 20
LAG_INTERVALS = [i+1 for i in range(LOOKBACK)]
CANDLE_MINUTES = 30
ENABLE_GBT_UNDERSAMPLING = True

Reading parquet files from the data folder containing 1-minute OHLCV candlestick data for multiple cryptocurrencies. Files are sorted and split chronologically into 80% training and 20% testing. 

In [85]:
all_parquet_files = sorted([f for f in os.listdir(DATA_FOLDER) if f.endswith(".parquet")])
split_idx = int(len(all_parquet_files) * 0.8)
train_files = all_parquet_files[:split_idx]
test_files = all_parquet_files[split_idx:]

Extracts cryptocurrency symbols from filenames using regex pattern matching to create a symbol column. 

In [86]:
symbols = set()
for f in all_parquet_files:
    match = re.search(r'([^/\\]+)\.parquet$', f)
    if match:
        symbols.add(match.group(1))
print(symbols)

{'BTCDOWN-USDT', 'YFI-BUSD', 'MASK-USDT', 'SAND-BNB', 'AST-BTC', 'PSG-BTC', 'MITH-BNB', 'EOS-EUR', 'BNT-ETH', 'HBAR-BNB', 'LTC-BTC', 'C98-USDT', 'IOTA-BNB', 'MTL-ETH', 'QKC-BTC', 'TRX-TRY', 'ZRX-ETH', 'BCH-EUR', 'EGLD-BUSD', 'IOTX-USDT', 'LRC-BUSD', 'LAZIO-USDT', 'OST-ETH', 'LOOM-BTC', 'KAVA-BTC', 'SALT-BTC', 'ROSE-BUSD', 'TORN-USDT', 'DOT-BRL', 'XRPUP-USDT', 'XLM-ETH', 'DASH-ETH', 'EOS-USDC', 'EOS-TUSD', 'SUSHI-BNB', 'MANA-BUSD', 'DUSK-USDT', 'ETH-RUB', 'TOMO-USDT', 'KEY-USDT', 'SUPER-USDT', 'BTT-USDC', 'TKO-BIDR', 'MIR-BUSD', 'FTM-USDT', 'YFI-EUR', 'DNT-ETH', 'AMB-BTC', 'AION-BNB', 'RSR-BUSD', 'LOOM-ETH', 'POWR-BTC', 'ALGO-BUSD', 'ETH-BTC', 'TNT-BTC', 'STX-USDT', 'SYS-BTC', 'GBP-USDT', 'ZIL-BNB', 'LIT-USDT', 'ADA-AUD', 'OCEAN-BUSD', 'SHIB-BUSD', 'XRP-PAX', 'ARDR-USDT', 'WABI-BTC', 'LTC-RUB', 'PAXG-USDT', 'SLP-ETH', 'WTC-BNB', 'XLM-EUR', 'XRP-RUB', 'MANA-ETH', 'EVX-BTC', 'PIVX-ETH', 'POLY-BTC', 'ARN-ETH', 'BTS-ETH', 'NKN-USDT', 'HOT-USDT', 'VET-ETH', 'FIRO-USDT', 'MDA-ETH', 'MCO-ETH',

Sample percentage control for faster experimentation and debugging on smaller data subsets before running full training. 

In [87]:
USE_SAMPLE = False  
SAMPLE_PERCENTAGE = 0.2
if USE_SAMPLE:
    train_files = train_files[:int(len(train_files) * SAMPLE_PERCENTAGE)]
    test_files = test_files[:int(len(test_files) * SAMPLE_PERCENTAGE)]
    print(f"Using sample - Train: {len(train_files)} files, Test: {len(test_files)} files")
else:
    print(f"Using full dataset")
train_files_full = [os.path.join(DATA_FOLDER, f) for f in train_files]
test_files_full = [os.path.join(DATA_FOLDER, f) for f in test_files]

Using full dataset


Reading the data and adding a symbol column to datasets. First instance of gc.collect() used repeatedly throughout the code to force freeup memory.

In [88]:
df_train = spark.read.parquet(*train_files_full)
df_test = spark.read.parquet(*test_files_full)

df_train = df_train.withColumn(
    "symbol",
    regexp_extract(input_file_name(), r'([^/\\]+)\.parquet$', 1)
)

df_test = df_test.withColumn(
    "symbol",
    regexp_extract(input_file_name(), r'([^/\\]+)\.parquet$', 1)
)

check_memory()
gc.collect()

Memory: 79.7% used (13.1GB/16.5GB)
Swap: 32.8% used (7.4GB/22.5GB) 


127

Aggregates raw 1-minute candlestick data into larger timeframes defined by CANDLE_MINUTES variable. Creates time buckets by flooring timestamps to the nearest interval, then groups by symbol and bucket. Applies OHLCV aggregation rules: first open price of the period, maximum high reached, minimum low touched, last close price, and sum of volume and number of trades. Checkpoints after aggregation to materialize the result to disk, breaking the query lineage to prevent stack overflow errors on complex transformation chains.

In [89]:
def aggregate_candles(df):
    df = df.withColumn(
        "time_bucket",
        floor(unix_timestamp(col("open_time")) * 1000 / lit(CANDLE_MINUTES*60*1000)) * lit(CANDLE_MINUTES*60*1000)
    )
    
    df = df.groupBy("symbol", "time_bucket").agg(
        first("open").alias("open"),
        _max("high").alias("high"),
        _min("low").alias("low"),
        last("close").alias("close"),
        _sum("volume").alias("volume"),
        _sum("number_of_trades").alias("number_of_trades"),
        #_sum("taker_buy_base_asset_volume").alias("taker_buy_base_asset_volume"), not used
        _sum("taker_buy_quote_asset_volume").alias("taker_buy_quote_asset_volume")
    ).withColumnRenamed("time_bucket", "open_time").orderBy("symbol", "open_time")
    
    return df

df_train = aggregate_candles(df_train)
df_test = aggregate_candles(df_test)

df_train = df_train.checkpoint()
df_test = df_test.checkpoint()
check_memory()
gc.collect()

Memory: 82.5% used (13.6GB/16.5GB)
Swap: 32.4% used (7.3GB/22.5GB) 


20

Generates technical indicators and derived features from raw OHLCV data. The feature engineering pipeline creates lag features (previous period values), moving averages (SMA 5, 10, 20), candlestick patterns (body, wicks, range), momentum indicators, Bollinger Band position, and Average True Range (ATR). Features used in the final model training are chosen through trial and error, through multiple training cycles and smaller data samples, by removing underperforming features based on the resulting AUC until further removal worsened the model performance. The final feature set includes: moving average ratios (price_to_sma5/10/20), price momentum, candlestick patterns (body, range, upper_wick, lower_wick),some lag features for high/close/open prices, trading volume metrics (number_of_trades, taker_buy_quote_asset_volume), and volatility indicators (bb_position, atr_10). Features are assembled into Spark ML vectors and checkpointed to prevent recomputation during training.

In [90]:
#feature engineering
def generate_features(df, dataset_name=""):
    print(f"Start feature generation for {dataset_name}")
    window_symbol = Window.partitionBy("symbol").orderBy("open_time")
    def drop_unused_columns(df, keep_list):
        #Dropping all columns except those in keep_list (and system columns)
        system_cols = ["symbol", "open_time", "label"]
        keep_cols = set(keep_list + system_cols)
        drop_cols = [c for c in df.columns if c not in keep_cols]
        
        if drop_cols:
            df = df.drop(*drop_cols)
        
        return df
    
    window_spec = Window.partitionBy("symbol").orderBy("open_time")
    lag_config = {
        "high": [1, 2, 3, 4, 5, 6],  
        "close": LAG_INTERVALS, 
        "open": [3, 5, 6], 
        "number_of_trades": [1, 2, 3],
        "taker_buy_quote_asset_volume": [1, 2, 3]  
    }

    for column, lags in lag_config.items():
        for lag_period in lags:
            df = df.withColumn(
                f"{column}_lag{lag_period}", 
                lag(col(column), lag_period).over(window_spec)
            )
    df = df.checkpoint()
    check_memory()
    gc.collect()
    
    #label
    df = df.withColumn("close_next", lead(col("close"), 1).over(window_spec))
    df = df.withColumn("label",
        when(col("close_next") > col("close"), 1)
        .when(col("close_next") < col("close"), 0)
        .otherwise(0)
    )
    df = df.drop("close_next")
    
    #derived features
    df = df.withColumn("body", col("close") - col("open"))
    df = df.withColumn("range", col("high") - col("low"))
    df = df.withColumn("upper_wick", col("high") - when(col("close") > col("open"), col("close")).otherwise(col("open")))
    df = df.withColumn("lower_wick", when(col("close") < col("open"), col("close")).otherwise(col("open")) - col("low"))
    
    #Simple moving averages
    #SMA-5
    df = df.withColumn("sma_5", 
        (col("close_lag1") + col("close_lag2") + col("close_lag3") + col("close_lag4") + col("close_lag5")) / 5)
    df = df.withColumn("price_to_sma5", 
        when(col("sma_5") != 0, (col("close") - col("sma_5")) / col("sma_5")).otherwise(0))
    df = df.drop("sma_5")
    #SMA-10
    close_lags_10 = [col(f"close_lag{i}") for i in range(1, 11)]
    df = df.withColumn("sma_10", reduce(lambda a, b: a + b, close_lags_10) / lit(10))
    df = df.withColumn("price_to_sma10", 
        when(col("sma_10") != 0, (col("close") - col("sma_10")) / col("sma_10")).otherwise(0))
    df = df.drop("sma_10")
    #SMA-20
    close_lags_20 = [col(f"close_lag{i}") for i in range(1, 21)]
    df = df.withColumn("sma_20", reduce(lambda a, b: a + b, close_lags_20) / lit(20))
    df = df.withColumn("price_to_sma20", 
        when(col("sma_20") != 0, (col("close") - col("sma_20")) / col("sma_20")).otherwise(0))
    #momentum
    df = df.withColumn("price_momentum", 
        when(col("close_lag5") != 0, (col("close") - col("close_lag5")) / col("close_lag5")).otherwise(0))
    
    #Bollinger Bands position
    df = df.withColumn("volatility",
        sqrt((
            pow((col("close_lag1") - col("close_lag2")) / col("close_lag2"), 2) +
            pow((col("close_lag2") - col("close_lag3")) / col("close_lag3"), 2) +
            pow((col("close_lag3") - col("close_lag4")) / col("close_lag4"), 2)
        ) / 3))
    df = df.withColumn("bb_position",
        when(col("volatility") != 0,
            (col("close") - col("sma_20")) / (2 * col("volatility")))
        .otherwise(0))
    df = df.drop("volatility", "sma_20")

    #average true range
    df = df.withColumn("true_range",
        greatest(
            col("high") - col("low"),
            sabs(col("high") - lag(col("close"), 1).over(window_symbol)),
            sabs(col("low") - lag(col("close"), 1).over(window_symbol))
        )
    )
    window_tr = Window.partitionBy("symbol").orderBy("open_time").rowsBetween(-10, -1)
    df = df.withColumn("atr_10", avg(col("true_range")).over(window_tr))
    df = df.drop("true_range")

    df = df.dropna()
    check_memory()
    gc.collect()

    #assembling features
    SELECTED_FEATURES = [
        "price_to_sma5", "price_to_sma10", "price_momentum", "price_to_sma20",
        
        "body", "range", "upper_wick", "lower_wick",
        
        "high_lag5", "high_lag6", "high_lag1", "high_lag2", "high_lag3", "high_lag4",
        "close_lag3", "close_lag4",
        "open_lag6", "open_lag3", "open_lag5",
        "number_of_trades_lag1", "number_of_trades_lag2", 
        "number_of_trades_lag3",
        "taker_buy_quote_asset_volume_lag1",
        "taker_buy_quote_asset_volume_lag2",
        "taker_buy_quote_asset_volume_lag3",
        
        "bb_position", "atr_10"
    ]
    
    df = drop_unused_columns(df, SELECTED_FEATURES)

    assembler = VectorAssembler(inputCols=SELECTED_FEATURES, outputCol="features", handleInvalid="skip")
    df = assembler.transform(df)

    result_df = df.select("symbol", "features", "label", "open_time")
    
    return result_df, SELECTED_FEATURES

# Generate features for train and test
df_train_features, SELECTED_FEATURES = generate_features(df_train, "TRAINING")
df_test_features, _ = generate_features(df_test, "TEST")

# Checkpoint before training
df_train_features = df_train_features.checkpoint()
df_test_features = df_test_features.checkpoint()
check_memory()
gc.collect()

Start feature generation for TRAINING
Memory: 84.2% used (13.9GB/16.5GB)
Swap: 32.5% used (7.3GB/22.5GB) 
Memory: 84.1% used (13.9GB/16.5GB)
Swap: 32.5% used (7.3GB/22.5GB) 
Start feature generation for TEST
Memory: 85.5% used (14.1GB/16.5GB)
Swap: 32.5% used (7.3GB/22.5GB) 
Memory: 85.3% used (14.1GB/16.5GB)
Swap: 32.5% used (7.3GB/22.5GB) 
Memory: 87.4% used (14.4GB/16.5GB)
Swap: 32.6% used (7.3GB/22.5GB) 


318

Analyzes the distribution of target labels (UP vs DOWN price movements) in the training set and calculates class weights to handle imbalance.Computes the proportion of each class and applies inverse frequency weighting. A downscale factor (set to 1.0 for full dataset) allows fine-tuning the weight  to prevent over-correction. These weights are added as a column to the training data, to make minority calss receive higher importance during training.

In [91]:
label_counts = df_train_features.groupBy("label").count().collect()
label_dict = {row['label']: row['count'] for row in label_counts}

count_0 = label_dict.get(0, 0)
count_1 = label_dict.get(1, 0)
total = count_0 + count_1

print(f"DOWN (0): {count_0:,} ({count_0/total*100:.2f}%)")
print(f"UP (1): {count_1:,} ({count_1/total*100:.2f}%)")

if count_0 > 0 and count_1 > 0:
    raw_weight_0 = total / (2.0 * count_0)
    raw_weight_1 = total / (2.0 * count_1)
    downscale_factor = 1 #left at neutral 1 for full dataset
    weight_0 = 1.0 + (raw_weight_0 - 1.0) * downscale_factor
    weight_1 = 1.0 + (raw_weight_1 - 1.0) * downscale_factor
    
    print(f"\nClass weights:")
    print(f"DOWN (0): {weight_0:.4f}")
    print(f"UP (1): {weight_1:.4f}")
    
    df_train_features = df_train_features.withColumn(
        "weight",
        when(col("label") == 0, lit(weight_0))
        .when(col("label") == 1, lit(weight_1))
        .otherwise(lit(1.0))
    )
    use_weights = True
else:
    use_weights = False

DOWN (0): 22,021,062 (55.54%)
UP (1): 17,628,667 (44.46%)

Class weights:
DOWN (0): 0.9003
UP (1): 1.1246


Implementation of time-series aware walk forward cross-validation to prevent data leakage and maintain temporal ordering. Training data is split into 3 folds using ntile partitioning.  Fold 0 (earliest data) is always included in training, while subsequent folds serve as validation sets in sequence. This ensures models are always trained on past data and validated on future data. For each fold three models are trained independently:
1. Random forest: 50 trees, max depth 6, 70% subsampling, with class weights (args were chosen with a separate hyperparameter tuning script , training an RF on the same data sample with different combinations to find better conditions)
2. Gradient boosted trees: 50 iterations, max depth 5, 0.1 learning rate. Since GBT don't natively support class weights, undersampling was used to account for class imbalance.
3. Logistic regression: L2 regularization (regParam=0.01), 100 max iterations, with class weights.

Each model generates probability predictions for its validation fold. The predictions are joined by row_id to create a combined dataset containing rf_prob, gbt_prob, and lr_prob columns. Out-of-fold predictions from all validation folds are accumulated for meta-learner training. 

In [92]:
#walk-forward validtion
NUM_FOLDS = 3

#since data is already chronologically ordered per symbol from feature generation, we can use a simple sequential assignment
from pyspark.sql.functions import monotonically_increasing_id, ntile, count

# Use ntile to split into NUM_FOLDS equal groups, ntile(3) over entire dataset → assigns 0, 1, 2 sequentially
df_train_features = df_train_features.withColumn(
    "fold",
    ntile(NUM_FOLDS).over(Window.orderBy("open_time")) - 1
)
print("\n=== Fold Temporal Validation ===")
df_train_features.groupBy("fold").agg(
    _min("open_time").alias("start_date"),
    _max("open_time").alias("end_date"),
    count("*").alias("count")
).orderBy("fold").show()

#adding row_id for joins
df_train_features = df_train_features.withColumn(
    "row_id",
    monotonically_increasing_id()
)

#checkpoint to break lineage and prevent crashes
df_train_features = df_train_features.checkpoint()
gc.collect()

total_count = df_train_features.count()


#initializing empty DataFrames for out-of-fold predictions
oof_predictions = None

#cross validation loop
for fold_num in range(NUM_FOLDS):
    
    #skipping fold 0
    if fold_num == 0:
        print(f"\nSkipping Fold {fold_num + 1} (earliest data - need for training)")
        continue
    
    print(f"Fold {fold_num + 1}/{NUM_FOLDS}")
    
    #training on past folds only
    train_fold = df_train_features.filter(col("fold") < fold_num)
    valid_fold = df_train_features.filter(col("fold") == fold_num)
    
    #materializing the splits
    train_fold = train_fold.checkpoint()
    valid_fold = valid_fold.checkpoint()
    gc.collect()

    #random forest
    start_time = time.time()
    
    rf = RandomForestClassifier(
        featuresCol="features",
        labelCol="label",
        weightCol="weight" if use_weights else None,
        numTrees=50,
        maxDepth=6,
        maxBins=24,
        subsamplingRate=0.7,
        seed=42
    )
    
    rf_fold_model = rf.fit(train_fold.select("features", "label", "weight") if use_weights 
                           else train_fold.select("features", "label"))
    
    print(f"RF trained in {(time.time()-start_time)/60:.1f} mins")
    
    rf_valid_pred = rf_fold_model.transform(valid_fold)
    rf_valid_pred = rf_valid_pred.withColumn("rf_prob", extract_prob_udf(col("probability")))
    rf_valid_pred = rf_valid_pred.select("row_id", "rf_prob").checkpoint()
    gc.collect()
    
    #gradient boosted trees
    start_time = time.time()
    
    gbt = GBTClassifier(
        featuresCol="features",
        labelCol="label",
        maxIter=50,
        maxDepth=5,
        stepSize=0.1,
        subsamplingRate=0.8,
        seed=42
    )
    if ENABLE_GBT_UNDERSAMPLING:
        train_up = train_fold.filter(col("label") == 1)
        train_down = train_fold.filter(col("label") == 0)
        up_count = train_up.count()
        down_count = train_down.count()
        if down_count > up_count:
            # DOWN is majority, sample it down
            sample_ratio = up_count / down_count
            train_down_sampled = train_down.sample(False, sample_ratio, seed=42)
            train_balanced = train_up.union(train_down_sampled)
        else:
            # UP is majority, sample it down
            sample_ratio = down_count / up_count
            train_up_sampled = train_up.sample(False, sample_ratio, seed=42)
            train_balanced = train_down.union(train_up_sampled)

        gbt_fold_model = gbt.fit(train_balanced.select("features", "label"))
    else:
        gbt_fold_model = gbt.fit(train_fold.select("features", "label"))
    
    print(f"GBT trained in {(time.time()-start_time)/60:.1f} mins")
    
    gbt_valid_pred = gbt_fold_model.transform(valid_fold)
    gbt_valid_pred = gbt_valid_pred.withColumn("gbt_prob", extract_prob_udf(col("probability")))
    gbt_valid_pred = gbt_valid_pred.select("row_id", "gbt_prob").checkpoint()
    gc.collect()
    
    #logistic regression
    start_time = time.time()
    
    lr = LogisticRegression(
        featuresCol="features",
        labelCol="label",
        weightCol="weight" if use_weights else None,
        maxIter=100,
        regParam=0.01,
        elasticNetParam=0.0
    )
    
    lr_fold_model = lr.fit(train_fold.select("features", "label", "weight") if use_weights
                           else train_fold.select("features", "label"))
    
    print(f"LR trained in{(time.time()-start_time)/60:.1f} mins")
    
    lr_valid_pred = lr_fold_model.transform(valid_fold)
    lr_valid_pred = lr_valid_pred.withColumn("lr_prob", extract_prob_udf(col("probability")))
    # Select only needed columns and checkpoint
    lr_valid_pred = lr_valid_pred.select("row_id", "lr_prob").checkpoint()
    gc.collect()
    
    #combining predictions for this fold
    if use_weights:
        fold_combined = valid_fold.select("row_id", "symbol", "features", "label", "weight")
    else:
        fold_combined = valid_fold.select("row_id", "symbol", "features", "label")
    
    fold_combined = fold_combined.join(rf_valid_pred, on="row_id", how="inner")
    fold_combined = fold_combined.join(gbt_valid_pred, on="row_id", how="inner")
    fold_combined = fold_combined.join(lr_valid_pred, on="row_id", how="inner")
    
    #Checkpointing combined
    fold_combined = fold_combined.checkpoint()
    gc.collect()
    
    #appending to out of fold predictions
    if oof_predictions is None:
        oof_predictions = fold_combined
    else:
        oof_predictions = oof_predictions.union(fold_combined)
    
    #checkpoint after union
    oof_predictions = oof_predictions.checkpoint()
    
    print(f"Fold {fold_num} complete")
    check_memory()
    gc.collect()
#verifying we have all probabilities
oof_predictions.select("symbol", "label", "rf_prob", "gbt_prob", "lr_prob").show(10, truncate=False)



=== Fold Temporal Validation ===
+----+-------------+-------------+--------+
|fold|   start_date|     end_date|   count|
+----+-------------+-------------+--------+
|   0|1500040800000|1596871800000|13216577|
|   1|1596871800000|1633696200000|13216576|
|   2|1633696200000|1668722400000|13216576|
+----+-------------+-------------+--------+


Skipping Fold 1 (earliest data - need for training)
Fold 2/3
RF trained in 12.9 mins
GBT trained in 9.2 mins
LR trained in3.0 mins
Fold 1 complete
Memory: 89.8% used (14.8GB/16.5GB)
Swap: 33.6% used (7.6GB/22.5GB) 
Fold 3/3
RF trained in 26.5 mins
GBT trained in 28.4 mins
LR trained in8.4 mins
Fold 2 complete
Memory: 68.6% used (11.3GB/16.5GB)
Swap: 33.4% used (7.5GB/22.5GB) 
+---------+-----+-------------------+-------------------+-------------------+
|symbol   |label|rf_prob            |gbt_prob           |lr_prob            |
+---------+-----+-------------------+-------------------+-------------------+
|AVA-BTC  |1    |0.5483395901154571 |0.5469

Trains the final versions of all three base models on the complete training dataset, using the same hyperparameters. These final models will generate predictions on the base set and serve as the base for the stacking ensemble.

In [None]:
print("Training final base models on full data")

# These will be used for test predictions
print("Training final RandomForest.")
start_time = time.time()

rf_final = RandomForestClassifier(
    featuresCol="features", labelCol="label",
    weightCol="weight" if use_weights else None,
    numTrees=50, maxDepth=6, maxBins=24,
    subsamplingRate=0.7, seed=42
)
rf_model = rf_final.fit(df_train_features.select("features", "label", "weight") if use_weights
                        else df_train_features.select("features", "label"))

print(f"RF trained in {(time.time()-start_time)/60:.1f} min")

print("Training final GBT.")
start_time = time.time()

gbt_final = GBTClassifier(
    featuresCol="features", labelCol="label",
    maxIter=50, maxDepth=5, stepSize=0.1,
    subsamplingRate=0.8, seed=42
)
if ENABLE_GBT_UNDERSAMPLING:
    train_up = df_train_features.filter(col("label") == 1)
    train_down = df_train_features.filter(col("label") == 0)
    
    up_count = train_up.count()
    down_count = train_down.count()
    
    if down_count > up_count:
        sample_ratio = up_count / down_count
        train_down_sampled = train_down.sample(False, sample_ratio, seed=42)
        train_balanced = train_up.union(train_down_sampled)
    else:
        sample_ratio = down_count / up_count
        train_up_sampled = train_up.sample(False, sample_ratio, seed=42)
        train_balanced = train_down.union(train_up_sampled)
    
    balanced_count = train_balanced.count()
    gbt_model = gbt_final.fit(train_balanced.select("features", "label"))
else:
    gbt_model = gbt_final.fit(df_train_features.select("features", "label"))


print(f"GBT trained in {(time.time()-start_time)/60:.1f} min")

print("Training final LogReg.")
start_time = time.time()

lr_final = LogisticRegression(
    featuresCol="features", labelCol="label",
    weightCol="weight" if use_weights else None,
    maxIter=100, regParam=0.01, elasticNetParam=0.0
)
lr_model = lr_final.fit(df_train_features.select("features", "label", "weight") if use_weights
                        else df_train_features.select("features", "label"))

print(f"LR trained in {(time.time()-start_time)/60:.1f} min")


Training final base models on full data
Training final RandomForest...
RF trained in 37.6 min
Training final GBT.
GBT trained in 124.4 min
Training final LogReg.
LR trained in 11.3 min


Extracts and analyzes feature importances from the tree-based models (Random Forest and Gradient Boosted Trees) to understand which technical indicators drive predictions. Logistic regression was not included here, since it's only used to add variety for the meta-learner and doesn't generate meaningful results worth analyzing. The analysis aggregates importance scores across both models by averaging their individual contributions.

In [None]:
#extracting feature importances from tree based models
def get_feature_importance(model, feature_names, model_name):
    try:
        importances = model.featureImportances.toArray()
        
        #creating a list of (feature_name, importance)
        feature_importance = list(zip(feature_names, importances))
        
        #sorting by importance
        feature_importance.sort(key=lambda x: x[1], reverse=True)
        
        print(f"\n{model_name} Top 20 importance:")
        print(f"{'Rank':<5} {'Feature':<40} {'Importance':<12} {'Cumulative %'}")
        
        cumulative = 0
        for rank, (feature, importance) in enumerate(feature_importance[:20], 1):
            cumulative += importance
            print(f"{rank:<5} {feature:<40} {importance:<12.6f} {cumulative*100:>6.2f}%")
        
        return feature_importance
    
    except AttributeError:
        return None

#getting importances
rf_importance = get_feature_importance(rf_model, SELECTED_FEATURES, "Random Forest")
gbt_importance = get_feature_importance(gbt_model, SELECTED_FEATURES, "Gradient Boosted Trees")

#aggregating importance across models
if rf_importance and gbt_importance:
    #creating a dict for averaging
    importance_dict = {}
    
    for feature, imp in rf_importance:
        importance_dict[feature] = [imp, 0]
    
    for feature, imp in gbt_importance:
        if feature in importance_dict:
            importance_dict[feature][1] = imp
        else:
            importance_dict[feature] = [0, imp]
    
    #calculating averages
    avg_importance = [(feature, (imps[0] + imps[1]) / 2) 
                      for feature, imps in importance_dict.items()]
    
    #sorting by average importance
    avg_importance.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\n{'Rank':<5} {'Feature':<40} {'Avg Importance':<15} {'RF':<12} {'GBT'}")
    
    for rank, (feature, avg_imp) in enumerate(avg_importance[:25], 1):
        rf_imp = importance_dict[feature][0]
        gbt_imp = importance_dict[feature][1]
        print(f"{rank:<5} {feature:<40} {avg_imp:<15.6f} {rf_imp:<12.6f} {gbt_imp:.6f}")
# After feature importance analysis, add:

print("low importance features")
low_importance_threshold = 0.01 

low_importance = [feat for feat, imp in avg_importance if imp < low_importance_threshold]

print(f"\nFeatures with <{low_importance_threshold*100}% avg importance:")
for feat in low_importance:
    rf_imp = importance_dict[feat][0]
    gbt_imp = importance_dict[feat][1]
    avg_imp = (rf_imp + gbt_imp) / 2
    print(f"  {feat:<40} RF: {rf_imp:.4f}, GBT: {gbt_imp:.4f}, Avg: {avg_imp:.4f}")

print(f"\nTotal: {len(low_importance)} features")



Random Forest - Top 20 importance:
Rank  Feature                                  Importance   Cumulative %
1     body                                     0.224023      22.40%
2     range                                    0.130031      35.41%
3     upper_wick                               0.113722      46.78%
4     price_to_sma5                            0.094474      56.22%
5     atr_10                                   0.068947      63.12%
6     price_to_sma10                           0.064193      69.54%
7     high_lag4                                0.038484      73.39%
8     price_momentum                           0.035905      76.98%
9     high_lag3                                0.026647      79.64%
10    number_of_trades_lag1                    0.026561      82.30%
11    bb_position                              0.022108      84.51%
12    open_lag5                                0.019942      86.50%
13    high_lag6                                0.019068      88.41%
14    h

Base model evaluation

In [None]:
#getting test predictions
rf_pred_test = rf_model.transform(df_test_features).withColumn("rf_prob", extract_prob_udf(col("probability")))
gbt_pred_test = gbt_model.transform(df_test_features).withColumn("gbt_prob", extract_prob_udf(col("probability")))
lr_pred_test = lr_model.transform(df_test_features).withColumn("lr_prob", extract_prob_udf(col("probability")))

#evaluating individual base models on test set
print("Evaluating base models on test set")
auc_evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")

rf_auc = auc_evaluator.evaluate(rf_pred_test)
gbt_auc = auc_evaluator.evaluate(gbt_pred_test)
lr_auc = auc_evaluator.evaluate(lr_pred_test)

print(f"RandomForest AUC: {rf_auc:.4f}")
print(f"GradientBoostedTrees AUC: {gbt_auc:.4f}")
print(f"LogisticRegression AUC: {lr_auc:.4f}")

Evaluating base models on test set
RandomForest AUC: 0.5927
GradientBoostedTrees AUC: 0.5989
LogisticRegression AUC: 0.5336


Trains a meta-learner that combines the three base model predictions optimally. Uses VectorAssembler to create meta-features from the three base model probabilities (rf_prob, gbt_prob, lr_prob), treating these probabilities as input features. The meta-model is a LogisticRegression with light regularization (0.001) to prevent overfitting while learning combination weights. Trains on out-of-fold predictions from the cross-validation loop, which are unbiased since each prediction was made by a model that never saw that data during training. This prevents the meta-model from simply memorizing training set patterns. The meta-model learns when to trust each base model based on their probability patterns: for example, it might learn that RandomForest is more reliable in volatile conditions while GradientBoostedTrees performs better in trending markets. Applies the same meta-feature assembly to test predictions so the meta-model can generate final combined predictions. The trained meta-model is saved alongside the three base models for use on the real-time stream.

In [96]:
# Combine test predictions
meta_test = rf_pred_test.select("symbol", "features", "label", "rf_prob") \
                        .join(gbt_pred_test.select("symbol", "features", "gbt_prob"), 
                              on=["symbol", "features"], how="inner") \
                        .join(lr_pred_test.select("symbol", "features", "lr_prob"), 
                              on=["symbol", "features"], how="inner")

#training the meta model on oof predictions
#assembling meta features
meta_assembler = VectorAssembler(
    inputCols=["rf_prob", "gbt_prob", "lr_prob"],
    outputCol="meta_features",
    handleInvalid="skip"
)

meta_train = meta_assembler.transform(oof_predictions)
meta_test = meta_assembler.transform(meta_test)

start_time = time.time()

meta_model = LogisticRegression(
    featuresCol="meta_features",
    labelCol="label",
    weightCol="weight" if use_weights else None,
    maxIter=100,
    regParam=0.001
)

stacked_model = meta_model.fit(meta_train.select("meta_features", "label", "weight") if use_weights
                               else meta_train.select("meta_features", "label"))

print(f"Meta-model trained in {(time.time()-start_time)/60:.1f} minutes")
check_memory()
gc.collect()


Meta-model trained in 0.6 minutes
Memory: 67.8% used (11.2GB/16.5GB)
Swap: 32.6% used (7.3GB/22.5GB) 


1249

Evaluates the performance ofthe ensemble by using three metrics: accuracy (overall correctness across all predictions), AUC (measures the models ability to distinguish between classes across probability thresholds), F1 score (balancing false positives and false negatives). Generates a confusion matrix to compare predictions vs actual labels. Generates per-class recall metrics to ensure proper balancing between directions and adjust weights.

In [97]:
#evaluation
final_predictions = stacked_model.transform(meta_test)
final_predictions = final_predictions.checkpoint()
gc.collect()
final_predictions.select("symbol", "label", "prediction", "probability").show(10, truncate=False)

# Metrics
accuracy_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
auc_evaluator_final = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
f1_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")

stacked_accuracy = accuracy_evaluator.evaluate(final_predictions)
stacked_auc = auc_evaluator_final.evaluate(final_predictions)
stacked_f1 = f1_evaluator.evaluate(final_predictions)

print(f"Stack performance")
print(f"Accuracy: {stacked_accuracy:.4f} ({stacked_accuracy*100:.2f}%)")
print(f"AUC: {stacked_auc:.4f}")
print(f"F1: {stacked_f1:.4f}")

final_predictions.groupBy("label", "prediction").count().orderBy("label", "prediction").show()

#per-class metrics
confusion = final_predictions.groupBy("label", "prediction").count().collect()
confusion_dict = {(row['label'], row['prediction']): row['count'] for row in confusion}

total_up = confusion_dict.get((1, 0), 0) + confusion_dict.get((1, 1), 0)
correct_up = confusion_dict.get((1, 1), 0)
total_down = confusion_dict.get((0, 0), 0) + confusion_dict.get((0, 1), 0)
correct_down = confusion_dict.get((0, 0), 0)

up_recall = correct_up / total_up if total_up > 0 else 0
down_recall = correct_down / total_down if total_down > 0 else 0

print(f"UP recall: {up_recall:.4f} ({up_recall*100:.2f}%)")
print(f"DOWN recall: {down_recall:.4f} ({down_recall*100:.2f}%)")

+-------+-----+----------+----------------------------------------+
|symbol |label|prediction|probability                             |
+-------+-----+----------+----------------------------------------+
|SXP-EUR|1    |0.0       |[0.7172629417673959,0.2827370582326041] |
|SXP-EUR|0    |0.0       |[0.7565512894618713,0.24344871053812867]|
|SXP-EUR|0    |0.0       |[0.7578277359923022,0.24217226400769776]|
|SXP-EUR|0    |0.0       |[0.8215137489170786,0.17848625108292138]|
|SXP-EUR|0    |0.0       |[0.8178598444766144,0.18214015552338558]|
|SXP-EUR|0    |0.0       |[0.8178598444766144,0.18214015552338558]|
|SXP-EUR|0    |0.0       |[0.8178598444766144,0.18214015552338558]|
|SXP-EUR|0    |0.0       |[0.8178598444766144,0.18214015552338558]|
|SXP-EUR|0    |0.0       |[0.8178598444766144,0.18214015552338558]|
|SXP-EUR|0    |0.0       |[0.8178598444766144,0.18214015552338558]|
+-------+-----+----------+----------------------------------------+
only showing top 10 rows
Stack performance
Accur


Assesses if the model's predicted probabilities match true likelyhood. Groups predictions into 5% probability bins and calculates two metrics per bin: the average predicted probability and the actual success rate (percentage that were truly UP).  Calculates mean absolute calibration error by averaging the absolute differences across all bins, weighted by the number of predictions in each bin. 

In [98]:
#calibration analysis, checking if predicted probabilities match actual outcomes

#extracting UP probability from final predictions
final_predictions = final_predictions.withColumn("prob_up", extract_prob_udf(col("probability")))

#creating probability bins (5% width)
from pyspark.sql.functions import floor as spark_floor, round as spark_round, avg, count

final_predictions_binned = final_predictions.withColumn(
    "prob_bin",
    spark_round(spark_floor(col("prob_up") * 20) / 20, 2)
)

#calculating calibration metrics per bin
calibration = final_predictions_binned.groupBy("prob_bin").agg(
    count("*").alias("count"),
    avg("prob_up").alias("avg_predicted_prob"),
    avg(when(col("label") == 1, 1.0).otherwise(0.0)).alias("actual_prob_up")
)

#adding calibration error
calibration = calibration.withColumn(
    "calibration_error",
    col("actual_prob_up") - col("avg_predicted_prob")
)

print("Calibration table:")
calibration.orderBy("prob_bin").show(25, truncate=False)

#overall calibration metrics
calibration_data = calibration.collect()
total_samples = sum(row['count'] for row in calibration_data)
weighted_cal_error = sum(row['count'] * abs(row['calibration_error']) for row in calibration_data) / total_samples
#calibration error <5% - near perfect
print(f"Mean absolute calibration error: {weighted_cal_error:.4f}")



Calibration table:
+--------+-------+-------------------+--------------------+---------------------+
|prob_bin|count  |avg_predicted_prob |actual_prob_up      |calibration_error    |
+--------+-------+-------------------+--------------------+---------------------+
|0.1     |14755  |0.14673957979679453|0.016468993561504573|-0.13027058623528995 |
|0.15    |185014 |0.175953570121973  |0.059427935183283424|-0.11652563493868956 |
|0.2     |101429 |0.22868296842935354|0.14135010697137898 |-0.08733286145797456 |
|0.25    |184074 |0.27598477684267897|0.18821778197898673 |-0.08776699486369224 |
|0.3     |306793 |0.32780809486830764|0.27037122750519077 |-0.05743686736311687 |
|0.35    |735674 |0.3789465932833661 |0.3391542993227979  |-0.039792293960568215|
|0.4     |982890 |0.4261534352734992 |0.3818942099319354  |-0.04425922534156379 |
|0.45    |2820897|0.47908378748012187|0.43543879836803684 |-0.04364498911208503 |
|0.5     |2861299|0.5269806528647341 |0.4857821569853413  |-0.04119849587939278

Analyzes the distribution of predicted probabilities and evaluates model performance across different confidence levels. Predictions are categorized into confidence ranges, the distribution shows how often predictions are made in each confidence tier. High confidence subset analysis filters predictions where probability is either ≥60% or ≤40% showing where the model has clear directional confidence, for example. Accuracy is calculated separately for subsets and compared against overall model performance. 

In [99]:
# Probability distribution analysis
print("Probability distribution:")
final_predictions.select("prob_up").describe().show()

#counting predictions by confidence ranges
print("Confidence distribution:")
final_predictions.withColumn(
    "confidence_range",
    when(col("prob_up") < 0.45, "Very Low (0-45%)")
    .when(col("prob_up") < 0.50, "Low (45-50%)")
    .when(col("prob_up") < 0.55, "Borderline (50-55%)")
    .when(col("prob_up") < 0.60, "Moderate (55-60%)")
    .when(col("prob_up") < 0.65, "Confident (60-65%)")
    .when(col("prob_up") < 0.70, "High (65-70%)")
    .otherwise("Very High (70%+)")
).groupBy("confidence_range").agg(
    count("*").alias("count")
).orderBy("confidence_range").show(truncate=False)

#high confidence subset performance
#defining thresholds
high_conf_threshold = 0.6  # 60% threshold
very_high_conf_threshold = 0.7  # 70% threshold

#high confidence
high_conf_predictions = final_predictions.filter(
    (col("prob_up") >= high_conf_threshold) | (col("prob_up") <= (1 - high_conf_threshold))
)
high_conf_count = high_conf_predictions.count()
total_count = final_predictions.count()
high_conf_pct = (high_conf_count / total_count * 100) if total_count > 0 else 0

if high_conf_count > 0:
    high_conf_correct = high_conf_predictions.filter(col("prediction") == col("label")).count()
    high_conf_acc = high_conf_correct / high_conf_count
    
    print(f"High Confidence (prob > {high_conf_threshold} or < {1-high_conf_threshold}):")
    print(f"  Predictions: {high_conf_count:,} ({high_conf_pct:.1f}% of total)")
    print(f"  Accuracy: {high_conf_acc:.4f} ({high_conf_acc*100:.2f}%)")
    print(f"  vs Overall: {stacked_accuracy:.4f} ({stacked_accuracy*100:.2f}%)")
else:
    print(f"No high confidence predictions")

#vhc predictions
very_high_conf_predictions = final_predictions.filter(
    (col("prob_up") >= very_high_conf_threshold) | (col("prob_up") <= (1 - very_high_conf_threshold))
)

very_high_conf_count = very_high_conf_predictions.count()
very_high_conf_pct = (very_high_conf_count / total_count * 100) if total_count > 0 else 0

if very_high_conf_count > 0:
    very_high_conf_correct = very_high_conf_predictions.filter(col("prediction") == col("label")).count()
    very_high_conf_acc = very_high_conf_correct / very_high_conf_count
    
    print(f"\nVery High Confidence (prob > {very_high_conf_threshold} or < {1-very_high_conf_threshold}):")
    print(f"  Predictions: {very_high_conf_count:,} ({very_high_conf_pct:.1f}% of total)")
    print(f"  Accuracy: {very_high_conf_acc:.4f} ({very_high_conf_acc*100:.2f}%)")
    print(f"  vs Overall: {stacked_accuracy:.4f} ({stacked_accuracy*100:.2f}%)")
else:
    print(f"No very high confidence predictions")



Probability distribution:
+-------+-------------------+
|summary|            prob_up|
+-------+-------------------+
|  count|           10621615|
|   mean| 0.4861669783835595|
| stddev| 0.0872711742406609|
|    min|0.14062491772922492|
|    max| 0.7011341491853222|
+-------+-------------------+

Confidence distribution:
+-------------------+-------+
|confidence_range   |count  |
+-------------------+-------+
|Borderline (50-55%)|2861299|
|Confident (60-65%) |364019 |
|High (65-70%)      |161    |
|Low (45-50%)       |2820897|
|Moderate (55-60%)  |2064607|
|Very High (70%+)   |3      |
|Very Low (0-45%)   |2510629|
+-------------------+-------+

High Confidence (prob > 0.6 or < 0.4):
  Predictions: 1,891,922 (17.8% of total)
  Accuracy: 0.7110 (71.10%)
  vs Overall: 0.5681 (56.81%)

Very High Confidence (prob > 0.7 or < 0.30000000000000004):
  Predictions: 485,275 (4.6% of total)
  Accuracy: 0.8759 (87.59%)
  vs Overall: 0.5681 (56.81%)


Saving the models

In [100]:
#saving models
rf_path = os.path.join(MODEL_FOLDER, "stacked_rf_model").replace("\\", "/")
gbt_path = os.path.join(MODEL_FOLDER, "stacked_gbt_model").replace("\\", "/")
lr_path = os.path.join(MODEL_FOLDER, "stacked_lr_model").replace("\\", "/")
meta_path = os.path.join(MODEL_FOLDER, "stacked_meta_model").replace("\\", "/")

rf_model.write().overwrite().save(rf_path)
gbt_model.write().overwrite().save(gbt_path)
lr_model.write().overwrite().save(lr_path)
stacked_model.write().overwrite().save(meta_path)

Final cleanup and stopping spark

In [101]:
#cleanup
print("Cleaning up checkpoints")
try:
    final_predictions.unpersist()
    meta_train.unpersist()
    meta_test.unpersist()
    df_train_features.unpersist()
    df_test_features.unpersist()
except:
    pass

try:
    spark.catalog.clearCache()
except:
    pass

gc.collect()

spark.stop()
try:
    shutil.rmtree(CHECKPOINT_DIR)
except Exception as e:
    print(f"Warning: {e}")

try:
    if os.path.exists(str(BASE_TEMP_DIR)):
        shutil.rmtree(str(BASE_TEMP_DIR))
        print(f"Temp directory cleaned: {BASE_TEMP_DIR}")
except Exception as e:
    print(f"Warning: Could not clean temp directory - {e}")

Cleaning up checkpoints
