## Introduction

Welcome to our research study where we delve into the intricate world of financial analytics. The following abstract outlines our exploration of advanced statistical models and machine learning techniques, marking a departure from traditional analytical methods. Our work focuses on the critical evaluation of model robustness and introduces specialized cross-validation techniques suitable for the financial domain. The insights presented aim to bridge the gap between theoretical finance and practical application, ensuring the reliability of financial models in complex market environments.

### Abstract
This research explores the integration of advanced statistical models and machine learning in financial analytics, representing a shift from traditional to advanced, data-driven methods. We address a critical gap in quantitative finance: the need for robust model evaluation and out-of-sample testing methodologies, particularly tailored cross-validation techniques for financial markets. We present a comprehensive framework to assess these methods, considering the unique characteristics of financial data like non-stationarity, autocorrelation, and regime shifts. Through our analysis, we unveil the marked superiority of the Combinatorial Purged (CPCV) method in mitigating overfitting risks, outperforming traditional methods like K-Fold, Purged K-Fold, and especially Walk-Forward, as evidenced by its lower Probability of Backtest Overfitting (PBO) and superior Deflated Sharpe Ratio (DSR) Test Statistic. Walk-Forward, by contrast, exhibits notable shortcomings in false discovery prevention, characterized by increased temporal variability and weaker stationarity. This contrasts starkly with CPCV's demonstrable stability and efficiency, confirming its reliability for financial strategy development. The analysis also suggests that choosing between Purged K-Fold and K-Fold necessitates caution due to their comparable performance and potential impact on the robustness of training data in out-of-sample testing. Our investigation utilizes a Synthetic Controlled Environment incorporating advanced models like the Heston Stochastic Volatility, Merton Jump Diffusion, and Drift-Burst Hypothesis, alongside regime-switching models. This approach provides a nuanced simulation of market conditions, offering new insights into evaluating cross-validation techniques. Our study underscores the necessity of specialized validation methods in financial modeling, especially in the face of growing regulatory demands and complex market dynamics. It bridges theoretical and practical finance, offering a fresh outlook on financial model validation. Highlighting the significance of advanced cross-validation techniques like CPCV, our research enhances the reliability and applicability of financial models in decision-making.

## Import Libraries and Modules
In this section, we are importing the necessary libraries and modules required for our financial analytics. We have standard libraries such as `numpy` and `pandas` for data manipulation, and we import various machine learning models and tools from `scikit-learn` and `xgboost`. Additionally, we utilize `joblib` for improving the performance of our models. The `RiskLabAI` package provides us with synthetic data generation and backtesting functionality, which is crucial for our analysis of financial models.



In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
from joblib import Parallel, delayed
from joblib_progress import joblib_progress

In [None]:
from RiskLabAI.data.synthetic_data import drift_volatility_burst, parallel_generate_prices
from RiskLabAI.backtest import backtset_overfitting_simulation

## Simulation Parameters
Here we define the simulation parameters such as the number of jobs (`N_JOBS`), paths (`N_PATHS`), and the total time (`TOTAL_TIME`). These parameters will be used to create a synthetic controlled environment that simulates market conditions, allowing us to evaluate our cross-validation techniques effectively.


In [None]:
N_JOBS = 24
N_PATHS = 1000
TOTAL_TIME = 40
N_STEPS = int(252 * TOTAL_TIME)
RISK_FREE_RATE = 0.05
STEP_RISK_FREE_RATE = np.log(1 + RISK_FREE_RATE) / N_STEPS * TOTAL_TIME
RANDOM_STATE = 0
OVERFITTING_PARTITIONS_LENGTH = 252

## Market Regime Parameters and Custom Pipeline
We define parameters for market regimes, using a `drift_volatility_burst` function to simulate market conditions, including calm, volatile, and speculative bubble regimes. These regimes are characterized by specific Heston model parameters such as mean return (`mu`), rate at which variance reverts to theta (`kappa`), long-run average price variance (`theta`), and others.


In [None]:
x = 0.35

bubble_drifts, bubble_volatilities = drift_volatility_burst(
    bubble_length=5 * 252, 
    a_before=x, 
    a_after=-x, 
    b_before=0.6 * x, 
    b_after=0.6 * x, 
    alpha=0.75, 
    beta=0.45,
    explosion_filter_width=0.1
)
# Dictionary of Heston parameters for different market regimes
regimes = {
    'calm': {
        'mu': 0.1,
        'kappa': 3.98,
        'theta': 0.029,
        'xi': 0.389645311,
        'rho': -0.7,
        'lam': 121,
        'm': -0.000709,
        'v': 0.0119
    },
    'volatile': {
        'mu': 0.1,
        'kappa': 3.81,
        'theta': 0.25056,
        'xi': 0.59176974,
        'rho': -0.7,
        'lam': 121,
        'm': -0.000709,
        'v': 0.0119
    },
    'speculative_bubble': {
        'mu': list(bubble_drifts),
        'kappa': 1,
        'theta': list(bubble_volatilities),
        'xi': 0,
        'rho': 0,
        'lam': 0,
        'm': 0,
        'v': 0.00000001
    },
}

In [None]:
class CustomPipeline(Pipeline):
    @classmethod
    def from_existing_pipeline(cls, existing_pipeline, memory=None, verbose=False):
        return cls(steps=existing_pipeline.steps, memory=memory, verbose=verbose)
        
    def fit(self, X, y=None, **fit_params):
        if 'sample_weight' in fit_params:
            sample_weight = fit_params.pop('sample_weight')
            for step_name, _ in self.steps:
                fit_params[f"{step_name}__sample_weight"] = sample_weight
        return super().fit(X, y, **fit_params)

## Transition Matrix and Strategy Parameters
A transition matrix is established to represent the probability of transitioning between market states. Additionally, we define `strategy_parameters` for various trading strategies to be tested in our simulation.

In [None]:
dt = TOTAL_TIME / N_STEPS

transition_matrix = np.array([
    [1 - 1 * dt,   1 * dt - 0.00001,        0.00001],  # State 0 transitions
    [20 * dt,      1 - 20 * dt - 0.00001,   0.00001],  # State 1 transitions
    [1 - 1 * dt,   1 * dt,                      0.0],  # State 2 transitions
])

strategy_parameters = {
    'fast_window' : [5, 20, 50, 70],
    'slow_window' : [10, 50, 100, 140],
    'exponential' : [False],
    'mean_reversion' : [False]
}

## Model Definition and Parameter Grids
We declare a dictionary of machine learning models, including k-Nearest Neighbors (k-NN), Decision Tree, and XGBoost. For each model, we specify a custom pipeline and a grid of hyperparameters to be optimized during model training.

In [None]:
# Define models and parameter grids

models = {
    'k-NN' : {
        'Model': CustomPipeline.from_existing_pipeline(existing_pipeline=make_pipeline(StandardScaler(), KNeighborsClassifier())),
        'Parameters': {
            'kneighborsclassifier__n_neighbors': [1, 2, 3],
        }
    },
    'Decision Tree' : {
        'Model': DecisionTreeClassifier(random_state=RANDOM_STATE),
        'Parameters': {
            'max_depth': [None],
            'min_samples_split': [2],
            'min_samples_leaf': [1],
        }
    },
    'XGBoost': {
        'Model': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', seed=RANDOM_STATE),
        'Parameters': {
            'n_estimators': [1000],
            'max_depth': [1000000000],
            'learning_rate': [1, 10, 100],
            'subsample': [1.0],
            'colsample_bytree': [1.0],
        }
    },
}

## Synthesizing Prices
Finally, we use the `parallel_generate_prices` function to synthesize asset prices for different market regimes, which will serve as the dataset for our backtesting and model validation.

In [None]:
print('Synthesizing Prices...')
all_prices, all_regimes = parallel_generate_prices(
    N_PATHS,
    regimes,
    transition_matrix,
    TOTAL_TIME,
    N_STEPS,
    RANDOM_STATE,
    N_JOBS
)

## Overall Evaluation
The code block executes a parallelized backtesting simulation on all price columns to evaluate the risk of overfitting in different cross-validation (CV) methods. Results from the simulation are collected into lists, which are then transformed into DataFrames for more detailed analysis.

In [None]:
with joblib_progress("Overfitting...", total=all_prices.shape[1]):    
    results = Parallel(n_jobs=N_JOBS)(delayed(backtset_overfitting_simulation)(all_prices[column], strategy_parameters, models, STEP_RISK_FREE_RATE, all_prices.shape[0]) for column in all_prices.columns)

# Assuming results is already populated from the joblib Parallel call
cv_pbo_list = [result[0] for result in results]  # Collect all cv_pbo dicts
cv_deflated_sr_list = [result[1] for result in results]  # Collect all cv_deflated_sr dicts

# Initialize dicts to collect lists for each CV method
cv_pbo_data = {cv: [] for cv in cv_pbo_list[0].keys()}
cv_deflated_sr_data = {cv: [] for cv in cv_deflated_sr_list[0].keys()}

# Populate the cv_pbo_data and cv_deflated_sr_data with concatenated lists from each result
for cv_pbo in cv_pbo_list:
    for cv, values in cv_pbo.items():
        cv_pbo_data[cv].append(values)

for cv_deflated_sr in cv_deflated_sr_list:
    for cv, values in cv_deflated_sr.items():
        cv_deflated_sr_data[cv].append(values)

# Convert the collected lists into DataFrames
cv_pbo_dfs = {cv: pd.DataFrame(cv_pbo_data[cv]).T for cv in cv_pbo_data}
cv_deflated_sr_dfs = {cv: pd.DataFrame(cv_deflated_sr_data[cv]).T for cv in cv_deflated_sr_data}

# Mapping from descriptive CV names to filesystem-friendly names
cv_name_map = {
    'Walk-Forward': 'walkforward',
    'K-Fold': 'kfold',
    'Purged K-Fold': 'purgedkfold',
    'Combinatorial Purged': 'combinatorialpurged',
}

# Save each cv_pbo DataFrame to CSV using the mapping for file names
for cv_name, df in cv_pbo_dfs.items():
    file_name = cv_name_map.get(cv_name, cv_name)  # Fallback to cv_name if not found in the map
    df.to_csv(f'overall_simulated_pbo_{file_name}.csv', index=False)

# Save each cv_deflated_sr DataFrame to CSV using the mapping for file names
for cv_name, df in cv_deflated_sr_dfs.items():
    file_name = cv_name_map.get(cv_name, cv_name)  # Fallback to cv_name if not found in the map
    df.to_csv(f'overall_simulated_deflated_sr_{file_name}.csv', index=False)

## Partitioned Data Evaluation
In a temporal analysis, the same simulation is conducted on partitions of the data to understand how each CV method performs over time. The results are stored in CSV files for each CV method, mapping descriptive names to filesystem-friendly names for consistency and ease of access.

In [None]:
with joblib_progress("Overfitting...", total=all_prices.shape[1]):    
    results = Parallel(n_jobs=N_JOBS)(delayed(backtset_overfitting_simulation)(all_prices[column], strategy_parameters, models, STEP_RISK_FREE_RATE, OVERFITTING_PARTITIONS_LENGTH) for column in all_prices.columns)

# Assuming results is already populated from the joblib Parallel call
cv_pbo_list = [result[0] for result in results]  # Collect all cv_pbo dicts
cv_deflated_sr_list = [result[1] for result in results]  # Collect all cv_deflated_sr dicts

# Initialize dicts to collect lists for each CV method
cv_pbo_data = {cv: [] for cv in cv_pbo_list[0].keys()}
cv_deflated_sr_data = {cv: [] for cv in cv_deflated_sr_list[0].keys()}

# Populate the cv_pbo_data and cv_deflated_sr_data with concatenated lists from each result
for cv_pbo in cv_pbo_list:
    for cv, values in cv_pbo.items():
        cv_pbo_data[cv].append(values)

for cv_deflated_sr in cv_deflated_sr_list:
    for cv, values in cv_deflated_sr.items():
        cv_deflated_sr_data[cv].append(values)

# Convert the collected lists into DataFrames
cv_pbo_dfs = {cv: pd.DataFrame(cv_pbo_data[cv]).T for cv in cv_pbo_data}
cv_deflated_sr_dfs = {cv: pd.DataFrame(cv_deflated_sr_data[cv]).T for cv in cv_deflated_sr_data}

# Mapping from descriptive CV names to filesystem-friendly names
cv_name_map = {
    'Walk-Forward': 'walkforward',
    'K-Fold': 'kfold',
    'Purged K-Fold': 'purgedkfold',
    'Combinatorial Purged': 'combinatorialpurged',
}

# Save each cv_pbo DataFrame to CSV using the mapping for file names
for cv_name, df in cv_pbo_dfs.items():
    file_name = cv_name_map.get(cv_name, cv_name)  # Fallback to cv_name if not found in the map
    df.to_csv(f'simulated_pbo_{file_name}.csv', index=False)

# Save each cv_deflated_sr DataFrame to CSV using the mapping for file names
for cv_name, df in cv_deflated_sr_dfs.items():
    file_name = cv_name_map.get(cv_name, cv_name)  # Fallback to cv_name if not found in the map
    df.to_csv(f'simulated_deflated_sr_{file_name}.csv', index=False)

By conducting these simulations, we can quantify the Probability of Backtest Overfitting (PBO) and the performance of the Deflated Sharpe Ratio (DSR) across different CV methods, providing a comprehensive view of their robustness in temporal contexts.