### Statistical Arbitrage using Graph Theory

Implmentation of "Statistical arbitrage in multi-pair trading strategy based on graph clustering algorithms in US equities market" by Adam Korniejczuka & Robert Slepaczukb.


#### Methodology:

1. Data Collection
- Universe: S&P 500 constituents (updated historically as they change).
- Timeframe: Daily data from 2000 to 2022.
- Price Data: Adjusted close prices for all stocks.

2. Graph Based clustering of the stocks
- Calculate residual returns (on Fama-French 3 factor model)
- Compute correlation matrix of residuals over a rolling window (30 days in paper)
- Treat the correlation matrix as an adjacency matrix of a signed, weighted, undirected graph.
- Use SPONGE-sym clustering algorithm to identify clusters of tightly connected stocks.

3. Signal Generation
- Every k days (e.g., every 10 days) for each cluster:
- Compute cluster mean return over past 5 days.
- Signal is then Stock’s 5-day return - Cluster’s 5-day mean return
- If signal is significantly:
    - Below cluster average then Long
    - Above cluster average then Short

4. Feature Engineering for Signal Quality
- For each signal generated extract features like:
- Graph-based:
    - Local/global vertex degree
    - Cluster size
    - Graph density
    - Cluster size
- Price-based:
    - Signal value
    - Sign of deviation
    - Stock & cluster average returns over past 10 days

5. Label Signals
- Label based on profitability (could try Triple Barrier Method as well), paper used "If the stock’s return after the signal exceeds a threshold (e.g., 4%) or beats transaction cost."
- This becomes a target for ML Classification

6. Machine Learning Signal Classifier
- Train multiple classifiers on the labeled dataset:
    - Logistic Regression
    - Gradient Boosted Trees
    - Neural Net (MLP)
    - SGD Classifier
    - AdaBoost
- Ensemble (soft voting) using classifier probabilities.

7. Trade Filtering 
- Only trade signals where ensemble confidence > 0.6 to reduce bad trades and minimize transaction costs.

8. Bet Sizing
- Use Kelly Critereon

9. Apply time-decaying take profit and stop loss:
- Threshold shrinks each day since entry.
- Threshold also weighted by signal confidence.

10. Performance Evaluation
- Backtest using realistic assumptions:
    - Transaction cost = 0.05%
    - Fractional shares allowed
- Metrics:
    - Annualized Return
    - Sharpe & Sortino Ratio
    - Max Drawdown
    - Calmar Ratio
    - Modified Information Ratios (IR*, IR**)


### Step 1: Data Collection

In [53]:
# Smaller Dataset (to start)
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta

tickers = [
    "AAPL", "MSFT", "AMZN", "GOOGL", "META", "NVDA", "TSLA", "BRK-B", "JPM", "JNJ",
    "V", "PG", "MA", "UNH", "HD", "XOM", "LLY", "ABBV", "AVGO", "KO",
    "PEP", "CVX", "MRK", "COST", "WMT", "MCD", "BAC", "ADBE", "TMO", "ABT",
    "CSCO", "ORCL", "ACN", "INTC", "CMCSA", "CRM", "NKE", "TXN", "DHR", "AMD",
    "LIN", "PM", "NEE", "UPS", "BMY", "MS", "UNP", "LOW", "AMGN", "RTX",
    "CAT", "HON", "GS", "SCHW", "QCOM", "AMAT", "BLK", "CVS", "MDT", "INTU",
    "DE", "ISRG", "GE", "LMT", "BA", "ADI", "TGT", "SBUX", "ZTS", "GILD",
    "PLD", "SPGI", "MO", "CI", "SO", "ADP", "NOW", "VRTX", "MMC", "CB",
    "C", "REGN", "PNC", "CL", "PGR", "SYK", "USB", "TFC", "BDX", "BKNG",
    "ETN", "ICE", "EQIX", "EL", "AON", "FIS", "HUM", "FDX", "GM", "APD"
]


end_date = datetime.today()
start_date = end_date - timedelta(days=365 * 2)

# Download adjusted close prices
df_prices = yf.download(tickers, start=start_date, end=end_date)["Close"]

# Handle missing data
df_prices = df_prices.dropna(axis=1, thresh=len(df_prices) * 0.9)
df_prices = df_prices.ffill().bfill()

[*********************100%***********************]  100 of 100 completed


In [166]:
# Price to Returns
def compute_log_returns(price_df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute returns from a price DataFrame.
    """
    # Ensure index is datetime
    price_df.index = pd.to_datetime(price_df.index)
    
    # Calculate log returns
    returns_df = price_df / price_df.shift(1)
    
    # Drop rows with NaNs introduced by shift
    returns_df = np.log(price_df / price_df.shift(1)).dropna()
    
    return returns_df

# Compute returns

df_returns = compute_log_returns(df_prices)

In [167]:
# Load Fama-French daily 3-factor data
ff_factors = pd.read_csv("F-F_Research_Data_Factors_daily.CSV", skiprows=3, index_col=0)

# Use first row as header
ff_factors = ff_factors.drop(ff_factors.index[0])

ff_factors.dropna(inplace=True)
ff_factors.columns = ['MKT', 'SMB', 'HML', 'RF']
# Ensure all factor columns are float
ff_factors[["MKT", "SMB", "HML", "RF"]] = ff_factors[["MKT", "SMB", "HML", "RF"]].astype(float)
ff_factors.index = pd.to_datetime(ff_factors.index, format="%Y%m%d")

# Align returns and factors
df_returns_ff = df_returns.join(ff_factors[["MKT", "SMB", "HML", "RF"]], how="inner")

# Subtract RF from stock returns (excess returns only on stock columns)
stock_cols = df_returns.columns
excess_returns = df_returns_ff[stock_cols].sub(df_returns_ff["RF"], axis=0)
factors = df_returns_ff[["MKT", "SMB", "HML"]]

In [168]:
from sklearn.linear_model import LinearRegression

def compute_residuals(excess_returns, factors):
    residuals = pd.DataFrame(index=excess_returns.index, columns=excess_returns.columns)
    model = LinearRegression()

    for ticker in excess_returns.columns:
        y = excess_returns[ticker].dropna()
        X = factors.loc[y.index]
        model.fit(X, y)
        if len(y) < 30:
            continue  # skip short series
        y_pred = model.predict(X)
        residuals.loc[y.index, ticker] = y - y_pred

    return residuals.astype(float)

residuals = compute_residuals(excess_returns, factors)

### Step 2: Graph Based Clustering

In [169]:
# Use a dictionary, but could later convert to tensor if needed
def rolling_signed_corr(residuals: pd.DataFrame, window: int = 30) -> dict:
    """
    Compute rolling signed correlation matrices from residual returns.

    Returns: 
    corr_matrices : dict
        Dictionary mapping end-of-window date to signed correlation matrix.
    """
    corr_matrices = {}
    for end in range(window, len(residuals)):
        window_data = residuals.iloc[end - window:end]
        date_key = residuals.index[end]
        corr = window_data.corr()
        corr_matrices[date_key] = corr
    return corr_matrices

# Compute rolling signed correlation matrices
rolling_corrs = rolling_signed_corr(residuals, window=30)

In [170]:
# Now cluster using SPONGEsym Algorithm

# %pip install git+https://github.com/alan-turing-institute/SigNet.git
from signet.cluster import Cluster
from scipy.sparse import csc_matrix

def run_sponge_sym(corr_matrix: pd.DataFrame, k: int = 6) -> dict:
    """
    Applies SPONGE-sym clustering to a signed correlation matrix using SigNet.

    Parameters:
    -----------
    corr_matrix : pd.DataFrame
        Signed correlation matrix (symmetric with values in [-1, 1]).
    k : int
        Number of clusters to generate.

    Returns:
    --------
    cluster_map : dict
        Dictionary mapping tickers to cluster labels.
    """
    W = corr_matrix.fillna(0).values
    Ap = np.where(W > 0, W, 0)
    An = np.where(W < 0, -W, 0)

    Ap_sparse = csc_matrix(Ap)
    An_sparse = csc_matrix(An)

    c = Cluster((Ap_sparse, An_sparse))
    labels = c.SPONGE_sym(k=k, tau_p=1.0, tau_n=1.0)  # ← correct function name

    return dict(zip(corr_matrix.index, labels))

sponge_cluster_results = {}

for date, corr_matrix in rolling_corrs.items():
    try:
        clusters = run_sponge_sym(corr_matrix, k=6)
        sponge_cluster_results[date] = clusters
    except Exception as e:
        print(f"SPONGE clustering failed on {date}: {e}")

In [127]:
# Tried out other clustering methods, but SPONGE-sym is the most robust for signed graphs

# Build the signed Laplacian Matrices
def construct_signed_laplacian(corr_matrix: pd.DataFrame) -> pd.DataFrame:
    """
    Construct the signed Laplacian matrix from a signed correlation matrix.

    Parameters:
    -----------
    corr_matrix : pd.DataFrame
        Correlation matrix with values in [-1, 1].

    Returns:
    --------
    laplacian : pd.DataFrame
        Signed Laplacian matrix.
    """
    W = corr_matrix.fillna(0).copy()
    np.fill_diagonal(W.values, 0)  # zero diagonal
    D = np.diag(np.sum(np.abs(W.values), axis=1))
    L = D - W.values
    return pd.DataFrame(L, index=W.index, columns=W.columns)

laplacians = {
    date: construct_signed_laplacian(corr)
    for date, corr in rolling_corrs.items()
}

# Cluster the signed Laplacian matrices using KMeans variant found in "ML for AM" by MLdP

from scipy.linalg import eigh
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import numpy as np
import pandas as pd

def cluster_signed_graph_autoK(
    laplacian: pd.DataFrame, 
    maxNumClusters: int = 10,
    n_init: int = 10
) -> tuple[dict, pd.Series, int]:
    """
    Auto-selects optimal K and clusters stocks using spectral clustering on signed Laplacian.

    Parameters:
    -----------
    laplacian : pd.DataFrame
        Signed Laplacian matrix.
    maxNumClusters : int
        Max number of clusters to try (searches from 2 to maxNumClusters).
    n_init : int
        KMeans initializations for robustness.

    Returns:
    --------
    clusters : dict
        Mapping: ticker -> cluster label
    silh_series : pd.Series
        Silhouette scores for each stock under optimal K
    best_k : int
        Optimal number of clusters selected
    """
    x = laplacian.values
    tickers = laplacian.index

    # Eigendecomposition once
    eigvals, eigvecs = eigh(x)
    scores, best_kmeans, best_k, best_silh = -np.inf, None, None, None

    for k in range(2, maxNumClusters + 1):
        embedding = eigvecs[:, :k]

        for _ in range(n_init):
            kmeans = KMeans(n_clusters=k, n_init=1, random_state=42)
            labels = kmeans.fit_predict(embedding)
            silh = silhouette_samples(embedding, labels)
            t_stat = np.mean(silh) / (np.std(silh) + 1e-6)

            if np.isnan(t_stat):
                continue
            if t_stat > scores:
                scores = t_stat
                best_kmeans = kmeans
                best_k = k
                best_silh = silh

    final_labels = best_kmeans.labels_
    cluster_map = dict(zip(tickers, final_labels))
    silh_series = pd.Series(best_silh, index=tickers)

    return cluster_map, silh_series, best_k

cluster_results = {}

for date, laplacian in laplacians.items():
    try:
        clusters, silh_scores, best_k = cluster_signed_graph_autoK(laplacian)
        cluster_results[date] = {
            "clusters": clusters,
            "silhouette_scores": silh_scores,
            "best_k": best_k
        }
    except Exception as e:
        print(f"Failed to cluster on {date}: {e}")

best_k_series = pd.Series({
    date: result["best_k"]
    for date, result in cluster_results.items()
})

import matplotlib.pyplot as plt

best_k_series.sort_index().plot(marker='o')
plt.title("Optimal Number of Clusters (best_k) Over Time")
plt.xlabel("Date")
plt.ylabel("Best k")
plt.grid(True)
plt.show()


### Step 3: Signal Generation

Every k days (e.g., every 10 days) for each cluster:
- Compute cluster mean return over past 5 days.
- Signal is then Stock’s 5-day return - Cluster’s 5-day mean return
- If signal is significantly:
    - Below cluster average then Long
    - Above cluster average then Short

In [181]:
# Parameters
signal_spacing = 10
lookback_days = 10

# If a stock outperformed its cluster its signal is positive
# If a stock underperformed its cluster its signal is negative

# Generate signal every `signal_spacing` days
signal_dates = sorted(sponge_cluster_results.keys())[::signal_spacing]
signals = []

for date in signal_dates:
    if date not in df_returns.index:
        continue

    try:
        date_idx = df_returns.index.get_loc(date)
    except KeyError:
        continue

    window_start = date_idx - lookback_days
    if window_start < 0:
        continue  # skip if not enough data

    past_window = df_returns.iloc[window_start:date_idx]
    returns_past = past_window.sum()

    # Cluster assignments from SPONGE
    clusters = sponge_cluster_results[date]
    cluster_series = pd.Series(clusters)

    for cluster_id in set(cluster_series.values):
        members = cluster_series[cluster_series == cluster_id].index

        if len(members) < 2:
            continue

        cluster_mean = returns_past[members].mean()

        for stock in members:
            signal_value = returns_past[stock] - cluster_mean

            signals.append({
                "date": date,
                "stock": stock,
                "cluster": cluster_id,
                "signal": signal_value,
                "cluster_mean": cluster_mean,
                "stock_return": returns_past[stock]
            })

# Create signal DataFrame
signals_df = pd.DataFrame(signals)
signals_df.sort_values(["date", "signal"], ascending=[True, False], inplace=True)
signals_df.reset_index(drop=True, inplace=True)

# Quantile-based signal thresholds
lower, upper = signals_df["signal"].quantile([0.10, 0.90])

# Assign positions
signals_df["position"] = 0
signals_df.loc[signals_df["signal"] <= lower, "position"] = 1    # Long
signals_df.loc[signals_df["signal"] >= upper, "position"] = -1   # Short

In [198]:
signals_df.head(20)

Unnamed: 0,date,stock,cluster,signal,cluster_mean,stock_return,position,check_signal
0,2023-08-17,LLY,0,0.196621,-0.010575,0.186046,-1,0.196621
1,2023-08-17,AMGN,2,0.15484,-0.014895,0.139945,-1,0.15484
2,2023-08-17,BKNG,2,0.117094,-0.014895,0.1022,-1,0.117094
3,2023-08-17,REGN,1,0.099428,-0.014091,0.085337,-1,0.099428
4,2023-08-17,PGR,1,0.096585,-0.014091,0.082494,-1,0.096585
5,2023-08-17,AMZN,3,0.083149,-0.031025,0.052124,-1,0.083149
6,2023-08-17,CMCSA,4,0.056219,-0.028874,0.027345,-1,0.056219
7,2023-08-17,GILD,2,0.04509,-0.014895,0.030195,0,0.04509
8,2023-08-17,GE,4,0.043063,-0.028874,0.014188,0,0.043063
9,2023-08-17,MRK,1,0.042354,-0.014091,0.028263,0,0.042354


### 4. Feature Engineering for Signal Quality
- For each signal generated extract features:
- Graph-based:
    - Local/global vertex degree
    - Cluster size
    - Graph density
    - Cluster size
- Price-based:
    - Signal value
    - Sign of deviation
    - Stock & cluster average returns over past 10 days

In [182]:
# Adding features to train ML on

# Ensure returns are sorted by date
df_returns = df_returns.sort_index()

# Step 4: Enrich signals with features
feature_rows = []

for idx, row in signals_df.iterrows():
    date = row['date']
    stock = row['stock']
    cluster_id = row['cluster']

    # Retrieve cluster members
    cluster_members = [s for s, c in sponge_cluster_results[date].items() if c == cluster_id]
    if len(cluster_members) < 2 or stock not in cluster_members:
        continue

    # Retrieve correlation matrix for that date
    corr_matrix = rolling_corrs.get(date)
    if corr_matrix is None or stock not in corr_matrix.index:
        continue

    # 1. Local Degree: sum of abs correlations to others
    local_degree = corr_matrix.loc[stock].drop(stock).abs().sum()

    # 2. Cluster Density: average abs correlation between all cluster members
    sub_corr = corr_matrix.loc[cluster_members, cluster_members]
    tri_mask = np.triu(np.ones(sub_corr.shape), k=1).astype(bool)
    flat_vals = sub_corr.values[tri_mask]
    try:
        density = np.nanmean(np.abs(flat_vals))  # alt: density = (2 * np.count_nonzero(flat_vals)) / (n * (n - 1))
    except:
        continue

    # 3. Cluster Size
    clust_size = len(cluster_members)

    # 4. Price-Based Features: stock & cluster return over past 10 days
    try:
        date_idx = df_returns.index.get_loc(date)
        window_start = date_idx - 10
        if window_start < 0:
            continue

        past_window = df_returns.iloc[window_start:date_idx]
        stock_10d_return = past_window[stock].sum()
        cluster_10d_return = past_window[cluster_members].mean(axis=1).sum()
    except Exception as e:
        print(f"Feature extraction failed for {date} - {stock}: {e}")
        continue

    # Append enriched signal
    feature_rows.append({
        **row,
        "local_degree": local_degree,
        "cluster_density": density,
        "cluster_size": clust_size,
        "signal_sign": np.sign(row["signal"]),
        "abs_signal": np.abs(row["signal"]),
        "stock_10d_return": stock_10d_return,
        "cluster_10d_return": cluster_10d_return
    })

# Build final DataFrame
features_df = pd.DataFrame(feature_rows)

In [183]:
features_df.head()

Unnamed: 0,date,stock,cluster,signal,cluster_mean,stock_return,position,local_degree,cluster_density,cluster_size,signal_sign,abs_signal,stock_10d_return,cluster_10d_return
0,2023-08-17,LLY,0,0.196621,-0.010575,0.186046,-1,12.852767,0.207717,14,1.0,0.196621,0.186046,-0.010575
1,2023-08-17,AMGN,2,0.15484,-0.014895,0.139945,-1,13.678793,0.286514,12,1.0,0.15484,0.139945,-0.014895
2,2023-08-17,BKNG,2,0.117094,-0.014895,0.1022,-1,13.260046,0.286514,12,1.0,0.117094,0.1022,-0.014895
3,2023-08-17,REGN,1,0.099428,-0.014091,0.085337,-1,16.656741,0.252425,45,1.0,0.099428,0.085337,-0.014091
4,2023-08-17,PGR,1,0.096585,-0.014091,0.082494,-1,16.386903,0.252425,45,1.0,0.096585,0.082494,-0.014091


### Step 5: Label Signals
- Label based on profitability 
- This becomes a target for ML Classification

In [235]:
#Idea 1 - Label the signals based on future return threshold

labelled_rows = []

profit_threshold = 0.03  # % threshold over 5-day holding
holding_period = 10       # in days

for _, row in features_df.iterrows():
    date = row['date']
    stock = row['stock']
    position = row['position']  # -1 for short, 1 for long, 0 for neutral

    try:
        date_idx = df_returns.index.get_loc(date)
        forward_window = df_returns.iloc[date_idx + 1: date_idx + 1 + holding_period]

        # Cumulative log return → convert to standard return
        log_ret_sum = forward_window[stock].sum()
        future_return = np.exp(log_ret_sum) - 1

        pnl = future_return * position
        label = 1 if pnl > profit_threshold else 0

        labelled_rows.append({
            **row,
            "log_ret_sum": log_ret_sum,
            "future_return": future_return,
            "pnl": pnl,
            "label": label
        })

    except (KeyError, IndexError):
        continue  # not enough forward data, skip

labelled_df = pd.DataFrame(labelled_rows)


### Step 6: Machine Learning Signal Classifier
Train multiple classifiers on the labeled dataset:

- Logistic Regression
- Gradient Boosted Trees
- Neural Net (MLP)
- SGD Classifier
- AdaBoost

Ensemble (soft voting) using classifier probabilities.

In [236]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features to use
feature_cols = [
    'signal', 'abs_signal', 'signal_sign',
    'local_degree', 'cluster_density', 'cluster_size',
    'stock_10d_return', 'cluster_10d_return'
]

X = labelled_df[feature_cols]
y = labelled_df['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

In [237]:
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Define base models
models = {
    "LogReg": make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000)),
    "GBM": GradientBoostingClassifier(),
    "MLP": make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes=(50, 20), max_iter=500)),
    "SGD": make_pipeline(StandardScaler(), SGDClassifier(loss='log_loss', max_iter=1000)),
    "AdaBoost": AdaBoostClassifier()
}

# Fit all base models
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)


Training LogReg...
Training GBM...
Training MLP...
Training SGD...
Training AdaBoost...


In [238]:
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import pandas as pd
results = {}

for name, model in models.items():
    print(f"\nEvaluating {name}...")
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    try:
        y_proba = model.predict_proba(X_test)
        auc = roc_auc_score(y_test, y_proba, multi_class='ovr')
    except Exception as e:
        print(f"AUC error for {name}: {e}")
        auc = None

    report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)

    results[name] = {
        "AUC": auc,
        "Report": report
    }




Evaluating LogReg...
AUC error for LogReg: y should be a 1d array, got an array of shape (1290, 2) instead.

Evaluating GBM...
AUC error for GBM: y should be a 1d array, got an array of shape (1290, 2) instead.

Evaluating MLP...
AUC error for MLP: y should be a 1d array, got an array of shape (1290, 2) instead.

Evaluating SGD...
AUC error for SGD: y should be a 1d array, got an array of shape (1290, 2) instead.

Evaluating AdaBoost...
AUC error for AdaBoost: y should be a 1d array, got an array of shape (1290, 2) instead.


In [239]:
# Step 5: Display summaries
for name, metrics in results.items():
    print(f"\nModel: {name}")
    print(f"AUC: {metrics['AUC']:.4f}" if metrics['AUC'] is not None else "AUC: Not available")
    print("Classification Report:")
    print(classification_report(y_test, models[name].predict(X_test)))


Model: LogReg
AUC: Not available
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1218
           1       0.28      0.07      0.11        72

    accuracy                           0.94      1290
   macro avg       0.61      0.53      0.54      1290
weighted avg       0.91      0.94      0.92      1290


Model: GBM
AUC: Not available
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1218
           1       0.48      0.14      0.22        72

    accuracy                           0.94      1290
   macro avg       0.71      0.56      0.59      1290
weighted avg       0.92      0.94      0.93      1290


Model: MLP
AUC: Not available
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.97      1218
           1       0.37      0.18      0.24        72

    accuracy     

In [240]:
# Ensemble

voting_clf = VotingClassifier(
    estimators=[(name, model) for name, model in models.items()],
    voting='soft'
)
voting_clf.fit(X_train, y_train)

y_pred = voting_clf.predict(X_test)
print("Classification Report (Soft Voting Ensemble):\n")
print(classification_report(y_test, y_pred, zero_division=0))

Classification Report (Soft Voting Ensemble):

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      1218
           1       0.50      0.07      0.12        72

    accuracy                           0.94      1290
   macro avg       0.72      0.53      0.55      1290
weighted avg       0.92      0.94      0.92      1290



### Enhanced Methodology using Advancements in Financial Machine Learning by MLdP

---

### 1. Data Collection (AFML Chapter 2)
- **Universe**: Historical constituents of the S&P 500.
- **Timeframe**: 2000–2022.
- **Data Sampling**: Replace fixed daily bars with **Dollar Imbalance Bars** to reflect information flow more accurately and reduce sampling bias.  
  *Added from AFML Chapter 2: Sampling Bars.*

- **Returns**: Calculate **fractionally differenced returns** to achieve stationarity while preserving memory.  
  *Added from AFML Chapter 5: Fractional Differencing.*

---

### 2. Graph-Based Clustering (AFML Appendix A)
- **Residual Estimation**: Run Fama-French 3-Factor model to extract residual returns.
- **Dependency Estimation**: Replace Pearson correlation with **information-adjusted or shrinkage estimators** (e.g., Ledoit-Wolf).  
  *AFML Appendix A: More robust covariance estimation.*

- **Graph Construction**: Treat the (residual) correlation matrix as an adjacency matrix of a signed, weighted graph.
- **Clustering**: Use **SPONGE-sym** or a robust spectral clustering method to identify clusters of tightly connected stocks.

---

### 3. Signal Generation
- **Rebalance Frequency**: Every *k* days (e.g., every 10 days).
- **Cluster Signal**: For each stock in a cluster:
  - Compute 5-day return deviation from the cluster’s 5-day mean.
  - **Long** if significantly below mean; **Short** if significantly above.
- **Preprocessing**: Apply **fractional differencing** to returns before signal construction for improved stationarity.  
  *AFML Chapter 5: Retaining memory while avoiding spurious relationships.*

---

### 4. Feature Engineering for Signal Quality (AFML Chapters 5 & 9)
- **Graph-based Features**:
  - Local/global vertex degree
  - Cluster size
  - Graph density
  - Cluster count relative to total nodes

- **Price-based Features**:
  - Deviation from cluster mean
  - Sign of deviation
  - Stock & cluster momentum (last 10 days)
  - **Entropy of cluster returns** and **volatility of degree distribution**  
    *Inspired by AFML’s emphasis on feature richness and signal structure (Ch. 9).*

---

### 5. Labeling Signals (AFML Chapter 3)
- Replace fixed return threshold with the **Triple Barrier Method**:
  - **Upper barrier**: Profit target
  - **Lower barrier**: Stop-loss
  - **Vertical barrier**: Time-decay limit
- Each label reflects whether the upper, lower, or time barrier was breached first.
  *AFML Chapter 3: More objective, robust label generation.*

---

### 6. Machine Learning Classifier (AFML Chapters 7–9)
- **Models**: Train multiple classifiers including:
  - Logistic Regression
  - Gradient Boosted Trees
  - MLP
  - AdaBoost
  - SGD Classifier

- **Meta-Labeling**: Use the base signal as a feature and train classifiers to predict *whether to act on it*.  
  *AFML Chapter 7: Meta-Labeling.*

- **Model Selection**: Use **Nested Cross-Validation** to tune hyperparameters while preventing leakage.  
  *AFML Chapter 8.*

- **Feature Importance**: Analyze with both **Mean Decrease Accuracy (MDA)** and **Mean Decrease Impurity (MDI)** to ensure feature stability.  
  *AFML Chapter 9.*

---

### 7. Trade Filtering (AFML Chapter 7)
- Use ensemble confidence from the meta-labeler.
- Only act on signals with **predicted probability above the 90th percentile**, based on out-of-sample distribution.
- **Probabilistic filtering** replaces static 0.6 threshold.  
  *AFML Chapter 7.*

---

### 8. Bet Sizing (AFML Chapter 10)
- Replace classic Kelly with **bet sizing based on predicted probability**:
  - `size = 2p - 1` where `p` is model confidence
- Optionally implement **Dynamic Programming-based Bet Sizing** for optimal growth with risk constraints.  
  *AFML Chapter 10.*

---

### 9. Risk Management (AFML Chapter 3)
- Use **Triple Barrier parameters** (profit-taking, stop-loss, time constraint) for exit strategy.
- Thresholds decay over time or adapt to **volatility regime**.
- Optional: dynamically tune barriers using recent volatility or entropy.

---

### 10. Performance Evaluation (AFML Chapter 11)
- Metrics:
  - Annualized Return (CAGR)
  - Sharpe Ratio
  - Sortino Ratio
  - Max Drawdown
  - Calmar Ratio
  - Modified Information Ratios (IR*, IR**)

- **Robustness Checks**:
  - Use **Combinatorially Symmetric Cross-Validation (CSCV)** to assess overfitting and compute **Probability of Backtest Overfitting (PBO)**.  
    *AFML Chapter 11.*

- **Hypothesis Testing**:
  - Apply **White’s Reality Check** or **bootstrapped performance testing** for statistical validity.

---
