## Performance Evaluation Dashboard
This section summarizes overall model performance, comparing cumulative returns, Sharpe ratios, annual returns, and maximum drawdowns. It provides a high-level view of model effectiveness in return forecasting and risk-adjusted performance.
#### Clean Baseline Strategy Evaluation 
- LSTM
- GRU
- CNN-LSTM
- ATT-LSTM
- Transformer
  
#### Model Signal Validation & Benchmark Comparison using VectorBT
**Purpose:** Validate directional signal performance with realistic execution logic using SPY adjusted prices, no constraints.



## 1. Environment & Imports

In [None]:
# === Step 1: Imports ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import vectorbt as vbt

## 2. Load Model Predictions & SPY Benchmark

In [None]:
# Load SPY
df_spy = pd.read_csv("../data/GSPC_fixed.csv")
df_spy['Date'] = pd.to_datetime(df_spy['Date'])
df_spy.set_index('Date', inplace=True)
df_spy = df_spy[['Adjusted_close']].rename(columns={'Adjusted_close': 'SPY'})

# Load predictions
df_lstm = pd.read_csv('../data/df_lstm.csv')
df_gru = pd.read_csv('../data/df_gru.csv')
df_cnn = pd.read_csv('../data/df_cnn.csv')
df_att = pd.read_csv('../data/df_att.csv')
df_trans = pd.read_csv('../data/df_trans.csv')

# Assign datetime index
date_index = pd.date_range(start='2018-12-28', periods=len(df_lstm), freq='B')
for df in [df_lstm, df_gru, df_cnn, df_att, df_trans]:
    df['date'] = date_index
    df.set_index('date', inplace=True)


## 3. Define Strategy Execution Function

In [None]:
def run_strategy(pred_df, price_series, threshold=0.5):
    df = pred_df.join(price_series, how='inner')
    df['log_returns'] = np.log(df['SPY'] / df['SPY'].shift(1))
    df['signal'] = (df['predictions'] > threshold).astype(int)
    df['strategy_returns'] = df['log_returns'] * df['signal']
    df['cum_manual'] = (df['strategy_returns'].fillna(0) + 1).cumprod()

    pf = vbt.Portfolio.from_signals(
        close=df['SPY'],
        entries=df['signal'] == 1,
        exits=df['signal'] == 0,
        freq='1D',
        init_cash=100,
        fees=0.0
    )
    return df, pf


## 4. Run Backtest for All Models (Clean Execution Logic)

In [None]:
model_dfs = {
    'LSTM': df_lstm,
    'GRU': df_gru,
    'CNN-LSTM': df_cnn,
    'ATT-LSTM': df_att,
    'Transformer': df_trans
}

results = {}

for name, model_df in model_dfs.items():
    merged_df, portfolio = run_strategy(model_df, df_spy['SPY'])
    results[name] = {'df': merged_df, 'portfolio': portfolio}


## 5. Construct SPY Buy & Hold Benchmark Portfolio

In [None]:
end_date = df_lstm.index[-1]
spy_prices = df_spy['SPY'].loc['2018-12-28':end_date]

# === Align SPY Portfolio to same time range as model predictions ===
spy_pf = vbt.Portfolio.from_signals(
    close=spy_prices,  # <-- now using clipped series
    entries=pd.Series(True, index=spy_prices.index),
    exits=pd.Series(False, index=spy_prices.index),
    freq='1D',
    init_cash=100,
    fees=0.0
)

## 6. Plot: Portfolio Value Over Time (Base 100)
This plot compares the growth of a $100 investment using each model’s predicted signals, versus the SPY benchmark. It helps assess overall return effectiveness.

In [None]:
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

plt.figure(figsize=(12, 6))

# Store plot lines for custom legend
lines = []

# Plot SPY Buy & Hold normalized
line_spy, = plt.plot(
    spy_prices.index,
    (spy_prices / spy_prices.iloc[0]) * 100,
    label='SPY Buy & Hold',
    linewidth=2
)
lines.append(line_spy)

# Plot model portfolios
for name, res in results.items():
    line, = plt.plot(
        res['portfolio'].value(),
        label=f'{name} (Vectorbt)',
        alpha=0.9
    )
    lines.append(line)

# Title and labels
plt.title("Cumulative Returns: AI Trading Models", fontsize=14, fontweight='bold')
plt.xlabel("Date", fontsize=14, fontweight='bold')
plt.ylabel("Portfolio Value (Base 100)", fontsize=14, fontweight='bold')

# Create custom legend with thicker lines
custom_lines = [
    Line2D([0], [0], color=line.get_color(), lw=3) for line in lines
]
labels = [line.get_label() for line in lines]
legend = plt.legend(custom_lines, labels, fontsize=12)
for text in legend.get_texts():
    text.set_fontweight('bold')

# Axis ticks
plt.xticks(fontsize=12, fontweight='bold')
plt.yticks(fontsize=12, fontweight='bold')

plt.grid(True)
plt.tight_layout()

# Save
plt.savefig("1_clean_strategy_comparison.png", dpi=150, bbox_inches='tight')
plt.show()


## 7. Summary: Final Portfolio Values

In [None]:
print("SPY Buy & Hold Final Value:", spy_pf.value().iloc[-1])
for name, res in results.items():
    print(f"{name} Final Value:", res['portfolio'].value().iloc[-1])


## Section 8: Performance Metrics Table

In [None]:
# === Step 8: Compile Performance Metrics ===

performance_stats = pd.DataFrame()

for name, res in results.items():
    pf = res['portfolio']
    stats = {
        'Final Value': pf.value().iloc[-1],
        'Total Return [%]': pf.total_return() * 100,
        'Annual Return [%]': pf.annualized_return() * 100,
        'Volatility [%]': pf.annualized_volatility() * 100,
        'Sharpe Ratio': pf.sharpe_ratio(),
        'Max Drawdown [%]': pf.max_drawdown() * 100
    }
    performance_stats[name] = pd.Series(stats)

# === SPY Buy & Hold ===
spy_returns = spy_prices.pct_change().dropna()
spy_cum = (1 + spy_returns).cumprod() * 100

spy_stats = {
    'Final Value': spy_cum.iloc[-1],
    'Total Return [%]': (spy_cum.iloc[-1] / spy_cum.iloc[0] - 1) * 100,
    'Annual Return [%]': ((1 + spy_returns.mean()) ** 252 - 1) * 100,
    'Volatility [%]': spy_returns.std() * np.sqrt(252) * 100,
    'Sharpe Ratio': (spy_returns.mean() / spy_returns.std()) * np.sqrt(252),
    'Max Drawdown [%]': ((spy_cum / spy_cum.cummax()) - 1).min() * 100
}

performance_stats['SPY (Benchmark)'] = pd.Series(spy_stats)

# === Final formatting ===
performance_stats = performance_stats.T.round(2)
performance_stats = performance_stats.sort_values(by='Sharpe Ratio', ascending=False)

# Display
display(performance_stats)

# Export to CSV
performance_stats.to_csv("model_comparison_summary.csv")


## 9. Visualize Key Metrics: Annual Return, Sharpe, Drawdown

In [None]:
# === Step 9: Bar Chart Comparison of Key Metrics ===

metrics_to_plot = ['Annual Return [%]', 'Sharpe Ratio', 'Max Drawdown [%]']

# Bar plot
ax = performance_stats[metrics_to_plot].plot(
    kind='bar',
    figsize=(12, 6),
    title='Model Comparison: Annual Return, Sharpe, Max Drawdown'
)

# Bold fonts
ax.set_title('Model Comparison: Annual Return, Sharpe, Max Drawdown', fontsize=14, fontweight='bold')
ax.set_ylabel('Metric Value', fontsize=14, fontweight='bold')
ax.set_xlabel('', fontsize=14, fontweight='bold')

# Bold x-ticks
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=12, fontweight='bold')

# Bold legend
legend = ax.legend()
for text in legend.get_texts():
    text.set_fontweight('bold')

# Grid and save
plt.grid(True)
plt.tight_layout()
plt.savefig("2_bar_comparison_metrics.png", dpi=150, bbox_inches='tight')
plt.show()


## 10. Rolling Sharpe Ratio (3-Month)
This gives a time-dynamic view of how stable each model’s risk-adjusted performance is.

Why it matters:
- Models may spike in Sharpe but lose consistency.

- You’ll spot regime shifts, overfitting risks, or persistent alpha.

In [None]:
# === Step 10: Compute Rolling Sharpe Ratios (3-month = 63 trading days) ===

rolling_sharpes = pd.DataFrame()

for name, res in results.items():
    returns = res['portfolio'].returns()
    rolling = returns.rolling(window=63).mean() / returns.rolling(window=63).std()
    rolling_sharpes[name] = rolling

# === Multi-panel plot ===
fig, axs = plt.subplots(nrows=len(rolling_sharpes.columns), ncols=1, figsize=(12, 3 * len(rolling_sharpes.columns)), sharex=True)

for i, col in enumerate(rolling_sharpes.columns):
    axs[i].plot(rolling_sharpes.index, rolling_sharpes[col], label=col)
    axs[i].set_title(f'Rolling Sharpe Ratio (3-Month): {col}', fontsize=12)
    axs[i].set_ylabel('Sharpe Ratio')
    axs[i].grid(True)
    axs[i].legend(loc='upper left')

plt.xlabel('Date')
plt.tight_layout()
plt.savefig("3_rolling_sharpe_all_models.png", dpi=150, bbox_inches='tight')
plt.show()


**Quick Observations**:

- Transformer and GRU show some persistent Sharpe edge across periods.

- CNN-LSTM is more volatile — needs stability tuning.

- LSTM & ATT-LSTM have oscillating risk-adjusted returns, good for model ensembling logic later.

## Section 11 — Strategy Behavior Diagnostics

### 11.1 Transaction Cost Sensitivity

In [None]:
# Fee levels to test
fee_levels = [0.0, 0.001, 0.0025, 0.005]  # 0%, 0.1%, 0.25%, 0.5%
fee_results = {}

for fee in fee_levels:
    temp_results = {}
    for name, df in model_dfs.items():
        df['signal'] = (df['predictions'] > 0.5).astype(int)

        # Align both SPY and signal dataframe to shared index
        common_index = df.index.intersection(spy_prices.index)
        aligned_spy = spy_prices.loc[common_index]
        aligned_signal = df['signal'].loc[common_index]

        pf = vbt.Portfolio.from_signals(
            close=aligned_spy,
            entries=aligned_signal == 1,
            exits=aligned_signal == 0,
            freq='1D',
            init_cash=100,
            fees=fee
        )
        temp_results[name] = pf.value().iloc[-1]
    fee_results[f"{int(fee*10000)/100}%"] = temp_results

# Build dataframe
fee_df = pd.DataFrame(fee_results).T

# Plot
ax = fee_df.plot(kind='bar', figsize=(12, 6), title='Final Portfolio Value vs. Transaction Cost')
ax.set_title('Final Portfolio Value vs. Transaction Cost', fontsize=16, fontweight='bold')
ax.set_ylabel("Final Portfolio Value", fontsize=14, fontweight='bold')
ax.set_xlabel("Transaction Cost Level", fontsize=14, fontweight='bold')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, fontsize=12, fontweight='bold')

legend = ax.legend()
for text in legend.get_texts():
    text.set_fontweight('bold')

plt.grid(True)
plt.tight_layout()
plt.savefig("4_transaction_cost_sensitivity.png", dpi=150, bbox_inches='tight')
plt.show()


We can clearly see that:

- GRU is the most resilient, even at higher costs.

- CNN-LSTM and LSTM decay faster.

- Transformer and ATT-LSTM show consistent structure in degradation, which is what we want for model comparisons.

### Section 11.2: Turnover & Signal Activity Diagnostics
We’ll compute:

- **Turnover Rate**: How frequently positions change

- **Total Long Signal Count**: How often the model signals a long entry

In [None]:
## 11.2 Turnover & Signal Activity Diagnostics

turnover_stats = {}

for name, df in model_dfs.items():
    df = df.copy()
    df['signal'] = (df['predictions'] > 0.5).astype(int)
    signal_changes = df['signal'].diff().abs()
    
    turnover_rate = signal_changes.sum() / len(df)
    signal_count = df['signal'].sum()

    turnover_stats[name] = {
        'Turnover Rate': turnover_rate,
        'Long Signal Count': signal_count
    }

# Convert to DataFrame
turnover_df = pd.DataFrame(turnover_stats).T.round(4)

# === Plot Turnover and Signal Count Side-by-Side ===
axes = turnover_df.plot(
    kind='bar',
    subplots=True,
    figsize=(12, 6),
    title=['Turnover Rate', 'Total Long Signal Count'],
    layout=(1, 2),
    legend=False
)

# Apply formatting
for ax, title in zip(axes.flatten(), ['Turnover Rate', 'Total Long Signal Count']):
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.set_ylabel('Value', fontsize=12, fontweight='bold')
    ax.set_xlabel('Model', fontsize=12, fontweight='bold')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=11, fontweight='bold')
    ax.grid(True)

plt.tight_layout()
plt.savefig("5_turnover_signal_activity.png", dpi=150, bbox_inches='tight')
plt.show()


### Key Takeaways
We ese this chart to select model pairs or triads for **ensemble blending** or **portfolio tilting**.

---

## 🔍 Insights from Turnover & Signal Activity

| Model          | Interpretation                                                                 |
|----------------|---------------------------------------------------------------------------------|
| **CNN-LSTM**   | Highest turnover → frequent switching, may signal overfitting or instability.   |
| **ATT-LSTM**   | Lower turnover with high signal count → more consistent, conviction-based.      |
| **Transformer**| Similar to ATT-LSTM: lower turnover, stable signal generation.                  |
| **GRU**        | Balanced — not too aggressive, not too passive.                                 |
| **LSTM**       | Moderate — less frequent than GRU or ATT-LSTM, but more stable than CNN-LSTM.  |

These metrics help diagnose how each model behaves beyond just returns: signal stability, activity level, and trading cost exposure.

---

### 11.3 A Correlation Diagnostics: Signal & Return Overlap
We’ll compute and visualize:

- **Signal Correlation Matrix** – how similar the trading decisions are.

- **Return Correlation Matrix** – how similar their realized PnLs are.

- **Clustered Heatmaps** – intuitive structure of model groupings.

- **Delta Matrix** – where signal agreement doesn’t translate to return similarity.

- **Top Ensemble Candidate Pairs** – lowest return correlation for diversification.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# === 11.3A: Correlation Matrices ===
signal_df = pd.DataFrame({name: df['signal'] for name, df in model_dfs.items()})
returns_df = pd.DataFrame({name: res['portfolio'].returns() for name, res in results.items()})

# Correlations
signal_corr = signal_df.corr()
returns_corr = returns_df.corr()

# Plot basic heatmaps
fig, axs = plt.subplots(1, 2, figsize=(18, 8))

sns.heatmap(signal_corr, annot=True, fmt=".2f", cmap='coolwarm', ax=axs[0], square=True, annot_kws={"weight": "bold"})
axs[0].set_title("Signal Correlation", fontsize=16, fontweight='bold')

sns.heatmap(returns_corr, annot=True, fmt=".2f", cmap='coolwarm', ax=axs[1], square=True, annot_kws={"weight": "bold"})
axs[1].set_title("Return Correlation", fontsize=16, fontweight='bold')

plt.tight_layout()
plt.savefig("6_correlation_heatmaps.png", dpi=150, bbox_inches='tight')
plt.show()


What stands out:

- Signal correlations are generally low (e.g., LSTM vs Transformer = 0.13).

- Return correlations are much higher across the board — convergence in realized PnL even with different signal structures.

- This suggests many signals ***fire differently***, but still lead to overlapping return profiles — interesting for ensembling!

### 11.3B: Clustered Signal and Return Correlations
Clustered Heatmaps to Detect Model Groupings

In [None]:
# Clustered Signal Correlation
signal_cluster = sns.clustermap(signal_corr, annot=True, fmt=".2f", cmap="coolwarm",
                                linewidths=0.5, figsize=(8, 8),
                                annot_kws={"size": 10, "weight": "bold"})
signal_cluster.fig.suptitle("Clustered Signal Correlation", fontsize=14, fontweight='bold', y=1.02)
signal_cluster.savefig("clustered_signal_corr.png", dpi=150, bbox_inches='tight')
plt.close(signal_cluster.fig)

# Clustered Return Correlation
returns_cluster = sns.clustermap(returns_corr, annot=True, fmt=".2f", cmap="coolwarm",
                                 linewidths=0.5, figsize=(8, 8),
                                 annot_kws={"size": 10, "weight": "bold"})
returns_cluster.fig.suptitle("Clustered Return Correlation", fontsize=14, fontweight='bold', y=1.02)
returns_cluster.savefig("clustered_return_corr.png", dpi=150, bbox_inches='tight')
plt.close(returns_cluster.fig)


### 11.3C: Display Clustered Heatmaps 
This will let us see groupings like:

- Which models fire similarly (signal space)

- Which models produce similar PnLs (return space)

- And whether any diversity exists in signals that still yield similar returns.

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

# Load saved images
img_signal = Image.open("clustered_signal_corr.png")
img_return = Image.open("clustered_return_corr.png")

# Plot side by side
fig, axs = plt.subplots(1, 2, figsize=(20, 10))

axs[0].imshow(img_signal)
axs[0].axis('off')
axs[0].set_title("Signal Correlation (Clustered)", fontsize=18, fontweight='bold')

axs[1].imshow(img_return)
axs[1].axis('off')
axs[1].set_title("Return Correlation (Clustered)", fontsize=18, fontweight='bold')

plt.tight_layout()
plt.savefig("7_clustered_correlation_side_by_side.png", dpi=150)
plt.show()


We can already spot insightful structure:

- CNN-LSTM + GRU + LSTM are closer in behavior (clustered tightly).

- Transformer and ATT-LSTM show more decorrelation — which is great for ensemble diversification ideas.

## 11.4 Signal vs. Return Correlation Delta Map
This chart gives a **matrix of mismatches**:

- **Positive delta**: high return correlation despite different signals (maybe due to timing offsets).

- **Negative delta**: similar signals with differing outcomes (possible execution noise or structural differences).


In [None]:
delta_corr = returns_corr - signal_corr

...to detect hidden behavioral divergence between signal timing and portfolio effects.

In [None]:
# === Step 11.4: Delta Correlation Heatmap ===
import seaborn as sns
import matplotlib.pyplot as plt

# Compute delta matrix
delta_corr = returns_corr - signal_corr

# Plot
plt.figure(figsize=(10, 8))
sns.heatmap(
    delta_corr, annot=True, fmt=".2f", cmap="BuGn", center=0,
    square=True, linewidths=0.5, annot_kws={"size": 10, "weight": "bold"}
)

plt.title("Return - Signal Correlation Delta", fontsize=14, fontweight='bold')
plt.xticks(rotation=45, fontsize=11, fontweight='bold')
plt.yticks(fontsize=11, fontweight='bold')
plt.tight_layout()
plt.savefig("8_delta_corr_heatmap.png", dpi=150, bbox_inches='tight')
plt.show()


That heatmap allows us to visually detect mismatches between signal logic and realized returns. Some key insights:

- **Transformer vs. ATT-LSTM** has the ***highest delta*** (0.62) → wildly different signals, surprisingly similar returns.

- **LSTM vs. GRU** has a ***moderate delta** (0.40) → confirms they’re not redundant despite similar architectures.

Great setup for model blending / ensemble discussion later.

## 11.5 Top 5 Most Diverse Strategy Pairs (Return Correlation)
This helps us rank pairs for ensemble diversification based on lowest ***absolute return correlation.***

In [None]:
# === Step 11.5: Top 5 Most Diverse Strategy Pairs (Return Corr) ===

# Extract off-diagonal pairs
def extract_correlation_pairs(corr_matrix):
    pairs = []
    names = corr_matrix.columns
    for i in range(len(names)):
        for j in range(i):
            pairs.append({
                "Strategy A": names[i],
                "Strategy B": names[j],
                "Return Corr": corr_matrix.iloc[i, j]
            })
    return pd.DataFrame(pairs).sort_values(by="Return Corr", key=lambda x: abs(x))

# Build ranked pair DataFrame
pair_df = extract_correlation_pairs(returns_corr)
top_pairs = pair_df.head(5)

# Plot top 5 diverse pairs
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
sns.barplot(data=top_pairs, x="Return Corr", y="Strategy A", hue="Strategy B", palette="Set2")
plt.title("Top 5 Most Diverse Strategy Pairs (Return Corr)", fontsize=14, fontweight='bold')
plt.xlabel("Return Correlation", fontsize=12, fontweight='bold')
plt.ylabel("Strategy A", fontsize=12, fontweight='bold')
plt.legend(title="Strategy B", fontsize=10)
plt.tight_layout()
plt.savefig("9_top_diverse_pairs.png", dpi=150)
plt.show()


***Which model combos could I use together to reduce overfitting or concentration in my signal ensemble?***

- Top diversified ensemble candidates:

- `ATT-LSTM + CNN-LSTM` (~0.66)

- `ATT-LSTM + LSTM` (~0.67)

- `Transformer + GRU` / `Transformer + LSTM` (~0.67–0.68)

These pairs share low return correlation, which means combining their strategies could reduce variance without sacrificing edge — great for ensembling alpha signals.