# Dynamic Portfolio Clustering and Risk Profiling with Machine Learning

**Author:** Roberto Berardi  
**Student ID:** 25419094  
**Program:** MSc Finance, HEC Lausanne - UNIL  
**Course:** Advanced Programming - Fall 2025  

---

## Project Overview

This notebook presents a comprehensive analysis comparing **risk-based clustering** vs **machine learning predictions** for portfolio construction using 50 U.S. stocks from 2015-2024.

**Research Question:** Can simple clustering strategies outperform complex ML models for building investment portfolios?

---
## Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image, display, Markdown
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

print("âœ… Libraries loaded successfully")
print("ðŸ“Š Ready for analysis")

---
## 1. Data Overview

### Dataset Characteristics

- **50 U.S. Large-Cap Stocks** (AAPL, MSFT, GOOGL, JPM, JNJ, etc.)
- **Time Period:** January 2015 - December 2024 (10 years)
- **Data Frequency:** Daily prices
- **Benchmark:** S&P 500

### Features Calculated (Rolling 12-Month Window)

1. Annualized return
2. Annualized volatility
3. Sharpe ratio (2% risk-free rate)
4. Maximum drawdown
5. Beta (market sensitivity)
6. Correlation with S&P 500
7-10. Momentum indicators (1m, 3m, 6m, 12m)

---
## 2. Load Analysis Results

Loading results from the main analysis (`main.py` with 50 stocks).

In [None]:
# Load performance tables
clustering_results = pd.read_csv('../results/tables/clustering_performance.csv')
ml_results = pd.read_csv('../results/tables/ml_performance.csv')
ml_model_eval = pd.read_csv('../results/tables/ml_model_evaluation.csv')

print("âœ… Data loaded successfully")
print(f"ðŸ“Š Clustering portfolios: {len(clustering_results)}")
print(f"ðŸ¤– ML portfolios: {len(ml_results)}")
print(f"ðŸ“ˆ ML models evaluated: {len(ml_model_eval)}")

---
## 3. Portfolio Strategies

Three portfolio strategies tested with different risk tolerances:

| Strategy | Low-Volatility | Moderate | High-Volatility | Risk Profile |
|----------|---------------|----------|-----------------|-------------|
| **Conservative** | 60% | 30% | 10% | Low risk |
| **Balanced** | 40% | 40% | 20% | Medium risk |
| **Aggressive** | 20% | 30% | 50% | High risk |

**Rebalancing:** Quarterly (17 rebalances from 2021-2024)  
**Transaction Costs:** 0.15% per trade  
**Initial Capital:** $100,000

---
## 4. Clustering-Based Portfolio Results

### Approach
- **K-means clustering:** Partition stocks into 3 groups
- **GMM (Gaussian Mixture Models):** Probabilistic clustering
- **PCA:** Dimensionality reduction (3 components, 96.7% variance)
- **Adaptive:** Re-cluster every quarter with most recent 12-month data

In [None]:
# Display clustering results table
print("ðŸ“Š CLUSTERING-BASED PORTFOLIO PERFORMANCE (2021-2024)")
print("=" * 70)
display(clustering_results)
print("\nðŸ’¡ Key Observation: All portfolios significantly beat S&P 500 (59.62%)")

### ðŸ“Š Visualization: Clustering Portfolio Performance

In [None]:
# Plot clustering portfolio performance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Subplot 1: Total Returns
x = np.arange(len(clustering_results))
width = 0.4

bars = ax1.bar(x, clustering_results['Total Return'], width, 
               alpha=0.85, color='#2E86AB', edgecolor='black', linewidth=1.5)
ax1.axhline(y=59.62, color='#F18F01', linestyle='--', linewidth=2.5, 
            label='S&P 500 (59.62%)', alpha=0.8)

ax1.set_xlabel('Portfolio Strategy', fontweight='bold', fontsize=13)
ax1.set_ylabel('Total Return (%)', fontweight='bold', fontsize=13)
ax1.set_title('Clustering-Based Portfolio Performance', fontweight='bold', fontsize=15)
ax1.set_xticks(x)
ax1.set_xticklabels(clustering_results['Portfolio'], fontsize=12)
ax1.legend(fontsize=11)
ax1.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels
for i, bar in enumerate(bars):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.1f}%', ha='center', va='bottom', fontsize=11, fontweight='bold')

# Subplot 2: Sharpe Ratios
bars2 = ax2.bar(x, clustering_results['Sharpe'], width,
                alpha=0.85, color='#A23B72', edgecolor='black', linewidth=1.5)
ax2.axhline(y=0.63, color='#F18F01', linestyle='--', linewidth=2.5,
            label='S&P 500 (0.63)', alpha=0.8)

ax2.set_xlabel('Portfolio Strategy', fontweight='bold', fontsize=13)
ax2.set_ylabel('Sharpe Ratio', fontweight='bold', fontsize=13)
ax2.set_title('Risk-Adjusted Performance', fontweight='bold', fontsize=15)
ax2.set_xticks(x)
ax2.set_xticklabels(clustering_results['Portfolio'], fontsize=12)
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels
for i, bar in enumerate(bars2):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.2f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nâœ… Clustering portfolios consistently outperform S&P 500")

### Individual Portfolio Analysis

In [None]:
# Detailed breakdown by portfolio
for idx, row in clustering_results.iterrows():
    print(f"\n{'='*60}")
    print(f"ðŸ“ˆ {row['Portfolio'].upper()} PORTFOLIO - CLUSTERING")
    print(f"{'='*60}")
    print(f"Total Return:    {row['Total Return']:>8.2f}%")
    print(f"CAGR:            {row['CAGR']:>8.2f}%")
    print(f"Sharpe Ratio:    {row['Sharpe']:>8.2f}")
    print(f"Max Drawdown:    {row['Max Drawdown']:>8.2f}%")
    print(f"Final Value:     ${row['Final Value']:>12,.0f}")
    print(f"vs S&P 500:      +{row['Total Return'] - 59.62:>7.2f}%")

---
## 5. Machine Learning Models

### Models Tested

Four regression models trained to predict 3-month forward returns:

1. **Ridge Regression** (linear, L2 regularization)
2. **Random Forest** (ensemble, decision trees)
3. **XGBoost** (gradient boosting)
4. **Neural Network** (deep learning, 3 layers)

### Training/Testing Split

- **Training:** 2015-2020 (6 years, ~60,000 samples)
- **Testing:** 2021-2024 (4 years)
- **Two versions:** Base (10 features) vs Enhanced (+cluster)

In [None]:
# Display ML model evaluation
print("ðŸ¤– MACHINE LEARNING MODEL EVALUATION")
print("=" * 70)
display(ml_model_eval)
print("\nðŸ’¡ Best Model: Ridge (Enhanced) - RÂ² = -0.101, Dir Acc = 58.9%")

### ðŸ“Š Visualization: ML Model Performance

In [None]:
# Plot ML model performance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Prepare data
models = ml_model_eval['Model'].unique()
x = np.arange(len(models))
width = 0.35

base_acc = [ml_model_eval[(ml_model_eval['Model']==m) & (ml_model_eval['Version']=='Base')]['Dir_Acc'].values[0] for m in models]
enh_acc = [ml_model_eval[(ml_model_eval['Model']==m) & (ml_model_eval['Version']=='Enhanced')]['Dir_Acc'].values[0] for m in models]

# Subplot 1: Directional Accuracy
bars1 = ax1.bar(x - width/2, base_acc, width, label='Base', 
                alpha=0.85, color='#6A994E', edgecolor='black', linewidth=1.2)
bars2 = ax1.bar(x + width/2, enh_acc, width, label='Enhanced (+Cluster)',
                alpha=0.85, color='#BC4749', edgecolor='black', linewidth=1.2)

ax1.axhline(y=50, color='gray', linestyle=':', linewidth=2, label='Random (50%)', alpha=0.7)

ax1.set_xlabel('Model', fontweight='bold', fontsize=13)
ax1.set_ylabel('Directional Accuracy (%)', fontweight='bold', fontsize=13)
ax1.set_title('ML Model Performance: Prediction Accuracy', fontweight='bold', fontsize=15)
ax1.set_xticks(x)
ax1.set_xticklabels(['Ridge', 'Random\nForest', 'XGBoost', 'Neural\nNetwork'], fontsize=11)
ax1.legend(fontsize=11)
ax1.grid(axis='y', alpha=0.3)
ax1.set_ylim(48, 62)

# Add values
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + 0.3,
                f'{height:.1f}%', ha='center', va='bottom', fontsize=9)

# Subplot 2: RÂ² Scores
base_r2 = [ml_model_eval[(ml_model_eval['Model']==m) & (ml_model_eval['Version']=='Base')]['RÂ²'].values[0] for m in models]
enh_r2 = [ml_model_eval[(ml_model_eval['Model']==m) & (ml_model_eval['Version']=='Enhanced')]['RÂ²'].values[0] for m in models]

bars3 = ax2.bar(x - width/2, base_r2, width, label='Base',
                alpha=0.85, color='#6A994E', edgecolor='black', linewidth=1.2)
bars4 = ax2.bar(x + width/2, enh_r2, width, label='Enhanced (+Cluster)',
                alpha=0.85, color='#BC4749', edgecolor='black', linewidth=1.2)

ax2.axhline(y=0, color='gray', linestyle=':', linewidth=2, alpha=0.7)

ax2.set_xlabel('Model', fontweight='bold', fontsize=13)
ax2.set_ylabel('RÂ² Score', fontweight='bold', fontsize=13)
ax2.set_title('ML Model Performance: RÂ² Coefficient', fontweight='bold', fontsize=15)
ax2.set_xticks(x)
ax2.set_xticklabels(['Ridge', 'Random\nForest', 'XGBoost', 'Neural\nNetwork'], fontsize=11)
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3)

# Add values
for bars in [bars3, bars4]:
    for bar in bars:
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}', ha='center', va='bottom' if height > 0 else 'top', fontsize=9)

plt.tight_layout()
plt.show()

print("\nâœ… Enhanced models marginally better than base models")

---
## 6. ML-Driven Portfolio Results

Using **Ridge (Enhanced)** - the best performing model.

In [None]:
# Display ML portfolio results
print("ðŸ“Š ML-DRIVEN PORTFOLIO PERFORMANCE (2021-2024)")
print("=" * 70)
display(ml_results)

### ðŸ“Š Visualization: ML Portfolio Performance

In [None]:
# Plot ML portfolio performance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

x = np.arange(len(ml_results))
width = 0.4

# Subplot 1: Total Returns
bars = ax1.bar(x, ml_results['Total Return'], width,
               alpha=0.85, color='#A23B72', edgecolor='black', linewidth=1.5)
ax1.axhline(y=59.62, color='#F18F01', linestyle='--', linewidth=2.5,
            label='S&P 500 (59.62%)', alpha=0.8)

ax1.set_xlabel('Portfolio Strategy', fontweight='bold', fontsize=13)
ax1.set_ylabel('Total Return (%)', fontweight='bold', fontsize=13)
ax1.set_title('ML-Driven Portfolio Performance', fontweight='bold', fontsize=15)
ax1.set_xticks(x)
ax1.set_xticklabels(ml_results['Portfolio'], fontsize=12)
ax1.legend(fontsize=11)
ax1.grid(axis='y', alpha=0.3)

for i, bar in enumerate(bars):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.1f}%', ha='center', va='bottom', fontsize=11, fontweight='bold')

# Subplot 2: Sharpe Ratios
bars2 = ax2.bar(x, ml_results['Sharpe'], width,
                alpha=0.85, color='#6A994E', edgecolor='black', linewidth=1.5)
ax2.axhline(y=0.63, color='#F18F01', linestyle='--', linewidth=2.5,
            label='S&P 500 (0.63)', alpha=0.8)

ax2.set_xlabel('Portfolio Strategy', fontweight='bold', fontsize=13)
ax2.set_ylabel('Sharpe Ratio', fontweight='bold', fontsize=13)
ax2.set_title('Risk-Adjusted Performance', fontweight='bold', fontsize=15)
ax2.set_xticks(x)
ax2.set_xticklabels(ml_results['Portfolio'], fontsize=12)
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3)

for i, bar in enumerate(bars2):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.2f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nâœ… ML portfolios also beat S&P 500, but clustering performs better")

### Individual ML Portfolio Analysis

In [None]:
# Detailed breakdown by ML portfolio
for idx, row in ml_results.iterrows():
    print(f"\n{'='*60}")
    print(f"ðŸ¤– {row['Portfolio'].upper()} PORTFOLIO - ML-DRIVEN")
    print(f"{'='*60}")
    print(f"Total Return:    {row['Total Return']:>8.2f}%")
    print(f"CAGR:            {row['CAGR']:>8.2f}%")
    print(f"Sharpe Ratio:    {row['Sharpe']:>8.2f}")
    print(f"Max Drawdown:    {row['Max Drawdown']:>8.2f}%")
    print(f"Volatility:      {row['Volatility']:>8.2f}%")
    print(f"Info Ratio:      {row['Info Ratio']:>8.2f}")
    print(f"Final Value:     ${row['Final Value']:>12,.0f}")
    print(f"vs S&P 500:      +{row['Total Return'] - 59.62:>7.2f}%")

---
## 7. Direct Comparison: Clustering vs ML

### Side-by-Side Performance

In [None]:
# Create comprehensive comparison plot
fig, ax = plt.subplots(figsize=(14, 7))

x = np.arange(len(clustering_results))
width = 0.35

bars1 = ax.bar(x - width/2, clustering_results['Total Return'], width,
               label='Clustering', alpha=0.85, color='#2E86AB',
               edgecolor='black', linewidth=1.5)
bars2 = ax.bar(x + width/2, ml_results['Total Return'], width,
               label='ML-Driven', alpha=0.85, color='#A23B72',
               edgecolor='black', linewidth=1.5)

# Add S&P 500 line
ax.axhline(y=59.62, color='#F18F01', linestyle='--', linewidth=3,
           label='S&P 500 (59.62%)', alpha=0.8)

ax.set_xlabel('Portfolio Strategy', fontweight='bold', fontsize=14)
ax.set_ylabel('Total Return (%)', fontweight='bold', fontsize=14)
ax.set_title('CLUSTERING vs ML: Total Return Comparison (2021-2024)',
             fontweight='bold', fontsize=16)
ax.set_xticks(x)
ax.set_xticklabels(clustering_results['Portfolio'], fontsize=13)
ax.legend(fontsize=12, loc='upper left')
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 1,
                f'{height:.1f}%', ha='center', va='bottom',
                fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

# Calculate advantages
print("\nðŸ“Š CLUSTERING ADVANTAGE:")
for i, portfolio in enumerate(clustering_results['Portfolio']):
    diff = clustering_results.iloc[i]['Total Return'] - ml_results.iloc[i]['Total Return']
    print(f"   {portfolio:12s}: +{diff:5.2f}% better")

---
## 8. Risk-Return Analysis

In [None]:
# Create risk-return scatter plot
fig, ax = plt.subplots(figsize=(12, 8))

colors_clust = ['#2E86AB', '#2E86AB', '#2E86AB']
colors_ml = ['#A23B72', '#A23B72', '#A23B72']
sizes = [300, 400, 500]

# Plot portfolios
for i, portfolio in enumerate(clustering_results['Portfolio']):
    # Clustering
    ax.scatter(abs(clustering_results.iloc[i]['Max Drawdown']),
              clustering_results.iloc[i]['Total Return'],
              s=sizes[i], c=colors_clust[i], alpha=0.7,
              edgecolors='black', linewidth=2, marker='o',
              label=f'Clustering - {portfolio}')
    
    # ML
    ax.scatter(abs(ml_results.iloc[i]['Max Drawdown']),
              ml_results.iloc[i]['Total Return'],
              s=sizes[i], c=colors_ml[i], alpha=0.7,
              edgecolors='black', linewidth=2, marker='s',
              label=f'ML - {portfolio}')

# S&P 500
ax.scatter(20, 59.62, s=400, c='#F18F01', alpha=0.8,
          edgecolors='black', linewidth=2, marker='D',
          label='S&P 500')

ax.set_xlabel('Maximum Drawdown (%)', fontweight='bold', fontsize=13)
ax.set_ylabel('Total Return (%)', fontweight='bold', fontsize=13)
ax.set_title('Risk-Return Profile: All Strategies', fontweight='bold', fontsize=15)
ax.legend(fontsize=10, loc='upper right', framealpha=0.9, ncol=2)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nâœ… Clustering provides better risk-return tradeoff")

---
## 9. Statistical Summary

### Performance Improvements

In [None]:
# Calculate average improvements
avg_return_improvement = (clustering_results['Total Return'].mean() - ml_results['Total Return'].mean())
avg_sharpe_improvement = (clustering_results['Sharpe'].mean() - ml_results['Sharpe'].mean())
avg_cagr_improvement = (clustering_results['CAGR'].mean() - ml_results['CAGR'].mean())

print("ðŸ“Š AVERAGE IMPROVEMENTS (Clustering vs ML)")
print("=" * 60)
print(f"Total Return:  +{avg_return_improvement:.2f}%")
print(f"CAGR:          +{avg_cagr_improvement:.2f}%")
print(f"Sharpe Ratio:  +{avg_sharpe_improvement:.3f}")

# Benchmark comparison
sp500_return = 59.62
print(f"\nðŸ“Š EXCESS RETURNS vs S&P 500 ({sp500_return}%)")
print("=" * 60)
print("\nCLUSTERING:")
for i, portfolio in enumerate(clustering_results['Portfolio']):
    excess = clustering_results.iloc[i]['Total Return'] - sp500_return
    print(f"  {portfolio:12s}: +{excess:5.2f}%")

print("\nML-DRIVEN:")
for i, portfolio in enumerate(ml_results['Portfolio']):
    excess = ml_results.iloc[i]['Total Return'] - sp500_return
    print(f"  {portfolio:12s}: +{excess:5.2f}%")

### ðŸ“Š Visualization: Summary Heatmap

In [None]:
# Create improvement heatmap
improvement_data = []
metrics = ['Return', 'CAGR', 'Sharpe', 'Max DD']

for i, portfolio in enumerate(clustering_results['Portfolio']):
    row = [
        clustering_results.iloc[i]['Total Return'] - ml_results.iloc[i]['Total Return'],
        clustering_results.iloc[i]['CAGR'] - ml_results.iloc[i]['CAGR'],
        clustering_results.iloc[i]['Sharpe'] - ml_results.iloc[i]['Sharpe'],
        clustering_results.iloc[i]['Max Drawdown'] - ml_results.iloc[i]['Max Drawdown']
    ]
    improvement_data.append(row)

improvement_df = pd.DataFrame(improvement_data,
                             columns=metrics,
                             index=clustering_results['Portfolio'])

fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(improvement_df, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
            cbar_kws={'label': 'Clustering Advantage'}, linewidths=2,
            linecolor='black', ax=ax, vmin=-5, vmax=15)

ax.set_title('Clustering vs ML: Performance Difference\n(Positive = Clustering Better)',
             fontweight='bold', fontsize=14, pad=15)
ax.set_xlabel('Metric', fontweight='bold', fontsize=12)
ax.set_ylabel('Portfolio', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

print("\nðŸŸ¢ Green = Clustering better")
print("ðŸ”´ Red = ML better")

---
## 10. Conclusion

### Main Findings

1. **âœ… Clustering Outperforms ML**
   - Average advantage: +6.5% return
   - Conservative: +10.6%, Balanced: +9.2%, Aggressive: +3.2%

2. **âœ… Both Beat the Market**
   - All 6 strategies exceeded S&P 500 (59.62%)
   - Aggressive clustering: +24.4% above benchmark

3. **âœ… Enhanced Models Add Value**
   - Models with cluster features marginally better
   - Ridge Enhanced: best directional accuracy (58.9%)

4. **âœ… Simpler Can Be Better**
   - Risk-based clustering more robust than ML predictions
   - Lower turnover = lower costs

### Answer to Research Question

**Can simple clustering strategies outperform complex ML models?**

**âœ… YES** - Clustering outperformed ML by 3-10% across all strategies.

---

**Project:** Dynamic Portfolio Clustering and Risk Profiling with Machine Learning  
**Author:** Roberto Berardi (25419094)  
**Institution:** HEC Lausanne - UNIL, MSc Finance  
**Course:** Advanced Programming - Fall 2025