# ⭐ Tutorial: Financial Asset Clustering with RiskLabAI

This notebook is a tutorial for the asset clustering functions in the `RiskLabAI` library, based on Chapter 4 of 'Advances in Financial Machine Learning' by Marcos López de Prado.

Clustering is a fundamental step for modern portfolio construction techniques like Hierarchical Risk Parity (HRP). The goal is to identify groups of similar assets *before* optimizing allocations.

We will demonstrate:
1.  **Data Preparation:** Load real stock data, convert it to returns, and compute the correlation matrix.
2.  **Visualizing the Problem:** Show a heatmap of the original, unsorted correlation matrix.
3.  **Synthetic Benchmark:** Use `random_block_correlation` to create a synthetic matrix with a *known* cluster structure.
4.  **Base K-Means (Snippet 4.1):** Apply the `cluster_k_means_base` algorithm to find the optimal number of clusters based on the silhouette score t-statistic.
5.  **Optimized Nested Clustering (ONC) (Snippet 4.2):** Apply the `cluster_k_means_top` function, which recursively re-clusters unstable groups to find a more robust solution.
6.  **Visualizing the Solution:** Show heatmaps of the sorted correlation matrices to see the clusters our algorithms uncovered.

## 0. Setup and Imports

First, we import our libraries and the necessary modules from `RiskLabAI`.

In [None]:
# Standard Imports
import seaborn as sns
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt

# RiskLabAI Imports
import RiskLabAI.cluster.clustering as cl
import RiskLabAI.utils.publication_plots as pub_plots

# --- Notebook Configuration ---
pub_plots.setup_publication_style()

## 1. Load and Prepare Data

We will load daily price data for a list of 30 symbols from Yahoo Finance. To analyze their relationships, we must first convert prices to **percentage returns**.

From these returns, we compute the Pearson correlation matrix, which will be the input for our clustering algorithms.

In [None]:
symbols = [
    "AAPL", "MSFT", "GOOG", "AMZN", "TSLA", "META", "JPM", "UNH", "V", "JNJ",
    "HD", "WMT", "PG", "BAC", "MA", "PFE", "DIS", "AVGO", "XOM", "ACM",
    "CSCO", "NFLX", "NKE", "LLY", "KO", "TMO", "CRM", "COST", "AAL", "X"
]

# Download price data
all_stocks = pd.DataFrame()
for symbol in symbols:
    data = pd.Series(
        yf.Ticker(symbol).history(start="2019-01-01", end="2021-08-08")["Close"],
        name=symbol,
    )
    if symbol == symbols[0]:
        all_stocks = pd.DataFrame(data)
    else:
        all_stocks[symbol] = data

# Convert to percentage returns and compute correlation matrix
# We use returns, as correlation of raw prices is often spurious.
returns = all_stocks.pct_change(1).dropna()
corr_matrix = returns.corr()

print("Correlation Matrix:")
corr_matrix.head()

## 2. Visualizing the Problem

Let's plot a heatmap of the raw correlation matrix. It's difficult to see any clear, defined clusters. The assets are sorted alphabetically, so correlated assets (like `V` and `MA`, or `JPM` and `BAC`) are far apart.

The goal of clustering is to re-order this matrix to reveal the hidden block structures.

In [None]:
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_matrix, ax=ax, cmap='viridis', vmin=-1, vmax=1)
pub_plots.apply_plot_style(ax, 'Original (Unsorted) Correlation Matrix', '', '')
plt.tight_layout()
plt.show()

## 3. Snippet 4.3: Synthetic Data Benchmark

Before we cluster the real data, let's see what a "perfect" result would look like. We use `random_block_correlation` to generate a synthetic matrix with **4 known blocks** plus a market component.

This is our target. We want our clustering algorithm to find and re-order the real matrix to look like this.

In [None]:
# Generate a synthetic correlation matrix with a known structure
synth_corr = cl.random_block_correlation(
    n_columns=30, 
    n_blocks=4, 
    random_state=42
)

# Plot the synthetic matrix
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(synth_corr, ax=ax, cmap='viridis', vmin=-1, vmax=1)
pub_plots.apply_plot_style(ax, 'Synthetic Block Correlation Matrix', '', '')
plt.tight_layout()
plt.show()

## 4. Snippet 4.1: Base K-Means Clustering

The `cluster_k_means_base` function is the first step. It runs K-Means multiple times for different numbers of clusters (k) and uses the **t-statistic of the silhouette scores** to determine the optimal `k`.

It returns the sorted matrix, the cluster assignments, and the silhouette scores for each asset.

In [None]:
corr_sorted_base, clusters_base, silh_base = cl.cluster_k_means_base(
    corr_matrix,
    max_clusters=10, 
    iterations=10,
    random_state=42
)

print(f"Found {len(clusters_base)} clusters with base K-Means:")
from pprint import pprint
pprint(clusters_base)

Now, let's plot the heatmap of the matrix sorted by these base clusters. It's much better, but you can see some "noisy" assets within the blocks.

In [None]:
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_sorted_base, ax=ax, cmap='viridis', vmin=-1, vmax=1)
pub_plots.apply_plot_style(ax, 'Correlation Matrix Sorted by Base K-Means', '', '')
plt.tight_layout()
plt.show()

## 5. Snippet 4.2: Optimized Nested Clustering (ONC)

The `cluster_k_means_top` function implements the full **Optimized Nested Clustering (ONC)** algorithm. 

It works by:
1. Running `cluster_k_means_base` (as we did above).
2. Identifying "unstable" clusters (those with a silhouette t-stat below the average).
3. Taking all unstable clusters, grouping them together, and **re-clustering them recursively**.
4. Merging the stable clusters with the new sub-clusters.

This recursive process results in more stable and homogeneous clusters.

In [None]:
corr_sorted_onc, clusters_onc, silh_onc = cl.cluster_k_means_top(
    corr_matrix, 
    max_clusters=10, 
    iterations=10,
    random_state=42
)

print(f"Found {len(clusters_onc)} clusters with Optimized Nested Clustering (ONC):")
pprint(clusters_onc)

Let's plot the final heatmap. The resulting clusters are tighter and make more intuitive sense. For example, `AAL` (Airlines) and `X` (US Steel) are now isolated, while the financial (`JPM`, `BAC`, `V`, `MA`) and tech (`AAPL`, `MSFT`, `GOOG`, `CRM`) groups are more clearly defined.

In [None]:
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_sorted_onc, ax=ax, cmap='viridis', vmin=-1, vmax=1)
pub_plots.apply_plot_style(ax, 'Correlation Matrix Sorted by ONC (cluster_k_means_top)', '', '')
plt.tight_layout()
plt.show()

## 6. Conclusion

This notebook demonstrated the use of the `RiskLabAI.cluster` module to identify hidden structures in financial data.

1.  We started with a messy, alphabetized correlation matrix from real stock returns.
2.  We benchmarked our goal by creating a `random_block_correlation` matrix with a known structure.
3.  We applied `cluster_k_means_base`, which found an optimal `k` but produced slightly noisy clusters.
4.  We used the more advanced **Optimized Nested Clustering (ONC)** via `cluster_k_means_top`, which recursively refined the clusters to produce a much cleaner, more stable, and more intuitive block structure.

This sorted matrix and its corresponding cluster list are the essential inputs for the Hierarchical Risk Parity (HRP) portfolio optimization algorithm.