# ⭐ Tutorial: Distance Metrics with RiskLabAI

This notebook is a tutorial for the information-theoretic distance metrics in the `RiskLabAI` library, based on Chapter 3 of 'Advances in Financial Machine Learning' by Marcos López de Prado.

We will demonstrate:
1.  **Variation of Information (VI):** Calculate VI using both a naive and an optimal number of bins.
2.  **Correlation vs. Mutual Information:** We'll run three tests to show a critical weakness of correlation.
    * Case 1: No Relationship
    * Case 2: Linear Relationship
    * Case 3: Non-Linear Relationship
3.  **Conclusion:** We'll show why Mutual Information (MI) is a superior metric for detecting non-linear patterns in financial data.

## 0. Setup and Imports

First, we import our libraries. We'll use `yfinance` to download sample stock data and our `RiskLabAI` modules for distance metrics and plotting.

In [None]:
import numpy as np
import scipy.stats as ss
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt

# Import from our RiskLabAI package
import RiskLabAI.data.distance.distance_metric as dm
import RiskLabAI.utils.publication_plots as pub_plots

# Apply global publication style
pub_plots.setup_publication_style()

## 1. Load Sample Data

Let's download daily closing prices for Apple (AAPL) and Tesla (TSLA) to use as our sample data.

In [None]:
# Download stock data
x_series = yf.Ticker("AAPL").history(start="2020-01-01")['Close']
y_series = yf.Ticker("TSLA").history(start="2020-01-01")['Close']

# Align the series to ensure they have the same dates
data = pd.DataFrame({'x': x_series, 'y': y_series}).dropna()
x = data['x'].to_numpy()
y = data['y'].to_numpy()

print(f"Loaded {len(x)} aligned observations.")

## 2. Snippet 3.2 & 3.3: Variation of Information (VI)

**Variation of Information (VI)** is a distance metric. It measures how much information two variables *do not* share. A lower VI means the variables are more similar.

First, we'll calculate VI using a naively chosen, hardcoded number of bins (`bins=10`).

In [None]:
# Snippet 3.2: VI with hardcoded bins
vi_hardcoded = dm.calculate_variation_of_information(x, y, bins=10, norm=True)

print(f"Normalized VI (10 Bins):  {vi_hardcoded:.4f}")

A better approach is to let the algorithm find the *optimal* number of bins, as described by De Prado. The `calculate_variation_of_information_extended` function does this automatically.

In [None]:
# Snippet 3.3: VI with optimal bins
vi_optimal = dm.calculate_variation_of_information_extended(x, y, norm=True)

print(f"Normalized VI (Optimal Bins): {vi_optimal:.4f}")

## 3. Snippet 3.4: Correlation vs. Mutual Information

This is one of the most important concepts in the chapter.

* **Correlation:** Measures the **linear** relationship between two variables. It fails to detect non-linear patterns.
* **Mutual Information (MI):** Measures any relationship (linear or non-linear). It quantifies the amount of information one variable provides about another.

We will test three cases using synthetic data.

In [None]:
# Helper function to run and plot each case
def plot_mi_vs_corr(x, y, title):
    """Calculates, prints, and plots MI vs. Correlation."""
    
    # Calculate metrics
    corr = np.corrcoef(x, y)[0, 1]
    nmi = dm.calculate_mutual_information(x, y, norm=True)
    
    print(f"Correlation:      {corr:.4f}")
    print(f"Normalized MI:    {nmi:.4f}")
    
    # Plot
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.scatter(x, y, alpha=0.1)
    pub_plots.apply_plot_style(
        ax,
        title=title,
        xlabel='x',
        ylabel='y'
    )
    plt.show()
    
# Setup synthetic data
size, seed = 5000, 0
np.random.seed(seed)
x_base = np.random.normal(size=size)
e_base = np.random.normal(size=size)

### Case 1: No Relationship

Here, `y` is just random noise and is completely independent of `x`.

In [None]:
# y is independent of x
y_uncorr = 0*x_base + e_base

plot_mi_vs_corr(x_base, y_uncorr, "Case 1: No Relationship (y = e)")

**Analysis:** As expected, both Correlation and Normalized MI are near zero.

### Case 2: Linear Relationship

Here, `y` has a strong, positive linear relationship with `x`.

In [None]:
# y has a strong linear relationship with x
y_linear = 100 * x_base + e_base

plot_mi_vs_corr(x_base, y_linear, "Case 2: Linear Relationship (y = 100x + e)")

**Analysis:** Both metrics are near 1.0, correctly identifying the strong linear relationship.

### Case 3: Non-Linear Relationship

This is the key test. We define `y` as a "V-shape" function of `x` (`y = 100*abs(x)`). This is a strong, predictable relationship, but it is not linear.

In [None]:
# y has a strong NON-LINEAR relationship with x
y_nonlinear = 100 * abs(x_base) + e_base

plot_mi_vs_corr(x_base, y_nonlinear, "Case 3: Non-Linear Relationship (y = 100|x| + e)")

## 4. Conclusion

This notebook demonstrates the critical difference between correlation and information-theoretic metrics.

* **Correlation failed** to detect a clear, non-linear "V-shape" relationship, reporting a value near 0.
* **Mutual Information succeeded** in detecting this non-linear relationship, reporting a high value of 0.64.

This confirms that Mutual Information (MI) and Variation of Information (VI) are more robust metrics for feature selection and cluster analysis, as they can capture complex relationships that simple linear correlation would miss.