## ML for Asset Management - Marcos Lopez Prado 

**Date:** 2025-05-20 

Working through the content in the textbook. Beach day so copying over earlier work.


### Chapter 3 - Distance Metrics

### 3.1 Motivation:

Infomation theory is really useful for both Finance and ML!

The key idea behind entropy is to quantify the amount of uncertainty associated with a random variable
 
In this section, we review concepts that are used throughout ML in a variety of settings, including:
1.  defining the objective function in decision tree learning; 
2.  defining the loss function for classification problems; 
3.  evaluating the distance between two random variables; 
4.  comparing clusters; and 
5.  feature selection. 

In [None]:
# 3.2 - Correlation as a Metric:

# Correlation is a useful measure of linear codependence, however not a metric (trianble inequality does not hold).

# Consider two random vectors X, Y of size T, and a correlation estimate ρ(X,Y) 
# with the only requirement that σ(X,Y)=ρ(X,Y)σ(X)σ(Y). (covarience)

# Then the measure: d_corr(X,Y) = sqrt(1/2(1 - ρ(X,Y))) is a metric, and satisfies the triangle inequality.

# We can show the euclidian metric after normalization d(x,y) = sqrt(4T)*d_corr(X,Y)

# The implication is that d_corr(X,Y) is a linear multiple of the Euclidean distance between the vectors 
# X,Y after z-standardization hence it inherits the true-metric properties of the Euclidean distance.

# Properties of d(x,y):

# 1. Normalized (in 0 to 1)

# 2. Another property is that it deems more distant two random variables 
#   with negative correlation than two random variables with positive correlation, 
#   regardless of their absolute value

#   This property makes sense in many applications. For example, we may wish to build a long-only
#   portfolio, where holdings in negative-correlated securities can only offset
#   risk, and therefore should be treated as different for diversification purposes.

#   ρ = 1 => d = 0
#   ρ = -1 => d = 1
#   ρ = 0 => d = 1/sqrt(2) = 0.707

In [1]:
# 3.3-3.7 Marginal and Joint Entropy

# The notion of correlation presents three important caveats:
#    1. First, it quantifies the linear codependency between two random variables. 
#       It neglects nonlinear relationships. 
#    2. Second, correlation is highly influenced by outliers. 
#    3. Third, its application beyond the multivariate Normal case is questionable. 
#       We may compute the correlation between any two real variables, 
#       however that correlation is typically meaningless unless the two variables 
#       follow a bivariate Normal distribution. 
# 
# To overcome these caveats, we need to introduce a few information-theoretic concepts.

# Let X be a discrete random variable that takes values in the set Sx with probability mass function p(x).

# The entropy of X is defined as:

    # H(X) = -Σx∈Sx p(x) log p(x)

# The value 1/p(x) measures how "surprising" an event is. Entropy is the expected value of the log of those surprises
# Accordingly, entropy can be interpreted as the amount of uncertainty associated with X

# Reaches a max at log(|Sx|) 


# The joint entropy of two random variables X and Y is defined as:

    # H(X,Y) = -Σx∈Sx Σy∈Sy p(x,y) log p(x,y)

# The joint entropy is the amount of uncertainty associated with the pair (X,Y).


# The conditional entropy of X given Y is defined as:

    # H(X|Y) = -Σy∈Sy p(y) Σx∈Sx p(x|y) log p(x|y)

# The conditional entropy is the amount of uncertainty associated with X given that we know Y.


# KL-Divergence: 
 
# The Kullback-Leibler divergence (or relative entropy) between two probability distributions P and Q is defined as:

   # D(P||Q) = Σx∈S p(x) log(p(x)/q(x)) 

# Intuitively, this expression measures how much p diverges from a reference distribution q (nonnegative but asymmetric).


# Cross-Entropy:

# The cross-entropy between two probability distributions P and Q is defined as:

    # H(P,Q) = -Σx∈S p(x) log q(x)

# Cross-entropy can be interpreted as the uncertainty associated with X, 
# where we evaluate its information content using a wrong distribution q rather than the true distribution p.

# Cross entropy is popular in classifcation problems, and it is particularly meaningful in financial applications.


# Mutual Information:

# The mutual information between two random variables X and Y is defined as:

    # I(X;Y) = H(X) + H(Y) - H(X,Y)

# The mutual information is a measure of the amount of information that knowing 
# one of the variables provides about the other.

# It is symmetric, nonnegative, and zero if and only if X and Y are independent. BUT not a metric (triangle fails).

# I(X;Y) = H(X) + H(Y) - H(X,Y)

# I(X;Y,Z) = I(X;Y) + I(X,Y;Z)

# Given two arrays x and y of equal size, which are discretized into a regular grid 
# with a number of partitions (bins) per dimension, the code below shows how to compute
# in python the marginal entropies, joint entropy, conditional entropies, and the mutual information.

import numpy as np
import scipy.stats as ss
from sklearn.metrics import mutual_info_score

# Generate synthetic data
np.random.seed(42)
x = np.random.normal(loc=0, scale=1, size=1000)
y = 0.5 * x + np.random.normal(loc=0, scale=1, size=1000)  # correlated with x

# Set number of bins for histogram estimation
bins = 30

# Compute 2D histogram (joint distribution)
cXY = np.histogram2d(x, y, bins)[0]

# Compute marginal entropies
hX = ss.entropy(np.histogram(x, bins)[0])
hY = ss.entropy(np.histogram(y, bins)[0])

# Compute mutual information
iXY = mutual_info_score(None, None, contingency=cXY)

# Normalize mutual information
iXYn = iXY / min(hX, hY)

# Compute joint entropy
hXY = hX + hY - iXY

# Conditional entropies
hX_Y = hXY - hY  # H(X|Y)
hY_X = hXY - hX  # H(Y|X)

# Output the results
print(f"hX: {hX:.4f}, hY: {hY:.4f}")
print(f"Mutual Information (I(X;Y)): {iXY:.4f}")
print(f"Normalized MI: {iXYn:.4f}")
print(f"Joint Entropy (H(X,Y)): {hXY:.4f}")
print(f"H(X|Y): {hX_Y:.4f}, H(Y|X): {hY_X:.4f}")


hX: 2.8279, hY: 2.9423
Mutual Information (I(X;Y)): 0.3334
Normalized MI: 0.1179
Joint Entropy (H(X,Y)): 5.4368
H(X|Y): 2.4945, H(Y|X): 2.6089


In [2]:
# 3.8 Varition of Information:

# This measure can be interpreted as the uncertainty we expect in one variable if we are told the value of other.

# The variation of information between two random variables X and Y is defined as:

#    VI(X,Y)    = H(X) + H(Y) - 2I(X;Y)
#               = H(X|Y) + H(Y|X)
#               = H(X,Y) - I(X,Y)
#               = 2H(X,Y) - H(X) - H(Y)

# The variation of information is symmetric, nonnegative, zero if and only if X and Y are equal,
# and has an Upper bound of H(X,Y). It is a metric!

# However as H(X,Y) is not bounded except for the size of Sx and Sy, VI is not bounded. 
# This is problematic when we wish to compare variations of information across different population sizes.

# The following quantity is a metric bounded between zero and one for all pairs (X,Y):

#    VI_tilda(X,Y) = VI(X,Y) / H(X,Y)

# Another way of bounding it is:

#    VI_tilda(X,Y) = 1 - I(X;Y) / max{H(X), H(Y)}

import numpy as np
import scipy.stats as ss
from sklearn.metrics import mutual_info_score
from IPython.display import Image, display

# ---------------------------------------------------
def varInfo(x, y, bins, norm=False):
    """
    Compute the variation of information (VI) between two variables x and y.
    
    Parameters:
        x (array-like): First input vector.
        y (array-like): Second input vector.
        bins (int): Number of bins to use for histograms.
        norm (bool): Whether to normalize VI by the joint entropy.
    
    Returns:
        float: Variation of information (normalized if norm=True).
    """
    # Joint histogram
    cXY = np.histogram2d(x, y, bins)[0]
    
    # Mutual information
    iXY = mutual_info_score(None, None, contingency=cXY)
    
    # Marginal entropies
    hX = ss.entropy(np.histogram(x, bins)[0])
    hY = ss.entropy(np.histogram(y, bins)[0])
    
    # Variation of information
    vXY = hX + hY - 2 * iXY
    
    if norm:
        hXY = hX + hY - iXY  # Joint entropy
        vXY /= hXY           # Normalized VI
    
    return vXY

# ------------------ Example Use ------------------

# Generate synthetic data
np.random.seed(0)
x = np.random.normal(0, 1, 1000)
y = 0.5 * x + np.random.normal(0, 1, 1000)

# Set number of histogram bins
bins = 30

# Compute VI
vi_raw = varInfo(x, y, bins, norm=False)
vi_norm = varInfo(x, y, bins, norm=True)

# Print results
print(f"Raw Variation of Information: {vi_raw:.4f}")
print(f"Normalized Variation of Information: {vi_norm:.4f}")


Raw Variation of Information: 5.2269
Normalized Variation of Information: 0.9270


In [None]:
# 3.9 Differential Entropy:

# See course notes

In [None]:
#3.10 Distance between two partitions:

# A partition P of a data set D is an unordered set of mutually disjoint nonempty subsets.

# Let us define the uncertainty associated with P. 
# First, we set the probability of picking any element d in D to be p(d) = 1/|D|.
# Second, the probability that an element d picked at random belongs to subset Dk is p(k) = |Dk|/|D|.

# This second probability p(k) is associated with a discrete random variable that takes a 
# value k from S = {1,2,...,K} with probability mass function p(k) = |Dk|/|D|.

# Third, the uncertainty associated with this discrete random variable can be expressed in terms of the entropy
#    H(P) = -Σk=1 to K p(k) log p(k)

# For another partition P', we can define the uncertainty associated with P' in the same way. 

# We can define the variation of information as: 

#    VI(P,P') = H(P|P') + H(P'|P) 

#  In the context of unsupervised learning, variation of information is useful for comparing outcomes 
#  from a partitional (non-hierarchical) clustering algorithm.

### 3.12 Conclusions:

- Correlations are useful at quantifying the linear codependency between random variables.  
- This form of codependency accepts various representations as a distance metric.  
- However, when variables X and Y are bound by a nonlinear relationship, the above distance metric misjudges the similarity of these variables.  
- For nonlinear cases, we have argued that the **normalized variation of information** is a more appropriate distance metric.  
- It allows us to answer questions regarding the unique information contributed by a random variable, without having to make functional assumptions.  
- Given that many machine learning algorithms do not impose a functional form on the data, it makes sense to use them in conjunction with entropy-based features.
