<div class="alert alert-block alert-info">
    
<center> 
    
# __Lead-Lag Portfolios__ 
### __Introduction to Signature and Lévy Area__ 
    
</center>

</div>

## __1. Introduction__

The signature of a path, or **path signature**, finds intriguing applications in the field of machine learning. The core concept behind using the signature transformation for machine learning problems stems from its ability to extract characteristic features from data. Embedding data into a path and computing its signature provides crucial insights into the original data. The workflow is straightforward and can be summarized by the following algorithm:

$$\text{data} \rightarrow \text{path} \rightarrow \text{signature of path} \rightarrow \text{features of data}$$

This algorithm is entirely general and applicable to any type of sequential data that can be embedded into a continuous path. The extracted features can be utilized for various machine learning applications, encompassing both supervised and unsupervised learning. For instance, it can be employed to classify time-series data or distinguish clusters within a dataset. One notable advantage of feature extraction with the signature method is its sensitivity to the geometric shape of a path.

The signature method finds natural applications in quantitative finance, particularly in the analysis of time-series data. Time-series data represent ordered sequential information, making them ideal candidates for creating a path from the data, followed by computing the signature and applying machine learning algorithms for further analysis. Any form of time-ordered sequential data naturally fits into the signature framework. Additionally, if the input data come from several parallel sources, this results in a multi-dimensional path. An example of such data is panel data in econometrics, involving repeated observations of the same quantity over periods of time.

In this notebook, we will cover essential concepts for lead-lag identification using the Lévy area and its implementation in Python. We encourage readers to dedicate some time to study the exceptional [A Primer on the Signature Method in Machine Learning](https://arxiv.org/abs/1603.03788) paper to gain a thorough understanding of the tool, its capabilities, and other applications.

## **2. From Path Signature to Lévy Area**

Without delving into the mathematical intricacies behind the path signature and Lévy area, this section will begin with practical numerical examples in Python. We will illustrate how to transform time series data into a path, compute the path signature, and, ultimately, derive the Lévy area—the pairwise lead-lag metric of interest in this repository:

$$\text{time series data} \rightarrow \text{path} \rightarrow \text{signature of path} \rightarrow \text{Lévy area}$$

By working through these numerical examples, you can develop an intuition about signatures and Lévy areas before delving into the underlying mathematical concepts on your own. Once you are familiar with the basics, understanding the mathematical foundations and contemplating potential applications will become more accessible. We recommend dedicating more time to explore available literature for a deeper understanding of the topic.

#### __Example 1 - Signature Terms and Lévy Area__

The key ingredient of the signature method is __a path constructed from data__. The path is a continuous piece-wise interpolation of data points.
For example, consider a collection of pairs $(X^{1},X^{2})$, where $X^{1} = \{1,3,5,8\}$ and $X^{2} = \{1,4,2,6\}$. First, we need to construct a path from this data:

In [1]:
import numpy as np

# Define data arrays
X1 = np.array([1., 3., 5., 8.]).reshape((-1,1))
X2 = np.array([1., 4., 2., 6.]).reshape((-1,1))

# Construct a path from the data
path = np.append(X1, X2, axis=1)
print(f"Path Length = {path.shape[0]}, Path Dimension = {path.shape[1]}.")
print(f'Path = \n{path}')

Path Length = 4, Path Dimension = 2.
Path = 
[[1. 1.]
 [3. 4.]
 [5. 2.]
 [8. 6.]]


Now that, we have the path, let's calculate its signature. The signature is a transformation (mapping) from a path into a collection of real-valued numbers. Each term in the collection has a particular (geometrical) meaning as a function of data points. It is crucial to understand each term in the resulting signature. The general form of the signatures is given by iterated integrals (projections or coordinates) of a path.

The signature **truncated at level 2**, whose components will be used to calculate the Lévy area, has the following form:
$$
S = \{1, S^{(1)}, S^{(2)}, S^{(1,1)}, S^{(1,2)}, S^{(2,1)}, S^{(2,2)}\}
$$

Using **`iisignature`**, we can get the all components of $S$, except the first one, which is always $1$:

In [2]:
import iisignature

signature = iisignature.sig(path, 2)   # signature of path up to level 2
signature

array([ 7. ,  5. , 24.5, 19. , 16. , 12.5])

So, the components of the path signature are:
$$
S^{(1)} = 7, \ S^{(2)} = 5, \ S^{(1,1)} = 24.5, \ S^{(1,2)} = 19, \ S^{(2,1)} = 16, \  S^{(2,2)} = 12.5
$$

- $S^{(1)}$ and $S^{(2)}$ are linear terms, correspond to the total increment (net Euclidean distance between the end points along each dimension).
- $S^{(1,1)} = \frac{1}{2}\left(S^{(1)}\right)^{2}$ and $S^{(2,2)} = \frac{1}{2}\left(S^{(2)}\right)^{2}$ are the square of the first linear term term with a factor 1/2.
- $S^{(1,2)}$ and $S^{(2,1)}$ are second order approximations and areas under the path computed, as shown in the following figure.

The geometric interpretation of the second-order terms $S^{(1,2)}$ and $S^{(2,1)}$ is presented in the figure below. The blue line is the path we constructed from $X^{1} = \{1,3,5,8\}$ and $X^{2} = \{1,4,2,6\}$, earlier.

<center>
  <img src="../images/sig_area_12_21.png" width="700"/>
</center>


Now, we can calculate Lévy area for this path using the following formula:

$$
Lévy^{area} = \frac{1}{2}\left(S^{(1,2)}-S^{(2,1)}\right)
$$

or the geometric interpretation, please refer to the figure below. The Lévy area is depicted as the signed area enclosed by the piece-wise linear path (blue), and the chord (red dashed line) is illustrated in the following figure. The light blue area represents negative values, while the pink area represents positive values.

<center>
  <img src="../images/levy_area.png" width="400"/>
</center>

So, the Lévy area for our hypothetical path is given by:

In [3]:
levy_area = 0.5 * (signature[3] - signature[4])
print(f'Lévy area = {levy_area}')

Lévy area = 1.5


We can define a function to calculate Lévy area up to level 2, which is enough for our purposes in this project:

In [4]:
def calc_levy_area(path: np.array) -> float:
    # signature of path up to level 2
    signature = iisignature.sig(path, 2) 
    
    # levy area
    levy_area = 0.5 * (signature[3] - signature[4])
    
    return levy_area

#### __Example 2 - Lévy Area as Lead-Lag Metric__

The Lévy area can serve as a signature lead-lag metric. It is positive and grows larger when increases (or decreases) in $X^{i}$ are followed by increases (or decreases) in $X^{j}$. If the relative moves of $X^{i}$ and $X^{j}$ are in opposite directions, then the signature lead–lag measure is negative.

In this specific example, we set $X^{1} = {0, 0.5, 2, 2.5}$ and $X^{2} = {0.5, 2, 2.5, 3.5}$, meaning we artificially set $X^{2}$ to lead $X^{1}$. Consequently, the Lévy-area between $X^{1}$ and $X^{2}$ is negative. The illustration of the Lévy area between these two time series is presented in the figure below.

<center>
    <img src="../images/levy_area_lead_lag.png" width="500">
</center>


In [5]:
# define time series
X1 = np.array([0., 0.5, 2.0, 2.5]).reshape((-1,1))
X2 = np.array([0.5, 2.0, 2.5, 3.5]).reshape((-1,1))

# Construct a path from the time series
path = np.append(X1, X2, axis=1)

# calculate levy area
print(f'Lévy area between two time series is equal to {calc_levy_area(path)}')

Lévy area between two time series is equal to -0.5


#### __Example 3 - Examining Lead-Lag Relationships between Bitcoin and Ethereum Using Lévy Area__

As mentioned, Lévy area can be used to examine the lead-lag relationship between two time series. In this example, we are going to obtain the prices of Bitcoin and Ethereum over a 60-day period and apply the Lévy area lead-lag metric to determine which one leads the other. To achieve this, we need to convert the price data into returns and then standardize those returns. Subsequently, we will calculate the Lévy area between these time series of standardized daily returns to find out which one leads the other over the selected period.

In [6]:
import yfinance as yf
# Define ticker symbols
tickers = ['BTC-USD', 'ETH-USD']

# Download historical data and filter-out Adj Close
data = yf.download(tickers, start='2023-09-01', end='2023-11-01')
data = data['Adj Close']

data

[*********************100%%**********************]  2 of 2 completed


Unnamed: 0_level_0,BTC-USD,ETH-USD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-09-01,25800.724609,1628.491211
2023-09-02,25868.798828,1637.025391
2023-09-03,25969.566406,1636.117676
2023-09-04,25812.416016,1629.655273
2023-09-05,25779.982422,1633.629395
...,...,...
2023-10-27,33909.800781,1780.045288
2023-10-28,34089.574219,1776.618164
2023-10-29,34538.480469,1795.546021
2023-10-30,34502.363281,1810.088623


In [7]:
# Calculate the returns based on 'Adj Close'
rets = data.pct_change().dropna()

# Standardize the returns
std_rets = (rets - rets.mean()) / rets.std()

# Define the path
path = std_rets.values

# Calculate Levy area
print(f'Lévy area between Bitcoin and Etherium is equal to {calc_levy_area(path)}')

Lévy area between Bitcoin and Etherium is equal to -4.994685270177154


As the Lévy area is -4.99, we can conclude that Ethereum led Bitcoin over the selected period.

## **3. Generating Lévy Area Lead-Lag Matrices**

In this project, we will apply the lead-lag metric described in this notebook to a price panel of cryptocurrencies and generate an $S$ matrix where $S_{ij}$ represents the Lévy area between the $i^{th}$ and $j^{th}$ cryptocurrencies. We will use this matrix to sort assets from the most likely to be a leader to most likely to be a follower, according to the row average of the matrix. All of these operations will be executed by a class called **`Levy()`** and its methods, as shown in the following code blocks. First, let's create a price panel with 10 cryptocurrencies: 

In [8]:
# ticker sof price panel
tickers = ['BTC-USD', 'ETH-USD', 'BNB-USD', 'XRT-USD', 'SOL-USD',
           'ADA-USD', 'DOGE-USD', 'TRX-USD', 'MATIC-USD', 'DOT-USD']

# Download historical data and filter-out Adj Close
data = yf.download(tickers, start='2023-09-01', end='2023-11-01')
data = data['Adj Close']

# Remove -USD from ticker names
data.columns = data.columns.str.replace('-USD', '')
data

[*********************100%%**********************]  10 of 10 completed


Unnamed: 0_level_0,ADA,BNB,BTC,DOGE,DOT,ETH,MATIC,SOL,TRX,XRT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2023-09-01,0.255185,213.630554,25800.724609,0.063849,4.215832,1628.491211,0.540550,19.327845,0.076093,2.014822
2023-09-02,0.256230,214.448547,25868.798828,0.063494,4.261871,1637.025391,0.540820,19.493744,0.077032,2.001043
2023-09-03,0.255793,214.399948,25969.566406,0.063173,4.261633,1636.117676,0.542173,19.578747,0.077041,1.961706
2023-09-04,0.256146,215.197510,25812.416016,0.063185,4.253158,1629.655273,0.553666,19.499409,0.077430,1.952594
2023-09-05,0.257614,214.478134,25779.982422,0.063992,4.256727,1633.629395,0.558237,20.272385,0.077418,1.966736
...,...,...,...,...,...,...,...,...,...,...
2023-10-27,0.289468,224.475906,33909.800781,0.067881,4.139360,1780.045288,0.609734,31.739698,0.093628,1.884600
2023-10-28,0.291250,225.774078,34089.574219,0.069001,4.180108,1776.618164,0.620639,31.653318,0.094332,1.880885
2023-10-29,0.295283,227.148453,34538.480469,0.069359,4.312546,1795.546021,0.638230,32.822609,0.094879,1.923370
2023-10-30,0.302734,228.287430,34502.363281,0.069650,4.527799,1810.088623,0.649928,34.962337,0.095611,1.924383


Now, you can pass this price panel to an instance of the **`Levy()`** class and then execute the **`generate_levy_matrix()`** method to obtain the $S$ matrix:

In [9]:
import sys
sys.path.append('../src')

from levy import Levy

levy_leadlag = Levy(price_panel=data)
levy_s_matrix = levy_leadlag.generate_levy_matrix()
levy_s_matrix.style.format("{:.3}")

Unnamed: 0,ADA,BNB,BTC,DOGE,DOT,ETH,MATIC,SOL,TRX,XRT
ADA,0.0,0.309,7.68,-1.08,1.51,2.2,-0.771,1.75,-3.82,6.49
BNB,-0.309,0.0,12.6,0.797,-0.202,5.16,-1.87,2.39,-1.15,11.5
BTC,-7.68,-12.6,0.0,-0.238,-5.43,-4.99,-7.33,0.394,-4.82,2.37
DOGE,1.08,-0.797,0.238,0.0,-2.81,0.821,-5.86,-2.03,-2.55,6.95
DOT,-1.51,0.202,5.43,2.81,0.0,1.57,-4.81,1.31,-4.89,6.85
ETH,-2.2,-5.16,4.99,-0.821,-1.57,0.0,-3.18,0.405,-6.14,4.62
MATIC,0.771,1.87,7.33,5.86,4.81,3.18,0.0,-0.0124,-7.7,6.09
SOL,-1.75,-2.39,-0.394,2.03,-1.31,-0.405,0.0124,0.0,-6.1,0.701
TRX,3.82,1.15,4.82,2.55,4.89,6.14,7.7,6.1,0.0,6.39
XRT,-6.49,-11.5,-2.37,-6.95,-6.85,-4.62,-6.09,-0.701,-6.39,0.0


In matrix $S$, the $S_{ij}$ represents the Lévy area between asset $i$ and asset $j$. You can see that $S_{ij} = -S_{ji}$ and $S_{ij} = 0$ if $i=j$. We can easily calculate the lead-lag score of the assets on a global scale using the **`score_assets()`** method. This method calculates the row mean of $S$, and assets with higher means are most likely to be leaders, while assets with lower means are most likely to be followers, over the time period of the price panel.

In [10]:
levy_leadlag.score_assets()

Unnamed: 0,ADA,BNB,BTC,DOGE,DOT,ETH,MATIC,SOL,TRX,XRT
2023-10-31,1.426738,2.896128,-4.034701,-0.495795,0.696117,-0.904872,2.218807,-0.960172,4.357435,-5.199685
