## Ch4. Persistence Analysis

Many of the variables in empirical asset pricing research are intended to capture persistent characteristics of the entities in the sample.
This means that the characteristic of the entity that is captured by the given variable is assumed to remain reasonably stable over time.

In this chapter, we discuss a technique that we call persistence analysis. We use persistence analysis to examine whether a given characteristic of the entities in our sample is in fact persistent. Persistence analysis can also be used to examine the ability of the variable in question to capture the desired characteristic of the entity

First Step: calculating cross-sectional correlations between the given variable X measured a certain number of periods apart.

Second Step: calculating the time-series average of each of these cross-sectional correlations.

이 과정은 보통 empirical asset pricing literatures에서 보고되지 않는 경우가 많다.
하지만, Factor zoo의 수백가지 factor가 실제 consistency를 가지는지, 어떤 방향인지, reversal을 보이는지 체크하는 것은 empirical asset pricing을 수행할 때, 매우 중요한 부분이라 판단되며,
우리는 이 notebook들의 기초가 되는 Empirical Asset Pricing: Cross-Section of Stock Return의 Chapter 5를 확장한다.

In [None]:
# import libraries
import pandas as pd
import numpy as np

Two step으로 이루어진다. 
1. 특정 기간 간격으로 측정된 변수 X(여기서는 firm characteristics or factor) 간의 단면 상관관계를 계산한다. 
2. 이러한 각 횡단면 correlationship의 time-series average 계산. 

1. Periodic Cross-Sectional Persistence.    
X와 기간 $\tau$ 기간 간격으로 측정된 변수간의 단면 상관관계 계산.    
time $t$와 $t+\tau$에 속하는 각 기간에 대해 수행한다.    
기본적인 상관관계는 pearson 상관관계를 확인한다.    

$$\rho_{t,t+\tau} = \frac{\sum_{i=1}^{n_t}[(X_{i,t}-\bar{X}_t)(X_{i,t+\tau}-\bar{X}_{t+\tau})]}{\sqrt{\sum_{i=1}^{n_t}(X_{i,t}- \bar{X}_t)^2} \sqrt{\sum_{i=1}^{n_t}(X_{i,t+\tau}-\bar{X}_{t+\tau})^2}}$$

$\tau$가 클수록 (시차가 길 수록) $X$간의 상관관계는 낮아지는 경향이 있으나 매번 그런것은 아니다.

In [None]:
def cal_periodic_cs_persistence(df, time_col, value_col, entity_col=None, max_tau=5):
    """
    """
    result = {}
    
    for value in value_col:
        
        if entity_col 

In [None]:
def cal_periodic_cs_persistence(df, time_column, value_columns, entity_column=None, max_tau=5):
    """
    Calculate the cross-sectional Pearson correlations for multiple variables measured tau periods apart for multiple tau values.
    
    Args:
        df (pd.DataFrame): The data frame containing the data.
        time_column (str): The name of the column representing time periods.
        value_columns (list of str): The names of the columns representing the values of the variables.
        entity_column (str or None): The name of the column representing the entities. If None, calculate persistence without entity grouping.
        max_tau (int): The maximum number of periods apart to measure persistence.
    
    Returns:
        dict of pd.DataFrame: A dictionary containing the persistence correlations for each variable.
    """
    persistence_results = {}

    for value_column in value_columns:
        # Ensure the dataframe is sorted by the time column (and entity column if provided)
        if entity_column:
            df = df.sort_values(by=[time_column, entity_column]).reset_index(drop=True)
        else:
            df = df.sort_values(by=[time_column]).reset_index(drop=True)
        
        # Create a dictionary to store the results
        results = {f't+{tau}': [] for tau in range(1, max_tau + 1)}
        results['Year'] = []

        # Get unique time periods
        unique_times = df[time_column].unique()

        # Loop over each time period
        for t in unique_times:
            period_df = df[df[time_column] == t]
            
            if len(period_df) == 0:
                continue

            correlations = []
            
            for tau in range(1, max_tau + 1):
                future_time = t + tau
                
                if future_time not in unique_times:
                    correlations.append(np.nan)
                    continue
                
                shifted_df = df[df[time_column] == future_time]
                
                if entity_column:
                    # Merge on entities to ensure we only consider those with valid values for both t and t+tau
                    merged_df = pd.merge(period_df, shifted_df, on=entity_column, suffixes=('', f'_shifted_{tau}'))
                else:
                    # If no entity column, just ensure both periods have data
                    merged_df = pd.concat([period_df.reset_index(), shifted_df.reset_index()], axis=1, keys=['t', 't+tau'])

                if len(merged_df) == 0:
                    correlations.append(np.nan)
                    continue
                
                # Calculate means
                X_t = merged_df[value_column] if entity_column else merged_df[('t', value_column)]
                X_t_tau = merged_df[f'{value_column}_shifted_{tau}'] if entity_column else merged_df[('t+tau', value_column)]
                mean_X_t = X_t.mean()
                mean_X_t_tau = X_t_tau.mean()

                # Calculate numerator and denominator separately
                numerator = ((X_t - mean_X_t) * (X_t_tau - mean_X_t_tau)).sum()
                denominator = np.sqrt(((X_t - mean_X_t) ** 2).sum() * ((X_t_tau - mean_X_t_tau) ** 2).sum())
                
                if denominator == 0:
                    correlations.append(np.nan)
                else:
                    pearson_corr = numerator / denominator
                    correlations.append(pearson_corr)
            
            results['Year'].append(t)
            for tau, corr in zip(range(1, max_tau + 1), correlations):
                results[f't+{tau}'].append(corr)
        
        results_df = pd.DataFrame(results)
        results_df.set_index('Year', inplace=True)
        persistence_results[value_column] = results_df

    return persistence_results

def calculate_average_persistence(persistence_results, max_tau):
    """
    Calculate the time-series average of the periodic cross-sectional correlations for multiple variables.
    
    Args:
        persistence_results (dict of pd.DataFrame): A dictionary containing the periodic cross-sectional correlations for each variable.
        max_tau (int): The maximum number of periods apart to measure persistence.
    
    Returns:
        pd.DataFrame: A data frame containing the average persistence values for each variable and each lag.
    """
    avg_persistence = {f't+{tau}': [] for tau in range(1, max_tau + 1)}
    avg_persistence['Variable'] = []

    for variable, persistence_df in persistence_results.items():
        avg_persistence['Variable'].append(variable)
        for tau in range(1, max_tau + 1):
            avg_persistence[f't+{tau}'].append(persistence_df[f't+{tau}'].mean())

    avg_persistence_df = pd.DataFrame(avg_persistence)
    avg_persistence_df.set_index('Variable', inplace=True)
    return avg_persistence_df




Average Persistence with Entity Column:
               t+1       t+2       t+3       t+4       t+5
Variable                                                  
beta     -0.012372  0.095060  0.222180  0.313634 -0.391324
Size     -0.077550 -0.379835  0.267808  0.534534 -0.420979
BM       -0.120001 -0.198523  0.396845 -0.014092 -0.130984

Average Persistence without Entity Column:
               t+1       t+2       t+3       t+4       t+5
Variable                                                  
beta     -0.084641 -0.096246  0.034608  0.035583 -0.006334
Size     -0.124825 -0.473607  0.182061  0.390492 -0.433567
BM       -0.118471 -0.340782  0.142714  0.067371  0.007421


2. Average Cross-Sectional Persistence.  
Periodic Cross-Sectional Persistence는 읽고 결론을 도출하기 어렵다.   
이러한 값을 요약하기 위해 Periodic Cross-Sectional Persistence의 평균을 구한다.
$$\rho_{\tau}(X) = \frac{\sum_{t=1}^{N-\tau} \rho_{t,t+\tau}(X)}{N-\tau}$$ 

3. 상관관계 변화... 어떻게 측정?