In [None]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr

## Ch3. Correlation

There are two correalations 
1. Pearson product moment correlation
2. Spearmann rank correlation 

Pearson product moment correlation: most applicable when the relation between the two variabels, which we denote X and Y.
Pearson correlation can be roughly interpreted as the signed percentage of variation in X that is related to variation in Y.
With the sign being positive if X tends to be high when Y is high and the sign being negative whne high values of X tends to correspeond to low values of Y

Spearman rank correlation is most application when the relation between the variables is though to be monotonic but not neccessarily linear.
Mesures how closely related the ordering of X is to the ordering of Y

1. Calcuate the cross-sectional correlation between the two variables in question, X and Y for each period t
2. time-series average of the cross-sectional correlations.

In [None]:
def cal_corr(group, var1, var2, option='all'):
    """
    Calculate the correlation between two variables for a given period.

    Args:
        group (pd.DataFrame): The data for the given period.
        var1 (str): The name of the first variable.
        var2 (str): The name of the second variable.
        option (str): The correlation type to calculate ('pearson', 'spearman', or 'all'). Defaults to 'all'.
        
    Returns:
        pd.Series: A series containing the correlation between the two variables.
    """
    if option == 'pearson':
        pearson_corr, _ = pearsonr(group[var1], group[var2])
        return pd.Series({'Pearson': pearson_corr})
    elif option == 'spearman':
        spearman_corr, _ = spearmanr(group[var1], group[var2])
        return pd.Series({'Spearman': spearman_corr})
    elif option == 'all':
        pearson_corr, _ = pearsonr(group[var1], group[var2])
        spearman_corr, _ = spearmanr(group[var1], group[var2])
        return pd.Series({'Pearson': pearson_corr, 'Spearman': spearman_corr})
    else:
        raise ValueError("Invalid option for correlation type. Choose from 'pearson', 'spearman', or 'all'.")

In [None]:
def cal_per_corr(df, time_column, specific_date=None):
    """
    Calculate Pearson and Spearman correlations for each time period for all pairs of variables.
    
    Args:
        df (pd.DataFrame): The data frame containing the data.
        time_column (str): The name of the column representing time periods.
        specific_date (str or None): A specific date to filter the data. If None, calculate for all dates.
    
    Returns:
        pd.DataFrame: A data frame containing the Pearson and Spearman correlations for each time period.
    """
    if specific_date:
        df = df[df[time_column] == specific_date]
    
    variables = [col for col in df.columns if col != time_column]
    correlations = []

    for i, var1 in enumerate(variables):
        for var2 in variables[i+1:]:
            corr_df = df.groupby(time_column).apply(lambda group: cal_corr(group, var1, var2)).reset_index()
            corr_df['Var1'] = var1
            corr_df['Var2'] = var2
            correlations.append(corr_df)

    all_correlations = pd.concat(correlations, ignore_index=True)
    return all_correlations

In [None]:
def cal_ts_avcorr(correlations_df):
    """
    Calculate the time-series averages of the periodic cross-sectional correlations.
    
    Args:
        correlations_df (pd.DataFrame): A data frame containing the periodic correlations.
    
    Returns:
        pd.DataFrame: A data frame containing the time-series average Pearson and Spearman correlations.
    """
    avg_corrs = correlations_df.groupby(['Var1', 'Var2'])[['Pearson', 'Spearman']].mean().reset_index()
    return avg_corrs

In [None]:
def cal_ts_avcorr(correlations_df):
    """
    Calculate the time-series averages of the periodic cross-sectional correlations.
    
    Args:
        correlations_df (pd.DataFrame): A data frame containing the periodic correlations.
    
    Returns:
        pd.DataFrame: A data frame containing the time-series average Pearson and Spearman correlations.
    """
    avg_corrs = correlations_df.groupby(['Var1', 'Var2'])[['Pearson', 'Spearman']].mean().reset_index()
    return avg_corrs

If the Spearman rank correlation is substantially larger in magnitude than the Pearson product–moment correlation, this likely indicates that there is a monotonic, but not linear, relation between the variables. 
This type of relation signals that linear regression analysis is a potentially problematic statistical technique to apply to the given variables if one of the variables is used as the dependent variable. 

If the Pearson product–moment correlation is substantially larger in magnitude than the Spearman rank correlation, this may indicate that there are a few extreme data points in one of the variables that are exerting a strong influence on the calculation of the Pearson product–moment correlation.
In this case, it is possible that winsorizing one or both of the variables at a higher level will alleviate this issue. 

Finally, it is worth noting here that, because of the assumption of linearity in the calculation of the Pearson product–moment correlation, this measure is usually more indicative of results that will be realized using regression techniques such as Fama and MacBeth (1973) regression analysis (presented in Chapter 6). Because the Spearman rank correlation is based on the ordering of the variables, Spearman rank correlations are more likely indicative of the results of analyses that rely on the ranking, or ordering, of the variables, such as portfolio analysis (presented in Chapter 5).