# (Alternative Approach) TimeSeries Exploratory Data Analysis
This notebook takes a different approach to simulate the slider effect. It yields similar (albeit slightly less relevant) results than the main notebook.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error


pd.set_option('display.max_columns', None)

In [2]:
df_base = pd.read_csv("..\Data_Sets\processed\economicData_1995-2022.csv")

In [3]:
#years_delta = [1,6,15,25]
years_delta = [1,2,3,4,5,6,8,10,12,15,20,25]

### TimeSeriesGrowthAnalysis()
To assist us in this journey, let's create the function 'TimeSeriesGrowthAnalysis'. A summary of it:

**Purpose:** This function computes how changes in a predictor variable (like 'Overall Score') relate to changes in a target variable (like 'GDP per capita') over various time ranges.

**Working:**
  - Cleans data and computes an Exponential Moving Average (EMA) for the predictor.
  - For each time period (delta), it:
	- Shifts the data by that period.
	- Calculates the percentage change for the target.
	- Fits a polynomial regression model using the difference from EMA as the predictor.

**Output:** The function outputs regression results (coefficients, intercepts, R^2 scores and RMSE) for each time shift and an augmented DataFrame (df_withDeltas) that contains computed changes for each time period.

In [4]:
def TimeSeriesDoubleDeltaGrowthAnalysis(
        df,
        predictor = 'Overall Score',
        target = 'GDP per capita (current USD)',
        quantiledTarget = 'Country Quintile',
        timeField = 'Index Year',
        mergeFields = ['Country Name', 'Index Year'],
        timePeriods_delta = [1,6,10],
        polyOrder = 1,
        EMAdepth = 3
    ):
    '''
    In a nutshell, this function analyzes time-series data using a time-slider-type analysis.
    
    predictor = the (single) variable name you want to analyze. This function doesn't analyze more than one predictor at once.
    target = the dependent variable name we want to predict for.
    quantiledTarget = categorical data to split the data (useful for later visualization)
    timeField = the name of the time column you're using
    mergeFields = a list to define the Primary/Composite Key of the DataSet.
    timePeriods_delta = list of delta ranges to test the data against
    '''
    
    # Initializing Base DataFrame
    cleaned_data = df[[
                        mergeFields[0],
                        mergeFields[1],
                        predictor,
                        target,
                        quantiledTarget
                    ]].dropna()
    
    # Initializing Lists and DF to return
    coefficients = []
    intercepts = []
    r2_scores = []
    RMSE = []    
    
    # Precompute the EMA and EMA diff for the predictor  
    cleaned_data[f'{predictor}_EMA_{EMAdepth}'] \
        = cleaned_data.groupby('Country Name')[predictor]\
            .transform(lambda x: x.ewm(span=EMAdepth, adjust=False).mean())   
    
    cleaned_data[f'{predictor}_diff_from_EMA'] = cleaned_data[predictor] - cleaned_data[f'{predictor}_EMA_{EMAdepth}']
    
    # Avoid Zero values on diff (Certain regressions break with 0 values here)
    # Further on that, that is because certain fields, like Judicial Effectiveness, have late beginnings.
    cleaned_data[f'{predictor}_diff_from_EMA']\
        = cleaned_data[f'{predictor}_diff_from_EMA'].transform(lambda x: np.random.normal(0, 0.000001) if x == 0 else x)
    
    
    df_withDeltas = cleaned_data.copy() #just initializing, for later appending
    
    # Loop over each year delta, compute & store the merged data, and fit a Regression Model
    for delta in timePeriods_delta:
        
        # ----------------- Time Slider Effect Calculation -----------------
        # Create a shifted dataframe to compute the changes
        shifted_data = cleaned_data[[*mergeFields, target]].copy()
        shifted_data[timeField] -= delta
        
        merged_data = pd.merge(
            cleaned_data,
            shifted_data[[*mergeFields, target]],  # Only include the shifted target
            on=mergeFields,
            suffixes=('', f'_plus_{delta}'),
            indicator=False,
            how='inner'
        )
        
        # If there's no data for this range, go to the next iteration
        if merged_data.shape[0] == 0:
            print(f"No overlapping data for {predictor} with a time delta of {delta}. Appending NaN values for this delta.")

            coefficients.append(np.nan)
            intercepts.append(np.nan)
            r2_scores.append(np.nan)
            RMSE.append(np.nan)
            continue
        
        # Compute the percentage change for target
        merged_data[f'{target}_change_{delta}']\
            = ((merged_data[f'{target}_plus_{delta}']\
               - merged_data[target])\
               / merged_data[target]) * 100
        
        
        # Merge results to df_withDeltas based on mergeFields
        df_withDeltas = pd.merge(
            df_withDeltas,
            merged_data[[*mergeFields, f'{target}_change_{delta}']],
            on=mergeFields,
            how='left'
        )        
        
        # -------------------- Model Building & Fitting --------------------
        X = merged_data[f'{predictor}_diff_from_EMA'].values.reshape(-1, 1)
        y = merged_data[f'{target}_change_{delta}']

        # Generate polynomial features    
        poly = PolynomialFeatures(degree=polyOrder)
        
        X_poly = poly.fit_transform(X)
        
        
        # Instantiate Regression
        model = LinearRegression().fit(X_poly, y)

        # Predict using the polynomial features
        yHat = model.predict(X_poly)

        # Store the results
        coefficients.append(sum(np.abs(model.coef_)))  # Note: This will now represent only the maximum term of the polynomial
        intercepts.append(model.intercept_)
        r2_scores.append(model.score(X_poly, y))
        RMSE.append(np.sqrt(mean_squared_error(y, yHat)))

    # Create a results DataFrame
    regression_results = pd.DataFrame({
        'Years Ahead': timePeriods_delta,
        'Coefficient (target to predictor % Change)': coefficients,
        'Intercept': intercepts,
        'R^2 Score': r2_scores,
        'RMSE': RMSE
    })
    
    return regression_results, df_withDeltas


#_, tempDFs = TimeSeriesDoubleDeltaGrowthAnalysis(df_base, timePeriods_delta = years_delta, predictor='Judicial Effectiveness')

In [5]:
#tempDFs[tempDFs['Index Year'] == 2017]
#tempDFs

### PlotTimeSeriesGrowthAnalysis()
To visualize how changes in the predictor variable (like 'Overall Score') correlate with changes in the target variable (like GDP per capita) for different time shifts.


In [6]:
def PlotTimeSeriesDoubleDeltaGrowthAnalysis(
        df,
        predictor = 'Overall Score',
        target = 'GDP per capita (current USD)',
        quantiledTarget = 'Country Quintile',
        timePeriods_delta = [],
        polyOrder = 1
    ):

    '''
    In a nutshell, a scatterplot of the data created in the function: TimeSeriesDoubleDeltaGrowthAnalysis
    See that function for more information on the variables.
    '''

    # Set up the subplots
    fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(15, 20))
    axes = axes.ravel()
    
    # Loop over each period delta, compute the merged data, and plot the data with the regression line
    for i, delta in enumerate(timePeriods_delta):
        
        predictor_column = f'{predictor}_%diff_from_EMA'
        target_column = f'{target}_change_{delta}'

        # Check if the columns exist in the dataframe
        if predictor_column not in df.columns or target_column not in df.columns:
            print(f"No data for {predictor} with a time delta of {delta}. Skipping plot for this delta.")
            continue
        
        # Extract the features and target variable
        X = df[f'{predictor}_%diff_from_EMA']
        y = df[f'{target}_change_{delta}']
        
        hue = df[quantiledTarget]
        quintile_order = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5']

        # Plot the data
        sns.scatterplot(x=X, y=y, hue=hue, ax=axes[i], s=10, palette="RdBu", hue_order=quintile_order)    
                
        # Plot Non-Linear Regression line
        sns.regplot(x=X, y=y, ax=axes[i], order=polyOrder, color='tan', scatter_kws={'s': 0})

        axes[i].set_title(f'Periods Ahead = {delta}')
        axes[i].set_xlabel(f'Recent Change in {predictor}')
        axes[i].set_ylabel(f'% Change in {target}')
        axes[i].legend(title=quantiledTarget)
        
        #if i >= 4:
        #    ymin, ymax, xmin, xmax = -200, 1000, -10, 10
        #else:
        #    ymin, ymax, xmin, xmax = -50, 400, -5, 5
            
        #axes[i].set_ylim(ymin=ymin, ymax=ymax)
        #axes[i].set_xlim(xmin=xmin, xmax=xmax)

    plt.tight_layout()
    plt.show()


## Economic Freedom : Analyzing each aspect on their own
We'll feed our functions with all 12 sub-categories of Heritage's Economic Freedom Index, alongside average aggregations with Overall Score, Rule of Law, Limited Government, Regulatory Efficiency and Open Markets.

In [7]:
df_base['Rule of Law'] = df_base[['Property Rights', 'Government Integrity', 'Judicial Effectiveness']].mean(axis=1)
df_base['Limited Government'] = df_base[['Government Spending', 'Tax Burden', 'Fiscal Health']].mean(axis=1)
df_base['Regulatory Efficiency'] = df_base[['Business Freedom', 'Monetary Freedom', 'Labor Freedom']].mean(axis=1)
df_base['Open Markets'] = df_base[['Financial Freedom', 'Investment Freedom', 'Trade Freedom']].mean(axis=1)

predictors = [
    'Overall Score'
]

In [8]:
predictors = [
    'Overall Score',    
    'Rule of Law',
    'Limited Government',
    'Regulatory Efficiency',
    'Open Markets'
]

In [9]:
# Create an empty DataFrame to store coefficients and R^2 scores for year 6
summaryData = pd.DataFrame(columns=['Predictor', 'Coefficient (Year 6)', 'R^2 Score (Year 6)', 'RMSE (Year 6)'])

polyOrder = 1
EMAdepth = 2

#maskedDF = df_base[df_base['Index Year'] == '2014']

# Loop through each predictor
for predictor in predictors:
    print(f"## {predictor}: Results")
    regression_results, df_withDeltas \
            = TimeSeriesDoubleDeltaGrowthAnalysis(df_base, polyOrder = polyOrder, timePeriods_delta=years_delta, predictor=predictor, EMAdepth=EMAdepth)

    # Extract the coefficient and R^2 score for year 6 
    year6_data = regression_results[regression_results['Years Ahead'] == 6]
    coeff_for_year6 = year6_data['Coefficient (target to predictor % Change)'].values
    r2_for_year6 = year6_data['R^2 Score'].values
    RMSE_for_year6 = year6_data['RMSE'].values

    # Append to the DataFrame
    new_row = pd.DataFrame({
        'Predictor': predictor, 
        'Coefficient (Year 6)': coeff_for_year6,
        'R^2 Score (Year 6)': r2_for_year6,
        'RMSE (Year 6)': RMSE_for_year6
    })
    summaryData = pd.concat([summaryData, new_row], ignore_index=True)

    # Display regression results
    display(regression_results)

    # Plot the analysis
#    PlotTimeSeriesDoubleDeltaGrowthAnalysis(df_withDeltas, 
#                             timePeriods_delta=years_delta, 
#                             predictor=predictor,
#                             polyOrder = polyOrder
#                             )

    print("\n\n")

## Overall Score: Results


Unnamed: 0,Years Ahead,Coefficient (target to predictor % Change),Intercept,R^2 Score,RMSE
0,1,0.391479,5.355519,0.00038,14.211283
1,2,0.994179,11.487447,0.000954,22.79006
2,3,2.26855,17.948293,0.002525,32.197579
3,4,3.807221,25.928482,0.004012,42.989367
4,5,6.068029,35.219795,0.006177,55.332866
5,6,8.099908,45.535141,0.006353,69.573356
6,8,15.04584,69.595792,0.011082,100.824503
7,10,18.933893,98.941286,0.010336,135.10336
8,12,20.443961,129.859649,0.009511,157.509457
9,15,23.556905,174.104952,0.010034,185.76714





## Rule of Law: Results


Unnamed: 0,Years Ahead,Coefficient (target to predictor % Change),Intercept,R^2 Score,RMSE
0,1,0.041569,5.344832,1.4e-05,14.261641
1,2,0.60889,11.500272,0.001203,22.908659
2,3,1.308605,18.017766,0.002777,32.328153
3,4,2.092161,26.13085,0.004044,43.105603
4,5,2.534375,35.546522,0.003563,55.515673
5,6,3.092254,45.780584,0.002783,69.820263
6,8,2.811003,70.271422,0.001076,101.469937
7,10,2.839242,99.88678,0.000667,135.895712
8,12,1.411184,131.035862,0.000131,158.418831
9,15,0.320105,175.995137,6e-06,187.044049





## Limited Government: Results


Unnamed: 0,Years Ahead,Coefficient (target to predictor % Change),Intercept,R^2 Score,RMSE
0,1,0.053023,5.360132,4.5e-05,14.232235
1,2,0.176916,11.476009,0.000187,22.888685
2,3,0.73741,17.945145,0.001574,32.298893
3,4,1.937307,25.875629,0.005523,43.037629
4,5,2.941719,35.206339,0.007489,55.364091
5,6,3.324166,45.576355,0.006012,69.66774
6,8,6.402379,69.631439,0.00859,101.037501
7,10,8.91005,98.903834,0.009491,135.215121
8,12,8.746138,129.444495,0.006221,157.812342
9,15,10.354321,173.536179,0.006828,186.21402





## Regulatory Efficiency: Results


Unnamed: 0,Years Ahead,Coefficient (target to predictor % Change),Intercept,R^2 Score,RMSE
0,1,0.006325,5.34222,2.999353e-07,14.261741
1,2,0.390581,11.429628,0.0004519154,22.91727
2,3,0.962511,17.888056,0.001415625,32.350207
3,4,2.231997,25.836638,0.004384019,43.098248
4,5,4.684555,35.046394,0.01204019,55.279013
5,6,7.16387,45.207753,0.01831606,69.27434
6,8,13.726596,69.009998,0.03282116,99.844603
7,10,18.034058,97.792654,0.03308806,133.673151
8,12,24.096835,128.144203,0.04785138,154.592205
9,15,23.708814,171.804542,0.03857859,183.401124





## Open Markets: Results


Unnamed: 0,Years Ahead,Coefficient (target to predictor % Change),Intercept,R^2 Score,RMSE
0,1,0.140597,5.324621,0.000135,14.261939
1,2,0.366965,11.404837,0.000362,22.920082
2,3,0.16876,17.957881,3.8e-05,32.376877
3,4,0.042521,26.065389,1e-06,43.200718
4,5,0.114591,35.551922,6e-06,55.625033
5,6,0.743844,46.093474,0.000173,69.918773
6,8,1.666793,70.882794,0.000441,101.502201
7,10,2.310197,100.600478,0.000508,135.906536
8,12,4.490827,132.02774,0.001552,158.306184
9,15,3.119468,176.264475,0.000574,186.886748







In [10]:
summaryData = summaryData.round(3)
summaryData.sort_values(by='R^2 Score (Year 6)')

Unnamed: 0,Predictor,Coefficient (Year 6),R^2 Score (Year 6),RMSE (Year 6)
4,Open Markets,0.744,0.0,69.919
1,Rule of Law,3.092,0.003,69.82
0,Overall Score,8.1,0.006,69.573
2,Limited Government,3.324,0.006,69.668
3,Regulatory Efficiency,7.164,0.018,69.274


In [31]:
years_delta = [1, 2, 6, 12, 20]

polyOrder = 1
EMAdepth = 2

def run_regression_for_combination(predictor, delta, year, q, df_withDeltas):
    currentX = f'{predictor}_diff_from_EMA'
    currentY = f'GDP per capita (current USD)_change_{delta}'
    yearMask = df_withDeltas['Index Year'] != year
    quintileMask = df_withDeltas['Country Quintile'] != q
    naMask = yearMask | quintileMask | df_withDeltas[currentX].isna() | df_withDeltas[currentY].isna()
    
    x = df_withDeltas[currentX].values[~naMask].reshape(-1, 1)
    y = df_withDeltas[currentY].values[~naMask]
    
    model = LinearRegression().fit(x, y)
    yHat = model.predict(x)
    
    coeff = model.coef_
    r2 = model.score(x, y)
    RMSE = np.sqrt(mean_squared_error(y, yHat))
    
    return {
        'Predictor': predictor,
        'Index Year': year,
        'Year Delta': delta,
        'Quintile': q,
        'Coefficients': coeff[0],
        'R^2 Scores': r2,
        'RMSEs': RMSE
    }

results = []

for predictor in predictors:
    _ , df_withDeltas = TimeSeriesDoubleDeltaGrowthAnalysis(\
                                                            df_base,\
                                                            timePeriods_delta=years_delta,\
                                                            predictor=predictor)
    #display(df_withDeltas)
    
    for delta in years_delta:
        naMask = df_withDeltas[f'{predictor}_diff_from_EMA'].isna() | df_withDeltas[f'GDP per capita (current USD)_change_{delta}'].isna()        
        years_List = list(df_withDeltas['Index Year'][~naMask].unique())
        for q in ['Q1', 'Q2', 'Q3', 'Q4', 'Q5']:
            for year in years_List:
                result = run_regression_for_combination(predictor, delta, year, q, df_withDeltas)
                results.append(result)

summaryData_agg = pd.DataFrame(results)

In [33]:
summaryData_agg.drop(columns=['Quintile', 'Index Year']).groupby(['Predictor', 'Year Delta'])\
    .mean().sort_values(by=['Predictor', 'Year Delta']).round(3)



Unnamed: 0_level_0,Unnamed: 1_level_0,Coefficients,R^2 Scores,RMSEs
Predictor,Year Delta,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Limited Government,1,-65569.101,0.081,10.195
Limited Government,2,-141407.319,0.048,16.537
Limited Government,6,-81048.661,0.045,43.835
Limited Government,12,-555332.768,0.037,103.364
Limited Government,20,-1472341.964,0.039,163.831
Open Markets,1,18356.597,0.073,10.251
Open Markets,2,-46436.963,0.032,16.757
Open Markets,6,-72957.211,0.037,44.151
Open Markets,12,-1216305.837,0.033,103.846
Open Markets,20,-4866476.049,0.041,164.942


In [34]:
summaryData_agg.drop(columns=['Quintile', 'Index Year', 'Predictor']).describe().round(3)

Unnamed: 0,Year Delta,Coefficients,R^2 Scores,RMSEs
count,2475.0,2475.0,2475.0,2475.0
mean,5.687,-337823.9,0.051,46.833
std,5.709,7144032.0,0.112,63.304
min,1.0,-170528700.0,0.0,0.0
25%,1.0,-1.433,0.005,11.247
50%,2.0,0.362,0.019,21.455
75%,6.0,2.586,0.055,48.283
max,20.0,117874900.0,1.0,433.223
