#Dataset Preparation

> Features to consider adding: Capacity utilisation/ Industrial production/ Real dividends from the S&P 500/ Personal savings as % of disposable income

> By Wednesday/ Thursday -> Dataset with EDA & Feature engineering

> By Monday/ Sunday week 11 -> Basic models with eval (standardise cross val+ eval metrics)







## Things to ask Denis about


*   Double check on missing value imputation
*   Imbalanced data (some start in 2005 some all the way back in 1950s)
*   Stationarity
*   Feature Selection
*   Feature Scaling/ Standardisation Across Time (K Folds)



In [2]:
# Load libraries
import pandas as pd
import numpy as np
import datetime
import glob
import os
from datetime import datetime
from functools import reduce
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
raw_data = '/content/gdrive/My Drive/EC4308/Project/Code & Data/Data/Predictors'

## 1) Pulling our raw Economic Data
Most of our data was pulled from the St. Louis FED's database. However, given the limited historical data that they have on the S&P 500's index (only up to 2011-03-24), we decided to pull this indicator from Yahoo Finance using it's API since it contains data beginning in 1981.


### Pulling S&P 500 data from Yahoo Finance

In [4]:
# !pip install yfinance

In [5]:
# import yfinance as yf
# sp500 = yf.Ticker("^GSPC")
# sp500_history = sp500.history(period='max', interval='1mo')

In [6]:
# closing_values = sp500_history['Close']
# sp500_closing_data = pd.DataFrame(closing_values).reset_index()

In [7]:
# Write data out to csv
# sp500_closing_data.rename(columns={'Date' : 'DATE', 'Close' : 'CLOSE'}, inplace=True)
# sp500_closing_data.to_csv('/content/gdrive/My Drive/EC4308/Project/Code & Data/Data/Predictors/SP500.csv', index=False)
# sp500_closing_data.to_csv('/content/gdrive/Shared with me/EC4308/Project/Code & Data/Data/Predictors/abc.csv', index=False)

### Pulling all data from our CSV files

In [8]:
# Get all csv data paths
filepaths = glob.glob(raw_data + "/*.csv")

In [9]:
# Read all csv data
def read_all_csv(filepaths, names_map = {}, convert_datetime = True, date_colname = 'DATE') :
    '''
    filepaths: list of absolute filepaths to relevant .csv files.
    names_map: mapping of names to be updated. (i.e. original: new)
    convert_datetime: true/false as to whether there is a 'DATE' column that needs to be converted
    '''
    repo = {}
    for i in filepaths:
        filename = os.path.basename(i).split('.')[0]
        if filename in names_map:
            new_name = names_map[filename]
            temp = pd.read_csv(i).rename(columns = {filename: new_name})
            if convert_datetime:
                try:
                    temp[date_colname] = pd.to_datetime(temp[date_colname])
                except KeyError:
                    return 'Column %s not found!'%date_colname
            repo[new_name] = temp
        else:
            temp = pd.read_csv(i)
            if convert_datetime:
                try:
                    temp[date_colname] = pd.to_datetime(temp[date_colname])
                except KeyError:
                    return 'Column %s not found!'%date_colname
            repo[filename] = temp
    return repo

## 2) Tidying up our Raw Data
###Dictionary of data in all_data

1.   Labour Market - Non-farm Payrolls: PAYEMS
2.   Monetary Policy - Fed Funds Rate: FEDFUNDS
3.   Bond Markets - Treasury Bills:
> *   3 month: 3MTB_SECONDARYMKT
> *   1 year: GS1
> *   5 year: GS5
> *   10 year: GS10
4.   Inflation: CPI
5.   Debt:
> *   Household: DEBT_HH
> *   Public: DEBT_PUB
6.   Stock Market SP500
7.    Industrial Indices:
> *   Industrial Production: INDPRO
> *   Capacity Utilisation (Manufacturing): TCU
9.   Unemployment Rate: UNRATE










### Pull data

In [10]:
# Pull all data
names_map = {'HDTGPDUSQ163N' : 'DEBT_HH', 'GFDEGDQ188S' : 'DEBT_PUB', 'CPALTT01USM657N' : 'CPI'}
all_data = read_all_csv(filepaths, names_map)
all_data.keys()

dict_keys(['PAYEMS', 'FEDFUNDS', 'GS5', 'GS10', 'SP500', 'GS1', 'INDPRO', '3MTB_SECONDARYMKT', 'UNRATE', 'TCU', 'DEBT_HH', 'DEBT_PUB', 'CPI'])

### Different time spans
As observed from our raw data below, each indicator that we intend to use have different start dates. As such, we will need to constraint our time period and impute for any periods of 'missing data' for each of our candidate indicators. 

#### PAYEMS - Non Farm Payrolls

In [11]:
payrolls = all_data['PAYEMS']
payrolls_start = min(payrolls.DATE)
payrolls_end = max(payrolls.DATE)
print("Non farm payrolls starts on %s and ends on %s"%(payrolls_start, payrolls_end))

Non farm payrolls starts on 1939-01-01 00:00:00 and ends on 2021-02-01 00:00:00


#### FEDFUNDS - Fed Funds Rate

In [12]:
fedfunds = all_data['FEDFUNDS']
fedfunds_start = min(fedfunds.DATE)
fedfunds_end = max(fedfunds.DATE)
print("Fed funds rate starts on %s and ends on %s"%(fedfunds_start, fedfunds_end))

Fed funds rate starts on 1954-07-01 00:00:00 and ends on 2021-02-01 00:00:00


#### Treasury Bills - 3 month, 1 year, 5 year & 10 year

In [13]:
three_month = all_data['3MTB_SECONDARYMKT']
three_month_start = min(three_month.DATE)
three_month_end = max(three_month.DATE)
print("3 month treasury bill rate starts on %s and ends on %s"%(three_month_start, three_month_end))

3 month treasury bill rate starts on 1934-01-01 00:00:00 and ends on 2021-02-01 00:00:00


In [14]:
one_year = all_data['GS1']
one_year_start = min(one_year.DATE)
one_year_end = max(one_year.DATE)
print("1 year treasury bill starts on %s and ends on %s"%(one_year_start, one_year_end))
one_year[one_year.DATE > datetime.strptime('2001-01-01', '%Y-%m-%d')]

1 year treasury bill starts on 1953-04-01 00:00:00 and ends on 2021-02-01 00:00:00


Unnamed: 0,DATE,GS1
574,2001-02-01,4.68
575,2001-03-01,4.30
576,2001-04-01,3.98
577,2001-05-01,3.78
578,2001-06-01,3.58
...,...,...
810,2020-10-01,0.13
811,2020-11-01,0.12
812,2020-12-01,0.10
813,2021-01-01,0.10


In [15]:
five_year = all_data['GS5']
five_year_start = min(five_year.DATE)
five_year_end = max(five_year.DATE)
print("5 year treasury bill starts on %s and ends on %s"%(five_year_start, five_year_end))

5 year treasury bill starts on 1953-04-01 00:00:00 and ends on 2021-02-01 00:00:00


In [16]:
ten_year = all_data['GS10']
ten_year_start = min(ten_year.DATE)
ten_year_end = max(ten_year.DATE)
print("10 year treasury bill starts on %s and ends on %s"%(ten_year_start, ten_year_end))

10 year treasury bill starts on 1953-04-01 00:00:00 and ends on 2021-02-01 00:00:00


#### CPI - Inflation

In [17]:
cpi = all_data['CPI']
cpi_start = min(cpi.DATE)
cpi_end = max(cpi.DATE)
print("CPI starts on %s and ends on %s"%(cpi_start, cpi_end))

CPI starts on 1960-01-01 00:00:00 and ends on 2021-01-01 00:00:00


#### Debt - Public & Household

In [18]:
debt_pub = all_data['DEBT_PUB']
debt_pub_start = min(debt_pub.DATE)
debt_pub_end = max(debt_pub.DATE)
print("Public debt starts on %s and ends on %s"%(debt_pub_start, debt_pub_end))

Public debt starts on 1966-01-01 00:00:00 and ends on 2020-07-01 00:00:00


In [19]:
debt_hh = all_data['DEBT_HH']
debt_hh_start = min(debt_hh.DATE)
debt_hh_end = max(debt_hh.DATE)
print("Household debt starts on %s and ends on %s"%(debt_hh_start, debt_hh_end))

Household debt starts on 2005-01-01 00:00:00 and ends on 2020-07-01 00:00:00


#### S&P 500 - Stock market

In [20]:
sp500 = all_data['SP500']
sp500_start = min(sp500.DATE)
sp500_end = max(sp500.DATE)
print("Household debt starts on %s and ends on %s"%(sp500_start, sp500_end))

Household debt starts on 1985-01-01 00:00:00 and ends on 2021-03-30 00:00:00


#### Industrial Production (Total Index)
The Industrial Production Index (INDPRO) is an economic indicator that measures real output for all facilities located in the United States manufacturing, mining, and electric, and gas utilities (excluding those in U.S. territories).

In [21]:
indpro = all_data['INDPRO']
indpro_start = min(indpro.DATE)
indpro_end = max(indpro.DATE)
print("Industrial production (INDPRO) starts on %s and ends on %s"%(indpro_start, indpro_end))

Industrial production (INDPRO) starts on 1919-01-01 00:00:00 and ends on 2021-02-01 00:00:00


#### Capacity utilisation 
Capacity Utilization: Total Industry (TCU) is the percentage of resources used by corporations and factories to produce goods in manufacturing, mining, and electric and gas utilities for all facilities located in the United States (excluding those in U.S. territories).

 We can also think of capacity utilization as how much capacity is being used from the total available capacity to produce demanded finished products.

*Note: We use the total index instead of just the manufacturing index as the total index is richer and spans further back in time. Further, the total index exhibits similar behaviour to the manufacturing index, though the latter likely to be more sensitive to recessionary undercurrents leading up to the actual recession.*

In [22]:
caputil = all_data['TCU']
caputil_start = min(caputil.DATE)
caputil_end = max(caputil.DATE)
print("Capacity utilisation (Total) starts on %s and ends on %s"%(caputil_start, caputil_end))

Capacity utilisation (Total) starts on 1967-01-01 00:00:00 and ends on 2021-02-01 00:00:00


#### Unemployment Rate

In [23]:
unrate = all_data['UNRATE']
unrate_start = min(unrate.DATE)
unrate_end = max(unrate.DATE)
print("Unemployment rate starts on %s and ends on %s"%(unrate_start, unrate_end))

Unemployment rate starts on 1948-01-01 00:00:00 and ends on 2021-02-01 00:00:00


In order to retain as much information as possible with respect to the number of variables included and the number of obseravtions included, we have decided to select 1966-01-01 ~ 2020-02-01 as the period of time for which we will perform predictive modelling upon. As such, we are able to retain most of our 10 candidate (sans Household debt and S&P 500 information), whilst capturing the 8 most recent recessions to ever occur in the U.S. (see [here](https://www.investopedia.com/articles/economics/08/past-recessions.asp) for details) 

### Augment labels for quarterly data
#### Household Debt

In [24]:
# Create augmented idx for gaps in date for quarterly data
augmented_idx = pd.date_range(start=min(debt_hh.DATE), end=max(debt_hh.DATE), freq='MS') # month start
# Augment Quarterly Household Debt Data
df_temp = pd.DataFrame(debt_hh.DATE, index=augmented_idx)
df_temp = df_temp.reset_index().rename(columns={'index' : 'DATE', 'DATE' : 'DEBT_HH'})
# Combined data
df_temp2 = pd.merge(df_temp, debt_hh, on='DATE', how='left').sort_values(by=['DATE']).drop(columns=['DEBT_HH_x']).rename(columns={'DEBT_HH_y' : 'DEBT_HH'})
df_temp2['DEBT_HH'] = df_temp2['DEBT_HH'].fillna(method='ffill')
df_temp2

Unnamed: 0,DATE,DEBT_HH
0,2005-01-01,88.405
1,2005-02-01,88.405
2,2005-03-01,88.405
3,2005-04-01,89.961
4,2005-05-01,89.961
...,...,...
182,2020-03-01,76.450
183,2020-04-01,84.672
184,2020-05-01,84.672
185,2020-06-01,84.672


In [25]:
# Update all_data
all_data['DEBT_HH'] = df_temp2

#### Public Debt

In [26]:
# Create augmented idx for gaps in date for quarterly data
augmented_idx = pd.date_range(start=min(debt_pub.DATE), end=max(debt_pub.DATE), freq='MS') # month start
# Augment Quarterly Public Debt Data
df_temp = pd.DataFrame(debt_pub.DATE, index=augmented_idx)
df_temp = df_temp.reset_index().rename(columns={'index' : 'DATE', 'DATE' : 'DEBT_PUB'})
# Combined data
df_temp2 = pd.merge(df_temp, debt_pub, on='DATE', how='left').sort_values(by=['DATE']).drop(columns=['DEBT_PUB_x']).rename(columns={'DEBT_PUB_y' : 'DEBT_PUB'})
df_temp2['DEBT_PUB'] = df_temp2['DEBT_PUB'].fillna(method='ffill')
df_temp2

Unnamed: 0,DATE,DEBT_PUB
0,1966-01-01,40.33999
1,1966-02-01,40.33999
2,1966-03-01,40.33999
3,1966-04-01,39.26763
4,1966-05-01,39.26763
...,...,...
650,2020-03-01,107.71144
651,2020-04-01,135.64081
652,2020-05-01,135.64081
653,2020-06-01,135.64081


In [27]:
# Update all_data
all_data['DEBT_PUB'] = df_temp2

#### Write augmented data

In [28]:
# all_data['DEBT_PUB'].to_csv(os.path.join(raw_data, '/DEBT_PUB.csv'), index=False)
# all_data['DEBT_HH'].to_csv(os.path.join(raw_data, '/DEBT_HH.csv'), index=False)

### Constraining our dataset


*  Get all observations between 1966 ~ 2020 across all indicators
*  Impute values for indicators with gaps in observations (i.e. no data before 1985) 

#### Sufficient Data

In [29]:
# Store for updated data
updated_data = []

# Define our start date
start_date = datetime.strptime("01-01-1966", '%d-%m-%Y')
end_date = datetime.strptime("01-02-2020", '%d-%m-%Y')

# Sufficient data
payroll_1966 = all_data['PAYEMS'][(all_data['PAYEMS'].DATE >= start_date) & (all_data['PAYEMS'].DATE  <= end_date)]
fedfunds_1966 = all_data['FEDFUNDS'][(all_data['FEDFUNDS'].DATE >= start_date) & (all_data['FEDFUNDS'].DATE  <= end_date)]
threemonth_1966 = all_data['3MTB_SECONDARYMKT'][(all_data['3MTB_SECONDARYMKT'].DATE >= start_date) & (all_data['3MTB_SECONDARYMKT'].DATE  <= end_date)]
oneyear_1966 = all_data['GS1'][(all_data['GS1'].DATE >= start_date) & (all_data['GS1'].DATE  <= end_date)]
fiveyear_1966 = all_data['GS5'][(all_data['GS5'].DATE >= start_date) & (all_data['GS5'].DATE  <= end_date)]
tenyear_1966 = all_data['GS10'][(all_data['GS10'].DATE >= start_date) & (all_data['GS10'].DATE  <= end_date)]
cpi_1966 = all_data['CPI'][(all_data['CPI'].DATE >= start_date) & (all_data['CPI'].DATE  <= end_date)]
debtpub_1966 = all_data['DEBT_PUB'][(all_data['DEBT_PUB'].DATE >= start_date) & (all_data['DEBT_PUB'].DATE  <= end_date)]
indpro_1996 = all_data['INDPRO'][(all_data['INDPRO'].DATE >= start_date) & (all_data['INDPRO'].DATE <= end_date)]
unrate_1996 = all_data['UNRATE'][(all_data['UNRATE'].DATE >= start_date) & (all_data['UNRATE'].DATE <= end_date)]

#### Insufficient Data

In [30]:
# Augmented time index
augmented_idx= pd.date_range(start=start_date, end=max(all_data['DEBT_HH'].DATE), freq='MS') # month start

In [44]:
# Total capacity utilisation
caputil_1996 = pd.DataFrame(all_data['TCU'].DATE, index=augmented_idx).reset_index().rename(columns={'index': 'DATE', 'DATE':'values'})
caputil_1996 = pd.merge(caputil_1996, all_data['TCU'], on=['DATE'], how='left').drop(columns=['values']).fillna(0)
caputil_1996 = caputil_1996[caputil_1996.DATE <= end_date]

In [32]:
# Household debt
debthh_1966 = pd.DataFrame(all_data['DEBT_HH'].DATE, index=augmented_idx).reset_index().rename(columns={'index': 'DATE', 'DATE':'values'})
debthh_1966 = pd.merge(debthh_1966, all_data['DEBT_HH'], on=['DATE'], how='left').drop(columns=['values']).fillna(0)
debthh_1966 = debthh_1966[debthh_1966.DATE <= end_date]

In [33]:
# SP500
sp500_1966 = pd.DataFrame(all_data['SP500'].DATE, index=augmented_idx).reset_index().rename(columns={'index': 'DATE', 'DATE':'values'})
sp500_1966 = pd.merge(sp500_1966, all_data['SP500'], on=['DATE'], how='left').drop(columns=['values']).fillna(0)
sp500_1966 = sp500_1966[sp500_1966.DATE <= end_date]
sp500_1966.rename(columns={'CLOSE':'SP500'}, inplace=True)

#### Update data

In [34]:
updated_data.extend((payroll_1966, fedfunds_1966, threemonth_1966, oneyear_1966, fiveyear_1966, tenyear_1966, cpi_1966, debtpub_1966, debthh_1966, sp500_1966, indpro_1996, caputil_1996, unrate_1996))

## 3) Creating our Raw Dataset

In [35]:
# combine data (is there a better way than this lol) -> pd.concat()
def create_data(datasets, key = 'DATE'):
    '''
    Combines datasets into a single dataset on a single given key.
    datasets: list of datasets to be concatenated.
    key: Defaults to DATE.
    '''
    if type(datasets) is dict:
        datasets = list(datasets.values())
    final_df = reduce(lambda left, right: pd.merge(left, right, on=key, how='inner'), datasets).replace({'.': np.nan})
    return final_df

In [36]:
final = create_data(updated_data)
final.head()

Unnamed: 0,DATE,PAYEMS,FEDFUNDS,3MTB_SECONDARYMKT,GS1,GS5,GS10,CPI,DEBT_PUB,DEBT_HH,SP500,INDPRO,TCU,UNRATE
0,1966-01-01,62529,4.42,4.59,4.88,4.86,4.61,0.0,40.33999,0.0,0.0,34.1729,0.0,4.0
1,1966-02-01,62796,4.6,4.65,4.94,4.98,4.83,0.628931,40.33999,0.0,0.0,34.3945,0.0,3.8
2,1966-03-01,63192,4.66,4.59,4.97,4.92,4.87,0.3125,40.33999,0.0,0.0,34.8652,0.0,3.8
3,1966-04-01,63437,4.67,4.62,4.9,4.83,4.75,0.623053,39.26763,0.0,0.0,34.9206,0.0,3.8
4,1966-05-01,63712,4.9,4.64,4.93,4.89,4.78,0.0,39.26763,0.0,0.0,35.2529,0.0,3.9


In [37]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 650 entries, 0 to 649
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   DATE               650 non-null    datetime64[ns]
 1   PAYEMS             650 non-null    int64         
 2   FEDFUNDS           650 non-null    float64       
 3   3MTB_SECONDARYMKT  650 non-null    float64       
 4   GS1                650 non-null    float64       
 5   GS5                650 non-null    float64       
 6   GS10               650 non-null    float64       
 7   CPI                650 non-null    float64       
 8   DEBT_PUB           650 non-null    float64       
 9   DEBT_HH            650 non-null    float64       
 10  SP500              650 non-null    float64       
 11  INDPRO             650 non-null    float64       
 12  TCU                650 non-null    float64       
 13  UNRATE             650 non-null    float64       
dtypes: datetim

## 4) Creating our Recession Indicator (Target feature)

We classify a given quarter as the first quarter of a recession period if its first month or the preceding quarter’s second or third month is classified as the NBER business cycle peak, while we classify a given quarter as the last quarter of a recession period if its second or third month or the subsequent quarter’s first month is classified as the NBER business cycle trough. All those quarters that are not included in a recession period are classified as expansion quarters.

In [38]:
cycle_data = pd.read_csv('/content/gdrive/My Drive/EC4308/Project/Code & Data/Data/Recession Indicator/Recession_Indicator_Correct_All_Periods.csv')
cycle_data[['Year', 'Month']] = cycle_data[['Year', 'Month']].astype(str)
# Convert to proper datetime
cycle_data['DATE'] = cycle_data[['Year', 'Month']].agg('-'.join, axis=1)
cycle_data['DATE'] = cycle_data['DATE'].apply(lambda x : datetime.strptime(x, '%Y-%m'))

In [39]:
cycle_data

Unnamed: 0,Year,Month,Peak,Trough,Is_Recession,DATE
0,1854,12,0,1,1,1854-12-01
1,1855,1,0,0,0,1855-01-01
2,1855,2,0,0,0,1855-02-01
3,1855,3,0,0,0,1855-03-01
4,1855,4,0,0,0,1855-04-01
...,...,...,...,...,...,...
1978,2019,10,0,0,0,2019-10-01
1979,2019,11,0,0,0,2019-11-01
1980,2019,12,0,0,0,2019-12-01
1981,2020,1,0,0,0,2020-01-01


In [40]:
# Get recession indicator
recession_indicator = cycle_data[['DATE', 'Is_Recession']][(cycle_data.DATE >= start_date) & (cycle_data.DATE <= end_date)]
recession_indicator

Unnamed: 0,DATE,Is_Recession
1333,1966-01-01,0
1334,1966-02-01,0
1335,1966-03-01,0
1336,1966-04-01,0
1337,1966-05-01,0
...,...,...
1978,2019-10-01,0
1979,2019-11-01,0
1980,2019-12-01,0
1981,2020-01-01,0


### Combining our Variables with our Target Feature

In [41]:
final_dataset = pd.merge(final, recession_indicator, how='left', on='DATE')
final_dataset

Unnamed: 0,DATE,PAYEMS,FEDFUNDS,3MTB_SECONDARYMKT,GS1,GS5,GS10,CPI,DEBT_PUB,DEBT_HH,SP500,INDPRO,TCU,UNRATE,Is_Recession
0,1966-01-01,62529,4.42,4.59,4.88,4.86,4.61,0.000000,40.33999,0.000,0.000000,34.1729,0.0000,4.0,0
1,1966-02-01,62796,4.60,4.65,4.94,4.98,4.83,0.628931,40.33999,0.000,0.000000,34.3945,0.0000,3.8,0
2,1966-03-01,63192,4.66,4.59,4.97,4.92,4.87,0.312500,40.33999,0.000,0.000000,34.8652,0.0000,3.8,0
3,1966-04-01,63437,4.67,4.62,4.90,4.83,4.75,0.623053,39.26763,0.000,0.000000,34.9206,0.0000,3.8,0
4,1966-05-01,63712,4.90,4.64,4.93,4.89,4.78,0.000000,39.26763,0.000,0.000000,35.2529,0.0000,3.9,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
645,2019-10-01,151524,1.83,1.65,1.61,1.53,1.71,0.228619,106.68579,75.462,3037.560059,109.0270,76.9891,3.6,0
646,2019-11-01,151758,1.55,1.54,1.57,1.64,1.81,-0.053624,106.68579,75.462,3140.979980,110.0388,77.5723,3.6,0
647,2019-12-01,151919,1.55,1.54,1.55,1.68,1.86,-0.090977,106.68579,75.462,3230.780029,109.6527,77.1697,3.6,0
648,2020-01-01,152234,1.55,1.52,1.53,1.56,1.76,0.387977,107.71144,76.450,3225.520020,109.1845,76.8754,3.5,0


## 5) Write our final dataset

In [42]:
# Write to CSV and store in gdrive
final_dest = '/content/gdrive/My Drive/EC4308/Project/Code & Data/Data/'
final_file_name = 'final_data.csv'
final_dataset.to_csv(os.path.join(final_dest, final_file_name), index = False)

## Links of Interest

*   Rolling & Expanding window statistics ([Source](https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/))
*   [An ML approach to forecasting Italian recessions](https://www.mdpi.com/2227-7390/8/2/241/htm)

## Further Preprocessing



*   Autoregressive Models -> stationarity etc.
*   ML Models -> No need stationarity since they are oblivious to time features.

## Train Test Split (Cross Validation Approachs)


*   [Recursive cross val](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) (Expensive to compute) 
*   [Blocked cross val](https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/)
*   AIC/ BIC (Supposedly performs just as well)

