<h1>Predicting Inflation Rates - Modelling</h1>

In this notebook, various types of modelling will be used to predict inflation rates using the data from the last notebook (Data Preprocessing). Each model will be evaluated to see how well it performs.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import BayesianRidge

In [3]:
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, ETS

In [4]:
import time

The dataset is now loaded from the previous notebook

In [5]:
df = pd.read_csv('preprocessed')

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,0,0,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,1,1,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2,2,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,3,3,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,4,4,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [7]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [8]:
df.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,0,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,1,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,3,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,4,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


<h3>Using the decision tree regressor</h3>

In [9]:
X = df
y = df['CPI']

In [10]:
tscv = TimeSeriesSplit()

In [11]:
# Split data into train and test, run decision tree regression model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    dt = DecisionTreeRegressor()
    
    dt.fit(X_train, y_train)
    ypred = dt.predict(X_test)
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -1.5147835497727757
Mean Squared Error: 0.5128343949044566
Root Mean Squared Error: 0.7161245666114636
Mean Absolute Error: 0.592356687898088

Fold 2
R2 Score: 0.6192769457380585
Mean Squared Error: 0.027165605095541466
Root Mean Squared Error: 0.1648199171688345
Mean Absolute Error: 0.12420382165605195

Fold 3
R2 Score: -0.9803901525821062
Mean Squared Error: 4.012531847133759
Root Mean Squared Error: 2.003130511757474
Mean Absolute Error: 1.4664012738853511

Fold 4
R2 Score: -0.10094073429279216
Mean Squared Error: 0.6206369426751589
Root Mean Squared Error: 0.7878051425797873
Mean Absolute Error: 0.612420382165605

Fold 5
R2 Score: 0.6771913112989306
Mean Squared Error: 4.712006369426732
Root Mean Squared Error: 2.170715635320926
Mean Absolute Error: 1.3914012738853452



<h3>Decision Tree Regressor with hyperparameter tuning using GridSearchCV</h3>

In [12]:
i=1
for train_index, test_index in tscv.split(y):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
   
    
    tree = DecisionTreeRegressor()
    
    grid_search = GridSearchCV(tree, param_grid={'ccp_alpha':[0.001, 0.01, 0, 0.1, 1, 10]})
    grid_search.fit(X_train, y_train)
    ypred = grid_search.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.9820142471552693
Mean Squared Error: 0.4041878980891719
Root Mean Squared Error: 0.6357577353750183
Mean Absolute Error: 0.514171974522293

Fold 2
R2 Score: 0.5987455735269798
Mean Squared Error: 0.028630573248407773
Root Mean Squared Error: 0.16920571281256366
Mean Absolute Error: 0.12611464968153016

Fold 3
R2 Score: -0.9803901525821086
Mean Squared Error: 4.012531847133763
Root Mean Squared Error: 2.0031305117574747
Mean Absolute Error: 1.466401273885353

Fold 4
R2 Score: -0.10094073429279216
Mean Squared Error: 0.6206369426751589
Root Mean Squared Error: 0.7878051425797873
Mean Absolute Error: 0.6124203821656051

Fold 5
R2 Score: 0.6765258703258517
Mean Squared Error: 4.721719745222911
Root Mean Squared Error: 2.1729518506453176
Mean Absolute Error: 1.4070063694267476



Fold 2 and fold 5 seem to have better R2 values while Fold 1, 3, and 4 have negative R2 values.

In [13]:
# Split data into train and test, run random forest regression model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    rf = RandomForestRegressor()
    
    rf.fit(X_train, y_train)
    ypred = rf.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.6407365759877288
Mean Squared Error: 0.33459187738853485
Root Mean Squared Error: 0.5784391734560643
Mean Absolute Error: 0.45387101910828

Fold 2
R2 Score: 0.6881963212255623
Mean Squared Error: 0.022248023885350944
Root Mean Squared Error: 0.14915771480332804
Mean Absolute Error: 0.11881050955414191

Fold 3
R2 Score: 0.5277264601312825
Mean Squared Error: 0.9568885286624133
Root Mean Squared Error: 0.9782067923820674
Mean Absolute Error: 0.7882261146496787

Fold 4
R2 Score: 0.3722064122643747
Mean Squared Error: 0.3539081449044523
Root Mean Squared Error: 0.5949017943362185
Mean Absolute Error: 0.45953343949044206

Fold 5
R2 Score: 0.7069937594463354
Mean Squared Error: 4.27698299363054
Root Mean Squared Error: 2.0680867954780187
Mean Absolute Error: 1.3375828025477643



The R2 score for the decision tree regressor is not consistent. For some folds, it equates to about 0.90 while other folds have about zero. This could be because the model is unable to predict values at certain parts of the data. The MSE and MAE are both generally low

In [14]:
# Split data into train and test, run GradientBoostingRegressor model, and print R2 score, MSE, and MAE.
i=0
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    gbr = GradientBoostingRegressor()
    
    gbr.fit(X_train, y_train)
    ypred = gbr.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 0
R2 Score: 0.04633142881287833
Mean Squared Error: 0.19447957844655686
Root Mean Squared Error: 0.4409983882584571
Mean Absolute Error: 0.3309767169454254

Fold 1
R2 Score: 0.8058128303817464
Mean Squared Error: 0.013855772340072167
Root Mean Squared Error: 0.11771054472761634
Mean Absolute Error: 0.07979395986771401

Fold 2
R2 Score: 0.708661633749332
Mean Squared Error: 0.5902899847025259
Root Mean Squared Error: 0.7683033155613256
Mean Absolute Error: 0.6459592691361051

Fold 3
R2 Score: 0.8495086220546717
Mean Squared Error: 0.08483699966552495
Root Mean Squared Error: 0.2912679173296039
Mean Absolute Error: 0.24768516370942015

Fold 4
R2 Score: 0.7801418431247492
Mean Squared Error: 3.2092476808328585
Root Mean Squared Error: 1.791437322607983
Mean Absolute Error: 1.064904217503989



The gradient booster has relatively consistent results across the folds. It has low MSE and MAE values across the folds and the R2 score is also generally high with just one significanly low value of 0.299.

In [15]:
# Split data into train and test, run BayesianRidge model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    br = BayesianRidge()
    
    br.fit(X_train, y_train)
    ypred = br.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: 0.9999999999999984
Mean Squared Error: 3.162119348315234e-16
Root Mean Squared Error: 1.7782348968331584e-08
Mean Absolute Error: 1.5682484269441835e-08

Fold 2
R2 Score: 1.0
Mean Squared Error: 1.6012674378371194e-19
Root Mean Squared Error: 4.0015839836708653e-10
Mean Absolute Error: 3.1474709604375455e-10

Fold 3
R2 Score: 1.0
Mean Squared Error: 3.587366386330994e-20
Root Mean Squared Error: 1.894034420577143e-10
Mean Absolute Error: 1.754642623695188e-10

Fold 4
R2 Score: 1.0
Mean Squared Error: 8.104309694218943e-22
Root Mean Squared Error: 2.846806929564937e-11
Mean Absolute Error: 2.525326020484001e-11

Fold 5
R2 Score: 1.0
Mean Squared Error: 1.1643686698463575e-21
Root Mean Squared Error: 3.4122846743001345e-11
Mean Absolute Error: 2.6537306499467925e-11



The Bayesian Ridge, much like the linear regression model performs extremely well on the dataset with R2 scores of 1 and MSE and MAE giving values of zero. 

In [16]:
otherdf = pd.read_csv('exploredData')

In [17]:
otherdf.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,2022-09-02,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,2022-09-01,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2022-08-31,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,2022-08-30,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,2022-08-29,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [18]:
frcstdf = otherdf[['date', 'CPI']]

In [19]:
frcstdf.reset_index(inplace=True)
frcstdf.rename(columns={'index': 'unique_id', 'date':'ds', 'CPI':'y' }, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf.rename(columns={'index': 'unique_id', 'date':'ds', 'CPI':'y' }, inplace=True)


In [20]:
frcstdf.head()

Unnamed: 0,unique_id,ds,y
0,0,2022-09-02,12.7
1,1,2022-09-01,12.7
2,2,2022-08-31,12.7
3,3,2022-08-30,12.7
4,4,2022-08-29,12.7


In [21]:
frcstdf['unique_id'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf['unique_id'] = 1


In [22]:
frcstdf.head()

Unnamed: 0,unique_id,ds,y
0,1,2022-09-02,12.7
1,1,2022-09-01,12.7
2,1,2022-08-31,12.7
3,1,2022-08-30,12.7
4,1,2022-08-29,12.7


In [23]:
frcstdf['ds'] = pd.to_datetime(frcstdf['ds'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf['ds'] = pd.to_datetime(frcstdf['ds'])


In [24]:
frcstdf['ds'].head()

0   2022-09-02
1   2022-09-01
2   2022-08-31
3   2022-08-30
4   2022-08-29
Name: ds, dtype: datetime64[ns]

In [25]:
frcstdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3771 entries, 0 to 3770
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   unique_id  3771 non-null   int64         
 1   ds         3771 non-null   datetime64[ns]
 2   y          3771 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 88.5 KB


In [67]:
tscv2 = TimeSeriesSplit(max_train_size=500, test_size=100)

In [74]:
# Split data into train and test, run AutoARIMA model, and print R2 score, MSE, and MAE.
#forcasts=[]
i=1
for train_index, test_index in tscv2.split(frcstdf['ds']):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = frcstdf.loc[train_index], frcstdf.loc[test_index]
    y_train, y_test = frcstdf['y'].loc[train_index], frcstdf['y'].loc[test_index]
    
    X_train.sort_values(by='ds', inplace=True)
    
    models = [AutoARIMA(), ETS()]
    fcst = StatsForecast(df=X_train, models=models, freq='D')
    
    
    #print(X_train.tail(5))
    forecasts = fcst.forecast(100)
       
    #print(forecasts.head(10))   
    
    print('AUTO ARIMA')
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, forecasts['AutoARIMA'])))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['AutoARIMA'])))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['AutoARIMA'], squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, forecasts['AutoARIMA'])))
    print('')
    
    print('ETS')
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, forecasts['ETS'])))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['ETS'])))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['ETS'], squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, forecasts['ETS'])))
    print('')
    
    i=i+1

AUTO ARIMA
Fold 1
R2 Score: -0.32619874579348473
Mean Squared Error: 5.136399571228066
Root Mean Squared Error: 2.2663626301252116
Mean Absolute Error: 2.2039999465942377

ETS
Fold 1
R2 Score: -0.32620130861317076
Mean Squared Error: 5.1364094970902165
Root Mean Squared Error: 2.2663648199462982
Mean Absolute Error: 2.2040011291503903

AUTO ARIMA
Fold 2
R2 Score: -7.3559519282841865
Mean Squared Error: 18.1297
Root Mean Squared Error: 4.257898542708598
Mean Absolute Error: 3.995000000000001

ETS
Fold 2
R2 Score: -7.3559519282841865
Mean Squared Error: 18.1297
Root Mean Squared Error: 4.257898542708598
Mean Absolute Error: 3.995000000000001

AUTO ARIMA
Fold 3
R2 Score: -81.7888744835117
Mean Squared Error: 51.82740841529351
Root Mean Squared Error: 7.199125531291527
Mean Absolute Error: 7.15552901172638

ETS
Fold 3
R2 Score: -81.89868964093019
Mean Squared Error: 51.896154790325475
Root Mean Squared Error: 7.203898582734593
Mean Absolute Error: 7.160316738128663

AUTO ARIMA
Fold 4
R2 Sc