<h1>Predicting Inflation Rates - Modelling</h1>

In this notebook, various types of modelling will be used to predict inflation rates using the data from the last notebook (Data Preprocessing). Each model will be evaluated to see how well it performs.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import BayesianRidge

In [3]:
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, ETS

In [4]:
import time

The dataset is now loaded from the previous notebook

In [5]:
df = pd.read_csv('preprocessed')

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,0,0,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,1,1,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2,2,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,3,3,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,4,4,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [7]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [8]:
df.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,0,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,1,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,3,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,4,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


<h3>Using the decision tree regressor</h3>

In [9]:
X = df
y = df['CPI']

In [10]:
tscv = TimeSeriesSplit()

In [11]:
# Split data into train and test, run decision tree regression model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    dt = DecisionTreeRegressor()
    
    dt.fit(X_train, y_train)
    ypred = dt.predict(X_test)
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.9820142471552693
Mean Squared Error: 0.4041878980891719
Root Mean Squared Error: 0.6357577353750183
Mean Absolute Error: 0.514171974522293

Fold 2
R2 Score: 0.65922385471395
Mean Squared Error: 0.024315286624204572
Root Mean Squared Error: 0.15593359684238856
Mean Absolute Error: 0.11671974522293167

Fold 3
R2 Score: -0.9803901525821004
Mean Squared Error: 4.012531847133747
Root Mean Squared Error: 2.0031305117574707
Mean Absolute Error: 1.4664012738853482

Fold 4
R2 Score: -0.08472716744437125
Mean Squared Error: 0.611496815286624
Root Mean Squared Error: 0.7819826182765344
Mean Absolute Error: 0.5939490445859881

Fold 5
R2 Score: 0.6743015520568225
Mean Squared Error: 4.754187898089153
Root Mean Squared Error: 2.1804100298084195
Mean Absolute Error: 1.414808917197448



<h3>Decision Tree Regressor with hyperparameter tuning using GridSearchCV</h3>

In [12]:
i=1
for train_index, test_index in tscv.split(y):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
   
    
    tree = DecisionTreeRegressor()
    
    grid_search = GridSearchCV(tree, param_grid={'ccp_alpha':[0.001, 0.01, 0, 0.1, 1, 10]})
    grid_search.fit(X_train, y_train)
    ypred = grid_search.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.9820142471552693
Mean Squared Error: 0.4041878980891719
Root Mean Squared Error: 0.6357577353750183
Mean Absolute Error: 0.514171974522293

Fold 2
R2 Score: 0.5824543760116561
Mean Squared Error: 0.02979299363057424
Root Mean Squared Error: 0.17260647041920021
Mean Absolute Error: 0.14410828025477934

Fold 3
R2 Score: -0.9803901525821004
Mean Squared Error: 4.012531847133747
Root Mean Squared Error: 2.0031305117574707
Mean Absolute Error: 1.4664012738853482

Fold 4
R2 Score: -0.11593969595918852
Mean Squared Error: 0.6290923566878976
Root Mean Squared Error: 0.7931534256925943
Mean Absolute Error: 0.6218152866242038

Fold 5
R2 Score: 0.6747499065485201
Mean Squared Error: 4.747643312101891
Root Mean Squared Error: 2.178908743408473
Mean Absolute Error: 1.4165605095541358



Fold 2 and fold 5 seem to have better R2 values while Fold 1, 3, and 4 have negative R2 values.

In [13]:
# Split data into train and test, run random forest regression model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    rf = RandomForestRegressor()
    
    rf.fit(X_train, y_train)
    ypred = rf.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.7238677334708676
Mean Squared Error: 0.3515446353503183
Root Mean Squared Error: 0.5929119962948282
Mean Absolute Error: 0.4685111464968153

Fold 2
R2 Score: 0.695375584170054
Mean Squared Error: 0.021735764331210754
Root Mean Squared Error: 0.14743054070039477
Mean Absolute Error: 0.11782165605095653

Fold 3
R2 Score: 0.6015637257593847
Mean Squared Error: 0.8072844824840676
Root Mean Squared Error: 0.898490112624545
Mean Absolute Error: 0.6581640127388484

Fold 4
R2 Score: 0.16169983246779707
Mean Squared Error: 0.4725777117834328
Root Mean Squared Error: 0.687442878924084
Mean Absolute Error: 0.5116640127388499

Fold 5
R2 Score: 0.7325777901782987
Mean Squared Error: 3.903535437898063
Root Mean Squared Error: 1.9757366823284075
Mean Absolute Error: 1.2856544585987206



The R2 score for the decision tree regressor is not consistent. For some folds, it equates to about 0.90 while other folds have about zero. This could be because the model is unable to predict values at certain parts of the data. The MSE and MAE are both generally low

In [14]:
# Split data into train and test, run GradientBoostingRegressor model, and print R2 score, MSE, and MAE.
i=0
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    gbr = GradientBoostingRegressor()
    
    gbr.fit(X_train, y_train)
    ypred = gbr.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 0
R2 Score: -0.08055998537263842
Mean Squared Error: 0.22035627134057553
Root Mean Squared Error: 0.4694212088738381
Mean Absolute Error: 0.3673285472139096

Fold 1
R2 Score: 0.7936745873998956
Mean Squared Error: 0.014721868342684657
Root Mean Squared Error: 0.12133370653979321
Mean Absolute Error: 0.08267909095044809

Fold 2
R2 Score: 0.6596489424683245
Mean Squared Error: 0.6895961665790409
Root Mean Squared Error: 0.8304192715604816
Mean Absolute Error: 0.7085774212051785

Fold 3
R2 Score: 0.5966108964981174
Mean Squared Error: 0.2274038666274833
Root Mean Squared Error: 0.47686881490351546
Mean Absolute Error: 0.38547030408443717

Fold 4
R2 Score: 0.7928261933513724
Mean Squared Error: 3.0240954803131403
Root Mean Squared Error: 1.7389926625242387
Mean Absolute Error: 1.0533278884982948



The gradient booster has relatively consistent results across the folds. It has low MSE and MAE values across the folds and the R2 score is also generally high with just one significanly low value of 0.299.

In [15]:
# Split data into train and test, run BayesianRidge model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    br = BayesianRidge()
    
    br.fit(X_train, y_train)
    ypred = br.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: 0.9999999999999984
Mean Squared Error: 3.162119348315234e-16
Root Mean Squared Error: 1.7782348968331584e-08
Mean Absolute Error: 1.5682484269441835e-08

Fold 2
R2 Score: 1.0
Mean Squared Error: 1.6012674378371194e-19
Root Mean Squared Error: 4.0015839836708653e-10
Mean Absolute Error: 3.1474709604375455e-10

Fold 3
R2 Score: 1.0
Mean Squared Error: 3.587366386330994e-20
Root Mean Squared Error: 1.894034420577143e-10
Mean Absolute Error: 1.754642623695188e-10

Fold 4
R2 Score: 1.0
Mean Squared Error: 8.104309694218943e-22
Root Mean Squared Error: 2.846806929564937e-11
Mean Absolute Error: 2.525326020484001e-11

Fold 5
R2 Score: 1.0
Mean Squared Error: 1.1643686698463575e-21
Root Mean Squared Error: 3.4122846743001345e-11
Mean Absolute Error: 2.6537306499467925e-11



The Bayesian Ridge, much like the linear regression model performs extremely well on the dataset with R2 scores of 1 and MSE and MAE giving values of zero. 

In [16]:
otherdf = pd.read_csv('exploredData')

In [17]:
otherdf.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,2022-09-02,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,2022-09-01,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2022-08-31,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,2022-08-30,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,2022-08-29,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [18]:
frcstdf = otherdf[['date', 'CPI']]

In [19]:
frcstdf.reset_index(inplace=True)
frcstdf.rename(columns={'index': 'unique_id', 'date':'ds', 'CPI':'y' }, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf.rename(columns={'index': 'unique_id', 'date':'ds', 'CPI':'y' }, inplace=True)


In [20]:
frcstdf.head()

Unnamed: 0,unique_id,ds,y
0,0,2022-09-02,12.7
1,1,2022-09-01,12.7
2,2,2022-08-31,12.7
3,3,2022-08-30,12.7
4,4,2022-08-29,12.7


In [21]:
frcstdf['unique_id'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf['unique_id'] = 1


In [22]:
frcstdf.head()

Unnamed: 0,unique_id,ds,y
0,1,2022-09-02,12.7
1,1,2022-09-01,12.7
2,1,2022-08-31,12.7
3,1,2022-08-30,12.7
4,1,2022-08-29,12.7


In [23]:
frcstdf['ds'] = pd.to_datetime(frcstdf['ds'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf['ds'] = pd.to_datetime(frcstdf['ds'])


In [24]:
frcstdf['ds'].head()

0   2022-09-02
1   2022-09-01
2   2022-08-31
3   2022-08-30
4   2022-08-29
Name: ds, dtype: datetime64[ns]

In [25]:
frcstdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3771 entries, 0 to 3770
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   unique_id  3771 non-null   int64         
 1   ds         3771 non-null   datetime64[ns]
 2   y          3771 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 88.5 KB


In [70]:
tscv2 = TimeSeriesSplit(max_train_size=3000, test_size=500)

In [71]:
# Split data into train and test, run AutoARIMA model, and print R2 score, MSE, and MAE.
#forcasts=[]
i=1
for train_index, test_index in tscv2.split(frcstdf['ds']):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = frcstdf.loc[train_index], frcstdf.loc[test_index]
    y_train, y_test = frcstdf['y'].loc[train_index], frcstdf['y'].loc[test_index]
    
    X_train.sort_values(by='ds', inplace=True)
    
    models = [AutoARIMA(), ETS()]
    fcst = StatsForecast(df=X_train, models=models, freq='D')
    
    
    #print(X_train.tail(5))
    forecasts = fcst.forecast(500)
       
    #print(forecasts.head(10))   
    
    print('AUTO ARIMA')
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, forecasts['AutoARIMA'])))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['AutoARIMA'])))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['AutoARIMA'], squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, forecasts['AutoARIMA'])))
    print('')
    
    print('ETS')
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, forecasts['ETS'])))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['ETS'])))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['ETS'], squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, forecasts['ETS'])))
    print('')
    
    i=i+1

AUTO ARIMA
Fold 1
R2 Score: -1987.5466768126776
Mean Squared Error: 171.7928541239923
Root Mean Squared Error: 13.106977306915287
Mean Absolute Error: 12.939298834609982

ETS
Fold 1
R2 Score: -1075.1931698043077
Mean Squared Error: 92.9735763234711
Root Mean Squared Error: 9.642280659857972
Mean Absolute Error: 9.637799809265136

AUTO ARIMA
Fold 2
R2 Score: -323.34166301753623
Mean Squared Error: 150.35455875204494
Root Mean Squared Error: 12.261914970837342
Mean Absolute Error: 12.172532306289675

ETS
Fold 2
R2 Score: -165.03066564116605
Mean Squared Error: 76.96657666343694
Root Mean Squared Error: 8.773059709328152
Mean Absolute Error: 8.746599809265138

AUTO ARIMA
Fold 3
R2 Score: -64.19617989099494
Mean Squared Error: 77.42014807104579
Root Mean Squared Error: 8.79887197719377
Mean Absolute Error: 8.712984803390503

ETS
Fold 3
R2 Score: -23.23007601685278
Mean Squared Error: 28.773097996444744
Root Mean Squared Error: 5.364056114214759
Mean Absolute Error: 5.252199809265138

AUTO 

In [72]:
forecasts.tail()

Unnamed: 0_level_0,ds,AutoARIMA,ETS
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2022-12-11,8.4,8.4
1,2022-12-12,8.4,8.4
1,2022-12-13,8.4,8.4
1,2022-12-14,8.4,8.4
1,2022-12-15,8.4,8.4


In [73]:
forecasts.to_csv('CPI forecasts')

In [74]:
forecasts

Unnamed: 0_level_0,ds,AutoARIMA,ETS
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2021-08-03,8.4,8.4
1,2021-08-04,8.4,8.4
1,2021-08-05,8.4,8.4
1,2021-08-06,8.4,8.4
1,2021-08-07,8.4,8.4
...,...,...,...
1,2022-12-11,8.4,8.4
1,2022-12-12,8.4,8.4
1,2022-12-13,8.4,8.4
1,2022-12-14,8.4,8.4
