<h1>Predicting Inflation Rates - Modelling</h1>

In this notebook, various types of modelling will be used to predict inflation rates using the data from the last notebook (Data Preprocessing). Each model will be evaluated to see how well it performs.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import BayesianRidge

In [4]:
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, ETS

  from tqdm.autonotebook import tqdm


In [5]:
import time

The dataset is now loaded from the previous notebook

In [8]:
df = pd.read_csv('../Data/preprocessed')

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,0,0,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,1,1,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2,2,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,3,3,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,4,4,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [10]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [11]:
df.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,0,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,1,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,3,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,4,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


<h3>Using the decision tree regressor</h3>

In [12]:
X = df
y = df['CPI']

In [13]:
tscv = TimeSeriesSplit()

In [14]:
# Split data into train and test, run decision tree regression model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    dt = DecisionTreeRegressor()
    
    dt.fit(X_train, y_train)
    ypred = dt.predict(X_test)
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.9820142471552693
Mean Squared Error: 0.4041878980891719
Root Mean Squared Error: 0.6357577353750183
Mean Absolute Error: 0.514171974522293

Fold 2
R2 Score: 0.5485329784455428
Mean Squared Error: 0.032213375796178265
Root Mean Squared Error: 0.17948085077851136
Mean Absolute Error: 0.15015923566879025

Fold 3
R2 Score: -0.9803901525821004
Mean Squared Error: 4.012531847133747
Root Mean Squared Error: 2.0031305117574707
Mean Absolute Error: 1.4664012738853482

Fold 4
R2 Score: -0.21627173332084704
Mean Squared Error: 0.6856528662420378
Root Mean Squared Error: 0.8280415848506872
Mean Absolute Error: 0.6912420382165605

Fold 5
R2 Score: 0.6756106162661579
Mean Squared Error: 4.735079617834376
Root Mean Squared Error: 2.1760238091147754
Mean Absolute Error: 1.3957006369426703



<h3>Decision Tree Regressor with hyperparameter tuning using GridSearchCV</h3>

In [15]:
i=1
for train_index, test_index in tscv.split(y):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
   
    
    tree = DecisionTreeRegressor()
    
    grid_search = GridSearchCV(tree, param_grid={'ccp_alpha':[0.001, 0.01, 0, 0.1, 1, 10]})
    grid_search.fit(X_train, y_train)
    ypred = grid_search.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -1.282241820719963
Mean Squared Error: 0.4654126607676914
Root Mean Squared Error: 0.6822115953043392
Mean Absolute Error: 0.5661254149098406

Fold 2
R2 Score: 0.6253024571478324
Mean Squared Error: 0.0267356687898089
Root Mean Squared Error: 0.1635104546804543
Mean Absolute Error: 0.12277070063694338

Fold 3
R2 Score: -0.9803901525821086
Mean Squared Error: 4.012531847133763
Root Mean Squared Error: 2.0031305117574747
Mean Absolute Error: 1.466401273885353

Fold 4
R2 Score: -0.1495654867874976
Mean Squared Error: 0.6480483343937475
Root Mean Squared Error: 0.8050144932817964
Mean Absolute Error: 0.6299244706489949

Fold 5
R2 Score: 0.6763338742418159
Mean Squared Error: 4.724522292993612
Root Mean Squared Error: 2.173596626100071
Mean Absolute Error: 1.4117834394904418



Fold 2 and fold 5 seem to have better R2 values while Fold 1, 3, and 4 have negative R2 values.

In [16]:
# Split data into train and test, run random forest regression model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    rf = RandomForestRegressor()
    
    rf.fit(X_train, y_train)
    ypred = rf.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: -0.5509784102331234
Mean Squared Error: 0.316287687898089
Root Mean Squared Error: 0.5623946015904571
Mean Absolute Error: 0.4385222929936302

Fold 2
R2 Score: 0.6865667105064368
Mean Squared Error: 0.02236430095541449
Root Mean Squared Error: 0.1495469857784318
Mean Absolute Error: 0.12102707006369548

Fold 3
R2 Score: 0.5802856746391148
Mean Squared Error: 0.8503966226114573
Root Mean Squared Error: 0.9221695194547787
Mean Absolute Error: 0.7101194267515882

Fold 4
R2 Score: 0.04043663393881858
Mean Squared Error: 0.5409378136942598
Root Mean Squared Error: 0.7354847474246218
Mean Absolute Error: 0.5478614649681484

Fold 5
R2 Score: 0.7263578962760076
Mean Squared Error: 3.99432661146494
Root Mean Squared Error: 1.998581149582108
Mean Absolute Error: 1.299038216560504



The R2 score for the decision tree regressor is not consistent. For some folds, it equates to about 0.90 while other folds have about zero. This could be because the model is unable to predict values at certain parts of the data. The MSE and MAE are both generally low

In [17]:
# Split data into train and test, run GradientBoostingRegressor model, and print R2 score, MSE, and MAE.
i=0
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    gbr = GradientBoostingRegressor()
    
    gbr.fit(X_train, y_train)
    ypred = gbr.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 0
R2 Score: 0.11458889689097485
Mean Squared Error: 0.18055997994166614
Root Mean Squared Error: 0.424923498928532
Mean Absolute Error: 0.3119122944093811

Fold 1
R2 Score: 0.7968013090549106
Mean Squared Error: 0.014498768415393744
Root Mean Squared Error: 0.12041083180259882
Mean Absolute Error: 0.08189764008152989

Fold 2
R2 Score: 0.7439251591683524
Mean Squared Error: 0.5188414276585804
Root Mean Squared Error: 0.7203064817552182
Mean Absolute Error: 0.5935777640382834

Fold 3
R2 Score: 0.8427449458452029
Mean Squared Error: 0.0886499091102699
Root Mean Squared Error: 0.2977413459872006
Mean Absolute Error: 0.24182127444959703

Fold 4
R2 Score: 0.7810262595901164
Mean Squared Error: 3.1963379415231614
Root Mean Squared Error: 1.7878305125271694
Mean Absolute Error: 1.0932438781047424



The gradient booster has relatively consistent results across the folds. It has low MSE and MAE values across the folds and the R2 score is also generally high with just one significanly low value of 0.299.

In [18]:
# Split data into train and test, run BayesianRidge model, and print R2 score, MSE, and MAE.
i=1
for train_index, test_index in tscv.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    
    br = BayesianRidge()
    
    br.fit(X_train, y_train)
    ypred = br.predict(X_test)
    
    
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, ypred)))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, ypred, squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))
    print('')
    i=i+1

Fold 1
R2 Score: 0.9999999999999984
Mean Squared Error: 3.1533588606085414e-16
Root Mean Squared Error: 1.7757699345941582e-08
Mean Absolute Error: 1.566073783817605e-08

Fold 2
R2 Score: 1.0
Mean Squared Error: 1.5949702389866517e-19
Root Mean Squared Error: 3.993707849839109e-10
Mean Absolute Error: 3.1405752945167563e-10

Fold 3
R2 Score: 1.0
Mean Squared Error: 3.681284718734038e-20
Root Mean Squared Error: 1.9186674330727663e-10
Mean Absolute Error: 1.7787576295119022e-10

Fold 4
R2 Score: 1.0
Mean Squared Error: 8.226614332281202e-22
Root Mean Squared Error: 2.8682075120676333e-11
Mean Absolute Error: 2.5451435721884024e-11

Fold 5
R2 Score: 1.0
Mean Squared Error: 1.1839755115386613e-21
Root Mean Squared Error: 3.440894522560465e-11
Mean Absolute Error: 2.6739752132266204e-11



The Bayesian Ridge, much like the linear regression model performs extremely well on the dataset with R2 scores of 1 and MSE and MAE giving values of zero. 

In [20]:
otherdf = pd.read_csv('../Data/exploredData')

In [21]:
otherdf.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,2022-09-02,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,2022-09-01,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2022-08-31,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,2022-08-30,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,2022-08-29,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [22]:
frcstdf = otherdf[['date', 'CPI']]

In [23]:
frcstdf.reset_index(inplace=True)
frcstdf.rename(columns={'index': 'unique_id', 'date':'ds', 'CPI':'y' }, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf.rename(columns={'index': 'unique_id', 'date':'ds', 'CPI':'y' }, inplace=True)


In [24]:
frcstdf.head()

Unnamed: 0,unique_id,ds,y
0,0,2022-09-02,12.7
1,1,2022-09-01,12.7
2,2,2022-08-31,12.7
3,3,2022-08-30,12.7
4,4,2022-08-29,12.7


In [25]:
frcstdf['unique_id'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf['unique_id'] = 1


In [26]:
frcstdf.head()

Unnamed: 0,unique_id,ds,y
0,1,2022-09-02,12.7
1,1,2022-09-01,12.7
2,1,2022-08-31,12.7
3,1,2022-08-30,12.7
4,1,2022-08-29,12.7


In [27]:
frcstdf['ds'] = pd.to_datetime(frcstdf['ds'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frcstdf['ds'] = pd.to_datetime(frcstdf['ds'])


In [28]:
frcstdf['ds'].head()

0   2022-09-02
1   2022-09-01
2   2022-08-31
3   2022-08-30
4   2022-08-29
Name: ds, dtype: datetime64[ns]

In [29]:
frcstdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3771 entries, 0 to 3770
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   unique_id  3771 non-null   int64         
 1   ds         3771 non-null   datetime64[ns]
 2   y          3771 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 88.5 KB


In [30]:
tscv2 = TimeSeriesSplit(max_train_size=3000, test_size=500)

In [31]:
# Split data into train and test, run AutoARIMA model, and print R2 score, MSE, and MAE.
#forcasts=[]
i=1
for train_index, test_index in tscv2.split(frcstdf['ds']):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = frcstdf.loc[train_index], frcstdf.loc[test_index]
    y_train, y_test = frcstdf['y'].loc[train_index], frcstdf['y'].loc[test_index]
    
    X_train.sort_values(by='ds', inplace=True)
    
    models = [AutoARIMA(), ETS()]
    fcst = StatsForecast(df=X_train, models=models, freq='D')
    
    
    #print(X_train.tail(5))
    forecasts = fcst.forecast(500)
       
    #print(forecasts.head(10))   
    
    print('AUTO ARIMA')
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, forecasts['AutoARIMA'])))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['AutoARIMA'])))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['AutoARIMA'], squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, forecasts['AutoARIMA'])))
    print('')
    
    print('ETS')
    print('Fold ' + str(i))
    print('R2 Score: ' + str(r2_score(y_test, forecasts['ETS'])))
    print('Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['ETS'])))
    print('Root Mean Squared Error: ' + str(mean_squared_error(y_test, forecasts['ETS'], squared=False)))
    print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, forecasts['ETS'])))
    print('')
    
    i=i+1

  ETS._warn()


AUTO ARIMA
Fold 1
R2 Score: -1987.5466768126776
Mean Squared Error: 171.7928541239923
Root Mean Squared Error: 13.106977306915287
Mean Absolute Error: 12.939298834609982

ETS
Fold 1
R2 Score: -1075.1931698043077
Mean Squared Error: 92.9735763234711
Root Mean Squared Error: 9.642280659857972
Mean Absolute Error: 9.637799809265136



  ETS._warn()


AUTO ARIMA
Fold 2
R2 Score: -323.3416772767229
Mean Squared Error: 150.35456536215386
Root Mean Squared Error: 12.261915240375536
Mean Absolute Error: 12.172532559967044

ETS
Fold 2
R2 Score: -165.03066564116605
Mean Squared Error: 76.96657666343694
Root Mean Squared Error: 8.773059709328152
Mean Absolute Error: 8.746599809265138



  ETS._warn()


AUTO ARIMA
Fold 3
R2 Score: -63.99808998042381
Mean Squared Error: 77.18491726099774
Root Mean Squared Error: 8.785494707812289
Mean Absolute Error: 8.700354706954958

ETS
Fold 3
R2 Score: -23.23007601685278
Mean Squared Error: 28.773097996444744
Root Mean Squared Error: 5.364056114214759
Mean Absolute Error: 5.252199809265138



  ETS._warn()


AUTO ARIMA
Fold 4
R2 Score: -147.49803062206232
Mean Squared Error: 83.14799145298599
Root Mean Squared Error: 9.118552048049404
Mean Absolute Error: 8.731817150878905

ETS
Fold 4
R2 Score: -50.52382478702004
Mean Squared Error: 28.849557971038855
Root Mean Squared Error: 5.371178452727004
Mean Absolute Error: 5.318799809265136



  ETS._warn()


AUTO ARIMA
Fold 5
R2 Score: -0.29079228107420385
Mean Squared Error: 19.519581599884177
Root Mean Squared Error: 4.418097056412883
Mean Absolute Error: 3.9002001693725585

ETS
Fold 5
R2 Score: -0.29079228107420385
Mean Squared Error: 19.519581599884177
Root Mean Squared Error: 4.418097056412883
Mean Absolute Error: 3.9002001693725585



In [32]:
forecasts.tail()

Unnamed: 0_level_0,ds,AutoARIMA,ETS
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2022-12-11,8.4,8.4
1,2022-12-12,8.4,8.4
1,2022-12-13,8.4,8.4
1,2022-12-14,8.4,8.4
1,2022-12-15,8.4,8.4


In [33]:
forecasts.to_csv('CPI forecasts')

In [34]:
forecasts

Unnamed: 0_level_0,ds,AutoARIMA,ETS
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2021-08-03,8.4,8.4
1,2021-08-04,8.4,8.4
1,2021-08-05,8.4,8.4
1,2021-08-06,8.4,8.4
1,2021-08-07,8.4,8.4
...,...,...,...
1,2022-12-11,8.4,8.4
1,2022-12-12,8.4,8.4
1,2022-12-13,8.4,8.4
1,2022-12-14,8.4,8.4
