<h1>Data Preprocessing of Inflation Data</h1>

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

In [75]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [2]:
df = pd.read_csv('exploredData')

In [3]:
df.head()

Unnamed: 0,date,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
0,2022-09-02,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
1,2022-09-01,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2,2022-08-31,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
3,2022-08-30,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
4,2022-08-29,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


In [4]:
df['date'] = pd.to_datetime(df['date'])

In [5]:
df.set_index('date', inplace=True)

In [6]:
df.head()

Unnamed: 0_level_0,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2022-09-02,2.65,2.43,0.534,0.0775,0.067,0.0772,0.0595,10.84,1.3355,12.7,10.3,6.6
2022-09-01,2.65,2.43,0.5359,0.0774,0.067,0.0775,0.0596,10.8,1.3333,12.7,10.3,6.6
2022-08-31,2.65,2.43,0.5395,0.0779,0.0669,0.0782,0.0599,10.82,1.3234,12.7,10.3,6.6
2022-08-30,2.15,2.43,0.542,0.0783,0.0669,0.0783,0.0601,10.85,1.3191,12.7,10.3,6.6
2022-08-29,2.15,2.43,0.5405,0.0785,0.0669,0.078,0.06,10.83,1.3216,12.7,10.3,6.6


<h3>Feature Engineering</h3>

The dataset already has a low dimensionality and so there is no need for dimensionality reduction. Also, the data is consistent and heterogenous. There is no feature engineering that will be done on this dataset.

In addition, all features are continuous, so there is no need to create any dummy variables. Instead, the data only needs to be scaled.

<h4>Scaling the Data</h4>

To scale the data, we use the StandardScaler object from sklearn preprocessing package. It scales the data by subtracting the mean from it and dividing by the standard deviation.

In [8]:
scaler = StandardScaler()

In [55]:
scale_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)

In [56]:
scale_df

Unnamed: 0_level_0,7 Day Bobc,1 Month BoBC,CHN,EUR,GBP,USD,SDR,YEN,ZAR,CPI,CPIT,CPIXA
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2022-09-02,-0.407504,10.675415,0.971495,-1.157787,-0.925579,-1.524403,-1.019696,-0.362137,0.884880,1.979911,1.543426,0.278027
2022-09-01,-0.407504,10.675415,0.977786,-1.166694,-0.925579,-1.511922,-1.013919,-0.386265,0.865676,1.979911,1.543426,0.278027
2022-08-31,-0.407504,10.675415,0.989705,-1.122159,-0.935353,-1.482802,-0.996590,-0.374201,0.779260,1.979911,1.543426,0.278027
2022-08-30,-0.539386,10.675415,0.997982,-1.086531,-0.935353,-1.478641,-0.985037,-0.356105,0.741726,1.979911,1.543426,0.278027
2022-08-29,-0.539386,10.675415,0.993016,-1.068717,-0.935353,-1.491122,-0.990813,-0.368169,0.763548,1.979911,1.543426,0.278027
...,...,...,...,...,...,...,...,...,...,...,...,...
2008-05-01,2.048134,-0.093949,-0.796522,0.910932,1.295454,1.110868,0.802131,1.017667,-0.508224,1.805058,1.771813,1.064576
2008-06-01,2.048134,-0.093949,-0.796522,0.910932,1.295454,1.110868,0.802131,1.017667,-0.508224,2.504472,2.261215,1.284810
2008-07-01,2.048134,-0.093949,-0.796522,0.910932,1.295454,1.110868,0.802131,1.017667,-0.508224,2.650183,2.326468,1.693816
2008-03-01,2.048134,-0.093949,-0.796522,0.910932,1.295454,1.110868,0.802131,1.017667,-0.508224,1.134786,1.347665,1.001652


<h4>Breaking data into Training and Test sets</h4>

It is important to note that cpi, cpit, and cpixa are all measures of inflation rate even though cpi is the most common one. They will all be treated as target variables to see if the prediction model will be able to perform better predicting one of them than the others.

The data shall now be separated into test and training sets. Sklearn has a package that provides this functionality. See below:

In [57]:
# Firstly splitting the target variable(y) from the predictor variables(X)
X = scale_df.drop(columns=['CPI', 'CPIT', 'CPIXA'])

In [58]:
y = scale_df['CPI']

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Sometimes it is possible to use multiple target variables and this option will be explored farther on.

Now, a linear regression algorithm from sklearn will be used as the baseline.

In [60]:
lm = LinearRegression()

In [61]:
lm.fit(X_train,y_train)

LinearRegression()

In [62]:
ypred = lm.predict(X_test) 

In [63]:
print('R2 Score: ' + str(r2_score(y_test, ypred)))
print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred)))
print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred)))

R2 Score: 0.6187787845654473
Mean Squared Error: 0.3939166004809056
Mean Absolute Error: 0.46519847896278915


The linear regression model was able to achieve an R_squared value of 0.6188 without any hyperparameter tuning. Also, this may not be the most accurate result since there could have been some sampling bias. To solve this, a gridsearch object will be used to do some parameter tuning and also to perform cross-validation, which will use each part of the dataset as the test set and get the mean of the results. We will use five folds.

In [64]:
lm2=LinearRegression()

In [71]:
grid_search = GridSearchCV(lm2, param_grid={'copy_X': [True,False], 'fit_intercept': [True, False]},  cv=5)

In [72]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=LinearRegression(),
             param_grid={'copy_X': [True, False],
                         'fit_intercept': [True, False]})

In [73]:
ypred2 = grid_search.predict(X_test) 

In [74]:
print('R2 Score: ' + str(r2_score(y_test, ypred2)))
print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred2)))
print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred2)))

R2 Score: 0.6191253137183562
Mean Squared Error: 0.3935585312540244
Mean Absolute Error: 0.46626575044290997


In [85]:
rm = Ridge()

In [96]:
grid_search3 = GridSearchCV(rm, param_grid={'copy_X': [True,False], 'fit_intercept': [True, False], 'alpha':[1, 10, 25, 50, 100], 'solver':['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs']},  cv=5)

In [97]:
grid_search3.fit(X_train, y_train)
ypred3 = grid_search3.predict(X_test) 

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 762, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 593, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py", line 397, in _ridge_regression
    raise ValueError("Known solvers are 'sparse_cg', 'cholesky', 'svd'"
ValueError: Known solvers are 'sparse_cg', 'cholesky', 'svd' 'lsqr', 'sag' or 'saga'. Got lbfgs.

Traceback (most recent call last):
  File "C:\Users\ituser\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fi

In [98]:
print('R2 Score: ' + str(r2_score(y_test, ypred2)))
print('Mean Squared Error: ' + str(mean_squared_error(y_test, ypred2)))
print('Mean Absolute Error: ' + str(mean_absolute_error(y_test, ypred2)))

R2 Score: 0.6191253137183562
Mean Squared Error: 0.3935585312540244
Mean Absolute Error: 0.46626575044290997


After cross validation and a bit of parameter tuning including using the ridge object to also tune the alpha parameter for regularization, the R2 score is now 0.6191.

This is a pretty good baseline performance considering it is significantly higher than 0.5. In the next Notebook, we will look at other ways to predict CPI.

In [100]:
scale_df.to_csv('ScaledData')