Importing libraries

In [1]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
from pycaret.regression import *
from yahoofinancials import YahooFinancials


We have prepared a list instruments for which we need to import data. yahoofinancials package requires Yahoo ticker symbols. The list contains the ticker symbols and their descriptions. We import that file and extract the tciker symbols and the names as seprarate lists

In [2]:
ticker_details = pd.read_excel("Ticker List.xlsx")
ticker = ticker_details['Ticker'].to_list()
names = ticker_details['Description'].to_list()
ticker_details.head(20)

Unnamed: 0,Ticker,Description
0,GC=F,Gold
1,SI=F,Silver
2,CL=F,Crude Oil
3,^GSPC,S&P500
4,PL=F,Platinum
5,HG=F,Copper
6,DX=F,Dollar Index
7,^VIX,Volatility Index
8,EEM,MSCI EM ETF
9,EURUSD=X,Euro USD


Once we have the list, we need to define what date range we need to import the data for. The period we have chosen is Jan 2010 till 1st Mar 2020. The reason we did not pull data prior to that is because the Global Financial Crisis (GFC) in 2008–09 massively changed the economic and market landscapes. Relationships prior to that period might be of less relevance now.

We create a date-range and write it to an empty dataframe named values where we would extract and paste data we pull from yahoofinancials.

In [3]:
#Creating Date Range and adding them to values table
end_date = '2021-04-01'
start_date = '2010-01-01'
date_range = pd.date_range(start_date,end_date)
values = pd.DataFrame({'Date':date_range})
values['Date'] = pd.to_datetime(values['Date'])

Once we have the date range in dataframe, we need to use ticker symbols to pull out data from the API. yahoofinancials returns the output in a JSON format. The following code loops over the the list of ticker symbols and extracts just the closing prices for all the historical dates and adds them to the dataframe horizontally merging on the date. Given these asset classes might have different regional and trading holidays, the date ranges for every data pull might not be the same. By merging, we will eventually have several NAs which we will frontfill later on.

In [4]:
#Extracting Data from Yahoo Finance and Adding them to Values table using date as key
for i in ticker:
    raw_data = YahooFinancials(i)
    raw_data = raw_data.get_historical_price_data(start_date, end_date,'daily')
    df = pd.DataFrame(raw_data[i]['prices'])[['formatted_date','adjclose']]
    df.columns = ['Date1',i]
    df['Date1']= pd.to_datetime(df['Date1'])
    values = values.merge(df,how='left',left_on='Date',right_on='Date1')
    values = values.drop(labels='Date1',axis=1)



In [5]:
#Renaming columns to represent instrument names rather than their ticker codes for ease of readability
names.insert(0,'Date')
values.columns = names

#Front filling the NaN values in the data set
values = values.fillna(method="ffill",axis=0)
values = values.fillna(method="bfill",axis=0)
values.isna().sum()

# Coercing numeric type to all columns except Date
cols=values.columns.drop('Date')
values[cols] = values[cols].apply(pd.to_numeric,errors='coerce').round(decimals=1)

    

In [6]:
imp = ['Gold','Silver', 'Crude Oil', 'S&P500','MSCI EM ETF']
# Calculating Short term -Historical Returns
change_days = [1,3,5,14,21]
data = pd.DataFrame(data=values['Date'])
for i in change_days:
    print(data.shape)
    x= values[cols].pct_change(periods=i).add_suffix("-T-"+str(i))
    data=pd.concat(objs=(data,x),axis=1)
    x=[]
    print(data.shape)
# Calculating Long term Historical Returns
change_days = [60,90,180,250]
for i in change_days:
    print(data.shape)
    x= values[imp].pct_change(periods=i).add_suffix("-T-"+str(i))
    data=pd.concat(objs=(data,x),axis=1)
    x=[]


(4109, 1)
(4109, 16)
(4109, 16)
(4109, 31)
(4109, 31)
(4109, 46)
(4109, 46)
(4109, 61)
(4109, 61)
(4109, 76)
(4109, 76)
(4109, 81)
(4109, 86)
(4109, 91)


In [7]:
#Calculating Moving averages for Gold
moving_avg = pd.DataFrame(values['Date'],columns=['Date'])
moving_avg['Date']=pd.to_datetime(moving_avg['Date'],format='%Y-%b-%d')
#Adding Simple Moving Average
moving_avg['Gold/15SMA'] = (values['Gold']/(values['Gold'].rolling(window=15).mean()))-1
moving_avg['Gold/30SMA'] = (values['Gold']/(values['Gold'].rolling(window=30).mean()))-1
moving_avg['Gold/60SMA'] = (values['Gold']/(values['Gold'].rolling(window=60).mean()))-1
moving_avg['Gold/90SMA'] = (values['Gold']/(values['Gold'].rolling(window=90).mean()))-1
moving_avg['Gold/180SMA'] = (values['Gold']/(values['Gold'].rolling(window=180).mean()))-1
#Adding Exponential Moving Average
moving_avg['Gold/90EMA'] = (values['Gold']/(values['Gold'].ewm(span=90,adjust=True,ignore_na=True).mean()))-1
moving_avg['Gold/180EMA'] = (values['Gold']/(values['Gold'].ewm(span=180,adjust=True,ignore_na=True).mean()))-1
moving_avg = moving_avg.dropna(axis=0)
print(moving_avg.shape)
moving_avg.head(20)

(3930, 8)


Unnamed: 0,Date,Gold/15SMA,Gold/30SMA,Gold/60SMA,Gold/90SMA,Gold/180SMA,Gold/90EMA,Gold/180EMA
179,2010-06-29,-0.003125,0.00579,0.018932,0.038923,0.077814,0.0334,0.052983
180,2010-06-30,-0.000973,0.007719,0.02089,0.040686,0.080186,0.03547,0.055197
181,2010-07-01,-0.031224,-0.02351,-0.011594,0.007173,0.045743,0.002816,0.0217
182,2010-07-02,-0.028255,-0.022272,-0.011026,0.007322,0.046244,0.003647,0.022338
183,2010-07-03,-0.025651,-0.022248,-0.01155,0.006554,0.045793,0.003565,0.022048
184,2010-07-04,-0.023034,-0.022016,-0.011992,0.00586,0.045343,0.003485,0.021762
185,2010-07-05,-0.020402,-0.021783,-0.012134,0.005188,0.044984,0.003407,0.021481
186,2010-07-06,-0.028265,-0.031432,-0.02224,-0.005693,0.033772,-0.006908,0.010683
187,2010-07-07,-0.022987,-0.027281,-0.018978,-0.002959,0.036759,-0.003666,0.01372
188,2010-07-08,-0.023236,-0.028287,-0.02108,-0.005607,0.034051,-0.00586,0.011207


In [8]:
#Merging Moving Average values to the feature space
data['Date']=pd.to_datetime(data['Date'],format='%Y-%b-%d')
data = pd.merge(left=data,right=moving_avg,how='left',on='Date')
data.isna().sum()

Date                       0
Gold-T-1                   1
Silver-T-1                 1
Crude Oil-T-1              1
S&P500-T-1                 1
Platinum-T-1               1
Copper-T-1                 1
Dollar Index-T-1           1
Volatility Index-T-1       1
MSCI EM ETF-T-1            1
Euro USD-T-1               1
Euronext100-T-1            1
Nasdaq-T-1                 1
Bse sensex-T-1             1
Nifty 50-T-1               1
Dow-T-1                    1
Gold-T-3                   3
Silver-T-3                 3
Crude Oil-T-3              3
S&P500-T-3                 3
Platinum-T-3               3
Copper-T-3                 3
Dollar Index-T-3           3
Volatility Index-T-3       3
MSCI EM ETF-T-3            3
Euro USD-T-3               3
Euronext100-T-3            3
Nasdaq-T-3                 3
Bse sensex-T-3             3
Nifty 50-T-3               3
Dow-T-3                    3
Gold-T-5                   5
Silver-T-5                 5
Crude Oil-T-5              5
S&P500-T-5    

This was all about features. Now we need to create targets, i.e what we want to predict. Since we are predicting returns, we need to pick a horizon for which we need to predict returns. We have chosen 22-day horizons because other smaller horizons tend to be very volatile and lack predictive power.

In [9]:
#Calculating forward returns for Target
y = pd.DataFrame(data=values['Date'])
y['Gold-T+14']=values['Gold'].pct_change(periods=-14)
y['Gold-T+22']=values['Gold'].pct_change(periods=-22)
print(y.shape)
y.isna().sum()
# Removing NAs

data = data[data['Gold-T-250'].notna()]
y = y[y['Gold-T+22'].notna()]
#Adding Target Variables
data = pd.merge(left=data,right=y,how='inner',on='Date',suffixes=(False,False))


(4109, 3)


Now we have the complete data set ready to start modelling. In the next part we will experiment with different algorithms using the extremely innovative and efficient PyCaret library. I will also exhibit how a pipeline can be created to continuously import new data to generate predictions using the trained models.

we are making 2 models for 22 days or 14 days.Hence, If one gives an low accuracy we can use another model

For 22 Day model

Predicting Gold Prices Using Machine Learning

Part- II Regression Modelling with PyCaret

In Part-I, we discussed importing data from open source free API and prepared it in a manner which is suitable for our intended Machine Learning exercise. You can refer to Part-I for the codes or import the final dataset in file name ‘Training Data’ from the github repo.
PyCaret is an open source machine learning library in Python which can be used across any notebook environment and drastically reduces the coding effort making the process extremely efficient and productive. In section below we will see how PyCaret can supercharge any machine learning experiment. To begin, you will need to install PyCaret using 

In [10]:
#We have two target columns. We will remove the T+14 day Target
data_22= data.drop(['Gold-T+14'],axis=1)


Setup

To begin any modelling exercise in PyCaret, the first step is the ‘setup’ function. The mandatory variables here are the dataset and the target label in the dataset. All the elementary and necessary data transformations like dropping IDs, One-Hot Encoding the categorical factors and missing value imputation happens behind the scene automatically. PyCaret also offers over 20 pre-processing options. For this example we would go with basics in setup and would try different pre-processing techniques in later experiments.

In [11]:
a=setup(data_22,target='Gold-T+22',
        ignore_features=['Date'],session_id=11,
        silent=True,profile=False);

Unnamed: 0,Description,Value
0,session_id,11
1,Target,Gold-T+22
2,Original Data,"(3837, 104)"
3,Missing Values,False
4,Numeric Features,102
5,Categorical Features,0
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(2685, 99)"


‘compare_models’

The function uses all the algorithms (25 as now) and fits them to the data, runs a 10-fold cross-validation and spits out 6 evaluation metrics for each model. All this with just 2-words. Two additional arguments that can be used in the function in the interest of time are:

a. turbo=False — True in default. When turbo=True, compare models does not evaluate few of the more costly algorithms, namely Kernel Ridge (kr), Automatic Relevance Determination (ard) and Multi-level Perceptron (mlp)

b. blacklist — Here, one can pass list of algorithm abbreviations (see docstring) which are known to take much longer time and with little performance improvement. Eg: Below I have blacklisted Theilsen Regressor (tr)

In [12]:
compare_models(turbo=True)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.01,0.0002,0.0142,0.8672,0.0132,0.7447,1.125
lightgbm,Light Gradient Boosting Machine,0.0119,0.0003,0.0167,0.8169,0.015,1.0718,0.593
knn,K Neighbors Regressor,0.012,0.0003,0.0171,0.8046,0.0143,1.2594,0.044
rf,Random Forest Regressor,0.0125,0.0003,0.0176,0.7965,0.0161,0.8998,4.105
gbr,Gradient Boosting Regressor,0.0195,0.0006,0.0253,0.5792,0.0229,1.29,1.881
dt,Decision Tree Regressor,0.0163,0.0007,0.027,0.5106,0.0189,1.9146,0.101
ada,AdaBoost Regressor,0.0273,0.0012,0.034,0.2362,0.0313,1.1633,0.612
br,Bayesian Ridge,0.0277,0.0013,0.0363,0.1317,0.03,1.7815,0.032
lr,Linear Regression,0.0278,0.0013,0.0365,0.1233,0.0291,1.9555,0.413
ridge,Ridge Regression,0.0278,0.0013,0.0365,0.1231,0.0307,1.6705,0.018


ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                    max_depth=None, max_features='auto', max_leaf_nodes=None,
                    max_samples=None, min_impurity_decrease=0.0,
                    min_impurity_split=None, min_samples_leaf=1,
                    min_samples_split=2, min_weight_fraction_leaf=0.0,
                    n_estimators=100, n_jobs=-1, oob_score=False,
                    random_state=11, verbose=0, warm_start=False)

In [13]:
    knn = create_model('knn')


Unnamed: 0,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,0.0114,0.0003,0.016,0.8234,0.0142,1.1466
1,0.0138,0.0005,0.0213,0.7356,0.0166,0.8752
2,0.0123,0.0003,0.0169,0.8397,0.0153,1.5436
3,0.0115,0.0003,0.0166,0.7956,0.0148,1.7212
4,0.0126,0.0003,0.0171,0.8407,0.0141,0.9877
5,0.0111,0.0002,0.0152,0.8391,0.0134,1.1331
6,0.0105,0.0002,0.0133,0.861,0.0114,1.3062
7,0.0125,0.0003,0.0175,0.7772,0.015,1.1559
8,0.0132,0.0005,0.0216,0.6748,0.0151,1.3295
9,0.0111,0.0002,0.0152,0.8585,0.0126,1.3953


Removing Outliers

To remove outliers, we need to go back to the setup stage and use PyCaret’s inbuilt outlier remover and create the models again to see the impact.

In [14]:
b=setup(data_22,target='Gold-T+22', ignore_features=['Date'],
        session_id=11,silent=True,profile=False,remove_outliers=True);

Unnamed: 0,Description,Value
0,session_id,11
1,Target,Gold-T+22
2,Original Data,"(3837, 104)"
3,Missing Values,False
4,Numeric Features,102
5,Categorical Features,0
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(2550, 102)"


Ensemble Models

We can also try to see if bagging/boosting can improve the model performance. We can use the ensemble_model() function in PyCaret to quickly see how ensembling methods can improve results through following codes:

In [15]:
knn_bagged = ensemble_model(knn, method='Bagging')


Unnamed: 0,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,0.0119,0.0003,0.0167,0.8057,0.0143,1.2367
1,0.0116,0.0003,0.0164,0.8344,0.0147,0.7295
2,0.0122,0.0003,0.0166,0.8432,0.0147,1.1995
3,0.0111,0.0002,0.0156,0.8253,0.0137,1.6724
4,0.0131,0.0003,0.018,0.8076,0.0151,0.984
5,0.0103,0.0002,0.0133,0.8602,0.0124,1.1833
6,0.0105,0.0002,0.0132,0.8633,0.0113,1.1752
7,0.0122,0.0003,0.0173,0.7819,0.0147,1.2346
8,0.0126,0.0004,0.0194,0.6973,0.015,1.5195
9,0.0105,0.0002,0.0138,0.8838,0.012,1.3284


The above codes will show a similar cross validated score, which did not show much improvement. The results can be seen in the notebook link in the repo.

In [16]:
save_model(model=knn_bagged, model_name='22Day Regressor')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False,
                                       features_todrop=['Date'], id_columns=[],
                                       ml_usecase='regression',
                                       numerical_features=[], target='Gold-T+22',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric...
                 ['trained_model',
                  BaggingRegressor(base_estimator=KNeighborsRegressor(algorithm='auto',
                                                                      leaf_size=30,
                                                                   