## Cross Validation

The point of this notebook is to demonstrate how to use the cross validation library on datasets in order to  achieve accurte results.

In [73]:
\l ../../ml.q
.ml.loadfile`:init.q
\l graphics.q
\c 15 100

~


## Loading library scripts and data

In the following cell the functions related to the FRESH library are loaded in the 1st line while preprocessing functions used within the notebook are loaded from the folder mlutils

In [74]:
5#amzndaydata:{lower[cols x]xcol x}("DFFFFJJ";enlist ",")0:`:SampleDatasets/amzn_day.us.txt
-1!"This dataset contains stock information for ",(string count amzndaydata)," days."
5#amzndaydata:.ml.util.dropconstant[amzndaydata] / drop columns without variance 

date       open high low  close volume   openint
------------------------------------------------
1997.05.16 1.97 1.98 1.71 1.73  14700000 0      
1997.05.19 1.76 1.77 1.62 1.71  6106800  0      
1997.05.20 1.73 1.75 1.64 1.64  5467200  0      
1997.05.21 1.64 1.65 1.38 1.43  18853200 0      
1997.05.22 1.44 1.45 1.31 1.4   11776800 0      


"This dataset contains stock information for 5170 days."


date       open high low  close volume  
----------------------------------------
1997.05.16 1.97 1.98 1.71 1.73  14700000
1997.05.19 1.76 1.77 1.62 1.71  6106800 
1997.05.20 1.73 1.75 1.64 1.64  5467200 
1997.05.21 1.64 1.65 1.38 1.43  18853200
1997.05.22 1.44 1.45 1.31 1.4   11776800


## Assign extracted features and complete extraction

In this case we are attempting to use rolled table forecasting frames to predict the close price for the next day given extracted features from the previous 10 days. To generate the targets the first 10 days were omitted as these the rolled table frames in this case would be incomplete and as such may skew our results.

In [75]:
tabletargets:10 _amzndaydata
targets:tabletargets[`close]

#### Data preprocessing

Here we produce polynomial features from our data in order to allow for interactions between terms in the system to also be studied not just the individual features. The date column is also removed from the data as this is not used as a feature and will not be required given the data will be subject to a sliding window which negates its significance as a column.

In [76]:
/ Add 2nd order polynomial features to the table 
table:amzndaydata^.ml.util.polytab[flip 1_flip amzndaydata;2]
/ Remove the date column from the data as the rolling of data is independent of this
5#table:(1_cols t)#t:table 

open high low  close volume   high_open low_open low_high close_open close_high close_low volume_..
-------------------------------------------------------------------------------------------------..
1.97 1.98 1.71 1.73  14700000 3.9006    3.3687   3.3858   3.4081     3.4254     2.9583    2.8959e..
1.76 1.77 1.62 1.71  6106800  3.1152    2.8512   2.8674   3.0096     3.0267     2.7702    1.07479..
1.73 1.75 1.64 1.64  5467200  3.0275    2.8372   2.87     2.8372     2.87       2.6896    9458256..
1.64 1.65 1.38 1.43  18853200 2.706     2.2632   2.277    2.3452     2.3595     1.9734    3.09192..
1.44 1.45 1.31 1.4   11776800 2.088     1.8864   1.8995   2.016      2.03       1.834     1.69585..


In [77]:
rollcreatefeatures:{[x;fns;n] 
 raze{.ml.fresh.createfeatures[x;`placer;(-1)_cols x;y]}[;fns]each
 {update placer:last y from x y}[x;]each dropswin[n;til count x]}
dropswin:{(-1) _ (x-1) _ swin[x;y]}
swin:{[w;s]{1_x,y}\[w#0;s]}

## Feature Extraction


In [78]:
/ in this example we look only at features of the data alone with no parameters
show tabraw:rollcreatefeatures[table;0b;10]
-1"The forecasting frame contains ",(string count tabraw)," datapoints.";

placer| absenergy_open absenergy_high absenergy_low absenergy_close absenergy_volume absenergy_hi..
------| -----------------------------------------------------------------------------------------..
9     | 26.2488        27.315         22.4808       24.2289         1.146768e+15     74.38126    ..
10    | 24.648         25.7355        21.8067       23.5161         9.310284e+14     64.50407    ..
11    | 23.8913        24.9435        21.3727       22.7824         8.951354e+14     60.27941    ..
12    | 23.0888        24.1011        20.6431       22.1092         8.74734e+14      55.97656    ..
13    | 22.4156        23.7502        20.6431       22.4359         5.514669e+14     53.43622    ..
14    | 22.6524        24.5718        21.2071       23.2315         4.737263e+14     55.83232    ..
15    | 23.4199        25.1855        22.1938       23.8376         2.252638e+14     59.29666    ..
16    | 24.0639        25.3871        22.4031       23.8376         1.794141e+14     61.63945    ..


## Complete feature significance tests

Upon completion of the feature extraction algorithm the importance of each of the features can be determined through the statistical tests contained in the .fresh.significantfeatures function. This will reduce the number of features used by the machine learning algorithm in making its prediction.

In [79]:
show tabreduced:key[tabraw]!(.ml.fresh.significantfeatures[p;targets])#p:value tabraw
-1 "The number of columns in the initial dataset is: ",string count cols amzndaydata;
-1 "The number of columns in the unfiltered dataset is: ",string count cols tabraw;
-1 "The number of columns in the filtered dataset is: ",string count cols tabreduced;

placer| absenergy_open absenergy_high absenergy_low absenergy_close absenergy_volume absenergy_hi..
------| -----------------------------------------------------------------------------------------..
9     | 26.2488        27.315         22.4808       24.2289         1.146768e+15     74.38126    ..
10    | 24.648         25.7355        21.8067       23.5161         9.310284e+14     64.50407    ..
11    | 23.8913        24.9435        21.3727       22.7824         8.951354e+14     60.27941    ..
12    | 23.0888        24.1011        20.6431       22.1092         8.74734e+14      55.97656    ..
13    | 22.4156        23.7502        20.6431       22.4359         5.514669e+14     53.43622    ..
14    | 22.6524        24.5718        21.2071       23.2315         4.737263e+14     55.83232    ..
15    | 23.4199        25.1855        22.1938       23.8376         2.252638e+14     59.29666    ..
16    | 24.0639        25.3871        22.4031       23.8376         1.794141e+14     61.63945    ..


### Data formatting

The data must now be converted to a matrix from a table in order to allow it to be passed to a machine learning algorithm for training.

In [80]:
mattab:{flip value flip x}
fitvalsfilter:0^mattab[value tabreduced]

## Grid Search

Grid search is the process of finding optimal parameters for a given model. This works by applying a model to a dataset while also giving a dictionary with various parameters for the model. The gridserch applies various combinations of these parameters, as a results it returns the optimal parameters for the model along with the accuracy result

In [81]:
regr:.p.import[`sklearn.ensemble][`:GradientBoostingRegressor]
dict:`learning_rate`n_estimators`random_state!(0.1 0.3;200 400;l:1?1000)

This gives the indices in ascending order of the dataset partitioned into k subsections. 

In [91]:
i:.ml.xval.kfsplit[targets;3]

In [85]:
.ml.xval.gridsearch[fitvalsfilter;targets;i;regr;dict]

0.1972137
`learning_rate`n_estimators`random_state!(,0.1;,400;,205)


Using the results from the grid search we can apply this to the Random Forest Classifier model. 

In [88]:
clf:.p.import[`sklearn.ensemble][`:GradientBoostingRegressor
    ][`learning_rate pykw 0.1;`n_estimators pykw 400;`random_state pykw first l]

### Chain-forward, Monte Carlo and Repeated Stratified Randomised k-fold cross validation 

Chain forward is a useful cross validation technique when dealing with time series data. In time series cross validation, each day is set to be the test data, with all the days prior used as the training data.

The Monte Carlo method involves randomly selecting a certain amount of the data, using the rest as the test set.  This process is then repeated a number of times, taking different random data as the training and testing set each time. 

In k-fold cross validation, the data is split up into k equal size sections. One fold is kept for validation while the other k-1 folds are used for training. The process is repeated k times so that each iteration a different fold is used as validation. Stratification is the process in which the dataset is arranged so that each fold has a good representation of the dataset. This is used xtensively in cases where the distribution of classes in the data is unbalanced.

In [90]:
chainfor:.ml.xval.chainxval[fitvalsfilter;targets;5;clf]
montecarlo:.ml.xval.mcxval[fitvalsfilter;targets;.2;clf;3]
repkf:.ml.xval.repkfstrat[fitvalsfilter;targets;3;4;clf]

-1"The cross validation score using cross-forward is ",string chainfor;
-1"The cross validation score using monte-carlo is ",string montecarlo;
-1"The cross validation score using repeated stratified randomized K-fold iss ",string repkf;

The cross validation score using cross-forward is -0.6752342
The cross validation score using monte-carlo is 0.9994306
The cross validation score using repeated stratified randomized K-fold iss 0.9992382


## Conclusion

From the above example it is evident that the Monte Carlo and Repeated Stratified Randomised k-fold method gives the most accurate results on this particular dataset. This shows how in this case, repeatedly splitting the input and target data at random leads to high cross validation scores.