## Cross Validation

The point of this notebook is to demonstrate how to use the cross validation library on datasets in order to  achieve accurte results.

In [1]:
\l ../../ml.q
.ml.loadfile`:init.q
\l graphics.q
\c 15 100

~


## Loading library scripts and data

In the following cell the functions related to the FRESH library are loaded in the 1st line while preprocessing functions used within the notebook are loaded from the folder mlutils. 

This data is taken from the "Breast Cancer Wisconsin (Diagnostic) Data Set" to predict whether a mass of cells is malignant or benign.

The diagnosis column is removed from the dataset and used as the target as this is what we will be predicting

In [35]:
data:("FSFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"; enlist ",") 0:`:datasets/data.csv
targets:select diagnosis from data
5#data:delete diagnosis from data

"The dataset contains ",(string count data)," patient files"


id          radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean co..
-------------------------------------------------------------------------------------------------..
842302      17.99       10.38        122.8          1001      0.1184          0.2776           0...
842517      20.57       17.77        132.9          1326      0.08474         0.07864          0...
8.43009e+07 19.69       21.25        130            1203      0.1096          0.1599           0...
8.43483e+07 11.42       20.38        77.58          386.1     0.1425          0.2839           0...
8.43584e+07 20.29       14.34        135.1          1297      0.1003          0.1328           0...


"The dataset contains 569 patient files"


One hot encoding is used on the target data to convert symbols into a numerical representation.

In [13]:
targets:exec diagnosis_M from .ml.util.onehot[targets]

#### Data preprocessing

Here we produce polynomial features from our data in order to allow for interactions between terms in the system to also be studied not just the individual features.

In [14]:
/ Add 2nd order polynomial features to the table 
5#table:data^.ml.util.polytab[flip 1_flip data;2]


id          radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean co..
-------------------------------------------------------------------------------------------------..
842302      17.99       10.38        122.8          1001      0.1184          0.2776           0...
842517      20.57       17.77        132.9          1326      0.08474         0.07864          0...
8.43009e+07 19.69       21.25        130            1203      0.1096          0.1599           0...
8.43483e+07 11.42       20.38        77.58          386.1     0.1425          0.2839           0...
8.43584e+07 20.29       14.34        135.1          1297      0.1003          0.1328           0...


## Feature Extraction


In [38]:
dict:.ml.i.dict
/ in this example we look only at features of the data alone with no parameters
show tabraw:.ml.fresh.createfeatures[data;`id;1_cols data;dict]
-1"The forecasting frame contains ",(string count tabraw)," datapoints.";

id   | absenergy_radius_mean absenergy_texture_mean absenergy_perimeter_mean absenergy_area_mean ..
-----| ------------------------------------------------------------------------------------------..
8670 | 239.0116              379.4704               10342.89                 560851.2            ..
8913 | 166.1521              172.1344               6705.972                 266152.8            ..
8915 | 223.8016              364.81                 9414.821                 472381.3            ..
9047 | 167.4436              261.4689               6918.912                 257657.8            ..
85715| 173.4489              348.1956               7392.56                  285797.2            ..
86208| 410.4676              530.3809               17529.76                 1597696             ..
86211| 148.3524              318.2656               6051.284                 203491.2            ..
86355| 495.9529              386.9089               23347.84                 2277081             ..


### Data formatting

Fill in any cells in the data set that might have nulls in it by using the average of that column. An extra column is also added noting if there previously was a null in the original position. Any constant columns are also dropped

In [37]:
tabraw:.ml.util.nullencode[value tabraw;avg]
5#tabraw:.ml.util.dropconstant[tabraw]

absenergy_radius_mean absenergy_texture_mean absenergy_perimeter_mean absenergy_area_mean absener..
-------------------------------------------------------------------------------------------------..
239.0116              379.4704               10342.89                 560851.2            0.01192..
166.1521              172.1344               6705.972                 266152.8            0.00483..
223.8016              364.81                 9414.821                 472381.3            0.00808..
167.4436              261.4689               6918.912                 257657.8            0.00975..
173.4489              348.1956               7392.56                  285797.2            0.01340..


The data must now be converted to a matrix from a table in order to allow it to be passed to a machine learning algorithm for training.

In [6]:
mattab:{flip value flip x}
/ Convert the table containing significant features to a matrix in order to allow it to be passed to a machine learning algorithm
featmat:mattab[tabraw]

## Grid Search

Grid search is the process of finding optimal parameters for a given model. This works by applying a model to a dataset while also giving a dictionary with various parameters for the model. The gridserch applies various combinations of these parameters, as a results it returns the optimal parameters for the model along with the accuracy result

In [22]:
regr:.p.import[`sklearn.tree][`:DecisionTreeClassifier]
dict:`max_depth`min_samples_split`max_features!(1_til 5;2_til 6;`auto`log2`sqrt)

This gives the indices in ascending order of the dataset partitioned into k subsections. 

In [23]:
i:.ml.xval.kfsplit[targets;3]

In [24]:
.ml.xval.gridsearch[featmat;targets;i;regr;dict]

0.6324608
`max_depth`min_samples_split`max_features!(,1;,4;,`log2)


Using the results from the grid search we can apply this to the Random Forest Classifier model. 

In [25]:
clf:.p.import[`sklearn.tree][`:DecisionTreeClassifier
    ][`max_depth pykw 1;`min_samples_split pykw 4;`max_features pykw `log2]

### Chain-forward, Monte Carlo and Repeated Stratified Randomised k-fold cross validation 

Chain forward is a useful cross validation technique where the data is split into equi-sized bins with increasing amounts of the data incorporated in the testing set at each step. 

The Monte Carlo method involves randomly selecting a certain amount of the data, using the rest as the test set.  This process is then repeated a number of times, taking different random data as the training and testing set each time. 

In k-fold cross validation, the data is split up into k equal size sections. One fold is kept for validation while the other k-1 folds are used for training. The process is repeated k times so that each iteration a different fold is used as validation. Stratification is the process in which the dataset is arranged so that each fold has a good representation of the dataset. This is used xtensively in cases where the distribution of classes in the data is unbalanced.

In [26]:
chainfor:.ml.xval.chainxval[featmat;targets;5;clf]
montecarlo:.ml.xval.mcxval[featmat;targets;.2;clf;3]
repkf:.ml.xval.repkfstrat[featmat;targets;2;2;clf]

-1"The cross validation score using chain-forward is ",string chainfor;
-1"The cross validation score using monte-carlo is ",string montecarlo;
-1"The cross validation score using Repeated Stratified Randomized K-fold is ",string repkf;

The cross validation score using chain-forward is 0.504386
The cross validation score using monte-carlo is 0.6345029
The cross validation score using Repeated Stratified Randomized K-fold is 0.6212534


## Conclusion

From the above example it is evident that the Monte Carlo and Repeated Stratified Randomised k-fold method gives the most accurate results as chain-forward cross validation is most useful on timeseries data. This shows how in this case, repeatedly splitting the input and target data at random leads to high cross validation scores.