# Cross-Validation

Cross-validation is a model validation technique that is used to assess how well the results produced by a model generalise to independent datasets. The aim of performing this technique is to train the algorithm using a variety of validation datasets in order to limit future problems with prediction, such as overfitting or underfitting.

In this notebook, we demonstrate how to use the kdb+/q cross validation library on datasets in order to achieve accurate final results.

---

## Loading library scripts and data

In the following cell, the kdb+/q machine learning toolkit (ML-Toolkit) is loaded in to allow the use of functions provided in both the Utilities and Cross-Validation sections of the library. Graphics functions have also been loaded in for the purpose of this notebook.

Data from the ["Breast Cancer Wisconsin (Diagnostic) Data Set"](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) is used below to predict whether a mass of cells is malignant or benign.

The diagnosis column is removed from the dataset and used as the target vector as this is the feature we want to predict.

In [1]:
\c 15 100
\l ../../ml.q
.ml.loadfile`:init.q
\l graphics.q

In [2]:
data:("FS",30#"F"; enlist ",") 0:`:datasets/data.csv
targets:select diagnosis from data
data:delete diagnosis from data
-1"The dataset contains ",string[count data]," patient files.";
5#data
targets`diagnosis

The dataset contains 569 patient files.


id          radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean co..
-------------------------------------------------------------------------------------------------..
842302      17.99       10.38        122.8          1001      0.1184          0.2776           0...
842517      20.57       17.77        132.9          1326      0.08474         0.07864          0...
8.43009e+07 19.69       21.25        130            1203      0.1096          0.1599           0...
8.43483e+07 11.42       20.38        77.58          386.1     0.1425          0.2839           0...
8.43584e+07 20.29       14.34        135.1          1297      0.1003          0.1328           0...


`M`M`M`M`M`M`M`M`M`M`M`M`M`M`M`M`M`M`M`B`B`B`M`M`M`M`M`M`M`M`M`M`M`M`M`M`M`B`M`M`M`M`M`M`M`M`B`M`..


One hot encoding is used on the target data to convert symbols into a numerical representation.

In [3]:
show targets:exec diagnosis_M from .ml.onehot[targets;cols targets]

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0..


---

## Preprocessing

In the cells below, polynomial features are produced from the original data table to allow for interactions between terms in the system. This allows us to study both individual and combinationed features.

In [4]:
/ target classifications should be agnostic of the id column
5#table:(cols[data]except`id)#data

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean..
-------------------------------------------------------------------------------------------------..
17.99       10.38        122.8          1001      0.1184          0.2776           0.3001        ..
20.57       17.77        132.9          1326      0.08474         0.07864          0.0869        ..
19.69       21.25        130            1203      0.1096          0.1599           0.1974        ..
11.42       20.38        77.58          386.1     0.1425          0.2839           0.2414        ..
20.29       14.34        135.1          1297      0.1003          0.1328           0.198         ..


In [5]:
/ add second order polynomial features to the table 
5#table:table^.ml.polytab[table;2]

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean..
-------------------------------------------------------------------------------------------------..
17.99       10.38        122.8          1001      0.1184          0.2776           0.3001        ..
20.57       17.77        132.9          1326      0.08474         0.07864          0.0869        ..
19.69       21.25        130            1203      0.1096          0.1599           0.1974        ..
11.42       20.38        77.58          386.1     0.1425          0.2839           0.2414        ..
20.29       14.34        135.1          1297      0.1003          0.1328           0.198         ..


In [6]:
/ complete standard scaling of the dataset to avoid biases due to orders of magnitude in the data
5#table:.ml.minmaxscaler[table]

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean..
-------------------------------------------------------------------------------------------------..
0.5210374   0.0226581    0.5459885      0.3637328 0.5937528       0.7920373        0.7031396     ..
0.6431445   0.2725736    0.6157833      0.5015907 0.2898799       0.181768         0.2036082     ..
0.6014956   0.3902604    0.5957432      0.4494168 0.5143089       0.4310165        0.4625117     ..
0.2100904   0.3608387    0.2335015      0.1029056 0.8113208       0.8113613        0.5656045     ..
0.6298926   0.1565776    0.6309861      0.4892895 0.4303512       0.3478928        0.4639175     ..


In [7]:
/ complete a train-test-split on the data - below 20% of data is used in the test set
show tts:.ml.traintestsplit[table;targets;.2]

xtrain| +`radius_mean`texture_mean`perimeter_mean`area_mean`smoothness_mean`compactness_mean`conc..
ytrain| 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0..
xtest | +`radius_mean`texture_mean`perimeter_mean`area_mean`smoothness_mean`compactness_mean`conc..
ytest | 1 0 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0..


---

## Cross-Validation

Below a Random Forest Classifier model is initialized in order to classify tumours as malignant or benign. We can perform consistency checks on this model by performing cross validation techniques on the training data. In the first cell, cross-validation is applied in 5 folds.

In [8]:
k:5  / number of folds
n:1  / number of repetitions

xtrain:flip value flip tts`xtrain
ytrain:tts`ytrain

/ function with algorithm
a:{.p.import[`sklearn.ensemble][`:RandomForestClassifier]}

/ scoring function which takes a function, parameters to apply to that function and data as arguments
score_func:.ml.xv.fitscore[a][`n_estimators pykw 500]

In [9]:
/ split data into k-folds and train/validate the model
s1:.ml.xv.kfsplit[k;n;xtrain;ytrain;score_func]  / sequentially split
s2:.ml.xv.kfshuff[k;n;xtrain;ytrain;score_func]  / randomized split
s3:.ml.xv.kfstrat[k;n;xtrain;ytrain;score_func]  / stratified split

-1"Average Model Scores:";
-1"----------------------------------------------------------------------------";
-1"Sequential split indices with basic k-fold cross validation: ",string avg s1;
-1"Random split indices with basic k-fold cross validation: ",string avg s2;
-1"Stratified split indices with basic k-fold cross validation: ",string avg s3;

  from numpy.core.umath_tests import inner1d


Average Model Scores:
----------------------------------------------------------------------------
Sequential split indices with basic k-fold cross validation: 0.9714286
Random split indices with basic k-fold cross validation: 0.967033
Stratified split indices with basic k-fold cross validation: 0.9758714


Another option is to use repeated forms of cross validation, such as monte-carlo or repeated k-fold cross validation. These methods have the benefit of allowing a user to evaluate the consistency and robustness of the models produced. Below 5 folds are again used, this time with 5 repetitions.

In [10]:
p:.2  / percentage of data in validation set
n: 5  / number of repetitions

r1:.ml.xv.mcsplit[p;n;xtrain;ytrain;score_func]
r2:.ml.xv.kfshuff[k;n;xtrain;ytrain;score_func]
r3:.ml.xv.kfsplit[k;n;xtrain;ytrain;score_func]

-1"Average Model Scores:";
-1"----------------------------------------------------------------------------";
-1"Monte-Carlo cross validation with 5 repetitions and training size of 80%: ",string avg r1;
-1"Repeated stratified cross validation, 5 fold, 5 repetitions: ",string avg r2;
-1"Repeated sequential cross validation, 5 fold, 5 repetitions: ",string avg r3;

Average Model Scores:
----------------------------------------------------------------------------
Monte-Carlo cross validation with 5 repetitions and training size of 80%: 0.9626374
Repeated stratified cross validation, 5 fold, 5 repetitions: 0.9727473
Repeated sequential cross validation, 5 fold, 5 repetitions: 0.9731868


---

## Grid Search

An alternative method is to perform a grid search over possible sets of hyperparameters in order to find the optimal model for the dataset. Grid search can be completed on the training data to find the best parameters which are then be applied to the model. Predictions are then made and scored using the unseen testing data.

In [11]:
/ new scoring function
sf:.ml.xv.fitscore[a]

/ dictionary of parameters
pd:`n_estimators`criterion`max_depth!(10 50 100 500;`gini`entropy;2 5 10 20 30)

In the grid search function below, the final argument is a float value denoting the size of the holdout set used in a fitted gridsearch where the best model is fit to holdout data. If 0 is used (shown below) the function will return scores for each fold for the given hyperparameters.

In [12]:
-1"Grid search: hyperparameters and resulting score from each fold:\n";
show gr:.ml.gs.kfsplit[k;n;xtrain;ytrain;sf;pd;0]

Grid search: hyperparameters and resulting score from each fold:

n_estimators criterion max_depth|                                                                ..
--------------------------------| ---------------------------------------------------------------..
10           gini      2        | 0.978022  0.956044  0.9450549 0.9450549 0.956044  0.9450549 0.9..
10           gini      5        | 0.9340659 0.978022  0.9340659 0.978022  0.978022  0.956044  0.9..
10           gini      10       | 0.9450549 0.967033  0.967033  0.9340659 0.956044  0.956044  0.9..
10           gini      20       | 0.9450549 0.967033  0.9450549 0.9450549 0.967033  0.978022  0.9..
10           gini      30       | 0.9340659 0.978022  0.9340659 0.956044  0.9450549 0.9450549 0.9..
10           entropy   2        | 0.967033  0.956044  0.989011  0.978022  0.967033  0.9340659 0.9..
10           entropy   5        | 0.978022  0.967033  0.956044  0.989011  0.956044  0.967033  0.9..
10           entropy   10       | 

We can now fit this best model on our training set and test how well it generalises to new data.

In [13]:
bstmdl:.p.import[`sklearn.ensemble][`:RandomForestClassifier][pykwargs first where a=max a:avg each gr]
bstmdl[`:fit][xtrain;ytrain];
br:bstmdl[`:score][flip value flip tts`xtest;tts`ytest]`
-1"Score for the 'best model' on the testing set was: ",string br;

Score for the 'best model' on the testing set was: 0.9736842


Alternatively, the previous two cells can be compressed within a fitted grid search procedure. As explained above, this is done by using a float for the final parameter.

In [14]:
-2#.ml.gs.kfsplit[k;n;flip value flip table;targets;sf;pd;.2]

`n_estimators`criterion`max_depth!(500;`entropy;20)
0.9912281


The grid search function can also be passed a negative value for the final parameter. This means that data will be shuffled prior to designation of the holdout set.

In [15]:
-2#.ml.gs.kfsplit[k;n;flip value flip table;targets;sf;pd;-.2]

`n_estimators`criterion`max_depth!(50;`entropy;20)
0.9561404


---

## Conclusions

Cross validation is a useful technique to determine how well a model will generalize to new data. It is possible to carry out cross validation in a number of ways depending on the chosen dataset. Above we displayed how to perform cross validation using a range methods, which split data in a sequential, randomized or stratified manner, as well as using monte-carlo methods.

It is clear that if the aim is to trial a range of hyperparameters on a model, then grid search is a robust way to test the chosen range of parameters and return the ones which allow the model to generalize best to new data.

---