## Cross Validation

The point of this notebook is to demonstrate how to use the cross validation library on datasets in order to  achieve accurte results.

In [1]:
\l ../../ml.q
.ml.loadfile`:init.q
\l graphics.q
\c 15 100

## Loading library scripts and data

In the following cell the functions related to the FRESH library are loaded in the 1st line while preprocessing functions used within the notebook are loaded from the folder mlutils. 

This data is taken from the "Breast Cancer Wisconsin (Diagnostic) Data Set" to predict whether a mass of cells is malignant or benign.

The diagnosis column is removed from the dataset and used as the target as this is what we will be predicting

In [2]:
data:("FSFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"; enlist ",") 0:`:datasets/data.csv
targets:select diagnosis from data
5#data:delete diagnosis from data
"The dataset contains ",(string count data)," patient files"

id          radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean co..
-------------------------------------------------------------------------------------------------..
842302      17.99       10.38        122.8          1001      0.1184          0.2776           0...
842517      20.57       17.77        132.9          1326      0.08474         0.07864          0...
8.43009e+07 19.69       21.25        130            1203      0.1096          0.1599           0...
8.43483e+07 11.42       20.38        77.58          386.1     0.1425          0.2839           0...
8.43584e+07 20.29       14.34        135.1          1297      0.1003          0.1328           0...


"The dataset contains 569 patient files"


One hot encoding is used on the target data to convert symbols into a numerical representation.

In [3]:
targets:exec diagnosis_M from .ml.util.onehot[targets]

#### Data preprocessing

Here we produce polynomial features from our data in order to allow for interactions between terms in the system to also be studied not just the individual features.

In [4]:
/ target classifications should be agnostic of id column
5#table:(cols[data]except `id)#data

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean..
-------------------------------------------------------------------------------------------------..
17.99       10.38        122.8          1001      0.1184          0.2776           0.3001        ..
20.57       17.77        132.9          1326      0.08474         0.07864          0.0869        ..
19.69       21.25        130            1203      0.1096          0.1599           0.1974        ..
11.42       20.38        77.58          386.1     0.1425          0.2839           0.2414        ..
20.29       14.34        135.1          1297      0.1003          0.1328           0.198         ..


In [5]:
/ Add 2nd order polynomial features to the table 
5#table:table^.ml.util.polytab[table;2]

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean..
-------------------------------------------------------------------------------------------------..
17.99       10.38        122.8          1001      0.1184          0.2776           0.3001        ..
20.57       17.77        132.9          1326      0.08474         0.07864          0.0869        ..
19.69       21.25        130            1203      0.1096          0.1599           0.1974        ..
11.42       20.38        77.58          386.1     0.1425          0.2839           0.2414        ..
20.29       14.34        135.1          1297      0.1003          0.1328           0.198         ..


In [6]:
/ complete a standard scaling of the dataset to avoid biases due to order of magnitudes in the data
5#table:.ml.util.minmaxscaler[table]

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean..
-------------------------------------------------------------------------------------------------..
0.5210374   0.0226581    0.5459885      0.3637328 0.5937528       0.7920373        0.7031396     ..
0.6431445   0.2725736    0.6157833      0.5015907 0.2898799       0.181768         0.2036082     ..
0.6014956   0.3902604    0.5957432      0.4494168 0.5143089       0.4310165        0.4625117     ..
0.2100904   0.3608387    0.2335015      0.1029056 0.8113208       0.8113613        0.5656045     ..
0.6298926   0.1565776    0.6309861      0.4892895 0.4303512       0.3478928        0.4639175     ..


In [7]:
/ complete a train-test split on the data, here we set 20% of data to be in test set
show tts:.ml.util.traintestsplit[table;targets;0.2]

xtrain| +`radius_mean`texture_mean`perimeter_mean`area_mean`smoothness_mean`compactness_mean`conc..
ytrain| 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0..
xtest | +`radius_mean`texture_mean`perimeter_mean`area_mean`smoothness_mean`compactness_mean`conc..
ytest | 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0..


In [8]:
/ At this point we can run consistency checks on the models being used to classify the tumours as
/ malignant or benign using cross validation techniques on the training data
xtrain:flip value flip tts`xtrain
ytrain:tts`ytrain

mdl:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`n_estimators pykw 500]

i1:.ml.xval.kfsplit[ytrain;5] / sequentially split data into k folds
i2:.ml.xval.kfshuff[ytrain;5] / randomise data and split into k folds
i3:.ml.xval.kfstrat[ytrain;5] / stratified data split based on target into k folds

-1"Sequential split indices with basic k-fold cross validation: ",string .ml.xval.kfoldx[xtrain;ytrain;i1;mdl];
-1"Random split indices with basic k-fold cross validation: ",string .ml.xval.kfoldx[xtrain;ytrain;i2;mdl];
-1"Stratified split indices with basic k-fold cross validation: ",string .ml.xval.kfoldx[xtrain;ytrain;i3;mdl];

Sequential split indices with basic k-fold cross validation: 0.9626374
Random split indices with basic k-fold cross validation: 0.9538462
Stratified split indices with basic k-fold cross validation: 0.960438


In [9]:
/ Another option is to use repeated forms of cross validation such as monte-carlo cross validation
/ or repeated k-fold cross validation. These have the benefit of allowing a user to evaluate the consistency
/ and robustness of the models produced

-1"Monte-Carlo cross validation with 10 repetitions and training size of 80%: ",
 string .ml.xval.mcxval[xtrain;ytrain;0.2;mdl;10];
-1"Repeated Stratified cross validation, 5 fold, 5 repetitions: ",
 string .ml.xval.repkfstrat[xtrain;ytrain;5;5;mdl];
-1"Repeated K-Fold cross validation, 5 fold, 5 repetitions: ",
 string .ml.xval.repkfval[xtrain;ytrain;5;5;mdl];

Monte-Carlo cross validation with 10 repetitions and training size of 80%: 0.9582418
Repeated Stratified cross validation, 5 fold, 5 repetitions: 0.9604667
Repeated K-Fold cross validation, 5 fold, 5 repetitions: 0.9630769


In [10]:
/ Another alternative is to perform a grid search over possible sets of hyperparameters in order to
/ find the optimal model for the dataset in question. This grid search can be completed on the training
/ data and the outcomes of it implemented into a model which is then applied to the testing set
dict:`n_estimators`criterion`max_depth!(10 50 100 500;`gini`entropy;2 5 10 20 30)
mdl2:.p.import[`sklearn.ensemble][`:RandomForestClassifier]
-1"HyperParameter Grid search, maximum score and best hyperparameter set are as follows: ";
.ml.xval.gridsearch[xtrain;ytrain;i3;mdl2;dict]

HyperParameter Grid search, maximum score and best hyperparameter set are as follows: 


0.9670558
`n_estimators`criterion`max_depth!(,100;,`entropy;,10)


In [11]:
/ We can now fit this best model on our training set and test how well it generalises to new data
bstmdl:.p.import[`sklearn.ensemble][`:RandomForestClassifier][pykwargs `n_estimators`criterion`max_depth!(500;`entropy;10)]
bstmdl[`:fit][xtrain;ytrain];
-1"Score for the 'best model' on the testing set was: ",
 string bstmdl[`:score][flip value flip tts`xtest;tts`ytest]`;

Score for the 'best model' on the testing set was: 0.9824561


In [12]:
/ The previous two cells can compressed within a fitted grid search procedure
.ml.xval.gridsearchfit[flip value flip table;targets;0.2;5;mdl2;dict]

`n_estimators`criterion`max_depth!(10;`gini;10)
0.9736842


## Conclusion

---