This notebook is the optimization of the algorithms for Mikania. Here, the data passes through the feature creation pipeline, goes to the ml training pipeline and gridsearch so that the best algorithms with the best parameters are chosen. 

No preliminary testing was done here as the data was not available yet. Also, previous data from mikania were obtained in positive and negative ionization

# Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.svm import SVC
from sklearn import metrics

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.model_selection import PredefinedSplit


# Functions

In [2]:
# creates the feature name with the mz and rt

def feature_name_creation(xcms_file_path):
    table = pd.read_csv(xcms_file_path, index_col=[0]) 
    
    # no need for decimal on m/z (low resolution) and only one decimal for rt
    table.mz = table.mz.round(0).astype(int)
    table.rt = table.rt.round(1)

    # creating the feature name: mz_rt
    features = table["mz"].astype(str) + "_" + table["rt"].astype(str)
    table.insert(0, 'features', features) # first column
    
    # drop as we don't know how many columns the table will have. Drop the known ones. 
    # There should only be the 'features' column and the samples
    
    table_clean = table.drop(['isotopes', 'adduct','pcgroup'], axis=1) #'npeaks','NEG_GROUP', 'POS_GROUP',
    
    return table_clean

In [3]:
# rounds the mz and rt columns along with its min and max

def rounder(dataframe):
    table = dataframe 
    
    table.mz = table.mz.round(0).astype(int)
    table.mzmin = table.mzmin.round(0).astype(int)
    table.mzmax = table.mzmax.round(0).astype(int)
    
    table.rt = table.rt.round(1)
    table.rtmin = table.rtmin.round(1)
    table.rtmax = table.rtmax.round(1)

    
    return table

# Data Prep Pipeline

`Train` and `Val` sets were processed separately on `xcms` - excludes the possibility of data leakage 
But, when processing is separated, the features can be slightly different. The compounds are almost the same, but due to processing steps, there can be shifts on the decimals of `mz` or `rt`. 
For this reason, creating the feature name concatenating `mz_rt` on train and val might not produce the same features, and machine learning training is not possible with that. 

Errors observed in this case are related to the fact that features observed in train were not present in validation and vice versa or the order of the features were different in both datasets. This 'pipeline' fixes this issue.

**Steps:**

1. Creates the name for the features on `Train` set - this is the set used as reference. Whatever features where observed here, should appear on `Val`. The name is created concatenating `mz` and `rt` columns (`mz_rt`)
2. Creates a correspondance between the feature on `Train` and `Val` set, giving val set the same column names as the train, when the feature is present 
    1. round `mz` and `rt` from `Val` and `Train` 
    2. for each `mz` in `Val`, search for a range on `mzmin` and `mzmax` on train that fits. The `mz_val` need to be between `mzmin_train` and `mzmax_train` 
    3. If a match is found, for each `rt`,`rtmin` and `rtmax` on `Val` search for a range on `rtmin` and `rtmax` on `Train` that fits. The `rt` values need to be between `rtmin_train` and `rtmax_train`. The `rtmin` and `rtmax` from `Val` are used in this case because ocasionally, the range on `Val` or train is too big (big difference in `rt` between samples)
    4. if a match, take the feature name from `Train` and apply on the match
    
**With the features names created:**

3. Features on `Train` and `Val` are ordered 
4. Duplicates are deleted based on the `npeaks` columnn
5. Features that were observed in `Val` but no correspondence was found in `Train` have names filled with `nan`. These are deleted.
4. Features that are on `Train`and were not found in `Val` are added to `Val` and filled with zero (no presence of that feature)
 
 
**To fix: **
 The code for the feature correspondence is not optimized. 
 - After the match with `mz`, the loop searches on the whole dataset for a match in `rt`. This takes more computation, unecessary. 
 - If there is a match of two features, the last one is kept. Could keep both, filter later? 
 


## Feature reference creation - train set

In [4]:
# train is loaded using the function to create the feature names - feature names are created using mz and rt.  
mikania_train = feature_name_creation('5. Gridsearch/mikania_train_processing.csv').reset_index(drop=True)

In [5]:
mikania_train.head()

Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,...,ML9_3_mik,ML96_1_mik,ML96_2_mik,ML96_3_mik,ML97_1_mik,ML97_2_mik,ML97_3_mik,ML99_1_mik,ML99_2_mik,ML99_3_mik
0,101_572.8,101,101.138345,102.117217,572.8,439.156,574.441,673,315,330,...,4679175.0,5973447.0,4515604.0,6924016.0,3737724.0,6793684.0,3589445.0,4982682.0,4943601.0,4584954.0
1,103_597.4,103,103.151596,104.077694,597.4,471.133,598.699,355,169,182,...,3533462.0,4417315.0,4068181.0,4132018.0,3667740.0,4952572.0,4515996.0,3845193.0,5086722.0,4798596.0
2,111_45.2,111,110.639401,111.636593,45.2,39.105,81.512,303,187,97,...,16273170.0,11978500.0,9300417.0,12274540.0,8721469.0,8164535.0,8876116.0,8547570.0,8042970.0,7753815.0
3,113_575.3,113,112.639963,113.632866,575.3,551.499,599.104,542,237,255,...,5238349.0,6322030.0,5794962.0,3939804.0,5070993.0,3642559.0,4522164.0,6095111.0,5914708.0,5225424.0
4,115_46.5,115,114.639109,115.528369,46.5,42.94,78.906,471,213,244,...,8146212.0,31666930.0,47333790.0,45677520.0,21478850.0,45790790.0,39245440.0,34501970.0,32961010.0,33845780.0


## Loading validation val set

In [6]:
# val will be loaded using regular read_csv - the names of the features will come based on comparison
mikania_val = pd.read_csv('5. Gridsearch/mikania_validation_processing.csv',index_col=[0]).reset_index(drop=True).drop(['isotopes', 'adduct','pcgroup'], axis=1) #'npeaks','NEG_GROUP', 'POS_GROUP',

## Rounding mz and rt

In [7]:
# rouding all mz and all rt
mikania_val = rounder(mikania_val)
mikania_train = rounder(mikania_train)

In [8]:
display(mikania_val.iloc[:,0:7].head())
display(mikania_train.iloc[:,0:7].head())

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks
0,101,101,102,572.8,522.2,574.1,159
1,103,103,104,597.4,594.3,598.7,93
2,111,111,112,46.1,39.4,66.6,80
3,113,113,113,575.0,574.1,598.4,116
4,115,115,115,47.2,43.9,93.1,112


Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax
0,101_572.8,101,101,102,572.8,439.2,574.4
1,103_597.4,103,103,104,597.4,471.1,598.7
2,111_45.2,111,111,112,45.2,39.1,81.5
3,113_575.3,113,113,114,575.3,551.5,599.1
4,115_46.5,115,115,116,46.5,42.9,78.9


## Feature creation and correspondance on val set

In [9]:
# creating the column
mikania_val['features'] = np.nan

In [10]:
# loop over mikania_val items. 
# Each mz will be tested against all mzmin and mzmax range from train. 
# if in range, test for rt.
# if in range, use the same feature name from train

mikania_val = mikania_val.sort_values('npeaks', ascending=False,ignore_index=True)
mikania_train = mikania_train.sort_values('npeaks', ascending=False,ignore_index=True)

for i in range(len(mikania_val)):
    for j in range(len(mikania_train)):


        if ((mikania_val.loc[i,'mz'] <= mikania_train.loc[j,'mzmax']) 
              & (mikania_val.loc[i,'mz'] >= mikania_train.loc[j,'mzmin'])):
            
            #maybe subset mikania train and then perform things on the subset? 
            
            if (
                ((mikania_val.loc[i,'rt'] <= mikania_train.loc[j,'rtmax']) 
                  & (mikania_val.loc[i,'rt'] >= mikania_train.loc[j,'rtmin'])) or
            
               ((mikania_val.loc[i,'rtmin'] <= mikania_train.loc[j,'rtmax']) 
                  & (mikania_val.loc[i,'rtmin'] >= mikania_train.loc[j,'rtmin'])) or
                
               ((mikania_val.loc[i,'rtmax'] <= mikania_train.loc[j,'rtmax']) 
                & (mikania_val.loc[i,'rtmax'] >= mikania_train.loc[j,'rtmin']))
            ):
                
                mikania_val.loc[i,'features'] = mikania_train.loc[j,'features']
            break

In [11]:
mikania_val

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,AQ16_1_mik,...,ML89_1_mik,ML89_2_mik,ML89_3_mik,ML90_1_mik,ML90_2_mik,ML90_3_mik,ML93_1_mik,ML93_2_mik,ML93_3_mik,features
0,117,117,117,572.4,521.2,573.3,372,96,96,5.887108e+07,...,6.333137e+07,6.646837e+07,7.123409e+07,6.464631e+07,6.473378e+07,6.301025e+07,4.236123e+07,4.446527e+07,3.973012e+07,117_572.4
1,302,301,302,547.3,473.8,575.4,302,48,96,2.381940e+06,...,3.372525e+07,3.535800e+07,3.302558e+07,4.497739e+07,5.013374e+07,4.915096e+07,7.694642e+06,1.155288e+07,1.015155e+07,302_547.3
2,217,217,218,572.4,518.6,573.3,241,95,85,1.289468e+07,...,1.512042e+07,1.590213e+07,1.379746e+07,1.467889e+07,1.396555e+07,1.519848e+07,8.829465e+06,7.136288e+06,6.655386e+06,217_572.8
3,402,401,402,547.1,521.1,573.7,234,51,94,2.144783e+06,...,3.802344e+07,3.635515e+07,3.162709e+07,4.257522e+07,4.005340e+07,3.648322e+07,1.217898e+07,1.219293e+07,1.279159e+07,402_547.0
4,488,488,489,348.2,278.3,375.0,213,8,93,4.323549e+06,...,1.779109e+07,2.757322e+07,1.314762e+08,2.092853e+07,1.326263e+08,3.254040e+07,3.582074e+07,3.951186e+07,3.728830e+07,488_348.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,674,674,674,247.0,182.1,249.2,32,0,32,2.813438e+06,...,1.141661e+07,9.989287e+06,8.737974e+06,9.431100e+06,8.432869e+06,9.401396e+06,1.155076e+07,1.163606e+07,1.318347e+07,674_247.9
153,534,533,534,351.2,347.5,354.6,32,29,3,5.578477e+06,...,6.380742e+06,6.812028e+06,5.918371e+06,6.090338e+06,5.022471e+06,5.476122e+06,6.344386e+06,5.182809e+06,5.825212e+06,
154,732,732,733,374.5,265.7,378.7,31,1,29,3.011137e+06,...,2.848826e+06,4.123351e+06,3.476350e+06,4.271722e+06,3.770071e+06,3.580085e+06,8.078104e+06,9.116588e+06,5.907386e+06,
155,654,654,654,246.8,183.6,249.2,30,0,30,8.240939e+05,...,1.977775e+07,1.941051e+07,1.573302e+07,1.536999e+07,1.316865e+07,1.340171e+07,1.200280e+07,1.327482e+07,1.190865e+07,654_247.9


In [12]:
# the process can create duplicates, so removing them is necessary
# the removal is based on the npeaks column. The feature with more npeaks, is kept.
mikania_val = mikania_val.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary columns
mikania_val = mikania_val.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                  'rtmin', 'rtmax', 'npeaks','NEG_GROUP', 'POS_GROUP'], axis=1)

# removing the duplicates that might arise with the train is also necessary
# drop possible duplicates for train as well
mikania_train = mikania_train.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary  columns
mikania_train = mikania_train.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                      'rtmin', 'rtmax', 'npeaks','NEG_GROUP', 'POS_GROUP'], axis=1)

# val set might have some feature that don't fit in any range - their feature names will be nan, so need to remove
# train might have some features that wont appear in the val. So, create them in val and set them to zero. 
# first, set index on both to be the features, so its possible to do that.
mikania_train= mikania_train.set_index('features')
mikania_val = mikania_val.dropna().set_index('features') # dropping na and making feature as index

# set method to get the set of index values that are unique 
# subtracting the sets to get the different indexes. 
# concat method to concatenate train and val
# filling the missing values on the concatenation with 0 using the fillna method.

unique_indexes = list(set(mikania_train.index) - set(mikania_val.index))
mikania_val = pd.concat([mikania_val, pd.DataFrame(index=unique_indexes, columns=mikania_val.columns)], sort=True).fillna(0)

# order both val and train features equally
# sort the features - the model needs them at the same sequence
mikania_train = mikania_train.reset_index().sort_values(by='features')
mikania_val = mikania_val.reset_index().sort_values(by='index')




## Bring the class data column

In [13]:
# load
classes_train = pd.read_csv('classes_train_mikania.csv', index_col=[0])
classes_val = pd.read_csv('classes_val_mikania.csv', index_col=[0])

# unite
mikania_train = mikania_train.set_index('features').T.join(classes_train)
display(mikania_train.head())

mikania_val = mikania_val.set_index('index').T.join(classes_val)
display(mikania_val.head())

Unnamed: 0,1000_338.1,101_572.8,103_597.4,111_45.2,113_575.3,115_46.5,117_572.4,119_337.1,119_41.3,120_573.2,...,904_381.6,918_353.8,920_360.1,923_330.0,940_376.4,946_391.7,962_367.5,977_346.3,988_298.7,class
AQ1_1_mik,1232216.0,4786491.0,4266050.0,26875910.0,4041009.0,46423360.0,68501760.0,998496.9,5405292.0,5419716.0,...,2757551.0,2388167.0,3581841.0,4042580.0,1447280.0,3062362.0,4357708.0,4917440.0,4760602.0,0
AQ1_2_mik,1162652.0,4263014.0,4774577.0,24264030.0,3714576.0,40969830.0,72548590.0,535025.1,5449906.0,5997642.0,...,3312367.0,656975.0,2240067.0,3029329.0,1305365.0,2793934.0,2827042.0,3798600.0,5098263.0,0
AQ1_3_mik,943051.9,4099495.0,4922569.0,28201200.0,4087074.0,41111360.0,69441870.0,872909.1,6331146.0,5484822.0,...,1878620.0,667154.6,2133046.0,3500603.0,2377893.0,1644458.0,2195071.0,3278922.0,4824396.0,0
AQ10_1_mik,838789.0,5558673.0,5206263.0,22844470.0,3217059.0,56316540.0,62669330.0,646066.4,4208421.0,5743904.0,...,2320033.0,784186.1,3029086.0,2734146.0,2606019.0,2125239.0,3100023.0,3922228.0,5367547.0,0
AQ10_2_mik,354750.1,3597790.0,7754229.0,20753700.0,5482342.0,59054790.0,63846630.0,1160967.0,4902441.0,6034946.0,...,1632448.0,2016960.0,2694035.0,2459825.0,1558328.0,1833980.0,3401275.0,5435805.0,4726148.0,0


Unnamed: 0,1000_338.1,101_572.8,103_597.4,111_45.2,113_575.3,115_46.5,117_572.4,119_337.1,119_41.3,120_573.2,...,904_381.6,918_353.8,920_360.1,923_330.0,940_376.4,946_391.7,962_367.5,977_346.3,988_298.7,class
AQ16_1_mik,960408.4,3969311.0,3983464.0,25216940.0,4548232.0,34000860.0,58871080.0,0.0,6644233.0,0.0,...,3372461.0,3045742.0,2074339.0,2590667.0,3154463.0,2982511.0,3286935.0,3835976.0,4617678.0,0
AQ16_2_mik,1173153.0,4757843.0,4461309.0,25360910.0,4471601.0,35297010.0,59962880.0,0.0,6761089.0,0.0,...,3412054.0,2805527.0,2407746.0,3218943.0,2476195.0,2075340.0,2955413.0,3517419.0,5412190.0,0
AQ16_3_mik,294756.6,4716249.0,6176580.0,25482680.0,4950501.0,30505180.0,61267460.0,0.0,6254150.0,0.0,...,1789400.0,2700694.0,1068769.0,4269818.0,4598138.0,3277033.0,3582061.0,3547674.0,3695863.0,0
AQ25_1_mik,1255573.0,5292930.0,4066484.0,20179030.0,6947826.0,38546440.0,42452070.0,0.0,5351820.0,0.0,...,2276597.0,2322428.0,2455301.0,3411034.0,4039373.0,2914401.0,2695155.0,3843059.0,3554007.0,0
AQ25_2_mik,618185.2,5197629.0,4298666.0,20314240.0,6087268.0,36273920.0,48255770.0,0.0,6368848.0,0.0,...,4171824.0,3115454.0,2394187.0,3744213.0,3469823.0,3784061.0,3426345.0,3781479.0,2319854.0,0


Data is now ready for ANY machine learning process

# Machine learning

## X y split

In [14]:
X_train = mikania_train.drop("class", axis=1)
y_train = mikania_train["class"]

X_val = mikania_val.drop("class", axis=1)
y_val = mikania_val["class"]

## Training

In [15]:
# https://stackoverflow.com/questions/31948879/using-explicit-predefined-validation-set-for-grid-search-with-sklearn
# https://stackoverflow.com/questions/48390601/explicitly-specifying-test-train-sets-in-gridsearchcv

# Create a list of indices for the training and validation sets
train_indices = np.ones(len(X_train))
val_indices = np.zeros(len(X_val))
cv_indices = np.concatenate((train_indices, val_indices))

# model
svm = SVC()
rf = RandomForestClassifier(random_state=2187)
knn = KNeighborsClassifier()

# params of each model

param_svm = {}
param_svm['model'] = [svm]
param_svm['model__kernel'] = ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']
param_svm['model__C'] = [1, 0.9]
param_svm['model__kernel'] = ['rbf']

param_rf = {}
param_rf['model'] = [rf]
param_rf['model__max_depth'] = [10,15]
param_rf['model__n_estimators'] = [100,200,300]
param_rf['model__criterion'] = ['gini', 'entropy']


param_knn = {}
param_knn['model'] = [knn]
param_knn['model__n_neighbors'] = [5,15,25]


# uniting param to test in gridsearch

params_gridsearch = [param_svm,param_rf,param_knn]

# no need to encode or transform data. All is numeric and same scale

# pipe - starts with svm 
pipe = Pipeline([('model', svm)])

cv = PredefinedSplit(cv_indices)

# gridsearch 
grid = GridSearchCV(pipe, params_gridsearch, 
                    cv = cv,
                   scoring = ['f1','matthews_corrcoef'],
                   return_train_score = True, 
                   refit = 'matthews_corrcoef',
                   verbose = 3)

In [16]:
grid.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))


Fitting 2 folds for each of 17 candidates, totalling 34 fits
[CV 1/2] END model=SVC(), model__C=1, model__kernel=rbf; f1: (train=0.979, test=0.984) matthews_corrcoef: (train=0.958, test=0.969) total time=   0.0s
[CV 2/2] END model=SVC(), model__C=1, model__kernel=rbf; f1: (train=0.984, test=0.951) matthews_corrcoef: (train=0.969, test=0.904) total time=   0.0s
[CV 1/2] END model=SVC(), model__C=0.9, model__kernel=rbf; f1: (train=0.973, test=0.984) matthews_corrcoef: (train=0.945, test=0.969) total time=   0.0s
[CV 2/2] END model=SVC(), model__C=0.9, model__kernel=rbf; f1: (train=0.984, test=0.951) matthews_corrcoef: (train=0.969, test=0.904) total time=   0.0s
[CV 1/2] END model=RandomForestClassifier(random_state=2187), model__criterion=gini, model__max_depth=10, model__n_estimators=100; f1: (train=1.000, test=0.984) matthews_corrcoef: (train=1.000, test=0.969) total time=   0.7s
[CV 2/2] END model=RandomForestClassifier(random_state=2187), model__criterion=gini, model__max_depth=10, 

In [17]:
pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model,param_model__C,param_model__kernel,param_model__criterion,param_model__max_depth,param_model__n_estimators,...,std_train_f1,split0_test_matthews_corrcoef,split1_test_matthews_corrcoef,mean_test_matthews_corrcoef,std_test_matthews_corrcoef,rank_test_matthews_corrcoef,split0_train_matthews_corrcoef,split1_train_matthews_corrcoef,mean_train_matthews_corrcoef,std_train_matthews_corrcoef
0,0.005982,0.004015088,0.007158,0.000849,SVC(),1.0,rbf,,,,...,0.002507,0.969223,0.904017,0.93662,0.032603,13,0.958346,0.969223,0.963785,0.005439
1,0.005485,0.003491759,0.006981,0.000997,SVC(),0.9,rbf,,,,...,0.005718,0.969223,0.904017,0.93662,0.032603,13,0.945316,0.969223,0.95727,0.011954
2,0.488974,0.329838,0.010461,0.00148,RandomForestClassifier(random_state=2187),,,gini,10.0,100.0,...,0.0,0.969223,0.915432,0.942328,0.026895,3,1.0,1.0,1.0,0.0
3,0.950544,0.6323633,0.016939,0.002976,RandomForestClassifier(random_state=2187),,,gini,10.0,200.0,...,0.0,0.969223,0.909969,0.939596,0.029627,7,1.0,1.0,1.0,0.0
4,1.417396,0.9352663,0.025446,0.003477,RandomForestClassifier(random_state=2187),,,gini,10.0,300.0,...,0.0,0.969223,0.904533,0.936878,0.032345,11,1.0,1.0,1.0,0.0
5,0.477185,0.3206031,0.009974,0.000996,RandomForestClassifier(random_state=2187),,,gini,15.0,100.0,...,0.0,0.969223,0.915432,0.942328,0.026895,3,1.0,1.0,1.0,0.0
6,0.966431,0.6403027,0.017967,0.002977,RandomForestClassifier(random_state=2187),,,gini,15.0,200.0,...,0.0,0.969223,0.909969,0.939596,0.029627,7,1.0,1.0,1.0,0.0
7,1.457552,0.9633558,0.02643,0.002493,RandomForestClassifier(random_state=2187),,,gini,15.0,300.0,...,0.0,0.969223,0.904533,0.936878,0.032345,11,1.0,1.0,1.0,0.0
8,0.437466,0.2708845,0.01097,0.000998,RandomForestClassifier(random_state=2187),,,entropy,10.0,100.0,...,0.0,0.969223,0.920649,0.944936,0.024287,1,1.0,1.0,1.0,0.0
9,0.892875,0.5358572,0.018457,0.002516,RandomForestClassifier(random_state=2187),,,entropy,10.0,200.0,...,0.0,0.969223,0.915183,0.942203,0.02702,5,1.0,1.0,1.0,0.0


In [18]:
grid.best_params_

{'model': RandomForestClassifier(random_state=2187),
 'model__criterion': 'entropy',
 'model__max_depth': 10,
 'model__n_estimators': 100}

In [19]:
grid.best_score_

0.9449359678336603

In [20]:
round(grid.best_score_,3)

0.945

In [21]:
grid.best_estimator_

In [22]:
results = pd.DataFrame(grid.cv_results_)

results[['param_model','mean_train_matthews_corrcoef','mean_test_matthews_corrcoef']].groupby('param_model',sort=False).mean(['mean_train_matthews_corrcoef','mean_test_matthews_corrcoef']).round(3)

Unnamed: 0_level_0,mean_train_matthews_corrcoef,mean_test_matthews_corrcoef
param_model,Unnamed: 1_level_1,Unnamed: 2_level_1
SVC(),0.961,0.937
RandomForestClassifier(random_state=2187),1.0,0.941
KNeighborsClassifier(),0.952,0.92


In [23]:
results[['param_model','mean_train_f1','mean_test_f1']].groupby('param_model',sort=False).mean(['mean_train_f1','mean_test_f1']).round(3)

Unnamed: 0_level_0,mean_train_f1,mean_test_f1
param_model,Unnamed: 1_level_1,Unnamed: 2_level_1
SVC(),0.98,0.968
RandomForestClassifier(random_state=2187),1.0,0.969
KNeighborsClassifier(),0.975,0.959


In [24]:
# Scores to present

results[['param_model','mean_test_f1','std_test_f1','mean_test_matthews_corrcoef','std_test_matthews_corrcoef']].groupby('param_model',sort=False).mean(['mean_test_f1','std_test_f1','mean_test_matthews_corrcoef','std_test_matthews_corrcoef']).round(3)

Unnamed: 0_level_0,mean_test_f1,std_test_f1,mean_test_matthews_corrcoef,std_test_matthews_corrcoef
param_model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVC(),0.968,0.017,0.937,0.033
RandomForestClassifier(random_state=2187),0.969,0.015,0.941,0.028
KNeighborsClassifier(),0.959,0.009,0.92,0.019
