This notebook is the optimization of the algorithms for Maytenus. Here, the data passes through the feature creation pipeline, goes to the ml training pipeline and gridsearch so that the best algorithms with the best parameters are chosen. 

At the end, a preliminar testing is done with previous data

- Gridsearch Score: 0.9806924101198402
- Preliminary Testing: 0.0


This notebook is the gridsearch done with previous data. The previous data was processed by XCMS WITH the PHD data, using the xcms script optimized previously. This optimization of XCMS parameters was done only with data from the PHD. This could be the reason behind the 0.0 score obtained at the end. So, to verify this, the next notebook had the previous data together with the PHD data during IPO optimization

In [None]:
xcms usado: somente no treino dos dados de mestrado

setwd("G:/2.LABMETAMASS/DOUTORADO/DADOS/maytenus/testes_mestrado/convertidos/output/train/xcms_aug")



library(xcms)
library(CAMERA)
library(beepr)



xset <- xcmsSet( 
        method   = "matchedFilter",
        fwhm     = 29.4,
        snthresh = 16.1595968, #16.1595968
        step     = 1,
        steps    = 12,
        sigma    = (29.4/2.3548), #12.4851367419738,
        max      = 5,
        mzdiff   = -11, # -11 WAS THE STANDARD
        index    = FALSE)

beep(2)


xset2 <- retcor( 
        xset,
        method         = "obiwarp",
        plottype       = "none",
        distFunc       = "cor_opt",
        profStep       = 1,
        response       = 1,
        gapInit        = 0.26,
        gapExtend      = 2.1,
        factorDiag     = 2,
        factorGap      = 1,
        localAlignment = 0)
beep(2)


xset3 <- group( 
        xset2,
        method  = "density",
        bw      = 50,
        mzwid   = 1,
        minfrac = 0.1,
        minsamp = 1,
        max     = 50)
beep(2)

xset4 <- fillPeaks(xset3)

beep(2)

# The IPO script ends here

# Substitute the object names inside the ( ) accordingly.

an <- xsAnnotate(xset4)
#Creation of an xsAnnotate object

anF <- groupFWHM(an, perfwhm = 0.6)

#Perfwhm = parameter defines the window width, which is used for matching
anI <- findIsotopes(anF, mzabs=0.01)

#Mzabs = the allowed m/z error
anIC <- groupCorr(anI, cor_eic_th=0.1)

anFA <- findAdducts(anIC, polarity="negative") #change polarity accordingly
beep(2)

write.csv(getPeaklist(anIC), file="maytenus_testing_masters.csv") # generates a table of features

beep(3)


In [3]:
import pandas as pd
import numpy as np

from sklearn.svm import SVC
from sklearn import metrics

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.model_selection import PredefinedSplit


# Functions

In [1]:
# creates the feature name with the mz and rt

def feature_name_creation(xcms_file_path):
    table = pd.read_csv(xcms_file_path, index_col=[0]) 
    
    # no need for decimal on m/z (low resolution) and only one decimal for rt
    table.mz = table.mz.round(0).astype(int)
    table.rt = table.rt.round(1)

    # creating the feature name: mz_rt
    features = table["mz"].astype(str) + "_" + table["rt"].astype(str)
    table.insert(0, 'features', features) # first column
    
    # drop as we don't know how many columns the table will have. Drop the known ones. 
    # There should only be the 'features' column and the samples
    
    table_clean = table.drop(['isotopes', 'adduct','pcgroup'], axis=1) #'npeaks','NEG_GROUP', 'POS_GROUP',
    
    return table_clean

In [2]:
# rounds the mz and rt columns along with its min and max

def rounder(dataframe):
    table = dataframe 
    
    table.mz = table.mz.round(0).astype(int)
    table.mzmin = table.mzmin.round(0).astype(int)
    table.mzmax = table.mzmax.round(0).astype(int)
    
    table.rt = table.rt.round(1)
    table.rtmin = table.rtmin.round(1)
    table.rtmax = table.rtmax.round(1)

    
    return table

# Data Prep Pipeline

`Train` and `Val` sets were processed separately on `xcms` - excludes the possibility of data leakage 
But, when processing is separated, the features can be slightly different. The compounds are almost the same, but due to processing steps, there can be shifts on the decimals of `mz` or `rt`. 
For this reason, creating the feature name concatenating `mz_rt` on train and val might not produce the same features, and machine learning training is not possible with that. 

Errors observed in this case are related to the fact that features observed in train were not present in validation and vice versa or the order of the features were different in both datasets. This 'pipeline' fixes this issue.

**Steps:**

1. Creates the name for the features on `Train` set - this is the set used as reference. Whatever features where observed here, should appear on `Val`. The name is created concatenating `mz` and `rt` columns (`mz_rt`)
2. Creates a correspondance between the feature on `Train` and `Val` set, giving val set the same column names as the train, when the feature is present 
    1. round `mz` and `rt` from `Val` and `Train` 
    2. for each `mz` in `Val`, search for a range on `mzmin` and `mzmax` on train that fits. The `mz_val` need to be between `mzmin_train` and `mzmax_train` 
    3. If a match is found, for each `rt`,`rtmin` and `rtmax` on `Val` search for a range on `rtmin` and `rtmax` on `Train` that fits. The `rt` values need to be between `rtmin_train` and `rtmax_train`. The `rtmin` and `rtmax` from `Val` are used in this case because ocasionally, the range on `Val` or train is too big (big difference in `rt` between samples)
    4. if a match, take the feature name from `Train` and apply on the match
    
**With the features names created:**

3. Features on `Train` and `Val` are ordered 
4. Duplicates are deleted based on the `npeaks` columnn
5. Features that were observed in `Val` but no correspondence was found in `Train` have names filled with `nan`. These are deleted.
4. Features that are on `Train`and were not found in `Val` are added to `Val` and filled with zero (no presence of that feature)
 
 
**To fix: **
 The code for the feature correspondence is not optimized. 
 - After the match with `mz`, the loop searches on the whole dataset for a match in `rt`. This takes more computation, unecessary. 
 - If there is a match of two features, the last one is kept. Could keep both, filter later? 
 


## Feature reference creation - train set

In [4]:
# train is loaded using the function to create the feature names - feature names are created using mz and rt.  
maytenus_train = feature_name_creation('5. Gridsearch/maytenus_train_processing.csv').reset_index(drop=True)

In [5]:
maytenus_train.head()

Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,...,IL9_3,IL96_1,IL96_2,IL96_3,IL97_1,IL97_2,IL97_3,IL99_1,IL99_2,IL99_3
0,118_574.6,118,116.844856,117.830395,574.6,446.674,575.801,117,45,56,...,59088280.0,58993830.0,55884640.0,56126180.0,49123880.0,49234010.0,49292970.0,46672220.0,46160370.0,48399060.0
1,118_574.2,118,117.857061,118.844673,574.2,572.723,575.897,307,47,69,...,63260280.0,62417810.0,59084520.0,60318190.0,60757550.0,58556850.0,59764570.0,59134560.0,57684060.0,60959120.0
2,133_63.3,133,132.418292,132.844587,63.3,58.987,68.379,453,101,109,...,300273300.0,482432100.0,483982900.0,467195900.0,300615500.0,294482800.0,275804900.0,232675400.0,243649400.0,236454200.0
3,133_63.2,133,132.84479,133.330327,63.2,58.987,68.386,8638,378,384,...,296050700.0,459560200.0,467617600.0,462127700.0,301653100.0,324375100.0,295186500.0,234183000.0,239973200.0,236143900.0
4,163_360.9,163,162.679173,163.251287,360.9,358.839,363.122,132,51,0,...,11875370.0,11618140.0,11513740.0,10015170.0,7565125.0,8315961.0,7388538.0,7563753.0,7313061.0,7200719.0


## Loading validation val set

In [10]:
# val will be loaded using regular read_csv - the names of the features will come based on comparison
maytenus_val = pd.read_csv('5. Gridsearch/maytenus_validation_processing.csv',index_col=[0]).reset_index(drop=True).drop(['isotopes', 'adduct','pcgroup'], axis=1) #'npeaks','NEG_GROUP', 'POS_GROUP',

## Rounding mz and rt

In [11]:
# rouding all mz and all rt
maytenus_val = rounder(maytenus_val)
maytenus_train = rounder(maytenus_train)

In [12]:
display(maytenus_val.iloc[:,0:7].head())
display(maytenus_train.iloc[:,0:7].head())

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks
0,117,117,117,112.2,111.3,113.7,20
1,118,117,118,574.2,573.5,575.0,13
2,118,118,119,573.8,573.1,574.6,53
3,133,133,133,60.3,57.1,65.7,2246
4,163,163,163,348.1,344.6,352.2,54


Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax
0,118_574.6,118,117,118,574.6,446.7,575.8
1,118_574.2,118,118,119,574.2,572.7,575.9
2,133_63.3,133,132,133,63.3,59.0,68.4
3,133_63.2,133,133,133,63.2,59.0,68.4
4,163_360.9,163,163,163,360.9,358.8,363.1


In [13]:
display(maytenus_val)
display(maytenus_train)

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,AQ15_1,...,IL88_3,IL89_1,IL89_2,IL89_3,IL90_1,IL90_2,IL90_3,IL93_1,IL93_2,IL93_3
0,117,117,117,112.2,111.3,113.7,20,12,0,1.507722e+07,...,1.461273e+07,1.434329e+07,1.436273e+07,1.391806e+07,9.000562e+06,9.721818e+06,9.042574e+06,1.257923e+07,1.336079e+07,1.316295e+07
1,118,117,118,574.2,573.5,575.0,13,3,10,4.016225e+07,...,5.592482e+07,5.365049e+07,5.376085e+07,5.381437e+07,6.404797e+07,6.219705e+07,6.203724e+07,5.224481e+07,5.518170e+07,5.498099e+07
2,118,118,119,573.8,573.1,574.6,53,7,15,4.153075e+07,...,5.857564e+07,6.322207e+07,6.139799e+07,6.090669e+07,6.834088e+07,6.645232e+07,6.623445e+07,5.886957e+07,5.755579e+07,6.050415e+07
3,133,133,133,60.3,57.1,65.7,2246,96,96,4.548042e+08,...,4.815065e+08,3.174242e+08,3.120228e+08,3.069786e+08,3.021286e+08,2.978952e+08,2.971043e+08,3.217211e+08,3.153699e+08,3.231264e+08
4,163,163,163,348.1,344.6,352.2,54,12,0,9.214956e+06,...,9.397630e+06,9.037046e+06,7.827418e+06,8.148794e+06,1.124304e+07,1.112709e+07,1.236415e+07,9.509602e+06,8.799963e+06,8.076247e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,741,741,742,284.4,282.3,287.8,94,0,94,1.854838e+07,...,9.428788e+07,8.643290e+07,7.781434e+07,7.667890e+07,8.215795e+07,8.672896e+07,8.572576e+07,6.567109e+07,7.106116e+07,6.721195e+07
94,755,754,755,268.7,266.9,271.1,42,0,42,1.871405e+07,...,8.031803e+07,7.000626e+07,7.291199e+07,5.532321e+07,6.726124e+07,6.271704e+07,7.129912e+07,4.220301e+07,4.313111e+07,4.445439e+07
95,755,755,756,268.6,264.8,271.5,999,0,96,1.937589e+07,...,8.416279e+07,7.747889e+07,7.838830e+07,7.654461e+07,6.726851e+07,8.314858e+07,7.156967e+07,4.917694e+07,5.297594e+07,4.751466e+07
96,756,756,757,268.7,265.2,271.6,193,0,96,1.762526e+07,...,7.710153e+07,5.676400e+07,5.879134e+07,6.519987e+07,6.654037e+07,6.715041e+07,6.434036e+07,4.980773e+07,5.314412e+07,4.697505e+07


Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,...,IL9_3,IL96_1,IL96_2,IL96_3,IL97_1,IL97_2,IL97_3,IL99_1,IL99_2,IL99_3
0,118_574.6,118,117,118,574.6,446.7,575.8,117,45,56,...,5.908828e+07,5.899383e+07,5.588464e+07,5.612618e+07,4.912388e+07,4.923401e+07,4.929297e+07,4.667222e+07,4.616037e+07,4.839906e+07
1,118_574.2,118,118,119,574.2,572.7,575.9,307,47,69,...,6.326028e+07,6.241781e+07,5.908452e+07,6.031819e+07,6.075755e+07,5.855685e+07,5.976457e+07,5.913456e+07,5.768406e+07,6.095912e+07
2,133_63.3,133,132,133,63.3,59.0,68.4,453,101,109,...,3.002733e+08,4.824321e+08,4.839829e+08,4.671959e+08,3.006155e+08,2.944828e+08,2.758049e+08,2.326754e+08,2.436494e+08,2.364542e+08
3,133_63.2,133,133,133,63.2,59.0,68.4,8638,378,384,...,2.960507e+08,4.595602e+08,4.676176e+08,4.621277e+08,3.016531e+08,3.243751e+08,2.951865e+08,2.341830e+08,2.399732e+08,2.361439e+08
4,163_360.9,163,163,163,360.9,358.8,363.1,132,51,0,...,1.187537e+07,1.161814e+07,1.151374e+07,1.001517e+07,7.565125e+06,8.315961e+06,7.388538e+06,7.563753e+06,7.313061e+06,7.200719e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,741_302.0,741,740,741,302.0,264.8,311.0,350,5,345,...,8.452444e+07,7.295674e+07,7.044123e+07,6.966945e+07,9.884958e+07,1.105645e+08,1.005150e+08,1.062612e+08,9.917727e+07,9.797857e+07
89,755_286.3,755,754,755,286.3,248.8,312.9,1032,5,351,...,8.833914e+07,5.279847e+07,5.557496e+07,4.565162e+07,1.261003e+08,1.379112e+08,1.216284e+08,7.109022e+07,8.046582e+07,8.627527e+07
90,756_286.7,756,755,756,286.7,248.8,312.9,3392,5,351,...,1.016305e+08,5.368691e+07,5.566762e+07,5.216342e+07,1.550416e+08,1.187256e+08,1.537047e+08,9.148052e+07,9.322778e+07,9.990198e+07
91,757_286.5,757,756,757,286.5,256.7,312.9,157,1,153,...,8.192823e+07,4.605462e+07,4.654267e+07,4.420568e+07,5.858005e+07,6.576977e+07,6.590599e+07,7.515455e+07,7.053095e+07,4.315254e+07


## Feature creation and correspondance on val set - create a function of this

In [14]:
# creating the column
maytenus_val['features'] = np.nan

In [15]:
# loop over maytenus_val items. 
# Each mz will be tested against all mzmin and mzmax range from train. 
# if in range, test for rt.
# if in range, use the same feature name from train

maytenus_val = maytenus_val.sort_values('npeaks', ascending=False,ignore_index=True)
maytenus_train_ref = maytenus_train.sort_values('npeaks', ascending=False,ignore_index=True)

for i in range(len(maytenus_val)):
    for j in range(len(maytenus_train_ref)):


        if ((maytenus_val.loc[i,'mz'] <= maytenus_train_ref.loc[j,'mzmax']) 
              & (maytenus_val.loc[i,'mz'] >= maytenus_train_ref.loc[j,'mzmin'])):
            
            #maybe subset maytenus train and then perform things on the subset? 
            
            if (
                ((maytenus_val.loc[i,'rt'] <= maytenus_train_ref.loc[j,'rtmax']) 
                  & (maytenus_val.loc[i,'rt'] >= maytenus_train_ref.loc[j,'rtmin'])) or
            
               ((maytenus_val.loc[i,'rtmin'] <= maytenus_train_ref.loc[j,'rtmax']) 
                  & (maytenus_val.loc[i,'rtmin'] >= maytenus_train_ref.loc[j,'rtmin'])) or
                
               ((maytenus_val.loc[i,'rtmax'] <= maytenus_train_ref.loc[j,'rtmax']) 
                & (maytenus_val.loc[i,'rtmax'] >= maytenus_train_ref.loc[j,'rtmin']))
            ):
                
                maytenus_val.loc[i,'features'] = maytenus_train_ref.loc[j,'features']
            break

In [16]:
maytenus_val

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,AQ15_1,...,IL89_1,IL89_2,IL89_3,IL90_1,IL90_2,IL90_3,IL93_1,IL93_2,IL93_3,features
0,191,191,192,89.0,50.3,92.2,2248,96,93,1.888418e+08,...,2.563923e+08,2.509333e+08,2.487752e+08,2.339792e+08,2.309708e+08,2.304622e+08,2.391413e+08,2.371037e+08,2.361392e+08,191_102.3
1,133,133,133,60.3,57.1,65.7,2246,96,96,4.548042e+08,...,3.174242e+08,3.120228e+08,3.069786e+08,3.021286e+08,2.978952e+08,2.971043e+08,3.217211e+08,3.153699e+08,3.231264e+08,133_63.2
2,289,288,289,254.4,220.8,259.4,1759,36,96,9.599036e+07,...,1.920411e+08,1.953578e+08,1.917544e+08,2.147209e+08,2.108924e+08,2.091615e+08,1.348852e+08,1.373190e+08,1.387906e+08,289_272.2
3,561,561,562,265.1,227.6,270.1,1258,28,78,9.315947e+07,...,1.333482e+08,1.354178e+08,1.390231e+08,1.424696e+08,1.403660e+08,1.364839e+08,1.340487e+08,1.327994e+08,1.432326e+08,561_282.7
4,739,739,740,284.5,282.3,288.3,1155,0,96,2.747926e+07,...,2.072239e+08,2.044367e+08,1.726183e+08,1.804404e+08,2.048974e+08,1.819017e+08,1.652030e+08,1.686066e+08,1.687056e+08,739_302.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,654,654,655,266.5,263.0,269.7,18,15,0,2.193694e+07,...,1.780166e+07,1.746834e+07,1.990690e+07,1.921297e+07,1.796969e+07,1.756218e+07,2.042788e+07,1.980194e+07,2.076722e+07,
94,118,117,118,574.2,573.5,575.0,13,3,10,4.016225e+07,...,5.365049e+07,5.376085e+07,5.381437e+07,6.404797e+07,6.219705e+07,6.203724e+07,5.224481e+07,5.518170e+07,5.498099e+07,118_574.2
95,424,424,424,410.7,409.0,412.4,12,12,0,1.207037e+07,...,1.140325e+07,1.162769e+07,1.130129e+07,1.156243e+07,9.358604e+06,8.610475e+06,9.908140e+06,1.160593e+07,8.717365e+06,
96,657,656,657,262.6,260.5,264.7,11,11,0,2.096774e+07,...,1.791879e+07,1.684114e+07,1.941615e+07,1.911588e+07,1.719130e+07,1.739270e+07,1.952732e+07,1.971833e+07,1.862377e+07,


In [17]:
# the process can create duplicates, so removing them is necessary
# the removal is based on the npeaks column. The feature with more npeaks, is kept.
maytenus_val = maytenus_val.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary columns
maytenus_val = maytenus_val.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                  'rtmin', 'rtmax', 'npeaks','NEG_GROUP', 'POS_GROUP'], axis=1)

# removing the duplicates that might arise with the train is also necessary
# drop possible duplicates for train as well
maytenus_train_ref = maytenus_train_ref.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary  columns
maytenus_train_ref = maytenus_train_ref.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                      'rtmin', 'rtmax', 'npeaks','NEG_GROUP', 'POS_GROUP'], axis=1)

# val set might have some feature that don't fit in any range - their feature names will be nan, so need to remove
# train might have some features that wont appear in the val. So, create them in val and set them to zero. 
# first, set index on both to be the features, so its possible to do that.
maytenus_train_ref= maytenus_train_ref.set_index('features')
maytenus_val = maytenus_val.dropna().set_index('features') # dropping na and making feature as index

# set method to get the set of index values that are unique 
# subtracting the sets to get the different indexes. 
# concat method to concatenate train and val
# filling the missing values on the concatenation with 0 using the fillna method.

unique_indexes = list(set(maytenus_train_ref.index) - set(maytenus_val.index))
maytenus_val = pd.concat([maytenus_val, pd.DataFrame(index=unique_indexes, columns=maytenus_val.columns)], sort=True).fillna(0)

# order both val and train features equally
# sort the features - the model needs them at the same sequence
maytenus_train_grid = maytenus_train_ref.reset_index().sort_values(by='features')
maytenus_val = maytenus_val.reset_index().sort_values(by='index')



In [18]:
maytenus_val

Unnamed: 0,index,AQ15_1,AQ15_2,AQ15_3,AQ24_1,AQ24_2,AQ24_3,AQ29_1,AQ29_2,AQ29_3,...,ML43_3,ML44_1,ML44_2,ML44_3,ML46_1,ML46_2,ML46_3,ML49_1,ML49_2,ML49_3
41,118_574.2,4.153075e+07,4.128833e+07,4.127786e+07,5.993508e+07,5.888217e+07,6.205178e+07,6.226222e+07,6.362410e+07,6.167428e+07,...,6.069511e+07,6.183458e+07,5.985650e+07,5.639033e+07,5.428191e+07,4.985583e+07,4.867924e+07,5.155424e+07,5.018745e+07,5.006901e+07
78,118_574.6,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
1,133_63.2,4.548042e+08,4.467540e+08,4.587334e+08,2.692450e+08,2.620074e+08,2.490009e+08,2.720095e+08,2.681214e+08,2.814799e+08,...,4.178706e+08,1.221708e+08,1.207355e+08,1.183927e+08,4.058158e+08,4.155007e+08,4.077516e+08,4.329363e+08,4.456691e+08,4.328093e+08
82,133_63.3,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
87,163_360.9,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,741_302.0,1.854838e+07,2.000525e+07,1.990883e+07,1.691154e+07,1.758071e+07,1.848960e+07,1.611531e+07,1.789688e+07,1.758252e+07,...,1.656812e+07,1.689603e+07,1.595879e+07,1.599404e+07,1.634736e+07,1.584006e+07,1.611701e+07,1.688472e+07,1.578628e+07,1.783517e+07
70,755_286.3,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
6,756_286.7,1.937589e+07,1.996009e+07,1.937854e+07,1.927812e+07,1.803453e+07,1.568489e+07,1.638128e+07,1.687270e+07,1.556628e+07,...,1.579348e+07,1.633861e+07,1.521654e+07,1.360705e+07,1.633646e+07,1.695084e+07,1.616958e+07,1.729669e+07,1.721778e+07,1.677490e+07
44,757_286.5,1.683946e+07,1.733512e+07,1.604495e+07,1.474818e+07,1.498274e+07,1.323982e+07,1.568520e+07,1.555287e+07,1.619717e+07,...,2.041815e+07,1.683738e+07,1.578859e+07,1.429213e+07,1.935499e+07,2.012559e+07,2.025872e+07,2.011256e+07,2.034606e+07,1.941281e+07


## Bring the class data column

In [19]:
# load
classes_train = pd.read_csv('classes_train_maytenus.csv', index_col=[0])
classes_val = pd.read_csv('classes_val_maytenus.csv', index_col=[0])

# unite
maytenus_train_grid = maytenus_train_grid.set_index('features').T.join(classes_train)
display(maytenus_train_grid.head())

maytenus_val = maytenus_val.set_index('index').T.join(classes_val)
display(maytenus_val.head())

Unnamed: 0,118_574.2,118_574.6,133_63.2,133_63.3,163_360.9,164_351.6,165_349.9,181_45.3,181_45.6,191_101.4,...,707_241.4,707_241.5,709_241.9,739_302.0,741_302.0,755_286.3,756_286.7,757_286.5,758_286.7,class
AQ1_1,66502010.0,50755490.0,352129100.0,361609800.0,9044279.0,14570900.0,15966390.0,73613700.0,73367890.0,263069000.0,...,13258260.0,14203630.0,13990700.0,18468440.0,18019010.0,13243040.0,16247320.0,17072400.0,15955370.0,0
AQ1_2,64243190.0,48721720.0,354461500.0,362486200.0,9299879.0,14033390.0,16219490.0,73107850.0,72655590.0,269104900.0,...,11118150.0,11643640.0,13800970.0,18108450.0,14995410.0,15788690.0,17079350.0,16492870.0,15988850.0,0
AQ1_3,63663890.0,47005090.0,353487700.0,367077700.0,8167617.0,14785070.0,15329950.0,74859590.0,73194970.0,271823300.0,...,11165930.0,12231060.0,12554180.0,18473440.0,16446070.0,13868740.0,15389190.0,15893680.0,15062550.0,0
AQ10_1,65309540.0,63667780.0,391287500.0,404661600.0,11156450.0,15602280.0,22417620.0,72161610.0,70694880.0,201744900.0,...,11661570.0,13380200.0,13174200.0,20088730.0,16951230.0,16270240.0,18291860.0,15627280.0,14872670.0,0
AQ10_2,65167600.0,62518030.0,387244900.0,397189600.0,12623900.0,16479650.0,23167940.0,75109470.0,72264010.0,194715400.0,...,11842300.0,13388720.0,13253770.0,18225030.0,17914100.0,16159650.0,18102620.0,17708470.0,16590010.0,0


Unnamed: 0,118_574.2,118_574.6,133_63.2,133_63.3,163_360.9,164_351.6,165_349.9,181_45.3,181_45.6,191_101.4,...,707_241.4,707_241.5,709_241.9,739_302.0,741_302.0,755_286.3,756_286.7,757_286.5,758_286.7,class
AQ15_1,41530750.0,0.0,454804200.0,0.0,0.0,0.0,0.0,0.0,66441290.0,0.0,...,0.0,0.0,0.0,27479260.0,18548380.0,0.0,19375890.0,16839460.0,0.0,0
AQ15_2,41288330.0,0.0,446754000.0,0.0,0.0,0.0,0.0,0.0,63796030.0,0.0,...,0.0,0.0,0.0,26072170.0,20005250.0,0.0,19960090.0,17335120.0,0.0,0
AQ15_3,41277860.0,0.0,458733400.0,0.0,0.0,0.0,0.0,0.0,59622250.0,0.0,...,0.0,0.0,0.0,25009740.0,19908830.0,0.0,19378540.0,16044950.0,0.0,0
AQ24_1,59935080.0,0.0,269245000.0,0.0,0.0,0.0,0.0,0.0,67804730.0,0.0,...,0.0,0.0,0.0,23576250.0,16911540.0,0.0,19278120.0,14748180.0,0.0,0
AQ24_2,58882170.0,0.0,262007400.0,0.0,0.0,0.0,0.0,0.0,64729630.0,0.0,...,0.0,0.0,0.0,25049610.0,17580710.0,0.0,18034530.0,14982740.0,0.0,0


In [20]:
maytenus_train.to_csv('features_train_comparison.csv')

Data is now ready for ANY machine learning process

# Machine learning

## X y split

In [21]:
X_train = maytenus_train_grid.drop("class", axis=1)
y_train = maytenus_train_grid["class"]

X_val = maytenus_val.drop("class", axis=1)
y_val = maytenus_val["class"]

## Training

In [22]:
# https://stackoverflow.com/questions/31948879/using-explicit-predefined-validation-set-for-grid-search-with-sklearn
# https://stackoverflow.com/questions/48390601/explicitly-specifying-test-train-sets-in-gridsearchcv

# Create a list of indices for the training and validation sets
train_indices = np.ones(len(X_train))
val_indices = np.zeros(len(X_val))
cv_indices = np.concatenate((train_indices, val_indices))


# model
svm = SVC()
rf = RandomForestClassifier(random_state=2187)
knn = KNeighborsClassifier()

# params of each model

param_svm = {}
param_svm['model'] = [svm]
param_svm['model__kernel'] = ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']
param_svm['model__C'] = [1, 0.9]
param_svm['model__kernel'] = ['rbf']

param_rf = {}
param_rf['model'] = [rf]
param_rf['model__max_depth'] = [10,15]
param_rf['model__n_estimators'] = [100,200,300]
param_rf['model__criterion'] = ['gini', 'entropy']


param_knn = {}
param_knn['model'] = [knn]
param_knn['model__n_neighbors'] = [5,15,25]


# uniting param to test in gridsearch

params_gridsearch = [param_svm,param_rf,param_knn]

# no need to encode or transform data. All is numeric and same scale

# pipe - starts with svm 
pipe = Pipeline([('model', svm)])

cv = PredefinedSplit(cv_indices)

# gridsearch 
grid = GridSearchCV(pipe, params_gridsearch, 
                    cv = cv,
                   scoring = ['f1','matthews_corrcoef'],
                   return_train_score = True, 
                   refit = 'matthews_corrcoef',
                   verbose = 3)

In [23]:
grid.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))


Fitting 2 folds for each of 17 candidates, totalling 34 fits
[CV 1/2] END model=SVC(), model__C=1, model__kernel=rbf; f1: (train=0.987, test=1.000) matthews_corrcoef: (train=0.974, test=1.000) total time=   0.1s
[CV 2/2] END model=SVC(), model__C=1, model__kernel=rbf; f1: (train=1.000, test=0.952) matthews_corrcoef: (train=1.000, test=0.908) total time=   0.0s
[CV 1/2] END model=SVC(), model__C=0.9, model__kernel=rbf; f1: (train=0.987, test=1.000) matthews_corrcoef: (train=0.974, test=1.000) total time=   0.0s
[CV 2/2] END model=SVC(), model__C=0.9, model__kernel=rbf; f1: (train=1.000, test=0.950) matthews_corrcoef: (train=1.000, test=0.906) total time=   0.0s
[CV 1/2] END model=RandomForestClassifier(random_state=2187), model__criterion=gini, model__max_depth=10, model__n_estimators=100; f1: (train=1.000, test=1.000) matthews_corrcoef: (train=1.000, test=1.000) total time=   0.5s
[CV 2/2] END model=RandomForestClassifier(random_state=2187), model__criterion=gini, model__max_depth=10, 

In [24]:
pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model,param_model__C,param_model__kernel,param_model__criterion,param_model__max_depth,param_model__n_estimators,...,std_train_f1,split0_test_matthews_corrcoef,split1_test_matthews_corrcoef,mean_test_matthews_corrcoef,std_test_matthews_corrcoef,rank_test_matthews_corrcoef,split0_train_matthews_corrcoef,split1_train_matthews_corrcoef,mean_train_matthews_corrcoef,std_train_matthews_corrcoef
0,0.078819,0.0633775,0.032155,0.02617,SVC(),1.0,rbf,,,,...,0.006494,1.0,0.908025,0.954013,0.045987,8,0.973972,1.0,0.986986,0.013014
1,0.003491,0.001495242,0.004987,0.000969,SVC(),0.9,rbf,,,,...,0.006494,1.0,0.905567,0.952784,0.047216,9,0.973972,1.0,0.986986,0.013014
2,0.347779,0.2260765,0.01016,0.000811,RandomForestClassifier(random_state=2187),,,gini,10.0,100.0,...,0.0,1.0,0.912958,0.956479,0.043521,4,1.0,1.0,1.0,0.0
3,0.679695,0.4393102,0.016955,0.001995,RandomForestClassifier(random_state=2187),,,gini,10.0,200.0,...,0.0,0.989637,0.912958,0.951297,0.03834,10,1.0,1.0,1.0,0.0
4,1.015785,0.6418129,0.024918,0.003007,RandomForestClassifier(random_state=2187),,,gini,10.0,300.0,...,0.0,0.959166,0.912958,0.936062,0.023104,16,1.0,1.0,1.0,0.0
5,0.339606,0.2159362,0.009974,0.001994,RandomForestClassifier(random_state=2187),,,gini,15.0,100.0,...,0.0,1.0,0.912958,0.956479,0.043521,4,1.0,1.0,1.0,0.0
6,0.667215,0.4248918,0.016954,0.001995,RandomForestClassifier(random_state=2187),,,gini,15.0,200.0,...,0.0,0.989637,0.912958,0.951297,0.03834,10,1.0,1.0,1.0,0.0
7,1.003801,0.6368103,0.024937,0.00299,RandomForestClassifier(random_state=2187),,,gini,15.0,300.0,...,0.0,0.959166,0.912958,0.936062,0.023104,16,1.0,1.0,1.0,0.0
8,0.318151,0.1944815,0.009971,0.000999,RandomForestClassifier(random_state=2187),,,entropy,10.0,100.0,...,0.0,1.0,0.912958,0.956479,0.043521,4,1.0,1.0,1.0,0.0
9,0.630314,0.3779875,0.017451,0.002492,RandomForestClassifier(random_state=2187),,,entropy,10.0,200.0,...,0.0,0.989637,0.912958,0.951297,0.03834,10,1.0,1.0,1.0,0.0


In [25]:
grid.best_params_

{'model': KNeighborsClassifier(), 'model__n_neighbors': 5}

In [26]:
grid.best_score_



0.9626932234861232

# Testing the model - some samples from the masters project

In [53]:
# pass maytenus through the same preprocessing 

maytenus_test = pd.read_csv("6. Gridsearch Maytenus (adicional dados mestrado no retreino)/maytenus_testing_masters.csv", index_col='Unnamed: 0')

In [54]:
maytenus_test.head()

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,aq1set.17_1,...,il2mar.17_2,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2,isotopes,adduct,pcgroup
1,113.766055,113.093379,114.076918,52.305,51.617,55.127,66,6,10,82873570.0,...,88342340.0,80850070.0,85253420.0,110348200.0,104886500.0,136714100.0,148264600.0,,,3
2,114.198306,114.10625,115.084676,51.59,50.173,52.305,28,6,5,83612160.0,...,86455270.0,78724930.0,81616300.0,108871900.0,73227430.0,109614800.0,115328900.0,,,3
3,115.207632,115.119803,115.360098,50.178,49.423,50.886,50,8,6,70917360.0,...,58891660.0,54521640.0,52818030.0,77227370.0,71171550.0,108949100.0,114858600.0,,,3
4,132.929407,132.502583,133.089878,50.178,49.423,50.912,226,10,10,190636300.0,...,165136800.0,145281900.0,133488400.0,213247100.0,207116200.0,318440000.0,306467100.0,,,3
5,133.118475,133.093647,133.198074,50.179,49.469,50.182,14,4,3,188283300.0,...,162470000.0,141343500.0,134538400.0,211362000.0,206861200.0,316116400.0,300560100.0,,,3


### The whole pipeline

In [55]:
# rounder
maytenus_test = rounder(maytenus_test)
maytenus_test.head()

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,aq1set.17_1,...,il2mar.17_2,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2,isotopes,adduct,pcgroup
1,114,113,114,52.3,51.6,55.1,66,6,10,82873570.0,...,88342340.0,80850070.0,85253420.0,110348200.0,104886500.0,136714100.0,148264600.0,,,3
2,114,114,115,51.6,50.2,52.3,28,6,5,83612160.0,...,86455270.0,78724930.0,81616300.0,108871900.0,73227430.0,109614800.0,115328900.0,,,3
3,115,115,115,50.2,49.4,50.9,50,8,6,70917360.0,...,58891660.0,54521640.0,52818030.0,77227370.0,71171550.0,108949100.0,114858600.0,,,3
4,133,133,133,50.2,49.4,50.9,226,10,10,190636300.0,...,165136800.0,145281900.0,133488400.0,213247100.0,207116200.0,318440000.0,306467100.0,,,3
5,133,133,133,50.2,49.5,50.2,14,4,3,188283300.0,...,162470000.0,141343500.0,134538400.0,211362000.0,206861200.0,316116400.0,300560100.0,,,3


In [56]:
# create the columns
maytenus_test['features'] = np.nan
maytenus_test.head()

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,aq1set.17_1,...,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2,isotopes,adduct,pcgroup,features
1,114,113,114,52.3,51.6,55.1,66,6,10,82873570.0,...,80850070.0,85253420.0,110348200.0,104886500.0,136714100.0,148264600.0,,,3,
2,114,114,115,51.6,50.2,52.3,28,6,5,83612160.0,...,78724930.0,81616300.0,108871900.0,73227430.0,109614800.0,115328900.0,,,3,
3,115,115,115,50.2,49.4,50.9,50,8,6,70917360.0,...,54521640.0,52818030.0,77227370.0,71171550.0,108949100.0,114858600.0,,,3,
4,133,133,133,50.2,49.4,50.9,226,10,10,190636300.0,...,145281900.0,133488400.0,213247100.0,207116200.0,318440000.0,306467100.0,,,3,
5,133,133,133,50.2,49.5,50.2,14,4,3,188283300.0,...,141343500.0,134538400.0,211362000.0,206861200.0,316116400.0,300560100.0,,,3,


In [57]:
maytenus_train

Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,...,IL9_3,IL96_1,IL96_2,IL96_3,IL97_1,IL97_2,IL97_3,IL99_1,IL99_2,IL99_3
0,118_574.6,118,117,118,574.6,446.7,575.8,117,45,56,...,5.908828e+07,5.899383e+07,5.588464e+07,5.612618e+07,4.912388e+07,4.923401e+07,4.929297e+07,4.667222e+07,4.616037e+07,4.839906e+07
1,118_574.2,118,118,119,574.2,572.7,575.9,307,47,69,...,6.326028e+07,6.241781e+07,5.908452e+07,6.031819e+07,6.075755e+07,5.855685e+07,5.976457e+07,5.913456e+07,5.768406e+07,6.095912e+07
2,133_63.3,133,132,133,63.3,59.0,68.4,453,101,109,...,3.002733e+08,4.824321e+08,4.839829e+08,4.671959e+08,3.006155e+08,2.944828e+08,2.758049e+08,2.326754e+08,2.436494e+08,2.364542e+08
3,133_63.2,133,133,133,63.2,59.0,68.4,8638,378,384,...,2.960507e+08,4.595602e+08,4.676176e+08,4.621277e+08,3.016531e+08,3.243751e+08,2.951865e+08,2.341830e+08,2.399732e+08,2.361439e+08
4,163_360.9,163,163,163,360.9,358.8,363.1,132,51,0,...,1.187537e+07,1.161814e+07,1.151374e+07,1.001517e+07,7.565125e+06,8.315961e+06,7.388538e+06,7.563753e+06,7.313061e+06,7.200719e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,741_302.0,741,740,741,302.0,264.8,311.0,350,5,345,...,8.452444e+07,7.295674e+07,7.044123e+07,6.966945e+07,9.884958e+07,1.105645e+08,1.005150e+08,1.062612e+08,9.917727e+07,9.797857e+07
89,755_286.3,755,754,755,286.3,248.8,312.9,1032,5,351,...,8.833914e+07,5.279847e+07,5.557496e+07,4.565162e+07,1.261003e+08,1.379112e+08,1.216284e+08,7.109022e+07,8.046582e+07,8.627527e+07
90,756_286.7,756,755,756,286.7,248.8,312.9,3392,5,351,...,1.016305e+08,5.368691e+07,5.566762e+07,5.216342e+07,1.550416e+08,1.187256e+08,1.537047e+08,9.148052e+07,9.322778e+07,9.990198e+07
91,757_286.5,757,756,757,286.5,256.7,312.9,157,1,153,...,8.192823e+07,4.605462e+07,4.654267e+07,4.420568e+07,5.858005e+07,6.576977e+07,6.590599e+07,7.515455e+07,7.053095e+07,4.315254e+07


In [58]:
# the loop
# loop over maytenus_val items. 
# Each mz will be tested against all mzmin and mzmax range from train. 
# if in range, test for rt.
# if in range, use the same feature name from train

maytenus_test = maytenus_test.sort_values('npeaks', ascending=False,ignore_index=True)

for i in range(len(maytenus_test)):
    for j in range(len(maytenus_train)):


        if ((maytenus_test.loc[i,'mz'] <= maytenus_train.loc[j,'mzmax']) 
              & (maytenus_test.loc[i,'mz'] >= maytenus_train.loc[j,'mzmin'])):
            
            #maybe subset maytenus train and then perform things on the subset? 
            
            if (
                ((maytenus_test.loc[i,'rt'] <= maytenus_train.loc[j,'rtmax']) 
                  & (maytenus_test.loc[i,'rt'] >= maytenus_train.loc[j,'rtmin'])) or
            
               ((maytenus_test.loc[i,'rtmin'] <= maytenus_train.loc[j,'rtmax']) 
                  & (maytenus_test.loc[i,'rtmin'] >= maytenus_train.loc[j,'rtmin'])) or
                
               ((maytenus_test.loc[i,'rtmax'] <= maytenus_train.loc[j,'rtmax']) 
                & (maytenus_test.loc[i,'rtmax'] >= maytenus_train.loc[j,'rtmin']))
            ):
                
                maytenus_test.loc[i,'features'] = maytenus_train.loc[j,'features']
            break
            


maytenus_test.head()

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,aq1set.17_1,...,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2,isotopes,adduct,pcgroup,features
0,133,133,133,50.2,49.4,50.9,226,10,10,190636300.0,...,145281900.0,133488400.0,213247100.0,207116200.0,318440000.0,306467100.0,,,3,
1,289,289,290,195.7,162.0,196.6,203,10,10,85898060.0,...,238775600.0,247438100.0,152743600.0,171133400.0,176205100.0,182063000.0,,,1,
2,561,561,562,207.0,168.9,208.5,130,3,10,37211230.0,...,135482500.0,73665840.0,114648200.0,118237000.0,118422000.0,110915300.0,,,17,
3,739,739,740,227.5,226.3,228.2,117,0,10,18038410.0,...,144186500.0,138902700.0,112888400.0,120684600.0,131545100.0,119686000.0,,,7,
4,191,190,191,57.2,56.5,58.0,114,10,10,76828290.0,...,110504500.0,115157200.0,133212400.0,131496400.0,124300100.0,130840400.0,,,4,191_101.4


In [59]:
# the cleaning

# the process can create duplicates, so removing them is necessary
# the removal is based on the npeaks column. The feature with more npeaks, is kept.
maytenus_test = maytenus_test.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary columns
maytenus_test = maytenus_test.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                  'rtmin', 'rtmax', 'npeaks','isotopes', 'adduct','pcgroup','NEG_GROUP', 'POS_GROUP'], axis=1)



# val set might have some feature that don't fit in any range - their feature names will be nan, so need to remove
# train might have some features that wont appear in the val. So, create them in val and set them to zero. 
# first, set index on both to be the features, so its possible to do that.
maytenus_test = maytenus_test.dropna().set_index('features') # dropping na and making feature as index

# set method to get the set of index values that are unique 
# subtracting the sets to get the different indexes. 
# concat method to concatenate train and val
# filling the missing values on the concatenation with 0 using the fillna method.
unique_indexes_test = list(set(maytenus_train_ref.index) - set(maytenus_test.index))
maytenus_test = pd.concat([maytenus_test, pd.DataFrame(index=unique_indexes_test, columns=maytenus_test.columns)], sort=True).fillna(0)


maytenus_test.head()

Unnamed: 0,aq1set.17_1,aq1set.17_2,aq2dez.16_1,aq2dez.16_2,aq2jul.17_1,aq2jul.17_2,aq4abr.17_1,aq4abr.17_2,aq5set.17_1,aq5set.17_2,il1ago.17_1,il1ago.17_2,il2mar.17_1,il2mar.17_2,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2
191_101.4,76828290.0,77041300.0,92912640.0,89680790.0,75609880.0,74100420.0,128114600.0,136302200.0,81381420.0,80735340.0,100690400.0,106624500.0,126749300.0,131828300.0,110504500.0,115157200.0,133212400.0,131496400.0,124300100.0,130840400.0
222_50.7,52139950.0,54717680.0,52126310.0,49296860.0,46219590.0,46394050.0,56891430.0,56823420.0,43361260.0,54228760.0,72292680.0,73364670.0,80847760.0,81323750.0,61523200.0,62886710.0,83382310.0,90574080.0,74952060.0,72926190.0
222_49.4,63075420.0,62098270.0,58489740.0,55924640.0,54801000.0,52134950.0,53491240.0,53464920.0,51108520.0,62466580.0,77155930.0,79144680.0,82821280.0,87807650.0,65980870.0,68887550.0,87298210.0,88957320.0,82195810.0,75671860.0
220_49.3,60466190.0,59161890.0,64169020.0,56484420.0,52858970.0,50791770.0,53089120.0,52726060.0,48610370.0,57217930.0,76509100.0,78195770.0,82426220.0,87322920.0,65841290.0,68591420.0,86915680.0,88848790.0,82118600.0,75427890.0
219_48.5,60198780.0,58665690.0,58755950.0,56443070.0,52412050.0,50407800.0,53089120.0,52726060.0,49058460.0,56458630.0,76368770.0,77939240.0,82426220.0,87322920.0,65638300.0,68591420.0,86915680.0,88601570.0,82118600.0,75101690.0


In [60]:
maytenus_test

Unnamed: 0,aq1set.17_1,aq1set.17_2,aq2dez.16_1,aq2dez.16_2,aq2jul.17_1,aq2jul.17_2,aq4abr.17_1,aq4abr.17_2,aq5set.17_1,aq5set.17_2,il1ago.17_1,il1ago.17_2,il2mar.17_1,il2mar.17_2,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2
191_101.4,7.682829e+07,7.704130e+07,9.291264e+07,8.968079e+07,7.560988e+07,7.410042e+07,1.281146e+08,1.363022e+08,8.138142e+07,8.073534e+07,1.006904e+08,1.066245e+08,1.267493e+08,1.318283e+08,1.105045e+08,1.151572e+08,1.332124e+08,1.314964e+08,1.243001e+08,1.308404e+08
222_50.7,5.213995e+07,5.471768e+07,5.212631e+07,4.929686e+07,4.621959e+07,4.639405e+07,5.689143e+07,5.682342e+07,4.336126e+07,5.422876e+07,7.229268e+07,7.336467e+07,8.084776e+07,8.132375e+07,6.152320e+07,6.288671e+07,8.338231e+07,9.057408e+07,7.495206e+07,7.292619e+07
222_49.4,6.307542e+07,6.209827e+07,5.848974e+07,5.592464e+07,5.480100e+07,5.213495e+07,5.349124e+07,5.346492e+07,5.110852e+07,6.246658e+07,7.715593e+07,7.914468e+07,8.282128e+07,8.780765e+07,6.598087e+07,6.888755e+07,8.729821e+07,8.895732e+07,8.219581e+07,7.567186e+07
220_49.3,6.046619e+07,5.916189e+07,6.416902e+07,5.648442e+07,5.285897e+07,5.079177e+07,5.308912e+07,5.272606e+07,4.861037e+07,5.721793e+07,7.650910e+07,7.819577e+07,8.242622e+07,8.732292e+07,6.584129e+07,6.859142e+07,8.691568e+07,8.884879e+07,8.211860e+07,7.542789e+07
219_48.5,6.019878e+07,5.866569e+07,5.875595e+07,5.644307e+07,5.241205e+07,5.040780e+07,5.308912e+07,5.272606e+07,4.905846e+07,5.645863e+07,7.636877e+07,7.793924e+07,8.242622e+07,8.732292e+07,6.563830e+07,6.859142e+07,8.691568e+07,8.860157e+07,8.211860e+07,7.510169e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
491_371.0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
758_286.7,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
209_50.2,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
353_243.3,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00


In [61]:
# order both val and train features equally
# sort the features - the model needs them at the same sequence

maytenus_test = maytenus_test.reset_index().sort_values(by='index')


In [62]:
maytenus_test

Unnamed: 0,index,aq1set.17_1,aq1set.17_2,aq2dez.16_1,aq2dez.16_2,aq2jul.17_1,aq2jul.17_2,aq4abr.17_1,aq4abr.17_2,aq5set.17_1,...,il1ago.17_1,il1ago.17_2,il2mar.17_1,il2mar.17_2,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2
55,118_574.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31,118_574.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,133_63.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
83,133_63.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42,163_360.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35,741_302.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,755_286.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,756_286.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,757_286.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
maytenus_test

Unnamed: 0,index,aq1set.17_1,aq1set.17_2,aq2dez.16_1,aq2dez.16_2,aq2jul.17_1,aq2jul.17_2,aq4abr.17_1,aq4abr.17_2,aq5set.17_1,...,il1ago.17_1,il1ago.17_2,il2mar.17_1,il2mar.17_2,il3nov.16_1,il3nov.16_2,il4fev.17_1,il4fev.17_2,il5mai.17_1,il5mai.17_2
55,118_574.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31,118_574.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,133_63.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
83,133_63.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42,163_360.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35,741_302.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,755_286.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,756_286.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,757_286.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [64]:
# bring classes

# load
classes_test = pd.read_csv('6. Gridsearch Maytenus (adicional dados mestrado no retreino)/classes_test_master_maytenus.csv', index_col=[0])

# # unite
maytenus_test = maytenus_test.set_index('index').T.join(classes_test)
display(maytenus_test)

Unnamed: 0,118_574.2,118_574.6,133_63.2,133_63.3,163_360.9,164_351.6,165_349.9,181_45.3,181_45.6,191_101.4,...,707_241.4,707_241.5,709_241.9,739_302.0,741_302.0,755_286.3,756_286.7,757_286.5,758_286.7,class
aq1set.17_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,95442920.0,76828290.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq1set.17_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,92034580.0,77041300.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq2dez.16_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,103721000.0,92912640.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq2dez.16_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,103789900.0,89680790.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq2jul.17_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,72588110.0,75609880.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq2jul.17_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,77044090.0,74100420.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq4abr.17_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100011400.0,128114600.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq4abr.17_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,101171700.0,136302200.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq5set.17_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91201400.0,81381420.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
aq5set.17_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,83432230.0,80735340.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [65]:
#maytenus_test.to_csv('testing_dataframe_master_data.csv')

In [66]:
X_test = maytenus_test.drop('class', axis=1)
y_test = maytenus_test['class']

In [67]:
grid.score(X_test,y_test)



0.0

In [68]:
grid.predict(X_test)



array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [69]:
grid.best_estimator_.score

<bound method Pipeline.score of Pipeline(steps=[('model', KNeighborsClassifier())])>