This notebook is both the pipeline of feature creation and the ml algorithm training. So far, no optimization of parameters were done. This is only for Mikania. Another notebook, similar to this one is for Maytenus.

# Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn import svm
from sklearn import metrics

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Functions

In [2]:
# creates the feature name with the mz and rt

def feature_name_creation(xcms_file_path):
    table = pd.read_csv(xcms_file_path, index_col=[0]) 
    
    # no need for decimal on m/z (low resolution) and only one decimal for rt
    table.mz = table.mz.round(0).astype(int)
    table.rt = table.rt.round(1)

    # creating the feature name: mz_rt
    features = table["mz"].astype(str) + "_" + table["rt"].astype(str)
    table.insert(0, 'features', features) # first column
    
    # drop as we don't know how many columns the table will have. Drop the known ones. 
    # There should only be the 'features' column and the samples
    
    table_clean = table.drop(['isotopes', 'adduct','pcgroup'], axis=1) #'npeaks','NEG_GROUP', 'POS_GROUP',
    
    return table_clean

In [3]:
# rounds the mz and rt columns along with its min and max

def rounder(dataframe):
    table = dataframe 
    
    table.mz = table.mz.round(0).astype(int)
    table.mzmin = table.mzmin.round(0).astype(int)
    table.mzmax = table.mzmax.round(0).astype(int)
    
    table.rt = table.rt.round(1)
    table.rtmin = table.rtmin.round(1)
    table.rtmax = table.rtmax.round(1)

    
    return table

# Processing Pipeline

`Train` and `Val` sets were processed separately on `xcms` - excludes the possibility of data leakage 
But, when processing is separated, the features can be slightly different. The compounds are almost the same, but due to processing steps, there can be shifts on the decimals of `mz` or `rt`. 
For this reason, creating the feature name concatenating `mz_rt` on train and val might not produce the same features, and machine learning training is not possible with that. 

Errors observed in this case are related to the fact that features observed in train were not present in validation and vice versa or the order of the features were different in both datasets. This 'pipeline' fixes this issue.

**Steps:**

1. Creates the name for the features on `Train` set - this is the set used as reference. Whatever features where observed here, should appear on `Val`. The name is created concatenating `mz` and `rt` columns (`mz_rt`)
2. Creates a correspondance between the feature on `Train` and `Val` set, giving val set the same column names as the train, when the feature is present 
    1. round `mz` and `rt` from `Val` and `Train` 
    2. for each `mz` in `Val`, search for a range on `mzmin` and `mzmax` on train that fits. The `mz_val` need to be between `mzmin_train` and `mzmax_train` 
    3. If a match is found, for each `rt`,`rtmin` and `rtmax` on `Val` search for a range on `rtmin` and `rtmax` on `Train` that fits. The `rt` values need to be between `rtmin_train` and `rtmax_train`. The `rtmin` and `rtmax` from `Val` are used in this case because ocasionally, the range on `Val` or train is too big (big difference in `rt` between samples)
    4. if a match, take the feature name from `Train` and apply on the match
    
**With the features names created:**

3. Features on `Train` and `Val` are ordered 
4. Duplicates are deleted based on the `npeaks` columnn
5. Features that were observed in `Val` but no correspondence was found in `Train` have names filled with `nan`. These are deleted.
4. Features that are on `Train`and were not found in `Val` are added to `Val` and filled with zero (no presence of that feature)
 
 
**To fix: **
 The code for the feature correspondence is not optimized. 
 - After the match with `mz`, the loop searches on the whole dataset for a match in `rt`. This takes more computation, unecessary. 
 - If there is a match of two features, the last one is kept. Could keep both, filter later? 
 


## Feature reference creation - train set

In [4]:
# train is loaded using the function to create the feature names - feature names are created using mz and rt.  
mikania_train = feature_name_creation('mikania_train_processing.csv').reset_index(drop=True)

In [5]:
mikania_train.head()

Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,...,ML9_3_mik,ML96_1_mik,ML96_2_mik,ML96_3_mik,ML97_1_mik,ML97_2_mik,ML97_3_mik,ML99_1_mik,ML99_2_mik,ML99_3_mik
0,101_545.4,101,101.095945,102.094158,545.4,333.873,584.271,3929,384,384,...,11928700.0,5967574.0,7290893.0,6380743.0,6448592.0,4925116.0,4012670.0,7062352.0,6987077.0,10422650.0
1,103_548.9,103,102.096017,103.095943,548.9,418.357,598.673,1060,284,295,...,3986794.0,2984790.0,2163372.0,4160999.0,3697033.0,3012724.0,4776169.0,1899082.0,2809514.0,2203643.0
2,103_18.8,103,102.596195,103.59513,18.8,0.935,114.64,1146,247,212,...,1633429.0,2265956.0,3159375.0,1898522.0,2676223.0,2158196.0,2797141.0,1968564.0,3734300.0,2033939.0
3,103_597.3,103,103.096132,104.095664,597.3,465.105,599.045,3183,382,383,...,4032735.0,4358152.0,2962596.0,5099119.0,5192264.0,4535534.0,4419680.0,3558658.0,4836014.0,4604429.0
4,104_43.1,104,103.596029,104.595825,43.1,0.923,183.37,2241,345,342,...,3646370.0,6710968.0,4905165.0,4000091.0,2357814.0,1931856.0,6458164.0,4571863.0,4643320.0,2402607.0


## Loading validation val set

In [6]:
# val will be loaded using regular read_csv - the names of the features will come based on comparison
mikania_val = pd.read_csv('mikania_validation_processing.csv',index_col=[0]).reset_index(drop=True).drop(['isotopes', 'adduct','pcgroup'], axis=1) #'npeaks','NEG_GROUP', 'POS_GROUP',

## Rounding mz and rt

In [7]:
# rouding all mz and all rt
mikania_val = rounder(mikania_val)
mikania_train = rounder(mikania_train)

In [8]:
display(mikania_val.iloc[:,0:7].head())
display(mikania_train.iloc[:,0:7].head())

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks
0,101,101,102,546.1,333.4,583.6,972
1,103,102,103,558.9,442.8,598.6,259
2,103,103,104,27.1,1.7,135.8,334
3,103,103,104,597.3,477.6,598.7,784
4,104,103,104,391.8,344.2,467.7,76


Unnamed: 0,features,mz,mzmin,mzmax,rt,rtmin,rtmax
0,101_545.4,101,101,102,545.4,333.9,584.3
1,103_548.9,103,102,103,548.9,418.4,598.7
2,103_18.8,103,103,104,18.8,0.9,114.6
3,103_597.3,103,103,104,597.3,465.1,599.0
4,104_43.1,104,104,105,43.1,0.9,183.4


## Feature creation and correspondance on val set

In [9]:
# creating the column
mikania_val['features'] = np.nan

In [10]:
# loop over mikania_val items. 
# Each mz will be tested against all mzmin and mzmax range from train. 
# if in range, test for rt.
# if in range, use the same feature name from train

mikania_val = mikania_val.sort_values('npeaks', ascending=False,ignore_index=True)
mikania_train = mikania_train.sort_values('npeaks', ascending=False,ignore_index=True)

for i in range(len(mikania_val)):
    for j in range(len(mikania_train)):


        if ((mikania_val.loc[i,'mz'] <= mikania_train.loc[j,'mzmax']) 
              & (mikania_val.loc[i,'mz'] >= mikania_train.loc[j,'mzmin'])):
            
            #maybe subset mikania train and then perform things on the subset? 
            
            if (
                ((mikania_val.loc[i,'rt'] <= mikania_train.loc[j,'rtmax']) 
                  & (mikania_val.loc[i,'rt'] >= mikania_train.loc[j,'rtmin'])) or
            
               ((mikania_val.loc[i,'rtmin'] <= mikania_train.loc[j,'rtmax']) 
                  & (mikania_val.loc[i,'rtmin'] >= mikania_train.loc[j,'rtmin'])) or
                
               ((mikania_val.loc[i,'rtmax'] <= mikania_train.loc[j,'rtmax']) 
                & (mikania_val.loc[i,'rtmax'] >= mikania_train.loc[j,'rtmin']))
            ):
                
                mikania_val.loc[i,'features'] = mikania_train.loc[j,'features']
            break

In [11]:
mikania_val

Unnamed: 0,mz,mzmin,mzmax,rt,rtmin,rtmax,npeaks,NEG_GROUP,POS_GROUP,AQ16_1_mik,...,ML89_1_mik,ML89_2_mik,ML89_3_mik,ML90_1_mik,ML90_2_mik,ML90_3_mik,ML93_1_mik,ML93_2_mik,ML93_3_mik,features
0,117,117,117,546.2,400.0,595.5,3595,96,96,5.450895e+07,...,4.622460e+07,5.342590e+07,5.688322e+07,5.655376e+07,7.296227e+07,5.133613e+07,4.620873e+07,5.816691e+07,3.971505e+07,117_545.9
1,217,217,218,542.7,433.0,598.3,3014,96,96,1.017159e+07,...,1.208553e+07,9.918006e+06,1.376041e+07,1.431348e+07,1.370172e+07,9.249183e+06,1.141816e+07,1.231201e+07,2.206551e+07,217_543.1
2,126,126,127,560.1,305.9,598.6,2559,96,96,7.899146e+06,...,1.236531e+07,6.111691e+06,7.214423e+06,7.631575e+06,9.658903e+06,1.383510e+07,1.576543e+07,7.749534e+06,1.206328e+07,126_559.3
3,266,266,267,538.3,294.6,599.1,2333,96,96,8.846121e+06,...,2.067394e+07,9.499209e+06,1.139168e+07,7.615448e+06,1.391855e+07,1.919913e+07,5.303957e+06,6.585297e+06,6.691118e+06,266_538.1
4,516,516,517,327.9,204.6,515.8,2284,96,96,2.051717e+07,...,1.962712e+08,8.694954e+07,2.017157e+08,1.490517e+08,1.498002e+08,1.527331e+08,2.459482e+08,1.212823e+08,2.471026e+08,516_327.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2853,324,324,325,234.0,142.9,330.5,43,29,9,1.046270e+07,...,1.086635e+07,1.942110e+07,1.926071e+07,8.132322e+06,1.653811e+07,1.165188e+07,4.414227e+06,5.760752e+06,5.521525e+06,
2854,486,485,486,43.6,3.3,87.6,42,35,1,5.463967e+06,...,4.549755e+06,5.580329e+06,3.714947e+06,3.702073e+06,5.696389e+06,5.024331e+06,4.418431e+06,5.504006e+06,5.274128e+06,
2855,325,325,326,44.6,2.0,66.7,42,30,4,1.068384e+07,...,7.381050e+06,6.046359e+06,6.235743e+06,5.906932e+06,6.222798e+06,6.884359e+06,6.891080e+06,8.600757e+06,7.726564e+06,
2856,491,490,491,44.5,34.9,97.6,38,31,1,4.297456e+06,...,5.541090e+06,5.954080e+06,5.395155e+06,5.866534e+06,6.339910e+06,5.558511e+06,5.158961e+06,5.935223e+06,4.987924e+06,


In [None]:
# the process can create duplicates, so removing them is necessary
# the removal is based on the npeaks column. The feature with more npeaks, is kept.
mikania_val = mikania_val.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary columns
mikania_val = mikania_val.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                  'rtmin', 'rtmax', 'npeaks','NEG_GROUP', 'POS_GROUP'], axis=1)

# removing the duplicates that might arise with the train is also necessary
# drop possible duplicates for train as well
mikania_train = mikania_train.sort_values('npeaks', ascending=False).drop_duplicates('features').sort_index()

# dropping unnecessary  columns
mikania_train = mikania_train.drop(['mz', 'mzmin', 'mzmax', 'rt', 
                                      'rtmin', 'rtmax', 'npeaks','NEG_GROUP', 'POS_GROUP'], axis=1)

# val set might have some feature that don't fit in any range - their feature names will be nan, so need to remove
# train might have some features that wont appear in the val. So, create them in val and set them to zero. 
# first, set index on both to be the features, so its possible to do that.
mikania_train= mikania_train.set_index('features')
mikania_val = mikania_val.dropna().set_index('features') # dropping na and making feature as index

# set method to get the set of index values that are unique 
# subtracting the sets to get the different indexes. 
# concat method to concatenate train and val
# filling the missing values on the concatenation with 0 using the fillna method.

unique_indexes = list(set(mikania_train.index) - set(mikania_val.index))
mikania_val = pd.concat([mikania_val, pd.DataFrame(index=unique_indexes, columns=mikania_val.columns)], sort=True).fillna(0)

# order both val and train features equally
# sort the features - the model needs them at the same sequence
mikania_train = mikania_train.reset_index().sort_values(by='features')
mikania_val = mikania_val.reset_index().sort_values(by='index')

## Bring the class data column

In [13]:
# load
classes_train = pd.read_csv('classes_train_mikania.csv', index_col=[0])
classes_val = pd.read_csv('classes_val_mikania.csv', index_col=[0])

# unite
mikania_train = mikania_train.set_index('features').T.join(classes_train)
display(mikania_train)

mikania_val = mikania_val.set_index('index').T.join(classes_val)
display(mikania_val)

Unnamed: 0,1000_336.4,101_545.4,103_18.8,103_548.9,103_597.3,104_252.6,104_43.1,105_44.3,105_596.0,106_588.5,...,995_40.0,995_574.1,996_334.1,996_40.0,996_574.0,997_335.1,997_39.9,997_574.0,998_336.3,class
AQ1_1_mik,1.391067e+06,4.786491e+06,2.620826e+06,5.727719e+06,6.901048e+06,1.311020e+06,4.081527e+06,2.023342e+06,1.930281e+06,2.410106e+06,...,8.399767e+06,2.136580e+06,6.426461e+06,7.296734e+06,2.087067e+06,5.183501e+06,6.276027e+06,3.065326e+06,3.215882e+06,0
AQ1_2_mik,1.452426e+06,6.154697e+06,2.019693e+06,2.151857e+06,5.365472e+06,1.575825e+06,3.040979e+06,1.960575e+06,2.325000e+06,3.745704e+06,...,6.283727e+06,2.978963e+06,2.971119e+06,7.317387e+06,1.547173e+06,2.953098e+06,8.065771e+06,2.832389e+06,4.860732e+06,0
AQ1_3_mik,1.355224e+06,8.682000e+06,2.581758e+06,2.658441e+06,8.765162e+06,2.253836e+06,1.802877e+06,3.195227e+06,2.405212e+06,3.309877e+06,...,4.741727e+06,2.208881e+06,5.141266e+06,4.932867e+06,2.748248e+06,5.822201e+06,4.300110e+06,2.675318e+06,3.207825e+06,0
AQ10_1_mik,4.521485e+05,5.533850e+06,2.803274e+06,2.325828e+06,4.614978e+06,1.581547e+06,3.306859e+06,2.149324e+06,3.503102e+06,3.449786e+06,...,1.124204e+07,1.751027e+06,5.827884e+06,9.672026e+06,3.114641e+06,5.741569e+06,6.270942e+06,3.038588e+06,5.781737e+06,0
AQ10_2_mik,6.038770e+05,3.995909e+06,2.086767e+06,2.885761e+06,7.754229e+06,1.199120e+06,3.742547e+06,2.640438e+06,2.355849e+06,3.825112e+06,...,5.341431e+06,4.586396e+06,8.231475e+06,5.749470e+06,3.051598e+06,4.911140e+06,5.903466e+06,5.038108e+06,3.881324e+06,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ML97_2_mik,1.538845e+06,4.925116e+06,2.158196e+06,3.012724e+06,4.535534e+06,7.153521e+05,1.931856e+06,2.532060e+06,2.493785e+06,2.783363e+06,...,4.781279e+06,3.108206e+06,6.287796e+06,5.756679e+06,4.787021e+06,7.172336e+06,6.811176e+06,3.622604e+06,6.711402e+06,1
ML97_3_mik,1.597062e+06,4.012670e+06,2.797141e+06,4.776169e+06,4.419680e+06,6.605320e+05,6.458164e+06,5.949876e+06,2.311459e+06,1.992181e+06,...,6.307898e+06,2.349106e+06,8.855588e+06,6.549448e+06,4.222359e+06,7.612594e+06,7.054713e+06,3.769409e+06,7.434780e+06,1
ML99_1_mik,1.988427e+06,7.062352e+06,1.968564e+06,1.899082e+06,3.558658e+06,4.879656e+05,4.571863e+06,3.649843e+06,2.826066e+06,3.025920e+06,...,5.151911e+06,4.195371e+06,4.301395e+06,6.779330e+06,3.938895e+06,7.064952e+06,5.516723e+06,4.203446e+06,8.090866e+06,1
ML99_2_mik,1.200366e+06,6.987077e+06,3.734300e+06,2.809514e+06,4.836014e+06,1.142034e+06,4.643320e+06,5.782194e+06,1.261592e+06,2.332127e+06,...,6.159122e+06,3.081671e+06,6.692044e+06,5.914648e+06,5.262627e+06,6.696560e+06,5.700842e+06,3.081827e+06,7.964680e+06,1


Unnamed: 0,1000_336.4,101_545.4,103_18.8,103_548.9,103_597.3,104_252.6,104_43.1,105_44.3,105_596.0,106_588.5,...,995_40.0,995_574.1,996_334.1,996_40.0,996_574.0,997_335.1,997_39.9,997_574.0,998_336.3,class
AQ16_1_mik,1.815123e+06,5.751224e+06,0.0,0.0,6.011511e+06,0.0,4.899899e+06,5.170678e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.839358e+06,0.0,0.0,3.049227e+06,0
AQ16_2_mik,1.962743e+06,7.083288e+06,0.0,0.0,8.233172e+06,0.0,5.640071e+06,4.939433e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.944314e+06,0.0,0.0,3.113128e+06,0
AQ16_3_mik,3.059900e+06,4.462236e+06,0.0,0.0,5.406164e+06,0.0,4.321619e+06,2.340848e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.635474e+06,0.0,0.0,3.548778e+06,0
AQ25_1_mik,2.649625e+06,5.292930e+06,0.0,0.0,6.219683e+06,0.0,4.525921e+06,4.112054e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.814060e+06,0.0,0.0,4.863169e+06,0
AQ25_2_mik,1.817107e+06,6.784137e+06,0.0,0.0,4.609765e+06,0.0,3.446368e+06,4.656634e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.560545e+06,0.0,0.0,3.345981e+06,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ML90_2_mik,3.122842e+06,7.530000e+06,0.0,0.0,6.787622e+06,0.0,3.991053e+06,1.752785e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,8.342923e+06,0.0,0.0,5.103561e+06,1
ML90_3_mik,3.556514e+06,7.240044e+06,0.0,0.0,6.451167e+06,0.0,4.377725e+06,4.763259e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,9.758063e+06,0.0,0.0,4.574595e+06,1
ML93_1_mik,5.739181e+06,3.818699e+06,0.0,0.0,4.476222e+06,0.0,6.343133e+06,5.751892e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.794955e+06,0.0,0.0,7.190445e+06,1
ML93_2_mik,4.799381e+06,6.086228e+06,0.0,0.0,6.186590e+06,0.0,4.665043e+06,2.823446e+06,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.756683e+06,0.0,0.0,7.749089e+06,1


Data is now ready for ANY machine learning process

# Machine learning

## X y split

In [14]:
X_train = mikania_train.drop("class", axis=1)
y_train = mikania_train["class"]

X_test = mikania_val.drop("class", axis=1)
y_test = mikania_val["class"]

## Training

### SVM

In [15]:
#Create a svm Classifier
svm_clf = svm.SVC(kernel='rbf') # RBF Kernel

#Train the model using the training sets
svm_clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = svm_clf.predict(X_test)

print("SVM")
print("F1 score:",metrics.f1_score(y_test, y_pred))
print("MCC score:",metrics.matthews_corrcoef(y_test, y_pred))

# too easy? 

SVM
F1 score: 0.43902439024390244
MCC score: 0.40451991747794525


### Random Forest

In [16]:
rf_clf = RandomForestClassifier(max_depth=2, random_state=2187)
#Train the model using the training sets
rf_clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = rf_clf.predict(X_test)

print("Random Forest")
print("F1 score:",metrics.f1_score(y_test, y_pred))
print("MCC score:",metrics.matthews_corrcoef(y_test, y_pred))

# too easy? 

Random Forest
F1 score: 0.0
MCC score: 0.0


### KNN

In [17]:
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = knn_clf.predict(X_test)

print("KNN")
print("F1 score:",metrics.f1_score(y_test, y_pred))
print("MCC score:",metrics.matthews_corrcoef(y_test, y_pred))

KNN
F1 score: 0.9025641025641026
MCC score: 0.802475262666927


In [None]:
*pycaret 
* knn 