# MicroMass feature selection

For the original source of this data, see https://doi.org/10.24432/C5T61S

This example dataset comes from the MicroMass dataset published at UCI Machine Learning Repositiory. <br> The data used here is the Mass Spectrometry data (MALDI-TOF) of a reference panel of 20 Gram positive and negative bacterial species. The samples are clustered per species, with each instance of a species as a replicate. The data is the filtered output from 'feature_filtering.ipynb'. <br> The Gram types of the species are used as phenotypes.

## Loading the MetabolitePhenotypeFeatureSelection class and data

We start by importing the MetabolitePhenotypeFeatureSelection class. For the different import options and more info on this class, see feature_selection_manual.ipynb in the manuals folder.

In [1]:
import numpy as np

import sys
sys.path.append('../../src/phloemfinder/')

from feature_selection_using_ml import MetabolitePhenotypeFeatureSelection 



Now load the datasets into an object of this class. Here, we call it 'micromass'.

In [2]:
micromass = MetabolitePhenotypeFeatureSelection(
    metabolome_csv="./filtered_features.csv",    
    phenotype_csv="./phenotype.csv",
    phenotype_sample_id='sample_id')

In [3]:
micromass.validate_input_metabolome_df()
micromass.validate_input_phenotype_df()

Metabolome data validated.
Phenotype data validated.


## Machine Learning

First we have a look at the performance of a simple Random Forest model as a baseline:

In [4]:
micromass.get_baseline_performance(
    train_size=0.7,
    random_state=123)

Average balanced_accuracy score on training data is: 97.200 % -/+ 2.00


Average balanced_accuracy score on test data is: 99.100 %


We can see that the Random Forest already performs well with a 97.2% accuracy on the training data and a 99.1% accuracy when predicting the phenotypes of the test data.

Let's see what happens if we run Auto Machine Learning for an hour to build a good fitting pipeline.

In [5]:
micromass.search_best_model_with_tpot_and_compute_pc_importances(
    class_of_interest='positive',
    max_time_mins=60,
    max_eval_time_mins=12,
    random_state=123)

Version 0.12.0 of tpot is outdated. Version 0.12.2 was released Friday February 23, 2024.


                                                                                   
71.35 minutes have elapsed. TPOT will close down.                                  
TPOT closed during evaluation in one generation.
                                                                                   
                                                                                   
TPOT closed prematurely. Will use the current best pipeline.
                                                                                   
Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=1.0, max_depth=9, max_features=0.65, min_samples_leaf=4, min_samples_split=4, n_estimators=1000, subsample=1.0)
Train balanced_accuracy score 100.000 %


                   value
balanced_accuracy  0.935
precision          0.814
recall             0.972
f1 score           0.886




In [6]:
micromass.best_model

In [7]:
print(micromass.pc_importances)

       mean_var_imp  std_var_imp     perm0     perm1     perm2     perm3  \
pc                                                                         
PC0        0.441780     0.020642  0.432219  0.456299  0.470584  0.427930   
PC12       0.045468     0.010738  0.065046  0.038264  0.052448  0.031543   
PC375      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
PC377      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
PC378      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
...             ...          ...       ...       ...       ...       ...   
PC185      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
PC184      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
PC183      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
PC182      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   
PC570      0.000000     0.000000  0.000000  0.000000  0.000000  0.000000   

          p

In [8]:
micromass.get_names_of_top_n_features_from_selected_pc(
    selected_pc=1,
    top_n=15)

Here are the metabolite names with the top 15 absolute loadings on PC1


Unnamed: 0,feature_name,loading
0,feature_347,0.518921
1,feature_1129,0.400993
2,feature_326,0.300768
3,feature_1080,0.298268
4,feature_1181,0.264018
5,feature_1261,0.260011
6,feature_1163,0.207906
7,feature_1002,0.153934
8,feature_579,0.130685
9,feature_128,0.129151


In [11]:
micromass.get_names_of_top_n_features_from_selected_pc(
    selected_pc=13,
    top_n=15)

Here are the metabolite names with the top 15 absolute loadings on PC13


Unnamed: 0,feature_name,loading
0,feature_70,0.33748
1,feature_1163,0.315914
2,feature_243,0.272666
3,feature_418,0.251968
4,feature_643,0.251227
5,feature_538,0.220616
6,feature_471,0.218184
7,feature_399,0.215889
8,feature_446,0.201884
9,feature_1010,0.170886
