# Notebook for feature selection

## load the packages you need

In [None]:
import numpy as np

from phloemfinder.feature_selection_using_ml import MetabolitePhenotypeFeatureSelection 

## open your data into an object
The cleaned metabolite data and the phenotypes data should both be in a csv file. Make sure both have a column for the sample IDs with the same name (for example 'sample_id'). The sample IDs in both files should be identical, in order to link the phenotypes to the metabolites.

The object in this example is named fs_habrochaites_positive, fs for feature selection, habrochaites to know which set of genotypes you're looking at and positive for the ionization mode.

In [None]:
fs_habrochaites_positive = MetabolitePhenotypeFeatureSelection(
    metabolome_csv="./clean_phloem_positive_habrochaites_n5.csv",      # I recomend to include information about ionization mode, genotypes and filtreation steps in the name of the file
    phenotype_csv="./phenotypes_habrochaites.csv",
    phenotype_sample_id='sample_id')

Validate the files before continuing

In [None]:
fs_habrochaites_positive.validate_input_metabolome_df()
fs_habrochaites_positive.validate_input_phenotype_df()

## Machine Learning
### a basic Random Forest for baseline performance
Start with running a basic Random Forest. This is needed as a baseline for later on and should only take a few seconds.

You don't have to shange any of these settings, unless your class of interest is not called 'resistant'.

In [None]:
fs_habrochaites_positive.get_baseline_performance(
    class_of_interest='resistant',
    train_size=0.7,
    random_state=123)

### time for the real fun 🎉
With the baseline set, you can continue building the model/pipeline for selection of interesting features.

Again, you don't have to change the class of interest and random state, as long as they are simmilar to the ones you used for the baseline performance.

What I do recommend you to play with, is the max_time_mins and max_eval_time_mins. These are the maximum time you want the entire funtion to take and the maximum time you want the function to take for the evaluation of the best model, respectively. The max_eval_time_mins should approximately be 1/5 of the max_time_mins. I usually start with a max time of 5-10 mins, just to get a feel for the data. If that looked absolutely shitty, I set it to 60. If it looks decent enough, I change it to overnight (depends on the time at which you start it and the time you want to use your computer again; for example 900). In case it doesn't get any better, you should consider altering the filtration steps of the metabolite data.

The 'Performance of ML model on train data' output ranges between 0% and 100%, while the 'Performance of ML model on test data' output ranges between 0 and 1. In all cases, higher means better.

If the model is performing much better on the train data than on the test data, it means the model is overfitted and therefore not robust. 

The most important value from the test data is the 'recall'. A high recall means a low false positive rate.


While running this step, try to close most other windows on your computer so it can work as efficiently as possible in the time you gave it. Oh, and make sure it doesn't go to sleep, otherwise it won't be doing anything 😉

In [None]:
fs_habrochaites_positive.search_best_model_with_tpot_and_compute_pc_importances(
    class_of_interest='resistant',
    max_time_mins=900,
    max_eval_time_mins=180,
    random_state=123)

To have a look at the 'model' that was eventually chosen, you can run:

In [None]:
fs_habrochaites_positive.best_model

### and now for the results
To deal with the strong correlation between the features, the data is first 'flattened' into principal components (PCs), which are used as features in the model. 

The first step to get to the interesting features, is to get the important PCs. The PC that is most important for the model to decide wether a sample is resistant get the highest variable importance (var_imp). Sometimes, there are only a few PCs with a mean_var_imp>0. Other times there are many, meaning you will have to decide on a threshold for which are interesting enough.

In [None]:
print(fs_habrochaites_positive.pc_importances)

Once you know which PCs are interesting, you can extract the features most important for the PC.

**!!You have to be carefull with the PC number here!!** In the above list of PC importances, the PC numbers start a 0. In this step, however, the PC numbers are +1 to prevent some errors. So if the most important PC in the list is PC 3, you should select PC 4 here.

In [None]:
fs_habrochaites_positive.get_names_of_top_n_features_from_selected_pc(
    selected_pc=1,
    top_n=10)

After extracting the top features from all interesting PCs, you can go back to your data to have a look at which features are more abundant for which phenotype, get the information needed to recognise the feature and learn all about it in MetaboScape