-
Notifications
You must be signed in to change notification settings - Fork 12
Home
MetAML is a computational tool for metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations.
The tool (i) is based on machine learning classifiers, (ii) includes automatic model and feature selection steps, (iii) comprises cross-validation and cross-study analysis, and (iv) uses as features quantitative microbiome profiles including species-level relative abundances and presence of strain-specific markers.
It provides also species-level taxonomic profiles, marker presence data, and metadata for 3000+ public available metagenomes.
MetAML is written in Python (tested on version 2.7) and requires some additional packages (matplotlib, numpy, pandas, scikit-learn, scipy), all included in the Anaconda platform.
- MetAML can be downloaded using wget
wget https://bitbucket.org/cibiocm/metaml/get/default.zip
unzip default.zip
mv CibioCM-metaml-*/ metaml/
- or using the Mercurial hg command
hg clone https://bitbucket.org/CibioCM/metaml
The main "metaml" folder is organized as follows:
- "data" folder: available data for 3000+ metagenomes in terms of i) species-level relative abundances ("abundance.txt.bz2"), ii) presence of strain-specific markers ("marker_presence.txt.bz2"), and iii) abundance of strain-specific markers ("marker_abundance.txt.bz2"). iv) The file "markers2clades_DB.txt.bz2" is the lookup table to associate each marker identifier to the corresponding species. Before using such files, it is required to uncompress them:
cd metaml
bunzip2 data/abundance.txt.bz2
bunzip2 data/marker_presence.txt.bz2
bunzip2 data/marker_abundance.txt.bz2
bunzip2 data/markers2clades_DB.txt.bz2
- dataset_selection.py: script to extract from the whole available data (e.g., from "abundance.txt") only the samples/features of interest;
- classification.py: script to run the classification task on the selected data;
- "tools" folder: additional scripts to generate the figures present in the published paper;
- "scripts" folder: commands to replicate the results reported in the published paper.
python dataset_selection.py -h
usage: dataset_selection.py [-h] [-z FEATURE_IDENTIFIER] [-s SELECT]
[-r REMOVE] [-i INCLUDE] [-e EXCLUDE] [-t]
[INPUT_FILE] [OUTPUT_FILE]
positional arguments:
INPUT_FILE The input dataset file [stdin if not present]
OUTPUT_FILE The output dataset file
optional arguments:
-h, --help show this help message and exit
-z FEATURE_IDENTIFIER, --feature_identifier FEATURE_IDENTIFIER
The feature identifier
-s SELECT, --select SELECT
The samples to select
-r REMOVE, --remove REMOVE
The samples to remove
-i INCLUDE, --include INCLUDE
The fields to include
-e EXCLUDE, --exclude EXCLUDE
The fields to exclude
-t, --tout Transpose output dataset file
- With the following command line we select the 440 samples in terms of species-level relative abundances belonging to the T2D and WT2D datasets considered in the published paper:
python dataset_selection.py data/abundance.txt data/abundance_t2d-WT2D.txt -z "k__" -s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance -i feature_level:s__,dataset_name:disease -e feature_level:t__
- Input file: We consider as INPUT_FILE the matrix (metadata/features on the rows with the first column that denotes the metadata/feature identifier; samples on the columns) with the species-level relative abundances "data/abundance.txt";
- Output file: The OUTPUT_FILE is a subset of this matrix and is saved as "data/abundance_t2d-WT2D.txt";
- Feature identifier: All the rows that contain "k__" in its identifier (i.e., the first column) are identified as features, the rest is considered as metadata;
- Selection of samples: The couple of options -s (SELECT) and -r (REMOVE) defines which are the samples to select or remove. In this example, we SELECT all the samples having in the metadata field "dataset_name" the value "t2dmeta_long" OR "t2dmeta_short" OR "WT2D". At the same time, we REMOVE all the samples having in the metadata field "gender" the value "-" OR " -" (in this scenario this permits to exclude the samples without metadata information) AND all the samples having in the metadata field "disease" the value "impaired_glucose_tolerance";
- Selection of metadata/features: The couple of options -i (INCLUDE) and -e (EXCLUDE) defines which are the metadata/features to include or exclude. In this example, we SELECT all the features that go from species (included, denoted as "s_") to sub-species (excluded, denoted as "t_") levels (this implies to select features at species level). Moreover, we keep only the fields "dataset_name" AND "disease" for metadata.
- We can extract the same set of samples but in terms of presence of strain-specific markers by slightly modifying the command in the following way:
python dataset_selection.py data/marker_presence.txt data/marker_presence_t2d-WT2D.txt -z "GeneID":"gi|" -s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance -i dataset_name:disease
python classification.py -h
usage: classification.py [-h] [-z FEATURE_IDENTIFIER] [-d DEFINE] [-t TARGET]
[-u UNIQUE] [-b] [-r RUNS_N] [-p RUNS_CV_FOLDS] [-w]
[-l {rf,svm,lasso,enet}] [-i {lasso,enet}]
[-f CV_FOLDS] [-g CV_GRID] [-s CV_SCORING]
[-j FS_GRID] [-e FIGURE_EXTENSION]
[INPUT_FILE] [OUTPUT_FILE]
positional arguments:
INPUT_FILE The input dataset file [stdin if not present]
OUTPUT_FILE The output file [stdout if not present]
optional arguments:
-h, --help show this help message and exit
-z FEATURE_IDENTIFIER, --feature_identifier FEATURE_IDENTIFIER
The feature identifier
-d DEFINE, --define DEFINE
Define the classification problem
-t TARGET, --target TARGET
Define the target domain
-u UNIQUE, --unique UNIQUE
The unique samples to select
-b, --label_shuffling
Label shuffling
-r RUNS_N, --runs_n RUNS_N
The number of runs
-p RUNS_CV_FOLDS, --runs_cv_folds RUNS_CV_FOLDS
The number of cross-validation folds per run
-w, --set_seed Setting seed
-l {rf,svm,lasso,enet}, --learner_type {rf,svm,lasso,enet}
The type of learner/classifier
-i {lasso,enet}, --feature_selection {lasso,enet}
The type of feature selection
-f CV_FOLDS, --cv_folds CV_FOLDS
The number of cross-validation folds for model selection
-g CV_GRID, --cv_grid CV_GRID
The parameter grid for model selection
-s CV_SCORING, --cv_scoring CV_SCORING
The scoring function for model selection
-j FS_GRID, --fs_grid FS_GRID
The parameter grid for feature selection
-e FIGURE_EXTENSION, --figure_extension FIGURE_EXTENSION
The extension of output figure
- With the following command we run a cross-validation analysis to discriminate between healthy and affected by T2D subjects (such results are denoted as T2D+WT2D* in the Figure 7(a) of the published paper):
mkdir results
python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf -d 1:disease:t2d -g [] -w
- Input file: We consider as INPUT_FILE the data matrix "data/abundance_t2d-WT2D.txt" generated in the above paragraph;
- Output file: The results are saved as OUTPUT_FILES in multiple files with "results/abundance_t2d-WT2D_rf". In particular, the main results with prediction accuracies and eventual feature importance are saved in ".txt", the estimation values in "_estimations.txt", the ROC curve values in "_roccurve.txt", the PCA plot in "_pca.png" (figure extension can be changed through -e);
- Definition of the classification problem: We DEFINE (-d) the classification problem by setting to class "1" all the samples having in the metadata field "disease" the value "t2d". The remaining samples are automatically assigned to class "0" (note that in general we can use the syntax "-d 1:field_i:V1:V2,2:field_j:V3" to assign i) to class "1" all the samples having in the metadata field "field_i" the value "V1" or "V2", ii) to class "2" all the samples having in the metadata field "field_j" the value "V3", iii) and the remaining samples to class "0");
- Definition of the learning setting: Prediction accuracies are estimated through cross-validation (NUMBER OF FOLDS defined with -f; default = 10) and averaged on independent runs (NUMBER OF RUNS defined with -r; default = 20);
- Putting -w (SEED SETTING) guarantees that different executions of the script (also changing some parameters such as the type of learner) are characterized by the same training and validation sets in the cross-validation procedure. This is crucial, for example, for doing a statistical test between different classifiers or when comparing between true and shuffled labels;
- Definition of the learner: Different types of classifiers are implemented (LEARNER TYPE defined with -l; default = rf, i.e. Random Forests);
- Definition of the feature selection strategy: Different feature selection strategies are implemented (FEATURE SELECTION defined with -i; default = none);
- Definition of model selection and feature selection parameters: Default parameters can be changed by acting on the NUMBER OF CROSS-VALIDATION FOLDS FOR MODEL SELECTION (-f), PARAMETER GRID FOR MODEL SELECTION (-g), SCORING FUNCTION FOR MODEL SELECTION (-s), PARAMETER GRID FOR FEATURE SELECTION (-s). Putting "-g []" when using random forest as classifier ("-l rf") disables the re-training of the model on the most discriminative features (this saves time, although only classification results obtained on the entire set of features are reported).
- Results using Lasso as feature selection and Support Vector Machine (SVM) as classifier can be obtained by acting on the parameters "-i" and "-l":
python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_lasso_svm -d 1:disease:t2d -i lasso -l svm -w
- Results with shuffled labels can be obtained by just adding the option "-b":
python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf-shuffled -d 1:disease:t2d -g [] -w -b
- We can conduct a cross-study analysis by first training the model on a specific dataset and then by validating it on a different dataset. Adding the option "-t" is sufficient for doing this:
python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf_t-t2d -d 1:disease:t2d -g [] -w -t dataset_name:t2dmeta_long:t2dmeta_short
- Definition of the validation set: The option -t (TARGET DOMAIN) defines which are the samples to consider as validation set. In this example, all the samples having in the metadata "dataset_name" the value "t2dmeta_long" OR "t2dmeta_short" are considered for validation. The remaining samples are automatically considered for training.
E. Pasolli, D. T. Truong, F. Malik, L. Waldron, and N. Segata, Machine learning meta-analysis of large metagenomic datasets: tools and biological insights, PLOS Computational Biology, 12(7), Jul. 2016.
MetAML is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.