<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

# Comparative metagenomics at community-level

## Normalization methods and alpha & beta diversity

[MGnifyR](http://github.com/beadyallen/mgnifyr) is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1/). 
The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.

In this example we aim to demonstrate how the MGnifyR tool can be used to fetch data and metadata of a MGnify metagenomic analyisis. Then we show how to generate diversity metrics for comparative metagenomics using taxonomic profiles.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter

## Contents
- [Part 1. Fetch data from MGnify using MGnifyR, explore metadata and build a phyloseq object](#part1)
  - [1.1. Fetch the MGnify Analyses accession](#part1_1)
  - [1.2. Explore and filter samples by metadata](#part1_2)
  - [1.3. Converting into phyloseq object](#part1_3)
- [Part 2. Normalization, alpha diversity indices and taxonomic profiles visualization](#part2)
   - [2.1. Cleaning the OTUs matrix](#part2_1)
   - [2.2. Normalization by total sum scaling (TSS, relative abundance or proportions)](#part2_2)
   - [2.3. Normalization by subsampling (rarefaction)](#part2_3)
   - [2.4. Normalization by cumulative sum scaling (CSS)](#part2_4)
- [Part 3. Comparative metagenomics at community-level: Beta diversity](#part3)
- [Part 4. Detection of differentially abundant taxa](#part4)
    - [4.1. Association testing](#part4_1)
    - [4.2. Cofounder testing](#part4_2)
    - [4.3. Generation of a predictive model](#part4_3)
- [References](#refs)

In [None]:
# Loading libraries:
library(stringr)
library(vegan)
library(ggplot2)
library(phyloseq)
library(metagenomeSeq)
library(MGnifyR)
library(microbiomeMarker)
library(plyr)
library(SIAMCAT)
library(tidyverse)
library(IRdisplay)

display_markdown(file = '../_resources/mgnifyr_help.md')

In [None]:
# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=200)
options(repr.plot.width=4, repr.plot.height=4)

## Part 1. Fetch data from MGnify using MGnifyR, explore metadata and build a phyloseq object <a id='part1'/>

In this example we are going to fetch MGnify analysis results and metadata for TARA ocean metagenomic study corresponding to size fractions for prokaryotes ([MGYS00002008](https://www.ebi.ac.uk/metagenomics/studies/MGYS00002008#overview)).
Find more information about the [TARA Ocean Project.](https://fondationtaraocean.org/en/expedition/tara-oceans/)

### 1.1. Fetch the MGnify Analyses accession <a id='part1_1'/>

1. The first step is to retrieve the analysis accession list.

In [None]:
# Create your session mgnify_client object
mg = mgnify_client(usecache = T, cache_dir = '/home/jovyan/.mgnify_cache')

tara_all = mgnify_analyses_from_studies(mg, 'MGYS00002008')

2. Use the list of accessions to fetch the metadata for all of the analyses from the MGnify API.

In [None]:
metadata = mgnify_get_analyses_metadata(mg, tara_all)
#head(metadata)

**Note**: In case you are intereseted in running the comparative metagenomic analysis using data from different studies in MGnify, you can adapt the following commands:

```R
analyses_accessions = mgnify_analyses_from_studies(mg, c("MGYS1","MGYS2"))

metadata = mgnify_get_analyses_metadata(mg, analyses_accessions)
```

### 1.2. Explore and filter samples by metadata <a id='part1_2'/>

We want to keep only **metagenomic** samples (not amplicon) of 'surface water layer ([ENVO:00002042](https://www.ebi.ac.uk/ols/ontologies/envo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FENVO_00002042))' and 'mesopelagic zone ([ENVO:00000213](https://www.ebi.ac.uk/ols/ontologies/envo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FENVO_00000213))' to compare. We also want to filter out results generated with old pipeline versions (<[v5.0](https://www.ebi.ac.uk/metagenomics/pipelines/5.0)). In the following steps we will filter out other samples before exporting to the phyloseq object. Let's first explore the metadata we fetched:

1) Check the number of analysis in the study.

In [None]:
length(metadata$'analysis_accession')

2) Check the `analysis_experiment-type` to determine whether a filtering is necesary to discard amplicon samples.

In [None]:
unique(metadata$'analysis_experiment-type')

3) Keep results generated only with the most updated pipeline (v5.0).

In [None]:
v5_metadata = metadata[which(metadata$'analysis_pipeline-version'=='5.0'), ]
length(v5_metadata$'analysis_accession')

#head(v5_metadata)

4) Check the `sample_environment-feature` to discover what kind of samples are part of the study and how many of each exists.

*Note that for a comparative study, we need at least 5 samples per group.*

In [None]:
table(v5_metadata$'sample_environment-feature')

Let's keep only samples having [ENVO:00002042](https://www.ebi.ac.uk/ols/ontologies/envo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FENVO_00002042) or [ENVO:00000213](https://www.ebi.ac.uk/ols/ontologies/envo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FENVO_00000213) in the `sample_environment-feature` column. We want to create a new dataframe containing the relevant samples.

We are also going to create a clean label for the environment feature.

<div class="alert alert-block alert-info">
To speed up the following analysis, we are going to keep only 25 samples per group (by randomly subsampling the accessions)
</div>

In [None]:
# Saving the list of relevant samples in a dataframe
sub1 = v5_metadata[str_detect(v5_metadata$'sample_environment-feature', "ENVO:00002042"), ]
set.seed(345)
acc_s1 = sample(sub1$'analysis_accession', size=25, replace = FALSE)

sub2 = v5_metadata[str_detect(v5_metadata$'sample_environment-feature', "ENVO:00000213"), ]
set.seed(345)
acc_s2 = sample(sub2$'analysis_accession', size=25, replace = FALSE)

filtered_samples = c(acc_s1,acc_s2)

# To create the environment feature label:
env_label = c(rep('Surface', times=25), rep('Mesopelagic', times=25))


### 1.3. Converting into phyloseq object <a id='part1_3'/>

1. Now that we have a new dataframe with 50 samples from either surface or mesopelagic zone water, we are going to create the phyloseq object

In [None]:
ps = mgnify_get_analyses_phyloseq(mg, filtered_samples)

2. Keep only relevant columns in metadata table and transform numeric variables from characters to numbers. Add the environment label as well.

In [None]:
#sample_variables(ps)

# Keeping relevant metadata only
variables_to_keep = c('sample_temperature','sample_depth','sample_salinity','sample_chlorophyll.sensor','sample_nitrate.sensor','sample_oxygen.sensor')

df = data.frame(sample_data(ps))[variables_to_keep]

# Transforming character to nummeric variables
df[] = lapply(df, function(x) as.numeric(as.character(x)))

sample_data(ps) = df

# Adding the env label                              
sample_data(ps)$'env_feature' = env_label

#sample_data(ps)

## Part 2. Normalization, alpha diversity indices and taxonomic profiles visualization <a id='part2'/>

### 2.1. Cleaning the OTUs matrix <a id='part2_1'/>

1) Remove samples with extremely low coverage – they aren't informative and interfere with the normalization process. The first step is to detect outliers by plotting some histograms.

In [None]:
ps

options(repr.plot.width=4, repr.plot.height=4)
hist(log10(sample_sums(ps)), breaks=50, main="Sample size distribution", xlab="Sample size (log10)", ylab="Frequency", col="#007c80")

We can see that samples with number of reads $\leq 10 ^ {1.5}$ (i.e. $\lesssim 32$) seem to be outliers. 
   Let's filter out the outliers and plot a new histogram.

In [None]:
ps_good = subset_samples(ps, sample_sums(ps) > 32)
ps_good

hist(log10(sample_sums(ps_good)), breaks=50, main="Sample size distribution", xlab="Sample size (log10)", ylab="Frequency", col="#007c80")

And let's check how many samples were discarded:

In [None]:
nsamples(ps)
nsamples(ps_good)

2) Remove singletons existing in a single sample. Singletons are OTUs of size one, meaning that only one read was assigned to that OTU. These very low-abundance OTUs could be biologically real (belonging to the rare biosphere ([1](#reference_1))), or they could be false positives due to sequencing artefacts. Singletons observed in only one sample are more likely to be artefacts, and it is good practice to remove them from the OTUs counts table to avoid artificially over-estimating the OTUs richness. You can find more discussion about this in [Robert Edgar's blog](https://drive5.com/usearch/manual/singletons.html).

In [None]:
ps_final = filter_taxa(ps_good, function(x) sum(x) > 1, prune=TRUE)
ps_final

3) Show some stats on the sequencing depth across samples.

In [None]:
max_difference = max(sample_sums(ps_final))/min(sample_sums(ps_final))

sprintf("The max difference in sequencing depth is %s", max_difference)

options(repr.plot.width=4, repr.plot.height=5)

boxplot(sample_sums(ps_final), main="Sequencing depth across samples", xlab="", ylab="Number of reads", col="#a6093d")
text(y=boxplot.stats(sample_sums(ps_final))$stats, labels=boxplot.stats(sample_sums(ps_final))$stats, x=1.25)

An approximately 10-fold difference in the library sizes means that we will need to apply a normalization method before continuing with the analysis. The most common normalization methods used in microbiome count data are proportions and rarefaction. However, other methods originally developed to normalize RNA-seq counts have been adapted to differential-abundance analysis in microbiome data. A discussion about how to choose the right normalization method is out of the scope of this material, but the topic has been covered in multiple forums and scientific publications. Depending on the downstream analysis we intend to perform, different methods might be appropriate. For instance, to compare groups of samples at community-level through beta-diversity, "...proportions and rarefying produced more accurate comparisons among communities and are the only methods that fully normalized read depths across samples. Additionally, upper quartile, cumulative sum scaling (CSS), edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed" ([2](#reference_2)). On the other hand, for detection of differentially abundant species, "both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes" ([3](#reference_3)).

In the following examples we will show three popular ways of normalization: relative abundance, rarefaction and cummulative sum scaling.

### 2.2. Normalization by total sum scaling (TSS, relative abundance or proportions) <a id='part2_2'/>

The simplest way to normalize the differences in sample size is to transform the OTU counts table into relative abundance by dividing by the number of total reads of each sample. This type of normalization is also referred to as relative abundance or proportions. We use this normalization to compare taxonomic profiles, while alpha diversity indices are computed on the non-normalized matrix. The reason to do so is that we need a matrix of integer numbers as input.

1) Compute alpha diversity indices and display plots.

In [None]:
options(repr.plot.width=14, repr.plot.height=4)

plot_richness(ps_final, x="env_feature", color="env_feature") + 
    geom_boxplot() + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(x='', color = "Environment")

2) Transform taxonomy raw-counts matrix into relative abundance.  <a id='part2_2_2'/>

In [None]:
relab_ps = transform_sample_counts(ps_final, function(x) x/sum(x))

3) Agglomerate taxonomy at Class rank and keep only the most abundant classes (threshold=1%, i.e. 0.01). In microbial data, we expect to observe abundance distributions with a long 'tail' of low-abundance organisms which often comprise the large majority of species. For this reason, once the matrix has been transformed to relative abundance, we will show the taxonomic profile at a high taxonomic rank (Class), agglomerating the counts first and using an abundance threshold of 1% to avoid displaying too many unreadable categories in the plot.

In [None]:
psglom = tax_glom(relab_ps, "Class")
norare_ps = filter_taxa(psglom, function(x) mean(x) > 0.01, TRUE)

4) Visualise the profile in barplots at Class rank in two visualization modes.

In [None]:
plot_bar(norare_ps, fill = "Class") + 
    facet_wrap(~Class) + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 
    labs(x=NULL)

plot_bar(norare_ps, fill = "Class") + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 
    labs(x='')

### 2.3. Normalization by subsampling (rarefaction) <a id='part2_3'/>

Rarefaction is an alternative to relative abundance normalization to obtain an adjusted OTUs count matrix. The method is based on a process of subsampling to the smallest library size in the data set. The algorithm randomly removes reads until the samples reach the same library size. Despite the apparent disadvantage of discarding information from the larger samples, rarefaction is quite popular in microbial ecology.

The first step is to find the smallest sample size. We can use the number of observed OTUs in the original matrix to do so.

1) Find the smallest sample size.

In [None]:
head(estimate_richness(ps_final)[order(estimate_richness(ps_final)$Observed),], 1)

2) Rarefying to the smallest sample.

In [None]:
ps_rare = rarefy_even_depth(ps_final, sample.size=23, replace=FALSE, rngseed=123, verbose=FALSE)

#otu_table(ps_rare)

3) Plot diversity indices.

In [None]:
plot_richness(ps_rare, x="env_feature", color="env_feature") + 
    geom_boxplot() + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(x='', color = "Environment")

4) Aglomerate taxonomy at Class rank and visualize the profile (show the top 15 classes only).

In [None]:
psglom = tax_glom(ps_rare, "Class")
top15 = names(sort(taxa_sums(psglom), decreasing=TRUE)[1:15])
top15_ps = prune_taxa(top15, psglom)

options(repr.plot.width=14, repr.plot.height=6)
plot_bar(top15_ps, fill="Class") + 
    facet_wrap(~Class) + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 
    labs(x=NULL)

options(repr.plot.width=14, repr.plot.height=4)
plot_bar(top15_ps, fill="Class") + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 
    labs(x=NULL)

### 2.4. Normalization by cumulative sum scaling (CSS) <a id='part2_4'/>

The third normalization method we are going to apply is CSS. To do so, we will use the implementation on the `microbiomeMarker` library.
Cumulative sum scaling normalization calculates scaling factors as the cumulative sum of gene (or taxa) abundances up to a data-derived threshold. This method is based on the assumption that the count distributions in each sample are equivalent for low abundance genes up to a certain threshold. Only the segment of each sample's count distribution that is relatively invariant across samples is scaled by CSS.

1) Normalizing the OTU counts in the `ps_final` object.

In [None]:
ps_CSS = normalize(ps_final, method="CSS")

2) Compute and plot alpha diversity metrics.

In [None]:
plot_richness(ps_CSS, x="env_feature", color="env_feature") + 
    geom_boxplot() + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(x='', color="Environment")

3) Aglomerate taxonomy at Class rank and visualize the profile (show the top 15 classes only).

In [None]:
psglom = tax_glom(ps_CSS, "Class")
top15 = names(sort(taxa_sums(psglom), decreasing=TRUE)[1:15])
top15_ps = prune_taxa(top15, psglom)

options(repr.plot.width=14, repr.plot.height=6)
plot_bar(top15_ps, fill="Class") + 
    facet_wrap(~Class) + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 
    labs(x=NULL)

options(repr.plot.width=14, repr.plot.height=4)
plot_bar(top15_ps, fill="Class") + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 
    labs(x=NULL)

## Part 3. Comparative metagenomics at community-level: Beta diversity <a id='part3'/>

According to Pereira *et al.,* (2018)([4](#reference_4)), the best normalization method for metagenomic gene abundance (tested in TARA ocean samples) is CSS for large group sizes. For this reason, we will use this method to show beta-diversity.

1) Compute beta diversity using various methods to calculate distance, and perform principle components analysis ploting the first two axes. Using as a guide the steps described in ([5](#reference_5)), we will create a list of suitable distance methods, iterate through them, and display a combined plot. For a better visualization we are going to show the 95% confidence region with an ellipse.

In [None]:
# Generating the methods list and discarding those that are not included in adonis methods list
dist_methods = unlist(distanceMethodList)
dist_methods = dist_methods[c(-(1:4),-(20:47))]

# Iterating through the list to save the plot
plist = vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
    # Calculate distance matrix
    iDist = distance(ps_CSS, method=i)
    # Calculate ordination
    iMDS  = ordinate(ps_CSS, "MDS", distance=iDist)
    ## Make plot
    # Don't carry over previous plot (if error, p will be blank)
    p = NULL
    # Create plot, store as temp variable, p
    p = plot_ordination(ps_CSS, iMDS, color="env_feature")
    # Add title to each plot
    p = p + ggtitle(paste("MDS using distance method ", i, sep=""))
    # Save the graphic to file.
    plist[[i]] = p
}

# Create a combined plot
df = ldply(plist, function(x) x$data)
names(df)[1] = "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=env_feature))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for TARA ocean dataset") + 
    stat_ellipse(level=0.95, type="norm", geom="polygon", alpha=0, aes(color=env_feature)) + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(color = "Environment") 

options(repr.plot.width=12, repr.plot.height=10)
p

2) Select the best distance metric – the one which best segregates the data by water layer – and determine whether the two groups of samples have different centroids. To do so, we use a permanova implemented in the `vegan` library's `adonis` function. The method calculates the squared deviations of each site to the centroid and then, performs significance tests using F-tests based on sequential sums of squares from permutations of the raw data.

In [None]:
metadata = data.frame(sample_data(ps_CSS))
css_beta = distance(ps_CSS, method="canberra")
adonis(css_beta ~ env_feature, data = metadata, perm=1e3)

3) Adonis assumes there is homogeneity of dispersion among groups. Let's test this assumption, to check whether the differences detected by adonis are due to variation in dispersion of the data. The strategy is to run a `betadisper` test (also from the `vegan` library) and  evaluate if there's a significant variation in beta dispersion between groups through an anova.

In [None]:
bd = betadisper(css_beta, metadata$'env_feature')
bd
anova(bd)

With these results, we can now accept that there is a significant difference between groups and it is not an artifact due to heterogeneous dispersion.

## Part 4. Detection of differentially abundant taxa <a id='part4'/>

There are many approaches out there to detect differentially abudant taxa (or genes). Here we are using the method implemented in `SIAMCAT` library. `SIAMCAT` is an R package that puts together statistical functions to compare large-scale studies in order to detect microbial community composition changes due to environmental factors. Describing such associations in quantitative models has multiple applications, such as in predicting disease status based on microbiome data. In addition, the pipeline includes the generation of high-quality plots for visual inspection of results ([6](#reference_6)). Noteworthy that there are multiple parametra that can be adjusted for a better performance in each step of the analysis according with the dataset nature. However, in this exercise we are showing the tool behaviuor with minimal tuning in default arguments so it can be used as a template by the user.

### 4.1. Association testing <a id='part4_1'/>

1. Preparing the data. The input for `SIAMCAT` is a matrix with normalized data as relative abundances. We will use the phyloseq object generated in step [2.2.2](#part2_2_2) of this notebook and aggregate the data at genus level.

In [None]:
psglom = tax_glom(relab_ps, "Genus")

#sample_data(psglom)

2. Creating the `SIAMCAT` object. We first need to create a label to set which group will be used as control for comparisons. We selected arbitrary Mesopelagic as case group and Surface as control.

In [None]:
# Creating a siamcat label
sc_label = create.label(meta=sample_data(psglom), label='env_feature', case='Mesopelagic')

# Creating the siamcat object
siamcat_obj = siamcat(phyloseq=psglom, label=sc_label)
show(siamcat_obj)

3. Filtering low-abundant features by abundance (threshold=0.001) and prevalence (threshold=0.05).

In [None]:
siamcat_obj = filter.features(siamcat_obj, filter.method='abundance', cutoff=1e-03)
siamcat_obj = filter.features(siamcat_obj, filter.method='prevalence', cutoff=0.05, feature.type='filtered')

siamcat_obj

4. Computing differentially abundant taxa on filtered object. 

In [None]:
options(repr.plot.width=10, repr.plot.height=6)
siamcat_obj = check.associations(siamcat_obj, panels=c("fc","prevalence"))

For significantly associated microbial features, the plot shows: 
- The abundances of the features across the two different classes (Mesopelagic vs. Surface) 
- The significance of the enrichment calculated by a Wilcoxon test (after multiple hypothesis testing correction) 
- The generalized fold change of each feature
- The prevalence shift between the two classes

Adding "auroc" to the `panels` list, you can also visualize the Area Under the Receiver Operating Characteristics Curve (AU-ROC) as non-parametric effect size measure.


5. Printing the taxonomic labels of differentially abudant OTUs. 

In [None]:
# The results table of differentially abundant OTUs are stored in associations(siamcat_obj)
diff_otus = as.list(rownames(associations(siamcat_obj)[associations(siamcat_obj)$p.adj < 0.05, ]))

# The taxonomic label per OTU is stored in tax_table(psglom)
tax_table(psglom)[rownames(tax_table(psglom)) %in% diff_otus, ]

### 4.2. Cofounder testing <a id='part4_2'/>

1. Test the associated metadata variables for potential confounding influence. The aim of this test is to check how differences between studies might influence the variance of specific taxa.

In [None]:
check.confounders(siamcat_obj, fn.plot='conf_check.pdf')

The output file has been created in the current directory. You can explorecheck.confounders the PDF file following this <a href="conf_check.pdf">link</a>.

### 4.3. Generation of a predictive model

Another feature of `SIAMCAT` is versatile but easy-to-use interface for the construction of machine learning (ML) meta-analysis models on the basis of microbial markers: the steps to follow include functions for data normalization, splitting the data into cross-validation folds, training the model, and making predictions based on cross-validation instances and the trained models.

1. Data normalisation. `SIAMCAT` offers a few normalization approaches that can be useful for subsequent statistical modeling in the sense that they transform features in a way that can increase the accuracy of the resulting models. Importantly, these normalization techniques do not make use of any label information, and can thus be applied up front to the whole data set and outside of the following cross validation.

In [None]:
siamcat_obj = normalize.features(siamcat_obj, norm.method='log.std', norm.param=list(log.n0=1e-06, sd.min.q=0))

2. Prepare cross-validation. Cross validation is a technique to assess how well a ML model would generalize to external data by partionining the dataset into training and test sets. `SIAMCAT` greatly simplifies the set-up of cross-validation schemes, including stratification of samples or keeping samples inseperable based on metadata. Here, we split the dataset into 10 parts and then train a model on 9 of these parts and use the left-out part to test the model. The whole process is repeated 10 times.

In [None]:
siamcat_obj = create.data.split(siamcat_obj, num.folds=10, num.resample=10)

3. Model training. The actual model training is performed using the function `train.model`. Again, multiple options for customization are available, ranging from the machine learning method to the measure for model selection or customizable parameter set for hyperparameter tuning. Here, we train a Lasso model ([7](#reference_7)).

In [None]:
siamcat_obj = train.model(siamcat_obj, method='lasso')

4. Make predictions. This function will automatically apply the models trained in cross validation to their respective test sets and aggregate the predictions across the whole data set.

In [None]:
siamcat_obj = make.predictions(siamcat_obj)

5. Model evaluation and plot. In the final part, we want to find out how well the model performed and which microbial markers had been selected in the model. In order to do so, we first calculate how well the predictions fit the real data using the function `evaluate.predictions`. This function calculates the Area Under the Receiver Operating Characteristic (ROC) Curve (AU-ROC) and the Precision Recall (PR) Curve for each resampled cross-validation run.

In [None]:
siamcat_obj = evaluate.predictions(siamcat_obj)

options(repr.plot.width=5, repr.plot.height=5)
model.evaluation.plot(siamcat_obj)

6. Interpretation plot. After statistical models have been trained to distinguish Mesopelagic samples from Surface, we will plot characteristics of the models (i.e. model coefficients or feature importance) alongside the input data aiding in understanding how/why the model works (or not).

In [None]:
model.interpretation.plot(siamcat_obj, fn.plot = 'interpretation_plot.pdf')

The output file has been created in the current directory, you can explore the content of the PDF file following this <a href="interpretation_plot.pdf">link</a>.

The plots shows: 
- The median relative feature weight for selected features (barplot on the left) 
- The robustness of features (i.e. in how many of the models the specific feature has been selected) 
- The distribution of selected features across samples (central heatmap) 
- Which proportion of the weight of all different models are shown in the plot (boxplot on the right)
- Distribution of metadata across samples (heatmap below)

### References: <a id='refs'/>

<a id="reference_1">1. Lynch, M., Neufeld, J. Ecology and exploration of the rare biosphere. Nat Rev Microbiol 13: 217–229 (2015)<a> [DOI:10.1038/nrmicro3400](https://doi.org/10.1038/nrmicro3400)
    
<a id="reference_2">2. McKnight, DT., Huerlimann, R., Bower, DS. et al. Methods for Normalizing Microbiome Data: An Ecological Perspective. Methods in Ecology and Evolution / British Ecological Society 10 (3): 389–400 (2019)<a> [DOI:10.1111/2041-210X.13115](https://doi.org/10.1111/2041-210X.13115)

<a id="reference_3">3. McMurdie, PJ., Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol 10(4): e1003531 (2014)<a> [DOI:10.1371/journal.pcbi.1003531](https://doi.org/10.1371/journal.pcbi.1003531)
    
<a id="reference_4">4. Pereira, M., Wallroth, M., Jonsson, V. et al. Comparison of normalization methods for the analysis of metagenomic gene abundance data. BMC Genomics 19, 274 (2018)<a> [DOI:10.1186/s12864-018-4637-6](https://doi.org/10.1186/s12864-018-4637-6)
    
<a id="reference_5">5. McMurdie, P., Holmes, S. Tutorial of 2018: The distance function in phyloseq. [Available at website](https://joey711.github.io/phyloseq/distance.html)

<a id="reference_6">6. Wirbel, J., Zych, K., Essex, M. et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol 22: 93 (2021)<a> [DOI:10.1186/s13059-021-02306-1](https://doi.org/10.1186/s13059-021-02306-1)
    
    
<a id="reference_7">7. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1): 267-288 (1996)<a> [DOI:10.1111/j.2517-6161.1996.tb02080.x](https://doi.org/10.1111/j.2517-6161.1996.tb02080.x)
    
Documentation and more MGnifyR code and exercises available [on GitHub](https://beadyallen.github.io/MGnifyR/) and [on rdrr site]The output file has been created in the current directory. You can explore the content of the PDF file following this <a href="conf_check.pdf">link</a>.(https://rdrr.io/github/beadyallen/MGnifyR/f/vignettes/MGnifyR.Rmd)

Phyloseq tutorials available [on GitHub](https://joey711.github.io/phyloseq/index.html)
    
Visit the [SIAMCAT documentation](https://siamcat.embl.de/)
