<a href="https://colab.research.google.com/github/Gibbons-Lab/isb_course_2020/blob/master/micom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧫🦠 Metagenome-scale metabolic modeling with MICOM

Okay, let's start to get a bit more functional and simulate growth for the microbiota we have identified in the previous tutorial.

# 📝 Setup

MICOM installation is is usually pretty straight-forward and can be as easy as typing a `pip install micom`. However, micom will require a solver for quadratic programming problems and all the good ones are commercial (boo) even though they often have free academic license 😌. We will use a brand new open source QP solver named OSQP but this will require to pull in some development versions of some packages.

But first let's start by downloading the materials again and switching to the folder.

In [1]:
!git clone https://github.com/gibbons-lab/isb_course_2020 materials
%cd materials

fatal: destination path 'materials' already exists and is not an empty directory.
/content/materials


## Basic Installation

Installing MICOM is straight-forward in Python. OSQP itself can be installed rigth along with it.

In [58]:
!pip install -q osqp micom

print("Done! 🎉 ")

Done! 🎉 


## Enabling OSQP support

For today we will also install some development versions that enable full support for OSQP in MICOM. This will hopefully not be necessary anymore soon. Also this would be unnecessary if we would have access to CPLEX og Gurobi (both commercial solvers with free academic licenses).

In [59]:
!pip install --force-reinstall --no-deps -q https://github.com/cdiener/cobrapy/archive/feature/osqp_support.zip \
  https://github.com/cdiener/optlang/archive/feature/osqp.zip
  
print("Done! 🎉 ")

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Building wheel for cobra (PEP 517) ... [?25l[?25hdone
  Building wheel for optlang (setup.py) ... [?25l[?25hdone
Done! 🎉 


## Enable Qiime 2 interactions

Finally we will need to install packages to read the "biom" format which is a file format Qiime 2 uses to save tables. This is only necessary if you want to read Qiime 2 FeatureTable artifacts.

In [60]:
!pip install -q numpy Cython
!pip install -q biom-format

print("Done! 🎉 ")

Done! 🎉 


Okay, all done. So let's get started 😁.

# 💻 MICOM

We will use the Python interface to MICOM since it plays nicely with Colaboratory being pure Python and all that. However, you could run the same steps with Qiime 2 if you wanted. 

Here is an overview of all the steps and the particular functions:
![micom overview](https://github.com/micom-dev/q2-micom/raw/706f583a060b91c12c0cec7acea2354fdd0dd320/docs/assets/overview.png).

MICOM starts from a combined abundance/taxonomy table which MICOM abbreviates to taxonomy table. To get a look at how those tables look we can import MICOM and look at an example table:


In [2]:
from micom.data import test_data

test_data()

Unnamed: 0,id,genus,species,reactions,metabolites,file,sample_id,abundance
0,Escherichia_coli_1,Escherichia,Escherichia coli 0,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_1,682
1,Escherichia_coli_2,Escherichia,Escherichia coli 1,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_1,422
2,Escherichia_coli_3,Escherichia,Escherichia coli 2,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_1,64
3,Escherichia_coli_4,Escherichia,Escherichia coli 3,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_1,736
0,Escherichia_coli_1,Escherichia,Escherichia coli 0,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_2,680
1,Escherichia_coli_2,Escherichia,Escherichia coli 1,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_2,900
2,Escherichia_coli_3,Escherichia,Escherichia coli 2,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_2,734
3,Escherichia_coli_4,Escherichia,Escherichia coli 3,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_2,317
0,Escherichia_coli_1,Escherichia,Escherichia coli 0,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_3,78
1,Escherichia_coli_2,Escherichia,Escherichia coli 1,95,72,/usr/local/lib/python3.6/dist-packages/micom/d...,sample_3,217


The `file` column is not required when using a taxonomy database like we will do. The same holds for the `id` column which is optional and allows you to give better names to your taxa. But each row needs to contain the the abundance of a single taxon in a single sample. 

Oh no, that's not what we have generated in the previous step. We only have separate Qiime 2 artifacts 😱 No worries.

## Importing data from Qiime 2

MICOM can read Qiime 2 artifacts. You dont't even need to have Qiime 2 installed for that! But before we do so let's resolve one issue. We discussed that MICOM summarizes genome-scale models into pangenome-scale models as a first step. But our data is on the ASV level so how will we know what to summarize? Basically that can be inferred from the model database we use. Model databases are prepared pangenome-scale models for use in MICOM. So before we read our data we have to decide which model database to use. We will go with the [AGORA database](https://pubmed.ncbi.nlm.nih.gov/27893703/) which is a curated database of more than 800 bacterial strains that commonly live in the human gut. In particular we will use a database summarized on the genus rank.



In [2]:
!wget -O agora103_genus.qza https://zenodo.org/record/3755182/files/agora103_genus.qza?download=1

--2020-08-19 18:05:19--  https://zenodo.org/record/3755182/files/agora103_genus.qza?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21080080 (20M) [application/octet-stream]
Saving to: ‘agora103_genus.qza’


2020-08-19 18:05:21 (13.8 MB/s) - ‘agora103_genus.qza’ saved [21080080/21080080]



Okay. We got everything we need now. The data from the prior analysis can be found in the `treasure_chest` folder, so we use those files.

In [22]:
from micom.taxonomy import qiime_to_micom

tax = qiime_to_micom(
    "treasure_chest/dada2/table.qza", 
    "treasure_chest/taxa.qza", 
    "agora103_genus.qza",
    cutoff=2.5e-2,
    strict=False
)
tax["id"] = tax.genus
tax

Taxa per sample:
count    8.00000
mean     6.25000
std      2.31455
min      3.00000
25%      4.75000
50%      6.00000
75%      8.25000
max      9.00000
Name: sample_id, dtype: float64 



Unnamed: 0,sample_id,abundance,genus,relative,id
8,ERR1883210,11.0,Akkermansia,0.000164,Akkermansia
9,ERR1883214,54717.0,Akkermansia,0.822404,Akkermansia
10,ERR1883247,102.0,Akkermansia,0.003015,Akkermansia
11,ERR1883248,4059.0,Akkermansia,0.103046,Akkermansia
12,ERR1883210,42793.0,Bacteroides,0.637674,Bacteroides
...,...,...,...,...,...
257,ERR1883294,3.0,Atopobium,0.000647,Atopobium
258,ERR1883294,3.0,Alicyclobacillus,0.000647,Alicyclobacillus
259,ERR1883212,4.0,WAL_1855D,0.000069,WAL_1855D
260,ERR1883212,2.0,Finegoldia,0.000035,Finegoldia


One helpful thing to do is to merge in our metadata so we have it at hand in the next steps.

In [23]:
metadata = pd.read_table("metadata.tsv").rename(columns={"id": "sample_id"})
tax = pd.merge(tax, metadata, on="sample_id")
tax

Unnamed: 0,sample_id,abundance,genus,relative,id,disease_stat,description
0,ERR1883210,11.0,Akkermansia,0.000164,Akkermansia,healthy,Donor 13
1,ERR1883210,42793.0,Bacteroides,0.637674,Bacteroides,healthy,Donor 13
2,ERR1883210,3298.0,Faecalibacterium,0.049145,Faecalibacterium,healthy,Donor 13
3,ERR1883210,1062.0,Clostridium,0.015825,Clostridium,healthy,Donor 13
4,ERR1883210,4556.0,Roseburia,0.067891,Roseburia,healthy,Donor 13
...,...,...,...,...,...,...,...
249,ERR1883315,48.0,Selenomonas,0.004828,Selenomonas,Recurrent Clostridium difficile infection,Day -1 CD4
250,ERR1883315,28.0,Microvirgula,0.002816,Microvirgula,Recurrent Clostridium difficile infection,Day -1 CD4
251,ERR1883315,3.0,Aggregatibacter,0.000302,Aggregatibacter,Recurrent Clostridium difficile infection,Day -1 CD4
252,ERR1883315,11.0,Providencia,0.001106,Providencia,Recurrent Clostridium difficile infection,Day -1 CD4


Okay, that will print out how many taxa per model we will use. In that case, about 23 taxa per sample. Note that qiime to micom has a `parameter`. When matching taxonomy you can either just match by the particular rank, meaning just compare genus names, or you could compare the entire taxonomy, which will require all taxonomic ranks prior to the target rank to match. Many databases disagree on the family or class names so that may give you les matches but it will lower the chance for wrong matches. The resulting table will be the same but it will include more ranks with `strict=True` that will be used when matching to the database. The GreenGenes database is pretty old and many taxonomic names have been superceded by now. So we will stick with the "lax" option of only matching on genus ranks.

For now we can finall build our community-level models.

## Building community models

With the data we have building our models is pretty easy. We just pass our taxonomy table and the model database. We also have to specify where to write the models. We will also run that in parallel over two threads. So it should take around 10 minutes.

In [24]:
from micom.workflows import build
from micom import Community
import pandas as pd

manifest = build(tax, "agora103_genus.qza", "models", threads=2, solver="osqp", cutoff=2e-2)
#manifest = pd.read_csv("models/manifest.csv")

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




This will warn if less than 50% of abundance is matched to the database. This can happen and you can still continue but be aware that this may not represent the ecoogical system in your sample perfectly *if* there other bacteria present. In our case we have some individuals with *C. diff* with little biomass so many of the reads may match to food components. So we can go ahead for now. Let's also take a look what we got back from the `build` process.

In [25]:
manifest

Unnamed: 0,sample_id,disease_stat,description,file,found_taxa,total_taxa,found_fraction,found_abundance_fraction
0,ERR1883210,healthy,Donor 13,ERR1883210.pickle,7.0,8.0,0.875,0.86635
1,ERR1883212,healthy,Donor 14,ERR1883212.pickle,10.0,11.0,0.909091,0.870943
2,ERR1883214,Recurrent Clostridium difficile infection,Day 0 CD1,ERR1883214.pickle,3.0,3.0,1.0,0.965912
3,ERR1883247,healthy,Donor CD3,ERR1883247.pickle,10.0,12.0,0.833333,0.85637
4,ERR1883248,Recurrent Clostridium difficile infection,Day 1 CD1,ERR1883248.pickle,7.0,7.0,1.0,0.892638
5,ERR1883260,healthy,CD2 Donor,ERR1883260.pickle,9.0,9.0,1.0,0.878539
6,ERR1883294,Recurrent Clostridium difficile infection,Day 0 CD3,ERR1883294.pickle,7.0,8.0,0.875,0.937244
7,ERR1883315,Recurrent Clostridium difficile infection,Day -1 CD4,ERR1883315.pickle,4.0,4.0,1.0,0.962788


So we now have our community models and can now leverage MICOM fully by simulating growth in the community.

## Simulating growth

With our community models built we can start to simulate growth with the cooperative tradeoff algorithm. Since we have no diet information for our samples we will apply the same averagec Western diet to each individual. So we will start by downloading the diet.

In [26]:
!wget -O western_diet_gut.qza https://zenodo.org/record/3755182/files/western_diet_gut.qza?download=1

--2020-08-19 21:41:40--  https://zenodo.org/record/3755182/files/western_diet_gut.qza?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7173 (7.0K) [application/octet-stream]
Saving to: ‘western_diet_gut.qza’


2020-08-19 21:41:41 (970 MB/s) - ‘western_diet_gut.qza’ saved [7173/7173]



This is again a Qiime 2 artifact which we can load with MICOM.

In [27]:
from micom.qiime_formats import load_qiime_medium

medium = load_qiime_medium("western_diet_gut.qza")
medium.index = medium.reaction
medium

Unnamed: 0_level_0,flux,dilution,metabolite,reaction
reaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EX_fru_m,0.014899,0.100,fru_m,EX_fru_m
EX_glc_m,0.014899,0.100,glc_m,EX_glc_m
EX_gal_m,0.014899,0.100,gal_m,EX_gal_m
EX_man_m,0.014899,0.100,man_m,EX_man_m
EX_mnl_m,0.014899,0.100,mnl_m,EX_mnl_m
...,...,...,...,...
EX_glu_D_m,0.100000,0.100,glu_D_m,EX_glu_D_m
EX_gthrd_m,0.100000,0.100,gthrd_m,EX_gthrd_m
EX_h2_m,0.100000,0.100,h2_m,EX_h2_m
EX_no2_m,0.100000,0.100,no2_m,EX_no2_m


Okay let's go right ahead and simulate growth. This will give us some time to dive in some details and will take about 20 minutes.

In [28]:
from micom.workflows import grow
import pickle

growth_results = grow(manifest, "models", medium, tradeoff=0.5)
pickle.dump(growth_results, open("growth.pickle", "wb"))

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




If that takes too long or was aborted we can read it from the saved version as well.

In [29]:
import pickle

growth_results = pickle.load(open("treasure_chest/growth.pickle", "rb"))

FileNotFoundError: ignored

What kind of results did we get? Well, `grow` returns a tuple of 3 daat sets:

1. The predicted growth rate for all taxa in all samples
2. The import and export fluxes for each taxon and the external environment
3. Annotations for the fluxes mapping to other databases

The groth rates are pretty straightforward.

In [43]:
growth_results.growth_rates.head()

Unnamed: 0_level_0,abundance,growth_rate,reactions,metabolites,taxon,tradeoff,sample_id
compartments,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bacteroides,0.736046,0.298356,3307,1887,Bacteroides,0.5,ERR1883210
Blautia,0.027486,0.011492,3108,1818,Blautia,0.5,ERR1883210
Faecalibacterium,0.056726,0.026825,1986,1472,Faecalibacterium,0.5,ERR1883210
Parabacteroides,0.042828,0.016625,2870,1747,Parabacteroides,0.5,ERR1883210
Roseburia,0.078364,0.028981,2357,1567,Roseburia,0.5,ERR1883210


More interesting are the exchange fluxes.

In [44]:
growth_results.exchanges

Unnamed: 0,taxon,sample_id,tolerance,reaction,flux,abundance,metabolite,direction
874,Faecalibacterium,ERR1883210,0.001,EX_ac(e),0.026445,0.056726,ac[e],export
901,Faecalibacterium,ERR1883210,0.001,EX_biomass(e),0.002999,0.056726,biomass[c],export
931,Faecalibacterium,ERR1883210,0.001,EX_dcyt(e),-0.011616,0.056726,dcyt[e],import
942,Faecalibacterium,ERR1883210,0.001,EX_for(e),0.001256,0.056726,for[e],export
966,Faecalibacterium,ERR1883210,0.001,EX_glyleu(e),-0.001945,0.056726,glyleu[e],import
...,...,...,...,...,...,...,...,...
35362,Cetobacterium,ERR1883315,0.001,EX_trp_L(e),-0.002037,0.266688,trp_L[e],import
35363,Cetobacterium,ERR1883315,0.001,EX_ttdca(e),-0.001022,0.266688,ttdca[e],import
35364,Cetobacterium,ERR1883315,0.001,EX_tyr_L(e),-0.003092,0.266688,tyr_L[e],import
35367,Cetobacterium,ERR1883315,0.001,EX_uri(e),-0.003911,0.266688,uri[e],import


So we see hwo much of each metabolite is either consumed or produced by each taxon in each sample. `tolerance` denotes the accuracy of the solver and tells you the smallest absolute flux that is likely difefrent form zero (substantial flux). *All of the fluxes are normalized to 1g dry weight of bacteria*. So you can directly compare them between different taxa even if they are present in different abundances. 

However, the names may not be very informative. For this we have our annotations. For insatnce, to figure out what `ac[e]` is (air conditioning?), we can do the following:

In [48]:
anns = growth_results.annotations
anns[anns.metabolite == "ac[e]"]

Unnamed: 0_level_0,metabolite,name,hmdb,inchi,kegg.compound,pubchem.compound,reaction
metabolite,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ac[e],ac[e],acetate,HMDB00042,"InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1",C00033,176,EX_ac(e)


Ohhh, it's acetate. Yeah that makes more sense 🕵️‍♀️. For the AGORA models you can also use the official VMH knowledge base at https://vmh.life maintained by Dr. Thiele's lab whcih will give you rich information on metabolites and reactions. For instance for acetate you can find quite some mre info at: https://www.vmh.life/#metabolite/ac. 

# 📊 Visualizations

So we have seen that we get quite some data from the growth simulations. But how do we make sense of it? 

We will use the standard visualizations included in MICOM. Those are all functions that take in the growth results we otained before and create a visualization in standalone HTML file that bundles the plots and raw data and canbe viewed directly in your browser.

The first thing we might want to look at is the growth rates for each taxon.

In [30]:
from micom.viz import *

viz = plot_growth(growth_results)

Normally we could just call `viz.view()` afterwards and it would open it in our browser. Since are working colab that will not work. However, the plot function create the file `growth_rates_[DATE].html` in your `materials` folder. To open it simply download that file and view it in your browser. Some things going on but not super clear. Let's continue.

## Growth niches

One really important question is what nutrients are consumed by the microbiota and which ones they produce. We provided nutrients in our medium but we don't actually know yet what was consumed by the microbiota. So let's check that out using the `plot_exchanges_per_sample` function.

In [32]:
plot_exchanges_per_sample(growth_results)

<micom.viz.core.Visualization at 0x7f1dd9612c18>

We can have a look at the results after downloading `materials/sample_exchanges_[DATE].html`. It would be even better if we could visualize which taxa compete for similar resources. We can create a niche plot by using `plot_exchanges_per_taxon`.

In [61]:
plot_exchanges_per_taxon(growth_results, perplexity=4, direction="import")

<micom.viz.core.Visualization at 0x7f1dcf8372e8>


This projects the full set of import or export fluxes into only 2 dimensions and arranges taxa so that more similar flux patterns lie close together. Thus, different taxa close to each other compete for similar resources or produce similar metabolites. The center of the thus signifies a more competitive nutrient space whereas clusters on the outskirts denote more isolated niches.

You can tune [TSNE parameters](https://distill.pub/2016/misread-tsne/) such as perplexity to get a more meaningful grouping.

## Metabolic connections to a phenotype

That is all nice but how does that relate to recurrent *C. diff* infections? To answer that question we can use `plot_fit` function. This will run a logistic LASSO regression with our diesase status as the response and normalizeed fluxes as variables. In general, import fluxes are not as predictive since, well, they are more relevant to the bacteria than us. What we usually care about is the production flux of a particular metabolite. This is the total production flux into the extracellular volume and thus signifies exactly the flux the host has access to. 

Since OSQP has a somewaht lower solver acuract we will be conservative of what we consider substantial flux and will filter out fluxes smaller 0.01 mmol/l.

In [41]:
from micom.viz import *

manifest.index = manifest.sample_id
pheno = manifest.disease_stat

pl = plot_fit(growth_results, pheno, atol=1e-2, flux_type="production")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exchanges["meta"] = meta[exchanges.sample_id].values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exchanges["description"] = anns.loc[exchanges.metabolite, "name"].values


This will again create a file `fit_[DATE].html` that you can open. You will see the production fluxes most predictive of the phenotype and compare them across the group. 

In [31]:
growth_results.exchanges

Unnamed: 0,taxon,sample_id,tolerance,reaction,flux,abundance,metabolite,direction
874,Faecalibacterium,ERR1883210,0.001,EX_ac(e),0.026445,0.056726,ac[e],export
901,Faecalibacterium,ERR1883210,0.001,EX_biomass(e),0.002999,0.056726,biomass[c],export
931,Faecalibacterium,ERR1883210,0.001,EX_dcyt(e),-0.011616,0.056726,dcyt[e],import
942,Faecalibacterium,ERR1883210,0.001,EX_for(e),0.001256,0.056726,for[e],export
966,Faecalibacterium,ERR1883210,0.001,EX_glyleu(e),-0.001945,0.056726,glyleu[e],import
...,...,...,...,...,...,...,...,...
35362,Cetobacterium,ERR1883315,0.001,EX_trp_L(e),-0.002037,0.266688,trp_L[e],import
35363,Cetobacterium,ERR1883315,0.001,EX_ttdca(e),-0.001022,0.266688,ttdca[e],import
35364,Cetobacterium,ERR1883315,0.001,EX_tyr_L(e),-0.003092,0.266688,tyr_L[e],import
35367,Cetobacterium,ERR1883315,0.001,EX_uri(e),-0.003911,0.266688,uri[e],import


# 🔵 Addendum


## Choosing a tradeoff value

Even if you don't have growth rate available you can still use your data to choose a decent tradeoff value. This can be done by choosing the largest tradeoff that still allow growth for the majority of the taxa that you observed in the data (they are present so they should be able to grow). This can be done with the `tradeoff` workflow in MICOM that will run cooperative tradeoff with varying tradeoff values and taht can be visualized with the `plot_tradeoff` function.

In [8]:
from micom.workflows import tradeoff
import micom

tradeoff_results = tradeoff(manifest, "models", medium, threads=2)
tradeoff_results.to_csv("tradeoff.csv", index=False)

plot_tradeoff(tradeoff_results, tolerance=1e-4)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




After opeing `tradeoff_[DATE].html` you will see that for our example here all tradeoff values work great. This is because we modeled very few taxa which keeps the compettion down. If you would allow less abundant taxa in the models this would change drastically. For instance, here is an example from a colorectal cancer data set:

[![tradeoff example](https://raw.githubusercontent.com/micom-dev/q2-micom/master/docs/assets/tradeoff.png)](https://micom-dev.github.io/q2-micom/assets/tradeoff/data/index.html)

You can see how not using cooperative tradeoff would give you nonsense results where only 10% of all observed taxa grow. A tradeoff value of 0.3 would probably  be agood choice for this data set.