# Genome scale model annotation
The goal of this notebook is to acheive high annotation scores from [MEMOTE](https://memote.io/).

This notebook annotates a preliminary Rhodococcus PD630 genome scale model reconstruction that was generated by [CarveMe](https://www.ncbi.nlm.nih.gov/assembly/GCF_000234335.1) using the [2011 genome from the Broad Institute](https://pubmed.ncbi.nlm.nih.gov/30192979/).<br>
### CarveMe Instructions
The draft reconstruction (Ropacus_carveme_grampos.xml) was generated with the command line commands:

<ol>
<li>carve --refseq GCF_000234335.1 -o Ropacus_carveme.xml </li>
<li>gapfill Ropacus_carveme.xml -m M9,LB -o new_model.xml</li>
</ol>

### Annotation Methods (Repeated for metabolites, reactions, and genes)
<ol>
<li>Get for annotations for the components of the R. opacus model from the Bigg Universal model</li>
<li>Convert the Bigg Models list of lists data structure to a dictionary</li>
<li>Relabel the keys of the annotation dictionary to match MEMOTE's expectations</li>
</ol>

# Setup imports and initial models

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from IPython.display import IFrame
import numpy as np
import pandas as pd
import json
import urllib
import cobra
import cplex
import os
import requests
import collections

Get initial R. opacus model

In [2]:
model = cobra.io.read_sbml_model("../GSMs/Ropacus_carveme_grampos.xml")
model

0,1
Name,Ropacus_carveme_grampos
Memory address,0x07fc4c5134dd0
Number of metabolites,1582
Number of reactions,2382
Number of groups,0
Objective expression,1.0*Growth - 1.0*Growth_reverse_699ae
Compartments,"cytosol, periplasm, extracellular space"


Get BiGG universal model (this step takes about 45 seconds)

In [3]:
%%time
bigg_universal = cobra.io.load_json_model("../GSMs/universal_model.json")
bigg_universal

CPU times: user 40 s, sys: 628 ms, total: 40.7 s
Wall time: 40.7 s


0,1
Name,bigg_universal
Memory address,0x07fc570440550
Number of metabolites,15638
Number of reactions,28301
Number of groups,0
Objective expression,0
Compartments,


Check initial MEMOTE Performance (This runs on Jupyterhub but not on GitHub)

In [4]:
IFrame('../memotes/ropacus_carveme_grampos.html', 1500, 800)

In [1]:
print(f'The model has {len(model.reactions)} reactions, {len(model.metabolites)} metabolites, and {len(model.genes)} genes.')

NameError: name 'model' is not defined

# Annotate Metabolites

Check if any metabolites in R. opacus model are not in the universal model

In [6]:
for m in model.metabolites:
    if m.id not in [m.id for m in bigg_universal.metabolites]:
        print(f'The metabolite with name {m.name} and id {m.id} is not in the universal model')

The metabolite with name L Cystathionine C7H14N2O4S and id cysth__L_c is not in the universal model


Check if Cystathionine is duplicated in the R. opacus model

In [7]:
for m in model.metabolites:
    if 'Cystathionine' in m.name:
        print(f'There is a metabolite with name {m.name} and {m.id} in the model involved in {len(m.reactions)} reactions')

There is a metabolite with name L-Cystathionine and cyst__L_c in the model involved in 4 reactions
There is a metabolite with name L Cystathionine C7H14N2O4S and cysth__L_c in the model involved in 2 reactions


Decide which  L Cystathionine to remove

In [8]:
for r in model.metabolites.get_by_id('cyst__L_c').reactions:
    print (f'{r.id}: {r.reaction}')
    
print()

for r in model.metabolites.get_by_id('cysth__L_c').reactions:
    print (f'{r.id}: {r.reaction}')

CYSTL: cyst__L_c + h2o_c --> hcys__L_c + nh4_c + pyr_c
SHSL1: cys__L_c + suchms_c --> cyst__L_c + h_c + succ_c
CYSTGL: cyst__L_c + h2o_c --> 2obut_c + cys__L_c + nh4_c
CYSTS: hcys__L_c + ser__L_c --> cyst__L_c + h2o_c

CYSTGL_1: cysth__L_c + h2o_c --> 2obut_c + cys__L_c + nh4_c
CYSTS_2: hcys__L_c + ser__L_c --> cysth__L_c + h2o_c


We see that the metabolite that was not in the universal model is involved in redundant reactions, and remove those reactions

In [9]:
model.reactions.get_by_id('CYSTGL_1').remove_from_model(remove_orphans=True)
model.reactions.get_by_id('CYSTS_2').remove_from_model(remove_orphans=True)

Check if all the metatbolites in the R. opacus model are now in the BiGG model

In [10]:
for m in model.metabolites:
    if m.id not in [m.id for m in bigg_universal.metabolites]:
        print(f'The metabolite with name {m.name} and id {m.id} is not in the universal model')

No output indicates that all the metabolites in the R. opacus model are also in the BiGG universal model

Check how metabolites are annotated in Bigg universal model

In [11]:
bigg_universal.metabolites.get_by_id('glc__D_c').annotation

[['KEGG Compound', 'http://identifiers.org/kegg.compound/C00031'],
 ['CHEBI', 'http://identifiers.org/chebi/CHEBI:12965'],
 ['CHEBI', 'http://identifiers.org/chebi/CHEBI:17634'],
 ['CHEBI', 'http://identifiers.org/chebi/CHEBI:20999'],
 ['CHEBI', 'http://identifiers.org/chebi/CHEBI:4167'],
 ['KEGG Drug', 'http://identifiers.org/kegg.drug/D00009'],
 ['Human Metabolome Database', 'http://identifiers.org/hmdb/HMDB00122'],
 ['Human Metabolome Database', 'http://identifiers.org/hmdb/HMDB06564'],
 ['BioCyc', 'http://identifiers.org/biocyc/META:Glucopyranose'],
 ['MetaNetX (MNX) Chemical',
  'http://identifiers.org/metanetx.chemical/MNXM41'],
 ['InChI Key', 'https://identifiers.org/inchikey/WQZGKKKJIJFFOK-GASJEMHNSA-N'],
 ['SEED Compound', 'http://identifiers.org/seed.compound/cpd00027'],
 ['SEED Compound', 'http://identifiers.org/seed.compound/cpd26821']]

We seee that the BiGG annotations are lists of lists. We need to convert them to dictionaries. <br>
First we test a method for casting to the list of lists to a dictionary

In [12]:
dict(bigg_universal.metabolites.get_by_id('glc__D_c').annotation)

{'KEGG Compound': 'http://identifiers.org/kegg.compound/C00031',
 'CHEBI': 'http://identifiers.org/chebi/CHEBI:4167',
 'KEGG Drug': 'http://identifiers.org/kegg.drug/D00009',
 'Human Metabolome Database': 'http://identifiers.org/hmdb/HMDB06564',
 'BioCyc': 'http://identifiers.org/biocyc/META:Glucopyranose',
 'MetaNetX (MNX) Chemical': 'http://identifiers.org/metanetx.chemical/MNXM41',
 'InChI Key': 'https://identifiers.org/inchikey/WQZGKKKJIJFFOK-GASJEMHNSA-N',
 'SEED Compound': 'http://identifiers.org/seed.compound/cpd26821'}

Now we apply this method to all metabolites in the R. opacus GSM

In [13]:
for m in model.metabolites:
    m.annotation = dict(bigg_universal.metabolites.get_by_id(m.id).annotation)

Check annotations on R. opacus model metabolites

In [14]:
model.metabolites.get_by_id('glc__D_c').annotation

{'KEGG Compound': 'http://identifiers.org/kegg.compound/C00031',
 'CHEBI': 'http://identifiers.org/chebi/CHEBI:4167',
 'KEGG Drug': 'http://identifiers.org/kegg.drug/D00009',
 'Human Metabolome Database': 'http://identifiers.org/hmdb/HMDB06564',
 'BioCyc': 'http://identifiers.org/biocyc/META:Glucopyranose',
 'MetaNetX (MNX) Chemical': 'http://identifiers.org/metanetx.chemical/MNXM41',
 'InChI Key': 'https://identifiers.org/inchikey/WQZGKKKJIJFFOK-GASJEMHNSA-N',
 'SEED Compound': 'http://identifiers.org/seed.compound/cpd26821'}

Convert keys in metabolite annotation dictionaries to be memote compatable. <br>
Also convert values from urls to only the portion of the url after the final '/'

In [15]:
memote_key_converter = dict({'BioCyc': 'biocyc',
                     'CHEBI': 'chebi',
                     'Human Metabolome Database': 'hmdb',
                     'InChI Key': 'inchikey',
                     'KEGG Compound': 'kegg.compound',
                     'KEGG Drug': 'kegg.drug',
                     'KEGG Glycan': 'kegg.glycan',
                     'LipidMaps': 'lipidmaps',
                     'MetaNetX (MNX) Chemical': 'metanetx.chemical',
                     'Reactome Compound': 'reactome',
                     'SEED Compound': 'seed.compound'})

for m in model.metabolites:
    if m.annotation:
        m.annotation = dict((memote_key_converter[k], v.rsplit('/',1)[-1]) for k, v in m.annotation.items())
    m.annotation['bigg.metabolite'] = m.id

Add systems biology ontology values to metbolites. <br>
[https://www.ebi.ac.uk/sbo/main/SBO:0000247](https://www.ebi.ac.uk/sbo/main/SBO:0000247)

In [16]:
for m in model.metabolites:
    m.annotation['sbo'] = 'SBO:0000247'

Check how metabolite annotations look now

In [17]:
model.metabolites.get_by_id('glc__D_c').annotation

{'kegg.compound': 'C00031',
 'chebi': 'CHEBI:4167',
 'kegg.drug': 'D00009',
 'hmdb': 'HMDB06564',
 'biocyc': 'META:Glucopyranose',
 'metanetx.chemical': 'MNXM41',
 'inchikey': 'WQZGKKKJIJFFOK-GASJEMHNSA-N',
 'seed.compound': 'cpd26821',
 'bigg.metabolite': 'glc__D_c',
 'sbo': 'SBO:0000247'}

# Annotate Reactions

Check which reactions is the R. opacus model are not in the universal model

In [18]:
for r in model.reactions:
    if r.id not in [r.id for r in bigg_universal.reactions]:
        print(f'{r.name} with the id, {r.id}, is not in the universal model')

5 carboxymethyl 2 hydroxymuconate delta isomerase wthe the id, 5CM2HMUDI, is not in the universal model
Alcohol dehydrogenase (propanol) wthe the id, ALCD3, is not in the universal model
3 hydroxyacyl CoA dehydratase  3 hydroxybutanoyl CoA  wthe the id, ECOAH1_2, is not in the universal model
3 hydroxyacyl CoA dehydratase  3 hydroxydodecanoyl CoA  wthe the id, ECOAH5_2, is not in the universal model
NAD(P)H-flavin oxidoreductase wthe the id, FLDO, is not in the universal model
H2St wthe the id, H2St, is not in the universal model
3 hydroxyacyl CoA dehydrogenase  acetoacetyl CoA  wthe the id, HACD1_2, is not in the universal model
Hydroxymethylglutaryl CoA synthase (ir) wthe the id, HMGCOASi, is not in the universal model
HOPNTAL3 wthe the id, HOPNTAL3, is not in the universal model
HSDx wthe the id, HSDx, is not in the universal model
Acetohydroxy acid isomeroreductase wthe the id, KARA1i, is not in the universal model
3-ketoacyl-CoA thiolase wthe the id, KAT2, is not in the universal 

Check the current reaction annotations in the R. opacus model

In [19]:
for r in model.reactions:
    if r.annotation != {}:
        print(r.id)

No output indicates that all reactions have no annotation <br>
Get reaction annotations from BiGG model

In [20]:
for r in model.reactions:
    if r.id in bigg_universal.reactions:
        r.annotation = dict(bigg_universal.reactions.get_by_id(r.id).annotation)

Check reaction annotation format

In [21]:
model.reactions.get_by_id('PGI').annotation

{'EC Number': 'http://identifiers.org/ec-code/5.3.1.9',
 'BioCyc': 'http://identifiers.org/biocyc/META:PGLUCISOM-RXN',
 'MetaNetX (MNX) Equation': 'http://identifiers.org/metanetx.reaction/MNXR102535'}

Convert keys in reaction annotation dictionaries to be memote compatable. <br>
Also convert values from urls to only the portion of the url after the final '/'

In [22]:
memote_key_converter = dict({'BioCyc': 'biocyc',
                     'EC Number': 'ec-code',
                     'KEGG Reaction': 'kegg.reaction',
                     'MetaNetX (MNX) Equation': 'metanetx.reaction',
                     'RHEA': 'rhea',
                     'Reactome Reaction': 'reactome',
                     'SBO': 'sbo',
                     'SEED Reaction': 'seed.reaction'})

for r in model.reactions:
    if r.annotation:
        r.annotation = dict((memote_key_converter[k], v.rsplit('/',1)[-1]) for k, v in r.annotation.items())

Check reaction annotation in R. opacus model

In [23]:
model.reactions.get_by_id('PGI').annotation

{'ec-code': '5.3.1.9',
 'biocyc': 'META:PGLUCISOM-RXN',
 'metanetx.reaction': 'MNXR102535'}

Add systems biology ontology for reactions <br>
exchange reaction: [http://www.ebi.ac.uk/sbo/main/SBO:0000627](http://www.ebi.ac.uk/sbo/main/SBO:0000627) <br>
sink reaction: [http://www.ebi.ac.uk/sbo/main/SBO:0000632](http://www.ebi.ac.uk/sbo/main/SBO:0000632) <br>
growth reaction: [http://www.ebi.ac.uk/sbo/main/SBO:0000629](http://www.ebi.ac.uk/sbo/main/SBO:0000629) <br>
demand reaction: [http://www.ebi.ac.uk/sbo/main/SBO:0000628](http://www.ebi.ac.uk/sbo/main/SBO:0000628) <br>
transport reaction: [http://www.ebi.ac.uk/sbo/main/SBO:0000655](http://www.ebi.ac.uk/sbo/main/SBO:0000655) <br>
biochemical reaction: [http://www.ebi.ac.uk/sbo/main/SBO:0000176](http://www.ebi.ac.uk/sbo/main/SBO:0000176) <br>

In [24]:
for r in model.reactions:
    if r.id.startswith('EX_'):
        r.annotation['sbo'] = 'SBO:0000627'
    elif r.id.startswith('sink_'):
        r.annotation['sbo'] = 'SBO:0000632'
    elif r.id.startswith('Growth'):
        r.annotation['sbo'] = 'SBO:0000629'
    elif r.id.startswith('ATPM'):
        r.annotation['sbo'] = 'SBO:0000628'
    elif len(r.compartments) > 1:
        r.annotation['sbo'] = 'SBO:0000655'
    else:
        r.annotation['sbo'] = 'SBO:0000176'

In [25]:
model.reactions.get_by_id('PGI').annotation

{'ec-code': '5.3.1.9',
 'biocyc': 'META:PGLUCISOM-RXN',
 'metanetx.reaction': 'MNXR102535',
 'sbo': 'SBO:0000176'}

In [26]:
for r in model.reactions:
    if r.annotation == {}:
        print(r.id)

No output indicates that all reactions have at least some annotation

# Annotate Genes
Check how many genes are in R. opacus model

In [27]:
print(f'There are {len(model.genes)} genes in the model')

There are 1576 genes in the model


Import gene_converting.csv as pandas dataframe, and display the first 15 values. <br>
The index is the current gene name in the model, and geneID is the ncbi gene id.

In [28]:
gene_converter = pd.read_csv('../gene_converter/r_opacus_gene_converter.csv', index_col = 0)
gene_converter.head(15)

Unnamed: 0,GeneID
WP_187300246_1,1897646000.0
WP_025432775_1,645061500.0
WP_005248578_1,491390700.0
WP_025433613_1,645062300.0
WP_005249637_1,491391800.0
WP_005248999_1,491391100.0
WP_025433301_1,645062000.0
WP_005246696_1,491388800.0
WP_005244822_1,491386900.0
WP_005250095_1,491392200.0


Apply gene ids to genes in the model

In [29]:
for g in model.genes:
    if g.id in gene_converter.index:
        try:
            g.annotation['ncbiprotein'] = str(int(gene_converter.loc[g.id]['GeneID']))
        except:
            print(f'Problem with gene: {g.id}')
    else:
        print(f'gene {g.id} not in gene_converter.csv')

Problem with gene: spontaneous


In [30]:
model.genes.get_by_id('spontaneous').annotation = {'ncbiprotein': 'spontaneous'}

Add sbo to gene annotations
gene: [http://www.ebi.ac.uk/sbo/main/SBO:0000176](http://www.ebi.ac.uk/sbo/main/SBO:0000243)

In [31]:
for g in model.genes:
    g.annotation['sbo'] = 'SBO:0000243'

Check that all genes have an annotation

In [32]:
for g in model.genes:
    if g.annotation == {}:
        print(g.id)

No output indicates that all genes are annotated <br>
Check what gene annotations in R. opacus model look like

In [33]:
model.genes.get_by_id('WP_005239747_1').annotation

{'ncbiprotein': '491381865', 'sbo': 'SBO:0000243'}

# Export annotated model

In [34]:
model.id = 'ropacus_annotated'
model.name = 'Rhodococcus opacus PD630 annotated'
model.description = 'Rhodococcus opacus PD630 model with metabolite, reaction, and gene annotations. Model reactions have not been curated'

In [35]:
cobra.io.write_sbml_model(model, "../GSMs/Ropacus_annotated.xml")

Check MEMOTE output of annotated model

In [36]:
IFrame('../memotes/ropacus_annotated.html', 1500, 800)