# Curating a genome scale model (first pass)

This notebook has been tested on [jprime.lbl.gov](jprime.lbl.gov) with the biodesign_3.7 kernel.

It starts with the model that gets output by the annotation_gr.ipynb notebook.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from IPython.display import IFrame
import numpy as np
import pandas as pd
import json
import urllib
import cobra
import cplex
import os
import requests
import collections

# Getting and preparing the genome-scale model

## Load *R.opacus* NCBI model generated by CarveMe

In [2]:
model = cobra.io.read_sbml_model("GSMs/Ropacus_annotated.xml")
model

0,1
Name,ropacus_annotated
Memory address,0x07fe390cd6890
Number of metabolites,1581
Number of reactions,2380
Number of groups,0
Objective expression,1.0*Growth - 1.0*Growth_reverse_699ae
Compartments,"cytosol, periplasm, extracellular space"


## Starting MEMOTE Output

In [3]:
IFrame('memotes/ropacus_carveme_grampos.htm', 1500, 800)

# Fix unbalanced reactions

define a function that returns whether a reactions should be balanced

In [4]:
def should_be_balanced(r):
    if r.id.startswith('EX_') or r.id.startswith('sink_') or r.id.startswith('Growth'):
        return False
    else:
        return True
    
def is_balanced(r):
    abs(sum(r.check_mass_balance().values())) < 10

## Check how many reactions are unbalanced 

In [5]:
unbalanced = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]
unbalanced_but_okay = [r for r in model.reactions if not should_be_balanced(r) and r.check_mass_balance() != {}]
balanced = [r for r in model.reactions if r.check_mass_balance() == {}]

print(f'{len(unbalanced)} of the {len(model.reactions)} reactions in the model are wrongly unbalanced')
print(f'{len(unbalanced_but_okay)} of the {len(model.reactions)} reactions in the model are properly unbalanced')
print(f'{len(balanced)} of the {len(model.reactions)} reactions in the model are balanced')

850 of the 2380 reactions in the model are wrongly unbalanced
228 of the 2380 reactions in the model are properly unbalanced
1302 of the 2380 reactions in the model are balanced


## Check how many metabolites have multiple formulas
This the reason for many of the reactions being unbalanced

In [6]:
multiple_formulas = [m for m in model.metabolites if len(m.formula.split(';')) > 1]

print(f'{len(multiple_formulas)} of {len(model.metabolites)} metabolites have multiple formulas')

162 of 1581 metabolites have multiple formulas


In [7]:
for i in range(1,5):
    num_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == i]
    print(f'{len(num_formulas)} of {len(model.metabolites)} metabolites have {i} formula(s)')

1419 of 1581 metabolites have 1 formula(s)
156 of 1581 metabolites have 2 formula(s)
5 of 1581 metabolites have 3 formula(s)
1 of 1581 metabolites have 4 formula(s)


Check some examples of metabolites with multiple formulas

In [8]:
for m in multiple_formulas[:5]:
    print (m.id, m.formula)

pi_c HPO4;HO4P
nh4_c H4N;NH4
ppi_c HO7P2;P2HO7
hco3_c CHO3;HCO3
hco3_e CHO3;HCO3


Some of the metabolites that have multiple formulas have multiple equivalent formulas

### Check how many of the reactions involve a metabolite with multiple formulas
Define a function to check if reaction has at least one unbalanced metabolite

In [9]:
def has_metabolite_with_multiple_formulas(r):
    for m in r.metabolites:
        if len(m.formula.split(';')) > 1:
            return True
    return False

Check how many reactions have at lease one metabolite with multiple formulas

In [10]:
unbalanced_multiple_formulas = [r for r in unbalanced if has_metabolite_with_multiple_formulas(r)]
unbalanced_but_okay_multiple_formulas = [r for r in unbalanced_but_okay if has_metabolite_with_multiple_formulas(r)]
balanced_multiple_formulas   = [r for r in   balanced if has_metabolite_with_multiple_formulas(r)]

print(f'{len(unbalanced_multiple_formulas)} of the {len(unbalanced)} improperly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(unbalanced_but_okay_multiple_formulas)} of the {len(unbalanced_but_okay)} properly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(balanced_multiple_formulas)} of the {len(balanced)} balanced reactions in the model have at least one metabolite with multiple formulas')

847 of the 850 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
19 of the 228 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
27 of the 1302 balanced reactions in the model have at least one metabolite with multiple formulas


Check the unbalanced reactions that don't have a metabolite with multiple formulas

In [11]:
unbalanced_not_multiple_formulas = [r for r in unbalanced if not has_metabolite_with_multiple_formulas(r)]

for r in [r for r in unbalanced if not has_metabolite_with_multiple_formulas(r)]:
    print(r.check_mass_balance())

{'C': -7.105427357601002e-15, 'H': -1.4210854715202004e-14, 'N': -8.881784197001252e-16, 'O': -3.552713678800501e-15, 'S': -2.220446049250313e-16}
{'C': -7.105427357601002e-15, 'O': -3.552713678800501e-15, 'N': -8.881784197001252e-16, 'S': -2.220446049250313e-16}
{'H': 2.6645352591003757e-15}


These are extremely close to balanced. <br>
This means the unbalanced reactions are due entirely to metabolites with multiple formulas

## Fix equivalent metabolite formulas

### Define functions to convert formula string to dictionary

Define function to get the numbers at the beginning of a string

In [12]:
def get_initial_number_string(substring):
    initial_string = ''
    for char in substring:
        if char.isdigit():
            initial_string += char
        else:
            return initial_string
    return initial_string

Test if it works

In [13]:
print(f"initial number of 'HPO4': {get_initial_number_string('HPO4')}")
print(f"initial number of '4N': {get_initial_number_string('4N')}")
print(f"initial number of '18H7O4': {get_initial_number_string('18H7O4')}")
print(f"initial number of '': {get_initial_number_string('')}")

initial number of 'HPO4': 
initial number of '4N': 4
initial number of '18H7O4': 18
initial number of '': 


Define a function to convert a string into a dictionary of elements and coeffient

In [14]:
def formula_dict_from_string(formula_string):
    formula_dict = {}
    elements = [char for char in formula_string if char.isalpha()]
    for element in elements:
        string_after_element = formula_string.split(element, 1)[1]
        coefficient = get_initial_number_string(string_after_element)
        if coefficient == '':
            coefficient = '1'
        formula_dict[element] = int(coefficient)
    return formula_dict

Test the function

In [15]:
print(formula_dict_from_string('HPO4'))
print(formula_dict_from_string('HO4P'))
print()
print(formula_dict_from_string('H4N'))
print(formula_dict_from_string('NH4'))
print()
print(formula_dict_from_string('HO7P2'))
print(formula_dict_from_string('P2HO7'))
print()
print(formula_dict_from_string('CHO3'))
print(formula_dict_from_string('HCO3'))
print()
print(formula_dict_from_string('C8H7O4'))
print(formula_dict_from_string('C8H8O4'))
print()
print(formula_dict_from_string('C10H10N5O7P'))
print(formula_dict_from_string('C10H11N5O7P'))

{'H': 1, 'P': 1, 'O': 4}
{'H': 1, 'O': 4, 'P': 1}

{'H': 4, 'N': 1}
{'N': 1, 'H': 4}

{'H': 1, 'O': 7, 'P': 2}
{'P': 2, 'H': 1, 'O': 7}

{'C': 1, 'H': 1, 'O': 3}
{'H': 1, 'C': 1, 'O': 3}

{'C': 8, 'H': 7, 'O': 4}
{'C': 8, 'H': 8, 'O': 4}

{'C': 10, 'H': 10, 'N': 5, 'O': 7, 'P': 1}
{'C': 10, 'H': 11, 'N': 5, 'O': 7, 'P': 1}


### Merge equivalent formulas

In [16]:
equivalent_formulas = 0
for m in multiple_formulas:
    formulas = m.formula.split(';')
    if len(formulas) == 2 and formula_dict_from_string(formulas[0]) == formula_dict_from_string(formulas[1]):
        print(m.id, formulas[0], formulas[1])
        m.formula = formulas[0]
        equivalent_formulas += 1
    if (len(formulas) == 3 and 
            formula_dict_from_string(formulas[0]) == formula_dict_from_string(formulas[1]) and
            formula_dict_from_string(formulas[1]) == formula_dict_from_string(formulas[2])):
        print(m.id, formulas[0], formulas[1], formulas[2])
        m.formula = formulas[0]
        equivalent_formulas += 1

print(f'There are {equivalent_formulas} metabolites with equivalent formulas, and they have been fixed.')

pi_c HPO4 HO4P
nh4_c H4N NH4
ppi_c HO7P2 P2HO7
hco3_c CHO3 HCO3
hco3_e CHO3 HCO3
pi_p HPO4 HO4P
nh4_p H4N NH4
for_c CHO2 CH1O2
pi_e HPO4 HO4P
nh4_e H4N NH4
1hdecg3p_c C19H37O7P1 C19H37O7P
1odecg3p_c C21H41O7P C21H41O7P1
1odec11eg3p_c C21H39O7P1 C21H39O7P
glx_c C2HO3 C2H1O3
meoh_c CH4O1 CH4O
ppoh_c C3H8O C3H8O1
so4_c O4S SO4
hco3_p CHO3 HCO3
meoh_e CH4O1 CH4O
cbl1_c C62CoH88N13O14P C62H88CoN13O14P
ficytC_c C42FeH54N8O6S2 C42H54FeN8O6S2
focytC_c C42FeH54N8O6S2 C42H54FeN8O6S2
h2co3_c CH2O3 H2CO3
1hdecg3p_p C19H37O7P1 C19H37O7P
so4_e O4S SO4
so4_p O4S SO4
There are 26 metabolites with equivalent formulas, and they have been fixed.


### Check how many metabolites with multiple formulas remain

In [17]:
for i in range(1,5):
    num_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == i]
    print(f'{len(num_formulas)} of {len(model.metabolites)} metabolites have {i} formula(s)')

1445 of 1581 metabolites have 1 formula(s)
130 of 1581 metabolites have 2 formula(s)
5 of 1581 metabolites have 3 formula(s)
1 of 1581 metabolites have 4 formula(s)


### Check how many unbalanced reactions remain

In [18]:
unbalanced = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]
unbalanced_but_okay = [r for r in model.reactions if not should_be_balanced(r) and r.check_mass_balance() != {}]
balanced = [r for r in model.reactions if r.check_mass_balance() == {}]

print(f'{len(unbalanced)} of the {len(model.reactions)} reactions in the model are wrongly unbalanced')
print(f'{len(unbalanced_but_okay)} of the {len(model.reactions)} reactions in the model are properly unbalanced')
print(f'{len(balanced)} of the {len(model.reactions)} reactions in the model are balanced')

333 of the 2380 reactions in the model are wrongly unbalanced
228 of the 2380 reactions in the model are properly unbalanced
1819 of the 2380 reactions in the model are balanced


In [19]:
unbalanced_multiple_formulas = [r for r in unbalanced if has_metabolite_with_multiple_formulas(r)]
unbalanced_but_okay_multiple_formulas = [r for r in unbalanced_but_okay if has_metabolite_with_multiple_formulas(r)]
balanced_multiple_formulas   = [r for r in   balanced if has_metabolite_with_multiple_formulas(r)]

print(f'{len(unbalanced_multiple_formulas)} of the {len(unbalanced)} improperly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(unbalanced_but_okay_multiple_formulas)} of the {len(unbalanced_but_okay)} properly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(balanced_multiple_formulas)} of the {len(balanced)} balanced reactions in the model have at least one metabolite with multiple formulas')

330 of the 333 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
14 of the 228 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
27 of the 1819 balanced reactions in the model have at least one metabolite with multiple formulas


Removing duplicate metabolite formulas reduced the number of improperly unbalanced reactions from 850 to 333.

# Remove shorthand notation from the model

Check the set of elements present in the model

In [20]:
all_letters = []
for m in model.metabolites:
    for c in m.formula:
        if c.isalpha():
            all_letters.append(c)
            
print(set(all_letters)) #print makes it horizontal

{'Z', 'R', 'O', 'a', 'd', 'P', 'g', 'M', 'l', 'C', 'X', 'r', 's', 'u', 'i', 'H', 'A', 'n', 'F', 'e', 'o', 'N', 'K', 'S'}


Check the set of elements present in the growth reaction

In [21]:
growth_elements = []
[growth_elements.extend(list(m.elements.keys())) for m in model.reactions.get_by_id('Growth').metabolites]
    
print(set(growth_elements)) #print makes it horizontal

{'Co', 'Fe', 'Ca', 'Cu', 'Mg', 'C', 'O', 'S', 'N', 'Cl', 'Zn', 'K', 'Mn', 'H', 'P'}


## Remove 'PRS' from model
Check how many metabolites have 'PRS' in their formula

In [22]:
metabolites_with_PRS = [m for m in model.metabolites if 'PRS' in m.formula]
print(f'There are {len(metabolites_with_PRS)} metabolites with PRS in their formula')

There are 56 metabolites with PRS in their formula


Print first five of these metabolites

In [23]:
for m in metabolites_with_PRS[:5]:
    print(m.id, m.formula)

3hdecACP_c C394H621O144N96P1S3;C21H39N2O9PRS
ACP_c HSR;HX;C384H603N96O142P1S3;C11H21N2O7PRS
3hddecACP_c C396H625O144N96P1S3;C23H43N2O9PRS
3hcddec5eACP_c C23H41N2O9PRS;C396H623O144N96P1S3
3hmrsACP_c C25H47N2O9PRS;C398H629O144N96P1S3


Check metabolites that have PRS and only have one formula

In [24]:
for m in model.metabolites:
    if ';' not in m.formula and 'PRS' in m.formula:
        print(m.id, m.name, m.formula)

arachACP_c Eicosanoyl-ACP (n-C20:0) C31H59N2O8PRS
phdcaACP_c Phenol palmitic acid ACP C34H57N2O9PRS
prephthACP_c Phthiocerol precursor bound ACP C43H81N2O11PRS
prepphthACP_c Phenolic phthiocerol precursor bound ACP C46H79N2O12PRS


There are four such metabolites. They will need to be fixed later. First we need to define the elemental composition of PRS.<br>
Check metabolties that have PRS and multiple formulas.

In [25]:
met_multiple_formulas_PRS = [m for m in model.metabolites if ';' in m.formula and 'PRS' in m.formula]
print(f'There are {len(met_multiple_formulas_PRS)} metabolites with multiple formulas and PRS')

There are 52 metabolites with multiple formulas and PRS


Define function to subtract element dictionaries. This will be used to determine the elemental makeup of PRS

In [26]:
def subtract_element_dicts(elements_1, elements_2):
    output = {}
    all_keys = list(elements_1.keys())
    element_2_keys = list(elements_2.keys())
    all_keys.extend(element_2_keys)
    all_keys = set(all_keys)
    
    
    for k in all_keys:
        if k in elements_1.keys() and k in elements_2.keys():
            output[k] = elements_1[k] - elements_2[k]
        elif k in elements_1.keys() and k not in elements_2.keys():
            output[k] = elements_1[k]
        else:
            output[k] = -1*elements_2[k]
            
    return output

Test that the function works

In [27]:
elements_1 = {'C': 394, 'H': 621, 'O': 144, 'N': 96, 'P': 1, 'S': 3}
elements_2 = {'C': 21, 'H': 39, 'N': 2, 'O': 9, 'P': 1, 'R': 1, 'S': 1}
subtract_element_dicts(elements_1, elements_2)

{'C': 373, 'R': -1, 'O': 135, 'S': 2, 'N': 94, 'H': 582, 'P': 0}

Use subtract_element_dicts for first 10 metabolites with PRS to determine the elemental makeup of PRS

In [28]:
for m in met_multiple_formulas_PRS[:10]:
    formulas = m.formula.split(';')
    if len(formulas) == 2:
        elements_1 = formula_dict_from_string(formulas[0])
        elements_2 = formula_dict_from_string(formulas[1])
                                              
        print(m.id, m.formula, subtract_element_dicts(elements_1, elements_2))


3hdecACP_c C394H621O144N96P1S3;C21H39N2O9PRS {'C': 373, 'R': -1, 'O': 135, 'S': 2, 'N': 94, 'H': 582, 'P': 0}
3hddecACP_c C396H625O144N96P1S3;C23H43N2O9PRS {'C': 373, 'R': -1, 'O': 135, 'S': 2, 'N': 94, 'H': 582, 'P': 0}
3hcddec5eACP_c C23H41N2O9PRS;C396H623O144N96P1S3 {'C': -373, 'R': 1, 'O': -135, 'S': -2, 'N': -94, 'H': -582, 'P': 0}
3hmrsACP_c C25H47N2O9PRS;C398H629O144N96P1S3 {'C': -373, 'R': 1, 'O': -135, 'S': -2, 'N': -94, 'H': -582, 'P': 0}
3hcmrs7eACP_c C398H627O144N96P1S3;C25H45N2O9PRS {'C': 373, 'R': -1, 'O': 135, 'S': 2, 'N': 94, 'H': 582, 'P': 0}
3hhexACP_c C17H31N2O9PRS;C390H613O144N96P1S3 {'C': -373, 'R': 1, 'O': -135, 'S': -2, 'N': -94, 'H': -582, 'P': 0}
3hoctACP_c C392H617O144N96P1S3;C19H35N2O9PRS {'C': 373, 'R': -1, 'O': 135, 'S': 2, 'N': 94, 'H': 582, 'P': 0}
tdec2eACP_c C21H37N2O8PRS;C394H619O143N96P1S3 {'C': -373, 'R': 1, 'O': -135, 'S': -2, 'N': -94, 'H': -582, 'P': 0}
tddec2eACP_c C23H41N2O8PRS;C396H623O143N96P1S3 {'C': -373, 'R': 1, 'O': -135, 'S': -2, 'N': -94

Seems clear that PRS has formula N94, H582, S2, C373, O135, and R-1

Remove all formulas that contain 'PRS' 

In [29]:
for m in met_multiple_formulas_PRS:
    formulas = m.formula.split(';')
    if len(formulas) == 2:
        if 'PRS' in formulas[0]:
            m.formula = formulas[1]
        else:
            m.formula = formulas[0]

Check how many metabolites with multiple formulas and 'PRS' remain

In [30]:
met_multiple_formulas_PRS = [m for m in model.metabolites if ';' in m.formula and 'PRS' in m.formula]
print(f'There is/are {len(met_multiple_formulas_PRS)} metabolite(s) with multiple formulas and PRS')

There is/are 1 metabolite(s) with multiple formulas and PRS


Update this one manually

In [31]:
print(met_multiple_formulas_PRS[0].id, met_multiple_formulas_PRS[0].formula)

ACP_c HSR;HX;C384H603N96O142P1S3;C11H21N2O7PRS


In [32]:
model.metabolites.get_by_id('ACP_c').formula = 'C384H603N96O142P1S3'

Fix metabolites with 'PRS' that only have one formula, by adding the elements of 'PRS' their formula

In [33]:
for m in model.metabolites:
    if ';' not in m.formula and'PRS' in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print(m.elements)
        negative_PRS = {'R': 1, 'H': -582, 'S': -2, 'P': 0, 'N': -94, 'O': -135, 'C': -373}
        print(subtract_element_dicts(m.elements, negative_PRS))
        m.elements = subtract_element_dicts(m.elements, negative_PRS)
        print('New formula', m.formula)
        print()

arachACP_c
Eicosanoyl-ACP (n-C20:0)
C31H59N2O8PRS
{'C': 31, 'H': 59, 'N': 2, 'O': 8, 'P': 1, 'R': 1, 'S': 1}
{'C': 404, 'R': 0, 'O': 143, 'S': 3, 'N': 96, 'H': 641, 'P': 1}
New formula C404H641N96O143PR0S3

phdcaACP_c
Phenol palmitic acid ACP
C34H57N2O9PRS
{'C': 34, 'H': 57, 'N': 2, 'O': 9, 'P': 1, 'R': 1, 'S': 1}
{'C': 407, 'R': 0, 'O': 144, 'S': 3, 'N': 96, 'H': 639, 'P': 1}
New formula C407H639N96O144PR0S3

prephthACP_c
Phthiocerol precursor bound ACP
C43H81N2O11PRS
{'C': 43, 'H': 81, 'N': 2, 'O': 11, 'P': 1, 'R': 1, 'S': 1}
{'C': 416, 'R': 0, 'O': 146, 'S': 3, 'N': 96, 'H': 663, 'P': 1}
New formula C416H663N96O146PR0S3

prepphthACP_c
Phenolic phthiocerol precursor bound ACP
C46H79N2O12PRS
{'C': 46, 'H': 79, 'N': 2, 'O': 12, 'P': 1, 'R': 1, 'S': 1}
{'C': 419, 'R': 0, 'O': 147, 'S': 3, 'N': 96, 'H': 661, 'P': 1}
New formula C419H661N96O147PR0S3



Check that all metabolites no longer have 'PRS' in their formula

In [34]:
[print(m.id) for m in model.metabolites if 'PRS' in m.formula]

[]

No output indicates that all instances of PRS have been removed from the model

### Check how many metabolites with multiple formulas remain

In [35]:
for i in range(1,5):
    num_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == i]
    print(f'{len(num_formulas)} of {len(model.metabolites)} metabolites have {i} formula(s)')

1497 of 1581 metabolites have 1 formula(s)
79 of 1581 metabolites have 2 formula(s)
5 of 1581 metabolites have 3 formula(s)
0 of 1581 metabolites have 4 formula(s)


### Check how many unbalanced reactions remain

In [36]:
unbalanced = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]
unbalanced_but_okay = [r for r in model.reactions if not should_be_balanced(r) and r.check_mass_balance() != {}]
balanced = [r for r in model.reactions if r.check_mass_balance() == {}]

print(f'{len(unbalanced)} of the {len(model.reactions)} reactions in the model are wrongly unbalanced')
print(f'{len(unbalanced_but_okay)} of the {len(model.reactions)} reactions in the model are properly unbalanced')
print(f'{len(balanced)} of the {len(model.reactions)} reactions in the model are balanced')

206 of the 2380 reactions in the model are wrongly unbalanced
228 of the 2380 reactions in the model are properly unbalanced
1946 of the 2380 reactions in the model are balanced


In [37]:
unbalanced_multiple_formulas = [r for r in unbalanced if has_metabolite_with_multiple_formulas(r)]
unbalanced_but_okay_multiple_formulas = [r for r in unbalanced_but_okay if has_metabolite_with_multiple_formulas(r)]
balanced_multiple_formulas   = [r for r in   balanced if has_metabolite_with_multiple_formulas(r)]

print(f'{len(unbalanced_multiple_formulas)} of the {len(unbalanced)} improperly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(unbalanced_but_okay_multiple_formulas)} of the {len(unbalanced_but_okay)} properly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(balanced_multiple_formulas)} of the {len(balanced)} balanced reactions in the model have at least one metabolite with multiple formulas')

203 of the 206 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
14 of the 228 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
26 of the 1946 balanced reactions in the model have at least one metabolite with multiple formulas


# Set metabolite formula based on reactions with single undefined metabolite
### First assign formulas to metabolites that participate in only reactions with one undefined metabolite and have the same formula for all reactions

Define function to return the number of metabolites with multiple formulas in a given reaction

In [42]:
def num_met_multiple_formulas(r):
    return len([m for m in r.metabolites if ';' in m.formula])

First check how many reactions have X number of metabolites with undefined formulas

In [43]:
for i in range(5):
    num_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == i]
    print(f'{len(num_reactions)} of {len(model.reactions)} reactions have {i} metabolites with multiple formulas')

2137 of 2380 reactions have 0 metabolites with multiple formulas
157 of 2380 reactions have 1 metabolites with multiple formulas
81 of 2380 reactions have 2 metabolites with multiple formulas
4 of 2380 reactions have 3 metabolites with multiple formulas
1 of 2380 reactions have 4 metabolites with multiple formulas


Define a function to return the first metabolite that has multiple formulas

In [44]:
def first_multiple_formulas(r):
    return [m for m in r.metabolites if ';' in m.formula][0]

Define function to take in a metabolite and return true if all its reactions have the equivalent mass errors

In [45]:
def all_rxns_have_same_mass_error(m):
    mass_errors = [r.check_mass_balance() for r in m.reactions]
    first_error = mass_errors[0]
    for mass_error in mass_errors:
        if ensure_positive_mass_error(mass_error) == ensure_positive_mass_error(first_error):
            pass
        else:
            return False
    return True

Define functions to ensure that mass error is positive

In [46]:
def ensure_positive_mass_error(mass_error):
    if list(mass_error.values())[0] > 0:
        return mass_error
    else:
        negative_mass_error = {}
        for k in mass_error:
            negative_mass_error[k] = -1 * mass_error[k]
        return negative_mass_error

Define function to get mass error of first reaction of given metabolite

In [47]:
def get_first_mass_error(m):
    for r in m.reactions:
        return r.check_mass_balance()

Define function to take in string with multiple formulas and a mass error, and return the formula with the formula that matches the mass error removed

In [48]:
def remove_formula_matching_mass_error(formula_string, mass_error):

    formulas = formula_string.split(';')
    if len(formulas) != 2:
        return formula_string
    else:
        for formula in formulas:
            if formula_dict_from_string(formula) != ensure_positive_mass_error(mass_error):
                return formula
    return formula_string

Set mass error for metabolites in this special condition (1st pass)

In [49]:
one_multiple_formula_metabolite_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == 1]
multiple_formula_metabolites = [m for m in model.metabolites if len(m.formula.split(';')) > 1]

print(f'{len(one_multiple_formula_metabolite_reactions)} reactions have a single metabolite with multiple formulas')
print(f'{len(multiple_formula_metabolites)} metabolites have multiple formulas')
print()

for m in multiple_formula_metabolites:
    reactions = m.reactions
    if reactions.issubset(one_multiple_formula_metabolite_reactions) and all_rxns_have_same_mass_error(m):
        print(f'The metabolite {m.id} is only in reactions with one metabolite with multiple formulas')
        print(f'Its current formula is {m.formula} and the mass errors of its reactions are:')
        for r in m.reactions:
            print(r.check_mass_balance())
        m.formula = remove_formula_matching_mass_error(m.formula, get_first_mass_error(m))
        print(f'The updated formula is {m.formula} and the mass errors of its reactions are:')
        for r in m.reactions:
            print(r.check_mass_balance())
        print()

157 reactions have a single metabolite with multiple formulas
84 metabolites have multiple formulas

The metabolite 35cgmp_c is only in reactions with one metabolite with multiple formulas
Its current formula is C10H10N5O7P;C10H11N5O7P and the mass errors of its reactions are:
{'C': 10.0, 'H': 10.0, 'N': 5.0, 'O': 7.0, 'P': 1.0}
{'C': -10.0, 'H': -10.0, 'N': -5.0, 'O': -7.0, 'P': -1.0}
The updated formula is C10H11N5O7P and the mass errors of its reactions are:
{}
{}

The metabolite ahdt_c is only in reactions with one metabolite with multiple formulas
Its current formula is C9H12N5O13P3;C9H11N5O13P3 and the mass errors of its reactions are:
{'C': 9.0, 'H': 11.0, 'N': 5.0, 'O': 13.0, 'P': 3.0}
{'C': -9.0, 'H': -11.0, 'N': -5.0, 'O': -13.0, 'P': -3.0}
The updated formula is C9H12N5O13P3 and the mass errors of its reactions are:
{}
{}

The metabolite salcn6p_c is only in reactions with one metabolite with multiple formulas
Its current formula is C13H19O10P;C13H17O10P and the mass errors 

(2nd pass)

In [50]:
one_multiple_formula_metabolite_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == 1]
multiple_formula_metabolites = [m for m in model.metabolites if len(m.formula.split(';')) > 1]

print(f'{len(one_multiple_formula_metabolite_reactions)} reactions have a single metabolite with multiple formulas')
print(f'{len(multiple_formula_metabolites)} metabolites have multiple formulas')
print()

for m in multiple_formula_metabolites:
    reactions = m.reactions
    if reactions.issubset(one_multiple_formula_metabolite_reactions) and all_rxns_have_same_mass_error(m):
        print(f'The metabolite {m.id} is only in reactions with one metabolite with multiple formulas')
        print(f'Its current formula is {m.formula} and the mass errors of its reactions are:')
        for r in m.reactions:
            print(r.check_mass_balance())
        m.formula = remove_formula_matching_mass_error(m.formula, get_first_mass_error(m))
        print(f'The updated formula is {m.formula} and the mass errors of its reactions are:')
        for r in m.reactions:
            print(r.check_mass_balance())
        print()

117 reactions have a single metabolite with multiple formulas
68 metabolites have multiple formulas



In [51]:
one_multiple_formula_metabolite_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == 1]
multiple_formula_metabolites = [m for m in model.metabolites if len(m.formula.split(';')) > 1]

print(f'{len(one_multiple_formula_metabolite_reactions)} reactions have a single metabolite with multiple formulas')
print(f'{len(multiple_formula_metabolites)} metabolites have multiple formulas')
print()

for m in multiple_formula_metabolites:
    reactions = m.reactions
    if reactions.issubset(one_multiple_formula_metabolite_reactions) and all_rxns_have_same_mass_error(m):
        print(f'The metabolite {m.id} is only in reactions with one metabolite with multiple formulas')
        print(f'Its current formula is {m.formula} and the mass errors of its reactions are:')
        for r in m.reactions:
            print(r.check_mass_balance())
        m.formula = remove_formula_matching_mass_error(m.formula, get_first_mass_error(m))
        print(f'The updated formula is {m.formula} and the mass errors of its reactions are:')
        for r in m.reactions:
            print(r.check_mass_balance())
        print()

117 reactions have a single metabolite with multiple formulas
68 metabolites have multiple formulas



## Check in on status 

In [52]:
for i in range(1,5):
    num_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == i]
    print(f'{len(num_formulas)} of {len(model.metabolites)} metabolites have {i} formula(s)')

1513 of 1581 metabolites have 1 formula(s)
63 of 1581 metabolites have 2 formula(s)
5 of 1581 metabolites have 3 formula(s)
0 of 1581 metabolites have 4 formula(s)


In [53]:
unbalanced = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]
unbalanced_but_okay = [r for r in model.reactions if not should_be_balanced(r) and r.check_mass_balance() != {}]
balanced = [r for r in model.reactions if r.check_mass_balance() == {}]

print(f'{len(unbalanced)} of the {len(model.reactions)} reactions in the model are wrongly unbalanced')
print(f'{len(unbalanced_but_okay)} of the {len(model.reactions)} reactions in the model are properly unbalanced')
print(f'{len(balanced)} of the {len(model.reactions)} reactions in the model are balanced')

166 of the 2380 reactions in the model are wrongly unbalanced
228 of the 2380 reactions in the model are properly unbalanced
1986 of the 2380 reactions in the model are balanced


In [54]:
unbalanced_multiple_formulas = [r for r in unbalanced if has_metabolite_with_multiple_formulas(r)]
unbalanced_but_okay_multiple_formulas = [r for r in unbalanced_but_okay if has_metabolite_with_multiple_formulas(r)]
balanced_multiple_formulas   = [r for r in   balanced if has_metabolite_with_multiple_formulas(r)]

print(f'{len(unbalanced_multiple_formulas)} of the {len(unbalanced)} improperly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(unbalanced_but_okay_multiple_formulas)} of the {len(unbalanced_but_okay)} properly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(balanced_multiple_formulas)} of the {len(balanced)} balanced reactions in the model have at least one metabolite with multiple formulas')

163 of the 166 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
14 of the 228 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
26 of the 1986 balanced reactions in the model have at least one metabolite with multiple formulas


In [55]:
for i in range(5):
    num_multiple_formulas = [r for r in model.reactions if num_met_multiple_formulas(r) == i]
    print(f'{len(num_multiple_formulas)} of {len(model.reactions)} reactions have {i} metabolites with multiple formulas')

2177 of 2380 reactions have 0 metabolites with multiple formulas
117 of 2380 reactions have 1 metabolites with multiple formulas
81 of 2380 reactions have 2 metabolites with multiple formulas
4 of 2380 reactions have 3 metabolites with multiple formulas
1 of 2380 reactions have 4 metabolites with multiple formulas


# Save model after first curation steps

In [None]:
model.id = 'ropacus_curated_first_pass'
model.name = 'Rhodococcus opacus PD630 curated first pass'
model.description = 'Rhodococcus opacus PD630 model with annotations and intitial curatation'

In [None]:
cobra.io.write_sbml_model(model, "GSMs/Ropacus_curation_first_pass.xml")

Define function to return the first metabolite that has multiple formulas in a given reaction

In [None]:
def first_metabolite_with_multiple_formulas(r):
    for m in r.metabolites:
        if ';' in m.formula:
            return m
    return model.metabolites.get_by_id('nh4_c')

Go through reactions with a single metabolite undefined, see if mass error matches one of the metabolite formulas

In [None]:
one_multiple_formula_rxns = [r for r in model.reactions if num_met_multiple_formulas(r) == 1]
for r in one_multiple_formula_rxns:
    print(r.check_mass_balance())
    print(first_metabolite_with_multiple_formulas(r).formula)
    print()
    

For a given metabolite,

In [None]:
mets_with_multiple_formulas = [m for m in model.metabolites if len(m.formula.split(';')) > 1]

for m in mets_with_multiple_formulas:
    print(m.id)
    print(m.formula)
    for r in m.reactions:
        print(r.check_mass_balance())
    print()

In [None]:
for r in unbalanced:
    if r not in unbalanced_multiple_formulas:
        print(r.check_mass_balance())
        print(r.reaction)
        print()

In [None]:
print('test')

Define function to convert mass error to string

In [None]:
def mass_error_to_string(mass_error):
    formula = ''
    pos_mass_error = ensure_positive_mass_error(mass_error)

    for element in pos_mass_error:
        formula += element
        if str(int(pos_mass_error[element])) != '1':
            formula += str(int(pos_mass_error[element]))
    return formula

Go through reactions with one undefined metabolite formula. Check if there is a formula match in the one metabolite in the formula with multiple options

In [None]:
one_multiple_formula_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == 1]

for r in one_multiple_formula_reactions:
    mass_error = r.check_mass_balance()
    metabolite_with_multiple_formulas = first_multiple_formulas(r)
    multiple_formulas = metabolite_with_multiple_formulas.formula
    
    for formula_string in multiple_formulas.split(';'):
        if mass_error == formula_dict_from_string(formula_string) or negative_mass_error(mass_error) == formula_dict_from_string(formula_string):
            metabolite_with_multiple_formulas.formula = formula_string

In [None]:
one_multiple_formula_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == 1]
len(one_multiple_formula_reactions)

# Went through many times now check current status

In [None]:
for i in range(5):
    num_reactions = [r for r in model.reactions if num_met_multiple_formulas(r) == i]
    print(f'{len(num_reactions)} of {len(model.reactions)} reactions have {i} metabolites with multiple formulas')

In [None]:
for i in range(1,5):
    num_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == i]
    print(f'{len(num_formulas)} of {len(model.metabolites)} metabolites have {i} formula(s)')

In [None]:
two_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == 2]
for m in two_formulas:
    print(m.id, m.formula)
    for r in m.reactions:
        print(r.check_mass_balance())
    print()

In [None]:
two_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == 3]
for m in two_formulas:
    print(m.id, m.formula)

In [None]:
unbalanced = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]
unbalanced_but_okay = [r for r in model.reactions if not should_be_balanced(r) and r.check_mass_balance() != {}]
balanced = [r for r in model.reactions if r.check_mass_balance() == {}]

print(f'{len(unbalanced)} of the {len(model.reactions)} reactions in the model are wrongly unbalanced')
print(f'{len(unbalanced_but_okay)} of the {len(model.reactions)} reactions in the model are properly unbalanced')
print(f'{len(balanced)} of the {len(model.reactions)} reactions in the model are balanced')

In [None]:
unbalanced_multiple_formulas = [r for r in unbalanced if has_metabolite_with_multiple_formulas(r)]
unbalanced_but_okay_multiple_formulas = [r for r in unbalanced_but_okay if has_metabolite_with_multiple_formulas(r)]
balanced_multiple_formulas   = [r for r in   balanced if has_metabolite_with_multiple_formulas(r)]

print(f'{len(unbalanced_multiple_formulas)} of the {len(unbalanced)} improperly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(unbalanced_but_okay_multiple_formulas)} of the {len(unbalanced_but_okay)} properly unbalanced reactions in the model have at least one metabolite with multiple formulas')
print(f'{len(balanced_multiple_formulas)} of the {len(balanced)} balanced reactions in the model have at least one metabolite with multiple formulas')

Check how many metabolites remain with X or Z in their formulas

In [None]:
metabolites_with_X = [m for m in model.metabolites if 'X' in m.formula]
print(f'There are {len(metabolites_with_X)} metabolites with X in their formula')

In [None]:
metabolites_with_R = [m for m in model.metabolites if 'R' in m.formula]
print(f'There are {len(metabolites_with_R)} metabolites with R in their formula')

In [None]:
metabolites_with_R = [m for m in model.metabolites if 'R' in m.formula]
print(f'There are {len(metabolites_with_R)} metabolites with R in their formula')

## Inspect metabolites and reactions to make next decision. Remove this section later 

In [None]:
for m in multiple_formulas:
    print(m.id)
    print(m.formula)
    for r in m.reactions:
        
        if r.check_mass_balance() != {} and len([m for m in r.metabolites if ';' in m.formula]) == 1:
            print(r.id)
            print(r.reaction)
            print(r.check_mass_balance())
    print()

## Check out element R

In [None]:
metabolites_with_R = []
for m in model.metabolites:
    if 'R' in m.formula:
        metabolites_with_R.append(m)
        print(m.id, m.formula)
        
print(f'There are {len(metabolites_with_R)} metabolites with R in their formula')

In [None]:
for m in model.metabolites:
    if ';' in m.formula and 'R' in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print()

## Check out element X

In [None]:
metabolites_with_X = []
for m in model.metabolites:
    if 'X' in m.formula:
        metabolites_with_X.append(m)
        print(m.id, m.formula)
        
print(f'There are {len(metabolites_with_X)} metabolites with X in their formula')

In [None]:
for m in model.metabolites:
    if ';' in m.formula and 'X' in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print()

## Fix unbalanced reactions starting with reactions with a single undefined metabolite formula

In [None]:
one_undefined_formula_rxns = []
for r in unbalanced:
    num_multiple_formula_mets = 0
    for m in r.metabolites:
        if ';' in m.formula:
            num_multiple_formula_mets += 1
    if num_multiple_formula_mets == 1:
        one_undefined_formula_rxns.append(r)
        
print(f'There are {len(one_undefined_formula_rxns)} unbalanced reactions with a single undefined formula')

Now attempt to fix them by checking if the mass error corrosponds to one of the formulas of the undefined metabolite

## Find the metabolites with multiple formulas that are used in one reaction, and fix them

In [None]:
def fix_metabolite_with_single_reaction(m, r):
    print(f'The metabolite {m.name} with the id {m.id} is only involved in one reaction {r.id}')
    print(f'{m.name} has the formula {m.formula}')
    print(f'This reactions is {r.name} which has the form {r.reaction} and the mass error {r.check_mass_balance()}')
    
    if r.check_mass_balance() == {}:
        print('The smaller of the two formulas that does not include the metabolites X' )
    print()

Check the reactions that they are in and see if they can be balanced. 

Fixing these metabolites will have the least harm to the rest of the model

find metabolites with multiple formulas that are used in a single reaction

In [None]:
for m in multiple_formulas:
    if len(m.reactions) == 1:
        print(m.formula)
        r = list(m.reactions)[0]
        fix_metabolite_with_single_reaction(m, r)

In [None]:
for m in multiple_formulas:
#     print(len(m.reactions))
    if len(m.reactions) == 1:
        reaction = list(m.reactions)[0]
        print(f'The metabolite {m.name} with the id {m.id} is only involved in one reaction {reaction.id}')
        print(f'{m.name} has the formula {m.formula}')
        print(f'This reactions is {reaction.name} which has the form {reaction.reaction}')
        rxn = list(m.reactions)[0]
        print(rxn.check_mass_balance())
        [print(m) for m in rxn.metabolites]
        print()

### Find reactions with only one metabolite with multiple formulas

In [None]:
one_undefined_formula_rxns = []
for r in unbalanced:
    num_multiple_formula_mets = 0
    for m in r.metabolites:
        if ';' in m.formula:
            num_multiple_formula_mets += 1
    if num_multiple_formula_mets == 1:
        one_undefined_formula_rxns.append(r)
        
print(f'There are {len(one_undefined_formula_rxns)} reactions with a single undefined formula')

In [None]:
for r in one_undefined_formula_rxns:
    print(r.reaction)
    print(r.check_mass_balance())
    for m in r.metabolites:
        if ';' in m.formula:
            print(m.formula)
    print()

Define function to allow dictionaries to be placed in lists. Found this solution on [stack overflow](https://stackoverflow.com/questions/56063246/how-to-obtain-a-set-of-dictionaries)

In [None]:
def make_hashable(o):
    if isinstance(o, dict):
        return frozenset((k, make_hashable(v)) for k, v in o.items())
    elif isinstance(o, list):
        return tuple(make_hashable(elem) for elem in o)
    elif isinstance(o, set):
        return frozenset(make_hashable(elem) for elem in o)
    else:
        return o

In [None]:
mass_error_list = [make_hashable(r.check_mass_balance()) for r in unbalanced_multiple_formulas]
for mass_error in collections.Counter(mass_error_list).most_common()[:20]:
    print(mass_error)

## Fix most frequent reaction unbalancing issues

Check which mass errors are most common

In [None]:
mass_error_list = [make_hashable(r.check_mass_balance()) for r in unbalanced_multiple_formulas]
for mass_error in collections.Counter(mass_error_list).most_common()[:20]:
    print(mass_error)

Most Frequent Problems:
1) Being off by two hyrdogens <br>
2) Being off by a water molecule <br>
3) ('H', 3.0), ('R', 1.0), ('O', -2.0), ('S', 1.0), ('C', -1.0), ('X', 1.0) <br>
4) ('N', 4.0), ('P', 1.0), ('S', 1.0), ('H', 19.0), ('O', 12.0), ('C', 17.0) <br>
5) ('C', -24.0), ('S', -1.0), ('N', -7.0), ('O', -19.0), ('P', -3.0), ('H', -34.0) <br>

### Look into two hydrogen error

In [None]:
two_hydrogen_error = []
for r in unbalanced_multiple_formulas:
    if r.check_mass_balance() == {'H': 2.0} or r.check_mass_balance() == {'H': -2.0}:
        two_hydrogen_error.append(r)

In [None]:
for r in two_hydrogen_error:
    print (r.check_mass_balance(), r.reaction)

Almost all have NAD or NADP. These reactions have the form:<br>
X + NADPH + H --> XH2 + NADP
<br>
The two hydrogen error happens becuase the formulas metabolite X and XH2 are both listed twice. So the two hydrogens on XH2 are double counted
<br>
The fix is to remove one of these formulas from each metabolite. For consistancy sake, the higher molecular weight formula will always be removed


In [None]:
for m in two_hydrogen_error[1].metabolites:
    print (m.name)
    print (m.formula)
    print (m.elements)
    print()

### Look into water molecule error

In [None]:
water_error = []
for r in unbalanced_multiple_formulas:
    if r.check_mass_balance() == {'H': 2.0, 'O': 1.0} or r.check_mass_balance() == {'H': -2.0, 'O': -1.0}:
        water_error.append(r)

In [None]:
for r in water_error:
    print (r.check_mass_balance(), r.reaction)

Notice that these all have water molecule as product<br>
These reactions are where one molecule loses a water <br>
Since the formulas are duplicated it is reading that two water molecules are lost, and only one is accounted for. <br>
This would be fixed by 

In [None]:
for m in water_error[2].metabolites:
    print (m.name)
    print (m.formula)
    print (m.elements)
    print()

In [None]:
[print(m.formula,m.name) for m in multiple_formulas if 'acyl' in m.name]

### Remove larger of two formulas for acyl-proteins

### Check how many metabolites still have multiple formulas

In [None]:
multiple_formulas = []
for m in model.metabolites:
    formulas = m.formula.split(';')
    if len(formulas) > 1:
        multiple_formulas.append(m)
    
print(f'{len(multiple_formulas)} metabolites still have multiple formulas')

### Check how many reactions are still unbalanced

In [None]:
unbalanced = []
for r in model.reactions:
    if r.check_mass_balance() != {} and should_be_balanced(r):
        unbalanced.append(r)
        
print(f'{len(unbalanced)} reactions are still unbalanced')

## Check most common mass errors now

In [None]:
mass_error_list = [make_hashable(r.check_mass_balance()) for r in unbalanced_multiple_formulas]
for mass_error in collections.Counter(mass_error_list).most_common()[:20]:
    print(mass_error)

## Inspect the balanced reactions with multiple formula metabolites

In [None]:
for r in balanced_multiple_formulas:
    if should_be_balanced(r):
        print(r.reaction)
        print([m.id for m in r.metabolites])
        print()

Most of these are transport reactions

In [None]:
set(r.check_mass_balance() for k,v in balanced_multiple_formulas)

In [None]:
set(r.subsystem for r in model.reactions)

## Inspect unbalanced reactions with multiple formulas

write a function to make a dictionary go into a list

In [None]:
def make_hashable(o):
    if isinstance(o, dict):
        return frozenset((k, make_hashable(v)) for k, v in o.items())
    elif isinstance(o, list):
        return tuple(make_hashable(elem) for elem in o)
    elif isinstance(o, set):
        return frozenset(make_hashable(elem) for elem in o)
    else:
        return o

Check the most frequent mass errors

In [None]:
mass_error_list = [make_hashable(r.check_mass_balance()) for r in unbalanced_multiple_formulas]
for mass_error in collections.Counter(mass_error_list).most_common()[:20]:
    print(mass_error)

### write function to take in array of strings and return lowest molecular weight string

In [None]:
test_m = model.metabolites.get_by_id('ddcaACP_c')
formulas = test_m.formula.split(';')
for f in formulas:
    print(formula_dict_from_string(f))
    


In [None]:
for r in model.metabolites.get_by_id('ddcaACP_c').reactions:
    print(r.id)
    print(r.reaction)
    for m in r.metabolites:
        print(m.formula.split(';'))
    print()

In [None]:
model.metabolites.get_by_id('ACP_c').name

In [None]:
model.metabolites.get_by_id('ACP_c').formula.split(';')

In [None]:
len(model.metabolites.get_by_id('ACP_c').reactions)

In [None]:
model.metabolites.get_by_id('ddcap_c')

Try to figure out which metabolites with multiple formulas are most common in unbalanced reactions

In [None]:
len(unbalanced)

In [None]:
len(multiple_formulas)

In [None]:
metabolite_occurances = {}
for r in unbalanced:
    for m in r.metabolites:
        if m in multiple_formulas:
            try:
                metabolite_occurances[m.id] += 1
            except:
                metabolite_occurances[m.id] = 1

In [None]:
dict(sorted(metabolite_occurances.items(), key=lambda item: -item[1]))

In [None]:
metabolite_with_x = []
for m in model.metabolites:
    if 'X' in m.formula:
        metabolite_with_x.append(m)
        
print(f'There are {len(metabolite_with_x )} metabolites with X in their formula')

In [None]:
for m in metabolite_with_x:
    print(m.id, m.name, m.formula.split(';'))

In [None]:
model.metabolites.get_by_id('fldox_c')

In [None]:
model.metabolites.get_by_id('fdxox_c')

In [None]:
metabolite_with_r = []
for m in model.metabolites:
    if 'R' in m.formula:
        metabolite_with_r.append(m)
        
print(f'There are {len(metabolite_with_r)} metabolites with R in their formula')

In [None]:
for m in metabolite_with_r:
    print(m.id, m.name, m.formula.split(';'))

Get all letters used in formulas

In [None]:
all_letters = []
for m in model.metabolites:
    for c in m.formula:
        if c.isalpha():
            all_letters.append(c)
            
print(set(all_letters))
    

In [None]:
for m in model.metabolites:
    if 'X' in m.formula:
        print (m.id, m.name, m.formula)

[X is the code for glutaredoxin in BiGG](http://bigg.ucsd.edu/models/universal/metabolites/grxox)

In [None]:
for m in model.metabolites:
    if 'R' in m.formula and 'PRS' not in m.formula:
        print (m.id, m.name, m.formula)

[R is the code for Ferricytochrome in BiGG](http://bigg.ucsd.edu/universal/metabolites/ficytc6)

In [None]:
for m in model.metabolites:
    if 'PRS' in m.formula:
        print (m.id, m.name, m.formula)

What does PRS mean in BiGG. Obviously something to do with ACP

In [None]:
for m in model.reactions.get_by_id('Growth').metabolites:
    print(m.formula)

In [None]:
model.reactions.get_by_id('Growth').reaction.split('+')

Get set of elements in growth equation.

In [None]:
growth_elements = []
[growth_elements.extend(list(m.elements.keys())) for m in model.reactions.get_by_id('Growth').metabolites]
    
set(growth_elements)

In [None]:
for m in model.metabolites:
    if 'obsolete' in m.name:
        print(m.id, m.name, m.formula)

In [None]:
for r in model.metabolites.get_by_id('nadph_c').reactions:
    print(r.id, r.reaction)

Check elements of metabolites with only one formula

In [None]:
one_formula_elements = []
for m in model.metabolites:
    if ';' not in m.formula:
        one_formula_elements.extend((list(m.elements.keys())))
set(one_formula_elements)

This is okay. What is R and X? Check R first

In [None]:
for m in model.metabolites:
    if ';' not in m.formula and 'R' in m.formula and 'PRS' not in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print()

## PRS is causing errors. This is a method to remove it

First check metabolites that have PRS that only have one formula

In [None]:
for m in model.metabolites:
    if ';' not in m.formula and'PRS' in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print()

There are four such metabolites. They will meed to be fixed later. First we need to define what PRS is

Next, check metabolties that have PRS and multiple formulas

In [None]:
met_multiple_formulas_PRS = []
for m in model.metabolites:
    if ';' in m.formula and'PRS' in m.formula:
        met_multiple_formulas_PRS.append(m)
print(f'There are {len(met_multiple_formulas_PRS)} metabolites with multiple formulas and PRS')

define function to subtract element dictionaries

In [None]:
def subtract_element_dicts(elements_1, elements_2):
    output = {}
    all_keys = list(elements_1.keys())
    element_2_keys = list(elements_2.keys())
    all_keys.extend(element_2_keys)
    all_keys = set(all_keys)
    
    
    for k in all_keys:
        if k in elements_1.keys() and k in elements_2.keys():
            output[k] = elements_1[k] - elements_2[k]
        elif k in elements_1.keys() and k not in elements_2.keys():
            output[k] = elements_1[k]
        else:
            output[k] = -1*elements_2[k]
            
    return output

Test that the function works

In [None]:
elements_1 = {'C': 394, 'H': 621, 'O': 144, 'N': 96, 'P': 1, 'S': 3}
elements_2 = {'C': 21, 'H': 39, 'N': 2, 'O': 9, 'P': 1, 'R': 1, 'S': 1}
subtract_element_dicts(elements_1, elements_2)

In [None]:
subtract_element_dicts(elements_1, elements_2)

In [None]:
for m in met_multiple_formulas_PRS:
    formulas = m.formula.split(';')
    if len(formulas) == 2:
        elements_1 = formula_dict_from_string(formulas[0])
        elements_2 = formula_dict_from_string(formulas[1])
                                              
        print(m.id, m.formula, subtract_element_dicts(elements_1, elements_2))
        print()


Seems clear that PRS has formula N94H582S2C373O135R-1

In [None]:
Remove all 

Now check X

In [None]:
for m in model.metabolites:
    if ';' not in m.formula and 'X' in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print()

Check to see if R is in metabolites with multiple formulas

In [None]:
for m in model.metabolites:
    if ';' in m.formula and 'PRS' in m.formula:
        print(m.id)
        print(m.name)
        print(m.formula)
        print(m.elements)
        print()