# Genome scale model curation
#### This notebook is currently not being used


This notebook assigns formulas to metabolites that have more than one formula in such a way that all reactions are balanced. 

### Methods
<ol>
<li>Assign the first formula to a metabolite if its equivalent metabolite formulas</li>
<li>Assign metabolite formulas by checking if one formula makes all reactions the where its the only undefined metabolite balanced</li>
<li>Repeat but allow for some reactions to be unbalanced with respect to hydrogen</li>
<li>Assign remaining metabolites by checking if combinations of formulas are balanaced or only off by hydrogen</li>
<li>Balance reactions that are only unbalanced by hydrogen by adding H+ to the deficient side</li>
<li>Manual curation (only a single metabolite needs to be set </li>
</ol>

In [1]:
import cobra
from IPython.display import IFrame
from itertools import product

# Getting and preparing the genome-scale model

## Load *R.opacus* NCBI model generated by CarveMe

In [3]:
model = cobra.io.read_sbml_model("../models/Ropacus_annotated_curated.xml")
model

0,1
Name,ropacus_annotated_curated
Memory address,0x07f34791aad90
Number of metabolites,1956
Number of reactions,3025
Number of groups,0
Objective expression,1.0*Growth - 1.0*Growth_reverse_699ae
Compartments,"cytosol, periplasm, extracellular space"


In [13]:
# remove charge from the model
for m in model.metabolites:
    m.charge = 0

In [28]:
model.id = 'ropacus_annotated_curated'
model.name = 'Rhodococcus opacus PD630 annotated and curated'
model.description = 'Rhodococcus opacus PD630 annotated curated'

cobra.io.write_sbml_model(model, "../models/Ropacus_annotated_curated_no_charge_balanced_H.xml")

## Starting MEMOTE Output

In [3]:
# IFrame('../data/memotes/ropacus_annotated.html', 1500, 800)

Define functions to print status report

In [4]:
def should_be_balanced(r):
    return not (r.id.startswith('EX_') or r.id.startswith('sink_') or r.id.startswith('Growth'))

def has_metabolite_with_multiple_formulas(r):
    return len([m for m in r.metabolites if len(m.formula.split(';')) > 1]) > 0
    
def status_report():
    for i in range(1,5):
        num_formulas = [m for m in model.metabolites if len(m.formula.split(';')) == i]
        print(f'{len(num_formulas)} of {len(model.metabolites)} metabolites have {i} formula(s)')
    print('\n')
    
    unbalanced = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]
    unbalanced_but_okay = [r for r in model.reactions if not should_be_balanced(r) and r.check_mass_balance() != {}]
    balanced = [r for r in model.reactions if r.check_mass_balance() == {}]
    
    unbalanced_multiple_formulas = [r for r in unbalanced if has_metabolite_with_multiple_formulas(r)]
    unbalanced_but_okay_multiple_formulas = [r for r in unbalanced_but_okay if has_metabolite_with_multiple_formulas(r)]
    balanced_multiple_formulas   = [r for r in   balanced if has_metabolite_with_multiple_formulas(r)]
    
    print(f'{len(unbalanced)} of the {len(model.reactions)} reactions in the model are wrongly unbalanced')
    print(f'{len(unbalanced_but_okay)} of the {len(model.reactions)} reactions in the model are properly unbalanced')
    print(f'{len(balanced)} of the {len(model.reactions)} reactions in the model are balanced')
    print('\n')
    
    print(f'{len(unbalanced_multiple_formulas)} of the {len(unbalanced)} improperly unbalanced reactions in the model have at least one metabolite with multiple formulas')
    print(f'{len(unbalanced_but_okay_multiple_formulas)} of the {len(unbalanced_but_okay)} properly unbalanced reactions in the model have at least one metabolite with multiple formulas')
    print(f'{len(balanced_multiple_formulas)} of the {len(balanced)} balanced reactions in the model have at least one metabolite with multiple formulas')

In [5]:
status_report()

1956 of 1956 metabolites have 1 formula(s)
0 of 1956 metabolites have 2 formula(s)
0 of 1956 metabolites have 3 formula(s)
0 of 1956 metabolites have 4 formula(s)


613 of the 3025 reactions in the model are wrongly unbalanced
339 of the 3025 reactions in the model are properly unbalanced
2073 of the 3025 reactions in the model are balanced


0 of the 613 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2073 balanced reactions in the model have at least one metabolite with multiple formulas


In [25]:
def balanced_or_only_hydrogen_unbalanced(r):
    return should_be_balanced(r) and list(r.check_mass_balance().keys()) == ['H']

def fix_unbalanced_hydrogen(r):
    hydrogen_error = int(r.check_mass_balance()['H'])
    r.subtract_metabolites({model.metabolites.get_by_id("h_c"): hydrogen_error})

In [26]:
unbalanced_rxns = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]

for r in unbalanced_rxns:
    if balanced_or_only_hydrogen_unbalanced(r):
        fix_unbalanced_hydrogen(r)
    

In [27]:
unbalanced_rxns = [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]

for r in unbalanced_rxns:
    mass_error = r.check_mass_balance()
    metabolites = [m.id for m in r.metabolites]
    print(r.id, mass_error)

AGPATr_BS {'C': -7.105427357601002e-15, 'H': -1.4210854715202004e-14, 'O': -3.552713678800501e-15, 'N': -8.881784197001252e-16}
G3POA_BS {'C': -7.105427357601002e-15, 'N': -8.881784197001252e-16, 'O': -3.552713678800501e-15}


These have extremely small errors, and will be ignored. I'm not sure how to fix them

## Fix equivalent metabolite formulas
Define functions for this section

In [6]:
# def get_initial_number_string(substring):
#     initial_string = ''
#     for char in substring:
#         if char.isdigit():
#             initial_string += char
#         else:
#             return initial_string
#     return initial_string

# def formula_dict_from_string(formula_string):
#     formula_dict = {}
#     elements = [char for char in formula_string if char.isalpha()]
#     for element in elements:
#         string_after_element = formula_string.split(element, 1)[1]
#         coefficient = get_initial_number_string(string_after_element)
#         if coefficient == '':
#             coefficient = '1'
#         formula_dict[element] = int(coefficient)
#     return formula_dict

# def all_formulas_equivalent(m):
#     first_formula = m.formula.split(';')[0]
#     return len([f for f in m.formula.split(';') if formula_dict_from_string(f) != formula_dict_from_string(first_formula)]) > 0

In [7]:
# equivalent_formulas = 0
# for m in [m for m in model.metabolites if ';' in m.formula and not all_formulas_equivalent(m)]:
#     print(m.id, m.name, m.formula)
#     m.formula = m.formula.split(';')[0]
    
#     equivalent_formulas += 1

# print(f'There are {equivalent_formulas} metabolites with equivalent formulas, and they have been fixed.')

There are 0 metabolites with equivalent formulas, and they have been fixed.


In [8]:
# status_report()

1952 of 1952 metabolites have 1 formula(s)
0 of 1952 metabolites have 2 formula(s)
0 of 1952 metabolites have 3 formula(s)
0 of 1952 metabolites have 4 formula(s)


610 of the 3021 reactions in the model are wrongly unbalanced
339 of the 3021 reactions in the model are properly unbalanced
2072 of the 3021 reactions in the model are balanced


0 of the 610 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2072 balanced reactions in the model have at least one metabolite with multiple formulas


# Assign metabolite formulas by checking if one formula makes all reactions where its the only undefined metabolite balanced
Define functions for this section

In [9]:
# def m_only_undefined_metabolite(m1, r):
#     return ';' in m1.formula and len([m2 for m2 in r.metabolites if ';' in m2.formula and m1 != m2]) == 0

# def reactions_where_m_is_only_undefined_metabolite(m):
#     return [r for r in m.reactions if m_only_undefined_metabolite(m,r)]

# def fraction_of_reactions_formula_balances(m, formula, rxn_list):
#     original_formula = m.formula
#     m.formula = formula
#     balanced_reactions   = [r for r in rxn_list if r.check_mass_balance() == {}]
#     unbalanced_reactions = [r for r in rxn_list if r.check_mass_balance() != {}]
#     m.formula = original_formula
    
#     # avoid divide by zero
#     if len(balanced_reactions) + len(unbalanced_reactions) == 0:
#         return 0
#     return len(balanced_reactions) / (len(balanced_reactions) + len(unbalanced_reactions))

Run this function until no additional metabolites can be defined based on being the only undefined metabolite. Only assign formulas to perfect fits (Need to improve this wording)

In [10]:
# metabolites_that_can_be_defined = 1
# while metabolites_that_can_be_defined > 0:
#     metabolites_that_can_be_defined = 0
#     for m in [m for m in model.metabolites if ';' in m.formula]:
#         for f in m.formula.split(';'):
#             if fraction_of_reactions_formula_balances(m, f, reactions_where_m_is_only_undefined_metabolite(m)) == 1:
#                 print(m.name)
#                 print(m.formula)
#                 print(f)
#                 metabolites_that_can_be_defined += 1
#                 m.formula = f

#     print(f'{metabolites_that_can_be_defined} metabolites can be defined in this round')

0 metabolites can be defined in this round


In [11]:
# status_report()

1952 of 1952 metabolites have 1 formula(s)
0 of 1952 metabolites have 2 formula(s)
0 of 1952 metabolites have 3 formula(s)
0 of 1952 metabolites have 4 formula(s)


610 of the 3021 reactions in the model are wrongly unbalanced
339 of the 3021 reactions in the model are properly unbalanced
2072 of the 3021 reactions in the model are balanced


0 of the 610 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2072 balanced reactions in the model have at least one metabolite with multiple formulas


# Now repeat but allow for imperfect fitting
Assign formulas that satisfy the greatest fraction of reactions first.
After each loop check to see which formulas have the greatest fraction of reactions they balance

In [12]:
# highest_fraction = 1
# while highest_fraction > 0:
#     highest_fraction = 0
#     metabolites_that_can_be_defined = 0

#     # find highest fraction of reactions that are solved by a given formula
#     for m in [m for m in model.metabolites if ';' in m.formula]:
#         for f in m.formula.split(';'):
#             if fraction_of_reactions_formula_balances(m, f, reactions_where_m_is_only_undefined_metabolite(m)) > highest_fraction:
#                 highest_fraction = fraction_of_reactions_formula_balances(m, f, reactions_where_m_is_only_undefined_metabolite(m))

#     # assign formulas to metabolites with formula that gives a score equal to the best fraction
#     if highest_fraction > 0:
#         for m in [m for m in model.metabolites if ';' in m.formula]:
#             for f in m.formula.split(';'):
#                 if fraction_of_reactions_formula_balances(m, f, reactions_where_m_is_only_undefined_metabolite(m)) == highest_fraction:
#                     print(m.name)
#                     print(m.formula)
#                     print(f)
#                     m.formula = f
#                     metabolites_that_can_be_defined += 1

#     print(f'{metabolites_that_can_be_defined} metabolite(s) can be defined in this round with a fitting score of {highest_fraction}')

0 metabolite(s) can be defined in this round with a fitting score of 0


In [13]:
# status_report()

1952 of 1952 metabolites have 1 formula(s)
0 of 1952 metabolites have 2 formula(s)
0 of 1952 metabolites have 3 formula(s)
0 of 1952 metabolites have 4 formula(s)


610 of the 3021 reactions in the model are wrongly unbalanced
339 of the 3021 reactions in the model are properly unbalanced
2072 of the 3021 reactions in the model are balanced


0 of the 610 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2072 balanced reactions in the model have at least one metabolite with multiple formulas


# Assign remaining metabolites by checking if combinations of formulas are balanaced or only off by hydrogen
Note 1: there is a preference for more verbose formulas to minimize shorthand notation<br>
Note 2: This is only done if there are less than 10 metabolites since the number of formula combination grows roughly exponentially (i.e. 10 undefined metabolites each with 2 formulas yields 2^10 possible combinations <br>
Define functions for this section

In [14]:
# def is_balanced(r):
#     return abs(sum(list(r.check_mass_balance().values()))) < 1e-5

# def balanced_or_only_hydrogen_unbalanced(r):
#     return should_be_balanced(r) and (is_balanced(r) or list(r.check_mass_balance().keys()) == ['H'])

# def reactions_off_by_more_than_hydrogen():
#     return [r for r in model.reactions if should_be_balanced(r) and not(balanced_or_only_hydrogen_unbalanced(r))]

# def chars_in_string_list(string_list):
#     total_chars = 0
#     for string in string_list:
#         total_chars += len(string)
#     return total_chars


In [15]:
# undefined_metabolites = [m for m in model.metabolites if ';' in m.formula]
# possible_formulas = [m.formula.split(';') for m in model.metabolites if ';' in m.formula]

# # only do this step if there is a reasonable number of undefined metbolites due to exponential growth of formula combinations
# if len(undefined_metabolites) < 10:
    
#     # inital best formulas is their original values
#     best_formulas = [m.formula for m in undefined_metabolites]
#     best_score = len(reactions_off_by_more_than_hydrogen())
#     best_length = 0

#     # goes through all permutations of formulas for undefined metabolites
#     for formulas in list(product(*possible_formulas)):
#         # assign the formulas to the metabolites
#         for count, m in enumerate(undefined_metabolites):
#             model.metabolites.get_by_id(m.id).formula = formulas[count]

#         # get the number of reactions that are off by more than hydrogen
#         unacceptable_reactions = reactions_off_by_more_than_hydrogen()
# #         print(len(unacceptable_reactions), formulas) this line give a lot of details

#         # if its the best fit replace the best formulas
#         if len(unacceptable_reactions) <= best_score:
#             if chars_in_string_list(formulas) > best_length:
#                 best_formulas = formulas
#                 best_score = len(reactions_off_by_more_than_hydrogen())
#                 best_length = chars_in_string_list(formulas)
                
# for count, m in enumerate(undefined_metabolites):
#     model.metabolites.get_by_id(m.id).formula = best_formulas[count]
#     print(f'For metabolite {m.id}, with possible formulas {possible_formulas[count]}, {best_formulas[count]} was chosen')

In [16]:
status_report()

1952 of 1952 metabolites have 1 formula(s)
0 of 1952 metabolites have 2 formula(s)
0 of 1952 metabolites have 3 formula(s)
0 of 1952 metabolites have 4 formula(s)


610 of the 3021 reactions in the model are wrongly unbalanced
339 of the 3021 reactions in the model are properly unbalanced
2072 of the 3021 reactions in the model are balanced


0 of the 610 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2072 balanced reactions in the model have at least one metabolite with multiple formulas


# Balance Hydrogen
Define function to balance hydrogen

In [17]:
# def only_hydrogen_unbalanced(r):
#     return list(r.check_mass_balance().keys()) == ['H'] and should_be_balanced(r)

# def fix_unbalanced_hydrogen(r):
#     hydrogen_error = int(r.check_mass_balance()['H'])
#     r.subtract_metabolites({model.metabolites.get_by_id("h_c"): hydrogen_error})

In [18]:
# for r in [r for r in model.reactions if only_hydrogen_unbalanced(r)]:
#     fix_unbalanced_hydrogen(r)

In [19]:
# status_report()

1952 of 1952 metabolites have 1 formula(s)
0 of 1952 metabolites have 2 formula(s)
0 of 1952 metabolites have 3 formula(s)
0 of 1952 metabolites have 4 formula(s)


610 of the 3021 reactions in the model are wrongly unbalanced
339 of the 3021 reactions in the model are properly unbalanced
2072 of the 3021 reactions in the model are balanced


0 of the 610 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2072 balanced reactions in the model have at least one metabolite with multiple formulas


# Check remaining imbalances

In [20]:
# for r in [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]:
#     print(r, '\n', r.check_mass_balance(), '\n')

1P2CBXLCYCL: 5a2opntn_c <=> 1p2cbxl_c + h2o_c + h_c 
 {'charge': 1.0} 

1P2CBXLR: 1p2cbxl_c + 2.0 h_c + nadph_c --> nadp_c + pro__L_c 
 {'charge': -1.0} 

23CTI1: decoa_c --> dc2coa_c + h_c 
 {'charge': -3.0} 

23CTI2: dded3coa_c --> dd2coa_c 
 {'charge': -4.0} 

24DECOAR: dec4_2_coa_c + h_c + nadph_c --> dc2coa_c + nadp_c 
 {'charge': -4.0} 

2AGPGAT160: 2agpg160_c + atp_c + hdca_c --> amp_c + pg160_c + ppi_c 
 {'charge': 1.0} 

2AGPGAT161: 2agpg161_c + atp_c + hdcea_c --> amp_c + pg161_c + ppi_c 
 {'charge': -1.0} 

2DDARAA: 2ddara_c <=> gcald_c + pyr_c 
 {'charge': -1.0} 

2DGULRx: 2dhguln_c + h_c + nadh_c --> idon__L_c + nad_c 
 {'charge': -1.0} 

34DHPACDO: 34dhpacet_c + o2_c --> 5cmhmsa_c + h_c 
 {'charge': 2.0} 

4ABZGLUH: 4abzglu_c + h2o_c <=> 4abz_c + glu__L_c 
 {'charge': -2.0} 

4CMLCL_kt: 4cml_c + h_c --> 5odhf2a_c + co2_c 
 {'charge': -2.0} 

4H2KPILY: 4h2kpi_c --> pyr_c + sucsal_c 
 {'charge': -2.0} 

4HOXPACMOF_1: 4hoxpac_c + fadh2_c + o2_c --> 34dhpacet_c + fad_c + h2o_

### Manual Curation
Only one metabolite (fdxox_c) needs to be fixed manually

In [23]:
# model.metabolites.get_by_id('fdxox_c').formula = 'Fe2S2X'

In [24]:
# model.reactions.get_by_id('FRDO').check_mass_balance()

In [25]:
for r in [r for r in model.reactions if only_hydrogen_unbalanced(r)]:
    fix_unbalanced_hydrogen(r)

In [26]:
for r in [r for r in model.reactions if should_be_balanced(r) and r.check_mass_balance() != {}]:
    print(r, '\n', r.check_mass_balance(), '\n')

1P2CBXLCYCL: 5a2opntn_c <=> 1p2cbxl_c + h2o_c + h_c 
 {'charge': 1.0} 

1P2CBXLR: 1p2cbxl_c + 2.0 h_c + nadph_c --> nadp_c + pro__L_c 
 {'charge': -1.0} 

23CTI1: decoa_c --> dc2coa_c + h_c 
 {'charge': -3.0} 

23CTI2: dded3coa_c --> dd2coa_c 
 {'charge': -4.0} 

24DECOAR: dec4_2_coa_c + h_c + nadph_c --> dc2coa_c + nadp_c 
 {'charge': -4.0} 

2AGPGAT160: 2agpg160_c + atp_c + hdca_c --> amp_c + pg160_c + ppi_c 
 {'charge': 1.0} 

2AGPGAT161: 2agpg161_c + atp_c + hdcea_c --> amp_c + pg161_c + ppi_c 
 {'charge': -1.0} 

2DDARAA: 2ddara_c <=> gcald_c + pyr_c 
 {'charge': -1.0} 

2DGULRx: 2dhguln_c + h_c + nadh_c --> idon__L_c + nad_c 
 {'charge': -1.0} 

34DHPACDO: 34dhpacet_c + o2_c --> 5cmhmsa_c + h_c 
 {'charge': 2.0} 

4ABZGLUH: 4abzglu_c + h2o_c <=> 4abz_c + glu__L_c 
 {'charge': -2.0} 

4CMLCL_kt: 4cml_c + h_c --> 5odhf2a_c + co2_c 
 {'charge': -2.0} 

4H2KPILY: 4h2kpi_c --> pyr_c + sucsal_c 
 {'charge': -2.0} 

4HOXPACMOF_1: 4hoxpac_c + fadh2_c + o2_c --> 34dhpacet_c + fad_c + h2o_

In [27]:
status_report()

1952 of 1952 metabolites have 1 formula(s)
0 of 1952 metabolites have 2 formula(s)
0 of 1952 metabolites have 3 formula(s)
0 of 1952 metabolites have 4 formula(s)


614 of the 3021 reactions in the model are wrongly unbalanced
339 of the 3021 reactions in the model are properly unbalanced
2068 of the 3021 reactions in the model are balanced


0 of the 614 improperly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 339 properly unbalanced reactions in the model have at least one metabolite with multiple formulas
0 of the 2068 balanced reactions in the model have at least one metabolite with multiple formulas


# Save the curated model

In [28]:
# model.id = 'ropacus_annotated_curated'
# model.name = 'Rhodococcus opacus PD630 annotated and curated'
# model.description = 'Rhodococcus opacus PD630 annotated curated'

# cobra.io.write_sbml_model(model, "../models/Ropacus_annotated_curated.xml")

In [29]:
# model

0,1
Name,ropacus_annotated_curated
Memory address,0x07f2cce1b1050
Number of metabolites,1952
Number of reactions,3021
Number of groups,0
Objective expression,1.0*Growth - 1.0*Growth_reverse_699ae
Compartments,"cytosol, periplasm, extracellular space"


# Check memote of current model

In [31]:
# IFrame('../data/memotes/ropacus_annotated_curated.html', 1500, 800)