Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct update prot pool #276

Merged
merged 12 commits into from
Mar 20, 2023
Merged

fix: correct update prot pool #276

merged 12 commits into from
Mar 20, 2023

Conversation

ae-tafur
Copy link
Collaborator

@ae-tafur ae-tafur commented Mar 8, 2023

Main improvements in this PR:

  • Correct updateProtPool after constraint LB of protein usage with proteomics data.

Since _f_ accounts for the fraction that the unmeasured protein sector represents out of the total proteome in the cell, in update protein pool it must be calculate as the fraction of proteins with a concentration value assigned in constrainProtConcs, then is drawn from the original protein pool

  • Display the total protein value if a ptot input is not defined

I hereby confirm that I have:

  • Selected develop as a target branch (top left drop-down menu)

@ae-tafur ae-tafur requested a review from edkerk March 8, 2023 21:59
@github-actions
Copy link

github-actions bot commented Mar 8, 2023

This PR has been automatically tested with GH Actions. Here is the output of the tests:

 
Running geckoCoreFunctionTests
================================================================================
Verification failed in geckoCoreFunctionTests/testsaveECModel_tc0009.
---------------------
Framework Diagnostic:
---------------------
verifyEqual failed.
--> Path to failure: .rev
--> The numeric values are not equal using "isequaln".
--> Failure table:
Index Actual Expected Error RelativeError
_____ ______ ________ _____ _____________

12 1 0 1 Inf
13 1 0 1 Inf
14 1 0 1 Inf
15 1 0 1 Inf
16 1 0 1 Inf
17 1 0 1 Inf

Actual double:
17x1 double
Expected double:
17x1 double

Actual Value:
struct with fields:

name: 'testModel'
id: 'testModel'
annotation: [1x1 struct]
date: '2023-03-18'
description: ''
version: ''
rxns: {17x1 cell}
S: [10x17 double]
rev: [17x1 double]
metNames: {10x1 cell}
comps: {2x1 cell}
compNames: {2x1 cell}
metComps: [10x1 double]
mets: {10x1 cell}
grRules: {17x1 cell}
rxnGeneMat: [17x5 double]
genes: {5x1 cell}
ub: [17x1 double]
lb: [17x1 double]
c: [17x1 double]
rxnNames: {17x1 cell}
b: [10x1 double]
eccodes: {17x1 cell}
metNotes: {10x1 cell}
ec: [1x1 struct]
Expected Value:
struct with fields:

id: 'testModel'
name: 'testModel'
description: ''
version: ''
date: '2023-03-18'
annotation: [1x1 struct]
rxns: {17x1 cell}
rxnNames: {17x1 cell}
mets: {10x1 cell}
metNames: {10x1 cell}
S: [10x17 double]
lb: [17x1 double]
ub: [17x1 double]
rev: [17x1 double]
c: [17x1 double]
b: [10x1 double]
genes: {5x1 cell}
grRules: {17x1 cell}
rxnGeneMat: [17x5 double]
eccodes: {17x1 cell}
metComps: [10x1 double]
metNotes: {10x1 cell}
comps: {2x1 cell}
compNames: {2x1 cell}
ec: [1x1 struct]
------------------
Stack Information:
------------------
In /home/m/actions-runner/_work/GECKO/GECKO/test/unit_tests/geckoCoreFunctionTests.m (testsaveECModel_tc0009) at 232
================================================================================
Model-specific DLKcat input stored at /home/m/actions-runner/_work/GECKO/GECKO/test/unit_tests/ecTestGEM/data/DLKcat_input_test.tsv
Limit has been reached. Protein P5 seems to be problematic. Consider changing the kcat.
Done geckoCoreFunctionTests
__________

Failure Summary:

Name Failed Incomplete Reason(s)
============================================================================================
geckoCoreFunctionTests/testsaveECModel_tc0009 X Failed by verification.

Note: In the case of multiple test runs, this post will be edited.

@edkerk
Copy link
Member

edkerk commented Mar 9, 2023

Since _f_ accounts for the fraction that the unmeasured protein sector represents out of the total proteome in the cell, in update protein pool it must be calculate as the fraction of proteins with a concentration value assigned in constrainProtConcs, then is drawn from the original protein pool

I think there is confusion on what f represents. f is not the fraction of unmeasured protein, as is suggested above.

  • f is the enzyme fraction of the total proteome.
  • When no proteomics data is used to constrain the model, f can be estimated from paxDB.tsv (this contains quantities for all proteins: what fraction of thoses are enzymes?).
  • For yeast, calculateFfactor(ecModel) gives 0.4692, but 0.5 is also a fair estimate.
  • In no-proteomics ecModel, the protein pool exchange reaction is constrained by how many enzyme the cell contains: Ptot * f (ignoring sigma here).
  • As reasonable estimate, we can suggest for now, Ptot * f = 0.5 * 0.5 = 0.25.

Now, when the ecModel is constrained with proteomics data:

  • model.ec.concs will sum up to part of that 0.25. It is not the full 0.25 because (a) not all proteins are measured; (b) low level proteins are below detection limit; (c) variation.
  • Let's say sum(model.ec.concs) is 116 (which is the case in ecYeastGEM example).
  • What updateProtPool is doing is to consider that the cell would indeed have 0.25 enzyme, then subtract the sum of model.ec.concs as those proteins will now be directly constrained and no longer draw from the prot_pool.
  • So Pdiff should be 250 - 116 = 133.
  • A bit confusing is that Ptot is defined (in ModelAdapter) in g/gDCW, while otherwise we use mg/gDCW (maybe we should change this?).
  • More confusing is that setProtPoolSize takes the total protein content and f as separate parameters, so we cannot just directly use 133 as input.
  • However, it is not essential to use setProtPoolSize, we can just directly set the LB of prot_pool_exchange.

What your proposed code does is:

  • Calculates f as fraction of unmeasured proteins (but ignoring that the proteomics data set contained more proteins than those included in model.ec.concs, as these are not enzymes).
  • PdiffEnz = PtotEnz - PmeasEnz itself would make sense if PmeasEnz would actually represent all measured proteins, but this should be sum(proteomics.tsv) in essence, not sum(model.ec.concs).
  • Even if PdiffEnz would be determined from all measured proteins, in setProtPoolSize it should not use the incorrectly (="unmeasured") defined f.

If anything, the meaning of f should be better documented, and this function should be made more intuitive.

@ae-tafur
Copy link
Collaborator Author

ae-tafur commented Mar 9, 2023

A bit confusing is that Ptot is defined (in ModelAdapter) in g/gDCW, while otherwise we use mg/gDCW (maybe we should change this?).

In proteomics model, ptot is in fluxData.tsv, and updateProtPool can receives ptot as parameter, then when using this function, value stored in fluxData.tsv should be used.

protModel = updateProtPool(protModel, fluxData.Ptot(i));

More confusing is that setProtPoolSize takes the total protein content and f as separate parameters, so we cannot just directly use 133 as input.

Allow updateProtPool to define it ? or just use the lb ?

PdiffEnz = PtotEnz - PmeasEnz itself would make sense if PmeasEnz would actually represent all measured proteins, but this should be sum(proteomics.tsv) in essence, not sum(model.ec.concs).

At this point the loadProtData can return the sum(proteomics.tsv) as new field in the structure, containing this value as array of n condition values

@edkerk
Copy link
Member

edkerk commented Mar 9, 2023

A bit confusing is that Ptot is defined (in ModelAdapter) in g/gDCW, while otherwise we use mg/gDCW (maybe we should change this?).

In proteomics model, ptot is in fluxData.tsv, and updateProtPool can receives ptot as parameter, then when using this function, value stored in fluxData.tsv should be used.
protModel = updateProtPool(protModel, fluxData.Ptot(i));

Indeed, this should be documented better. But my main point is that it Ptot is in g/gDCW (in ModelAdapter and in fluxData.tsv), and not mg/gDCW.

More confusing is that setProtPoolSize takes the total protein content and f as separate parameters, so we cannot just directly use 133 as input.

Allow updateProtPool to define it ? or just use the lb ?

That is indeed my suggestion, to just directly set the LB to whatever value is calculated in this function.

PdiffEnz = PtotEnz - PmeasEnz itself would make sense if PmeasEnz would actually represent all measured proteins, but this should be sum(proteomics.tsv) in essence, not sum(model.ec.concs).

At this point the loadProtData can return the sum(proteomics.tsv) as new field in the structure, containing this value as array of n condition values

Good suggestion. Should this be done on the data without filtering? If we filter to discard the most variable protein levels, the sum will be (much?) lower. Meanwhile, while high variance for individual proteins is problematic when filling model.ec.concs (and constraining the model), when we are only interested in the sum of the values these variances across individual proteins somewhat cancel out each other?

@ae-tafur
Copy link
Collaborator Author

ae-tafur commented Mar 9, 2023

Good suggestion. Should this be done on the data without filtering? If we filter to discard the most variable protein levels, the sum will be (much?) lower. Meanwhile, while high variance for individual proteins is problematic when filling model.ec.concs (and constraining the model), when we are only interested in the sum of the values these variances across individual proteins somewhat cancel out each other?

Yes, I think should be before filtering. I don't think so the sum will be much lower, but will depend of the cutoff set by the user. So to avoid that and give (probably) some flexibility. It better to use all of them.

This value should be the mean in the replicates, since in Ptot we use a mean value.

@ae-tafur
Copy link
Collaborator Author

Few changes have been introduced to this PR,

  • f for proteomics integration counts on the measured protein (average sum per condition) / total protein content
  • loadProtData also returns the measured proteins in proteomics.tsv
  • updateProtPool draw from measured protein per condition the sum of protein concentration in model.ec.cons, then the new prot_pool is calculated

@edkerk
Copy link
Member

edkerk commented Mar 17, 2023

To clarify how the new constraint is calculated (now also included in updateProtPool):

% Ptot      the "real" total protein content in the cell
% Pmeas     the measured protein content (sum of proteomics data). Is lower
%           than Ptot, as not all protein are measured in a proteomics
%           experiment. Equals protData.measuredProt from loadProtData.
% f         the ratio enzyme/protein: how many of proteins are enzymes
% m         the ratio measured/total protein: how much is measured
% PtotEnz   the "real" total enzyme content in the cell. Equals f * Ptot
% PmeasEnz  the measured enzyme content. Equals f * Pmeas, also equals
%           sum(ecModel.ec.concs)
% PdiffEnz  the non-measured enzyme content. Equals PtotEnz - PmeasEnz.
%
% Without proteomics, protein pool is constraint by PtotEnz (= f * Ptot).
% With proteomics, protein pool should be constraint by PdiffEnz.
% [And the above two constraints are both multiplied by sigma-factor, which
% is unrelated to the adjustments made here]

In addition, if a non-functional model is resulted, the sigma factor is relaxed to 1 to potentially resolve this. sigmaFitter is not directly run by this function, as then the growth rate would also become an input, and it's better to direct the user to use this function.

@edkerk edkerk added this to the 3.0.2 milestone Mar 20, 2023
@edkerk edkerk merged commit 2f2036c into develop Mar 20, 2023
@edkerk edkerk deleted the fix/updateProtPool branch March 20, 2023 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants