Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue of get_isoforms with itertools.product #39

Closed
jalew188 opened this issue Aug 10, 2020 · 5 comments
Closed

Issue of get_isoforms with itertools.product #39

jalew188 opened this issue Aug 10, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@jalew188
Copy link
Contributor

For phosphorylation as an example, there may be many phospho sites on a sequence, resulting in a lot of isoforms. If max_isoforms is large enough, there may be too many isoforms to be considered; if max_isoforms is too small, modifications on left-side AAs may be excluded due to the behavior of itertools.product.

seq = "SBCDMFSSSSSSSSSSSMMFD" # maybe more phosites when sequence gets longer
mod_dict = dict(zip("S,M".split(","), "(ph)S,(ox)M".split(",")))

max_isoforms = 1000000
peplist = []
from time import perf_counter
start = perf_counter()
for i in range(1000):
    peptides = get_isoforms(mod_dict, seq, max_isoforms)
    peplist.extend(peptides)
end = perf_counter()
print(peptides[:100])
import sys
print(f'Memory usage and running time for 1000 repeats: {sys.getsizeof(peplist)/10**6:2f} MB, {end-start:2f} s, # of isoforms without repeat: {len(peptides)}')

Outputs:

['SBCDMFSSSSSSSSSSSMMFD', 'SBCDMFSSSSSSSSSSSM(ox)MFD', 'SBCDMFSSSSSSSSSSS(ox)MMFD', 'SBCDMFSSSSSSSSSSS(ox)M(ox)MFD', 'SBCDMFSSSSSSSSSS(ph)SMMFD', 'SBCDMFSSSSSSSSSS(ph)SM(ox)MFD', 'SBCDMFSSSSSSSSSS(ph)S(ox)MMFD', 'SBCDMFSSSSSSSSSS(ph)S(ox)M(ox)MFD', 'SBCDMFSSSSSSSSS(ph)SSMMFD', 'SBCDMFSSSSSSSSS(ph)SSM(ox)MFD', ..., 'SBCDMFSSSSSS(ph)SS(ph)S(ph)S(ph)S(ox)MMFD', 'SBCDMFSSSSSS(ph)SS(ph)S(ph)S(ph)S(ox)M(ox)MFD', 'SBCDMFSSSSSS(ph)S(ph)SSSSMMFD', 'SBCDMFSSSSSS(ph)S(ph)SSSSM(ox)MFD', 'SBCDMFSSSSSS(ph)S(ph)SSSS(ox)MMFD', 'SBCDMFSSSSSS(ph)S(ph)SSSS(ox)M(ox)MFD']
Memory usage and running time for 1000 repeats: 271.614064 MB, 14.134513 s, # of isoforms without repeat: 32768
@jalew188 jalew188 added the enhancement New feature or request label Aug 10, 2020
@straussmaximilian
Copy link
Member

straussmaximilian commented Aug 10, 2020

Great catch and thanks for reporting the issue. For searching with a lot of modifications, I implemented an "on the fly"-mode, which does not save the database and can search 1 protein at a time. This enables to search sequences with a lot of modifications.

For general stability, we could set the max_isoforms according to the system's RAM.

How many isoforms would you consider to be too many? What would you think of re-writing itertools.product and randomly placing the modifications?

@jalew188
Copy link
Contributor Author

@straussmaximilian Thank you for the reply. "On the fly" mode is a good solution, but the question is that too many isoforms may lead to same amount of random matches (N-1, since only one peptide is the true positive), and unnecessary scorings, especially for phosphopeptides.

I have no idea about what max_isoforms value is the best. For re-writing itertools.product, we can generate isoforms with one mod, and then two mods, ...., until max_isoforms is reached. Thus we would not miss any site, although we would miss isoforms with more mods.

@swillems
Copy link
Collaborator

I had similar issues when I was analyzing histone ptms. I think an important conceptual realisation is that there are three different levels to identification (https://pubs.acs.org/doi/10.1021/acs.jproteome.6b00724). You want to:

  1. determine which peptide (backbone sequence) belongs to a spectrum. From a database perspective this is easy. Most precursor/spectrum masses only have a few probable sequence explanations.
  2. determine which ptm combination is present. In terms of combinatorics, this is not hard either. Even the sequence that Feng provided only has a few available ptm combinations. With 12 Serines and 3 Methionines, there are only 52 (equal to (12+1)*(3+1), since unmodified also needs to be considered) different precursor masses possible.
  3. determine the exact location of these PTMs . This is actually the only part where it gets really tricky. With the above sequence that are essentially 2^15 precursors that need to be evaluated. This particular value still is feasible, but it can blow up quickly when more PTMs than just phospos and oxidations ar allowed.

In terms of database storing, I do not believe the first two situations are problematic. Keeping the PTM combinations (case 2) should provide very usefull in quick filtering of spectrum candidates. For case three, the "on-the-fly" approach seems very reasonable to me.

In terms of PTM localization (case 3), a max_isoforms could be useful. Randomizing itertools seems like a bad idea, especially if you want it to be transparent and use it as a benchmarking tool. Feng's suggestion to increase complexity by increasing from one mod to n mods sounds better to me. Easily explained and probably quite performant as well. Last but not least, most biological sequences tend to have only a few PTMs anyways, since there is some stoichiometric hindrance if too many PTMs are placed close to each other (i.e. on a peptide).

Though I might sound like a broken record, I would also suggest another conceptual approach. As mentioned, determining a PTM combination is not that hard, the localization is the problem. Naturally we need fragments for this. However, if you treat fragments individually, you can greatly reduce at least part of the search. If you also only consider fragments with a PTM combo instead of actually localizing them, there are only at most 2000 fragment masses that can be formed (40 b/y fragments, all with a maximum of 52 PTM combinations as explained above in case 2). While this leaves some difficulties when reassembling the fragments afterwards, such a "fragment-centric" approach at least greatly reduces the computational resources that are needed, or at least in terms of RAM...

@jalew188
Copy link
Contributor Author

@swillems Extending situation 2 to 3 sounds like an open-multi-PTM-search (an extension for open search). We can use the delta mass between the sequence and precursor to locate a/several PTM combinations. And for 3, we can use a dynamic programming algorithm to localize the PTM sites when PTM combinations are known, this is what I have done in pGlyco3 for O-glycosylation site localization (which is more complicated comparing with traditional PTM localization).

@straussmaximilian
Copy link
Member

Sounds all very good to me. So I would then say we do the following:

For now, we make get_isoforms more deterministic with Feng's suggestion to increase the mods until max_isoforms are reached.

For proper PTM support, we can define a new issue/project and use the delta-mass / dynamic programming algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants