-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue of get_isoforms with itertools.product #39
Comments
Great catch and thanks for reporting the issue. For searching with a lot of modifications, I implemented an "on the fly"-mode, which does not save the database and can search 1 protein at a time. This enables to search sequences with a lot of modifications. For general stability, we could set the How many isoforms would you consider to be too many? What would you think of re-writing itertools.product and randomly placing the modifications? |
@straussmaximilian Thank you for the reply. "On the fly" mode is a good solution, but the question is that too many isoforms may lead to same amount of random matches (N-1, since only one peptide is the true positive), and unnecessary scorings, especially for phosphopeptides. I have no idea about what max_isoforms value is the best. For re-writing itertools.product, we can generate isoforms with one mod, and then two mods, ...., until max_isoforms is reached. Thus we would not miss any site, although we would miss isoforms with more mods. |
I had similar issues when I was analyzing histone ptms. I think an important conceptual realisation is that there are three different levels to identification (https://pubs.acs.org/doi/10.1021/acs.jproteome.6b00724). You want to:
In terms of database storing, I do not believe the first two situations are problematic. Keeping the PTM combinations (case 2) should provide very usefull in quick filtering of spectrum candidates. For case three, the "on-the-fly" approach seems very reasonable to me. In terms of PTM localization (case 3), a Though I might sound like a broken record, I would also suggest another conceptual approach. As mentioned, determining a PTM combination is not that hard, the localization is the problem. Naturally we need fragments for this. However, if you treat fragments individually, you can greatly reduce at least part of the search. If you also only consider fragments with a PTM combo instead of actually localizing them, there are only at most 2000 fragment masses that can be formed (40 b/y fragments, all with a maximum of 52 PTM combinations as explained above in case 2). While this leaves some difficulties when reassembling the fragments afterwards, such a "fragment-centric" approach at least greatly reduces the computational resources that are needed, or at least in terms of RAM... |
@swillems Extending situation 2 to 3 sounds like an open-multi-PTM-search (an extension for open search). We can use the delta mass between the sequence and precursor to locate a/several PTM combinations. And for 3, we can use a dynamic programming algorithm to localize the PTM sites when PTM combinations are known, this is what I have done in pGlyco3 for O-glycosylation site localization (which is more complicated comparing with traditional PTM localization). |
Sounds all very good to me. So I would then say we do the following: For now, we make get_isoforms more deterministic with Feng's suggestion to increase the mods until max_isoforms are reached. For proper PTM support, we can define a new issue/project and use the delta-mass / dynamic programming algorithm. |
For phosphorylation as an example, there may be many phospho sites on a sequence, resulting in a lot of isoforms. If max_isoforms is large enough, there may be too many isoforms to be considered; if max_isoforms is too small, modifications on left-side AAs may be excluded due to the behavior of itertools.product.
Outputs:
The text was updated successfully, but these errors were encountered: