-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong peptide start and end positions #771
Comments
Hi David, Thanks for your feedback. I think it is also related to the issue Nesvilab/philosopher#430 . We will see what we can do. Best, Fengchao |
@prvst Is it fixed? Best, Fengchao |
I updated some rules since this report. It would be good to try running it again with teh current release to see the results. |
Hi, Thanks for the update. I can run my samples with the latest release and let you know in the next days if the issue is gone. Best, |
Hi, I ran the same dataset with the newest version of FragPipe and used the standard LFQ-MBR workflow.
Unfortunately the issue seems to still exist, and got slightly worse in my example. I parsed the combined_peptides.tsv file and extracted the "Protein ID", "Peptide Sequence", "Start" and "End" columns. Then I used these information to extract peptides from the FASTA using the given protein id, start and end positions. The three examples I previously mentioned still exist, for which the reported sequence matches to a different position within the same protein. In addition, for about 300 peptides the reported sequence did not match to the reported positions in the protein sequence. Again the mismatch was purely because of I and L differences. The reported peptide could however be perfectly matched to a different protein in the FASTA file. I guess when deciding which protein will be reported, the start and end positions are updated but not the peptide sequence. Let me know if providing additional information would be helpful. Best, |
Hi David,
Thanks for the test
Felipe, could you please take another look and see if you can replace the peptide sequence reported with the correct I/L (that matches the protein to which the peptide is reassigned)?
Thanks
Alexey
From: David Hollenstein ***@***.***>
Sent: Wednesday, December 14, 2022 7:53 AM
To: Nesvilab/FragPipe ***@***.***>
Cc: Nesvizhskii, Alexey ***@***.***>; Assign ***@***.***>
Subject: Re: [Nesvilab/FragPipe] Wrong peptide start and end positions (Issue Nesvilab/FragPipe#771)
External Email - Use Caution
Hi, I ran the same dataset with the newest version of FragPipe and used the standard LFQ-MBR workflow.
Version info:
FragPipe version 19.0
MSFragger version 3.6
IonQuant version 1.8.9
Philosopher version 4.6.0
Unfortunately the issue seems to still exist, and got slightly worse in my example. I parsed the combined_peptides.tsv file and extracted the "Protein ID", "Peptide Sequence", "Start" and "End" columns. Then I used these information to extract peptides from the FASTA using the given protein id, start and end positions.
The three examples I previously mentioned still exist, for which the reported sequence matches to a different position within the same protein.
In addition, for about 300 peptides the reported sequence did not match to the reported positions in the protein sequence. Again the mismatch was purely because of I and L differences. The reported peptide could however be perfectly matched to a different protein in the FASTA file.
I guess when deciding which protein will be reported, the start and end positions are updated but not the peptide sequence.
Let me know if providing additional information would be helpful.
Best,
David
—
Reply to this email directly, view it on GitHub<#771 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIIMM63HC4GPSOEE4TO75DLWNG7MBANCNFSM54LYC7EA>.
You are receiving this because you were assigned.Message ID: ***@***.******@***.***>>
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
|
Hi David. Do you see the same occurring with the PSM table? |
Yes, the issue is already present in the PSM table. |
@hollenstein can you share the interact files and the FASTA? I would like to see those rare cases. |
Sure, I will send you a link |
Thanks. Quick question; have you noticed that you have completely identical sequences in your database under different IDs? |
Hi, thanks for information. In the MS facility we use our own contaminants file, so I was expecting that some entries might become duplicated in combination with certain organisms. Otherwise we simply use the Uniprot reference database (one protein per gene). Actually, I wasn't aware that they still contain duplicate entries. Seems that those are gene duplications with identical protein sequences, so I guess its fine that they are present. Afaik it shouldn't really matter for the analysis, right? |
@hollenstein I released a new version that corrects the mapping of cases like you describe above. Please download it and give it a try. |
Hi, sorry for the delay. I've re-run FragPipe with the following versions: Unfortunately, the issue seems not be resolved and there is an additional bug. For about 10% of all entries in the combined_ion.tsv file, the “Start” and “End” entries are now 0, so I couldn’t check if the reported positions are correct. In order to test the “I” and “L” issue, I’ve checked if the reported peptide sequence is present in the sequence of the reported protein. For ~100 entries this was not the case. However, when treating “I” and “L” as indistinguishable all peptides could be found within the protein sequence, and all those peptides appeared to be fully tryptic peptides. |
@hollenstein thanks for the testing. We're currently working on a few changes and fixes on both MSFragger and Philosopher. We'll keep you posted. |
Thanks for the update. Let me know if I can help with some testing again. |
You can grab the latest Philosopher release candidate here. But there's also a bug fix in MSFragger, so you'll need that from Fengchao too. |
Fixed in MSFragger 3.7 and Philosopher 4.8.0. Best, Fengchao |
Dear FragPipe Team,
First of all, thank you very much for providing these fantastic tools. I am working at a proteomics MS facility and we are planning to switch to FragPipe for the bulk of our data analysis in the near future. I am currently looking into using FragPipe for several typical sample types we get in the facility, and how to further process the outputs generated by FragPipe. So I will probably have several questions in the future and maybe also some feature requests.
I’ve noticed a rare problem of the reported peptide Start and End positions:
When looking up the positions of a peptide in the protein sequence from the FASTA file, in very few cases the Start and End positions reported by FragPipe don’t match the actual positions. It appears to be a problem caused by treating Isoleucine and Leucine as indistinguishable (at least in the cases I’ve observed).
For example, the reported peptide “GDSLDSVEALIK” matches exactly once to the protein sequence of “Q13813”, at position 1474 to 1485. However, FragPipe reports 496 to 507. When replacing all “L” with “I” in the peptide and the protein sequence, then the peptide matches to two positions within the protein: 496 to 507 (GDSLDSVEALLK), and 1474 to 1485 (GDSLDSVEALIK). I assume that when matching the peptide to the protein (treating “L” and “I” as indistinguishable), the first occurance of the peptide is reported by FragPipe. I think this behavior is fair enough, however, the actual sequence of the peptide in the ion.tsv and other tables is not updated and thus does not fit to the reported Start and End positions.
In the few cases I’ve seen, only the “original” peptide sequences that were reported are tryptic peptides, meaning the preceding amino acid were K/R, whereas the peptides from the reported start/end positions were only semi-tryptic peptides.
I’ve seen that you are planning to consider the cleavage rules when mapping peptides to proteins in an upcoming version (#718), so this might already solve the problem in most if not all the cases. I still wanted to let you know, as this seems to be an unintentional behavior.
Best,
David
Here is the protein sequence for the mentioned example
The text was updated successfully, but these errors were encountered: