Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong peptide start and end positions #771

Closed
hollenstein opened this issue Jul 22, 2022 · 18 comments
Closed

Wrong peptide start and end positions #771

hollenstein opened this issue Jul 22, 2022 · 18 comments
Assignees

Comments

@hollenstein
Copy link

Dear FragPipe Team,

First of all, thank you very much for providing these fantastic tools. I am working at a proteomics MS facility and we are planning to switch to FragPipe for the bulk of our data analysis in the near future. I am currently looking into using FragPipe for several typical sample types we get in the facility, and how to further process the outputs generated by FragPipe. So I will probably have several questions in the future and maybe also some feature requests.

I’ve noticed a rare problem of the reported peptide Start and End positions:
When looking up the positions of a peptide in the protein sequence from the FASTA file, in very few cases the Start and End positions reported by FragPipe don’t match the actual positions. It appears to be a problem caused by treating Isoleucine and Leucine as indistinguishable (at least in the cases I’ve observed).

For example, the reported peptide “GDSLDSVEALIK” matches exactly once to the protein sequence of “Q13813”, at position 1474 to 1485. However, FragPipe reports 496 to 507. When replacing all “L” with “I” in the peptide and the protein sequence, then the peptide matches to two positions within the protein: 496 to 507 (GDSLDSVEALLK), and 1474 to 1485 (GDSLDSVEALIK). I assume that when matching the peptide to the protein (treating “L” and “I” as indistinguishable), the first occurance of the peptide is reported by FragPipe. I think this behavior is fair enough, however, the actual sequence of the peptide in the ion.tsv and other tables is not updated and thus does not fit to the reported Start and End positions.

In the few cases I’ve seen, only the “original” peptide sequences that were reported are tryptic peptides, meaning the preceding amino acid were K/R, whereas the peptides from the reported start/end positions were only semi-tryptic peptides.

I’ve seen that you are planning to consider the cleavage rules when mapping peptides to proteins in an upcoming version (#718), so this might already solve the problem in most if not all the cases. I still wanted to let you know, as this seems to be an unintentional behavior.

Best,
David

Here is the protein sequence for the mentioned example

sp|Q13813|SPTN1_HUMAN Spectrin alpha chain, non-erythrocytic 1 OS=Homo sapiens OX=9606 GN=SPTAN1 PE=1 SV=3
MDPSGVKVLETAEDIQERRQQVLDRYHRFKELSTLRRQKLEDSYRFQFFQRDAEELEKWIQEKLQIASDENYKDPTNLQGKLQKHQAFEAEVQANSGAIVKLDETGNLMISEGHFASETIRTRLMELHRQWELLLEKMREKGIKLLQAQKLVQYLRECEDVMDWINDKEAIVTSEELGQDLEHVEVLQKKFEEFQTDMAAHEERVNEVNQFAAKLIQEQHPEEELIKTKQDEVNAAWQRLKGLALQRQGKLFGAAEVQRFNRDVDETISWIKEKEQLMASDDFGRDLASVQALLRKHEGLERDLAALEDKVKALCAEADRLQQSHPLSATQIQVKREELITNWEQIRTLAAERHARLNDSYRLQRFLADFRDLTSWVTEMKALINADELASDVAGAEALLDRHQEHKGEIDAHEDSFKSADESGQALLAAGHYASDEVREKLTVLSEERAALLELWELRRQQYEQCMDLQLFYRDTEQVDNWMSKQEAFLLNEDLGDSLDSVEALLKKHEDFEKSLSAQEEKITALDEFATKLIQNNHYAMEDVATRRDALLSRRNALHERAMRRRAQLADSFHLQQFFRDSDELKSWVNEKMKTATDEAYKDPSNLQGKVQKHQAFEAELSANQSRIDALEKAGQKLIDVNHYAKDEVAARMNEVISLWKKLLEATELKGIKLREANQQQQFNRNVEDIELWLYEVEGHLASDDYGKDLTNVQNLQKKHALLEADVAAHQDRIDGITIQARQFQDAGHFDAENIKKKQEALVARYEALKEPMVARKQKLADSLRLQQLFRDVEDEETWIREKEPIAASTNRGKDLIGVQNLLKKHQALQAEIAGHEPRIKAVTQKGNAMVEEGHFAAEDVKAKLHELNQKWEALKAKASQRRQDLEDSLQAQQYFADANEAESWMREKEPIVGSTDYGKDEDSAEALLKKHEALMSDLSAYGSSIQALREQAQSCRQQVAPTDDETGKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDPAQSASRENLLEEQGSIALRQEQIDNQTRITKEAGSVSLRMKQVEELYHSLLELGEKRKGMLEKSCKKFMLFREANELQQWINEKEAALTSEEVGADLEQVEVLQKKFDDFQKDLKANESRLKDINKVAEDLESEGLMAEEVQAVQQQEVYGMMPRDETDSKTASPWKSARLMVHTVATFNSIKELNERWRSLQQLAEERSQLLGSAHEVQRFHRDADETKEWIEEKNQALNTDNYGHDLASVQALQRKHEGFERDLAALGDKVNSLGETAERLIQSHPESAEDLQEKCTELNQAWSSLGKRADQRKAKLGDSHDLQRFLSDFRDLMSWINGIRGLVSSDELAKDVTGAEALLERHQEHRTEIDARAGTFQAFEQFGQQLLAHGHYASPEIKQKLDILDQERADLEKAWVQRRMMLDQCLELQLFHRDCEQAENWMAAREAFLNTEDKGDSLDSVEALIKKHEDFDKAINVQEEKIAALQAFADQLIAAGHYAKGDISSRRNEVLDRWRRLKAQMIEKRSKLGESQTLQQFSRDVDEIEAWISEKLQTASDESYKDPTNIQSKHQKHQAFEAELHANADRIRGVIDMGNSLIERGACAGSEDAVKARLAALADQWQFLVQKSAEKSQKLKEANKQQNFNTGIKDFDFWLSEVEALLASEDYGKDLASVNNLLKKHQLLEADISAHEDRLKDLNSQADSLMTSSAFDTSQVKDKRDTINGRFQKIKSMAASRRAKLNESHRLHQFFRDMDDEESWIKEKKLLVGSEDYGRDLTGVQNLRKKHKRLEAELAAHEPAIQGVLDTGKKLSDDNTIGKEEIQQRLAQFVEHWKELKQLAAARGQRLEESLEYQQFVANVEEEEAWINEKMTLVASEDYGDTLAAIQGLLKKHEAFETDFTVHKDRVNDVCTNGQDLIKKNNHHEENISSKMKGLNGKVSDLEKAAAQRKAKLDENSAFLQFNWKADVVESWIGEKENSLKTDDYGRDLSSVQTLLTKQETFDAGLQAFQQEGIANITALKDQLLAAKHVQSKAIEARHASLMKRWSQLLANSAARKKKLLEAQSHFRKVEDLFLTFAKKASAFNSWFENAEEDLTDPVRCNSLEEIKALREAHDAFRSSLSSAQADFNQLAELDRQIKSFRVASNPYTWFTMEALEETWRNLQKIIKERELELQKEQRRQEENDKLRQEFAQHANAFHQWIQETRTYLLDGSCMVEESGTLESQLEATKRKHQEIRAMRSQLKKIEDLGAAMEEALILDNKYTEHSTVGLAQQWDQLDQLGMRMQHNLEQQIQARNTTGVTEEALKEFSMMFKHFDKDKSGRLNHQEFKSCLRSLGYDLPMVEEGEPDPEFEAILDTVDPNRDGHVSLQEYMAFMISRETENVKSSEEIESAFRALSSEGKPYVTKEELYQNLTREQADYCVSHMKPYVDGKGRELPTAFDYVEFTRSLFVN

@fcyu
Copy link
Member

fcyu commented Jul 22, 2022

Hi David,

Thanks for your feedback. I think it is also related to the issue Nesvilab/philosopher#430 . We will see what we can do.

Best,

Fengchao

@fcyu
Copy link
Member

fcyu commented Dec 12, 2022

@prvst Is it fixed?

Best,

Fengchao

@prvst
Copy link

prvst commented Dec 12, 2022

I updated some rules since this report. It would be good to try running it again with teh current release to see the results.

@hollenstein
Copy link
Author

Hi,

Thanks for the update. I can run my samples with the latest release and let you know in the next days if the issue is gone.

Best,
David

@hollenstein
Copy link
Author

Hi, I ran the same dataset with the newest version of FragPipe and used the standard LFQ-MBR workflow.

Version info:
FragPipe version 19.0
MSFragger version 3.6
IonQuant version 1.8.9
Philosopher version 4.6.0

Unfortunately the issue seems to still exist, and got slightly worse in my example. I parsed the combined_peptides.tsv file and extracted the "Protein ID", "Peptide Sequence", "Start" and "End" columns. Then I used these information to extract peptides from the FASTA using the given protein id, start and end positions.

The three examples I previously mentioned still exist, for which the reported sequence matches to a different position within the same protein.

In addition, for about 300 peptides the reported sequence did not match to the reported positions in the protein sequence. Again the mismatch was purely because of I and L differences. The reported peptide could however be perfectly matched to a different protein in the FASTA file.

I guess when deciding which protein will be reported, the start and end positions are updated but not the peptide sequence.

Let me know if providing additional information would be helpful.

Best,
David

@anesvi
Copy link
Collaborator

anesvi commented Dec 14, 2022 via email

@prvst
Copy link

prvst commented Dec 14, 2022

Hi David. Do you see the same occurring with the PSM table?

@hollenstein
Copy link
Author

Yes, the issue is already present in the PSM table.

@prvst
Copy link

prvst commented Dec 14, 2022

@hollenstein can you share the interact files and the FASTA? I would like to see those rare cases.

@hollenstein
Copy link
Author

Sure, I will send you a link

@prvst
Copy link

prvst commented Dec 15, 2022

Thanks.

Quick question; have you noticed that you have completely identical sequences in your database under different IDs?

@hollenstein
Copy link
Author

Hi, thanks for information.

In the MS facility we use our own contaminants file, so I was expecting that some entries might become duplicated in combination with certain organisms. Otherwise we simply use the Uniprot reference database (one protein per gene). Actually, I wasn't aware that they still contain duplicate entries. Seems that those are gene duplications with identical protein sequences, so I guess its fine that they are present. Afaik it shouldn't really matter for the analysis, right?

@prvst
Copy link

prvst commented Dec 23, 2022

@hollenstein I released a new version that corrects the mapping of cases like you describe above. Please download it and give it a try.

@hollenstein
Copy link
Author

Hi, sorry for the delay.

I've re-run FragPipe with the following versions:
FragPipe version 19.0
MSFragger version 3.6
IonQuant version 1.8.9
Philosopher version 4.7.0

Unfortunately, the issue seems not be resolved and there is an additional bug. For about 10% of all entries in the combined_ion.tsv file, the “Start” and “End” entries are now 0, so I couldn’t check if the reported positions are correct. In order to test the “I” and “L” issue, I’ve checked if the reported peptide sequence is present in the sequence of the reported protein. For ~100 entries this was not the case. However, when treating “I” and “L” as indistinguishable all peptides could be found within the protein sequence, and all those peptides appeared to be fully tryptic peptides.

@prvst
Copy link

prvst commented Jan 9, 2023

@hollenstein thanks for the testing. We're currently working on a few changes and fixes on both MSFragger and Philosopher. We'll keep you posted.

@hollenstein
Copy link
Author

Thanks for the update. Let me know if I can help with some testing again.

@prvst
Copy link

prvst commented Jan 11, 2023

You can grab the latest Philosopher release candidate here. But there's also a bug fix in MSFragger, so you'll need that from Fengchao too.

@fcyu
Copy link
Member

fcyu commented Jan 13, 2023

Fixed in MSFragger 3.7 and Philosopher 4.8.0.

Best,

Fengchao

@fcyu fcyu closed this as completed Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants