Wrong peptide start and end positions #771

hollenstein · 2022-07-22T14:33:09Z

Dear FragPipe Team,

First of all, thank you very much for providing these fantastic tools. I am working at a proteomics MS facility and we are planning to switch to FragPipe for the bulk of our data analysis in the near future. I am currently looking into using FragPipe for several typical sample types we get in the facility, and how to further process the outputs generated by FragPipe. So I will probably have several questions in the future and maybe also some feature requests.

I’ve noticed a rare problem of the reported peptide Start and End positions:
When looking up the positions of a peptide in the protein sequence from the FASTA file, in very few cases the Start and End positions reported by FragPipe don’t match the actual positions. It appears to be a problem caused by treating Isoleucine and Leucine as indistinguishable (at least in the cases I’ve observed).

For example, the reported peptide “GDSLDSVEALIK” matches exactly once to the protein sequence of “Q13813”, at position 1474 to 1485. However, FragPipe reports 496 to 507. When replacing all “L” with “I” in the peptide and the protein sequence, then the peptide matches to two positions within the protein: 496 to 507 (GDSLDSVEALLK), and 1474 to 1485 (GDSLDSVEALIK). I assume that when matching the peptide to the protein (treating “L” and “I” as indistinguishable), the first occurance of the peptide is reported by FragPipe. I think this behavior is fair enough, however, the actual sequence of the peptide in the ion.tsv and other tables is not updated and thus does not fit to the reported Start and End positions.

In the few cases I’ve seen, only the “original” peptide sequences that were reported are tryptic peptides, meaning the preceding amino acid were K/R, whereas the peptides from the reported start/end positions were only semi-tryptic peptides.

I’ve seen that you are planning to consider the cleavage rules when mapping peptides to proteins in an upcoming version (#718), so this might already solve the problem in most if not all the cases. I still wanted to let you know, as this seems to be an unintentional behavior.

Best,
David

Here is the protein sequence for the mentioned example

sp|Q13813|SPTN1_HUMAN Spectrin alpha chain, non-erythrocytic 1 OS=Homo sapiens OX=9606 GN=SPTAN1 PE=1 SV=3
MDPSGVKVLETAEDIQERRQQVLDRYHRFKELSTLRRQKLEDSYRFQFFQRDAEELEKWIQEKLQIASDENYKDPTNLQGKLQKHQAFEAEVQANSGAIVKLDETGNLMISEGHFASETIRTRLMELHRQWELLLEKMREKGIKLLQAQKLVQYLRECEDVMDWINDKEAIVTSEELGQDLEHVEVLQKKFEEFQTDMAAHEERVNEVNQFAAKLIQEQHPEEELIKTKQDEVNAAWQRLKGLALQRQGKLFGAAEVQRFNRDVDETISWIKEKEQLMASDDFGRDLASVQALLRKHEGLERDLAALEDKVKALCAEADRLQQSHPLSATQIQVKREELITNWEQIRTLAAERHARLNDSYRLQRFLADFRDLTSWVTEMKALINADELASDVAGAEALLDRHQEHKGEIDAHEDSFKSADESGQALLAAGHYASDEVREKLTVLSEERAALLELWELRRQQYEQCMDLQLFYRDTEQVDNWMSKQEAFLLNEDLGDSLDSVEALLKKHEDFEKSLSAQEEKITALDEFATKLIQNNHYAMEDVATRRDALLSRRNALHERAMRRRAQLADSFHLQQFFRDSDELKSWVNEKMKTATDEAYKDPSNLQGKVQKHQAFEAELSANQSRIDALEKAGQKLIDVNHYAKDEVAARMNEVISLWKKLLEATELKGIKLREANQQQQFNRNVEDIELWLYEVEGHLASDDYGKDLTNVQNLQKKHALLEADVAAHQDRIDGITIQARQFQDAGHFDAENIKKKQEALVARYEALKEPMVARKQKLADSLRLQQLFRDVEDEETWIREKEPIAASTNRGKDLIGVQNLLKKHQALQAEIAGHEPRIKAVTQKGNAMVEEGHFAAEDVKAKLHELNQKWEALKAKASQRRQDLEDSLQAQQYFADANEAESWMREKEPIVGSTDYGKDEDSAEALLKKHEALMSDLSAYGSSIQALREQAQSCRQQVAPTDDETGKELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDPAQSASRENLLEEQGSIALRQEQIDNQTRITKEAGSVSLRMKQVEELYHSLLELGEKRKGMLEKSCKKFMLFREANELQQWINEKEAALTSEEVGADLEQVEVLQKKFDDFQKDLKANESRLKDINKVAEDLESEGLMAEEVQAVQQQEVYGMMPRDETDSKTASPWKSARLMVHTVATFNSIKELNERWRSLQQLAEERSQLLGSAHEVQRFHRDADETKEWIEEKNQALNTDNYGHDLASVQALQRKHEGFERDLAALGDKVNSLGETAERLIQSHPESAEDLQEKCTELNQAWSSLGKRADQRKAKLGDSHDLQRFLSDFRDLMSWINGIRGLVSSDELAKDVTGAEALLERHQEHRTEIDARAGTFQAFEQFGQQLLAHGHYASPEIKQKLDILDQERADLEKAWVQRRMMLDQCLELQLFHRDCEQAENWMAAREAFLNTEDKGDSLDSVEALIKKHEDFDKAINVQEEKIAALQAFADQLIAAGHYAKGDISSRRNEVLDRWRRLKAQMIEKRSKLGESQTLQQFSRDVDEIEAWISEKLQTASDESYKDPTNIQSKHQKHQAFEAELHANADRIRGVIDMGNSLIERGACAGSEDAVKARLAALADQWQFLVQKSAEKSQKLKEANKQQNFNTGIKDFDFWLSEVEALLASEDYGKDLASVNNLLKKHQLLEADISAHEDRLKDLNSQADSLMTSSAFDTSQVKDKRDTINGRFQKIKSMAASRRAKLNESHRLHQFFRDMDDEESWIKEKKLLVGSEDYGRDLTGVQNLRKKHKRLEAELAAHEPAIQGVLDTGKKLSDDNTIGKEEIQQRLAQFVEHWKELKQLAAARGQRLEESLEYQQFVANVEEEEAWINEKMTLVASEDYGDTLAAIQGLLKKHEAFETDFTVHKDRVNDVCTNGQDLIKKNNHHEENISSKMKGLNGKVSDLEKAAAQRKAKLDENSAFLQFNWKADVVESWIGEKENSLKTDDYGRDLSSVQTLLTKQETFDAGLQAFQQEGIANITALKDQLLAAKHVQSKAIEARHASLMKRWSQLLANSAARKKKLLEAQSHFRKVEDLFLTFAKKASAFNSWFENAEEDLTDPVRCNSLEEIKALREAHDAFRSSLSSAQADFNQLAELDRQIKSFRVASNPYTWFTMEALEETWRNLQKIIKERELELQKEQRRQEENDKLRQEFAQHANAFHQWIQETRTYLLDGSCMVEESGTLESQLEATKRKHQEIRAMRSQLKKIEDLGAAMEEALILDNKYTEHSTVGLAQQWDQLDQLGMRMQHNLEQQIQARNTTGVTEEALKEFSMMFKHFDKDKSGRLNHQEFKSCLRSLGYDLPMVEEGEPDPEFEAILDTVDPNRDGHVSLQEYMAFMISRETENVKSSEEIESAFRALSSEGKPYVTKEELYQNLTREQADYCVSHMKPYVDGKGRELPTAFDYVEFTRSLFVN

fcyu · 2022-07-22T14:39:33Z

Hi David,

Thanks for your feedback. I think it is also related to the issue Nesvilab/philosopher#430 . We will see what we can do.

Best,

Fengchao

fcyu · 2022-12-12T15:41:07Z

@prvst Is it fixed?

Best,

Fengchao

prvst · 2022-12-12T18:54:02Z

I updated some rules since this report. It would be good to try running it again with teh current release to see the results.

hollenstein · 2022-12-13T15:37:03Z

Hi,

Thanks for the update. I can run my samples with the latest release and let you know in the next days if the issue is gone.

Best,
David

hollenstein · 2022-12-14T12:53:09Z

Hi, I ran the same dataset with the newest version of FragPipe and used the standard LFQ-MBR workflow.

Version info:
FragPipe version 19.0
MSFragger version 3.6
IonQuant version 1.8.9
Philosopher version 4.6.0

Unfortunately the issue seems to still exist, and got slightly worse in my example. I parsed the combined_peptides.tsv file and extracted the "Protein ID", "Peptide Sequence", "Start" and "End" columns. Then I used these information to extract peptides from the FASTA using the given protein id, start and end positions.

The three examples I previously mentioned still exist, for which the reported sequence matches to a different position within the same protein.

In addition, for about 300 peptides the reported sequence did not match to the reported positions in the protein sequence. Again the mismatch was purely because of I and L differences. The reported peptide could however be perfectly matched to a different protein in the FASTA file.

I guess when deciding which protein will be reported, the start and end positions are updated but not the peptide sequence.

Let me know if providing additional information would be helpful.

Best,
David

anesvi · 2022-12-14T13:35:46Z

Hi David, Thanks for the test Felipe, could you please take another look and see if you can replace the peptide sequence reported with the correct I/L (that matches the protein to which the peptide is reassigned)? Thanks Alexey From: David Hollenstein ***@***.***> Sent: Wednesday, December 14, 2022 7:53 AM To: Nesvilab/FragPipe ***@***.***> Cc: Nesvizhskii, Alexey ***@***.***>; Assign ***@***.***> Subject: Re: [Nesvilab/FragPipe] Wrong peptide start and end positions (Issue Nesvilab/FragPipe#771) External Email - Use Caution Hi, I ran the same dataset with the newest version of FragPipe and used the standard LFQ-MBR workflow. Version info: FragPipe version 19.0 MSFragger version 3.6 IonQuant version 1.8.9 Philosopher version 4.6.0 Unfortunately the issue seems to still exist, and got slightly worse in my example. I parsed the combined_peptides.tsv file and extracted the "Protein ID", "Peptide Sequence", "Start" and "End" columns. Then I used these information to extract peptides from the FASTA using the given protein id, start and end positions. The three examples I previously mentioned still exist, for which the reported sequence matches to a different position within the same protein. In addition, for about 300 peptides the reported sequence did not match to the reported positions in the protein sequence. Again the mismatch was purely because of I and L differences. The reported peptide could however be perfectly matched to a different protein in the FASTA file. I guess when deciding which protein will be reported, the start and end positions are updated but not the peptide sequence. Let me know if providing additional information would be helpful. Best, David — Reply to this email directly, view it on GitHub<#771 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIIMM63HC4GPSOEE4TO75DLWNG7MBANCNFSM54LYC7EA>. You are receiving this because you were assigned.Message ID: ***@***.******@***.***>> ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

prvst · 2022-12-14T14:23:59Z

Hi David. Do you see the same occurring with the PSM table?

hollenstein · 2022-12-14T14:32:43Z

Yes, the issue is already present in the PSM table.

prvst · 2022-12-14T20:42:13Z

@hollenstein can you share the interact files and the FASTA? I would like to see those rare cases.

hollenstein · 2022-12-15T16:25:27Z

Sure, I will send you a link

prvst · 2022-12-15T18:08:52Z

Thanks.

Quick question; have you noticed that you have completely identical sequences in your database under different IDs?

hollenstein · 2022-12-16T11:02:10Z

Hi, thanks for information.

In the MS facility we use our own contaminants file, so I was expecting that some entries might become duplicated in combination with certain organisms. Otherwise we simply use the Uniprot reference database (one protein per gene). Actually, I wasn't aware that they still contain duplicate entries. Seems that those are gene duplications with identical protein sequences, so I guess its fine that they are present. Afaik it shouldn't really matter for the analysis, right?

prvst · 2022-12-23T17:00:32Z

@hollenstein I released a new version that corrects the mapping of cases like you describe above. Please download it and give it a try.

hollenstein · 2023-01-04T08:49:18Z

Hi, sorry for the delay.

I've re-run FragPipe with the following versions:
FragPipe version 19.0
MSFragger version 3.6
IonQuant version 1.8.9
Philosopher version 4.7.0

Unfortunately, the issue seems not be resolved and there is an additional bug. For about 10% of all entries in the combined_ion.tsv file, the “Start” and “End” entries are now 0, so I couldn’t check if the reported positions are correct. In order to test the “I” and “L” issue, I’ve checked if the reported peptide sequence is present in the sequence of the reported protein. For ~100 entries this was not the case. However, when treating “I” and “L” as indistinguishable all peptides could be found within the protein sequence, and all those peptides appeared to be fully tryptic peptides.

prvst · 2023-01-09T15:12:19Z

@hollenstein thanks for the testing. We're currently working on a few changes and fixes on both MSFragger and Philosopher. We'll keep you posted.

hollenstein · 2023-01-11T09:19:37Z

Thanks for the update. Let me know if I can help with some testing again.

prvst · 2023-01-11T14:19:36Z

You can grab the latest Philosopher release candidate here. But there's also a bug fix in MSFragger, so you'll need that from Fengchao too.

fcyu · 2023-01-13T02:42:58Z

Fixed in MSFragger 3.7 and Philosopher 4.8.0.

Best,

Fengchao

fcyu assigned prvst, fcyu and anesvi Jul 22, 2022

fcyu added the Philosopher label Jul 22, 2022

fcyu mentioned this issue Dec 14, 2022

Adjust I and L in the psm.tsv, peptide.tsv, and ion.tsv after razor assignment Nesvilab/philosopher#430

Open

fcyu closed this as completed Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong peptide start and end positions #771

Wrong peptide start and end positions #771

hollenstein commented Jul 22, 2022

fcyu commented Jul 22, 2022

fcyu commented Dec 12, 2022

prvst commented Dec 12, 2022

hollenstein commented Dec 13, 2022

hollenstein commented Dec 14, 2022

anesvi commented Dec 14, 2022 via email

prvst commented Dec 14, 2022

hollenstein commented Dec 14, 2022

prvst commented Dec 14, 2022

hollenstein commented Dec 15, 2022

prvst commented Dec 15, 2022

hollenstein commented Dec 16, 2022

prvst commented Dec 23, 2022

hollenstein commented Jan 4, 2023

prvst commented Jan 9, 2023

hollenstein commented Jan 11, 2023

prvst commented Jan 11, 2023

fcyu commented Jan 13, 2023

Wrong peptide start and end positions #771

Wrong peptide start and end positions #771

Comments

hollenstein commented Jul 22, 2022

fcyu commented Jul 22, 2022

fcyu commented Dec 12, 2022

prvst commented Dec 12, 2022

hollenstein commented Dec 13, 2022

hollenstein commented Dec 14, 2022

anesvi commented Dec 14, 2022 via email

prvst commented Dec 14, 2022

hollenstein commented Dec 14, 2022

prvst commented Dec 14, 2022

hollenstein commented Dec 15, 2022

prvst commented Dec 15, 2022

hollenstein commented Dec 16, 2022

prvst commented Dec 23, 2022

hollenstein commented Jan 4, 2023

prvst commented Jan 9, 2023

hollenstein commented Jan 11, 2023

prvst commented Jan 11, 2023

fcyu commented Jan 13, 2023