Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated transition IDs in TargetedExperiment object from PQP source, caused by decoy peptides with same sequence but different gene #5653

Open
gureann opened this issue Nov 12, 2021 · 6 comments
Assignees

Comments

@gureann
Copy link

gureann commented Nov 12, 2021

Hi @hroest ,

I'm using OpenSwath to analyze DIA data acquired from QEHF, and an error occurred when I ran OSW with generated pqp file.
The error was caused by some decoy peptides which got same sequence from different target peptide sequences (also belonged to different genes), and this lead to duplicated transition IDs at the pqp reading step for aggregation of gene table

If the input mzml was converted from thermo raw file by msconvert without peak picking, there will be no exception raised, and just stopped when searching, like this

Thread 3_0 will analyze 12911 compounds and 72038 transitions from SWATH 5 (batch 2 out of 12)
Thread 6_0 will analyze 14914 compounds and 82015 transitions from SWATH 4 (batch 2 out of 14)
Thread 0_0 will analyze 10545 compounds and 55144 transitions from SWATH 2 (batch 3 out of 10)
Thread 2_0 will analyze 14011 compounds and 74796 transitions from SWATH 3 (batch 3 out of 14)
(base) PS F:\>

When the input mzml was converted with peak picking, the error will be invalid ID

...
Thread 0_0 will analyze 10545 compounds and 55144 transitions from SWATH 2 (batch 3 out of 10)
Thread 4_0 will analyze 7360 compounds and 37058 transitions from SWATH 1 (batch 4 out of 7)

---------------------------------------------------
FATAL: uncaught exception!
---------------------------------------------------
last entry in the exception handler:
exception of type InvalidValue occurred in line 165, function void __cdecl OpenMS::MRMTransitionGroup<class OpenMS::MSChromatogram,struct OpenSwath::LightTransition>::addTransition(const struct OpenSwath::LightTransition &,const class OpenMS::String &) of C:\jenkins\ws\openms\RC\openms_release_packaging\9447518b\source\src\openms\include\OpenMS/KERNEL/MRMTransitionGroup.h
error message: the value '1747343' was used but is not valid; Internal error: Transition with nativeID was already present!
---------------------------------------------------

If I use TargetedFileConverter again, from pqp to tsv, the error will be raised correctly, in the checking step after reading database and generating TargetedExperiment

Progress of 'reading PQP file (SQL warmup)':
-- done [took 36.64 s (CPU), 36.82 s (Wall)] --
Progress of 'reading PQP file':
-- done [took 7.72 s (CPU), 7.71 s (Wall)] --
Progress of 'conversion to internal data representation':
-- done [took 5.11 s (CPU), 5.10 s (Wall)] --
Found duplicate transition id (must be unique): 1292421
Error: Unexpected internal error (Invalid input, contains duplicate or invalid references)

The file attched below (extracted from pqp file) is all transitions with duplicated IDs after running DecoyGenerator.
example_of_same_seq_in_diff_genes.txt
Two kinds of decoy peptides with same sequence:
Peptide FVQDLSK belongs to Q91ZJ5;DECOY_P52196, in which DECOY_P52196 has original proteinID P52196 with a peptide FQLVDSR, gene name of these two is Ugp2 and Tst
Peptide YLDLLQK belongs to protein group DECOY_Q0KK55;DECOY_Q6PHN7, and the original sequence is YLLDLLR and YLLQLLR, with one AA difference, belongs to Tmem164 and Kndc1 respectively (after shuffle, protein are combined but gene are individually kept)

Currently I directly dropped decoy peptides which have same sequences as targets and same decoy peptide sequence belong to different genes, when assay file was still in tsv format before converting to pqp and it worked fine now

Maybe this case is rare since it needs both genes assgined and same sequences from randomly generated decoy peptides
I'd suggest an optional parameter to control if the decoys are allowed as same as target ones, or just filter them out.
And a checking step for pqp file in OSW will be great, like that in TargetedFileConverter, to find some invalid items before next step.

Best regards,
Ronghui

@shubham1637
Copy link

Hi, I am seeing similar issues.
I do not have any decoy and target peptide that have common sequence, even then I see this happening. Do you know how to fix it?

@gureann
Copy link
Author

gureann commented Feb 8, 2022

Hi @shubham1637 , looking back at this issue again, I think the main problem was caused by different genes were assigned to same one peptide sequence, and the decoys in my first narrative was only one way to reach it.

Which means if peptideA has GeneI in some rows and GeneII in other rows, this would lead to duplicated transition IDs, since the join action for tables in PQP file will also use gene column. Different genes would be kept, and any other values that were same would be repeated.

If you get same error in second code block when running OSW, and error in third code block when runnning TargetedFileConverter, I think you can have a look at the genes (or protein groups?) in your tsv or pqp file.

Hope this would be helpful.

Best,
Ronghui

@shubham1637
Copy link

shubham1637 commented Feb 8, 2022 via email

@gureann
Copy link
Author

gureann commented Feb 8, 2022

For anyone who reaches here,

The main problem of this issue would be caused by the assay library file itself, and I think this should be fixed by users ourselves, but not a issue for developers. So I would like to close this issue.

If you meet duplicated transition ID error, please check: only one unique gene and one unique protein was assigned to each peptide, but not two or more different ones appear in different rows

@gureann gureann closed this as completed Feb 8, 2022
@hroest
Copy link
Contributor

hroest commented Apr 1, 2022

It seems this issue still persists

@hroest hroest reopened this Apr 1, 2022
@hroest
Copy link
Contributor

hroest commented Apr 1, 2022

Part of the issue could come from the SQL select query here: https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/OPENSWATH/TransitionPQPFile.cpp which could lead to duplicated entries when you have 1:n mappings of peptides to proteins / genes. We should address this

  1. we should also address the issue of decoy peptides with the same sequence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants