Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes in BaseParser::upload_xref_object_graphs() #334

Merged
merged 4 commits into from
Nov 19, 2018

Conversation

mkszuba
Copy link
Contributor

@mkszuba mkszuba commented Nov 14, 2018

Description

BaseParser::upload_xref_object_graphs() can insert duplicate dependent xrefs on subsequent reruns of parsers using it on the same input, and inserted empty strings rather than NULLs as descriptions of dependent xrefs unless explicitly overridden by arguments.

Use case

BaseParser::upload_xref_object_graphs() is used by (at least) UniProtParser.

Benefits

Calls to upload_xref_object_graphs() should now be replay-safe. Fewer empty strings.

Possible Drawbacks

Parser output will likely change a lot, mostly due to the empty-space-to-null transition - which will make validation challenging.

Testing

Have you added/modified unit tests to test the changes?
No.

If so, do the tests pass/fail?
N/A

Have you run the entire test suite and no regression was detected?
I have run xref_parser.t, which AFAIK is the only part of the test suite dealing with xref parsers. All tests still pass.
Moreover, I have run the new UniProtParser (with a small subset of input extracted from the middle of current Swiss-Prot file) before and after the change to compare results and they have changed as expected.

…aponly()

Before it would insert dependent xrefs using own code, which repetition aside
made no effort to prevent insertion of duplicate entries.
@mkszuba mkszuba force-pushed the bugfix/upload_xref_object_graphs branch from 575885a to 44f1885 Compare November 14, 2018 12:38
Copy link
Contributor

@premanand17 premanand17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving and appreciate the effort. However, as a general comment, I suggest we will have to pause making any changes to the BaseParser in future unless it is very critical. BaseParser has evolved into BaseAdaptor in ensembl-xref and the changes have to be synced.

mkszuba pushed a commit that referenced this pull request Nov 15, 2018
…first one

Having multiple xrefs corresponding to the same (accession,source_id,species_id) triplet
confuses some Ensembl code, the bit that PR #334 attempts to fix being just one occurrence
of this. This never happened with the old UniProtParser because it simply ignored the fact
there could be multiple gene-name entries per protein and happily overwrote names/synonyms
every time it encountered them, however now that we process those entries correctly
a non-insignificant number of "duplicate" xrefs could appear for both Swiss-Prot and TrEMBL
data.

Having just discussed this with Mag, an administrative decision has been made to only
generate an xref for the first gene-name encountered in a record.
@mkszuba mkszuba force-pushed the bugfix/upload_xref_object_graphs branch from 28c7351 to 9a50d2e Compare November 16, 2018 13:14
Marek Szuba added 3 commits November 16, 2018 16:22
… xrefs

Essentially the same thing but without code duplication.
…bject_graphs()

Previously it was a mixture of both, with newly added lines being one of the few
indented only with spaces.
@mkszuba mkszuba force-pushed the bugfix/upload_xref_object_graphs branch from b7013c6 to f65bf2a Compare November 16, 2018 16:29
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 81.396% when pulling f65bf2a on bugfix/upload_xref_object_graphs into 32104e5 on feature/xref_sprint.

@mkszuba mkszuba merged commit 5f81416 into feature/xref_sprint Nov 19, 2018
@mkszuba mkszuba deleted the bugfix/upload_xref_object_graphs branch November 19, 2018 10:58
mkszuba pushed a commit that referenced this pull request Nov 19, 2018
…first one

Having multiple xrefs corresponding to the same (accession,source_id,species_id) triplet
confuses some Ensembl code, the bit that PR #334 attempts to fix being just one occurrence
of this. This never happened with the old UniProtParser because it simply ignored the fact
there could be multiple gene-name entries per protein and happily overwrote names/synonyms
every time it encountered them, however now that we process those entries correctly
a non-insignificant number of "duplicate" xrefs could appear for both Swiss-Prot and TrEMBL
data.

Having just discussed this with Mag, an administrative decision has been made to only
generate an xref for the first gene-name encountered in a record.
mkszuba pushed a commit that referenced this pull request Nov 21, 2018
…first one

Having multiple xrefs corresponding to the same (accession,source_id,species_id) triplet
confuses some Ensembl code, the bit that PR #334 attempts to fix being just one occurrence
of this. This never happened with the old UniProtParser because it simply ignored the fact
there could be multiple gene-name entries per protein and happily overwrote names/synonyms
every time it encountered them, however now that we process those entries correctly
a non-insignificant number of "duplicate" xrefs could appear for both Swiss-Prot and TrEMBL
data.

Having just discussed this with Mag, an administrative decision has been made to only
generate an xref for the first gene-name encountered in a record.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants