-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes in BaseParser::upload_xref_object_graphs() #334
Merged
mkszuba
merged 4 commits into
feature/xref_sprint
from
bugfix/upload_xref_object_graphs
Nov 19, 2018
Merged
Fixes in BaseParser::upload_xref_object_graphs() #334
mkszuba
merged 4 commits into
feature/xref_sprint
from
bugfix/upload_xref_object_graphs
Nov 19, 2018
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…aponly() Before it would insert dependent xrefs using own code, which repetition aside made no effort to prevent insertion of duplicate entries.
nerdstrike
suggested changes
Nov 14, 2018
mkszuba
force-pushed
the
bugfix/upload_xref_object_graphs
branch
from
November 14, 2018 12:38
575885a
to
44f1885
Compare
nerdstrike
reviewed
Nov 15, 2018
nerdstrike
approved these changes
Nov 15, 2018
premanand17
approved these changes
Nov 15, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving and appreciate the effort. However, as a general comment, I suggest we will have to pause making any changes to the BaseParser in future unless it is very critical. BaseParser has evolved into BaseAdaptor in ensembl-xref and the changes have to be synced.
mkszuba
pushed a commit
that referenced
this pull request
Nov 15, 2018
…first one Having multiple xrefs corresponding to the same (accession,source_id,species_id) triplet confuses some Ensembl code, the bit that PR #334 attempts to fix being just one occurrence of this. This never happened with the old UniProtParser because it simply ignored the fact there could be multiple gene-name entries per protein and happily overwrote names/synonyms every time it encountered them, however now that we process those entries correctly a non-insignificant number of "duplicate" xrefs could appear for both Swiss-Prot and TrEMBL data. Having just discussed this with Mag, an administrative decision has been made to only generate an xref for the first gene-name encountered in a record.
mkszuba
force-pushed
the
bugfix/upload_xref_object_graphs
branch
from
November 16, 2018 13:14
28c7351
to
9a50d2e
Compare
… xrefs Essentially the same thing but without code duplication.
…bject_graphs() Previously it was a mixture of both, with newly added lines being one of the few indented only with spaces.
mkszuba
force-pushed
the
bugfix/upload_xref_object_graphs
branch
from
November 16, 2018 16:29
b7013c6
to
f65bf2a
Compare
mkszuba
pushed a commit
that referenced
this pull request
Nov 19, 2018
…first one Having multiple xrefs corresponding to the same (accession,source_id,species_id) triplet confuses some Ensembl code, the bit that PR #334 attempts to fix being just one occurrence of this. This never happened with the old UniProtParser because it simply ignored the fact there could be multiple gene-name entries per protein and happily overwrote names/synonyms every time it encountered them, however now that we process those entries correctly a non-insignificant number of "duplicate" xrefs could appear for both Swiss-Prot and TrEMBL data. Having just discussed this with Mag, an administrative decision has been made to only generate an xref for the first gene-name encountered in a record.
mkszuba
pushed a commit
that referenced
this pull request
Nov 21, 2018
…first one Having multiple xrefs corresponding to the same (accession,source_id,species_id) triplet confuses some Ensembl code, the bit that PR #334 attempts to fix being just one occurrence of this. This never happened with the old UniProtParser because it simply ignored the fact there could be multiple gene-name entries per protein and happily overwrote names/synonyms every time it encountered them, however now that we process those entries correctly a non-insignificant number of "duplicate" xrefs could appear for both Swiss-Prot and TrEMBL data. Having just discussed this with Mag, an administrative decision has been made to only generate an xref for the first gene-name encountered in a record.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
BaseParser::upload_xref_object_graphs() can insert duplicate dependent xrefs on subsequent reruns of parsers using it on the same input, and inserted empty strings rather than NULLs as descriptions of dependent xrefs unless explicitly overridden by arguments.
Use case
BaseParser::upload_xref_object_graphs() is used by (at least) UniProtParser.
Benefits
Calls to upload_xref_object_graphs() should now be replay-safe. Fewer empty strings.
Possible Drawbacks
Parser output will likely change a lot, mostly due to the empty-space-to-null transition - which will make validation challenging.
Testing
Have you added/modified unit tests to test the changes?
No.
If so, do the tests pass/fail?
N/A
Have you run the entire test suite and no regression was detected?
I have run xref_parser.t, which AFAIK is the only part of the test suite dealing with xref parsers. All tests still pass.
Moreover, I have run the new UniProtParser (with a small subset of input extracted from the middle of current Swiss-Prot file) before and after the change to compare results and they have changed as expected.