ENSCORESW-3147: correctly capture all required fields from file #384
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Requirements
Description
Using one or more sentences, describe in detail the proposed changes.
The RFAM parser does not capture all required fields when run as part of the eHive xref pipeline. This is because the file is read line by line from disk, while the information for a single entry is split across multiple lines.
The proposed change groups the lines in logical blocks representing a single RFAM entry.
Use case
Describe the problem. Please provide an example representing the motivation behind the need for having these changes in place.
Since the move to the eHive-based xref pipeline, all RFAM xrefs are missing descriptions and their label is identical to the accession when it should not be.
The proposed fix ensures both label and description are stored in the database and can be displayed on the browser.
This was reported by a user who compared the same gene across two zebrafish assemblies, see
http://apr2018.archive.ensembl.org/Danio_rerio/Transcript/Summary?db=core;g=ENSDARG00000082665;r=25:8247670-8247787;t=ENSDART00000116947
versus
http://dec2017.archive.ensembl.org/Danio_rerio/Transcript/Summary?db=core;g=ENSDARG00000082665;r=25:8247670-8247787;t=ENSDART00000116947
Benefits
If applicable, describe the advantages the changes will have.
Useful information from RFAM is correctly stored and displayed to our users.
Possible Drawbacks
If applicable, describe any possible undesirable consequence of the changes.
The proposed fix is not the prettiest and results in storing multiple lines in memory. However, the RFAM file is small so this should not have a major impact on memory requirements.
Testing
Have you added/modified unit tests to test the changes?
NA, the current xref pipeline does not have unit tests.
The pipeline was run with and without the fix though to ensure the data is captured correctly.
If so, do the tests pass/fail?
NA
Have you run the entire test suite and no regression was detected?
NA