Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor genomes.py duplication #321

Closed
BurkovBA opened this issue Feb 6, 2018 · 3 comments
Closed

Refactor genomes.py duplication #321

BurkovBA opened this issue Feb 6, 2018 · 3 comments

Comments

@BurkovBA
Copy link
Contributor

BurkovBA commented Feb 6, 2018

Currently, genome handling logic is all around and we're using crutches to do format conversions. This chaos needs to go.

I suggest the following roadmap:

  1. I'll look into the genomes-related code in :
    • database
    • python and django models and serializers methods logic
    • data import pipelines
    • urls
    • Genoverse genome-browser
    • text search and Lucene index
    • user-readable representations on website (backend and frontend-generated)
    • hyperlinks generation for external resources (E!, UCSC, ...)

I'll create a github issue with hyperlinks for Anton and Blake to quickly recap.

  1. Anton and Blake, using the hyperlinks I provided, refresh in their memory this whole problem and come up with their visions of:

    • how this should be done
    • how to get from where we are to where we need to be ASAP
  2. We do a short meeting and agree on what formats we're using for genome names in each part of our site. I create meeting notes that will serve as a documentation prototype.

  3. Using meeting notes, I document the formats used to store data and pipelines of data transfer. I make this documentation available and we maintain this documentation up-to-date.

  4. Following the documentation, we create one and only data flow with well-defined interfaces and adapter functions for convertions between formats. This pipeline describes is used:

  • by data import pipelines to transport data import pipeline from external sources to the database and python code
  • by backend code to retrieve data from DB to python/django models
  • by various frontend modules to request genomes form backend
  • by various frontend modules to display data
  1. We rewrite our code to use this pipeline and remove any duplications of logic and ad-hoc code.

We can download all the available genomes from E! public MySQL database into our own database table.

Then we can get rid of config/genomes.py and similar code on frontend. Expose genomes through REST api endpoint.

This script is an example of how to retrieve genomes information from E! public MySQL database.
https://github.com/RNAcentral/rnacentral-webcode/blob/master/rnacentral/portal/management/commands/update_ensembl_genome_mapping.py

We also have multiple functions, tied to genomes, such as Xref.get_ucsc_db_id, Xref.get_ensembl_division(), Accession.get_ensembl_species_url().

@blakesweeney
Copy link
Member

How often do we need to run this? If it is something we should run when we import E! data I would prefer to add it to the pipeline as part of the Ensembl update. pgloader supports pulling from a mysql database into a Postgres one: http://pgloader.readthedocs.io/en/latest/ref/mysql.html.

@AntonPetrov
Copy link
Member

This would need to run every time Ensembl is updated so it's a good idea to merge this script with the Ensembl import pipeline.

Not sure if pgloader can help here because we need to pull data from several tables across multiple Ensembl databases.

@blakesweeney blakesweeney self-assigned this Feb 6, 2018
@blakesweeney
Copy link
Member

Ok, I can work on adding it as part of the import pipeline later then. I'll aim for after I update Ensembl data for this release.

BurkovBA added a commit that referenced this issue Apr 6, 2018
BurkovBA added a commit that referenced this issue Apr 6, 2018
BurkovBA added a commit that referenced this issue Apr 8, 2018
BurkovBA added a commit that referenced this issue Apr 8, 2018
BurkovBA added a commit that referenced this issue Apr 9, 2018
BurkovBA added a commit that referenced this issue Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants