-
Notifications
You must be signed in to change notification settings - Fork 2
Features/updater refactor #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removed forced list
andrewyatz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to assess full impact but looks good from what I can understand
|
Are we including code to exclude LRG regions in the core update module? |
| # Special case for loading a single species from a collection database. Can be removed in a future release | ||
| sel_species = kwargs.get('species', None) | ||
| metadata_uri = kwargs.get('metadata_uri', self.metadata_uri) | ||
| taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri) | |
| taxonomy_uri = kwargs.get('taxonomy_uri', self.taxonomy_uri) |
|
|
||
|
|
||
| class BaseMetaUpdater: | ||
| def __init__(self, db_uri, metadata_uri, taxonomy_uri, release=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loaded genomes are not assigned to any release, needed extra step to assign the genomes to release will this be handled in future developments and taxonomy_uri is not used is it required for future purposes?
| metadata_uri = kwargs.get('metadata_uri', self.metadata_uri) | ||
| taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri) | ||
| db_uri = kwargs.get('db_uri', self.db_uri) | ||
| if sel_species: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/Ensembl/ensembl-production/blob/main/src/python/ensembl/production/hive/ensembl_genome_metadata/MetadataUpdaterHiveCore.py#L22
expected species not initialized in pipeline code, as it collects the list of species ids based on production name it could handle both collection and non-collection dbs. could it be simplified?
| Process an individual species from a core database to update the metadata db. | ||
| This method contains the logic for updating the metadata | ||
| """ | ||
| meta_conn = DBConnection(metadata_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could reuse the metadata db connection, https://github.com/Ensembl/ensembl-metadata-api/pull/32/files#diff-c4c8b847961a9a8ef0f89784d66a14afed1c340edc59ad56de3815eeb4e84296R25, declared in base class
self.metadata_db
| new_organism = Organism( | ||
| species_taxonomy_id=self.get_meta_single_meta_key(species, "species.species_taxonomy_id"), | ||
| taxonomy_id=self.get_meta_single_meta_key(species, "species.taxonomy_id"), | ||
| display_name=self.get_meta_single_meta_key(species, "species.display_name"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per db schema below columns cannot be null
display_name
url_name
ensembl_name
method get_meta_single_meta_key returns None if meta_keys is missed do we need to raise an exception if mandatory fields are none
| organism_group_member = meta_session.query(OrganismGroupMember).filter( | ||
| OrganismGroupMember.organism_id == old_organism.organism_id, | ||
| OrganismGroupMember.organism_group_id == division.organism_group_id).one_or_none() | ||
| return old_organism, division, organism_group_member, "Existing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use an enum class for standard organism/assembly status New, Existing instead of hardcoded string ?
class Status(Enum):
NEW = 1
EXISTING = 2
| ################################################################ | ||
| # Dataset section. More logic will be necessary for additional datasets. Currently only the genebuild is listed here. | ||
|
|
||
| print("Rewrite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use logging instead of print
| for attribute, value in attributes.items(): | ||
| meta_attribute = meta_session.query(Attribute).filter(Attribute.name == attribute).one_or_none() | ||
| if meta_attribute is None: | ||
| raise Exception(f"Attribute {attribute} not found. Please enter it into the db manually") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exception raised when loading homo_sapiens 110 db with updated accession GCA_000001405.29,
homo_sapiens already loaded for 108 with assembly accession GCA_000001405.28. core update processor considers it as an existing organism and looks for accession in metadata db fails to find new assembly accession declared in core db metatable. the clean load will not have any impact but later while loading multiple releases will be an issue
| name=name # dbname | ||
| ) | ||
| meta_session.add(dataset_source) # Only add a new DatasetSource to the session if it doesn't exist | ||
| return dataset_source, "new" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status
| return dataset_source, "new" | |
| return dataset_source, "New" |
| meta_session.add(dataset_source) # Only add a new DatasetSource to the session if it doesn't exist | ||
| return dataset_source, "new" | ||
| else: | ||
| return dataset_source, "existing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return dataset_source, "existing" | |
| return dataset_source, "Existing" |
or setting Enum class will be standard
…_only Get genome UUID using default assembly only
A complete refactor of the updater. New logic has been implemented and the code is more readable and robust. It did come at a cost of run time, with the previous instance taking milliseconds and this instance taking seconds.
Nonetheless, the whole thing should function identically to the previous version with the additional logic steps added.
In addition a function and method was added to the api. More will need to be added to datasets in the future.
This has been fully tested with local databases, however the tests may fail currently. Hopefully those are fixed before I leave.