Skip to content

Conversation

@dpopleton
Copy link
Contributor

A complete refactor of the updater. New logic has been implemented and the code is more readable and robust. It did come at a cost of run time, with the previous instance taking milliseconds and this instance taking seconds.
Nonetheless, the whole thing should function identically to the previous version with the additional logic steps added.
In addition a function and method was added to the api. More will need to be added to datasets in the future.

This has been fully tested with local databases, however the tests may fail currently. Hopefully those are fixed before I leave.

andrewyatz

This comment was marked as outdated.

Copy link
Member

@andrewyatz andrewyatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to assess full impact but looks good from what I can understand

@vinay-ebi
Copy link
Contributor

Are we including code to exclude LRG regions in the core update module?

@dpopleton dpopleton merged commit 897bac6 into main Aug 9, 2023
@dpopleton dpopleton deleted the features/updater_refactor branch August 9, 2023 10:30
# Special case for loading a single species from a collection database. Can be removed in a future release
sel_species = kwargs.get('species', None)
metadata_uri = kwargs.get('metadata_uri', self.metadata_uri)
taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)
taxonomy_uri = kwargs.get('taxonomy_uri', self.taxonomy_uri)



class BaseMetaUpdater:
def __init__(self, db_uri, metadata_uri, taxonomy_uri, release=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loaded genomes are not assigned to any release, needed extra step to assign the genomes to release will this be handled in future developments and taxonomy_uri is not used is it required for future purposes?

metadata_uri = kwargs.get('metadata_uri', self.metadata_uri)
taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)
db_uri = kwargs.get('db_uri', self.db_uri)
if sel_species:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/Ensembl/ensembl-production/blob/main/src/python/ensembl/production/hive/ensembl_genome_metadata/MetadataUpdaterHiveCore.py#L22
expected species not initialized in pipeline code, as it collects the list of species ids based on production name it could handle both collection and non-collection dbs. could it be simplified?

Process an individual species from a core database to update the metadata db.
This method contains the logic for updating the metadata
"""
meta_conn = DBConnection(metadata_uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_organism = Organism(
species_taxonomy_id=self.get_meta_single_meta_key(species, "species.species_taxonomy_id"),
taxonomy_id=self.get_meta_single_meta_key(species, "species.taxonomy_id"),
display_name=self.get_meta_single_meta_key(species, "species.display_name"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per db schema below columns cannot be null
display_name
url_name
ensembl_name
method get_meta_single_meta_key returns None if meta_keys is missed do we need to raise an exception if mandatory fields are none

organism_group_member = meta_session.query(OrganismGroupMember).filter(
OrganismGroupMember.organism_id == old_organism.organism_id,
OrganismGroupMember.organism_group_id == division.organism_group_id).one_or_none()
return old_organism, division, organism_group_member, "Existing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use an enum class for standard organism/assembly status New, Existing instead of hardcoded string ?
class Status(Enum):
NEW = 1
EXISTING = 2

################################################################
# Dataset section. More logic will be necessary for additional datasets. Currently only the genebuild is listed here.

print("Rewrite")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use logging instead of print

for attribute, value in attributes.items():
meta_attribute = meta_session.query(Attribute).filter(Attribute.name == attribute).one_or_none()
if meta_attribute is None:
raise Exception(f"Attribute {attribute} not found. Please enter it into the db manually")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception raised when loading homo_sapiens 110 db with updated accession GCA_000001405.29,
homo_sapiens already loaded for 108 with assembly accession GCA_000001405.28. core update processor considers it as an existing organism and looks for accession in metadata db fails to find new assembly accession declared in core db metatable. the clean load will not have any impact but later while loading multiple releases will be an issue

name=name # dbname
)
meta_session.add(dataset_source) # Only add a new DatasetSource to the session if it doesn't exist
return dataset_source, "new"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status

Suggested change
return dataset_source, "new"
return dataset_source, "New"

meta_session.add(dataset_source) # Only add a new DatasetSource to the session if it doesn't exist
return dataset_source, "new"
else:
return dataset_source, "existing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return dataset_source, "existing"
return dataset_source, "Existing"

or setting Enum class will be standard

marcoooo added a commit that referenced this pull request Jan 18, 2024
…_only

Get genome UUID using default assembly only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants