Features/updater refactor #32

dpopleton · 2023-07-16T19:46:13Z

A complete refactor of the updater. New logic has been implemented and the code is more readable and robust. It did come at a cost of run time, with the previous instance taking milliseconds and this instance taking seconds.
Nonetheless, the whole thing should function identically to the previous version with the additional logic steps added.
In addition a function and method was added to the api. More will need to be added to datasets in the future.

This has been fully tested with local databases, however the tests may fail currently. Hopefully those are fixed before I leave.

…I datasets

Removed forced list

…t quite proper

andrewyatz

Hard to assess full impact but looks good from what I can understand

vinay-ebi · 2023-08-08T19:13:54Z

Are we including code to exclude LRG regions in the core update module?

vinay-ebi · 2023-08-08T20:11:31Z

src/ensembl/production/metadata/updater/core.py

        # Special case for loading a single species from a collection database. Can be removed in a future release
        sel_species = kwargs.get('species', None)
+        metadata_uri = kwargs.get('metadata_uri', self.metadata_uri)
+        taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)


Suggested change

taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)

taxonomy_uri = kwargs.get('taxonomy_uri', self.taxonomy_uri)

vinay-ebi · 2023-08-08T20:26:06Z

src/ensembl/production/metadata/updater/base.py



 class BaseMetaUpdater:
    def __init__(self, db_uri, metadata_uri, taxonomy_uri, release=None):


loaded genomes are not assigned to any release, needed extra step to assign the genomes to release will this be handled in future developments and taxonomy_uri is not used is it required for future purposes?

vinay-ebi · 2023-08-08T20:38:16Z

src/ensembl/production/metadata/updater/core.py

+        metadata_uri = kwargs.get('metadata_uri', self.metadata_uri)
+        taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)
+        db_uri = kwargs.get('db_uri', self.db_uri)
        if sel_species:


https://github.com/Ensembl/ensembl-production/blob/main/src/python/ensembl/production/hive/ensembl_genome_metadata/MetadataUpdaterHiveCore.py#L22
expected species not initialized in pipeline code, as it collects the list of species ids based on production name it could handle both collection and non-collection dbs. could it be simplified?

vinay-ebi · 2023-08-08T20:45:20Z

src/ensembl/production/metadata/updater/core.py

+        Process an individual species from a core database to update the metadata db.
+        This method contains the logic for updating the metadata
+        """
+        meta_conn = DBConnection(metadata_uri)


could reuse the metadata db connection, https://github.com/Ensembl/ensembl-metadata-api/pull/32/files#diff-c4c8b847961a9a8ef0f89784d66a14afed1c340edc59ad56de3815eeb4e84296R25, declared in base class
self.metadata_db

vinay-ebi · 2023-08-08T21:00:09Z

src/ensembl/production/metadata/updater/core.py

+        new_organism = Organism(
+            species_taxonomy_id=self.get_meta_single_meta_key(species, "species.species_taxonomy_id"),
+            taxonomy_id=self.get_meta_single_meta_key(species, "species.taxonomy_id"),
+            display_name=self.get_meta_single_meta_key(species, "species.display_name"),


As per db schema below columns cannot be null
display_name
url_name
ensembl_name
method get_meta_single_meta_key returns None if meta_keys is missed do we need to raise an exception if mandatory fields are none

vinay-ebi · 2023-08-08T21:06:12Z

src/ensembl/production/metadata/updater/core.py

+            organism_group_member = meta_session.query(OrganismGroupMember).filter(
+                OrganismGroupMember.organism_id == old_organism.organism_id,
+                OrganismGroupMember.organism_group_id == division.organism_group_id).one_or_none()
+            return old_organism, division, organism_group_member, "Existing"


could we use an enum class for standard organism/assembly status New, Existing instead of hardcoded string ?
class Status(Enum):
NEW = 1
EXISTING = 2

vinay-ebi · 2023-08-08T21:07:39Z

src/ensembl/production/metadata/updater/core.py

-                ################################################################
-                # Dataset section. More logic will be necessary for additional datasets. Currently only the genebuild is listed here.
-
+                    print("Rewrite")


could use logging instead of print

vinay-ebi · 2023-08-08T21:56:34Z

src/ensembl/production/metadata/updater/core.py

+            for attribute, value in attributes.items():
+                meta_attribute = meta_session.query(Attribute).filter(Attribute.name == attribute).one_or_none()
+                if meta_attribute is None:
+                    raise Exception(f"Attribute {attribute} not found. Please enter it into the db manually")


exception raised when loading homo_sapiens 110 db with updated accession GCA_000001405.29,
homo_sapiens already loaded for 108 with assembly accession GCA_000001405.28. core update processor considers it as an existing organism and looks for accession in metadata db fails to find new assembly accession declared in core db metatable. the clean load will not have any impact but later while loading multiple releases will be an issue

vinay-ebi · 2023-08-08T22:20:51Z

src/ensembl/production/metadata/updater/base.py

+                name=name  # dbname
+            )
+            meta_session.add(dataset_source)  # Only add a new DatasetSource to the session if it doesn't exist
+            return dataset_source, "new"


Status

Suggested change

return dataset_source, "new"

return dataset_source, "New"

vinay-ebi · 2023-08-08T22:21:24Z

src/ensembl/production/metadata/updater/base.py

+            meta_session.add(dataset_source)  # Only add a new DatasetSource to the session if it doesn't exist
+            return dataset_source, "new"
+        else:
+            return dataset_source, "existing"


Suggested change

return dataset_source, "existing"

return dataset_source, "Existing"

or setting Enum class will be standard

…_only Get genome UUID using default assembly only

dpopleton and others added 5 commits July 11, 2023 07:45

Initial refactor with changes made to the organism update

e93b273

Fiadded new genset and assembly updates

bc6601b

Full refactor of updater before testing. Created release check for AP…

f60e708

…I datasets

Full refactor of updater before testing. Created release check for AP…

4f8c7b6

…I datasets

Update genome.py

93fc4a1

Removed forced list

This comment was marked as outdated.

Sign in to view

bilalebi mentioned this pull request Jul 26, 2023

Improve code logic and add fetch genome by ensembl and assembly name #31

Merged

dpopleton added 14 commits August 7, 2023 15:54

Fixed Tests

7c4f4fd

Modified test dbs to use E. coli taxid

18ea79d

Modified test dbs to use specific unique name.

6b974d1

Improved fetch taxonomy names within api

659aca5

Added taxid genome api check

1c397de

added a change for taxid check. Major rework already in PR so it isn'…

c3cf0a7

…t quite proper

added a change for taxid check. Major rework already in PR so it isn'…

fad763c

…t quite proper

added a change for taxid check. Major rework already in PR so it isn'…

03eef45

…t quite proper

updated test tables

43534ae

Fixed api test and altered get sequences

06c7c3c

added attrib_type to tests

ffcd9f2

Reworked Tests

8ce5ea4

Reworked Tests

3e06aef

Reworked Tests

7baee2a

dpopleton requested review from andrewyatz and vinay-ebi August 8, 2023 11:57

andrewyatz approved these changes Aug 8, 2023

View reviewed changes

Added lrg skip for assembly sequences

216ef2c

dpopleton merged commit 897bac6 into main Aug 9, 2023

dpopleton deleted the features/updater_refactor branch August 9, 2023 10:30

vinay-ebi reviewed Aug 23, 2023

View reviewed changes

marcoooo added a commit that referenced this pull request Jan 18, 2024

Merge pull request #32 from Ensembl/feat/get_guuid_using_assm_default…

f8348dd

…_only Get genome UUID using default assembly only

	taxonomy_uri = kwargs.get('metadata_uri', self.taxonomy_uri)
	taxonomy_uri = kwargs.get('taxonomy_uri', self.taxonomy_uri)



		class BaseMetaUpdater:
		def __init__(self, db_uri, metadata_uri, taxonomy_uri, release=None):

	return dataset_source, "existing"
	return dataset_source, "Existing"

Features/updater refactor #32

Features/updater refactor #32

Uh oh!

Conversation

dpopleton commented Jul 16, 2023

Uh oh!

This comment was marked as outdated.

Uh oh!

andrewyatz left a comment

Choose a reason for hiding this comment

Uh oh!

vinay-ebi commented Aug 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants