Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keywords #115

Merged
merged 25 commits into from
Jun 4, 2024
Merged

Keywords #115

merged 25 commits into from
Jun 4, 2024

Conversation

stuartmcalpine
Copy link
Collaborator

@stuartmcalpine stuartmcalpine commented Apr 9, 2024

Adding keywords

Add a keywords table so that datasets can be tagged with multiple keywords.

Changes

  • New table "keyword"
    • keyword_id : int
    • keyword : str
    • system : bool, True for preset keyword, False for user defined keyword
    • activate : bool, True is keyword is active and usable
    • creator_uid : str
    • creation_date : datetime
  • New table "dataset_keyword" (links keywords to datasets)
    • dataset_keyword_id : int
    • dataset_id : int
    • keyword_id : int
  • The list of preset keywords for datasets is in src/dataregistry/schema/keywords.yaml which populates the keyword table during database creation.
  • Added datareg.Registrar.dataset.get_keywords() function to return the list of currently registered keywords.
  • When the keyword table is queried on an automatic join is made with the dataset_keyword association table. So the user can query for all datasets with a given keyword for example.
  • Added keywords to the docs
  • Can tag keywords when registering datasets through the CLI
  • Can run dregs show keywords from CLI to display all pre-registered keywords

Notes

  • Think about what the preset keywords should be
  • Want ability for user added keywords? (yes, will do in another PR)
  • Currently no ability to untag a keyword from a dataset (will do in another PR)

@stuartmcalpine stuartmcalpine changed the base branch from main to u/stuart/v0.4.0 April 9, 2024 11:21
@JoanneBogart
Copy link
Collaborator

Borrowing some ideas from experience with the eTraveler version of label support, I'd suggest

  • the table containing the keywords can have a field system, values True or False and status. It can then just be called keyword rather than keyword_preset. What you're calling preset_keywords would have system=True; user labels would have system=False. status would have values like active or disabled; not sure if that's all we need. Normal mortals would not be able to disable system keywords. This table should probably also have fields creator_uid and one or both of creation_date, last_modify_date.
  • the table dataset_keyword_preset can similarly just be called dataset_keyword. It also should have status, with values like active, disabled (making it possible to dissolve the association), last_modify_date, and one or both of creator_uid, last_modifier_uid. Then searches for keywords would probably by default require status=active.

@stuartmcalpine
Copy link
Collaborator Author

  • the table dataset_keyword_preset can similarly just be called dataset_keyword. It also should have status, with values like active, disabled (making it possible to dissolve the association), last_modify_date, and one or both of creator_uid, last_modifier_uid. Then searches for keywords would probably by default require status=active.

Is there a benefit to keeping inactive keyword entry links in the database, as opposed to a user just being able to delete a keyword from a dataset and the entry is gone?

I also don't think keywords should be modifiable, as it would have a knock on effect to everyones dataset that is tagged with that keyword.

@JoanneBogart
Copy link
Collaborator

Is there a benefit to keeping inactive keyword entry links in the database, as opposed to a user just being able to delete a keyword from a dataset and the entry is gone?
Yes, a minor benefit in case anyone is interested in history, particularly if, in someone's code (e.g. analysis code) an action is triggered when a particular keyword has been associated with a dataset.
I also don't think keywords should be modifiable, as it would have a knock on effect to everyones dataset that is tagged with that keyword.
I agree. I withdraw that suggestion.

Base automatically changed from u/stuart/v0.4.0 to main April 27, 2024 09:56
@stuartmcalpine
Copy link
Collaborator Author

@JoanneBogart, because we don't have access to the table classes which we create the schema with (in create_registry_db.py) during runtime, we don't have access to sqlalchemys ORM abilities.

Coming across this because of the keywords many-to-many relationship, which would be nice to directly query "get me all datasets with this keyword" or vice versa. But the relationships in sqlalchemy are not something in the schema, just the code, but they are only declared in the original classes in the creation script.

So to query keywords we'd have to do it a bit more manual in the code, through the dataset_keyword table ourselves, which is fine. Just thinking if we care about this functionality anymore beyond this, i.e., have those original table classes in the source code rather than an external script.

Can discuss more during meeting

@JoanneBogart
Copy link
Collaborator

@stuartmcalpine
Can't we get access to sqlalchemy table classes with Metadata.reflect?
I had to do this in create_registry_schema in order to access tables in the production schema when creating a new non-production schema.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have more suggestions (or maybe not) but it may be a few days before I can find time to finish looking at this. It seemed best to let you know what I found so far.

src/dataregistry/db_basic.py Show resolved Hide resolved
@@ -116,6 +117,12 @@ def modify(self, entry_id, modify_fields):

# Loop over each column to be modified
for key, v in modify_fields.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A base class shouldn't have to know about its subclasses. And modifying keywords seems different from modifying anything else. I think this functionality belongs in a separate routine or routines, e.g. modify_keywords(self, entry_id, kwds); or, maybe more usefully, add_keywords(self, entry_id, kwds) and remove_keywords(self, entry_id, kwds). That (or they) could go in DatasetTable, which I think is the best option currently. When and if we support assigning keywords to executions and if the implementation is going to be similar, then it could perhaps move back to BaseTable. If so, anything called, like _validate_keywords, should also be declared in BaseTable but with an empty implementation if the real implementation is specific to the subclass:

def _validate_keywords(self, kwds):
            pass

or maybe better something like

def _validate_keywords(self, kwds):
           raise Exception("Subclasses should provide implementation")

Alternatively the routine or routines could go in some file for utilities where the table is passed in as an argument. Either way there would need to be a check that the table is an instance of an appropriate type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've moved the keyword specific functions to the dataset class.

There is now a add_keyword and (for now empty) delete_keyword function that can be used to manipulate keywords post-registration.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. We might make it a little easier for users to figure out how to query for keywords. You have a nice example in the tutorial. The docstring for find_datasets could mention the possibility of querying for keywords and refer to the tutorial.

@stuartmcalpine stuartmcalpine merged commit cad7dcd into main Jun 4, 2024
20 checks passed
@stuartmcalpine stuartmcalpine deleted the u/stuart/keywords branch June 4, 2024 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants