Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unique constraints to schema yaml file #132

Merged
merged 7 commits into from
Jun 5, 2024

Conversation

stuartmcalpine
Copy link
Collaborator

The schema.yaml file now has the list of unique column constraints, this is read during database creation in the creation script.

Also have added owner and owner_type to the unique constraints of the dataset table.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do the job, but I think the schema and code would benefit from some reorganization. The goal should be to make the code as generic as possible, so that, for example, adding another table or adding a constraint to a table that didn't have any before doesn't require any changes in the code. I think this can be done by organizing the schema by table, something like

tables:
    execution:
         column_definitions:
                 <stuff>
    execution_alias:
         column_definitions:
                 <more stuff>
         unique_constraints:
                  <other stuff>
    provenance:
           <and so forth>

The tables element isn't strictly necessary but makes it a little clearer what's going on and also leaves a place for non-table-specific information in case we need to add some in the future. Then the code can look like

#  load the schema
for tbl in tables:
      <process the columns>
      if "unique_constraints" in tbl.keys():
             <process uniqueness constraints>

The only thing I see which wouldn't immediately fit into this arrangement is the special handling for dependencies, depending on whether a production schema is present or not (does that happen only with sqlite?). We might be able to avoid special code by adding another optional key under a table, say "sqlite" or maybe "no_production" (used for now only by dependency) which describes what's needed. If that's too awkward, there would have to be some conditional code in the table-handling which would only be activated when table is "dependency", but a lot less than there is now.

@stuartmcalpine
Copy link
Collaborator Author

Have made the schema.yaml a bit more generic as suggested.

Creating the schema now no longer needs individual functions for each table, there is a generic _BuildTable function that basically takes everything fom the yaml.

Still a bit of a special case for the dependency table in the schema creation, _FixDependencyColumns, but I think it's ok. Reduces the code a good bit which is nice

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new handling of tables looks good, but more can be done to make this more general. See inline comments for a couple suggestions.

_Execution(schema)
_ExecutionAlias(schema)
_Provenance(schema)
for table_name in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A construction like this, where the code specifies table names, shouldn't be necessary. There could be a routine get_table_names() which would look them up from schema_data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"__tablename__": table,
}

if (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be done in a more general fashion; there are several cases it doesn't handle:

  • multiple uniqueness constraints
  • multiple indices
  • one or more indices but no uniqueness constraints.

The first thing to do is to come up with a better representation of these things in the schema yaml file which supports multiple indices and constraints in a single table. We don't have any need for that now but someday we might. Write a routine which can handle a list of uniqueness constraints, outputting the sort of metadata sqlalchemy expects, and another for a list of indices.

Then, rather than having if ... elif ... elif... to handle all the cases, it would be somewhat nicer (but not crucial) to add on metadata for uniqueness constraints if there are any, then the same thing for indices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I have this now, should be able to add an arbitrary number of unique constraints and indexes to each table.

I've also added a doc string to the schema.yaml file

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
One minor suggestion (optional): to avoid possible future confusion you might change all occurrences of indexs to indexes since that's an actual word.

@stuartmcalpine stuartmcalpine merged commit 987067a into main Jun 5, 2024
20 checks passed
@stuartmcalpine stuartmcalpine deleted the u/stuart/update_dataset_constraints branch June 5, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants