Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvobject Table Dangers: 800,000 rows with no type checking and wasted space #777

Closed
raprasad opened this issue Jul 28, 2014 · 8 comments
Closed
Assignees
Labels

Comments

@raprasad
Copy link
Contributor

Related: #733

Background

A single postgres table, dvobject, is being used to store 3 distinct entities.

  • Dataverses
  • Datasets
  • Datafiles

The table is not normalized and allows logical inconsistencies and anomalies.

Consider normalizing dvobject table and using a dvobject_types table.

note: After migration, the dvobject table will contain nearly 800,000 rows. (750k files, 54k studies, 750+ dataverses, etc)

Reason 1 for normalization: No business logic enforcement / data integrity at database level

  • Using one big table (dvobject) mixes distinct entity types--making if difficult for new team members, outside parters, and even dataverse team members to understand
  • The dvobject table leads to greater code complexity because postgres cannot guard against the following situations:
    • creator - Possible to have a null value for a Dataverse or Dataset. Possible to have a creator for a Datafile
    • owner - If keys mixed up: Datafile can own a Dataverse, Dataset can own a Dataverse, Datafile can own a Dataset. Only enforced at code level.
    • release_user - Possible to have a release_user for a Datafile
    • alias - May be null for a Dataverse
    • contact_email - May be blank for a dataverse
    • etc...
  • Data integrity enforcement normally done at the table level must now be done in the code
    • Leads to a greater chance of bad data by backend developers, UI developers, direct database users, etc.
    • Example of extra coding includes requiring a contact_email for a dataverse

Reason 2 for normalization: Wasted space on a large table

  • Mixing distinct entity types (Dataverse, Dataset, Datafile) leads to unused columns.
  • Post migration, the database will contain nearly 800,000 rows, including these unused fields:
    • Datafile: 700,000 rows of unused fields for creator, affiliation alias, contact_email, facetroot, metadatablockroot, permissionroot, md5, etc...
    • Dataset: 10s of 1000s of rows with unused fields for creator, affiliation alias, contact_email, facetroot, metadatablockroot, permissionroot, filesystemname, contenttype, etc...
    • Dataverse: 700+ of rows with unused fields for authority, identifier, protocol, filesystemname, contenttype, md5, ingeststatus
    • (This is not meant to be an exhaustive list, only a sample)

Suggested Fix: Path to normalization

  • DvObject makes permission checking convenient BUT doesn't enforce data integrity
    • Evaluating permissions requires retrieving objects by ID (DvObject.id) and traversing objects (e.g. parent dataverse, etc)
  • Potential middle road: Normalize tables and check objects by id and dvobject_type (explained below)

Consider normalizing and using a dvobject_types table

  • Make separate tables for each dvobject:
    • dataverse table
    • dataset table
    • datafile table
  • Make a dvobject_types table:
    • id (pk)
    • name
    • table_name (name of normalized table)
    • java_class_name (name of java class)
    • java_import_name
    • Constraint. Unique together: (java_class_name, table_name, java_package_name)
  • Example contents of a dvobject_types table:
    • 1, "Dataverse", "dataverse", "Dataverse", "edu.harvard.iq.dataverse.Dataverse"
    • 2, "Dataset", "dataset", "Dataset", "edu.harvard.iq.dataset.Dataset"
    • 3, "Datafile", "datafile", "Datafile", "edu.harvard.iq.datafile.Datafile"

Use the dvobject_types table when storing object level roles/permissions

  • Wherever object role/permissions are stored for dvobjects, instead of using object id, use object id and dvobject_types.id
  • Current permissions: (user id, object id, role id)
    • object id from DvObject table~~
  • Suggested: (user id, object id, dvobject_types.id, role id)
    • (dvobject_types.id, object.id) - Allows retrieval of specific object
    • Example object key: (1, 5)
      • 1 = Dataverse type, FK to dvobject_types (FK enforced by db)
      • 5 = object id, key in the dataverse table (not enforced by db)
    • Example object key: (2, 3)
      • 1 = Dataset type, FK to dvobject_types (FK enforced by db)
      • 5 = object id, key in the dataset table (not enforced by db)

Questions/Discussion

  • How would this fit into the permissions system?
  • How much complexity is REDUCED by using a dvobject_types table? Short term? Long term?
  • How much complexity is ADDED by using a dvobject_types table? Short term? Long term?
@raprasad raprasad added this to the In Review - Dataverse 4.0 milestone Jul 28, 2014
@pdurbin
Copy link
Member

pdurbin commented Aug 6, 2014

data integrity at database level

See also #807.

@raprasad
Copy link
Contributor Author

Marking this is as critical:

  • For all the reasons listed in orig. ticket (data integrity)
  • There exists a "not-too-complicated fix"
  • This will be hard to fix after data migration (don't wait for 4.1)

@scolapasta
Copy link
Contributor

@michbarsinai can you comment? My recollection is that when you first did this, you had to have one dvobject table because of how eclipselink handles this (that it does not support the one table per entity model)? Am I recalling correctly or has this changed?

@michbarsinai
Copy link
Member

I'll review this again after beta 8. Note that joins come at a cost as well. There may be some other solutions, e.g. "row objects".

@raprasad
Copy link
Contributor Author

This is a not an insignificant change.

In the tables where permissions reference the dvobject (RoleAssignment, DataverseRole, etc), it means replacing one attribute with two:

  • current FK reference:
    1. dvobject.id - FK to dvobject table
  • future FK reference:
    1. dvobject_type.id - FK to the "type". Is this a dataverse, dataset, or datafile?
    2. id - to a row in the dataverse, dataset or datafile table
      • This second 'id' is not a true foreign key
      • However the advantage is table integrity for all the attributes required by business rules that the current dvobject table allows to be null

Un-mixing the entities also makes things simpler:

  • For dataverses: why search 800,000 rows of the dvobject when only 1,000-2,000 of the rows are dataverses?
    • This becomes even more complex with dataverses being heirarchical
    • Adding/removing indices to this table can become expensive simply b/c almost every column can be null

It will be easier to make this change before 4.0 is live. After the data is migrated and the system is live, it will be much more difficult.

That said, at this late date, this change will push milestone dates forward.

@scolapasta scolapasta self-assigned this Nov 11, 2014
@scolapasta
Copy link
Contributor

With the new JOINED inheritence strategy, we knowhave both type safety and removed the sparseness.

@scolapasta scolapasta removed their assignment Nov 14, 2014
@kcondon kcondon self-assigned this Nov 14, 2014
@kcondon
Copy link
Contributor

kcondon commented Nov 14, 2014

3 tables in place, objects added to tables on create. will open tickets if issues arise. closing

@kcondon kcondon closed this as completed Nov 14, 2014
@pdurbin
Copy link
Member

pdurbin commented Nov 14, 2014

With the new JOINED inheritence strategy, we knowhave both type safety and removed the sparseness.

The sparseness has been removed for sure (yay!) but we don't get automatic type safety or data integrity checks/constraints. We should review the code and add @Column(nullable = false) to various fields in entity objects, as @raprasad advised above when he opened this ticket. I'm sure there are other database constraints we could add as well but this would avoid a lot of NullPointerExceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants