dvobject Table Dangers: 800,000 rows with no type checking and wasted space #777

raprasad · 2014-07-28T14:09:12Z

Related: #733

Background

A single postgres table, dvobject, is being used to store 3 distinct entities.

Dataverses
Datasets
Datafiles

The table is not normalized and allows logical inconsistencies and anomalies.

Consider normalizing dvobject table and using a dvobject_types table.

note: After migration, the dvobject table will contain nearly 800,000 rows. (750k files, 54k studies, 750+ dataverses, etc)

Reason 1 for normalization: No business logic enforcement / data integrity at database level

Using one big table (dvobject) mixes distinct entity types--making if difficult for new team members, outside parters, and even dataverse team members to understand
The dvobject table leads to greater code complexity because postgres cannot guard against the following situations:
- creator - Possible to have a null value for a Dataverse or Dataset. Possible to have a creator for a Datafile
- owner - If keys mixed up: Datafile can own a Dataverse, Dataset can own a Dataverse, Datafile can own a Dataset. Only enforced at code level.
- release_user - Possible to have a release_user for a Datafile
- alias - May be null for a Dataverse
- contact_email - May be blank for a dataverse
- etc...
Data integrity enforcement normally done at the table level must now be done in the code
- Leads to a greater chance of bad data by backend developers, UI developers, direct database users, etc.
- Example of extra coding includes requiring a contact_email for a dataverse

Reason 2 for normalization: Wasted space on a large table

Mixing distinct entity types (Dataverse, Dataset, Datafile) leads to unused columns.
Post migration, the database will contain nearly 800,000 rows, including these unused fields:
- Datafile: 700,000 rows of unused fields for creator, affiliation alias, contact_email, facetroot, metadatablockroot, permissionroot, md5, etc...
- Dataset: 10s of 1000s of rows with unused fields for creator, affiliation alias, contact_email, facetroot, metadatablockroot, permissionroot, filesystemname, contenttype, etc...
- Dataverse: 700+ of rows with unused fields for authority, identifier, protocol, filesystemname, contenttype, md5, ingeststatus
- (This is not meant to be an exhaustive list, only a sample)

Suggested Fix: Path to normalization

DvObject makes permission checking convenient BUT doesn't enforce data integrity
- Evaluating permissions requires retrieving objects by ID (DvObject.id) and traversing objects (e.g. parent dataverse, etc)
Potential middle road: Normalize tables and check objects by id and dvobject_type (explained below)

Consider normalizing and using a dvobject_types table

Make separate tables for each dvobject:
- dataverse table
- dataset table
- datafile table
Make a dvobject_types table:
- id (pk)
- name
- table_name (name of normalized table)
- java_class_name (name of java class)
- java_import_name
- Constraint. Unique together: (java_class_name, table_name, java_package_name)
Example contents of a dvobject_types table:
- 1, "Dataverse", "dataverse", "Dataverse", "edu.harvard.iq.dataverse.Dataverse"
- 2, "Dataset", "dataset", "Dataset", "edu.harvard.iq.dataset.Dataset"
- 3, "Datafile", "datafile", "Datafile", "edu.harvard.iq.datafile.Datafile"

Use the dvobject_types table when storing object level roles/permissions

Wherever object role/permissions are stored for dvobjects, instead of using object id, use object id and dvobject_types.id
Current permissions: (user id, object id, role id)
- object id from DvObject table~~
Suggested: (user id, object id, dvobject_types.id, role id)
- (dvobject_types.id, object.id) - Allows retrieval of specific object
- Example object key: (1, 5)
  - 1 = Dataverse type, FK to dvobject_types (FK enforced by db)
  - 5 = object id, key in the dataverse table (not enforced by db)
- Example object key: (2, 3)
  - 1 = Dataset type, FK to dvobject_types (FK enforced by db)
  - 5 = object id, key in the dataset table (not enforced by db)

Questions/Discussion

How would this fit into the permissions system?
How much complexity is REDUCED by using a dvobject_types table? Short term? Long term?
How much complexity is ADDED by using a dvobject_types table? Short term? Long term?

pdurbin · 2014-08-06T13:07:53Z

data integrity at database level

See also #807.

raprasad · 2014-08-22T17:26:28Z

Marking this is as critical:

For all the reasons listed in orig. ticket (data integrity)
There exists a "not-too-complicated fix"
This will be hard to fix after data migration (don't wait for 4.1)

scolapasta · 2014-10-16T15:33:59Z

@michbarsinai can you comment? My recollection is that when you first did this, you had to have one dvobject table because of how eclipselink handles this (that it does not support the one table per entity model)? Am I recalling correctly or has this changed?

michbarsinai · 2014-10-16T18:01:31Z

I'll review this again after beta 8. Note that joins come at a cost as well. There may be some other solutions, e.g. "row objects".

raprasad · 2014-10-17T15:29:53Z

This is a not an insignificant change.

In the tables where permissions reference the dvobject (RoleAssignment, DataverseRole, etc), it means replacing one attribute with two:

current FK reference:
1. dvobject.id - FK to dvobject table
future FK reference:
1. dvobject_type.id - FK to the "type". Is this a dataverse, dataset, or datafile?
2. id - to a row in the dataverse, dataset or datafile table
  - This second 'id' is not a true foreign key
  - However the advantage is table integrity for all the attributes required by business rules that the current dvobject table allows to be null

Un-mixing the entities also makes things simpler:

For dataverses: why search 800,000 rows of the dvobject when only 1,000-2,000 of the rows are dataverses?
- This becomes even more complex with dataverses being heirarchical
- Adding/removing indices to this table can become expensive simply b/c almost every column can be null

It will be easier to make this change before 4.0 is live. After the data is migrated and the system is live, it will be much more difficult.

That said, at this late date, this change will push milestone dates forward.

scolapasta · 2014-11-14T21:19:02Z

With the new JOINED inheritence strategy, we knowhave both type safety and removed the sparseness.

kcondon · 2014-11-14T21:51:32Z

3 tables in place, objects added to tables on create. will open tickets if issues arise. closing

pdurbin · 2014-11-14T23:22:08Z

With the new JOINED inheritence strategy, we knowhave both type safety and removed the sparseness.

The sparseness has been removed for sure (yay!) but we don't get automatic type safety or data integrity checks/constraints. We should review the code and add @Column(nullable = false) to various fields in entity objects, as @raprasad advised above when he opened this ticket. I'm sure there are other database constraints we could add as well but this would avoid a lot of NullPointerExceptions.

raprasad added this to the In Review - Dataverse 4.0 milestone Jul 28, 2014

raprasad added Priority: Medium labels Jul 28, 2014

raprasad added Priority: Critical and removed Priority: Medium labels Aug 22, 2014

eaquigley modified the milestones: In Review - Dataverse 4.0, Long Term Issues-Dataverse 4.0 Aug 22, 2014

eaquigley assigned scolapasta Aug 22, 2014

raprasad mentioned this issue Sep 4, 2014

Edit Dataverse: Duplicate aliases are allowed, resulting in 500 errors when accessed. #621

Closed

eaquigley modified the milestones: Beta 7 (Permissions & Auth Branch) - Dataverse 4.0, Long Term Issues-Dataverse 4.0 Sep 10, 2014

raprasad mentioned this issue Sep 11, 2014

Tree performance: breadcrumbs, groups, etc #752

Closed

scolapasta modified the milestones: Beta 7 (Permissions & Auth Branch) - Dataverse 4.0, Beta 8 - Dataverse 4.0 Oct 10, 2014

scolapasta modified the milestones: Beta 9 - Dataverse 4.0, Beta 8 - Dataverse 4.0 Oct 16, 2014

scolapasta assigned michbarsinai and unassigned scolapasta Oct 16, 2014

This was referenced Oct 28, 2014

API: Duplicate dataverse aliases are allowed via api #1004

Closed

Investigate strategy=InheritanceType.TABLE_PER_CLASS and strategy=InheritanceType.JOINED for DvObject #733

Closed

scolapasta unassigned michbarsinai Nov 11, 2014

scolapasta self-assigned this Nov 11, 2014

scolapasta added Status: QA and removed Status: Design labels Nov 14, 2014

scolapasta removed their assignment Nov 14, 2014

kcondon self-assigned this Nov 14, 2014

kcondon closed this as completed Nov 14, 2014

adam3smith mentioned this issue Sep 10, 2021

Accessibility - Dataverse - Color Changes QualitativeDataRepository/dataverse#75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvobject Table Dangers: 800,000 rows with no type checking and wasted space #777

dvobject Table Dangers: 800,000 rows with no type checking and wasted space #777

raprasad commented Jul 28, 2014

pdurbin commented Aug 6, 2014

raprasad commented Aug 22, 2014

scolapasta commented Oct 16, 2014

michbarsinai commented Oct 16, 2014

raprasad commented Oct 17, 2014

scolapasta commented Nov 14, 2014

kcondon commented Nov 14, 2014

pdurbin commented Nov 14, 2014

dvobject Table Dangers: 800,000 rows with no type checking and wasted space #777

dvobject Table Dangers: 800,000 rows with no type checking and wasted space #777

Comments

raprasad commented Jul 28, 2014

Background

Consider normalizing dvobject table and using a dvobject_types table.

Reason 1 for normalization: No business logic enforcement / data integrity at database level

Reason 2 for normalization: Wasted space on a large table

Suggested Fix: Path to normalization

Consider normalizing and using a dvobject_types table

Use the dvobject_types table when storing object level roles/permissions

Questions/Discussion

pdurbin commented Aug 6, 2014

raprasad commented Aug 22, 2014

scolapasta commented Oct 16, 2014

michbarsinai commented Oct 16, 2014

raprasad commented Oct 17, 2014

scolapasta commented Nov 14, 2014

kcondon commented Nov 14, 2014

pdurbin commented Nov 14, 2014