Skip to content

Latest commit

 

History

History
10 lines (5 loc) · 2.47 KB

6558-validate-files-on-publish.md

File metadata and controls

10 lines (5 loc) · 2.47 KB

Datafiles validation when publishing datasets

When a user requests to publish a dataset, Dataverse will now attempt to validate the physical files in the dataset, by recalculating the checksums and verifying them against the values in the database. The goal is to prevent any corrupted files in published datasets. Most of all the instances of actual damage to physical files that we've seen in the past happened while the datafiles were still in the Draft state. (Physical files become essentially read-only once published). So this is the logical place to catch any such issues.

If any files in the dataset fail the validation, the dataset does not get published, and the user is notified that they need to contact their Dataverse support in order to address the issue before another attempt to publish can be made. See the "Troubleshooting" section of the Guide on how to fix such problems.

For datasets with large numbers of files, this validation will be performed asynchronously, using the same mechanism as for the registration of the file-level global ids. The cutoff number of files is configured by the same database setting. Similarly to the file PID registration, this validation process can be disabled on your system, with the setting :FileValidationOnPublishEnabled. (A Dataverse admin may choose to disable it if, for example, they are already running an external auditing system to monitor the integrity of the files in their Dataverse, and would prefer the publishing process to take less time). See the Config section of the Installation guide for more info.

Please note that we are not aware of any bugs in the current versions of Dataverse that would result in damage to users' files. But you may have some legacy files in your archive that were affected by some issue in the past, or perhaps affected by something outside Dataverse, so we are adding this feature out of abundance of caution. An example of a problem we've experienced in the early versions of Dataverse was a possible scenario where a user actually attempted to delete a Draft file from an unpublished version, where the database transaction would fail for whatever reason, but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. (the fix for this case, since the user wanted to delete the file in the first place, is simply to confirm it and purge the datafile entity from the database).