md5 hash displayed to user is wrong #9501

charmoniumQ · 2023-04-06T19:58:43Z

What steps does it take to reproduce the issue?

See the dataset file page here.

This page says the "Original File MD5" begins with 9e9be.... But this is not true.
The "Stata Binary (Original File Format)" file has an md5 hash beginning with 20ddc4....
The "Tab-Delimited" file has an md5 hash beginning with 1f75c2....
However "Tab-Delimited" file without the header row (cat file.tab | tail --lines=+2 | md5sum) has an md5 hash of 9e9be....

This is a bug because it will lead users to believe that they downloaded a corrupted file.

There are two parts: the incorrect labeling, and cutting off the header row. The label should be "Tab-Delimited File MD5" not "Original File MD5." Cutting off the header row is more interesting. Why does Dataverse send the file to the user, but hash a transformed version of that file?

When does this issue occur?

Unknown

Which page(s) does it occurs on?

The Stata files of this dataset that I checked by hand.

Which version of Dataverse are you using?

The one hosted at https://dataverse.harvard.edu/, 5.13 build 1244-79d6e57

The text was updated successfully, but these errors were encountered:

jggautier · 2023-04-06T22:07:17Z

I wonder if this bug is also what's making it difficult to use pooch to download tabular files that Dataverse was able to ingest. When I use pooch to try to download a tabular file that's been ingested, pooch tries to use a checksum to verify the file's integrity, and because of some checksum mixup that sounds like the one you described @charmoniumQ, pooch doesn't let me download the file.

Been meaning to report this somewhere, but haven't had time to dig into it.

charmoniumQ · 2023-04-06T22:15:35Z

Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of 71988... and size of 642.5 KB, but the only file Dataverse lets me download has a hash of 2965... and size of 609.3 KiB (yes, I'm dividing by 1024 not 1000; neither seem to work) and no header. I don't see an obvious way to get the original hash or size.

https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/VSN1O0/QEH8LH&version=1.0 aslo seems wrong. It reports a size of 0 and hash of 1e67.... When I download, I get size of 0 and hash of d41d8... which is well-known as the hash of the empty string. The 1e67... hash is actually md5_hash("\n\n").

These exact transformation which is causing the hash to differ is different between these examples. But I think in all cases, there is some transformation (e.g., removing the headers) happening during the hash that is not happening before downloading or vice versa.

jggautier · 2023-04-06T23:54:07Z

There's a related GitHub issue at IQSS/dataverse.harvard.edu#37, in a repo where we track things related to the Harvard Dataverse Repository that are more at the "production"-level and less related to the current version of the Dataverse software, unless other repositories using Dataverse are affected, too.

I'm just mentioning this since I think this GitHub issue and your GitHub issue at IQSS/dataverse.harvard.edu#220 might wind up being resolved with changes to the software (like how checksums are displayed and organized in metadata exports) and changes to the Harvard Dataverse (like an audit of that repository's files).

charmoniumQ · 2023-04-07T02:17:08Z

Thanks, I didn't know about the issue tracker for issues specific to Harvard's dataverse instance. I'll post future discrepancies to IQSS/dataverse.harvard.edu#37.

jggautier · 2023-04-07T14:11:44Z

Definitely easier to find this main Dataverse repo :)

charmoniumQ · 2023-04-14T19:26:56Z

@atrisovic

pdurbin · 2023-04-18T14:03:03Z

Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of 71988... and size of 642.5 KB, but the only file Dataverse lets me download has a hash of 2965... and size of 609.3 KiB (yes, I'm dividing by 1024 not 1000; neither seem to work) and no header. I don't see an obvious way to get the original hash or size.

Huh. I'm seeing the same thing.

After downloading the CSV, I'm getting 2965cd060e16781a2f6fafa5a54a6c59 as the MD5 checksum...

$ md5 ph2_endline_attitudes_survey.csv 
MD5 (ph2_endline_attitudes_survey.csv) = 2965cd060e16781a2f6fafa5a54a6c59

... but Dataverse is asserting that it's 71988070448ef1f28b8538ebee9919bf

@charmoniumQ thanks for the heads up about this! Very strange.

landreev · 2023-04-18T15:29:35Z

@charmoniumQ

Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of 71988... and size of 642.5 KB, but the only file Dataverse lets me download has a hash of 2965... and size of 609.3 KiB (yes, I'm dividing by 1024 not 1000; neither seem to work) and no header. I don't see an obvious way to get the original hash or size.

Apologies for the delay with this, I missed this issue earlier (@pdurbin brought your comment to my attention this morning). Please try downloading the above again, you should get the right file now.
The underlying problemw was a "partial ingest" failure (note that you were getting a tab-delimited, not the actual CSV version of the file). There was a bug in Dataverse at some point that (occasionally) resulted in this; where tabular data ingest would fail, but the application would still end up saving the converted tab-delimited file in place of the original. (luckily, the original would also be saved, with the .orig extension, so the fix is simply to move the .orig back).
What's alarming (and super embarrassing) is that for this file it was never fixed, and it was in fact sitting in this state since 2019. I was honestly under the impression that after we found and fixed the bug, we ran an audit that found and fixed all the files that had been affected. It really looks like I need to do that again.

(I haven't even looked at the other things mentioned in the issue yet, will take a look and reply asap)

charmoniumQ · 2023-04-18T20:31:25Z

Thanks @landreev !

landreev · 2023-04-18T22:03:05Z

(I'm debating whether I should move this issue to the local support repo as well; but I'll deal with that later)

Addressing the original report at the top:

See the dataset file page here.

* This page says the "Original File MD5" begins with `9e9be...`. But this is not true.

* The "Stata Binary (Original File Format)" file has an md5 hash beginning with `20ddc4...`.

* The "Tab-Delimited" file has an md5 hash beginning with `1f75c2...`.

* However "Tab-Delimited" file without the header row (`cat file.tab | tail --lines=+2 | md5sum`) has an md5 hash of `9e9be...`.

Good catch, thank you. Plus, extra credit for figuring out that the displayed md5 was in fact that of the raw tabular data file without the variable name header.
I have corrected the md5sum entry in the database ("corrected" as in, the file page is now showing the correct md5 of the original file - "Original File MD5: 20ddc4ec170ffdadd8a91d5e2db0066e"; I'll address your other question, whether this is what we want to display, separately).

Unfortunately, I'm at a bit of a loss as to

how this happened in the first place. Unlike that "partial ingest" problem, I have not seen this before. The tab-delimited version of the file is in fact stored on disk without the variable header (for, well, reasons), which is added in real time when the file is downloaded. So calculating the checksum from that stored copy would produce this md5... but I can't think of how or why it would ever be recalculated like that after ingest. and
how it survived numerous audits without this having been detected.

As I said earlier, it really looks like we need to review our system of file integrity audits and re-validate everything.

beepsoft · 2023-11-08T09:52:30Z

Hi @landreev ! Is there any progress on this issue? We've just been hit by this. We have some RO-Crate related functions that depend on the md5 of the files and if that changes in an async manner due to ingestion it causes inconsistencies for us.

By the way, is there a way to turn off ingestion of tabular files?

landreev · 2023-11-13T02:16:14Z

@beepsoft
Hi,
Please note that there are several different things discussed in this issue. One, the one that I replied to and addressed back in April, was an actual data corruption of a specific file on our production server.
What this issue was originally opened for was that the way Dataverse handles and stores ingested tabular data files is confusing, especially with how the md5 signatures are shown to the user. This is not a bug per se, but very ancient legacy, this is how it was implemented early on - back when this tabular data ingest was a central feature of the application; for some specific reasons that time may have forgotten.

If this is causing problems with your RO-Crate use case, yes, it is very easy to disable ingest completely: just set:TabularIngestSizeLimit to zero, and it will stop.
Also, please note that there is the "uningest" API, that can undo it for any ingested files that already exist.

jggautier added Feature: File Upload & Handling Type: Bug a defect labels Apr 6, 2023

charmoniumQ mentioned this issue Apr 7, 2023

Audit Files: Perform routine file system audit, comparing db records to file system. IQSS/dataverse.harvard.edu#37

Open

cmbz mentioned this issue May 20, 2024

GREI 6: HDV Task - Support for Research Objects (e.g., software, computational workflows) IQSS/dataverse-pm#146

Open

16 tasks

cmbz added the GREI 6 Connect Digital Objects label May 20, 2024

cmbz mentioned this issue May 22, 2024

GREI 3: HDV Task - Improve Dataverse Metadata and Data Packaging Support IQSS/dataverse-pm#175

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

md5 hash displayed to user is wrong #9501

md5 hash displayed to user is wrong #9501

charmoniumQ commented Apr 6, 2023 •

edited

Loading

jggautier commented Apr 6, 2023

charmoniumQ commented Apr 6, 2023 •

edited

Loading

jggautier commented Apr 6, 2023

charmoniumQ commented Apr 7, 2023

jggautier commented Apr 7, 2023

charmoniumQ commented Apr 14, 2023

pdurbin commented Apr 18, 2023

landreev commented Apr 18, 2023

charmoniumQ commented Apr 18, 2023

landreev commented Apr 18, 2023 •

edited

Loading

beepsoft commented Nov 8, 2023

landreev commented Nov 13, 2023 •

edited

Loading

md5 hash displayed to user is wrong #9501

md5 hash displayed to user is wrong #9501

Comments

charmoniumQ commented Apr 6, 2023 • edited Loading

jggautier commented Apr 6, 2023

charmoniumQ commented Apr 6, 2023 • edited Loading

jggautier commented Apr 6, 2023

charmoniumQ commented Apr 7, 2023

jggautier commented Apr 7, 2023

charmoniumQ commented Apr 14, 2023

pdurbin commented Apr 18, 2023

landreev commented Apr 18, 2023

charmoniumQ commented Apr 18, 2023

landreev commented Apr 18, 2023 • edited Loading

beepsoft commented Nov 8, 2023

landreev commented Nov 13, 2023 • edited Loading

charmoniumQ commented Apr 6, 2023 •

edited

Loading

charmoniumQ commented Apr 6, 2023 •

edited

Loading

landreev commented Apr 18, 2023 •

edited

Loading

landreev commented Nov 13, 2023 •

edited

Loading