[BUG] In PostgreSQL hashes are doubled in size #176

dbaibakov · 2022-12-26T14:02:12Z

Description
In PostgreSQL when hash is calculated it is first converted to hex-string, upper-cased, and only then binary-encoded. As a result we have 32 bytes binary for storing 16-byte MD5, and 64-byte binary for storing SHA256.
Here is a bit of code from my compiled model, just for illustration:
with cte as ( select CAST(UPPER(MD5(CONCAT( COALESCE(NULLIF(UPPER(TRIM(CAST('iuagdfiusbdofusbdosibfor' AS VARCHAR))), ''), '^^') ))) AS BYTEA) AS hk ) select hk , encode(hk, 'hex') , length(hk) from cte

Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA.
There is also an additional behavior for SHA functions implemented in macros. The functions return binary strings as their results, but macros convert these binary strings to hex-strings, upper-case them, and then convert them to binary. So we have 2 unneeded conversions and doubling of size. Seems we shouldn't convert them to hex, just store the binaries we got from SHA functions.

In MS SQL implementation is right, hashes are calculated and stored as normal binaries, 16 and 32 bytes in size.

Environment

dbt version: 1.3.1
dbtvault version: 0.9.2
Database/Platform: docker image postgres:latest

To Reproduce
Make a simple model, run it, check hash column size.

Expected behavior
16-byte MD5 hash, not 32-byte
32-byte SHA hash, not 64-byte

Screenshots

The text was updated successfully, but these errors were encountered:

DVAlexHiggs · 2022-12-26T14:16:05Z

Hi thanks for your report! This will be investigated in the new year.

This is odd as all of our tests are still passing and they have not been modified, so they should still be producing the correct hashes (correct lengths and values etc.)

They are converted to Hex because our hashing approach is intended to be consistent across all platforms and produce the same strings whether encoded in BINARY or just as strings, depending on the platform.

My instinct here is that this is intended functionality due to the differences in hashing in postgres; there was a reason we implemented it this way.

Will keep this updated!

dbaibakov · 2022-12-26T15:13:22Z

Yeh, sure, Merry Cristmas and upcoming Happy New Year ))

Just for your future investigation.
Snowflake (default) - calculated as binary, stored as binary
MS SQL - calculated as binary, stored as binary
Databricks - calculated as hex-string, stored as text
Bigquery - calculated as hex-string, stored as text
Postgres - calculated ad hex-string, stored as binary

So it seems that there is no one-size-fits-all approach. There are 2 things to note:

all the engines calculate values right the way they are stored: binary as binary, text as text. Only Postgres calculates as text and stores the text as binary
data type selection - binary or text - depends on engine's capabilities. Unfortunately I don't have enough expertise in Bigquery and Databricks, so I can't suppose why hashes are stored as text there. For Databricks it's probably a compatibility issue as it works with files in Lakehouse architecture. But I think Postgres is much more like MS SQL than Databricks, they are traditional RDBMSes with proprietary data storage formats, and seems their abilities and features should be very close.

Anyway thanks in advance for your investigation))

DVAlexHiggs · 2022-12-26T15:19:50Z

Yeh, sure, Merry Cristmas and upcoming Happy New Year ))

Just for your future investigation.
Snowflake (default) - calculated as binary, stored as binary
MS SQL - calculated as binary, stored as binary
Databricks - calculated as hex-string, stored as text
Bigquery - calculated as hex-string, stored as text
Postgres - calculated ad hex-string, stored as binary

So it seems that there is no one-size-fits-all approach. There are 2 things to note:

all the engines calculate values right the way they are stored: binary as binary, text as text. Only Postgres calculates as text and stores the text as binary

data type selection - binary or text - depends on engine's capabilities. Unfortunately I don't have enough expertise in Bigquery and Databricks, so I can't suppose why hashes are stored as text there. For Databricks it's probably a compatibility issue as it works with files in Lakehouse architecture. But I think Postgres is much more like MS SQL than Databricks, they are traditional RDBMSes with proprietary data storage formats, and seems their abilities and features should be very close.

Anyway thanks in advance for your investigation))

It's impossible to do the same in every platform due to platform specific differences, however the aim is to have the same hash value/output regardless of platform.

You're right that postgres is inconsistent but as I say I think it's due to the way postgres treats hashes. This was contributed by the community so I'm not 100% sure of all the details.

Thanks for the details you have provided!

@johnoscott may be able to provide some insight here

johnoscott · 2022-12-26T22:13:22Z

I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though.

DVTimWilson · 2023-03-02T13:36:41Z

I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though.

Thank you both for all the details. I think you are correct with:

Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA.

Looks like we have taken a wrong turn en route from hex string to bytea! I shall do some investigative testing.

isoctcolo · 2023-03-13T15:11:54Z

Do you need some testing here? - I'm in the process of building a larger warehouse on postgresql with dbtvault.

johnoscott · 2023-03-18T21:46:21Z

@isoctcolo sure, that would help I think.

dbaibakov · 2023-04-21T13:36:04Z

Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository.

DVTimWilson · 2023-04-26T07:11:06Z

Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository.

Sorry for the slow response on this, yes we have a solution and we will be testing it soon once we have internal resource.

DVAlexHiggs · 2023-07-19T14:03:14Z

Fixed in 0.9.7. At its surface it looks like a rather minor change, but this required a lot of testing and experimentation to get it right 😊

Thank you all for your patience!

dbaibakov added the bug Something isn't working label Dec 26, 2022

dbaibakov assigned DVAlexHiggs Dec 26, 2022

DVAlexHiggs closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] In PostgreSQL hashes are doubled in size #176

[BUG] In PostgreSQL hashes are doubled in size #176

dbaibakov commented Dec 26, 2022

DVAlexHiggs commented Dec 26, 2022 •

edited

dbaibakov commented Dec 26, 2022

DVAlexHiggs commented Dec 26, 2022 •

edited

johnoscott commented Dec 26, 2022 •

edited

DVTimWilson commented Mar 2, 2023 •

edited

isoctcolo commented Mar 13, 2023

johnoscott commented Mar 18, 2023

dbaibakov commented Apr 21, 2023

DVTimWilson commented Apr 26, 2023

DVAlexHiggs commented Jul 19, 2023

[BUG] In PostgreSQL hashes are doubled in size #176

[BUG] In PostgreSQL hashes are doubled in size #176

Comments

dbaibakov commented Dec 26, 2022

DVAlexHiggs commented Dec 26, 2022 • edited

dbaibakov commented Dec 26, 2022

DVAlexHiggs commented Dec 26, 2022 • edited

johnoscott commented Dec 26, 2022 • edited

DVTimWilson commented Mar 2, 2023 • edited

isoctcolo commented Mar 13, 2023

johnoscott commented Mar 18, 2023

dbaibakov commented Apr 21, 2023

DVTimWilson commented Apr 26, 2023

DVAlexHiggs commented Jul 19, 2023

DVAlexHiggs commented Dec 26, 2022 •

edited

DVAlexHiggs commented Dec 26, 2022 •

edited

johnoscott commented Dec 26, 2022 •

edited

DVTimWilson commented Mar 2, 2023 •

edited