Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] In PostgreSQL hashes are doubled in size #176

Closed
dbaibakov opened this issue Dec 26, 2022 · 10 comments
Closed

[BUG] In PostgreSQL hashes are doubled in size #176

dbaibakov opened this issue Dec 26, 2022 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@dbaibakov
Copy link

Description
In PostgreSQL when hash is calculated it is first converted to hex-string, upper-cased, and only then binary-encoded. As a result we have 32 bytes binary for storing 16-byte MD5, and 64-byte binary for storing SHA256.
Here is a bit of code from my compiled model, just for illustration:
with cte as ( select CAST(UPPER(MD5(CONCAT( COALESCE(NULLIF(UPPER(TRIM(CAST('iuagdfiusbdofusbdosibfor' AS VARCHAR))), ''), '^^') ))) AS BYTEA) AS hk ) select hk , encode(hk, 'hex') , length(hk) from cte

Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA.
There is also an additional behavior for SHA functions implemented in macros. The functions return binary strings as their results, but macros convert these binary strings to hex-strings, upper-case them, and then convert them to binary. So we have 2 unneeded conversions and doubling of size. Seems we shouldn't convert them to hex, just store the binaries we got from SHA functions.

In MS SQL implementation is right, hashes are calculated and stored as normal binaries, 16 and 32 bytes in size.

Environment

dbt version: 1.3.1
dbtvault version: 0.9.2
Database/Platform: docker image postgres:latest

To Reproduce
Make a simple model, run it, check hash column size.

Expected behavior
16-byte MD5 hash, not 32-byte
32-byte SHA hash, not 64-byte

Screenshots
image

@dbaibakov dbaibakov added the bug Something isn't working label Dec 26, 2022
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Dec 26, 2022

Hi thanks for your report! This will be investigated in the new year.

This is odd as all of our tests are still passing and they have not been modified, so they should still be producing the correct hashes (correct lengths and values etc.)

They are converted to Hex because our hashing approach is intended to be consistent across all platforms and produce the same strings whether encoded in BINARY or just as strings, depending on the platform.

My instinct here is that this is intended functionality due to the differences in hashing in postgres; there was a reason we implemented it this way.

Will keep this updated!

@dbaibakov
Copy link
Author

Yeh, sure, Merry Cristmas and upcoming Happy New Year ))

Just for your future investigation.
Snowflake (default) - calculated as binary, stored as binary
MS SQL - calculated as binary, stored as binary
Databricks - calculated as hex-string, stored as text
Bigquery - calculated as hex-string, stored as text
Postgres - calculated ad hex-string, stored as binary

So it seems that there is no one-size-fits-all approach. There are 2 things to note:

  1. all the engines calculate values right the way they are stored: binary as binary, text as text. Only Postgres calculates as text and stores the text as binary
  2. data type selection - binary or text - depends on engine's capabilities. Unfortunately I don't have enough expertise in Bigquery and Databricks, so I can't suppose why hashes are stored as text there. For Databricks it's probably a compatibility issue as it works with files in Lakehouse architecture. But I think Postgres is much more like MS SQL than Databricks, they are traditional RDBMSes with proprietary data storage formats, and seems their abilities and features should be very close.

Anyway thanks in advance for your investigation))

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Dec 26, 2022

Yeh, sure, Merry Cristmas and upcoming Happy New Year ))

Just for your future investigation.
Snowflake (default) - calculated as binary, stored as binary
MS SQL - calculated as binary, stored as binary
Databricks - calculated as hex-string, stored as text
Bigquery - calculated as hex-string, stored as text
Postgres - calculated ad hex-string, stored as binary

So it seems that there is no one-size-fits-all approach. There are 2 things to note:

  1. all the engines calculate values right the way they are stored: binary as binary, text as text. Only Postgres calculates as text and stores the text as binary
  2. data type selection - binary or text - depends on engine's capabilities. Unfortunately I don't have enough expertise in Bigquery and Databricks, so I can't suppose why hashes are stored as text there. For Databricks it's probably a compatibility issue as it works with files in Lakehouse architecture. But I think Postgres is much more like MS SQL than Databricks, they are traditional RDBMSes with proprietary data storage formats, and seems their abilities and features should be very close.

Anyway thanks in advance for your investigation))

It's impossible to do the same in every platform due to platform specific differences, however the aim is to have the same hash value/output regardless of platform.

You're right that postgres is inconsistent but as I say I think it's due to the way postgres treats hashes. This was contributed by the community so I'm not 100% sure of all the details.

Thanks for the details you have provided!

@johnoscott may be able to provide some insight here

@johnoscott
Copy link
Contributor

johnoscott commented Dec 26, 2022

I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though.

@DVTimWilson
Copy link
Contributor

DVTimWilson commented Mar 2, 2023

I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though.

Thank you both for all the details. I think you are correct with:

Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA.

Looks like we have taken a wrong turn en route from hex string to bytea! I shall do some investigative testing.

@isoctcolo
Copy link

Do you need some testing here? - I'm in the process of building a larger warehouse on postgresql with dbtvault.

@johnoscott
Copy link
Contributor

@isoctcolo sure, that would help I think.

@dbaibakov
Copy link
Author

Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository.

@DVTimWilson
Copy link
Contributor

Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository.

Sorry for the slow response on this, yes we have a solution and we will be testing it soon once we have internal resource.

@DVAlexHiggs
Copy link
Member

Fixed in 0.9.7. At its surface it looks like a rather minor change, but this required a lot of testing and experimentation to get it right 😊

Thank you all for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants