-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] In PostgreSQL hashes are doubled in size #176
Comments
Hi thanks for your report! This will be investigated in the new year. This is odd as all of our tests are still passing and they have not been modified, so they should still be producing the correct hashes (correct lengths and values etc.) They are converted to Hex because our hashing approach is intended to be consistent across all platforms and produce the same strings whether encoded in BINARY or just as strings, depending on the platform. My instinct here is that this is intended functionality due to the differences in hashing in postgres; there was a reason we implemented it this way. Will keep this updated! |
Yeh, sure, Merry Cristmas and upcoming Happy New Year )) Just for your future investigation. So it seems that there is no one-size-fits-all approach. There are 2 things to note:
Anyway thanks in advance for your investigation)) |
It's impossible to do the same in every platform due to platform specific differences, however the aim is to have the same hash value/output regardless of platform. You're right that postgres is inconsistent but as I say I think it's due to the way postgres treats hashes. This was contributed by the community so I'm not 100% sure of all the details. Thanks for the details you have provided! @johnoscott may be able to provide some insight here |
I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though. |
Thank you both for all the details. I think you are correct with:
Looks like we have taken a wrong turn en route from hex string to bytea! I shall do some investigative testing. |
Do you need some testing here? - I'm in the process of building a larger warehouse on postgresql with dbtvault. |
@isoctcolo sure, that would help I think. |
Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository. |
Sorry for the slow response on this, yes we have a solution and we will be testing it soon once we have internal resource. |
Fixed in 0.9.7. At its surface it looks like a rather minor change, but this required a lot of testing and experimentation to get it right 😊 Thank you all for your patience! |
Description
In PostgreSQL when hash is calculated it is first converted to hex-string, upper-cased, and only then binary-encoded. As a result we have 32 bytes binary for storing 16-byte MD5, and 64-byte binary for storing SHA256.
Here is a bit of code from my compiled model, just for illustration:
with cte as ( select CAST(UPPER(MD5(CONCAT( COALESCE(NULLIF(UPPER(TRIM(CAST('iuagdfiusbdofusbdosibfor' AS VARCHAR))), ''), '^^') ))) AS BYTEA) AS hk ) select hk , encode(hk, 'hex') , length(hk) from cte
Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA.
There is also an additional behavior for SHA functions implemented in macros. The functions return binary strings as their results, but macros convert these binary strings to hex-strings, upper-case them, and then convert them to binary. So we have 2 unneeded conversions and doubling of size. Seems we shouldn't convert them to hex, just store the binaries we got from SHA functions.
In MS SQL implementation is right, hashes are calculated and stored as normal binaries, 16 and 32 bytes in size.
Environment
dbt version: 1.3.1
dbtvault version: 0.9.2
Database/Platform: docker image postgres:latest
To Reproduce
Make a simple model, run it, check hash column size.
Expected behavior
16-byte MD5 hash, not 32-byte
32-byte SHA hash, not 64-byte
Screenshots
The text was updated successfully, but these errors were encountered: