Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using 'vault_insert_by_period' materialisation with period='MILLISECOND' results in Unhandled error: Range too big. #175

Closed
anouar-zh opened this issue Dec 23, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@anouar-zh
Copy link

anouar-zh commented Dec 23, 2022

Describe the bug
I want to load data which I receive in millisecond (example load_dts: 2022-11-09 05:01:23.351).
When I use the 'vault_insert_by_period' materialisation with period='MILLISECOND' it results in Unhandled error: Range too big.

14:01:48 Unhandled error while executing model.datavault.hs_store_exploitation Range too big. The sandbox blocks ranges larger than MAX_RANGE (100000).

Environment

dbt version: dbt=1.3.1
dbtvault version: 0.9.1
Database/Platform: databricks

To Reproduce
Steps to reproduce the behavior:

  1. Use a dataset with data received in milliseconds
  2. Use the vault_insert_by_period materialisation with period='MILLISECOND'
  3. Execute dbt run
  4. See error

Expected behavior
I expect data to be loaded just like it would have been loaded when period day was used

Screenshots
Screenshot 2022-12-23 at 09 06 35

Log files
�[0m08:04:53.174466 [error] [MainThread]: �[33mRange too big. The sandbox blocks ranges larger than MAX_RANGE (100000).�[0m �[0m08:04:52.211738 [debug] [Thread-4 ]: Databricks adapter: NotImplemented: rollback

Additional context
The databricks datediff(unit, start, end) receives the unit of measure period as first parameter as opposed to the third parameter in https://github.com/Datavault-UK/dbtvault/blob/4dc4cab0daa9ca17d703a42d50ad651fd2ee707e/macros/materialisations/period_mat_helpers/get_period_boundaries.sql
Screenshot 2022-12-23 at 09 14 00

@anouar-zh anouar-zh added the bug Something isn't working label Dec 23, 2022
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Dec 23, 2022

Hi, thanks for your report!

It's odd that datediff has different parameter order in Databricks, that's perhaps something that needs to be reported to them. We'll look into that (we can fix it our end by using name parameters though)

Generally, millisecond is ill-advised as an iteration period for most use cases, as there will potentially be thousands and thousands of iterations for dbt to perform.

This will be badly performing. How many iterations are there for dbt to do if this were to work?

Either way, you're right that this shouldn't produce an error like this of course, so we'd like to have a solution for this.

@anouar-zh
Copy link
Author

In terms of iterations using milliseconds I think a maximum of 20-50 iterations. Is this issue a candidate for a minor fix/release?

@DVAlexHiggs
Copy link
Member

In terms of iterations using milliseconds I think a maximum of 20-50 iterations. Is this issue a candidate for a minor fix/release?

Yes of course. Will be worked on after the holiday. Thank you for your report

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Jan 25, 2023

Hi. We've done some investigation on this and essentially it's due to the range() jinja function having a maximum of 100,000 iterations. Though this is a technical limit, it does highlight the issue I previously described around millisecond periods being ill-advised.

Due to this, this is something we won't be fixing, but will instead provide some error handling and a friendly message, such as:

"Max iterations is 100,000. Consider using a different time period (e.g. day). vault_insert_by materialisations are not intended for this purpose, please see -link to docs-"

We do need some more guidance on this in the docs, but essentially the vault_insert_by materialisations are not for loading large amounts of data such as what you would have in a first time load (base load) and should not be used for highly granular millisecond-difference data. There are other approaches for loading, including those in the DV 2.0 standards.

Our aim for a while now has been to implement waterlevel macros for loading correctly, and release more guidance and documentation on the matter, as we understand how to load is a pain-point for many dbtvault users.

@DVAlexHiggs DVAlexHiggs added wontfix This will not be worked on and removed wontfix This will not be worked on labels Jan 25, 2023
@DVAlexHiggs
Copy link
Member

Hi! We've added a warning and some better handling around this issue in 0.9.5.

In a future update, (likely 0.9.6) we will be raising an exception when an attempt is made to use milliseconds.

This report prompted us re-visit our materialisation logic and we've got some refinements and big fixes coming up. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants