-
Notifications
You must be signed in to change notification settings - Fork 240
Add Databricks and benchmark results for most SQL warehouse options #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d9926ef to
f7f02a8
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
07774c2 to
062b789
Compare
rschu1ze
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a permission error when I try to push to this repository:
remote: Permission to prequel-co/ClickBench.git denied to rschu1ze.
fatal: unable to access 'https://github.com/prequel-co/ClickBench.git/': The requested URL returned error: 403
... therefore leaving some comments for now.
databricks/.env.example
Outdated
| DATABRICKS_SCHEMA=clickbench_schema | ||
|
|
||
| # Parquet data location | ||
| DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions here: I set my databricks hostname, the databricks HTTP path, the instance type (2X-Small for the free test version) and the token. I didn't touch the CATALOG and the SCHEMA variables.
When I ran benchmark.sh, I got this:
Connecting to Databricks; loading the data into clickbench_catalog.clickbench_schema 16:12:40 [247/341]
[WARN] pyarrow is not installed by default since databricks-sql-connector 4.0.0,any arrow specific api (e.g. fetchmany_arrow) and cloud fetch will be disabled.If you n
eed these features, please run pip install pyarrow or pip install databricks-sql-connector[pyarrow] to install
Creating table and loading data from s3://some/path/hits.parquet...
Traceback (most recent call last):
File "/data/ClickBench/databricks/./benchmark.py", line 357, in <module>
load_data(run_metadata)
File "/data/ClickBench/databricks/./benchmark.py", line 289, in load_data
cursor.execute(load_query)
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/telemetry/latency_logger.py", line 175, in wrapper
result = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/client.py", line 1260, in execute
self.active_result_set = self.backend.execute_command(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1058, in execute_command
execute_response, has_more_rows = self._handle_execute_response(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1265, in _handle_execute_response
final_operation_state = self._wait_until_command_done(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 957, in _wait_until_command_done
self._check_command_not_in_error_or_closed_state(op_handle, poll_resp)
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 635, in _check_command_not_in_error_or_closed_st
ate
raise ServerOperationError(
databricks.sql.exc.ServerOperationError: [UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY] Unsupported data source type for direct query on files: parquet SQLSTATE: 0A000; lin
e 109 pos 13
Attempt to close session raised a local exception: sys.meta_path is None, Python is likely shutting down
(l. 289 ran the INSERT statement - the prior CREATE TABLE was successful)
Do you have an idea what went wrong? Do I need to set any other variables?
Oh, I should have mentioned as well that I set DATABRICKS_PARQUET_LOCATION to https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/hits_compatible/hits.parquet. Is this correct? If yes, I think we can hard-code it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work okay if you use the S3 URI (starting with "s3://"). Just updated the example to use that placeholder. Optionally, I could just remove it as a .env variable if that public s3 location is going to stick around.
Resolves: #24