Skip to content

Conversation

@conormccarter
Copy link

@conormccarter conormccarter commented Nov 7, 2025

Resolves: #24

  1. Add Databricks benchmark script
  2. Add results for most Databricks SQL warehouse sizes

@rschu1ze

This comment was marked as resolved.

@conormccarter

This comment was marked as resolved.

@conormccarter conormccarter reopened this Nov 13, 2025
Copy link
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a permission error when I try to push to this repository:

remote: Permission to prequel-co/ClickBench.git denied to rschu1ze.
fatal: unable to access 'https://github.com/prequel-co/ClickBench.git/': The requested URL returned error: 403

... therefore leaving some comments for now.

DATABRICKS_SCHEMA=clickbench_schema

# Parquet data location
DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions here: I set my databricks hostname, the databricks HTTP path, the instance type (2X-Small for the free test version) and the token. I didn't touch the CATALOG and the SCHEMA variables.

When I ran benchmark.sh, I got this:

Connecting to Databricks; loading the data into clickbench_catalog.clickbench_schema                                                                 16:12:40 [247/341]
[WARN] pyarrow is not installed by default since databricks-sql-connector 4.0.0,any arrow specific api (e.g. fetchmany_arrow) and cloud fetch will be disabled.If you n
eed these features, please run pip install pyarrow or pip install databricks-sql-connector[pyarrow] to install
Creating table and loading data from s3://some/path/hits.parquet...
Traceback (most recent call last):
  File "/data/ClickBench/databricks/./benchmark.py", line 357, in <module>
    load_data(run_metadata)
  File "/data/ClickBench/databricks/./benchmark.py", line 289, in load_data
    cursor.execute(load_query)
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/telemetry/latency_logger.py", line 175, in wrapper
    result = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/client.py", line 1260, in execute
    self.active_result_set = self.backend.execute_command(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1058, in execute_command
    execute_response, has_more_rows = self._handle_execute_response(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1265, in _handle_execute_response
    final_operation_state = self._wait_until_command_done(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 957, in _wait_until_command_done
    self._check_command_not_in_error_or_closed_state(op_handle, poll_resp)
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 635, in _check_command_not_in_error_or_closed_st
ate
    raise ServerOperationError(
databricks.sql.exc.ServerOperationError: [UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY] Unsupported data source type for direct query on files: parquet SQLSTATE: 0A000; lin
e 109 pos 13
Attempt to close session raised a local exception: sys.meta_path is None, Python is likely shutting down

(l. 289 ran the INSERT statement - the prior CREATE TABLE was successful)

Do you have an idea what went wrong? Do I need to set any other variables?

Oh, I should have mentioned as well that I set DATABRICKS_PARQUET_LOCATION to https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/hits_compatible/hits.parquet. Is this correct? If yes, I think we can hard-code it as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work okay if you use the S3 URI (starting with "s3://"). Just updated the example to use that placeholder. Optionally, I could just remove it as a .env variable if that public s3 location is going to stick around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Help wanted: Databricks

2 participants