Skip to content

[BUG] dask-sql creates all tables as lower case. #481

Open
@brightsparc

Description

@brightsparc

The dask create_table method adds to the list of tables with lower case which means that any queries that are executed must use the lower case name.

What happened:

All tables are registered as lower case, so I am unable to include any queries with upper case table names - which requires me to convert all queries to lower().

What you expected to happen:

I would expect that the case would be preserved when adding to schemas, or at least like in the case of #177 there would be an original and lower case verison.

Minimal Complete Verifiable Example:

See the following code snippet that creates a dask dataframe, and then creates two tables, one with lower_case name, and another with UPPER_CASE.

from dask_sql import Context
c = Context()

ddf = dd.from_pandas(pd.DataFrame({"key": ["value"]}), npartitions=1)

# create table with lower case
c.create_table("lower_case", ddf)
print(c.schema["root"].tables) 
c.sql("SELECT * FROM lower_case")

# Create table with upper case
c.create_table("UPPER_CASE", ddf)
print(c.schema["root"].tables) 
c.sql("SELECT * FROM UPPER_CASE")

The second call to print the tables lists both as lower case:

{'lower_case': <dask_sql.datacontainer.DataContainer object at 0x2d8432970>, 'upper_case': <dask_sql.datacontainer.DataContainer object at 0x2d49d9670>}

And the second select fails to find the table with upper case

ParsingException: Can not parse the given SQL: From line 1, column 15 to line 1, column 24: Object 'UPPER_CASE' not found within 'root'; did you mean 'upper_case'?

Anything else we need to know?:

Perhaps there is a reason why everything was made lower() - but I can't seem to find it in the docs.

Environment:

  • dask-sql version: 2022.4.1
  • Python version: 3.9
  • Operating System: mac-osx arm64
  • Install method (conda, pip, source): conda

Activity

added
bugSomething isn't working
needs triageAwaiting triage by a dask-sql maintainer
on Apr 20, 2022
charlesbluca

charlesbluca commented on Apr 20, 2022

@charlesbluca
Collaborator

Perhaps there is a reason why everything was made lower()

Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table and NEW_TABLE, as both of these map to the all lowercase new_table within the schema.

Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive - it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:

import dask.config

with dask.config.set({"sql.identifier.case_sensitive": False}):
    c.sql("SELECT * FROM UPPER_CASE")

But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)

brightsparc

brightsparc commented on Apr 20, 2022

@brightsparc
Author

Perhaps there is a reason why everything was made lower()

Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table and NEW_TABLE, as both of these map to the all lowercase new_table within the schema.

Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive - it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:

import dask.config

with dask.config.set({"sql.identifier.case_sensitive": False}):
    c.sql("SELECT * FROM UPPER_CASE")

But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)

Thanks making dask case insensitive solves the issue from a query stand point.

FWIW, I think the calcite default dialect is Casing.TO_UPPER for most products, with Postgresql being the exception with Casing.TO_LOWER for unquoted casing. However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.

charlesbluca

charlesbluca commented on Apr 20, 2022

@charlesbluca
Collaborator

However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.

I generally agree with this sentiment if this is the expected behavior for Postgres, though I think more consideration needs to be made on if / how we should shift our case handling to this style, as this is a large change from our current handling (which essentially ignores quotes when parsing table names and uses sql.identifier.case_sensitive to decide whether casing should be respected) - I will get back to this issue and #482 after some internal sync around what the best path forward here is.

Also I will note that we are currently exploring an overhaul of our current SQL parsing machinery from Calcite to DataFusion (check out https://github.com/dask-contrib/dask-sql/tree/datafusion-sql-planner and some of the recently opened datafusion issues for some additional context). Thus, while we can discuss here what an ideal casing behavior might be for dask-sql, I wouldn't anticipate any solution being long term until we have fully fleshed out and merged the changes developing in that branch.

brightsparc

brightsparc commented on Apr 28, 2022

@brightsparc
Author

Thanks for this insight @charlesbluca. I did investigate using sql.identifier.case_sensitive and had some success in validating that this allowed me to work around this issue. However in the last day, I upgraded my dependencies and found that this was now not being respect. I wonder if you know of any issues that might interface with dask-sql with respect to case sensitivity. (I upgrade llvm and numba but not sure these are dependencies that would affect this project).

Thanks for sharing the work on DaskFusion, this does look interesting, however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution. Not having a Java dependency and using the jpype1 interface.

ayushdg

ayushdg commented on May 2, 2022

@ayushdg
Collaborator

However in the last day, I upgraded my dependencies and found that this was now not being respect.

To followup does this mean that in a newer environment the initial example posted still has issues even when case_sensitive is set to False?

however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution.

Are you able to elaborate a bit further on this? Is the expectation here for dask-sql to be able to execute sql queries that the calcite parser understands, or is it to be able to directly accept a calcite logical plan object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageAwaiting triage by a dask-sql maintainer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @brightsparc@ayushdg@charlesbluca

      Issue actions

        [BUG] dask-sql creates all tables as lower case. · Issue #481 · dask-contrib/dask-sql