Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] stage macro does not support columns with reserved keywords as names #23

Closed
balmasi opened this issue Feb 2, 2021 · 9 comments · Fixed by #89 or #91
Closed

[BUG] stage macro does not support columns with reserved keywords as names #23

balmasi opened this issue Feb 2, 2021 · 9 comments · Fixed by #89 or #91
Assignees
Labels
bug Something isn't working

Comments

@balmasi
Copy link

balmasi commented Feb 2, 2021

Describe the bug
Let's say your source contains a column called "order", if you try to run the stage macro with that as a hashed column (or even simply with include_source_columns = true) it will fail

dbtvault_bq.stage(
    include_source_columns=true,
    source_model={ 'outreach': 'stages'},
    hashed_columns={
      'SAT_HASH_DIFF': {
        'hashdiff': true,
        'columns': [
          'name'
          ,'order'
        ]
      }
    },
...
  )

Versions

dbt: 0.18.1
dbtvault: (BQ fork a.k.a 0.7.0), however looking at the code it should be present in the latest version as well

Expected behavior
I would expect all queried columns to have be qualified with the table name

Screenshots
image

@balmasi balmasi changed the title [BUG] stage macro does not support columns with reserved column names [BUG] stage macro does not support columns with reserved keywords as names Feb 2, 2021
@balmasi
Copy link
Author

balmasi commented Feb 2, 2021

Workaround for the time being:

for the hashed_columns section wrap the column name like so

"`order`"

If you need the column in your staging layer, you can set include_source_columns to false and use the derived_columns section to rename to a non-conflicting name (again, wrapped)

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 3, 2021

Thank you for reporting this issue.

This is not possible in snowflake (using reserved words as column names), even if quotes are used to force it.

i.e. SELECT "TEST_COL" AS "ORDER" will fail.

This seems to be a BQ only issue. I think this issue highlights a data quality issue in the dataset; Playing devil's advocate, data should not have reserved words for column names in the first place, and therefore it's not our responsibility to fix it.

Saying this, we cannot always help source data quality issues and we should have the option to fix the problem in the staging layer.

Is it adequate that you create a derived column which aliases the reserved word column, and this is used as the column in place of the reserved word column in the hashed column definition?

If not, we can look at a more permanent workaround/fix for when dbtvault supports BigQuery

@balmasi
Copy link
Author

balmasi commented Feb 3, 2021

SELECT "TEST_COL" AS "ORDER" fails on BQ as well, cause "ORDER" is a string literal whereas `ORDER` is a symbol. I wonder if SELECT "TEST_COL" AS `ORDER` would work

The source data can have whatever column names it wants. The question is, do you want dbtvault to natively support these special column names in the stage macro or not?

The bigger issue is that It's not just hashing, this "bug" forces you not to be able to use include_source_columns as well because the column names are unqualified.

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 3, 2021

SELECT "TEST_COL" AS `ORDER` does work in Snowflake. I've been able to re-produce the issue in our test bed now, and you're right about this being an issue. I still feel that this is a source data issue, and if users have column names which are reserved words then they have a bigger data quality issue.

We've never seen this before in the wild and I feel like it's quite a niche issue, however, it is nonetheless an issue with a relatively simple fix, so we will add it to the backlog as a quality of life improvement.

@DVAlexHiggs DVAlexHiggs added the bug Something isn't working label Feb 3, 2021
@balmasi
Copy link
Author

balmasi commented Feb 3, 2021

I still feel that this is a source data issue, and if users have column names which are reserved words then they have a bigger data quality issue.

I'm genuinely surprised to hear this cause there's no rule that says the source system uses a SQL datastore.

Perhaps I'm missing a layer in my DV implementation for "hard business rules" like this as Dan calls them. I was under the impression that in most cases I could just shove it through dbtvault without a previous layer 😆

@DVAlexHiggs
Copy link
Member

Apologies, I haven't made myself clear. dbtvault should handle this case, I'm simply suggesting that it might imply other issues with your data if you have reserved word column names in the source system, and it may be something developers would be inclined to fix prior to generating the raw vault (dbtvault or no dbtvault).

In saying this, developers should be able to fix it OR not need to fix it because dbtvault should understand that this is possible and handle it on their behalf, by qualifying column names in the stage.

@DVAlexHiggs
Copy link
Member

Currently working on this and #28 in this branch: https://github.com/Datavault-UK/dbtvault/tree/feat/escape-col-name

@DVAlexHiggs
Copy link
Member

We've got a fix for this coming in 0.7.10.

@DVAlexHiggs
Copy link
Member

Hi all. This is now supported in v0.8.0! Read the docs here

DVAlexHiggs added a commit that referenced this issue Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment