Skip to content

Fix tokenizer over-consuming character after ->> operator#84

Open
RagingKore wants to merge 1 commit intoTylerBrinks:mainfrom
RagingKore:bugfix/longarrow-json-extraction
Open

Fix tokenizer over-consuming character after ->> operator#84
RagingKore wants to merge 1 commit intoTylerBrinks:mainfrom
RagingKore:bugfix/longarrow-json-extraction

Conversation

@RagingKore
Copy link
Copy Markdown

Fixes #70.

Reproduction (from the issue)

var parser = new SqlQueryParser();
const string sql = """
                   select
                   category_seq as seq,
                   data.name as name,
                   meta->>'description' as description
                   from category
                   order by seq;
                   """;

var x = parser.Parse(sql, new DuckDbDialect());

Throws:

SqlParser.TokenizeException: Unterminated string literal. Expected ' after Line: 4, Col: 20
   at SqlParser.Tokenizer.TokenizeQuotedString(TokenizeQuotedStringSettings settings)
   at SqlParser.Tokenizer.TokenizeSingleQuotedString(Char quoteStyle, Boolean backslashEscape)
   ...

Root cause

TokenizeLongArrow called _state.Next() after confirming the second >, then handed off to ConsumeForBinOp which calls _state.Next() again. The character immediately following ->> was therefore silently consumed. When it was the opening ' of a string literal, subsequent tokenization walked past the closing quote and raised "Unterminated string literal".

This only surfaced when ->> was written without whitespace before the following token (e.g. meta->>'description'). Every existing test uses meta ->> 'description', where the eaten character was the harmless space — which is why it slipped through.

The fix drops the redundant _state.Next(), aligning ->> with the already-correct #>> pattern in TokenizeHash.

Tests

Added LongArrowJsonExtractionTests covering tokenization, the parser-level AST, and the exact multi-line SQL from the bug report.

The new tests are deliberately scoped to the dialects that genuinely support ->> as a JSON extraction operator: PostgreSQL, DuckDB, MySQL (since 5.7.13), SQLite (since 3.38), Redshift, and Generic. ->> is a PostgreSQL-originated extension and is not part of ANSI/ISO SQL. Dialects that use other JSON extraction mechanisms (Snowflake's col:path, BigQuery's JSON_VALUE, MS SQL's JSON_VALUE, Hive/Databricks' get_json_object, Oracle, ANSI) are intentionally excluded.

…s#70)

TokenizeLongArrow called _state.Next() after confirming the second '>'
and then delegated to ConsumeForBinOp, which calls _state.Next() again.
The result was that the character immediately following `->>` was
silently eaten. When that character was the opening single quote of a
string literal, subsequent tokenization walked past the closing quote
and raised "Unterminated string literal".

This only surfaced when `->>` was written without whitespace before the
following token (e.g. `meta->>'x'`). The existing tests all used
`meta ->> 'x'`, where the swallowed character happened to be the
harmless space.

The fix drops the redundant _state.Next(), aligning `->>` with the
already-correct pattern used by `#>>` in TokenizeHash.

Regression tests are added against the dialects that genuinely support
`->>` (PostgreSQL, DuckDB, MySQL, SQLite, Redshift, Generic). `->>` is
a PostgreSQL-originated extension and is not part of ANSI/ISO SQL, so
dialects that use other JSON extraction mechanisms (Snowflake, BigQuery,
MS SQL Server, Hive, Databricks, Oracle, ANSI) are intentionally
excluded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON extraction with DuckDbDialect throws an exception

1 participant