Fix tokenizer over-consuming character after ->> operator by RagingKore · Pull Request #84 · TylerBrinks/SqlParser-cs

RagingKore · 2026-04-23T14:00:37Z

Fixes #70.

Reproduction (from the issue)

var parser = new SqlQueryParser();
const string sql = """
                   select
                   category_seq as seq,
                   data.name as name,
                   meta->>'description' as description
                   from category
                   order by seq;
                   """;

var x = parser.Parse(sql, new DuckDbDialect());

Throws:

SqlParser.TokenizeException: Unterminated string literal. Expected ' after Line: 4, Col: 20
   at SqlParser.Tokenizer.TokenizeQuotedString(TokenizeQuotedStringSettings settings)
   at SqlParser.Tokenizer.TokenizeSingleQuotedString(Char quoteStyle, Boolean backslashEscape)
   ...

Root cause

TokenizeLongArrow called _state.Next() after confirming the second >, then handed off to ConsumeForBinOp which calls _state.Next() again. The character immediately following ->> was therefore silently consumed. When it was the opening ' of a string literal, subsequent tokenization walked past the closing quote and raised "Unterminated string literal".

This only surfaced when ->> was written without whitespace before the following token (e.g. meta->>'description'). Every existing test uses meta ->> 'description', where the eaten character was the harmless space — which is why it slipped through.

The fix drops the redundant _state.Next(), aligning ->> with the already-correct #>> pattern in TokenizeHash.

Tests

Added LongArrowJsonExtractionTests covering tokenization, the parser-level AST, and the exact multi-line SQL from the bug report.

The new tests are deliberately scoped to the dialects that genuinely support ->> as a JSON extraction operator: PostgreSQL, DuckDB, MySQL (since 5.7.13), SQLite (since 3.38), Redshift, and Generic. ->> is a PostgreSQL-originated extension and is not part of ANSI/ISO SQL. Dialects that use other JSON extraction mechanisms (Snowflake's col:path, BigQuery's JSON_VALUE, MS SQL's JSON_VALUE, Hive/Databricks' get_json_object, Oracle, ANSI) are intentionally excluded.

…s#70) TokenizeLongArrow called _state.Next() after confirming the second '>' and then delegated to ConsumeForBinOp, which calls _state.Next() again. The result was that the character immediately following `->>` was silently eaten. When that character was the opening single quote of a string literal, subsequent tokenization walked past the closing quote and raised "Unterminated string literal". This only surfaced when `->>` was written without whitespace before the following token (e.g. `meta->>'x'`). The existing tests all used `meta ->> 'x'`, where the swallowed character happened to be the harmless space. The fix drops the redundant _state.Next(), aligning `->>` with the already-correct pattern used by `#>>` in TokenizeHash. Regression tests are added against the dialects that genuinely support `->>` (PostgreSQL, DuckDB, MySQL, SQLite, Redshift, Generic). `->>` is a PostgreSQL-originated extension and is not part of ANSI/ISO SQL, so dialects that use other JSON extraction mechanisms (Snowflake, BigQuery, MS SQL Server, Hive, Databricks, Oracle, ANSI) are intentionally excluded.

This was referenced Apr 23, 2026

Fix tokenizer over-consuming character after ->> operator #83

Closed

Fix CI not running on pull requests targeting main #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer over-consuming character after ->> operator#84

Fix tokenizer over-consuming character after ->> operator#84
RagingKore wants to merge 1 commit intoTylerBrinks:mainfrom
RagingKore:bugfix/longarrow-json-extraction

RagingKore commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RagingKore commented Apr 23, 2026

Reproduction (from the issue)

Root cause

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant