Various parsing issues #802

sogaiu · 2020-09-30T05:40:13Z

As discussed in #calva, we'll use this issue to collect some parsing issues that are discovered for the time being.

Handling of number before discard / ignore marker

In clj I get the following results:

user=> [1#_2]
[1]
user=> [1#_ 2]
[1]
user=> [+1#_2]
[1]

So it appears that in each case, 1 (or +1) is being recognized as a number, and then starting from #_, there is a discard expression that extends to include the 2.

I tried a similar sequence for Calva's clojure-lexer and found:

npx ts-node
> import { Scanner } from './clojure-lexer'
{}
> let s = new Scanner(32768)
undefined
> s.processLine("[1#_2]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'id', offset: 1, raw: '1#_2', state: { inString: false } },
  { type: 'close', offset: 5, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]
> s.processLine("[1#_ 2]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'id', offset: 1, raw: '1#_', state: { inString: false } },
  { type: 'ws', offset: 4, raw: ' ', state: { inString: false } },
  { type: 'lit', offset: 5, raw: '2', state: { inString: false } },
  { type: 'close', offset: 6, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]
> s.processLine("[+1#_2]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'id', offset: 1, raw: '+1#_2', state: { inString: false } },
  { type: 'close', offset: 6, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]

It looks like in these cases #_ is being seen as part of what comes immediately before.

Note that:

> s.processLine("[:a#_1]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'kw', offset: 1, raw: ':a#_1', state: { inString: false } },
  { type: 'close', offset: 6, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]

seems correct, as in clj, one gets:

user=> [:a#_1]
[:a#_1]
user=> (type :a#_1)
clojure.lang.Keyword

So far I think for numbers and delimiters of collections, one doesn't need to put a space before #_ for there to be appropriate recognition of a following discard expression.

However, for characters, symbols, keywords, and symbolic values (e.g. ##NaN), not having a space makes a difference in what ends up being recognized.

It may be obvious, but this analysis may not be complete.

The text was updated successfully, but these errors were encountered:

sogaiu · 2020-10-17T09:30:11Z

Handling of some symbols

In clj I get the following results:

user=> (type (read-string "_"))
clojure.lang.Symbol

The lexer I tested recognized as 'junk':

> s.processLine("_")
[
  { type: 'junk', offset: 0, raw: '_', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 1, state: { inString: false } }
]

I expected recognition as id:

> s.processLine("_")
[
 {'type': 'id', 'offset': 0, 'raw': '_', 'state': {'inString': False}},   
 {'type': 'eol', 'raw': '\n', 'offset': 1, 'state': {'inString': False}}
]

sogaiu · 2020-10-17T10:01:15Z

Handling of some number literals

The following are some things that are parsed as id, but I expected lit

user=> (type +0X0)
java.lang.Long

> s.processLine("+0X0")
[
  { type: 'id', offset: 0, raw: '+0X0', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 4, state: { inString: false } }
]

user=> (type 0x3B85110)
java.lang.Long

> s.processLine("0x3B85110")
[
  {
    type: 'id',
    offset: 0,
    raw: '0x3B85110',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 9, state: { inString: false } }
]

user=> (type 00M)
java.math.BigDecimal

> s.processLine("00M")
[
  { type: 'id', offset: 0, raw: '00M', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 3, state: { inString: false } }
]

user=> (type -0344310433453N)
clojure.lang.BigInt

> s.processLine("-0344310433453N")
[
  {
    type: 'id',
    offset: 0,
    raw: '-0344310433453N',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 15, state: { inString: false } }
]

user=> (type +3r11)
java.lang.Long

> s.processLine("+3r11")
[
  { type: 'id', offset: 0, raw: '+3r11', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 5, state: { inString: false } }
]

user=> (type -25Rn)
java.lang.Long

> s.processLine("-25Rn")
[
  { type: 'id', offset: 0, raw: '-25Rn', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 5, state: { inString: false } }
]

user=> (type -95/96)
clojure.lang.Ratio

> s.processLine("-95/96")
[
  { type: 'id', offset: 0, raw: '-95/96', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]

user=> (type +18998.18998e+18998M)
java.math.BigDecimal

> s.processLine("+18998.18998e+18998M")
[
  {
    type: 'id',
    offset: 0,
    raw: '+18998.18998e+18998M',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 20, state: { inString: false } }
]

user=> (type -61E-19471M)
java.math.BigDecimal

> s.processLine("-61E-19471M")
[
  {
    type: 'id',
    offset: 0,
    raw: '-61E-19471M',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 11, state: { inString: false } }
]

For reference, parcera currently parses numbers like this: https://github.com/carocad/parcera/blob/83cd988e69116b67c620c099f78b693ac5e37233/src/Clojure.g4#L46

tree-sitter-clojure takes a mostly similar approach: https://github.com/sogaiu/tree-sitter-clojure/blob/9df53ae75475e5bdbeb21cd297b8e3160f3b6ed8/grammar.js#L21-L65

sogaiu · 2020-10-17T11:49:44Z

Handling of some character literals

Some character literals appear to be split apart and recognized as id followed by ws or junk:

user=> (type \
)
java.lang.Character
user=> (type (read-string "\\\n"))
java.lang.Character

> s.processLine("\\\n")
[
  { type: 'id', offset: 0, raw: '\\', state: { inString: false } },
  { type: 'ws', offset: 1, raw: '\n', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 2, state: { inString: false } }
]
> "\\\n".length
2
> console.log("\\\n")
\

undefined

user=> (type (read-string "\\\f"))
java.lang.Character

> s.processLine("\\\f")
[
  { type: 'id', offset: 0, raw: '\\', state: { inString: false } },
  { type: 'junk', offset: 1, raw: '\f', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 2, state: { inString: false } }
]
> "\\\f".length
2
> console.log("\\\f")
\

undefined

I expected a single lit.

sogaiu · 2020-10-17T12:13:37Z

Handling of some symbolic values

Symbolic values expressed in a certain form (e.g. space between ## and Inf) appear to be recognized as id and not reader (which is how ##Inf is currently recognized here):

user=> (type (read-string "## Inf"))
java.lang.Double
user=> (type ## Inf)
java.lang.Double

> s.processLine("## Inf")
[
  { type: 'id', offset: 0, raw: '## Inf', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]

Note that:

user=> (type (read-string "##Inf"))
java.lang.Double

I expected results like:

> s.processLine("## Inf")
[
  {
    type: 'reader',
    offset: 0,
    raw: '## Inf',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]

This would be similar to what I currently get for:

> s.processLine("##Inf")
[
  {
    type: 'reader',
    offset: 0,
    raw: '##Inf',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 5, state: { inString: false } }
]

I don't understand how Calva works well enough to have an opinion about what the result should be, but it seems that ##Inf and ## Inf should be recognized as the same type, or at least be more similar.

There is this case too:

user=> (type (read-string "## #_ 1 NaN"))
java.lang.Double

I currently get:

> s.processLine("## #_ 1 Inf")
[
  { type: 'reader', offset: 0, raw: '##', state: { inString: false } },
  { type: 'ws', offset: 2, raw: ' ', state: { inString: false } },
  { type: 'ignore', offset: 3, raw: '#_', state: { inString: false } },
  { type: 'ws', offset: 5, raw: ' ', state: { inString: false } },
  { type: 'lit', offset: 6, raw: '1', state: { inString: false } },
  { type: 'ws', offset: 7, raw: ' ', state: { inString: false } },
  { type: 'id', offset: 8, raw: 'Inf', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 11, state: { inString: false } }
]

So may be there is a case for the first two being something like reader at some point followed by id (with the possibility of ws being in between)?

sogaiu · 2020-10-18T10:49:18Z

Handling of character literal + comment sequence

In clj:

user=> \newline;)
\newline

in the local lexer:

> s.processLine("\\newline;)");
[
  {
    type: 'lit',
    offset: 0,
    raw: '\\newline;',
    state: { inString: false }
  },
  { type: 'close', offset: 9, raw: ')', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 10, state: { inString: false } }
]

It seems that \newline; got treated as lit instead of being recogniez as lit (\newline) followed by comment (;)).

Handling of whitespace

I'm not sure if the following will work for copy-paste, buf FWIW:

#{nil true}

There is a character between nil and true that IIUC ought to be treated as whitespace.

In Python one can enter it like: '\u2028'

Ah, I guess in JS one may enter it in a similar way, so that might be:

"#{1\u2028nil}"

What this is a demonstration of is what Clojure considers whitespace: https://github.com/clojure/clojure/blob/833c924239a818ff1a2563ae88af6dc266b35a61/src/jvm/clojure/lang/LispReader.java#L131

So it's either a comma or what java.lang.Character's isWhitespace method returns true for.

I've summarized my understanding of what counts here: https://github.com/sogaiu/tree-sitter-clojure/blob/f8006afc91296b0cdb09bfa04e08a6b3347e5962/grammar.js#L6-L32

For U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2008, U+2009, U+200A, U+205F, U+3000, I get junk, but may be it would be better to recognize as ws?

For U+2028 and U+2029, I get an exception / error "Unexpected character" -- may be JS(?) cannot handle those. My testing was with ts-node FWIW. (It may depend on the JS version perhaps: https://stackoverflow.com/questions/2965293/javascript-parse-error-on-u2028-unicode-character)

For U+001C, U+001D, U+001E, U+001F, I get id, but may be ws is better?

For U+000B and \f, I get junk, may be ws is better?

For U+0020 (space), \n, \r, and \t, I get ws which is what I expect.

For an upstream reference there is: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char)

I looked at that (or possibly a version for another JDK) and then went looking for Unicode docs to figure out what the bits meant.

For Unicode info, I looked at:

PEZ · 2020-11-02T10:45:02Z

This is super! Should be pretty easy to through all that into my ws pattern. Famous last words.

PEZ · 2020-11-02T17:28:37Z

Yes, famous last words. But I figured it out. 😄

Extra good that you found that there are a lot of unicode not matched by /./. Which is why U+2028 crashed the lexer. I now match junk as: /[\u0000-\uffff]/. Will need to figure out how to match code points above that, my attempts at this failed. Also on TODO is to add tests for junk. It is a crucial part of the lexer!

sogaiu · 2020-11-03T00:26:47Z

Reader (symbolic value?) and comment

clj says:

user=> ##Inf;[
##Inf

For cd2ffed, I get:

> s.processLine("##Inf;[")
[
  {
    type: 'reader',
    offset: 0,
    raw: '##Inf;',
    state: { inString: false }
  },
  { type: 'open', offset: 6, raw: '[', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]

Looks like the reader has "absorbed" a semicolon.

sogaiu · 2020-11-03T09:51:54Z

N-numeric literal split

clj says:

user=> 0X0N
0N

With 943a4b1 I get:

> s.processLine("0X0N");
[
  { type: 'lit', offset: 0, raw: '0X0', state: { inString: false } },
  { type: 'id', offset: 3, raw: 'N', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 4, state: { inString: false } }
]

PEZ · 2020-11-03T11:13:05Z

Thanks! Looking at this I found another one:

Calva treats something like 08.01 as two tokens. This because that late night some days ago I git the idea that sci-numbers couldn't start with a 0...

sogaiu · 2020-11-03T11:40:34Z

Good catch!

I think the generators here will put a leading zero for hex and octal, but not for double or ratio. Perhaps I should add that :)

IIUC, multiple leading zeros work (only for double and ratio).

(FWIW, my limited testing and understanding suggest that radix numbers can't start with zero.)

Looks like if one does -00034, it is interpreted as octal, so AFAIU, one cannot really have any leading zeros for integers (the base 10 kind).

sogaiu · 2023-06-20T13:15:19Z

May be these have all been addressed?

I'll close this for now :)

PEZ · 2023-06-20T14:05:27Z

TBH, I don't know if all has been addressed. But as I recall things, I addressed most of it. It was super valuable to Calva. Belated thanks!

PEZ added bug Something isn't working parsing labels Sep 30, 2020

sogaiu changed the title ~~Handling of number before discard / ignore marker~~ Various parsing issues Oct 17, 2020

PEZ added a commit that referenced this issue Nov 1, 2020

Parse ignores with non-ws leading numerics

bbd9eee

Addressing #802

PEZ added a commit that referenced this issue Nov 1, 2020

Make _ a valid starting char of symbols

6c43b1a

Addressing #802

PEZ added a commit that referenced this issue Nov 1, 2020

Add test against leading digit of symbols

3924a07

As mentioned in #802

PEZ added a commit that referenced this issue Nov 1, 2020

Parse ints, sci, hex and oct properly

6af8b95

Addressing #802

PEZ mentioned this issue Nov 1, 2020

Fix some parsing issues #838

Merged

37 tasks

sogaiu closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various parsing issues #802

Various parsing issues #802

sogaiu commented Sep 30, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 18, 2020 •

edited

Loading

PEZ commented Nov 1, 2020

sogaiu commented Nov 1, 2020

PEZ commented Nov 1, 2020

sogaiu commented Nov 1, 2020

PEZ commented Nov 1, 2020

sogaiu commented Nov 1, 2020

sogaiu commented Nov 2, 2020 •

edited

Loading

PEZ commented Nov 2, 2020

PEZ commented Nov 2, 2020

sogaiu commented Nov 3, 2020

sogaiu commented Nov 3, 2020

PEZ commented Nov 3, 2020

sogaiu commented Nov 3, 2020 •

edited

Loading

sogaiu commented Jun 20, 2023

PEZ commented Jun 20, 2023

Various parsing issues #802

Various parsing issues #802

Comments

sogaiu commented Sep 30, 2020 • edited Loading

Handling of number before discard / ignore marker

sogaiu commented Oct 17, 2020 • edited Loading

Handling of some symbols

sogaiu commented Oct 17, 2020 • edited Loading

Handling of some number literals

sogaiu commented Oct 17, 2020 • edited Loading

Handling of some character literals

sogaiu commented Oct 17, 2020 • edited Loading

Handling of some symbolic values

sogaiu commented Oct 18, 2020 • edited Loading

Handling of character literal + comment sequence

PEZ commented Nov 1, 2020

sogaiu commented Nov 1, 2020

PEZ commented Nov 1, 2020

sogaiu commented Nov 1, 2020

PEZ commented Nov 1, 2020

sogaiu commented Nov 1, 2020

sogaiu commented Nov 2, 2020 • edited Loading

Handling of whitespace

PEZ commented Nov 2, 2020

PEZ commented Nov 2, 2020

sogaiu commented Nov 3, 2020

Reader (symbolic value?) and comment

sogaiu commented Nov 3, 2020

N-numeric literal split

PEZ commented Nov 3, 2020

sogaiu commented Nov 3, 2020 • edited Loading

sogaiu commented Jun 20, 2023

PEZ commented Jun 20, 2023

sogaiu commented Sep 30, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 17, 2020 •

edited

Loading

sogaiu commented Oct 18, 2020 •

edited

Loading

sogaiu commented Nov 2, 2020 •

edited

Loading

sogaiu commented Nov 3, 2020 •

edited

Loading