Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various parsing issues #802

Closed
sogaiu opened this issue Sep 30, 2020 · 20 comments
Closed

Various parsing issues #802

sogaiu opened this issue Sep 30, 2020 · 20 comments
Labels
bug Something isn't working parsing

Comments

@sogaiu
Copy link

sogaiu commented Sep 30, 2020

As discussed in #calva, we'll use this issue to collect some parsing issues that are discovered for the time being.


Handling of number before discard / ignore marker

In clj I get the following results:

user=> [1#_2]
[1]
user=> [1#_ 2]
[1]
user=> [+1#_2]
[1]

So it appears that in each case, 1 (or +1) is being recognized as a number, and then starting from #_, there is a discard expression that extends to include the 2.

I tried a similar sequence for Calva's clojure-lexer and found:

npx ts-node
> import { Scanner } from './clojure-lexer'
{}
> let s = new Scanner(32768)
undefined
> s.processLine("[1#_2]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'id', offset: 1, raw: '1#_2', state: { inString: false } },
  { type: 'close', offset: 5, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]
> s.processLine("[1#_ 2]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'id', offset: 1, raw: '1#_', state: { inString: false } },
  { type: 'ws', offset: 4, raw: ' ', state: { inString: false } },
  { type: 'lit', offset: 5, raw: '2', state: { inString: false } },
  { type: 'close', offset: 6, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]
> s.processLine("[+1#_2]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'id', offset: 1, raw: '+1#_2', state: { inString: false } },
  { type: 'close', offset: 6, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]

It looks like in these cases #_ is being seen as part of what comes immediately before.

Note that:

> s.processLine("[:a#_1]")
[
  { type: 'open', offset: 0, raw: '[', state: { inString: false } },
  { type: 'kw', offset: 1, raw: ':a#_1', state: { inString: false } },
  { type: 'close', offset: 6, raw: ']', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]

seems correct, as in clj, one gets:

user=> [:a#_1]
[:a#_1]
user=> (type :a#_1)
clojure.lang.Keyword

So far I think for numbers and delimiters of collections, one doesn't need to put a space before #_ for there to be appropriate recognition of a following discard expression.

However, for characters, symbols, keywords, and symbolic values (e.g. ##NaN), not having a space makes a difference in what ends up being recognized.

It may be obvious, but this analysis may not be complete.

@PEZ PEZ added bug Something isn't working parsing labels Sep 30, 2020
@sogaiu sogaiu changed the title Handling of number before discard / ignore marker Various parsing issues Oct 17, 2020
@sogaiu
Copy link
Author

sogaiu commented Oct 17, 2020

Handling of some symbols

In clj I get the following results:

user=> (type (read-string "_"))
clojure.lang.Symbol

The lexer I tested recognized as 'junk':

> s.processLine("_")
[
  { type: 'junk', offset: 0, raw: '_', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 1, state: { inString: false } }
]

I expected recognition as id:

> s.processLine("_")
[
 {'type': 'id', 'offset': 0, 'raw': '_', 'state': {'inString': False}},   
 {'type': 'eol', 'raw': '\n', 'offset': 1, 'state': {'inString': False}}
]

@sogaiu
Copy link
Author

sogaiu commented Oct 17, 2020

Handling of some number literals

The following are some things that are parsed as id, but I expected lit

user=> (type +0X0)
java.lang.Long

> s.processLine("+0X0")
[
  { type: 'id', offset: 0, raw: '+0X0', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 4, state: { inString: false } }
]
user=> (type 0x3B85110)
java.lang.Long

> s.processLine("0x3B85110")
[
  {
    type: 'id',
    offset: 0,
    raw: '0x3B85110',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 9, state: { inString: false } }
]
user=> (type 00M)
java.math.BigDecimal

> s.processLine("00M")
[
  { type: 'id', offset: 0, raw: '00M', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 3, state: { inString: false } }
]
user=> (type -0344310433453N)
clojure.lang.BigInt

> s.processLine("-0344310433453N")
[
  {
    type: 'id',
    offset: 0,
    raw: '-0344310433453N',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 15, state: { inString: false } }
]
user=> (type +3r11)
java.lang.Long

> s.processLine("+3r11")
[
  { type: 'id', offset: 0, raw: '+3r11', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 5, state: { inString: false } }
]
user=> (type -25Rn)
java.lang.Long

> s.processLine("-25Rn")
[
  { type: 'id', offset: 0, raw: '-25Rn', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 5, state: { inString: false } }
]
user=> (type -95/96)
clojure.lang.Ratio

> s.processLine("-95/96")
[
  { type: 'id', offset: 0, raw: '-95/96', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]
user=> (type +18998.18998e+18998M)
java.math.BigDecimal

> s.processLine("+18998.18998e+18998M")
[
  {
    type: 'id',
    offset: 0,
    raw: '+18998.18998e+18998M',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 20, state: { inString: false } }
]
user=> (type -61E-19471M)
java.math.BigDecimal

> s.processLine("-61E-19471M")
[
  {
    type: 'id',
    offset: 0,
    raw: '-61E-19471M',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 11, state: { inString: false } }
]

For reference, parcera currently parses numbers like this: https://github.com/carocad/parcera/blob/83cd988e69116b67c620c099f78b693ac5e37233/src/Clojure.g4#L46

tree-sitter-clojure takes a mostly similar approach: https://github.com/sogaiu/tree-sitter-clojure/blob/9df53ae75475e5bdbeb21cd297b8e3160f3b6ed8/grammar.js#L21-L65

@sogaiu
Copy link
Author

sogaiu commented Oct 17, 2020

Handling of some character literals

Some character literals appear to be split apart and recognized as id followed by ws or junk:

user=> (type \
)
java.lang.Character
user=> (type (read-string "\\\n"))
java.lang.Character

> s.processLine("\\\n")
[
  { type: 'id', offset: 0, raw: '\\', state: { inString: false } },
  { type: 'ws', offset: 1, raw: '\n', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 2, state: { inString: false } }
]
> "\\\n".length
2
> console.log("\\\n")
\

undefined
user=> (type (read-string "\\\f"))
java.lang.Character

> s.processLine("\\\f")
[
  { type: 'id', offset: 0, raw: '\\', state: { inString: false } },
  { type: 'junk', offset: 1, raw: '\f', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 2, state: { inString: false } }
]
> "\\\f".length
2
> console.log("\\\f")
\

undefined

I expected a single lit.

@sogaiu
Copy link
Author

sogaiu commented Oct 17, 2020

Handling of some symbolic values

Symbolic values expressed in a certain form (e.g. space between ## and Inf) appear to be recognized as id and not reader (which is how ##Inf is currently recognized here):

user=> (type (read-string "## Inf"))
java.lang.Double
user=> (type ## Inf)
java.lang.Double

> s.processLine("## Inf")
[
  { type: 'id', offset: 0, raw: '## Inf', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]

Note that:

user=> (type (read-string "##Inf"))
java.lang.Double

I expected results like:

> s.processLine("## Inf")
[
  {
    type: 'reader',
    offset: 0,
    raw: '## Inf',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 6, state: { inString: false } }
]

This would be similar to what I currently get for:

> s.processLine("##Inf")
[
  {
    type: 'reader',
    offset: 0,
    raw: '##Inf',
    state: { inString: false }
  },
  { type: 'eol', raw: '\n', offset: 5, state: { inString: false } }
]

I don't understand how Calva works well enough to have an opinion about what the result should be, but it seems that ##Inf and ## Inf should be recognized as the same type, or at least be more similar.

There is this case too:

user=> (type (read-string "## #_ 1 NaN"))
java.lang.Double

I currently get:

> s.processLine("## #_ 1 Inf")
[
  { type: 'reader', offset: 0, raw: '##', state: { inString: false } },
  { type: 'ws', offset: 2, raw: ' ', state: { inString: false } },
  { type: 'ignore', offset: 3, raw: '#_', state: { inString: false } },
  { type: 'ws', offset: 5, raw: ' ', state: { inString: false } },
  { type: 'lit', offset: 6, raw: '1', state: { inString: false } },
  { type: 'ws', offset: 7, raw: ' ', state: { inString: false } },
  { type: 'id', offset: 8, raw: 'Inf', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 11, state: { inString: false } }
]

So may be there is a case for the first two being something like reader at some point followed by id (with the possibility of ws being in between)?

@sogaiu
Copy link
Author

sogaiu commented Oct 18, 2020

Handling of character literal + comment sequence

In clj:

user=> \newline;)
\newline

in the local lexer:

> s.processLine("\\newline;)");
[
  {
    type: 'lit',
    offset: 0,
    raw: '\\newline;',
    state: { inString: false }
  },
  { type: 'close', offset: 9, raw: ')', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 10, state: { inString: false } }
]

It seems that \newline; got treated as lit instead of being recogniez as lit (\newline) followed by comment (;)).

Similar results were seen for \a.

PEZ added a commit that referenced this issue Nov 1, 2020
@PEZ
Copy link
Collaborator

PEZ commented Nov 1, 2020

Regarding:

[1#_2]

I think it can be simplified as 1#_2. And I also think it is an effect of one of the very first things mentioned about the Reader:

Symbols begin with a non-numeric character and can contain alphanumeric characters and *, +, !, -, _, ', ?, <, > and = (other characters may be allowed eventually).

(My emphasis.) This non-numeric start rule was, for some reason, not implemented in Calva's Clojure syntax. Adding that constraint immediately made Calva parse 1#_foo as { lit: "1", ignore: "#_", id: "foo" }, instead of as { id: "1#_foo" }.

So, it is not general as in <lit>#_<something> should be parsed to { lit: "<whatever>", ignore: "#_", ...}, as evidence by:

clj::user=> (def false#_foo :false#_foo)
#'user/false#_foo
clj::user=> false#_foo
:false#_foo

(Smiling at the irony that the Github Clojure syntax disagrees 😄 )

Anyway, super good find! I'll try to find some more time to spend on this awesome list.

@sogaiu
Copy link
Author

sogaiu commented Nov 1, 2020

One reason to express the discard example as [1#_2] is to not leave your REPL hanging :)

BTW, Calva was not alone -- there was some discussion concerning this here: carocad/parcera#86

PEZ added a commit that referenced this issue Nov 1, 2020
@PEZ
Copy link
Collaborator

PEZ commented Nov 1, 2020

One reason to express the discard example as [1#_2] is to not leave your REPL hanging :)

That's very considerate of you. 😄 In which cases did it hang the REPL?

I think I managed to fix the _ as a valid starting char for symbols now. At first it seemed as a simple miss like the one with the digit, but it turned out to be a bit more involved as Calva's lexer allows for # to prefix any symbol, but if it is immediately preceding an _ then that should be flexed as an ignore marker... A bit similar to that parcera dilemma you pointed at. Anyway, negative lookbehind to the rescue.

@sogaiu
Copy link
Author

sogaiu commented Nov 1, 2020

In clj this is what happens here:

$ clj
Clojure 1.10.1
user=> 1#_2
1

Note, that there is no prompt printed after the 1.

One can get out of this situation by typing, say a 1 followed by enter, but initially I think it can be disorienting because the usual user=> prompt does not appear. Perhaps I should not have described that as hanging :)

I'm not sure I understand why # should be allowed to prefix any symbol:

user=> (def #a 1)
Syntax error reading source at (REPL:3:10).
No reader function for tag a
Syntax error reading source at (REPL:3:11).
Unmatched delimiter: )

# doesn't appear to be listed as a legal character that can start a symbol according to the text you quoted earlier.

May be I misunderstood or this is part of Calva being more generous in what it accepts?

@PEZ
Copy link
Collaborator

PEZ commented Nov 1, 2020

It is due to that Calva treats a lot of quoting/splicing/etcetera chars to prefix symbols (this a reason why we call them id and not symbol). This treatment of prefixing chars allows us to keep Paredit a bit more stupid, it can move past ids w/o figuring about what is the symbol and what is prefixes. It makes ids out of a lot of stuff that the reader would barf upon, which is unfortunate, but that is the thing with tradeoffs, right? 😄

@sogaiu
Copy link
Author

sogaiu commented Nov 1, 2020

Thank you for the explanation. I was able to update my idea of the intent of id a bit :)

PEZ added a commit that referenced this issue Nov 1, 2020
PEZ added a commit that referenced this issue Nov 1, 2020
@PEZ PEZ mentioned this issue Nov 1, 2020
37 tasks
@sogaiu
Copy link
Author

sogaiu commented Nov 2, 2020

Handling of whitespace

I'm not sure if the following will work for copy-paste, buf FWIW:

#{nil
true}

There is a character between nil and true that IIUC ought to be treated as whitespace.

In Python one can enter it like: '\u2028'

Ah, I guess in JS one may enter it in a similar way, so that might be:

"#{1\u2028nil}"

What this is a demonstration of is what Clojure considers whitespace: https://github.com/clojure/clojure/blob/833c924239a818ff1a2563ae88af6dc266b35a61/src/jvm/clojure/lang/LispReader.java#L131

So it's either a comma or what java.lang.Character's isWhitespace method returns true for.

I've summarized my understanding of what counts here: https://github.com/sogaiu/tree-sitter-clojure/blob/f8006afc91296b0cdb09bfa04e08a6b3347e5962/grammar.js#L6-L32

For U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2008, U+2009, U+200A, U+205F, U+3000, I get junk, but may be it would be better to recognize as ws?

For U+2028 and U+2029, I get an exception / error "Unexpected character" -- may be JS(?) cannot handle those. My testing was with ts-node FWIW. (It may depend on the JS version perhaps: https://stackoverflow.com/questions/2965293/javascript-parse-error-on-u2028-unicode-character)

For U+001C, U+001D, U+001E, U+001F, I get id, but may be ws is better?

For U+000B and \f, I get junk, may be ws is better?

For U+0020 (space), \n, \r, and \t, I get ws which is what I expect.

For an upstream reference there is: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char)

I looked at that (or possibly a version for another JDK) and then went looking for Unicode docs to figure out what the bits meant.

For Unicode info, I looked at:

@PEZ
Copy link
Collaborator

PEZ commented Nov 2, 2020

This is super! Should be pretty easy to through all that into my ws pattern. Famous last words.

@PEZ
Copy link
Collaborator

PEZ commented Nov 2, 2020

Yes, famous last words. But I figured it out. 😄

Extra good that you found that there are a lot of unicode not matched by /./. Which is why U+2028 crashed the lexer. I now match junk as: /[\u0000-\uffff]/. Will need to figure out how to match code points above that, my attempts at this failed. Also on TODO is to add tests for junk. It is a crucial part of the lexer!

@sogaiu
Copy link
Author

sogaiu commented Nov 3, 2020

Reader (symbolic value?) and comment

clj says:

user=> ##Inf;[
##Inf

For cd2ffed, I get:

> s.processLine("##Inf;[")
[
  {
    type: 'reader',
    offset: 0,
    raw: '##Inf;',
    state: { inString: false }
  },
  { type: 'open', offset: 6, raw: '[', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 7, state: { inString: false } }
]

Looks like the reader has "absorbed" a semicolon.

@sogaiu
Copy link
Author

sogaiu commented Nov 3, 2020

N-numeric literal split

clj says:

user=> 0X0N
0N

With 943a4b1 I get:

> s.processLine("0X0N");
[
  { type: 'lit', offset: 0, raw: '0X0', state: { inString: false } },
  { type: 'id', offset: 3, raw: 'N', state: { inString: false } },
  { type: 'eol', raw: '\n', offset: 4, state: { inString: false } }
]

@PEZ
Copy link
Collaborator

PEZ commented Nov 3, 2020

Thanks! Looking at this I found another one:

Calva treats something like 08.01 as two tokens. This because that late night some days ago I git the idea that sci-numbers couldn't start with a 0...

@sogaiu
Copy link
Author

sogaiu commented Nov 3, 2020

Good catch!

I think the generators here will put a leading zero for hex and octal, but not for double or ratio. Perhaps I should add that :)

IIUC, multiple leading zeros work (only for double and ratio).

(FWIW, my limited testing and understanding suggest that radix numbers can't start with zero.)

Looks like if one does -00034, it is interpreted as octal, so AFAIU, one cannot really have any leading zeros for integers (the base 10 kind).

@sogaiu
Copy link
Author

sogaiu commented Jun 20, 2023

May be these have all been addressed?

I'll close this for now :)

@sogaiu sogaiu closed this as completed Jun 20, 2023
@PEZ
Copy link
Collaborator

PEZ commented Jun 20, 2023

TBH, I don't know if all has been addressed. But as I recall things, I addressed most of it. It was super valuable to Calva. Belated thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parsing
Projects
None yet
Development

No branches or pull requests

2 participants