-
-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various parsing issues #802
Comments
Handling of some symbolsIn clj I get the following results:
The lexer I tested recognized as 'junk':
I expected recognition as
|
Handling of some number literalsThe following are some things that are parsed as
For reference, parcera currently parses numbers like this: https://github.com/carocad/parcera/blob/83cd988e69116b67c620c099f78b693ac5e37233/src/Clojure.g4#L46 tree-sitter-clojure takes a mostly similar approach: https://github.com/sogaiu/tree-sitter-clojure/blob/9df53ae75475e5bdbeb21cd297b8e3160f3b6ed8/grammar.js#L21-L65 |
Handling of some character literalsSome character literals appear to be split apart and recognized as
I expected a single |
Handling of some symbolic valuesSymbolic values expressed in a certain form (e.g. space between
Note that:
I expected results like:
This would be similar to what I currently get for:
I don't understand how Calva works well enough to have an opinion about what the result should be, but it seems that There is this case too:
I currently get:
So may be there is a case for the first two being something like |
Handling of character literal + comment sequenceIn
in the local lexer:
It seems that Similar results were seen for |
Regarding:
I think it can be simplified as
(My emphasis.) This non-numeric start rule was, for some reason, not implemented in Calva's Clojure syntax. Adding that constraint immediately made Calva parse So, it is not general as in clj::user=> (def false#_foo :false#_foo)
#'user/false#_foo
clj::user=> false#_foo
:false#_foo (Smiling at the irony that the Github Clojure syntax disagrees 😄 ) Anyway, super good find! I'll try to find some more time to spend on this awesome list. |
One reason to express the discard example as BTW, Calva was not alone -- there was some discussion concerning this here: carocad/parcera#86 |
That's very considerate of you. 😄 In which cases did it hang the REPL? I think I managed to fix the |
In
Note, that there is no prompt printed after the One can get out of this situation by typing, say a I'm not sure I understand why
May be I misunderstood or this is part of Calva being more generous in what it accepts? |
It is due to that Calva treats a lot of quoting/splicing/etcetera chars to prefix symbols (this a reason why we call them |
Thank you for the explanation. I was able to update my idea of the intent of |
Handling of whitespaceI'm not sure if the following will work for copy-paste, buf FWIW:
There is a character between In Python one can enter it like: Ah, I guess in JS one may enter it in a similar way, so that might be:
What this is a demonstration of is what Clojure considers whitespace: https://github.com/clojure/clojure/blob/833c924239a818ff1a2563ae88af6dc266b35a61/src/jvm/clojure/lang/LispReader.java#L131 So it's either a comma or what java.lang.Character's I've summarized my understanding of what counts here: https://github.com/sogaiu/tree-sitter-clojure/blob/f8006afc91296b0cdb09bfa04e08a6b3347e5962/grammar.js#L6-L32 For U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2008, U+2009, U+200A, U+205F, U+3000, I get For U+2028 and U+2029, I get an exception / error "Unexpected character" -- may be JS(?) cannot handle those. My testing was with For U+001C, U+001D, U+001E, U+001F, I get For U+000B and For U+0020 (space), For an upstream reference there is: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char) I looked at that (or possibly a version for another JDK) and then went looking for Unicode docs to figure out what the bits meant. For Unicode info, I looked at: |
This is super! Should be pretty easy to through all that into my |
Yes, famous last words. But I figured it out. 😄 Extra good that you found that there are a lot of unicode not matched by |
Reader (symbolic value?) and comment
For cd2ffed, I get:
Looks like the reader has "absorbed" a semicolon. |
N-numeric literal split
With 943a4b1 I get:
|
Thanks! Looking at this I found another one: Calva treats something like |
Good catch! I think the generators here will put a leading zero for hex and octal, but not for double or ratio. Perhaps I should add that :) IIUC, multiple leading zeros work (only for double and ratio). (FWIW, my limited testing and understanding suggest that radix numbers can't start with zero.) Looks like if one does |
May be these have all been addressed? I'll close this for now :) |
TBH, I don't know if all has been addressed. But as I recall things, I addressed most of it. It was super valuable to Calva. Belated thanks! |
As discussed in #calva, we'll use this issue to collect some parsing issues that are discovered for the time being.
Handling of number before discard / ignore marker
In
clj
I get the following results:So it appears that in each case,
1
(or+1
) is being recognized as a number, and then starting from#_
, there is a discard expression that extends to include the2
.I tried a similar sequence for Calva's clojure-lexer and found:
It looks like in these cases
#_
is being seen as part of what comes immediately before.Note that:
seems correct, as in
clj
, one gets:So far I think for numbers and delimiters of collections, one doesn't need to put a space before
#_
for there to be appropriate recognition of a following discard expression.However, for characters, symbols, keywords, and symbolic values (e.g.
##NaN
), not having a space makes a difference in what ends up being recognized.It may be obvious, but this analysis may not be complete.
The text was updated successfully, but these errors were encountered: