Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json to g4 only with "parser" cause some syntax error #43

Open
0x7Fancy opened this issue Jan 18, 2024 · 3 comments
Open

json to g4 only with "parser" cause some syntax error #43

0x7Fancy opened this issue Jan 18, 2024 · 3 comments
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed

Comments

@0x7Fancy
Copy link
Contributor

In my experimental environment, I found json to g4 only with "parser" cause some syntax error, syntax parsing errors may lead to the possibility of losing a large amount of mutated data.

I made mincase lex.json:

{
    "<A>": [["<NUMBER>", "<STRING>", "\n"]],
    "<NUMBER>": [["10"], ["99"]],
    "<STRING>": [["(", "<HEXSTRING>", ")"]],
    "<HEXSTRING>": [["<CHAR>", "<HEXSTRING>"], []],
    "<CHAR>": [
            ["0"], ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"],
            ["8"], ["9"], ["a"], ["b"], ["c"], ["d"], ["e"], ["f"]
    ]
}

Grammar-Mutator make it, generate Grammar.g4 is:

grammar Grammar;
entry
    : node_A EOF
    ;
node_A
    : node_NUMBER node_STRING '\n'
    ;
node_NUMBER
    : '10'
    | '99'
    ;
node_STRING
    : '(' node_HEXSTRING ')'
    ;
node_HEXSTRING
    : 
    | node_CHAR node_HEXSTRING
    ;
node_CHAR
    : '0'
    | '1'
    | '2'
    | '3'
    | '4'
    | '5'
    | '6'
    | '7'
    | '8'
    | '9'
    | 'a'
    | 'b'
    | 'c'
    | 'd'
    | 'e'
    | 'f'
    ;

we prepared input data seed1 / seed2, and use antlr4-parse to testing:

Screen Shot 2024-01-18 at 17 03 03

why is 10(10) parsed incorrectly? because antlr4 is divided into two stages: lexer and parser. during lexer stage, node_NUMBER:10 will be recognized as TOKEN, and in the parser stage, the result is node_NUMBER (node_NUMBER), so an error occurred.

in the antlr4 grammar, lex rules begin with an uppercase letter, parser rules begin with a lowercase letter, so we should tell antlr4 the lexical rules clearly, patch Grammar_patch.g4:

grammar Grammar_patch;
entry
    : node_A EOF
    ;
node_A
    : node_NUMBER Node_STRING '\n'
    ;
node_NUMBER
    : '10'
    | '99'
    ;
Node_STRING
    : '(' Node_HEXSTRING ')'
    ;
Node_HEXSTRING
    : 
    | Node_CHAR Node_HEXSTRING
    ;
Node_CHAR
    : '0'
    | '1'
    | '2'
    | '3'
    | '4'
    | '5'
    | '6'
    | '7'
    | '8'
    | '9'
    | 'a'
    | 'b'
    | 'c'
    | 'd'
    | 'e'
    | 'f'
    ;

testing again:

Screen Shot 2024-01-18 at 17 18 58

the "warning" prompts us it can match the empty string, this may cause antlr4 parsing backtrace issues, but we can easily mark it with fragment Node_HEXSTRING

maybe we can optimize the json to g4 generation code, to distinguish between lexer and parser?

@h1994st
Copy link
Collaborator

h1994st commented Jan 18, 2024

If you have time to work on this, that would be great! Feel free to submit a PR and add your test cases.

I can take a look at this later when I am available. Probably not in recent weeks :(

@h1994st h1994st added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Jan 18, 2024
@0x7Fancy
Copy link
Contributor Author

okay, I'll try my best to provide

@0x7Fancy
Copy link
Contributor Author

I tried to solve the problem but I found that Grammar-Mutator has to rely on AST to work, if I add lexical rules, this is not transparent to Grammar-Mutator.

In the above example, we focus on the input data of 10(12), respectively using Grammar.g4 and Grammar_patch.g4, we can see that Grammar.g4 only with grammar parser has a complete AST structure (in line with Grammar-Mutator expectations)

I think this is a trade-off in the Grammar-Mutator design. It loses part of the mutation data, but the program design is more concise, clear and direct.

so, currently Grammar-Mutator is perfect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed
Projects
Status: To do
Development

No branches or pull requests

2 participants