Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using MANY with OR alternatives: NoViableAltException #867

Closed
jabney opened this issue Dec 14, 2018 · 11 comments
Closed

Error using MANY with OR alternatives: NoViableAltException #867

jabney opened this issue Dec 14, 2018 · 11 comments

Comments

@jabney
Copy link

jabney commented Dec 14, 2018

I'm having an issue with a NoViableAltException error when using two different (but similar) alternatives in an OR rule. It looks like a bug but it's quite possibly user error.

Chevrotain version 4.1.0
Node version 8.9.2

I've simplified my grammar to highlight the issue:

/**
 * The entry rule.
 */
this.RULE('command', () => {
  const command = this.OR([
    { ALT: () => this.SUBRULE(this.replace) },
    { ALT: () => this.SUBRULE(this.repeal) },
  ])

  return command
})

/**
 * First alternative.
 */
this.RULE('replace', () => {
  const sectionType = this.CONSUME(SectionName).image
  const pointRef = this.SUBRULE(this.pointRef)
  this.CONSUME(Replaced)

  return {
    commandType: 'replace',
    sectionType,
    pointRef,
  }
})

/**
 * Second alternative.
 */
this.RULE('repeal', () => {
  const sectionType = this.CONSUME(SectionName).image
  const pointRef = this.SUBRULE(this.pointRef)
  this.CONSUME(Repealed)

  return {
    commandType: 'repeal',
    sectionType,
    pointRef,
  }
})

// Section, Subsection, etc.
this.RULE('sectionType', () => {
  const token = this.CONSUME(SectionName).image
  return token
})

// 2, 5(3), 4(7)(b)
this.RULE('pointRef', () => {
  const pointRef = []

  // Point refs begin with a section ref.
  pointRef.push(this.CONSUME(SectionRef).image)

  /**
   * A section ref is followed by zero or more subrefs.
   * Causes NoViableAltException error.
   */
  this.MANY(() => {
    const subref = this.CONSUME(SubRef).image
    pointRef.push(subref)
  })

  /**
   * Same error as MANY above (NoViableAltException).
   */
  // this.OPTION(() => {
  //   this.AT_LEAST_ONE(() => {
  //     const subref = this.CONSUME(SubRef).image
  //     pointRef.push(subref)
  //   })
  // })

  return pointRef
})

Input:

const inputs = [
  'Section 5 is replaced',          // success
  'Subsection 2(2) is repealed',    // success
  'Paragraph 4(a)(i) is replaced',  // failure
]

The result is a NoViableAltException on the third input.

Result:

[ { [NoViableAltException: Expecting: one of these possible Token sequences:
  1. [SectionName, SectionRef, Replaced]
  2. [SectionName, SectionRef, SubRef, Replaced]
  3. [SectionName, SectionRef, Repealed]
  4. [SectionName, SectionRef, SubRef, Repealed]
but found: 'Paragraph']

Tokens (token names):

[ 'SectionName', 'SectionRef', 'SubRef', 'SubRef', 'Replaced' ]

However if I remove one of the alternatives in the command rule, the third input works fine:

const command = this.OR([
  { ALT: () => this.SUBRULE(this.replace) },
  // { ALT: () => this.SUBRULE(this.repeal) },
])

I'll include a complete executable example in a follow-up comment.

Apologies in advance if it's not a bug and I'm just missing something that's already documented.

@jabney
Copy link
Author

jabney commented Dec 14, 2018

Executable example:

(standalone: copy, paste, run)

const {
  createToken,
  Lexer,
  Parser,
 } = require('chevrotain') // Version 4.1.0

const Whitespace = createToken({
  name: 'Whitespace',
  pattern: /\s+/,
  group: Lexer.SKIPPED,
})

const SectionName = createToken({
  name: 'SectionName',
  pattern: /section|subsection|paragraph|subparagraph|clause|subclause/i,
})

const SectionRef = createToken({
  name: 'SectionRef',
  pattern: /\d+/,
})

const SubRef = createToken({
  name: 'SubRef',
  pattern: /\(\w+\)/,
})

const Repealed = createToken({
  name: 'Repealed',
  pattern: /is repealed/i,
})

const Replaced = createToken({
  name: 'Replaced',
  pattern: /is replaced/i,
})

const tokenList = [
  Whitespace,
  SectionName,
  SectionRef,
  SubRef,
  Repealed,
  Replaced,
]

const tokenVocabulary = tokenList.reduce((map, tokenType) => {
  map[tokenType.name] = tokenType
  return map
}, {})

class CommandParser extends Parser {

  constructor() {
    super(tokenVocabulary, { outputCst: false })

    /**
     * The root rule.
     */
    this.command = null

    /**
     * Main rules.
     */
    this.replace = null
    this.repeal = null

    /**
     * Subrules.
     */
    this.pointRef = null

    /**
     * The entry rule.
     */
    this.RULE('command', () => {
      const command = this.OR([
        { ALT: () => this.SUBRULE(this.replace) },
        { ALT: () => this.SUBRULE(this.repeal) },
      ])

      return command
    })

    /**
     * First alternative.
     */
    this.RULE('replace', () => {
      const sectionType = this.CONSUME(SectionName).image
      const pointRef = this.SUBRULE(this.pointRef)
      this.CONSUME(Replaced)

      return {
        commandType: 'replace',
        sectionType,
        pointRef,
      }
    })

    /**
     * Second alternative.
     */
    this.RULE('repeal', () => {
      const sectionType = this.CONSUME(SectionName).image
      const pointRef = this.SUBRULE(this.pointRef)
      this.CONSUME(Repealed)

      return {
        commandType: 'repeal',
        sectionType,
        pointRef,
      }
    })

    // Section, Subsection, etc.
    this.RULE('sectionType', () => {
      const token = this.CONSUME(SectionName).image
      return token
    })

    // 2, 5(3), 4(7)(b)
    this.RULE('pointRef', () => {
      const pointRef = []

      // Point refs begin with a section ref.
      pointRef.push(this.CONSUME(SectionRef).image)

      /**
       * A section ref is followed by zero or more subrefs.
       * Causes NoViableAltException error.
       */
      this.MANY(() => {
        const subref = this.CONSUME(SubRef).image
        pointRef.push(subref)
      })

      /**
       * Same error as MANY above (NoViableAltException).
       */
      // this.OPTION(() => {
      //   this.AT_LEAST_ONE(() => {
      //     const subref = this.CONSUME(SubRef).image
      //     pointRef.push(subref)
      //   })
      // })

      return pointRef
    })

    this.performSelfAnalysis()
  }
}

// The single parser instance.
const parser = new CommandParser()

/**
 * Lex and parse the input, return the parsed command.
 */
function parseCommand(input) {
  const commandLexer = new Lexer(tokenList)
  const lexingResult = commandLexer.tokenize(input)
  const tokenNames = lexingResult.tokens.map((token) => token.tokenType.name)
  const parser = new CommandParser()

  parser.input = lexingResult.tokens
  const command = parser.command()

  if (parser.errors.length > 0) {
    console.log('error tokens:', tokenNames)
    console.log(parser.errors)
    return
  }

  console.log('success tokens:', tokenNames)
  return command
}

// Only run from CLI
if (!module.parent) {

  const inputs = [
    'Section 5 is replaced',          // success
    'Subsection 2(2) is repealed',    // success
    'Paragraph 4(a)(i) is replaced',  // failure
  ]

  for (const input of inputs) {
    const result = parseCommand(input)

    if (result) {
      console.log('success result:', result)
    }
  }
}

Output

success tokens: [ 'SectionName', 'SectionRef', 'Replaced' ]
success result: { 
  commandType: 'replace',
  sectionType: 'Section',
  pointRef: [ '5' ] 
}

success tokens: [ 'SectionName', 'SectionRef', 'SubRef', 'Repealed' ]
success result: { 
  commandType: 'repeal',
  sectionType: 'Subsection',
  pointRef: [ '2', '(2)' ] 
}

error tokens: [ 'SectionName', 'SectionRef', 'SubRef', 'SubRef', 'Replaced' ]

[ { [NoViableAltException: Expecting: one of these possible Token sequences:
  1. [SectionName, SectionRef, Replaced]
  2. [SectionName, SectionRef, SubRef, Replaced]
  3. [SectionName, SectionRef, Repealed]
  4. [SectionName, SectionRef, SubRef, Repealed]
but found: 'Paragraph']
    name: 'NoViableAltException',
    message: 'Expecting: one of these possible Token sequences:\n  1. [SectionName, SectionRef, Replaced]\n  2. [SectionName, SectionRef, SubRef, Replaced]\n  3. [SectionName, SectionRef, Repealed]\n  4. [SectionName, SectionRef, SubRef, Repealed]\nbut found: \'Paragraph\'',
    token:
     { image: 'Paragraph',
       startOffset: 0,
       endOffset: 8,
       startLine: 1,
       endLine: 1,
       startColumn: 1,
       endColumn: 9,
       tokenTypeIdx: 3,
       tokenType: [Object] },
    previousToken:
     { image: '',
       startOffset: NaN,
       endOffset: NaN,
       startLine: NaN,
       endLine: NaN,
       startColumn: NaN,
       endColumn: NaN,
       tokenTypeIdx: 1,
       tokenType: [Object] },
    resyncedTokens: [],
    context: { ruleStack: [Array], ruleOccurrenceStack: [Array] } } ]

@jabney
Copy link
Author

jabney commented Dec 14, 2018

With either of the alternative rules alone, it works fine:

this.RULE('command', () => {
  const command = this.OR([
    { ALT: () => this.SUBRULE(this.replace) },
    // { ALT: () => this.SUBRULE(this.repeal) },
])

const inputs = [
  'Section 5 is replaced',          // success
  // 'Subsection 2(2) is repealed',
  'Paragraph 4(a)(i) is replaced',  // success
]

/* Output
success tokens: [ 'SectionName', 'SectionRef', 'Replaced' ]
success result: { 
  commandType: 'replace',
  sectionType: 'Section',
  pointRef: [ '5' ] 
}
success tokens: [ 'SectionName', 'SectionRef', 'SubRef', 'SubRef', 'Replaced' ]
success result: { 
  commandType: 'replace',
  sectionType: 'Paragraph',
  pointRef: [ '4', '(a)', '(i)' ] 
}
*/

or:

this.RULE('command', () => {
  const command = this.OR([
    // { ALT: () => this.SUBRULE(this.replace) },
    { ALT: () => this.SUBRULE(this.repeal) },
])

const inputs = [
  // 'Section 5 is replaced',
  'Subsection 2(2) is repealed',    // success
  // change to repealed ---v
  'Paragraph 4(a)(i) is repealed',  // success
]

/* Output
success tokens: [ 'SectionName', 'SectionRef', 'SubRef', 'Repealed' ]
success result: {
  commandType: 'repeal',
  sectionType: 'Subsection',
  pointRef: [ '2', '(2)' ]
 }
success tokens: [ 'SectionName', 'SectionRef', 'SubRef', 'SubRef', 'Repealed' ]
success result: { 
  commandType: 'repeal',
  sectionType: 'Paragraph',
  pointRef: [ '4', '(a)', '(i)' ] 
}
*/

@bd82
Copy link
Member

bd82 commented Dec 14, 2018

That's for reporting this @jabney particularly for providing an easy to reproduce example. 👍

I believe there are several issues here:

1. Grammar Ambiguity

Chevrotain is an LL(K) parseing toolkit. This means that it can look upto K tokens
ahead to distinguish alternatives. The grammar you provided however would in theory
require infinite lookahead to decide between the two alternatives as the distinguishing token
appears after a (possibly) infinite repetition.

e.g:

main:
alt1 |
alt2
;

alt1:
A (B)* C

alt2:
A (B)* D

So for any K = N we could create an input with n-1 "B" tokens
so that the lookahead would be too small to "peek ahead" until the distinguishing "C" or "D" tokens.

To resolve this you could either:

  • refactor your grammar to be LL(K) by extracting the common infinite prefix outside the alternatives.
       main:
          A B* (alt1 | alt2)
       ;
    
       alt1:
          C
        ;
        
      alt2:
         D
        ;
  • Or use backtracking

2. Failed ambiguity detection.

This ambiguity should be detected, and the parser should throw an error during initialization it is a bug, see: #854

3. This kind of ambiguity is not documented on the website:

See: #853

4. The lookahead calculation is missing one possible path.

This looks like a bug too.

// [ { NoViableAltException: Expecting: one of these possible Token sequences:
//     1. [SectionName, SectionRef, Replaced]
//     2. [SectionName, SectionRef, SubRef, Replaced]
//     3. [SectionName, SectionRef, Repealed]
//     4. [SectionName, SectionRef, SubRef, Repealed]

So while by default (K=4) so we would expect the list of possible tokens to include.

[SectionName, SectionRef, SubRef, SubRef]

of course either alternative could start with this sequence, so in the case of the ambiguity
the first alternative should be chosen, However if that sequence was not accounted for
it means that even if you choose to use backtracking, it would still throw the NoViableAltExceptin.
That is because backtracking is implemented using GATES, and gates are an additional condition to the regular lookahead, not a replacement, so the regular lookahead must still be able to resolve, even in the case of an ambiguity.

Suggestion / Workaround

At this point you will need to refactor the grammar to be LL(K).
The other option of backtracking would not work (due to the bug), and anyhow backtracking is the less recommended approach.

I will prioritize investigating the bugs described here, because their combination
is causing strange and complex behavior and I suspect they are even related.
While I am at it I would try to improve the documentation for that ambiguity as well.

@jabney
Copy link
Author

jabney commented Dec 14, 2018

@bd82 your thoughtful explanation has helped me understand some issues with my grammar that would have bitten me later, namely that the SubRef tokens can go even deeper, e.g., (1)(3)(a)(iii)..., which was the reason for wanting to use MANY in the first place. I presume this would mean needing to increase maxLookahead to account for it, even with the suggested refactoring.

I will try the LL(k) refactoring you suggested and see how that goes with the default lookahead. My actual grammar is quite a bit more complex and arbitrary than the provided example suggests, with several disparate and non-regular input types, and this might be tricky. Alternately, I suspect that I can move the SubRef repetition to the token as well, that is:

const SubRef = createToken({
  name: 'SubRef',
  pattern: /\(\w+\)+/,
  // --------------^-- arbitrary repetition
})

and then handle extracting the values a little differently in the logic. However I will study your refactoring example and see if I can think more in LL(k) when constructing the grammar, as this is really the best approach.

Thanks again for your detailed answer, and do let me know if there's anything else I can help with for reproducing the bugs.

All the best,
James

@bd82
Copy link
Member

bd82 commented Dec 14, 2018

I presume this would mean needing to increase maxLookahead to account for it, even with the suggested refactoring.

If the common prefix could be infinitely long than no amount of lookahead would suffice.
You will need either backtracking or refactoring.

I am using backtracking myself in the development of a Java Parser because
I'm trying to keep the parser aligned with the specifications, but that would come at the cost of performance and error recovery accuracy. So you could evaluate backtracking too (once the related bug is fixed), just don't go overboard with it 😄

However I will study your refactoring example and see if I can think more in LL(k) when constructing the grammar, as this is really the best approach.

I tend to agree with the statement, while some parsing tools have greater capabilities (e.g handling left recursion or LL(*) grammars) there is an advantage to keeping things simple particularly when considering the ease of others to develop tools for a language you have created.

For example see the Python approach to grammar design:

The parser won't be more complex than LL(1).
Simple is better than complex. This idea extends to the parser. Restricting Python's grammar to an LL(1) parser is a blessing, not a curse. It puts us in handcuffs that prevent us from going overboard and ending up with funky grammar rules like some other dynamic languages that will go unnamed, such as Perl.

@jabney
Copy link
Author

jabney commented Dec 14, 2018

...funky grammar rules like some other dynamic languages that will go unnamed, such as Perl.

I couldn't keep from laughing at that.

If the common prefix could be infinitely long than no amount of lookahead would suffice.
You will need either backtracking or refactoring.

In theory it could be infinite, but empirically the depth goes up to 6 known repetitions. Even still, I don't want to jack up lookahead as I need to process potentially hundreds of inputs as quickly as possible unless I decide to implement some sort of preprocessing and caching mechanism as a last resort. In some of my early performance checks I was able to process 100's of inputs in 20-30 ms, iirc, and that is quite good for my use case.

So you could evaluate backtracking too (once the related bug is fixed), just don't go overboard with it 😄

I'll keep an eye out for an updated version and look into how to make use of some backtracking if needed.

Cheers!

@bd82 bd82 closed this as completed in ed4c75e Dec 14, 2018
@bd82
Copy link
Member

bd82 commented Dec 14, 2018

I couldn't keep from laughing at that.

😄

as I need to process potentially hundreds of inputs as quickly as possible

Do you start a new process for each input or use the same process and parser instance?
There are startup and warmup costs, some of those can be avoided if needed...

I'll keep an eye out for an updated version and look into how to make use of some backtracking if needed.

Version 4.1.1 should be on npm once the travis build finishes.
https://travis-ci.org/SAP/chevrotain/builds/468114918

I will check this later to ensure it succeeded.

@jabney
Copy link
Author

jabney commented Dec 14, 2018

Do you start a new process for each input or use the same process and parser instance?
There are startup and warmup costs, some of those can be avoided if needed...

I create a single instance of the parser and then throw all the inputs at that single instance:

class CommandParser extends Parser { ... }

// The single parser instance.
const parser = new CommandParser()

export function parseCommand(input: string): ICommand {
  const { tokens } = tokenize(input)
  parser.input = tokens
  const command = parser.command()

  if (parser.errors.length > 0) {
    throw new ParseError(input, parser.errors)
  }

  return command
}

...

// Create some inputs and parse them for quick debugging. 
if (!module.parent) {
  const inputs: string[] = [ ... ]

  for (const input if inputs) {
    const result = parseCommand(input)
    console.log(result)
  }
}

BTW I just ran 676 inputs in ~24 milliseconds using my previous attempt at a grammar:

number of inputs: 676
time in ms: 24.2990
average ms per input: 0.0359

For my purposes this exceeds expectations. 😀

Version 4.1.1 should be on npm once the travis build finishes.

That was fast! 🥇

@bd82
Copy link
Member

bd82 commented Dec 14, 2018

BTW I just ran 676 inputs in ~24 milliseconds using my previous attempt at a grammar:
For my purposes this exceeds expectations. 😀

I am glad the performance is adequate enough.
Although, 676 inputs does not mean much to me (lines of code/ KB size).

Chevrotain is quite fast, see: https://sap.github.io/chevrotain/performance/
Of course this depends on the complexity of the grammar and how optimized it is as well.

That was fast! 🥇

That bug seems to have produced multiple strange behaviors so it was time to squash it 🐞

@jabney
Copy link
Author

jabney commented Dec 14, 2018

Although, 676 inputs does not mean much to me (lines of code/ KB size).

Fair enough. For me it means processing 1,000 inputs in roughly 36 ms depending on my final grammar. 😂

Chevrotain is quite fast, see: https://sap.github.io/chevrotain/performance/

That's one of the reasons I decided to use it. I need fast. The other reasons were it's easy to get up and running, and the online docs and tutorial are excellent especially for one just starting out.

BTW I can confirm that with 4.1.1 the ambiguity in my sample grammar is detected properly and the NoViableAltException error is gone.

Error: Parser Definition Errors detected:
 Ambiguous alternatives: <1 ,2> in <OR> inside <command> Rule,
<SectionName, SectionRef, SubRef, SubRef> may appears as a prefix path in all these alternatives.
To Resolve this, try one of of the following:
1. Refactor your grammar to be LL(K) for the current value of k (by default k=4})
2. Increase the value of K for your grammar by providing a larger 'maxLookahead' value in the parser's config
3. This issue can be ignored (if you know what you are doing...), see https://sap.github.io/chevrotain/documentation/4_1_1/interfaces/iparserconfig.html#ignoredissues for more details

Thanks much for the bug fix and all the help. It's greatly appreciated. 😎

@bd82
Copy link
Member

bd82 commented Dec 14, 2018

The other reasons were it's easy to get up and running, and the online docs and tutorial are excellent especially for one just starting out

Great, glad to hear 😄

BTW I can confirm that with 4.1.1 the ambiguity in my sample grammar is detected properly and the NoViableAltException error is gone.

Thanks for confirmation.

Thanks much for the bug fix and all the help. It's greatly appreciated. 😎

You are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants