Skip to content

Latest commit

 

History

History
264 lines (207 loc) · 14 KB

parsers.rst

File metadata and controls

264 lines (207 loc) · 14 KB

This is an archive about the research that was made by Erik Rose and Peter Potrowl in order to find and finally build a good Python parser for MediaWiki's syntax.

Goals

  • Possible (to represent MediaWiki syntax). There is no such thing as invalid wiki markup, so we have to make some sense out of anything.
  • Extensible per project (for {for} syntax, DekiWiki conditionals, includes, templates, etc.)
  • Easier to comprehend than MW's existing formatter
  • Must work with Unicode input (Japanese, for example)

Ungoals

  • Implement those MediaWiki language features which make no sense outside the MediaWiki product: for example, category links.
  • Be bug-for-bug compatible with MW. We can't constrain ourselves to give the exact same output for egregiously bad wikisyntax; we'll make a mess, which kills our goal of being easy to understand. We should catch common errors like bad quote nesting but not go out of our minds to be bug-for-bug compatible. After all, MW itself changes the language in every release. Let them chase us.

MW language properties

Parser libs

See:

In the following lists, (+) signifies a pro, (-) a con, and (.) a neutral point.

LEPL

PLY

  • (.) LALR or, optionally, SLR (Can SLR look ahead farther? No: actually, it has no lookahead.)
  • (-) LALR(1), which is grossly insufficient for MW. I think it's about a lookahead of 4, which we'd have to take care of ourselves, probably making the code hard to comprehend in the process.
  • (+) Modal lexer and parser. (Is this necessary? Understand what can't be BNF'd about apostrophe jungles and lists.)
  • (+) Easy to translate into C later
  • (.) Can turn off magic autodiscovery
  • (-) Potential for yacc to guess wrong about how to assemble symbols: reduce/reduce and shift/reduce errors
  • (+) Much faster than PyParsing? http://www.mefeedia.com/watch/29412150 at 23:58 suggests it's 5.5x faster. More benchmarks (ANTLR and more): http://www.dalkescientific.com/writings/diary/archive/2007/11/03/antlr_java.html
  • (-) A bit more verbose (but very clear)

PyParsing

  • (.) Recursive descent (LL) of PEGs
  • (+) Packrat, so O(n)
  • (+) Easy to write
  • (-) "[An LL(k) parser] may defer error detection to a different branch of the grammar due to backtracking, often making errors harder to localize across disjunctions with long common prefixes."—Wikipedia. I had that problem when writing a simple italics/bold parser: you have to keep the recursion stack in your head to make any sense of the debug info. I eventually gave up trying to fix it.

PyBison

Not researched in depth.

  • (+) Claims to be nearly as fast as C
  • (-) Requires a C build step

ANTLR

  • (-) Separate code generation step
  • (-) Slow because it generates a lot of function calls
  • (+) Can parse LL(k) grammars (arbitrary lookahead)

SPARK

  • (+) Has an implementation of an Earley parser, which can do arbitrary lookahead in n^3 worst case.

NLTK

  • (+) Another Earley parser
  • (+) Long-lived. Under active development by multiple authors. Last released 4/2011.
  • (.) There's a good, free book about the project: http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html. Not sure how good the documentation about the code itself is, though.
  • (-) An enormous dependency
  • (.) Untested
  • (.) GLR parser
  • (+) Public domain
  • (-) Might be dead (the home page has disappeared: http://www.lava.net/~newsham/pyggy/)
  • (-) "PyGgy was written and tested with Python 2.2.3." (in 2003)
  • (+) PEG. Easy, easy grammar definition.
  • (.) Looks promising but not mature. Author has given no thought to speed but much to clarity.
  • (-) Build step
  • (-) Currently no Unicode support
  • (+) Great docs: http://spir.wikidot.com/pijnu-user-guide
  • (+) Great error feedback
  • (+) The generated code looks like what you have to hand-write for PyParsing (see the user guide).
  • (+) Can handle having Unicode chars in the input.
  • (.) Can it handle having Unicode chars as part of parse rules? We might need guillemets.
  • (-) Eek, no tests! Throws DeprecationWarnings on import. Very unique coding style.
  • (.) PEG. Grammar defined in a DSL.
  • (+) No build step; converts grammar from a DSL at runtime.
  • (+) Good docs in the code
  • (-) Nobody's touched it for a year.
  • (.) Is a port of PyMeta to "the simplified OMeta 2 syntax" (new DSL syntax).
  • (-) Not in Python: Python code (21 kB) code is just an API for a C parser (172 kB)
  • (.) Only 340 lines of Python
  • (-) Similar to Pijnu but much less easy to use

Previous implementations

See: http://www.mediawiki.org/wiki/Alternative_parsers

  • (+) Probably works (untested)
  • (-) Direct transformation from wikitext to HTML (generates no AST)
  • (-) As a direct port of the MW PHP, it is very difficult to understand or extend.
  • (-) Because it is based on a sequence of perilously combined regexes which interact in surprising ways, it, like MW proper, sometimes yields surprising output.
  • (+) Works well, lots of unittests already defined and successfully passed
  • (+) Generates an AST
  • (.) Implements its own lexer/parser (see mwlib/refine/core.py and mwlib/refine/_core.pyx: compiled token walker)
  • (.) Seems to: tokenize the text and then apply ~20 different parsers one by one (see mwlib/refine/core.py#928 and #635)
  • (-) Structure of the code somewhat hard to understand (uparser.py vs old_uparser.py, etc.)
  • (-) Lot of code not related to parsing (fetching articles, (un)zip files, API stuff, output for ODF, Latex, etc. that should be more isolated from the parsing part)

mediawiki_parser (this one)

  • (+) Good start (parser + lexer, unittests)
  • (.) Currently using PLY but will be abandoned due to the lack of lookahead
  • (-) Currently incomplete syntax
  • (-) Currently generates no AST

Algorithms

Lexer + parser (e.g. PLY)

  • (+) Easy to use and debug
  • (+) Stateful (specific simple rules for each context)
  • (-) Not enough lookahead in the case of LR(1) parser

Recursive descent of CFGs

  • (+) No separate lexer and parser
  • (+) Memoization ("packrat") makes it run in O(n)
  • (.) Recursive
  • (-) May require large amounts of memory
  • (-) Quite hard to read and debug

Recursive descent of PEGs (e.g. Rats, PyParsing)

  • (+) No separate lexer and parser
  • (+) O(n) with packrat
  • (+) Resolves ambiguity by having precedence orders for productions. As a result, it is easy to extend a PEG with productions for use in special situations without wrecking the wider grammar. This could be a very big deal for our extensibility story.
  • (+) We can rip off Sweble's grammar.

Earley parser (e.g. Spark, NLTK)

GLR parser (e.g. Pyggy)

  • (.) Supports ambiguous grammars (which MW isn't)
  • (+) O(n) on deterministic grammars

Previous work

Milestones

Notes

If we build the parse tree in custom lexer callbacks, we can make it an ElementTree or whatever we want--meaning we can use XPath on it later if we want.

Quasi Gantt chart

Re-examing parsing algorithm,
& implement links                       |----|----|----   Bold/Italics/Apostrophe Jungles (3 weeks)                                      |----|----|----   HTML formatter |----   Showfor support |--
& other long-lookahead productions
(3 weeks)                                                 Simple productions:
                                                          Paragraphs (3 days)                                                            |--
                                                          HRs (1 day)                                                                    |
                                                          magic words (3 days)                                                           |--

                                                          Tables (long lookahead?) (1 week)                                              |----

                                                          One person should do these:
                                                          Includes (long lookahead?) (2 weeks)                                           |----|----
                                                          Templates w/params (long lookahead?) (2 weeks)                                 |----|----

                                                          Redirects (3 days)                                                             |--
                                                          Naked URLs (long lookahead but doable in lexer?) (1 day)                       |
                                                          Headers (long lookahead but doable in lexer) (done for now)
                                                          Entities (done for now)
                                                          Behavior switches (optional) (4 days--will require some architecture thinking) |---

                                                          HTML tags: probably just tokenize and preserve them through the parser and     |----|----|----
                                                            then have a separate post-parse step to balance and validate them and, for
                                                            example, escape any invalid ones (3 weeks)