Xml parsers

Mehdi Bouaziz edited this page May 23, 2014 · 1 revision

XML parsers

Motivation

Xml is often used as an intermediate structure, and as such, it should be convenient to transform it into a less generic structure.

Opa features a special pattern matching like construct that is meant for this:

my_parser = xml_parser {
    case <x>parser valx=Rule.integer</><y>parser valy=Rule.integer</>: {x:valx, y:valy}
}

It defines a parser that transforms xml such as

<x>12</x><y>13</y>

into

{x:12, y:13}

This syntax resembles the one of pattern matching, and it acts the same: The xml begin parsed is matched against the patterns in order until a match is found. The right hand side of the pattern is then executed.

Syntax extensions

The syntax of expressions is extended with the following rules:

expr ::=
| <xml-parser>

xml-parser ::=
| xml_parser <xml-parser-rules>

xml-parser-rules ::=
| |? <xml-parser-rule>* sep | <xml-parser-default-rule>? end?

xml-parser-rule ::=
| <xml-named-pattern>+ -> <expr>

xml-parser-default-rule ::=
| default: <expr>

xml-named-pattern ::=
| <xml-parser-prefix>? <ident> <xml-parser-suffix>?
| <xml-parser-prefix>? <ident> = <xml-pattern> <xml-parser-suffix>?
| parser? <parser-prod>+
| <xml-parser-prefix>( <xml-named-pattern>+ )

xml-pattern ::=
| < <xhtml-tag> <xml-pattern-attribute>* />
| < <xhtml-tag> <xml-pattern-attribute>* > <xml-named-pattern>* </ <xhtml-tag> >
| < <xml-pattern-ns-tag> <xml-pattern-attribute>* > <xml-named-pattern>* </>
| _
| { <expr> }
| ( <xml-parser-rules> )

xml-pattern-ns-tag ::=
| <xml-named-pattern-tag> : <xml-named-pattern-tag>
| <xml-named-pattern-tag>

xml-named-pattern-tag ::=
| <ident> = <xml-pattern-tag-list>
| <xml-pattern-tag-list>

xml-pattern-tag-list ::=
| _
| <xml-pattern-tag>+ sep |

xml-pattern-tag ::=
| <xhtml-tag>
| <string-literal>
| ( <parser-prod>+ )
| { <expr> }

xml-pattern-attribute ::=
| <xml-parser-prefix>? . <xml-parser-suffix>?
| <xml-parser-prefix>? <xml-pattern-attribute-name> <xml-parser-suffix>?
| <xml-parser-prefix>? <xml-pattern-attribute-name> = <xml-pattern-attribute-rhs> <xml-parser-suffix>?

xml-pattern-attribute-name ::=
like <xml-pattern-ns-tag> but with := instead of = for bindings

xml-pattern-attribute-rhs ::=
| <xml-pattern-attribute-value>
| ( <xml-pattern-attribute-value2> as <ident> )

xml-pattern-attribute-value ::= xml-pattern-attribute-name

xml-pattern-attribute-value2 ::=
| _
| <xml-pattern-tag>

xml-parser-prefix ::=
| &
| !

xml-parser-suffix ::=
| ?
| +
| *
| { <expr> }
| { <expr> , <expr> }

We detail here how xml_parser patterns behave:

  • One rule is composed of a list of patterns. Any xml matches the empty list of patterns. Note that the xml doesn't have to be empty.
    • An xml matches an unprefixed non empty list of patterns if the xml matches the head pattern and its siblings (if it is a fragment, or the empty fragment otherwise) match the remaining patterns.
    • An xml matches a list of patterns starting with &<subrule> (or &(<subrule>)) when the xml matches the and the remaining patterns.
    • An xml matches a list of patterns starting with !<subrule> (or !(<subrule>)) when the xml does not match the <subrule> and matches the remaining patterns (no binding is allowed in <subrule>).
  • An xml matches the pattern parser <parser-prod>+ if its first node is a text node that is accepted by the parser. The bindings done in the parser are in the scope of the action.
  • An xml matches the pattern x=<subrule> when the xml matches the <subrule> and the result of <subrule> is then bound to x.
  • An xml matches the pattern x<suffix> when the xml matches the pattern x=_<suffix>
  • An xml always matches the subrule <subrule>*. As many nodes as possible are matched, and the list of results returned by those matched is returned.
  • An xml matches the subrule <subrule>+ when the xml matches <subrule> at least once. As many nodes as possible are matched, and the list of results returned by those matched is returned.
  • An xml always matches the subrule <subrule>?. A node is matched if possible, and its result wrapped into an option is returned. If no node is matched, {none} is returned.
  • An xml matches the subrule <subrule>{min,max} when the xml matches <subrule> at least min times. As many nodes as possible (but stopping at max) are matched, and the list the results of these matches is returned.
  • An xml matches the subrule <subrule>{number} when it matches <subrule>{number,number}.
  • An xml matches the subrule _ if it contains at least a node. This node is then returned.
  • An xml matches the subrule {<expr>} when the xml_parser <expr> accepts the xml. In that case, the result of the xml_parser is returned.
  • An xml matches the subrule <tag-pattern attributes-pattern>content</> when the xml begins with a tag node and
    • the tag of the node matches <tag-pattern>
    • the attributes of the node match <attributes-pattern>
    • the content of the node matches <content>
  • A tag matches a tag pattern <namespace-pattern>:<name-pattern> when its namespace matches <namespace-pattern> (if there is none, it is considered to be empty) and its name matches <name-pattern>.
  • A tag name matches <ident>=<tag-list> if it matches <tag-list>. The name is bound to <ident>.
  • A tag name always matches _.
  • A tag name matches <tag1>|<tag2>|... if it matches either <tag1> or <tag2> or ...
  • A tag name matches an xml tag name literal if they are equal.
  • A tag name matches a string literal if they are equal. The string literal may contain opa expressions.
  • A tag name matches ( <parser-prod>+ ) if it is accepted by the parser (bindings are allowed only if there is one choice).
  • A tag name matches { <expr> } if the parser <expr> accepts it.
  • The same conditions applies to tag namespaces.
  • A set of attributes matches an unprefixed list of attribute patterns if some attribute matches the head attribute pattern and the other attributes match the remaining patterns (matching is greedy and you have no control on the order if there are several possibilities).
  • A set of attributes matches a list of attribute patterns starting with &<attribute-patterns> when it matches <attribute-pattern> and the remaining patterns.
  • A set of attributes matches a list of attribute patterns starting with !<attribute-patterns> when it does not match <attribute-pattern> and matches the remaining patterns (no binding is allowed in ).
  • A set of attributes matches <ident> := <attribute-pattern><suffix> with not being ? when it matches <attribute-pattern><suffix> (no binding is allowed in ). The matched attributes are bound to <ident> as a list.
  • A set of attributes always matches <attribute-pattern>*. As many attributes as possible are matched.
  • A set of attributes matches <attribute-pattern>+ when it matches <attribute-pattern> at least once. As many attributes as possible are matched.
  • A set of attributes matches <attribute-pattern>{min,max} when it matches <attribute-pattern> at least min times. As many attributes as possible (but at most max) are matched.
  • A set of attributes always matches <attribute-pattern>?. An attribute is matched if possible. Bindings are wrapped into option.
  • Any attribute matches ..
  • An attribute matches <attribute-name> if its name matches <attribute-name>. If <attribute-name> is just an identifier, the attribute value is bound to it.
  • An attribute matches <attribute-name> = <attribute-value> if its name matches <attribute-name> and its value matches <attribute-value>.
  • Matching of attribute namespaces:names are like matching of tag namespaces:names. Bindings are done with := instead of =.
  • Matching of attribute values are like matching of tag names. Bindings are done with := instead of =.
  • An attribute value matches (<value> as <ident>) when it matches . In that case, the value of the matching is bound to <ident>.

Whitespaces

IMPORTANT:

All whitespace-only text nodes are discarded during parsing.

xml = @xml(<>" "</>) p = xml_parser { case parser .*: {} }

p would fail at parsing xml, because it behaves exactly as if xml were defined as

xml = @xml(<></>)

Examples

xml_parser {
    case <b | i | u> x* </> y*: @xml(<>{x}{y}</>)
}

Matches a leading <b>, <i>, or <u> tag, removes it and returns its contents and the remaining nodes.

recursive p = xml_parser {
    case l=(| s={Xml.Rule.string}: s | t=<_:_>{p}</>: t)*: String.concat("", l)
}

Removes all tags, returning the concatenation of all text nodes.

xml_parser {
    case (| !<a /> _: void)* r=<a />: r
}

Returns the first <a /> tag (at the top level).

recursive p = xml_parser {
    case !_: []
    case h=<a /> t={p}: [h|t]
    case (| !<a /> _: void)* r={p}: r
}

Returns all <a /> tags (at the top level).

recursive p = xml_parser {
    case r=<a />: r
    case <_:_>r={p}</>: r
    case _ r={p}: r
}

Returns the first <a /> tag (at any level).

Examples of attribute patterns

The following examples match an element <a ... /> with the following attributes:

  • Any set of attributes: <a />
  • No attributes: <a !. />
  • Exactly one attribute: <a . !. />
  • Only attributes from namespace ns: <a "{ns}":_* !. />
  • Two attributes x and y having the same value: <a x y="{x}" />
  • An attribute x, and bind y to the other attributes: <a x y:=.* />
  • An attribute x, and bind y to all the attributes (including x): <a &x y:=.* />
  • An attribute b starting with 'a' and having a 'c' in the third position: <a &b=("a" .*) &b=(.."c".*) /> or simply <a b=("a"."c".*) />
  • Bind x to the list of all attributes having value "b": <a x:=_="a"* />

This will never work: <a x x /> because unprefixed matched attributes are eaten.

Matching is greedy, the is no backtracking. Hence: <a x:=_ "{x}" /> can match <a b="c" c="d" /> only if you are lucky. The order of the attributes is not guaranted, indeed x:=_ can match either b="c" (binding x to "b") or c="d" (binding x to "c"). {x} can only match the other attribute, since the first one has been eaten.