Skip to content

Basic usage

LPeter1997 edited this page May 1, 2019 · 15 revisions

Table of content

  1. Goals of the library
  2. Structure of the library
  3. Your first parser
  4. Parsing multiple things in a sequence
  5. Transforming the parsed value
  6. Parsing alternatives
  7. Parsing many of the same thing
  8. Optional elements
  9. Matching the end
  10. Recursion
  11. Memorization, left-recursion
  12. A full expression parser

Goals of the library

This library tries to provide a simple way to write parsers as natural as possible. As you will see later in the tutorials, the library is capable of parsing from a DSL that is very close to BNF notation.

Structure of the library

The library is only a single header-file. Everything is in the cppcmb namespace inside that. Since it's relatively long and we can usually write our grammar in .cpp files, in this tutorial we will be using the following alias:

namespace pc = cppcmb; // PC as ParserCombinator

Your first parser

We will start out with one of the simplest parsers possible: one. It simply consumes a single element of the input, if there is any. For example, if the input is a string, then it tries to consume a single character. Creating one is simple:

auto parser = pc::one; // No need for parenthesis or braces!

Next, we create a reader, that can read from a given source:

std::string_view src = "Hello, World!";
auto reader = pc::reader(src);

Finally, we can try to parse our source:

auto result = parser.apply(reader);

The result type is either a pc::success<value> or a pc::failure. pc::success tells us that the parse succeeded and the result contains the representation of the parsed segment. pc::failure means that the parsing failed. Example:

std::cout << "Furthest peek: " << result.furthest() << std::endl; // Furthest the parser peeked in the input relative to the starting position
if (result.is_success()) {
    auto& success = result.success(); // Access the result as a success type
    std::cout << "Parse succeeded!" << std::endl;
    std::cout << "Characters consumed: " << success.matched() << std::endl; // matched() gives us the number of elements consumed by the parser
    std::cout << "Value: '" << success.value() << "'" << std::endl;
}
else {
    auto& failure = result.failure(); // Access the result as a failure type
    std::cout << "Parse failed!" << std::endl;
}

The above example will give us the following output:

Furthest peek: 1
Parse succeeded!
Characters consumed: 1
Value: 'H'

But what if there is no input? Let's try to apply the same parser with the following source:

std::string_view src2 = "";
auto reader2 = pc::reader(src2);

Running parser.apply(reader2) gives us the following:

Furthest peek: 0
Parse failed!

Parsing multiple things in a sequence

Let's say we want to parse three characters, and only succeed if all three characters are parsed successfully. The sequencing parser combinator comes in handy here. It can be created with the & operator:

auto parser = pc::one & pc::one & pc::one;

This parser will only succeed for inputs with three or more characters. You might ask: What's the succeeding value of this parser? (In other words: What's the type of: result.success().value()?) It's an std::tuple-like type called product. You can convert it to an std::tuple using it's as_tuple() method. For example:

std::string_view src = "abcd";
auto result = parser.apply(pc::reader(src));
auto& success = result.success();
assert(success.value().as_tuple() == std::tuple('a', 'b', 'c'));
assert(success.matched() == 3);

Transforming the parsed value

Most of the times having the raw parsed data is useless, because we want to do evaluation, build a parse tree or just simply ignore some elements that are only significant for parsing - like parenthesis usually. We want to be able to invoke a function that transforms the parsed value into something more useful. We can use the operator[] for that. Let's see an example, where we transfrorm a parsed character into it's upper-case form:

auto parser = pc::one[toupper];
std::string_view src = "abc";
auto value = parser.apply(pc::reader(src)).success().value(); // We cheat a bit to shorten examples
assert(value == 'A');

Transformations that can fail

Sometimes it's easier to implement a feature (like matching a certain character) if we can somehow tell the transformation to fail the parse. This is possible by returning a maybe<T> type:

template <char Ch>
struct match_t {
    auto operator()(char c) const -> pc::maybe<char> {
        if (c == Ch) return pc::some(c);
        else return pc::none();
    }
};

template <char Ch>
inline constexpr auto match = match_t<Ch>();

For example, let's say we want to accept only strings that start with the "abc" sequence:

auto parser = match<'a'> & match<'b'> & match<'c'>;

Builtin transformations

The library provides some useful transformations:

  • select<Is...>: Selects certain elements from a product-type. For example, we can drop parenthesis like this:
    auto parser = (match<'('> & pc::one & match<')'>)[pc::select<1>];
    // For input "(a)" the successful value would be just 'a'
  • filter: Only allows the parser to succeed, if the provided predicate is true. Wit this, we can simplify our match functionality:
    template <char Ch>
    bool is_same_char(char c) { return c == Ch; }
    
    template <char Ch>
    inline constexpr auto match = pc::one[pc::filter(is_same_char<Ch>)];

Note: you always need to define your own matching functionality, as this library is intended to work for every input kind. If you have an underlying lexer for example, matching would involve looking at the token type, not the whole token itself.

Parsing alternatives

Let's say we want to do something slightly more complex: Having multiple alternative parsers, and apply them until one succeeds. This can be done with the | operator. For example, let's create a parser that either matches the character a or b, but nothing else:

auto parser = match<'a'> | match<'b'>;

Simple enough. Let's write something more complex, like either matching a 0, or a 0 in parenthesis:

// Note the formatting, kind of looks like something from Bison
auto parser =
      match<'0'>
    | match<'('> & match<'0'> & match<')'>
    ;

What's the return type for this? It's an std::variant-like type, called sum. For this case it's: pc::sum<char, pc::product<char, char, char>>. You can convert it to a standard std::variant with the method as_variant(). It might be worth to transform the alternatives to a single form instead:

auto parser =
      match<'0'>
    | (match<'('> & match<'0'> & match<')'>)[pc::select<1>]
    ;
// Now the value is a single char in all cases

Important note for alternatives

If you are using operator|, the order of alternatives is important. The first alternative that succeeds will be returned, even if there are other alternatives that could consume more input:

auto parser = 
      match<'0'>
    | (match<'0'> & match<'1'> & match<'2'>)[pc::select<0>]
    ;

With this, parsing "012" will only consume 0. If you want the longest alternative to always win, you can use operator|| instead:

auto parser = 
       match<'0'>
    || (match<'0'> & match<'1'> & match<'2'>)[pc::select<0>]
    ;

This will consume "012". operator|| has the expense that it will always try all alternatives.

Syntax sugar

It looks much nicer if we could start all of our alternative lines with |, even the first line. There is a little helper-type for that, called pass:

auto parser = pc:pass
    | match<'0'>
    | (match<'0'> & match<'1'> & match<'2'>)[pc::select<0>]
    ;

It's only purpose is that it gets ignored when it's the first alternative.

Parsing many of the same thing

Sometimes there is the need to parse many of the same thing, like a comma-separated list. There is a so-called many-combinator for that, with the prefix operator*. Here is an example parser, that parses as many ones, as it can:

auto parser = *match<'1'>;

It will succeed, even if there is zero matches. The successful value of this will be an std::vector<char>. We can redirect it to another kind of sequential container:

auto parser = *match<'1'> >> pc::collect_to<std::list>; // std::list<char>
auto parser2 = *match<'1'> >> pc::collect_to<std::list, my_allocator>; // std::list<char, my_allocator<char>>

Parsing one, or more

It's pretty common that we want at least one of something, or more. A great example for this is a number literal, which is at least one digit. We can match one or more things using the prefix operator+:

auto parser = +match<'a'>; // This will succeed wor one or more 'a's, succeeding value is std::vector<char>

Optional elements

There could be elements in the grammar, that are purely optional, meaning that the lack of that element doesn't mean that the parser should fail. We can turn parsers into optional parsers using the prefix operator-:

auto parser = -match<'a'>;

This will match an a or nothing - but will succeed anyway. The succeeding value is pc::maybe<char>.

Matching the end

It is possible to match the end of input with pc::end.

Recursion

Recursion is usually the weak-point of parser combinators. This library tries to overcome this with two macros:

  • cppcmb_decl for declaring a grammar rule
  • cppcmb_def for defining a previously declared rule

Important: Both of these macros must be used outside of a function (meaning global or namespace level).

Let's write a recursive parser for just matching 0s! We will throw away any result so we won't have to deal with infinitely-recursive types. Note that this example is purely for demonstration, you could implement this with the many combinator.

cppcmb_decl(many_ones, pc::product<>); // Name of rule is many_ones, returns an empty tuple
cppcmb_def(many_ones) = pc::pass
    | (match<'1'> & many_ones)[pc::select<>]
    | match<'1'>[pc::select<>]
    ;

Pretty simple, and fairly close to a real grammar. Note that we didn't write it left-recursive. We will get to that later, when discussing memorizing parsers. Applying the parser stays the same, many_ones.apply(pc::reader(src)).

A more complex recursive grammar

Let's write a grammar, that can parse a number nested in any number of parenthesis. We will also convert the digit list to an actual C++ int to practice already discussed features of the library.

// A simple function to convert a digit list into an int
int to_number(std::vector<char> const& digs) {
    int r = 0;
    for (auto c : digs) r = r * 10 + (c - '0');
    return r;
}

// Note that the order of declarations and definitions don't matter. You can do it in any order you wish.
// The only rule to keep is to declare everything before defining anything.
cppcmb_decl(number, int);
cppcmb_decl(digit, char);

cppcmb_def(number) = pc::pass
    | (match<'('> & number & match<')'>)[pc::select<1>]
    | (+digit)[to_number]
    ;
    
cppcmb_def(digit) = pc::one[pc::filter(isdigit)];

Results:

"123" -> 123
"((56))" -> 56
"(((((234)))))" -> 234
"(34" -> FAIL
"((36)" -> FAIL

Unmatched parenthesis fail as expected.

Memorization, left-recursion

There are two main problems so far:

  • Alternatives can apply the same rule multiple times
  • We can't do left recursion

As it turns out, we can solve both problems with memorization!

Memorization

Look at the following parser:

cppcmb_def(number) =
      (+digit)[to_number]
    ;

cppcmb_def(addition) = pc::pass
    | (number & match<'+'> & number)[add]
    | (number & match<'-'> & number)[subtract]
    | number
    ;

When we parse the input "12", the following happens:

  • We try the first alternative, apply number successfully, but then fail to match a +
  • We try the second alternative, apply number successfully, but then fail to match a -
  • We try the last alternative, apply number successfully and the parse succeeds

We applied number three times, but we already knew from the first application that it will succeed. This is not that important for this case, but for more complex grammars this can really hit performance. The solution is to use a memorizing parser, commonly known as a packrat parser. We can turn a parser into a memorizing one like so:

cppcmb_def(number) =
      (+digit)[to_number]
    %= pc::as_memo
    ;

If you build the parser from memorizing parsers, the complexity will degrade to linear-time instead of exponential-time. Note, that there will be extra memory-consumption in exchange. It's important that the memorizer will need some context to work with. We need to pass that to the reader:

auto ctx = pc::memo_context();
auto result = parser.apply(pc::reader(src, ctx));

Which parsers should I turn into a memorizing parser?

If you accept the memory overhead, you might wonder what to turn into a packrat parser and what to leave as it is. My suggersion is to turn every named rule into a packrat parser, except the ones that only match a single character.

A little hepler

You can ignore the steps of creating a context and a reader by using the parser wrapper. It can only be used on the top-level rule. Example:

std::string_view src = "...";
auto parser = pc::parser(top_level_rule);
auto result = parser.parse(src); // Note the different method name, no reader

Direct left-recursion

With some modifications to memorization, we can have direct left-recursive grammar:

cppcmb_decl(many_ones, pc::product<>);
cppcmb_def(many_ones) = pc::pass
    | (many_ones & match<'1'>)[pc::select<>]
    | match<'1'>[pc::select<>]
    %= pc::as_memo_d;

Note that we have pc::as_memo_d instead of just pc::as_memo. This is because direct left-recursion has a bit more overhead, so it's a separate option, to only have overhead you are willing to take.

Indirect left-recursion

If we try our prevoius example with an indirection in between, we will see that it will fail:

cppcmb_decl(many_ones, pc::product<>);
cppcmb_decl(many_ones_impl, pc::product<>);

cppcmb_def(many_ones) =
      many_ones_impl
    %= pc::as_memo_d;

cppcmb_def(many_ones_impl) = pc::pass
    | (many_ones & match<'1'>)[pc::select<>]
    | match<'1'>[pc::select<>]
    %= pc::as_memo_d;

Indirect left-recursion requires a more complex algorithm to work, so it's a separate feature again. This time, it's pc::as_memo_i:

cppcmb_decl(many_ones, pc::product<>);
cppcmb_decl(many_ones_impl, pc::product<>);

cppcmb_def(many_ones) =
      many_ones_impl
    %= pc::as_memo_i;

cppcmb_def(many_ones_impl) = pc::pass
    | (many_ones & match<'1'>)[pc::select<>]
    | match<'1'>[pc::select<>]
    %= pc::as_memo_i;

A full expression parser

Now we know everything to write an expression parser (and evaluator) with correct precedence and associativity, supporting +, -, *, /, ^ and grouping with parenthesis. It will read user input line-by-line and evaluate the expression. This is the full source:

namespace pc = cppcmb;

template <char Ch>
bool is_same_char(char c) { return c == Ch; }

template <char Ch>
inline constexpr auto match = pc::one[pc::filter(is_same_char<Ch>)];

int do_op(int x, char ch, int y) {
    switch (ch) {
    case '+': return x + y;
    case '-': return x - y;
    case '*': return x * y;
    case '/': return x / y;
    case '^': return (int)std::pow(x, y); // For simplicity
    }
}

int to_num(std::vector<char> const& chs) {
    int n = 0;
    for (auto c : chs) n = n * 10 + (c - '0');
    return n;
}

cppcmb_decl(expr,  int);
cppcmb_decl(mul,   int);
cppcmb_decl(expon, int);
cppcmb_decl(atom,  int);
cppcmb_decl(num,   int);
cppcmb_decl(digit, char);

cppcmb_def(expr) = pc::pass
    | (expr & match<'+'> & mul) [do_op]
    | (expr & match<'-'> & mul) [do_op]
    | mul
    %= pc::as_memo_d;

cppcmb_def(mul) = pc::pass
    | (mul & match<'*'> & expon) [do_op]
    | (mul & match<'/'> & expon) [do_op]
    | expon
    %= pc::as_memo_d;

cppcmb_def(expon) = pc::pass
    | (atom & match<'^'> & expon) [do_op]
    | atom
    %= pc::as_memo_d;

cppcmb_def(atom) = pc::pass
    | (match<'('> & expr & match<')'>) [pc::select<1>]
    | num
    %= pc::as_memo_d;

cppcmb_def(num) =
      (+digit) [to_num]
    ;

cppcmb_def(digit) = pc::one[pc::filter(isdigit)];

int main() {
    auto parser = pc::parser(expr);
    std::string line;

    while (std::getline(std::cin, line)) {
        auto res = parser.parse(line);
        if (res.is_success()) {
            std::cout << "Result = " << res.success().value() << std::endl;
        }
        else {
            std::cout << "Failed to parse expression!" << std::endl;
        }
    }
    return 0;
}