IDNI parser library

This C++ parsing library was created because other available parsers were missing features required for development of our non-turing but still very expressive languages:

TML (Tau Meta Language), a logical programming language
Tau, a software specification language...

This library additionally provides a parser for TGF (Tau Grammar form), an EBNF-like language for describing boolean grammars.

It also provides a tgf tool which takes a grammar described in a TGF file and it can help debug the grammar or generate C++ code of the parser.

Features

Boolean (including conjunctive) grammars
- C++ structures with operators for creating grammars programatically
- TGF - Tau Grammar Form language for describing grammars in an EBNF-like format
Earley based parser with garbage collection producing parse forest containing all the parsed trees
scannerless (lexerless) parsing
forest traversal with callbacks and extraction of trees and graphs to deal with transformations and with ambiguity
tgf tool to test grammars in TGF file or to generate a C++ code representing a parser for a given TGF file.
predefined or custom classes of characters defined by a function (<cctype>-like functions)
optional recoding of an input into another type which can can be used for lexing, reencoding or accessing binary data
own UTF-8 support
no dependencies on other libraries

Basic terminology

Literal refers to a specific value or sequence of characters that appears exactly as it is in the input or in the grammar rules. It represents a fixed, concrete element that must be matched in order for a particular rule or pattern to be satisfied. Literal can be terminal or non-terminal.

A terminal literal represents a specific token or value in the input language. It corresponds to the actual elements that appear in the input stream during parsing. Terminal literals are also referred to as terminals or terminal symbols.

A non-terminal literal represents a syntactic category or a placeholder that can be replaced by a sequence of terminals and/or non-terminals. Non-terminal literals are used to define the structure of the language and serve as intermediate building blocks for generating valid sentences or expressions.

A production rule consists of two parts: a left-hand side (LHS) and a right-hand side (RHS). The left-hand side contains a single non-terminal symbol, and the right-hand side contains a sequence of terminals, non-terminals, or both. The production rule indicates that the non-terminal on the left-hand side can be rewritten or expanded into the sequence of symbols on the right-hand side.

Grammar is a formal system that defines the syntax and structure of a language. It consists of terminals, non-terminals, production rules and a starting symbol (which is usually a non-terminal).

Conjunctive Grammar is a grammar which allows usage of conjunctions of literal sequences in production rules. The conjunction acts as a logical "AND" operation, indicating that all the conjuncted literal sequences must be matched for the whole rule to be satisfied.

Boolean Grammar is a conjunctive grammar which allows also usage of negation of literal sequences. Negation acts as a logical "NOT" operation, indicating that if the negated literal sequence is matched the whole rule fails to be satisfied.

DNF or Disjunctive Normal form is a specific form of representing logical formulas, Boolean expressions or production rules in our case. In DNF, a logical formula is represented as a disjunction (OR) of one or more conjunctions (AND) of literals.

EBNF or Extended Backus–Naur form is a family of metasyntax notations, any of which can be used to express a context-free grammar.

TGF or Tau Grammar Form is an EBNF based form to describe grammars which are usable by this library. TGF understands EBNF syntax (+, *, [] and {}) so it is easier to define repetition or optionality.

Namespace

Library uses an idni namespace for its declarations.

For simplicity, all examples in library documentation expect declaration using namespace idni;.

Templating

Most of the structures provided by this library are templates to enable generic parsing of any input data while parsing any type of terminals. Usually they are declared like this:

template <typename C = char, typename T = C> ... where

C is an input type of input data (usually characters).
T is a type of a terminal.

Fully supported and tested types are char and char32_t (Unicode support). Though it is possible to supply encoder and decoder functions to your parser and these can convert input elements of any type into terminals of any type.

If T differs from C it is required to provide decoder and encoder functions in parser::options. parser use these recoders to convert input data into terminals and back. See recoders for more information.

Classes and structs

Click on the type name in the following list to see more detailed documentation of its API:

nonterminals - id map of non-terminals to their names
lit, lits, conjs, disjs, prod and prods - basic building blocks for creating a grammar programatically. prods represents a grammar containing production rules as disjunctions of conjunctions of sequences of literals.
char_class_fn, char_class_fns - Character class function is an alternative way to create a production rule. It is a function returning true if a terminal character is a member of a character class. It can be custom or there are predefined cctype-like functions (isalnum, isdigit...) provided. char_class_fns is a container for such functions.
shaping_options - options for shaping a parsed tree
grammar::options - options for grammar
grammar - represents a grammar created from production rules prods and character class functions char_class_fns.
parser::options - plain struct of options changing behavior of the parser
parser - uses a grammar for parsing an input and produces a result
parser::result - result of parsing containing the parsed forest
parser::error - detailed information about a parse error
parser::input - wraper for an input data pointer, an input stream or a file. It provides an API to access it from a parser or a decoder
parser::pnode - parser's forest node
forest - a result of parsing containing all the parsed trees. Provides means of traversal or extraction of trees or graphs.
forest::graph - a graph extracted from a forest
forest::tree - a tree extracted from a graph
tgf - Tau Grammar Form parser - reads a grammar in TGF format from a string or a file. An alternative way to describe a grammar instead of creating it programatically.
traverser - struct for traversing and accessing rewriter trees
rewriting - API for rewriting a resulting parse tree
measure - simple struct for measuring time

Classes and structs usable for building command line interfaces

cli - provides CLI arguments management and processing
repl - provides terminal (currently only Linux is supported) REPL UI
term_colors - provides terminal color escape codes

Functions

There are several areas covered by functions provided by this library

predefined character class functions - cctype-like predefined character class functions
recoders - decoder and encoder for UTF-8 in char <-> Unicode in char32_t
UTF-8 - UTF-8 and Unicode support
devhelpers - helper forest transformations to various formats (TML facts, TML rules, DOT) useful when developing a parser

TGF - Tau Grammar Form

TGF is an EBNF-based form to describe grammars. Specification of the form can be found on page Tau Grammar Form

TGF tool

This library comes with a CLI executable tgf which features viewing, testing and debugging grammars written in TGF and it can also generate a C++ code from TGF grammar which is then usable in a C++ project.

More detailed information about this CLI can be found on page TGF tool

Tutorials

CSV parser

Beginner tutorial to quick start using this library.

You can learn how to use this library in a CSV parser tutorial.

Part 1 shows a very simple parser with all it's required. Each following part expands the previous one. Read comments since they are always related to a new or a changed code.

part 1 - minimal parser of positive integers
part 2 - using predefined character classes
part 3 - refactor parser into its struct
part 4 - parse also negative integers
part 5 - parse strings and nulls, traverse
part 6 - parse comma separated values
part 7 - parse new line separated rows
part 8 - replace programmatically created grammar with a grammar in TGF
part 9 - using EBNF syntax in TGF
part 10, csv.tgf - using tgf tool to generate a parser from a TGF file

Name		Name	Last commit message	Last commit date
Latest commit History 269 Commits
cmake		cmake
docs		docs
examples		examples
src		src
tests		tests
tools		tools
.aliases		.aliases
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
benchmark.sh		benchmark.sh
build.sh		build.sh
clean.sh		clean.sh
debug.sh		debug.sh
release.sh		release.sh
relwithdebinfo.sh		relwithdebinfo.sh
test-with-tau.sh		test-with-tau.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IDNI parser library

Features

Basic terminology

Namespace

Templating

Classes and structs

Classes and structs usable for building command line interfaces

Functions

TGF - Tau Grammar Form

TGF tool

Tutorials

CSV parser

About

Releases

Packages

Contributors 5

Languages

IDNI/parser

Folders and files

Latest commit

History

Repository files navigation

IDNI parser library

Features

Basic terminology

Namespace

Templating

Classes and structs

Classes and structs usable for building command line interfaces

Functions

TGF - Tau Grammar Form

TGF tool

Tutorials

CSV parser

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages