---

# 1. Language and Syntax
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, January 2024**

---

### Language and Grammar

Every language is based on a _vocabulary_. Its elements are called _words_ or _symbols_ whose structure is of no further interest. The _syntax_ determines which sequences of words, called *sentences*, belong to the language.

| language                | symbols                                          |
|:------------------------|:-------------------------------------------------|
| English                 | `eats`, `Kevin`, `a`, `banana`, ...              |
| Roman numerals          | `I`, `V`, `X`, `L`, `C`, `D`, `M`                |
| identifiers in programs | `A`, `B`, ..., `a`, `b`, ..., `0`, `1`, ..., `_` |
| arithmetic expressions  | `dist`, `rot`, `24`, `+`, `–`, `×`, `/`, ...     |

**Question:** What are other non-spoken languages?

_Answer:_
- Chemical formulae, e.g <code>H<sub>3</sub>O<sup>+</sup></code> for hydronium.
- Musical scores, with vocabulary 🎼, ♭, ♮, ♯, ♩, ♪, ♫, ♬, etc.
- Morse code, with vocabulary "●" (short), "━━━" (long), " " (pause).

<div style="float:right;background-color:lightgrey;border-left:20px solid white">

**Example:** if `V = {a, b}`, then  
`Vᐩ = {a, b, aa, ab, ba, bb, aaa, … }`  
`V* = {ε, a, b, aa, ab, ba, bb, aaa, … }`  
The sentences of the language  
`L = {σaσ | σ ∈ V*}`  
are those sequences that contain at least one `a`.
</div>

Formally, a vocabulary `V` is a finite, non-empty set of (atomic) symbols. The set `V*` of all _finite sequences_ or _strings_ over `V` consists of

- the empty string `ε`,
- any symbol `x ∈ V`,
- the _concatenation_ `στ` of strings `σ, τ ∈ V*`.

The empty sequence is both the left and right identity of concatenation. Concatenation is associative, meaning that parenthesis can be left out. Formally for any `σ, τ, ω ∈ V*`:

- `σε = σ = εσ`
- `(στ)ω = σ(τω)`

The set of all _non-empty strings_ over `V` is denoted by `Vᐩ`, formally `Vᐩ = V* – {ε}`. The _length_ of string `σ` is written as `|σ|`:

- `|ε| = 0`,
- `|x| = 1` for any `x ∈ V`,
- `|στ| = |σ| + |τ|` for any `σ, τ ∈ V*`.

<img style="width:16em;height:auto;float:right;border-left:10px solid white" src="./img/NLexample.svg">

A *grammar* not only determines unambiguously which sequences of words are sentences and which are not but also provides sentences with a *structure*. The structure is instrumental in recognizing the *semantic* of a sentence, which is our ultimate goal.

The theory of formal languages originates in linguistics. A basic rule of English is that sentences (`S`) consists of a noun phrase (`NP`) followed by verb phrase (`VP`). A noun phrase is either a proper name (`PN`) or a determiner (`D`) followed by a noun (`N`). A verb phrase is either a verb (`V`) or a verb followed by a noun phrase. Determiners are `a` and `the`. The hierarchical composition of an English sentence by a *parse tree* is given to the right; below are the corresponding rules. Grammars of this form are called *generative*, and the rules are called *productions*, as they determine how all sentences of a language are generated.

<div style="float:right;background-color:lightgrey;margin-left:18pt">

`S → NP VP`  
`NP → PN`  
`NP → D N`  
`VP → V`  
`VP → V NP`  
`PN → Kevin`  
`PN → Dave`  
`D → a`  
`D → the`  
`N → banana`  
`N → apple`  
`V → eats`  
`V → runs`

</div>

Formally, grammar `G = (T, N, P, S)` is specified by

- a finite set `T` of *terminal symbols*,
- a finite set `N` of *nonterminal symbols*,
- a finite set `P` of *productions*,
- a symbol `S ∈ N`, the *start symbol*

where `N ∩ T = {}` and `V = T ∪ N` is its *vocabulary*. Productions are pairs of strings `σ ∈ Vᐩ`, `τ ∈ V*`, written `σ → τ`.

**Example.** `G₀ = (T, N, P, S)` with `T = {Kevin, Dave, a, the, banana, apple, eats, runs}`, `N = {S, NP, VP, PN, D, N, V}`, and the productions to the right is a grammar.

<div style="float:right;background-color:lightgrey;margin-left:18pt">

    `S`  
`⇒ NP VP`  
`⇒ PN VP`  
`⇒ Kevin VP`  
`⇒ Kevin V NP`  
`⇒ Kevin eats NP`  
`⇒ Kevin eats D N`  
`⇒ Kevin eats a N`  
`⇒ Kevin eats a banana`
</div>

Given grammar `G = (T, N, P, S)`, sequence `χ ∈ V*` is _directly derivable_ from `π ∈ Vᐩ`, written `π ⇒ χ`, if there exist sequences `σ`, `τ`, `μ`, `ν` such that, `π = μσν`, `χ = μτν`, and `σ → τ ∈ P`.

If `χ` is derivable from `π` in `n` steps, this is written as `π ⇒ⁿ χ`. Formally, relation `⇒ⁿ` is defined for `n ≥ 0` by:
- `π ⇒⁰ π`
- `π ⇒ⁿ⁺¹ π` if `π ⇒ ρ` and `ρ ⇒ⁿ π` for some `ρ`

We write

- `π ⇒* χ` if `χ` is _derivable in zero or more steps_ from `π`,
- `π ⇒ᐩ χ` if `χ` is _derivable in one or more steps_ from `π`.

Formally, `⇒*` is the transitive and reflexive closure of relation `⇒` and `⇒ᐩ` is the transitive closure of `⇒`.

The derivation to the right allows us to conclude that `S ⇒ᐩ Kevin eats a banana` with grammar `G₀`. More precisely, we can state `S ⇒⁸ Kevin eats a banana`.

The _language_ `L(G)` generated by grammar `G = (T, N, P, S)` is the set of all sequences of terminal symbols which can be derived from the start symbol:

    L(G) = {χ ∈ T* | S ⇒ᐩ χ}

Grammars `G` and `G'` are _equivalent_ if they generate the same language, `L(G) = L(G')`.

**Example.** Given `G₁ = (T, N, P, S)`, where `T = {a, b, c, d}`, `N = {S, X}`, `P = {S → aX, S → bX, X → c, X → d}`, the sequence `ac` is derivable from `S`, formally `S ⇒ᐩ ac`,

    S ⇒ aX ⇒ ac

as are `ad`, `bc`, `bd`. The language generated by `G₁` is:

    L(G₁) = {ac, ad, bc, bd}

**Question.** What are other equivalent grammars? 

_Answer._
- `G₁̍ = (T, N', P', S)`, where `N = {S, X, Y}`, `P = {S → XY, X → a, X → b, Y → c, Y → d}`, is equivalent to `G₁`.
- Renaming the non-terminals also gives an equivalent grammar. In that sense, non-terminals "carry no meaning".
- Adding nonterminal `X₁` and replacing `X → a` with `X → X₁, X₁ → a` also gives an equivalent grammar. Repeating this, infinitely many equivalent grammars can be obtained.

Languages generated by a grammar can be _finite_ or _infinite_. Infinite languages are expressed through recursion with a finite set of productions.

**Example.** Let `G₂ = (T, N, P, S)`, where `T = {a}`, `N = {S}` and let the productions `P` be:

    S → ε
    S → aS

For a string `σ`, the term `σⁿ` stands for `σ` repeated `n` times, formally `σ⁰ = ε` and `σⁿ⁺¹ = σσⁿ`.  For example, `{aⁿ | n ≥ 0}` is  `{ε, a, aa, aaa, aaaa, …}`.

**Theorem.** The language of `G₂` is that of sequences over `a` of arbitrary length:

    L(G₂) = {aⁿ | n ≥ 0}
 
*Proof.* This is formally proved by inclusion in both directions. By definition of `L(G₂)`,

    {χ ∈ T* | S ⇒ᐩ χ} ⊆ {aⁿ | n ≥ 0}

means that for every `χ ∈ T*` derivable from `S`, there exists an `n ≥ 0` such that `χ = aⁿ`. This is shown by induction over the length of derivations  from `S`.

- _Base._ A derivation of `χ` of length `1` from `S` can only derive `χ = ε` by the first production. As `ε = a⁰`, the base case holds.
- _Step._ We need to show that each `χ` derivable from `S` in `n + 1` steps, `S ⇒ⁿ⁺¹ χ` is `aⁱ` for some `i ≥ 0`, under the hypothesis that each  `χ` derivable from `S` in `n` steps, `S ⇒ⁿ χ` is `aⁱ` for some `i ≥ 0`. If `χ` is derivable in `n+1` steps, then `S ⇒ aS ⇒ⁿ χ` and `χ` is `aω` for some `ω`. Since `ω` is derived from `S` in `n` steps, `ω` is `aⁱ` for some `i ≥ 0`, hence `χ = aω` is `aⁱ` for some `i ≥ 0`.

The inclusion in the other direction means that every `aⁿ` for `n ≥ 0` can be derived from `S`:

    {aⁿ | n ≥ 0} ⊆ {χ ∈ T* | S ⇒ᐩ χ}

This is shown by induction over `n`.

- _Base._ For `n = 0`, obviously `a⁰ = ε` can be generated by the first production, `S ⇒ᐩ ε`.
- _Step._ Suppose `aⁿ` can be generated, `S ⇒ᐩ aⁿ`. We need to show that `aⁿ⁺¹` can also be generated. This follows from `S ⇒ aS ⇒ᐩ aaⁿ = aⁿ⁺¹`.

Thus we can conclude `L(G₂) = {aⁿ | n ≥ 0}`.

Recursion also allows the expression of arbitrarily deep *nested structures*.

**Example.** Let `G₃ = (T, N, P, S)`, where `T = {a, b, c}`, `N = {S}`, and the productions `P` are:

    S → b
    S → aSc

The sequence `aabcc` is derivable from `S`:

    S ⇒ aSc ⇒ aaScc ⇒ aabcc
    
The generated language is:

    L(G₃) = {b, abc, aabcc, aaabccc, …} = {aⁿbcⁿ | n ≥ 0}

### Chomsky Hierarchy

Languages can be classified according to restrictions on their grammar. The following classification is known as the _Chomsky Hierarchy_ [(Chomsky 1956)](#Chomsky56). For grammar `G = (T, N, P, S)`, let `V = T ∪ N` be its vocabulary, and assume `a ∈ T`, `A, B ∈ N`, `μ, ν, τ ∈ V*`, `σ ∈ Vᐩ`:

- A grammar is _context-sensitive_ if productions are of the form

    μAν → μσν
        
    Additionally, `S → ε` is allowed, provided that `S` does not occur on the right-hand side of another production.


- A grammar is _context-free_ if productions are of the form

    A → τ
        
- A grammar is _regular_ if productions are of the form

    A → ε  
    A → a  
    A → aB

**Question.** Which of the grammars `G₀`, `G₁`, `G₂`, `G₃` are regular or context-free?

_Answer._
- `G₀` is not regular, but is context-free
- `G₁` is regular (and therefore context-free)
- `G₂` is regular (and therefore context-free)
- `G₃` is not regular, but is context-free

<div style="float:right;background-color:lightgrey;margin-left:2em;margin-top:1em">

`S → NP VP`  
`NP → D Nₛ`  
`NP → D Nₚ`  
`Nₛ VP → Nₛ Vₛ`  
`Nₚ VP → Nₚ Vₚ`  
`D → the`  
`Nₛ → child`  
`Nₚ → children`  
`Vₛ → runs`  
`Vₚ → run`

</div>

Context-sensitive languages allow the expression of the subject-verb agreement with respect to singular vs. plural in natural languages.

**Example.** Consider the grammar to the right with terminals in lower case and nonterminals in upper case letters. Then

  `the child runs`  
  `the children run`  

are sentences, but `the child run` is not.  


**Question.** What is a derivation of `the child runs`?

*Answer.*

       S
    ⇒ NP VP  
    ⇒ D Nₛ VP
    ⇒ D Nₛ Vₛ
    ⇒ the child Vₛ
    ⇒ the child runs

Here are some fundamental results from formal language theory. Regular grammars can express repetition, but not nesting:

**Theorem.** No regular grammar for `L(G₃)` exists.

**Example.** Let `G₄ = (T, N, P, S)`, where `T = {a, b, c}`, `N = {S, B}`, and let the productions `P` be:

    S → abc
    S → aBSc
    Ba → aB
    Bb → bb

The grammar is not context-free. The language generated is:

    L(G₄) = {abc, aabbcc, aaabbbccc, …} = {aⁿbⁿcⁿ | n ≥ 1}

**Question.** What is a derivation of `aaabbbccc` in `G₄`? Explain how the grammar works!

_Answer._

      S
    ⇒ aBSc
    ⇒ aBaBScc
    ⇒ aBaBabccc
    ⇒ aBaaBbccc
    ⇒ aaBaBbccc
    ⇒ aaaBBbccc
    ⇒ aaaBbbccc
    ⇒ aaabbbccc

The grammar works by first producing the same number of `a`, `B`, `c`, with all `c` in the correct position at the end but `a` and `B` alternating. The production `Ba → aB` moves all `a` to the left and all `B` to the middle. Once a `B` is in its correct position, it is converted to a `b`.

**Theorem.** No context-free grammar for `L(G₄)` exists.

Grammar `G₄` is not context-sensitive: `Ba → aB` does not match the form for context-sensitive productions. However, grammar `G₄'` with the same terminals, additional nonterminal `X`, and following productions is context-sensitive and equivalent; it uses three productions to achieve `BA → AB` and adds the production `A → a`:

    S → Abc
    S → ABSc
    BA → BX
    BX → AX
    AX → AB
    Bb → bb
    A → a

**Question.** Argue that `G₄'` is context-sensitive. What is a derivation of `aabbcc` in `G₄'`?

*Answer.*  The production `BA → BX` replaces `A` by `X` in left context `B`, so matches `μAν → μσν` with `μ`, `A`, `ν`, `σ` being `B`, `A`, `ε`, `X`. The production `BX → AX` replaces `B` with `Y` in right context `X`, so matches `μAν → μσν` with `μ`, `A`, `ν`, `σ` being `ε`, `B`, `X`, `Y`. The other productions are similar.

        S
    ⇒ ABSc
    ⇒ ABAbcc
    ⇒ ABXbcc
    ⇒ AAXbcc
    ⇒ AABbcc
    ⇒ AAbbcc
    ⇒ Aabbcc
    ⇒ aabbcc

**Example.** Let `G₅ = (T, N, P, S)`, where `T = {a, b}`, `N = {A, B, S}`, and productions `P` are:  

  `S → aAS`  
  `S → bBS`  
  `Aa → aA`  
  `Ab → bA`  
  `Ba → aB`  
  `Bb → bB`  
  `AS → Sa`  
  `BS → Sb`  
  `S → ε`  

The grammar is not context-free. The language generated is the *copy language*:

    L(G₅) = {ww | w ∈ T*}

The first two productions produce an arbitrary sequence of pairs of `aA` and `bB` ending with `S`. The following four productions move all `A` and `B` to the right without "overtaking" each other. The final three productions convert `A` to `a` and `B` to `b` from right to left.

**Question.** What is a derivation of `abab` in `G₅`?

_Answer._

```
  S
⇒ aAS
⇒ aAbBS
⇒ abABS
⇒ abASb
⇒ abSab
⇒ abab
```

**Theorem.** No context-free grammar for `L(G₅)` exists.

Languages generated by context-sensitive, context-free, and regular grammars are called *context-sensitive*, *context-free*, and *regular languages*, respectively.

**Theorem.** Every regular language is also context-free. Every context-free language is also context-sensitive.

Note that the inclusion does not quite hold for grammars, as `A → ε` is allowed in regular and context-free grammars, but not in context-sensitive grammars.

For brevity, we write

	σ → τ₀ | τ₁ | …

for the set of productions

	σ → τ₀
    σ → τ₁
    …

### Concrete and Abstract Syntax Trees

We continue with context-free languages. For those, the _parse tree_ or _concrete syntax tree_ is a visual representation of a derivation, which abstracts from the order of independent applications of productions. In the example, `E` and `id` stand for expressions and identifiers of programs.

<img style="width:6em;float:right;border-left:10px solid white" src="./img/idplusid.svg">

**Example.** Let `G₆ = (T, N, P, E)` where `T = {id, +}`, `N = {E}`, and the productions `P` are:

    E → id | E + E

There are two derivations of `id + id`:

    E ⇒ E + E ⇒ id + E ⇒ id + id
    E ⇒ E + E ⇒ E + id ⇒ id + id

<img style="width:19em;float:right;border-left:10px solid white" src="./img/idplusidplusid.svg"></img>
Continuing with `G₆`, there are two parse trees for `id + id + id`. A sentence with more than one parse tree is an _ambiguous sentence_ and a grammar allowing that is an *ambiguous grammar*. Syntactically ambiguous sentences may have an ambiguous meaning. In natural languages, this may be resolved through the context; in programming languages, syntactic ambiguity is generally avoided.

<img style="width:9em;float:right;border-left:10px solid white" src="./img/idplusleft.svg">

Changing the productions to a _left-recursive_ form eliminates ambiguity and makes `+` associate to the left.

    E → id | E + id

<img style="width:9em;float:right;border-left:10px solid white" src="./img/idplusright.svg"></img>
Changing the productions to a _right-recursive_ form eliminates ambiguity and makes `+` associate to the right.

    E → id | id + E

**Question.** For which operators in programming languages does associativity matter, and for which does not?

_Answer._
- For integer division, associativity matters.
- For integer addition, associativity matters in bounded arithmetic (overflow is an error) and saturating arithmetic (overflow results in maximal number).
- For integer addition, associativity does not matter in modulo arithmetic, e.g. with word size and with arbitrary precision.
- For bitwise `and` and bitwise `or`, associativity does not matter.
- For string concatenation, associativity does not matter.

The next example illustrates operator _precedence_.

<img style="width:19em;float:right;border-left:10px solid white" src="./img/idplusidtimesid.svg"></img>
**Example.** Let `G₇ = (T, N, P, E)` where `T = {id, +, ×}`, `N = {E}`, and the productions `P` are:

    E → id | E + id | E × id

In `id + id × id`, operator `+` binds tighter; in `id × id + id`, operator `×` binds tighter: `+` and `×` bind equally tight and associate to the left.

<img style="width:19em;float:right;border-left:10px solid white" src="./img/idplustimestimesplus.svg">

To have proper operator precedence, nonterminal `T` for terms is introduced and the productions are changed to:  

    E → T | E + T
    T → id | T × id

Parentheses are needed to allow `+` to bind tighter than `×`. For this, an additional nonterminal, ` F` for factor, is introduced.

**Example.** Let `G₈ = (T, N, P, E)` where `T = {id, +, ×, (, )}`, `N = {E, T, F}`, and the productions `P` are:  

    E → T | E + T
    T → F | T × F
    F → id | ( E )

**Question.** What are the parse trees for `id + id × id`, for `id × id + id`, and for `(id + id) × id`?

_Answer._

<img style="width:30em" src="./img/idparen.svg"></img>

A _structural tree_ or _abstract syntax tree_ is a simplified parse tree with only the relevant structure information:
- Productions whose sole purpose is to define precedence (like bracketing) are left out.
- Chains of derivations are left out.
- Nodes are labelled with the construct in question rather than a nonterminal.

For example, for `id + id × id`, for `id × id + id`, and for `(id + id) × id`:

<img style="width:28em" src="./img/idast.svg">

A parse tree is also called a _concrete syntax tree_.

### Backus-Naur Form

Context-free grammars are more conveniently written in _Backus-Naur Form_ (*BNF*):
- The left-hand side of the first production is the start symbol.
- Terminals are enclosed in `'quotes'`; all other symbols are nonterminals.
- Productions for the same nonterminal are grouped into one, separated by `|`.
- The empty string `ε` is written as `''`.

For example, here is BNF grammar for expressions like `– 3 × a + b`

    expression → term | '+' term | '–' term | expression '+' term | expression '–' term
    term → factor | term '×' factor | term '/' factor
    factor → number | identifier | '(' expression ')'

and one for statements like `if b then x := 3 else (x := y ; y := 5)`:

    statement → assignment | compoundStatement | ifStatement | whileStatement
    assignment → identifier ':=' expression
    compoundStatement → '(' statementSequence ')'
    statementSequence → statement | statementSequence ';' statement
    ifStatement → 'if' expression 'then' statement | 'if' expression 'then' statement 'else' statement
    whileStatement → 'while' expression 'do' statement

Let us define BNF in BNF! The terminals are characters written in quotes. The newline character is written as `\n`, and the quote character is `\'`. We let `char` stand for an arbitrary character:

    grammar  →  production | grammar '\n' production
    production  →  identifier '→' expression
    expression  →  term | expression '|' term
    term  →  factor | term ' ' factor
    factor  →  identifier | string
	identifier  →  letter | identifier letter | identifier digit
    letter  →  'A' | … | 'Z'
    digit  →  '0' | … | '9'
    string  →  '\'' characters '\''
    characters  →  characters char | ''

Numerous variations of BNF exist. For example, the grammar of C uses different fonts for terminals and nonterminals, enumerates the terms of a production indented on subsequent lines, and uses <code>A<sub>opt</sub></code> if `A` is optional [(Kernighan and Ritchie 1988)](#KernighandRitchie88). Formally, using <code>A<sub>opt</sub></code> amounts to adding a production <code>A<sub>opt</sub> → A | ε</code>. Here is a simplified fragment:

<!-- <code style="font-family:monospace"> -->
<code style="font:Noto Sans Mono">
<i>statement:</i>
        <i>compound-statement</i>
        <i>expression-statement</i>
        <i>selection-statement</i>
        <i>iteration-statement</i>
<i>compound-statement:</i>
        { <i>statement-list<sub>opt</sub></i> }
<i>statement-list:</i>
        <i>statement</i>
        <i>statement-list</i> <i>statement</i>
<i>selection-statement:</i>
        if ( <i>expression</i> ) <i>statement</i>
        if ( <i>expression</i> ) <i>statement</i> else <i>statement</i>
        switch ( <i>expression</i> ) <i>statement</i>
<i>iteration-statement:</i>
        while ( <i>expression</i> ) <i>statement</i>
        for ( <i>expression<sub>opt</sub></i> ; <i>expression<sub>opt</sub></i> ; <i>expression<sub>opt</sub></i> ) <i>statement</i>
</code>

EBNF is an extension of BNF that allows simple repetitions to be formulated more naturally and avoids inflation of nonterminals:
- `(A)` allows precedence to be expressed. Formally, `(A)` stands for a new nonterminal `X` with the production `X → A` added.
- `[A]` means that `A` is optional. Formally, `[A]` stands for a new nonterminal `X` with the production `X → A | ε` added.
- `{A}` means repeating `A` an arbitrary number of times. Formally, `{A}` stands for a new nonterminal `X` with the production `X → X A | ε` added.

For example, here is an EBNF grammar for expressions,

    expression → [ '+' | '–' ] term { ( '+' | '–' ) term}
    term → factor { ( '×' | '/' ) factor }
    factor → number | identifier | '(' expression ')'

and one for statements:

    statement → assignment | compoundStatement | ifStatement | whileStatement
    assignment → identifier ':=' expression
    compoundStatement → '(' statement { ';' statement } ')'
    ifStatement → 'if' expression 'then' statement ['else' statement]
    whileStatement → 'while' expression 'do' statement

**Question.** First, eliminate `(...)`, `[...]` in the expression grammar, then eliminate `{...}`. How can the grammar be made more readable?

For eliminating `(...)` and `[...]`, `{...}` we introduce `unaryop`, `addop`, and `multop`:

    expression → unaryop term { addop term}
    unaryop → '+' | '–' | ε
    addop → '+' | '–'
    term → factor { multop factor }
    multop → '×' | '/'
    factor → number | identifier | '(' expression ')'

For eliminating `{...}` in the production for `term`, that production can be replaced by:

    term → factor morefactor
    morefactor → morefactor multop factor | ε

The introduction of the nonterminal `morefactor` and the use of `ε` can be avoided here: 

    expression → unaryop primary
    primary → term | primary addop term
    unaryop → '+' | '–' | ε
    addop → '+' | '–'
    term → factor | term multop factor
    multop → '×' | '/'
    factor → number | identifier | '(' expression ')'


Let us define EBNF in EBNF!

    grammar  →  production {'\n' production }
    production  →  identifier '→' expression
    expression  →  term { '|' term }
    term  →  factor { ' ' factor }
    factor  →  identifier | string | '(' expression ')' | '[' expression ']' | '{' expression '}'
    identifier  →  letter { letter | digit }
    letter  →  'A' | … | 'Z'
    digit  →  '0' | … | '9'
    string  →  '\'' { char } '\''

Sometimes, `=` or `::=` is used instead of `→` and productions are terminated with a dot. For example, here is a fragment of the [Go Grammar](https://golang.org/ref/spec):

    Block = "{" StatementList "}" .
    StatementList = { Statement ";" } .
    
    Statement =
        Declaration | Assignment | Block | IfStmt | SwitchStmt | SelectStmt | ForStmt .

    Assignment = ExpressionList assign_op ExpressionList .
    assign_op = [ add_op | mul_op ] "=" .

    ExpressionList = Expression { "," Expression } .

Productions using `=` are also called *syntactic equations*; however, care has to be taken as `A = B` is not the same as `B = A`!

More variations of EBNF exist:

- Zero or more repetitions of `E` are also written as `E*`
- One or more repetitions of `E` are written as `Eᐩ`, which stands for `E E*`.
- An optional occurrence of `E` is also written as `E?`, which stands for `E | ε`.

Here is a fragment of the Python [grammar in the language reference](https://docs.python.org/3/reference/compound_stmts.html) (which differs slightly from the [grammar used by parsers](https://docs.python.org/3/reference/grammar.html)). In Python, the indentation of statements matters; this is expressed in the grammar by symbols that indicate indentation:

<pre style="font-family:monospace;color:royalblue">
compound_stmt ::=  if_stmt
                   | while_stmt
                   | for_stmt
suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement     ::=  stmt_list NEWLINE | compound_stmt
stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]

if_stmt ::=  "if" expression ":" suite
             ("elif" expression ":" suite)*
             ["else" ":" suite]
</pre>

EBNF is not only helpful for the compact definition of a grammar but is also essential for the construction of a specific kind of recognizer. Our preference for EBNF is motivated by that.

### Syntax Diagrams

An EBNF grammar can be equivalently represented by _syntax diagrams_ (*railroad diagrams*). These are constructed recursively over the structure of EBNF grammars. Let `'a'` stand for a string (terminal), `A` for an identifier (nonterminal), and `E`, `E₁`, `E₂`, … for expressions (right-hand side of productions):

| EBNF            | syntax diagram                                     |
|:----------------|:---------------------------------------------------|
| `A → E`         |<img style="width:10em" src="./img/production.svg"> |
| `'a'`           |<img style="width:10em" src="./img/terminal.svg">   |
| `A`             |<img style="width:10em" src="./img/nonterminal.svg">|
| `E₁ │ E₂ │ …`   |<img style="width:10em" src="./img/choice.svg">     |
| `E₁ E₂ …`       |<img style="width:14em" src="./img/sequence.svg">   |
| `(E)`           |<img style="width:10em" src="./img/parenthesis.svg">|
| `[E]`           |<img style="width:10em" src="./img/option.svg">     |
| `{E}`           |<img style="width:10em" src="./img/repetition.svg"> |

For example, for

    A → 'x' | '(' A { '+' A } ')'

the syntax diagram is:

<img style="width:28em" src="./img/railroad.svg">

**Question.** What is the syntax diagram for EBNF?

### Recognizers

A _recognizer_ for a language is a program that takes as input a string and _accepts_ it if the string is a sentence of the language or otherwise _rejects_ it. For regular, context-free, and context-sensitive languages, a _universal recognizer_ exists, i.e. a program that, given a grammar `G` and sentence `ω`, returns if `ω ∈ L(G)`. For an unrestricted grammar `G`, in general `ω ∈ L(G)` is undecidable. 

For context-sensitive grammar `G = (T, N, P, S)`, a universal recognizers can be constructed by generating all derivations of length `1`, length `2`, etc. from the start symbol and keeping a set, `d`, of the derived strings. New strings are only added to `d` if they are not longer than `ω` as in context-sensitive grammars, derived strings cannot shrink. This terminates if either `ω ∈ d`, in which case `ω` is accepted, or no more strings can be added to `d`, i.e. all derived strings of length of `ω` have been explored:

```algorithm
procedure derivable(S, P, ω): boolean 
    d₀, d := {}, {S}
    while d₀ ≠ d do
        d₀ := d
        for π ∈ d₀ do
            for σ → τ ∈ P do
                for μ, ν where π = μσν do
                    χ := μτν
                    if χ = ω then return true
                    else if |χ| ≤ |ω| then d := d ∪ {χ}
    return false
```

This algorithm always terminates and the memory it uses is bounded. Since the set `d` may be very large, it is not a practical universal recognizer, but a constructive proof that membership in a context-sensitive language is decidable.

In the Python implementation below, symbols are represented by characters, strings by Python strings, and productions of the form `σ → τ` as strings `σ→τ`, where the character `→` must not be a symbol of the grammar. The method `s.find(t, i)` returns the index of the first occurrence of `t` in `s` starting at index `i`, or `-1` if no such occurrence exists:

In [None]:
def derivable(S, P, ω, trace = False):
    # S: start symbol, a string, P: productions, a set of pairs of strings, ω: string
    d0, d = {}, {S} # set of strings
    while d != d0:
        d0 = d
        if trace: print('S ⇒*', d)
        for p in P:
            i = p.find('→', 0)
            σ, τ = p[0:i], p[i+1:]
            for π in d0:
                i = π.find(σ, 0)
                while i != - 1:
                    χ = π[0:i] + τ + π[i + len(σ):]
                    if trace: print('    ', π, '⇒', χ)
                    if χ == ω: return True
                    elif len(χ) <= len(ω): d = d.union({χ})
                    i = π.find(σ, i + 1)
    return False

If `trace` is set to `True`, the set of all derivations of length `0`, `1`, `2`, etc., are printed. Additionally, all direct derivations from the current set of derivations are printed with indentation:

In [None]:
assert derivable('S', {'S→a', 'S→Sb'}, 'abb', trace = True)

In [None]:
assert not derivable('S', {'S→a', 'S→Sb'}, 'bb')

Abbreviating `the`, `child`, `children`, `runs`, `run` by `t`, `c`, `C`, `r`, `R` and `NP`, `VP`, `Nₛ Vₛ`, `Nₚ Vₚ` by `𝒩`, `𝒱`, `n`, `v`, `N`, `V`, the previous productions are expressed as:

In [None]:
P = {'S→𝒩𝒱', '𝒩→Dn', '𝒩→DN', 'n𝒱→nv', 'N𝒱→NV',
     'D→t', 'n→c', 'N→C', 'v→r', 'V→R'}

In [None]:
assert derivable('S', P, 'tCR')

In [None]:
assert not derivable('S', P, 'tcR')

### Historic Notes and Further Reading

The Backus-Naur Form was first proposed by John Backus and then adopted by Peter Naur for the definition of Algol-60. Donald Knuth suggested the name [(Knuth 1964)](#Knuth64). EBNF was proposed by Niklaus Wirth [(Wirth 1977)](#Wirth77).

The original motivation for the classification of grammars came from the study of natural languages. The following examples illustrate the potential use of regular, context-free, and context-sensitive languages (credit for examples: [C. Chesi, Univ. of Siena](http://www.ciscl.unisi.it/master/chesi/lingcomp-2017_18-03_04-formal_grammar.pdf))

- _Right recursion_ (*tail recursion*) of the form `abⁿ`:


    [the dog bit [the cat [that chased [the mouse [that ran]]]]]


- _Center embedding_ (*true recursion*) of the form `aⁿbⁿ`:


    [the mouse [(that) the cat [(that) the dog bit] chased] ran]


- _Cross‐serial dependencies_ (*identity recursion*) of the form `ww`:


    John, Mary, and David are a widower, a widow, and a widower, respectively

There is an ongoing discussion on using regular, context-free, and context-sensitive languages for natural languages. The male-female correspondence of the last example can also be seen as a semantic issue rather than a syntactic issue. If one takes the limits of human comprehension into account, the full generality of context-sensitive, context-free, and even regular languages is not needed. As a consequence, further classes of grammars have emerged, e.g. [(Kallmeyer 2010)](#Kallmeyer10).

Grammars can be used for the translation of natural languages: first, the input sentence is parsed according to the grammar of the source language, and then a sentence is generated that satisfies the grammar of the target languages. That was, for decades, the dominant approach until computers became fast enough for neural networks, which perform better than grammar-based translation [(Wu et al. 2016)](#WuEtAl16), [(Le and Schuster 2016)](#LeSchuster16).

On the other hand, Chomsky's Hierarchy profoundly impacted computing: for each class of languages, equivalent recognizers are known. Calling languages of unrestricted grammars *recursively enumerable*, we have:

|type   |language              |recognizer              |
|:------|:---------------------|:-----------------------|
|type 0 |recursively enumerable|Turing machine          |
|type 1 |context-sensitive     |linear bounded automaton|
|type 2 |context-free          |pushdown automaton      |
|type 3 |regular               |finite state automaton  |

Regular and context-free languages are ubiquitous as recognizers for those that can be constructed efficiently; the recognizers themselves are, in a certain sense, efficient. The following chapters discuss their use for scanning and parsing.

Even the above examples show the difficulty of writing context-sensitive grammars. After Algol 60 introduced the use of context-free grammars for syntax definition, with Algol 68, an attempt was made to go beyond context-free grammars by using a dedicated "two-level grammar" [(Wijngaarden et al. 1976)](#WijngaardenEtAl76); that kind of grammar was not used for another language. Around the same time, Knuth proposed _attribute grammars_ as a way of associating computation to recognition of a context-free language [(Knuth 68)](#Knuth68). Since then, it has become common to define a programming language with regular and context-free grammars and to use attribute grammars for compilation. Type systems, which can be thought of as context-sensitive grammars, are also used in the definition of some languages [(Cardelli 1996)](#Cardelli96).

The Pascal language and its successors Modula-2 and Oberon have compact EBNF grammars. The [syntax diagrams of the Apple Pascal](https://web.archive.org/web/20211028072457/http://www.pascal-central.com/pascal-syntax.html) can fit on a poster. It was common for these to hang on the walls next to the computers!

### Bibliography

<div class="csl-bib-body" style="line-height: 1.35; margin-left: 2em; text-indent:-2em;">
<a id='Cardelli96'></a> <div class="csl-entry">Cardelli, Luca. 1996. “Type Systems.” <i>ACM Comput. Surv.</i> 28 (1): 263–64. <a href="https://doi.org/10.1145/234313.234418">https://doi.org/10.1145/234313.234418</a>.</div>
<a id='Chomsky56'></a> <div class="csl-entry">Chomsky, N. 1956. “Three Models for the Description of Language.” <i>IRE Transactions on Information Theory</i> 2 (3): 113–24. <a href="https://doi.org/10.1109/TIT.1956.1056813">https://doi.org/10.1109/TIT.1956.1056813</a>.</div>
<a id='Kallmeyer10'></a> <div class="csl-entry">Kallmeyer, Laura. 2010. <i>Parsing Beyond Context-Free Grammars</i>. Springer-Verlag Berlin Heidelberg. <a href="https://doi.org/10.1007/978-3-642-14846-0">https://doi.org/10.1007/978-3-642-14846-0</a>.</div>
<a id='KernighanRitchie88'></a> <div class="csl-entry">Kernighan, Brian W., and Dennis M. Ritchie. 1988. <i>The C Programming Language</i>. 2nd ed. Prentice Hall Professional Technical Reference.</div>
<a id='Knuth64'></a><div class="csl-entry">Knuth, Donald E. “Backus Normal Form vs. Backus Naur Form.”, Letter to Editor, <i>Communications of the ACM</i>, vol. 7, no. 12, Dec. 1964, pp. 735–36. <i>Dec. 1964</i>, doi:<a href="https://doi.org/10.1145/355588.365140">10.1145/355588.365140</a>.</div> 
<a id='Knuth68'></a> <div class="csl-entry">Knuth, Donald E. 1968. “Semantics of Context-Free Languages.” <i>Mathematical Systems Theory</i> 2 (2): 127–45. <a href="https://doi.org/10.1007/BF01692511">https://doi.org/10.1007/BF01692511</a>.</div>
<a id='LeSchuster16'></a> <div class="csl-entry">Le, Quoc V., and Mike Schuster. 2016. “A Neural Network for Machine Translation, at Production Scale.” <i>Google AI Blog</i> (blog). September 27, 2016. <a href="https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html">https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html</a>.</div>
<a id='WijngaardenEtAl76'></a> <div class="csl-entry">Wijngaarden, A. van, B. J. Mailloux, J. E. L. Peck, C. H. A. Koster, C. H. Lindsey, M. Sintzoff, L. G. L. T. Meertens, and R. G. Fisker, eds. 1976. <i>Revised Report on the Algorithmic Language Algol 68</i>. Berlin Heidelberg: Springer-Verlag. <a href="https://doi.org///www.springer.com/gp/book/9783540075929">//www.springer.com/gp/book/9783540075929</a>.</div>
<a id='Wirth77'></a><div class="csl-entry">Wirth, Niklaus. 1977. “What Can We Do about the Unnecessary Diversity of Notation for Syntactic Definitions?” <i>Communications of the ACM</i> 20 (11): 822–23. <a href="https://doi.org/10.1145/359863.359883">https://doi.org/10.1145/359863.359883</a>.</div>   
<a id='WuEtAl16'></a> <div class="csl-entry">Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” <i>CoRR</i> abs/1609.08144. <a href="http://arxiv.org/abs/1609.08144">http://arxiv.org/abs/1609.08144</a>.</div>
</div>