---

# 1. Language and Syntax
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, January 2021**

---

### Language and Grammar

Every language is based on a _vocabulary_. Its elements are called _words_ or _symbols_ whose structure is of no further interest. The _syntax_ determines which sequences of words, called *sentences*, belong to the language.

| language                | symbols                                          |
|:------------------------|:-------------------------------------------------|
| English                 | `eats`, `Kevin`, `a`, `banana`, ...              |
| Roman numerals          | `I`, `V`, `X`, `L`, `C`, `D`, `M`                |
| identifiers in programs | `A`, `B`, ..., `a`, `b`, ..., `0`, `1`, ..., `_` |
| arithmetic expressions  | `dist`, `rot`, `24`, `+`, `-`, `√ó`, `/`, ...     |

**Question:** What are other non-spoken languages?

_Answer:_
- Chemical formulae, e.g <code>H<sub>3</sub>O<sup>+</sup></code> for hydronium.
- Musical scores, with vocabulary üéº, ‚ô≠, ‚ôÆ, ‚ôØ, ‚ô©, ‚ô™, ‚ô´, ‚ô¨, etc.
- Morse code, with vocabulary "‚óè" (short), "‚îÅ‚îÅ‚îÅ" (long), " " (pause).

<div style="float:right;background-color:lightgrey;border-left:20px solid white">

**Example:** if `V = {a, b}`, then  
`V·ê© = {a, b, aa, ab, ba, bb, aaa, ‚Ä¶ }`  
`V* = {Œµ, a, b, aa, ab, ba, bb, aaa, ‚Ä¶ }`  
The sentences of the language  
`L = {œÉaœÉ | œÉ ‚àà V*}`  
are those sequences that contain at least one `a`.</div>
Formally, a vocabulary `V` is a finite, non-empty set of (atomic) symbols. The set `V*` of all _finite sequences_ or _strings_ over `V` consists of

- the empty string `Œµ`,
- any symbol `x ‚àà V`,
- the _concatenation_ `œÉœÑ` of strings `œÉ, œÑ ‚àà V*`.

The empty sequence is both the left and right identity of concatenation and concatenation is associative, meaning that parenthesis can be left out. Formally for any `œÉ, œÑ, œâ ‚àà V*`:

	œÉŒµ = œÉ = ŒµœÉ
    (œÉœÑ)œâ = œÉ(œÑœâ)

The set of all _non-empty strings_ over `V` is denoted by `V·ê©`, formally `V·ê© = V* ‚Äì {Œµ}`. The _length_ of string `œÉ` is written as `|œÉ|`:

- `|Œµ| = 0`,
- `|x| = 1` for any `x ‚àà V`,
- `|œÉœÑ| = |œÉ| + |œÑ|` for any `œÉ, œÑ ‚àà V*`.

<img style="width:16em;height:auto;float:right;border-left:10px solid white" src="./img/NLexample.svg"></img>
A *grammar* not only determines unambiguously which sequences of words are sentences and which not but also provides sentences with a *structure*. The structure is instrumental in recognizing the *semantic* of a sentence, which is our ultimate goal.

The theory of formal languages originates in linguistics. A basic rule of English is that sentences (`S`) consists of a noun phrase (`NP`) followed by verb phrase (`VP`). A noun phrase is either a proper name (`PN`) or a determiner (`D`) followed by a noun (`N`). A verb phrase is either a verb (`V`) or a verb followed by a noun phrase. Determiners are `a` and `the`. The hierarchical composition of an English sentence by a *parse tree* is given to the right; below are the corresponding rules. Grammars of this form are called *generative* and the rules are called *productions*, as they determine how all sentences of a language are generated.

<div style="float:right;background-color:lightgrey;margin-left:18pt">

`S ‚Üí NP VP`  
`NP ‚Üí PN`  
`NP ‚Üí D N`  
`VP ‚Üí V`  
`VP ‚Üí V NP`  
`PN ‚Üí Kevin`  
`PN ‚Üí Dave`  
`D ‚Üí a`  
`D ‚Üí the`  
`N ‚Üí banana`  
`N ‚Üí apple`  
`V ‚Üí eats`  
`V ‚Üí runs`

</div>

Formally, grammar `G = (T, N, P, S)` is specified by

- a finite set `T` of *terminal symbols*,
- a finite set `N` of *nonterminal symbols*,
- a finite set `P` of *productions*,
- a symbol `S ‚àà N`, the *start symbol*

where `N ‚à© T = {}` and `V = T ‚à™ N` is its vocabulary. Productions are pairs of strings `œÉ ‚àà V·ê©`, `œÑ ‚àà V*`, written `œÉ ‚Üí œÑ`.

**Example.** `G‚ÇÄ = (T, N, P, S)` with `T = {Kevin, Dave, a, the, banana, apple, eats, runs}`, `N = {S, NP, VP, PN, D, N, V}`, and the productions to the right is a grammar.

<div style="float:right;background-color:lightgrey;margin-left:18pt">

‚ÄÉ `S`  
`‚áí NP VP`  
`‚áí PN VP`  
`‚áí Kevin VP`  
`‚áí Kevin V NP`  
`‚áí Kevin eats NP`  
`‚áí Kevin eats D N`  
`‚áí Kevin eats a N`  
`‚áí Kevin eats a banana`</div>
Given grammar `G = (T, N, P, S)`, sequence `œá ‚àà V*` is _directly derivable_ from `œÄ ‚àà V·ê©`, written `œÄ¬†‚áí œá`, if there exist sequences `œÉ`, `œÑ`, `Œº`, `ŒΩ` such that, `œÄ = ŒºœÉŒΩ`, `œá = ŒºœÑŒΩ`, and `œÉ ‚Üí œÑ ‚àà P`.

We write

- `œÄ¬†‚áí* œá` if `œá` is _derivable in zero or more steps_ from `œÄ`,
- `œÄ¬†‚áí·ê© œá` if `œá` is _derivable in one or more steps_ from `œÄ`.

Formally, `‚áí*` is the transitive and reflexive closure of relation `‚áí` and `‚áí·ê©` is the transitive closure of `‚áí`.

The derivation to the right allows to conclude that `S ‚áí·ê© Kevin eats a banana` with grammar `G‚ÇÄ`.

The _language_ `L(G)` generated by grammar `G = (T, N, P, S)` is the set of all sequences of terminal symbols which can be derived from the start symbol:

    L(G) = {œá ‚àà T* | S ‚áí·ê© œá}

Two grammars `G`, `G'` are _equivalent_ if they generate the same language, `L(G) = L(G')`.

**Example.** Given `G‚ÇÅ = (T, N, P, S)`, where `T = {a, b, c, d}`, `N = {S, X}`, `P = {S ‚Üí aX, S ‚Üí bX, X ‚Üí c, X ‚Üí d}`, the sequence `ac` is derivable from `S`, formally `S¬†‚áí·ê© ac`,

    S ‚áí aX ‚áí ac

as are `ad`, `bc`, `bd`. The language generated by `G‚ÇÅ` is:

	L(G‚ÇÅ) = {ac, ad, bc, bd}

**Question.** What are other equivalent grammars? 

_Answer._
- `G‚ÇÅÃç = (T, N', P', S)`, where `N = {S, X, Y}`, `P = {S ‚Üí XY, X ‚Üí a, X ‚Üí b, Y ‚Üí c, Y ‚Üí d}`, is equivalent to `G‚ÇÅ`.
- Renaming the non-terminals also gives an equivalent grammar. In that sense, non-terminals "carry no meaning".
- Adding nonterminal `X‚ÇÅ` and replacing `X ‚Üí a` with `X ‚Üí X‚ÇÅ, X‚ÇÅ ‚Üí a` also gives an equivalent grammar. Repeating this, infinitely many equivalent grammars can be obtained.

Languages generated by a grammar can be _finite_ or _infinite_. Infinite languages are expressed through recursion with a finite set of productions.

**Example.** Let `G‚ÇÇ = (T, N, P, S)`, where `T = {a}`, `N = {S}` and let the productions `P` be:

    S ‚Üí Œµ
    S ‚Üí aS

**Theorem.** The language of `G‚ÇÇ` is that of sequences over `a` of arbitrary length,

    L(G‚ÇÇ) = {Œµ, a, aa, aaa, aaaa, ‚Ä¶} = {a‚Åø | n ‚â• 0}

where `a‚Å∞ = Œµ` and `a‚Åø‚Å∫¬π = aa‚Åø`.

*Proof.* We prove this formally by inclusion in both directions. By definition of `L(G‚ÇÇ)`,

    {œá ‚àà T* | S ‚áí·ê© œá} ‚äÜ {a‚Åø | n ‚â• 0}

means that for every `œá ‚àà T*` that is derivable from `S` there exists an `n ‚â• 0` such that `œá = a‚Åø`. We show this by induction over the length of derivations.

- _Base._ Suppose `œá` is derived directly from `S` by `S ‚áí œá`, which leaves only `œá = Œµ` according to the first production. Then `œá = a‚Å∞`.
- _Step._ Suppose `œá` is derived from `S` in multiple steps, which leaves only `S ‚áí aS ‚áí* œá` according to the second production. Then `œá` must be `aœâ` to be derivable from `aS`. As the derivation `S ‚áí* œâ` is shorter than the derivation of `S ‚áí* œá`, by induction assumption there exists an `n` such that `œâ = a‚Åø`. Therefore `œá = aœâ = aa‚Åø = a‚Åø‚Å∫¬π`.

The inclusion in the other direction means that every `a‚Åø` for `n ‚â• 0` can be derived from `S`:

    {a‚Åø | n ‚â• 0} ‚äÜ {œá ‚àà T* | S ‚áí·ê© œá}

We show this by induction over `n`.

- _Base._ For `n = 0`, obviously `a‚Å∞ = Œµ` can be generated by the first production, `S ‚áí·ê© Œµ`.
- _Step._ Suppose `a‚Åø` can be generated, `S ‚áí·ê© a‚Åø`. We need to show that `a‚Åø‚Å∫¬π` can be generated as well. This follows from `S ‚áí aS ‚áí·ê© aa‚Åø = a‚Åø‚Å∫¬π`.

Thus we can conclude `L(G‚ÇÇ) = {a‚Åø | n ‚â• 0}`.

Recursion also allows to express arbitrarily deep _nested structures_.

**Example.** Let `G‚ÇÉ = (T, N, P, S)`, where `T = {a, b, c}`, `N = {S}`, and the productions `P` are:

    S ‚Üí b
    S ‚Üí aSc

The sequence `aabcc` is derivable from `S`:

    S ‚áí aSc ‚áí aaScc ‚áí aabcc
    
The generated language is:

    L(G‚ÇÉ) = {b, abc, aabcc, aaabccc , ‚Ä¶} = {a‚Åøbc‚Åø | n ‚â• 0}

### Chomsky Hierarchy

Languages can be classified according to restrictions on their grammar. The following classification is known as the _Chomsky Hierarchy_ [(Chomsky 1956)](#Chomsky56). For grammar `G = (T, N, P, S)`, let `V = T ‚à™ N` be its vocabulary, and assume `a ‚àà T`, `A, B ‚àà N`, `Œº, ŒΩ, œÑ ‚àà V*`, `œÉ ‚àà V·ê©`:

- A grammar is _context-sensitive_ if productions are of the form

    `ŒºAŒΩ ‚Üí ŒºœÉŒΩ`  
    
    Additionally, `S ‚Üí Œµ` is allowed provided that `S` does not occur on the right hand side of another production


- A grammar is _context-free_ if productions are of the form

    `A ‚Üí œÑ`


- A grammar is _regular_ if productions are of the form

    `A ‚Üí Œµ`  
    `A ‚Üí a`  
    `A ‚Üí aB`

**Question.** Which of the grammars `G‚ÇÄ`, `G‚ÇÅ`, `G‚ÇÇ`, `G‚ÇÉ` are regular or context-free?

_Answer._
- `G‚ÇÄ` is not regular, but is context-free
- `G‚ÇÅ` is regular (and therefore context-free)
- `G‚ÇÇ` is regular (and therefore context-free)
- `G‚ÇÉ` is not regular, but is context-free

<div style="float:right;background-color:lightgrey;margin-left:2em;margin-top:1em">

`S ‚Üí NP VP`  
`NP ‚Üí D N‚Çõ`  
`NP ‚Üí D N‚Çö`  
`N‚Çõ VP ‚Üí N‚Çõ V‚Çõ`  
`N‚Çö VP ‚Üí N‚Çö V‚Çö`  
`D ‚Üí the`  
`N‚Çõ ‚Üí child`  
`N‚Çö ‚Üí children`  
`V‚Çõ ‚Üí runs`  
`V‚Çö ‚Üí run`

</div>

Context-sensitive languages allow to express the subject-verb agreement with respect to singular vs. plural in natural languages.
<br><br>  
**Example.** Consider the grammar to the right with terminals in lower case and nonterminals in upper case letters. Then

‚ÄÉ `the child runs`  
‚ÄÉ `the children run`  

are sentences but `the child run` is not.

**Question.** What is a derivation of `the child runs`?

    ‚ÄÉ S
    ‚áí NP VP  
    ‚áí D N‚Çõ VP
    ‚áí D N‚Çõ V‚Çõ
    ‚áí the child V‚Çõ
    ‚áí the child runs

We give some fundamental results from formal language theory. Regular grammars can express repetition, but not nesting:

**Theorem.** No regular grammar for `L(G‚ÇÉ)` exists.

**Example.** Let `G‚ÇÑ = (T, N, P, S)`, where `T = {a, b, c}`, `N = {S, B}`, and let the productions `P` be:

    S ‚Üí abc
    S ‚Üí aBSc
    Ba ‚Üí aB
    Bb ‚Üí bb

The grammar is not context-free. The language generated is:

    L(G‚ÇÑ) = {abc, aabbcc, aaabbbccc, ‚Ä¶} = {a‚Åøb‚Åøc‚Åø | n ‚â• 1}

**Question.** What is a derivation of `aaabbbccc` in `G‚ÇÑ`? Explain how the grammar works!

_Answer._

    ‚ÄÉ S
    ‚áí aBSc
    ‚áí aBaBScc
    ‚áí aBaBabccc
    ‚áí aBaaBbccc
    ‚áí aaBaBbccc
    ‚áí aaaBBbccc
    ‚áí aaaBbbccc
    ‚áí aaabbbccc

The grammar works by first producing the same number of `a`, `B`, `c`, with all `c` in correct position at the end but `a` and `B` alternating. The the production `Ba ‚Üí aB` moves all `a` to the left and all `B` to the middle. Once a `B` is in its correct position, it is converted to a `b`.

**Theorem.** No context-free grammar for `L(G‚ÇÑ)` exists.

Grammar `G‚ÇÑ` is not context-sensitive: `Ba ‚Üí aB` does not match the form for context-sensitive productions. However, grammar `G‚ÇÑ'` with the same terminals, additional nonterminal `X`, and following productions is context-sensitive and equivalent; it uses three productions to achieve `BA ‚Üí AB` and adds production `A ‚Üí a`:

    S ‚Üí Abc
    S ‚Üí ABSc
    BA ‚Üí BX
    BX ‚Üí AX
    AX ‚Üí AB
    Bb ‚Üí bb
    A ‚Üí a

**Question.** Argue that `G‚ÇÑ'` is context-sensitive. What is a derivation of `aabbcc` in `G‚ÇÑ'`?

The production `BA ‚Üí BX` replaces `A` by `X` in left context `B`, so matches `ŒºAŒΩ ‚Üí ŒºœÉŒΩ` with `Œº`, `A`, `ŒΩ`, `œÉ` being `B`, `A`, `Œµ`, `X`. The production `BX ‚Üí AX` replaces `B` with `Y` in right context `X`, so matches `ŒºAŒΩ ‚Üí ŒºœÉŒΩ` with `Œº`, `A`, `ŒΩ`, `œÉ` being `Œµ`, `B`, `X`, `Y`. The other productions are similar.

      S
    ‚áí ABSc
    ‚áí ABAbcc
    ‚áí ABXbcc
    ‚áí AAXbcc
    ‚áí AABbcc
    ‚áí AAbbcc
    ‚áí Aabbcc
    ‚áí aabbcc

**Example.** Let `G‚ÇÖ = (T, N, P, S)`, where `T = {a, b}`, `N = {A, B, S}`, and productions `P` are:  

‚ÄÉ‚ÄÉ`S ‚Üí aAS`  
‚ÄÉ‚ÄÉ`S ‚Üí bBS`  
‚ÄÉ‚ÄÉ`Aa ‚Üí aA`  
‚ÄÉ‚ÄÉ`Ab ‚Üí bA`  
‚ÄÉ‚ÄÉ`Ba ‚Üí aB`  
‚ÄÉ‚ÄÉ`Bb ‚Üí bB`  
‚ÄÉ‚ÄÉ`AS ‚Üí Sa`  
‚ÄÉ‚ÄÉ`BS ‚Üí Sb`  
‚ÄÉ‚ÄÉ`S ‚Üí Œµ`  

The grammar is not context-free. The language generated is the *copy language*:

    L(G‚ÇÖ) = {ww | w ‚àà T*}

The first three productions produce an arbitrary sequence of pairs of `aA` and `bB` ending with `S`. The following four productions move all `A` and `B` to the right without "overtaking" each other. The final three productions convert `A` to `a` and `B` to `b` from right to left.

**Question.** What is a derivation of `abab` in `G‚ÇÖ`?

_Answer._

```
‚ÄÉ S
‚áí aAS
‚áí aAbBS
‚áí abABS
‚áí abASb
‚áí abSab
‚áí abab
```

**Theorem.** No context-free grammar for `L(G‚ÇÖ)` exists.

Languages generated by context-sensitive, context-free, and regular grammars are called *context-sensitive*, *context-free*, and *regular languages*, respectively.

**Theorem.** Every regular language is also context-free. Every context-free language is also context-sensitive.

Note that the inclusion does not quite hold for grammars, as `A ‚Üí Œµ` is allowed in regular and context-free grammars, but not in context-sensitive grammars.

For brevity, we write

	œÉ ‚Üí œÑ‚ÇÄ | œÑ‚ÇÅ | ‚Ä¶

for the set of productions

	œÉ ‚Üí œÑ‚ÇÄ
    œÉ ‚Üí œÑ‚ÇÅ
    ‚Ä¶

### Concrete and Abstract Syntax Trees

We continue with context-free languages. For those, the _parse tree_ or _concrete syntax tree_ is a visual representation of a derivation which abstracts from the order of independent applications of productions. In the example, `E` and `id` stand for expressions and identifiers of programs.

<img style="width:6em;float:right;border-left:10px solid white" src="./img/idplusid.svg"></img>
**Example.** Let `G‚ÇÜ = (T, N, P, E)` where `T = {id, +}`, `N = {E}`, and the productions `P` are:

<code>      E ‚Üí id | E + E</code>

There are two derivations of `id + id`:

    E ‚áí E + E ‚áí id + E ‚áí id + id
    E ‚áí E + E ‚áí E + id ‚áí id + id

<img style="width:19em;float:right;border-left:10px solid white" src="./img/idplusidplusid.svg"></img>
Continuing with `G‚ÇÜ`, there are two parse trees for `id + id + id`. A sentence with more than one parse trees is an _ambiguous sentence_ and a grammar allowing that is an *ambiguous grammar*. Syntactically ambiguous sentences may have an ambiguous meaning. In natural languages this may be resolved through the context; in programming languages, syntactic ambiguity is generally avoided.

<img style="width:9em;float:right;border-left:10px solid white" src="./img/idplusleft.svg"></img>
Changing the productions to a _left-recursive_ form eliminates ambiguity and makes `+` associate to the left.

<code>      E ‚Üí id | E + id</code>

<img style="width:9em;float:right;border-left:10px solid white" src="./img/idplusright.svg"></img>
Changing the productions to a _right-recursive_ form eliminates ambiguity and makes `+` associate to the right.

<code>      E ‚Üí id | id + E</code>

**Question.** For which operators in programming languages does associativity matter and for which not?

_Answer._
- For integer division associativity matters.
- For integer addition associativity matters in bounded arithmetic (overflow is error) and saturating arithmetic (overflow results in maximal number).
- For integer addition associativity does not matter in modulo arithmetic, e.g. with word size.
- For bitwise `and` and bitwise `or`, associativity does not matter.
- For string concatenation, associativity does not matter.

The next example illustrates operator _precedence_.

<img style="width:19em;float:right;border-left:10px solid white" src="./img/idplusidtimesid.svg"></img>
**Example.** Let `G‚Çá = (T, N, P, E)` where `T = {id, +, √ó}`, `N = {E}`, and the productions `P` are:

<code>      E ‚Üí id | E + id | E √ó id</code>

In `id + id √ó id`, operator `+` binds tighter; in `id √ó id + id`, operator `√ó` binds tighter: `+` and `√ó` bind equally tight and associate to the left.

<img style="width:19em;float:right;border-left:10px solid white" src="./img/idplustimestimesplus.svg"></img>
To have proper operator precedence, nonterminal `T` for terms is introduced and the productions are changed to:  
<code>
      E ‚Üí T | E + T
      T ‚Üí id | T √ó id
</code>

To allow `+` to bind tighter than `√ó`, parenthesis are needed. For this, nonterminal `F` for factor is introduced.

**Example.** Let `G‚Çà = (T, N, P, E)` where `T = {id, +, √ó, (, )}`, `N = {E, T, F}`, and the productions `P` are:  
<code>
      E ‚Üí T | E + T
      T ‚Üí F | T √ó F
      F ‚Üí id | ( E )
</code>

**Question.** What are the parse trees for `id + id √ó id`, for `id √ó id + id`, and for `(id + id) √ó id`?

_Answer._
<img style="width:30em" src="./img/idparen.svg"></img>

A _structural tree_ or _(abstract) syntax tree_ is a simplified parse trees with only the relevant structure information:
- Productions whose sole purpose is to define precedence (like bracketing) are left out.
- Chains of derivations are left out.
- Nodes are labelled with the construct in question rather than a nonterminal.

For example, for `id + id √ó id`, for `id √ó id + id`, and for `(id + id) √ó id`:
<img style="width:28em" src="./img/idast.svg"></img>

### Backus-Naur Form

Context-free grammars are more conveniently written in _Backus-Naur Form_ (*BNF*):
- The left-hand side of the first production is the start symbol.
- Terminals are enclosed in `'quotes'` all other symbols are nonterminals.
- Productions for the same nonterminal are grouped into one, separated by `|`.
- The empty string `Œµ` is written as `''`.

For example, here is BNF grammar for expression like `‚Äì 3 √ó a + b`

    expression ‚Üí term | '+' term | '‚Äì' term | expression '+' term | expression '‚Äì' term
    term ‚Üí factor | term '√ó' factor | term '/' factor
    factor ‚Üí number | identifier | '(' expression ')'

and one for statements like `if b then x := 3 else (x := y ; y := 5)`:

    statement ‚Üí assignment | compoundStatement | ifStatement | whileStatement
    assignment ‚Üí identifier ':=' expression
    compoundStatement ‚Üí '(' statementSequence ')'
    statementSequence ‚Üí statement | statementSequence ';' statement
    ifStatement ‚Üí 'if' expression 'then' statement | 'if' expression 'then' statement 'else' statement
    whileStatement ‚Üí 'while' expression 'do' statement

Let us define BNF in BNF! The terminals are characters, written in quotes. The newline character is written as `\n` and the quote character itself as `\'`. We let `char` stand for an arbitrary character:

    grammar  ‚Üí  production | grammar '\n' production
    production  ‚Üí  identifier '‚Üí' expression
    expression  ‚Üí  term | expression '|' term
    term  ‚Üí  factor | term ' ' factor
    factor  ‚Üí  identifier | string
	identifier  ‚Üí  letter | identifier letter | identifier digit
    letter  ‚Üí  'A' | ‚Ä¶ | 'Z'
    digit  ‚Üí  '0' | ‚Ä¶ | '9'
    string  ‚Üí  '\'' characters '\''
    characters  ‚Üí  characters char | ''

Numerous variations of BNF exist. For example, the grammar of C uses different fonts for terminals and nonterminals, enumerates the terms of a production indented on subsequent lines, and uses <code>A<sub>opt</sub></code> if `A` is optional [(Kernighan and Ritchie 1988)](#KernighandRitchie88). Formally, using <code>A<sub>opt</sub></code> amounts to adding a production <code>A<sub>opt</sub> ‚Üí A | Œµ</code>. Here is a simplified fragment:

<code style="font-family:monospace">
<i>statement:</i>
        <i>compound-statement</i>
        <i>expression-statement</i>
        <i>selection-statement</i>
        <i>iteration-statement</i>
<i>compound-statement:</i>
        { <i>statement-list<sub>opt</sub></i> }
<i>statement-list:</i>
        <i>statement</i>
        <i>statement-list</i> <i>statement</i>
<i>selection-statement:</i>
        if ( <i>expression</i> ) <i>statement</i>
        if ( <i>expression</i> ) <i>statement</i> else <i>statement</i>
        switch ( <i>expression</i> ) <i>statement</i>
<i>iteration-statement:</i>
        while ( <i>expression</i> ) <i>statement</i>
        for ( <i>expression<sub>opt</sub></i> ; <i>expression<sub>opt</sub></i> ; <i>expression<sub>opt</sub></i> ) <i>statement</i>
</code>

EBNF is an extension of BNF: it allows simple repetitions to be formulated more naturally and it avoids an inflation of nonterminals:
- `(A)` allows precedence to be expressed. Formally, `(A)` stands for a new nonterminal `X` with the production `X ‚Üí A` added.
- `[A]` stands for `A` optionally. Formally, `[A]` stands for a new nonterminal `X` with the production `X¬†‚Üí A | Œµ` added.
- `{A}` stands for repeating `A` an arbitrary number of times. Formally, `{A}` stands for a new nonterminal `X` with the production `X¬†‚Üí X A | Œµ` added.

For example, here is an EBNF grammar for expression,

    expression ‚Üí [ '+' | '‚Äì' ] term { ( '+' | '‚Äì' ) term}
    term ‚Üí factor { ( '√ó' | '/' ) factor }
    factor ‚Üí number | identifier | '(' expression ')'

and one for statements:

    statement ‚Üí assignment | compoundStatement | ifStatement | whileStatement
    assignment ‚Üí identifier ':=' expression
    compoundStatement ‚Üí '(' statement { ';' statement } ')'
    ifStatement ‚Üí 'if' expression 'then' statement ['else' statement]
    whileStatement ‚Üí 'while' expression 'do' statement


Let us define EBNF in EBNF!

    grammar  ‚Üí  production {'\n' production }
    production  ‚Üí  identifier '‚Üí' expression
    expression  ‚Üí  term { '|' term }
    term  ‚Üí  factor { ' ' factor }
    factor  ‚Üí  identifier | string | '(' expression ')' | '[' expression ']' | '{' expression '}'
    identifier  ‚Üí  letter { letter | digit }
    letter  ‚Üí  'A' | ‚Ä¶ | 'Z'
    digit  ‚Üí  '0' | ‚Ä¶ | '9'
    string  ‚Üí  '\'' { char } '\''

With typing and parsing of grammars in mind, sometimes `=` or `::=` is used instead of `‚Üí` and productions are terminated with a dot. For example, here is a fragment of the [Go Grammar](https://golang.org/ref/spec):

    Block = "{" StatementList "}" .
    StatementList = { Statement ";" } .
    
    Statement =
	    Declaration | Assignment | Block | IfStmt | SwitchStmt | SelectStmt | ForStmt .

    Assignment = ExpressionList assign_op ExpressionList .
    assign_op = [ add_op | mul_op ] "=" .

    ExpressionList = Expression { "," Expression } .

Productions are then also called *syntactic equations*, however care has to be taken as `A = B` is not the same as `B = A`!

More variations of EBNF exist:

- Zero or more repetitions of `E` are also written as `E*`
- One or more repetitions of `E` are written as `E·ê©`, which stands `E E*`.
- An optional occurrence of `E` is also written as `E?`, which stands for `E | Œµ`.

Here is a fragment of the Python [grammar in the language reference](https://docs.python.org/3/reference/compound_stmts.html) (which differs slightly from the [grammar used by parsers](https://docs.python.org/3/reference/grammar.html)). As in Python indentation of statements matters, this is expressed in the grammar by symbols that indicate indentation:

<pre style="font-family:monospace;color:royalblue">
compound_stmt ::=  if_stmt
                   | while_stmt
                   | for_stmt
suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement     ::=  stmt_list NEWLINE | compound_stmt
stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]

if_stmt ::=  "if" expression ":" suite
             ("elif" expression ":" suite)*
             ["else" ":" suite]
</pre>

EBNF is not only helpful for a compact definition of a grammar, but is also essential for the construction of a specific kind of recognizer. Our choice of EBNF is motivated by that.

### Syntax Diagrams

An EBNF grammar can be equivalently represented by _syntax diagrams_ (*railroad diagrams*). These are constructed recursively over the structure of EBNF grammars. Let `'a'` stand for a string (terminal), `A` for an identifier (nonterminal), and `E`, `E‚ÇÅ`, `E‚ÇÇ`, ‚Ä¶ for expressions (right-hand side of productions):

| EBNF            | syntax diagram                                                |
|:----------------|:--------------------------------------------------------------|
| `A ‚Üí E`         |<img style="width:10em" src="./img/production.svg"></img> |
| `'a'`           |<img style="width:10em" src="./img/terminal.svg"></img>   |
| `A`             |<img style="width:10em" src="./img/nonterminal.svg"></img>|
| `E‚ÇÅ ‚îÇ E‚ÇÇ ‚îÇ ‚Ä¶`   |<img style="width:10em" src="./img/choice.svg"></img>     |
| `E‚ÇÅ E‚ÇÇ ‚Ä¶`       |<img style="width:14em" src="./img/sequence.svg"></img>   |
| `(E)`           |<img style="width:10em" src="./img/parenthesis.svg"></img>|
| `[E]`           |<img style="width:10em" src="./img/option.svg"></img>     |
| `{E}`           |<img style="width:10em" src="./img/repetition.svg"></img> |

For example, for

    A ‚Üí 'x' | '(' A { '+' A } ')'

the syntax graph is:

<img style="width:28em" src="./img/railroad.svg"></img> 

**Question.** What is the syntax diagram for EBNF?

### Recognizers

A _recognizer_ for a language is a program that takes as input a string and _accepts_ it if the string is a sentence of the language or otherwise _rejects_ it. For regular, context-free, and context-sensitive languages, _universal recognizers_ exist, i.e. programs that given a grammar `G` and sentence `œâ` return if `œâ ‚àà L(G)`. For an unrestricted grammar `G`, in general `œâ ‚àà L(G)` is undecidable. 

For context-sensitive grammar `G = (T, N, P, S)`, a universal recognizers can be constructed by generating all derivations of length `1`, length `2`, etc. from the start symbol and keeping a set, `d`, of the derived strings. New strings are only added to `d` if they are not longer than `œâ` as in context-sensitive grammars, derived strings cannot shrink. This terminates if either `œâ ‚àà d`, in which case `œâ` is accepted, or no more strings can be added to `d`, i.e. all derived strings of length of `œâ` have been explored:

```algorithm
    d‚ÇÄ, d := {}, {S}
    while d‚ÇÄ ‚â† d do
        d‚ÇÄ := d
        for œÄ ‚àà d‚ÇÄ do
            for œÉ ‚Üí œÑ ‚àà P do
                for Œº, ŒΩ where œÄ = ŒºœÉŒΩ do
                    œá := ŒºœÑŒΩ
                    if œá = œâ then accept
                    else if |œá| ‚â§ |œâ| then d := d ‚à™ {œá}
    reject
```

This algorithm always terminates and the memory it uses is bounded. Since the set `d` may be very large, it is not a practical universal recognizer, but a constructive proof that membership in a context-sensitive language is decidable.

For implementing in Python, symbols are represented by characters, i.e. strings as Python strings. The method `s.find(t, i)` returns the index of the first occurrence of `t` in `s` starting at index `i`, or `-1` if no such occurrence exists:

In [1]:
def derivable(S, P, œâ):
    # S: start symbol, a string, P: productions, a set of pairs of strings, œâ: string
    d0, d = {}, {S} # set of strings
    while d != d0:
        d0 = d
        for (œÉ, œÑ) in P:
            for œÄ in d0:
                i = œÄ.find(œÉ, 0) #print('œÄ, i', œÄ, i)
                while i != - 1:
                    œá = œÄ[0:i] + œÑ + œÄ[i + len(œÉ):] #print('œá', œá)
                    if œá == œâ: return True
                    elif len(œá) <= len(œâ): d = d.union({œá})
                    i = œÄ.find(œÉ, i + 1) #print('d, i', d, i)
    return False            

In [2]:
derivable('S', {('S', 'a'), ('S', 'Sb')}, 'abb')

True

In [3]:
derivable('S', {('S', 'a'), ('S', 'Sb')}, 'bb')

False

Abbreviating `the`, `child`, `children`, `runs`, `run` by `t`, `c`, `C`, `r`, `R` and `NP`, `VP`, `N‚Çõ V‚Çõ`, `N‚Çö V‚Çö` by `ùí©`, `ùí±`, `n`, `v`, `N`, `V`, the previous productions are expressed as:

In [4]:
P = {('S', 'ùí©ùí±'), ('ùí©', 'Dn'), ('ùí©', 'DN'), ('nùí±', 'nv'), ('Nùí±', 'NV'),
     ('D', 't'), ('n', 'c'), ('N', 'C'), ('v', 'r'), ('V', 'R')}

In [5]:
derivable('S', P, 'tCR')

True

In [6]:
derivable('S', P, 'tcR')

False

### Historic Notes and Further Reading

The Backus-Naur Form was first proposed by John Backus and then adopted by Peter Naur for the definition of Algol-60. Donald Knuth suggested the name [(Knuth 1964)](#Knuth64). EBNF was proposed by Niklaus Wirth [(Wirth 1977)](#Wirth77).

The original motivation for the classification of grammars came from the study of natural languages. Following examples illustrate the potential use of regular, context-free, and context-sensitive languages (credit for examples: [C. Chesi, Univ. of Siena](http://www.ciscl.unisi.it/master/chesi/lingcomp-2017_18-03_04-formal_grammar.pdf))

- _Right recursion_ (*tail recursion*) of the form `ab‚Åø`:


    [the dog bit [the cat [that chased [the mouse [that ran]]]]]


- _Center embedding_ (*true recursion*) of the form `a‚Åøb‚Åø`:


    [the mouse [(that) the cat [(that) the dog bit] chased] ran]


- _Cross‚Äêserial dependencies_ (*identity recursion*) of the form `ww`:


    John, Mary, and David, are a widower, a widow, and a widower, respectively

There is an ongoing discussion on using regular, context-free, and context-sensitive languages for natural languages. The male-female correspondence of the last example can also be seen as a semantic issue rather than a syntactic issue. If one takes the limits of human comprehension into account the full generality of context-sensitive, context-free, and even regular languages is not needed. As a consequence, further classes of grammars have emerged, e.g. [(Kallmeyer 2010)](#Kallmeyer10).

Grammars can be used for translation of natural languages: first the input sentence is parsed according to the grammar of the source language and then a sentence is generated that satisfies the grammar of the target languages. However, some recent works demonstrates that neural networks can perform better than grammar-based translation [(Wu et al. 2016)](#WuEtAl16), [(Le and Schuster 2016)](#LeSchuster16).

On the other hand, Chomsky's Hierarchy had a profound impact on computing: for each class of languages equivalent recognizers for languages are known. Calling languages of unrestricted grammars *recursively enumerable*, we have:

|language              |recognizer              |
|:---------------------|:-----------------------|
|recursively enumerable|Turing machine          |
|context-sensitive     |linear bounded automaton|
|context-free          |pushdown automaton      |
|regular               |finite state automaton  |

Regular and context-free languages are ubiquitous as recognizers for those can be constructed efficiently and are themselves in some sense efficient. The next chapters in these notes discuss their use for scanning and parsing.

Even the above examples show the difficulty of writing context-sensitive grammars. After Algol 60 introduced the use of context-free grammars for its syntax, with Algol 68 an attempt was made to go beyond context-free grammars by using a dedicated "two-level grammar" [(Wijngaarden et al. 1976)](#WijngaardenEtAl76); that kind of grammar was not used for another language. Around the same time, Knuth proposed _attribute grammars_ as a way of associating computation (which can be type-checking and translation) to recognition of a context-free language [(Knuth 68)](#Knuth68). Since then it has become common to define a programming language with regular and context-free grammars and to use attribute grammars for compilation. Type systems, which can be thought of as context-sensitive grammars, are also used in the definition of some languages [(Cardelli 1996)](#Cardelli96).

The Pascal language and its successors Modula-2 and Oberon have compact EBNF grammars. The [syntax diagrams of the Apple Pascal](http://www.pascal-central.com/pascal-syntax.html) can fit on a poster. It used to be common that these were hanging on the walls next to the computers!

### Exercises

**Exercise 1.** Prove formally that `L(G‚ÇÉ) = {a‚Åøbc‚Åø | n ‚â• 0}`!

**Exercise 2.** Considering `G‚ÇÖ`, give a derivation of `abbabb`! Explain how the grammar works!

**Exercise 3.** A well-known syntactically ambiguous English sentence is `Time flies like an arrow`. Explain the ambiguity!

Following exercises use the Python Natural Language Toolkit. If you need to, install it by `pip3 install nltk` or `pip install nltk`. Following example is from the [NLTK Book](https://www.nltk.org/book/ch08.html). It shows the ambiguity of the sentence:

    I shot an elephant in my pajamas

This is from the Groucho Marx movie, _Animal Crackers_ (1930): "While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don't know." First, a grammar is defined that is sufficient to show the ambiguity and a parser for that grammar is created:

In [None]:
import nltk
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(groucho_grammar)

Now all parse trees for this sentence are created and printed:

In [None]:
trees = list(parser.parse(['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']))
for t in trees: print(t)

The output shows that there are two parse trees, printed with indentation. They can also be graphically visualized:

In [None]:
trees[0]

In [None]:
trees[1]

**Exercise 3.** Draw the parse tree of `id √ó (id + id)` in grammar `G‚Çà` using NLTK!

**Exercise 4.** Let grammar `G` be given by:

    S ::= [aSbS | bSaS]

- Show that `G` is ambiguous by constructing two parse trees for `abab`! You may draw the parse trees using NLTK. _Hint:_ `Œµ` is represented by nothing in NLTK grammars, i.e. left out.
- What language does `G` generate? Give a formal proof! Hint: use the notation `x#œÉ` for the number of occurrences of `x` in sequence `œÉ`.

**Exercise 5.** Write grammars for expression made up of identifers `a`, `b`, `c`, `d` and operators `+`, `‚Äì`, e.g.

    a ‚Äì b + c ‚Äì d

- Write a grammar such that `+` binds tighter than `‚Äì`, i.e. the above sentence would be evaluated as `(a ‚Äì (b + c)) ‚Äì d`. Draw the parse tree for `a ‚Äì b + c ‚Äì d`!

- Write a grammar such that `‚Äì` binds tighter than `+`, i.e. the above sentence would be evaluated as `(a ‚Äì b) + (c ‚Äì d)`. Draw the parse tree for `a ‚Äì b + c ‚Äì d`!

- Write a grammar such that `+` and `‚Äì` bind equally strong but associate to the left, i.e. the above sentence would be evaluated as `(( a ‚Äì b) + c) ‚Äì d`. Draw the parse tree for `a ‚Äì b + c ‚Äì d`!

- Write a grammar such that `+` and `‚Äì` bind equally strong but associate to the right, i.e. the above sentence would be evaluated as `a ‚Äì (b + (c ‚Äì d))`. Draw the parse tree for `a ‚Äì b + c ‚Äì d`!

You may use NLTK for drawing the parse trees.

**Exercise 6.** Consider following grammar for arithmetic expressions:

    expression  ‚Üí  [ '+' | '‚Äì' ] term { ( '+' | '‚Äì' ) term }
    term  ‚Üí  factor { ( '√ó' | '/' ) factor }
    factor  ‚Üí  number | identifier | '(' expression ')'

_Variant A._ Use the LaTeX `mdwtools` package to (a) pretty-print above grammar, (b) to draw the syntax diagrams of the three nonterminals, and (c) to draw the grammar with all productions!

_Variant B._ Use the Python https://github.com/tabatkins/railroad-diagrams library to draw the syntax diagram! Do so by calling `drawProduction()` for each production, as below:

_Variant C._ Use http://bottlecaps.de/rr/ui to draw the syntax diagram! Note that this web site uses a W3C standard for EBNF.

### Bibliography

<div class="csl-bib-body" style="line-height: 1.35; margin-left: 2em; text-indent:-2em;">
<a id='Cardelli96'></a> <div class="csl-entry">Cardelli, Luca. 1996. ‚ÄúType Systems.‚Äù <i>ACM Comput. Surv.</i> 28 (1): 263‚Äì64. <a href="https://doi.org/10.1145/234313.234418">https://doi.org/10.1145/234313.234418</a>.</div>
<a id='Chomsky56'></a> <div class="csl-entry">Chomsky, N. 1956. ‚ÄúThree Models for the Description of Language.‚Äù <i>IRE Transactions on Information Theory</i> 2 (3): 113‚Äì24. <a href="https://doi.org/10.1109/TIT.1956.1056813">https://doi.org/10.1109/TIT.1956.1056813</a>.</div>
<a id='Kallmeyer10'></a> <div class="csl-entry">Kallmeyer, Laura. 2010. <i>Parsing Beyond Context-Free Grammars</i>. Springer-Verlag Berlin Heidelberg. <a href="https://doi.org/10.1007/978-3-642-14846-0">https://doi.org/10.1007/978-3-642-14846-0</a>.</div>
<a id='KernighanRitchie88'></a> <div class="csl-entry">Kernighan, Brian W., and Dennis M. Ritchie. 1988. <i>The C Programming Language</i>. 2nd ed. Prentice Hall Professional Technical Reference.</div>
<a id='Knuth64'></a><div class="csl-entry">Knuth, Donald E. ‚ÄúBackus Normal Form vs. Backus Naur Form.‚Äù, Letter to Editor, <i>Communications of the ACM</i>, vol. 7, no. 12, Dec. 1964, pp. 735‚Äì36. <i>Dec. 1964</i>, doi:<a href="https://doi.org/10.1145/355588.365140">10.1145/355588.365140</a>.</div> 
<a id='Knuth68'></a> <div class="csl-entry">Knuth, Donald E. 1968. ‚ÄúSemantics of Context-Free Languages.‚Äù <i>Mathematical Systems Theory</i> 2 (2): 127‚Äì45. <a href="https://doi.org/10.1007/BF01692511">https://doi.org/10.1007/BF01692511</a>.</div>
<a id='LeSchuster16'></a> <div class="csl-entry">Le, Quoc V., and Mike Schuster. 2016. ‚ÄúA Neural Network for Machine Translation, at Production Scale.‚Äù <i>Google AI Blog</i> (blog). September 27, 2016. <a href="https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html">https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html</a>.</div>
<a id='WijngaardenEtAl76'></a> <div class="csl-entry">Wijngaarden, A. van, B. J. Mailloux, J. E. L. Peck, C. H. A. Koster, C. H. Lindsey, M. Sintzoff, L. G. L. T. Meertens, and R. G. Fisker, eds. 1976. <i>Revised Report on the Algorithmic Language Algol 68</i>. Berlin Heidelberg: Springer-Verlag. <a href="https://doi.org///www.springer.com/gp/book/9783540075929">//www.springer.com/gp/book/9783540075929</a>.</div>
<a id='Wirth77'></a><div class="csl-entry">Wirth, Niklaus. 1977. ‚ÄúWhat Can We Do about the Unnecessary Diversity of Notation for Syntactic Definitions?‚Äù <i>Communications of the ACM</i> 20 (11): 822‚Äì23. <a href="https://doi.org/10.1145/359863.359883">https://doi.org/10.1145/359863.359883</a>.</div>   
<a id='WuEtAl16'></a> <div class="csl-entry">Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. ‚ÄúGoogle‚Äôs Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.‚Äù <i>CoRR</i> abs/1609.08144. <a href="http://arxiv.org/abs/1609.08144">http://arxiv.org/abs/1609.08144</a>.</div>
</div>