# Working with syntactic grammars

Our models so far have all assumed that strings are indeed just that, strings.
True, we tokenized strings to break them down into lists of words, and we occassionally added some hidden information such as parts of speech or morphological information like *stem* and *affix*.
But at no point did we move beyond the idea that words and sentences are possibly more than linear sequences.
However, that's exactly what linguists have figured out a long time ago for sentences: a sentence isn't just a sequence of words like beads on a string, but contains a lot of hidden structure.
A sentence is more like a molecule, with the words as its atoms.

This shift from strings to trees makes things a lot harder to implement.
How does one represent these "sentence molecules", what is the right data structure, how can it be traversed and manipulated?
And most importantly, how can we automatically get the structures just from the linear string?

## Background: Phrase structure grammars

Let's have a quick linguistics recap for sentence structure.
Linguists commonly use **phrase structure grammars** to describe the set of possible sentence structures in a language.
Phrase structure grammars assume that sentence structures are trees.
Just like prefix trees, they have a single **root** at the very top, and each node has 0 or more **daughter** nodes below it.
A phrase structure grammar then is a finite set of **rewrite rules** that describe what daughters a node may have.

Each rule is of the form `X -> daughter1 daughter2 daughter3...`.
For instance, `S -> NP VP` tells us that a sentence (`S`) can be split into an `NP` and a `VP`.
The former is the first daughter of the `S`-node in the tree, the second daughter is the latter.
If we also add the rules `NP -> 'everybody'` and `VP -> 'snored'`, then we can already assign a structure to the simple sentence *everybody snored*:

```
       S
      / \
     /   \
    /     \
   /       \
 NP        VP
 |          |
everybody  snored
```

We can represent this tree as a single string by using **labeled bracketing**:

```
(S (NP everybody) (VP snored) )
```

Right now our toy grammar contains only three rules.

```
S -> NP VP
NP -> 'everybody'
VP -> 'snored'
```

But of course we can add more rules to analyze more sentences than just *everybody snored*.

```
S -> NP VP
S -> Aux NP V
NP -> 'everybody'
NP -> 'nobody'
VP -> 'snored'
Aux -> 'did'
V -> 'snore'
```

This grammar has multiple rules for `S` and `NP`.
Each one of these rules describes a valid configuration.
So instead of `NP VP`, a sentence may also consist of `Aux NP V`, e.g. in *Did everybody snore*.

For simplicity, rules with the same symbol on the left-hand side are often written on a single line, with the different options separated by `|` (the **pipe**).

```
S -> NP VP | Aux NP V
NP -> 'everybody' | 'nobody'
VP -> 'snored'
Aux -> 'did'
V -> 'snore'
```

You might have also noticed that words are flanked by quotes, whereas symbols like `S` and `NP` are not.
While it may seem innocuous, it indicates an important distinction.
Symbols like `S` and `NP` are **non-terminals**, which means that they cannot occur in the pronounced sentence.
All symbols that appear on the left-hand side of a rule are non-terminals.
The things that do appear in the string are called **terminals**, and they must never occur on the left hand-side of a rule.
The quotes indicate the difference between terminals (with quotes) and non-terminals (no quotes).

Alright, that's all nice and dandy, but it doesn't get us any closer to doing syntax in Python.
Well, actually, it does, because Python's `nltk` package comes with tools that build directly on this way of defining phrase structure grammars.

## Phrase structure grammars in `nltk`

Consider once more our initial toy grammar.

```
S -> NP VP
NP -> 'everybody'
VP -> 'snored'
```

We can directly use this with `nltk`.

In [None]:
import nltk

toy_grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> 'everybody'
VP -> 'snored'
""")

We now have a grammar `toy_grammar` that contains the specified rewrite rules.

Let's look quickly at the name of the `nltk` function that allows us to do this: `nltk.CFG.fromstring`.
That it starts with `nltk` is not surprising, after all it's a function of the `nltk` package.
The `fromstring` part is also self-explanatory; we define the grammar as a string and pass that string to the function.
But what does CFG mean?

CFG is short for context-free grammar.
From a mathematical perspective, phrase structure grammars and context-free grammars are the same thing.
But phrase structure grammars are context-free grammars with a specific intended application.
Phrase structure grammars are context-free grammars that are only meant to describe the structure of linguistic sentences.
Context-free grammars, on the other hand, have many other applications: they could also be used to analyze DNA-sequences or to define the syntax of a programming language.
Here's a metaphor: a salad fork (phrase structure grammar) is not all that different from any other fork (context-free grammar), the main distinction is that it's only meant to be used with salad.

Alright, enough about terminology, let's get back to the productive stuff.
We now have a very simple toy grammar with only three rules.
Our last example in the previous exercise was quite a bit more complex.
But it, too, can be used with `nltk.CFG.fromstring`.

In [None]:
import nltk

toy_grammar = nltk.CFG.fromstring("""
S -> NP VP | Aux NP V
NP -> 'everybody' | 'nobody'
VP -> 'snored'
Aux -> 'did'
V -> 'snore'
""")

We can use the `productions` method to check that the grammar has indeed been correctly instantiated.

In [None]:
toy_grammar.productions()

Keep in mind that `|` is just a shorthand for defining multiple right-hand sides, so the output of `productions` does indeed match the grammar we defined.

## Parsing

Okay, so now we have a toy grammar that's a bit more complex.
A grammar by itself, though, isn't of much use.
We only care about grammars to the extent that they allow us to talk about the structure of sentences.
But a grammar by itself does not actually help us with finding the structure of a sentence.
This task is handled by the **parser**.

In [None]:
sentence = ["everybody", "snored"]
parser = nltk.ChartParser(toy_grammar)
for parse in parser.parse(sentence):
    print(parse)

The code above tells Python to build a chart parser from our toy grammar.
This is essentially the CKY parser we encountered in class, with some minor modifications that do not matter here.
The parser provides a method `parse` that computes **all** possible structures for a given sentence.

The output of `parse` itself cannot be easily inspected with `print`.

In [None]:
print(parser.parse(sentence))

But we can convert it to a list and then look at that.

In [None]:
print(list(parser.parse(sentence)))

A little tricky to read, hmm?
Even though it is hard to tell with all the brackets and parentheses, the list actually contains just a single element.
Printing just this single element gives us the familiar output we saw earlier on.

In [None]:
print(list(parser.parse(sentence))[0])

Why the difference in output?
The nltk package defines a new class `Tree`, that is used to represent tree structures.
A `Tree` object consists of a node and a list of the subtrees that the node is the mother of.
But each subtree is itself a `Tree` object.
This causes the somewhat convoluted output we saw with `list`.
We can use indentation and linebreaks to make it a bit more digestible:

```python
[
 Tree('S',
      [Tree('NP',
            ['everybody']),
       Tree('VP',
            ['snorted'])
      ]
     )
]
```

This is the actual structure of an nltk tree.
Nobody likes clutter, though, so nltk uses some tricks to make sure that when a tree is the argument of `print`, it  is shown with the cleaner labeled bracketing format:

```python
(S (NP everybody) (VP snored))
```

As far as Python is concerned, however, that's just a formatting trick for increased readability.
The actual tree structure is the more complicated one with `Tree`s containing `Tree`s.

## A more realistic example

Let's move beyond our toy grammar and actually design a grammar that can handle some interesting cases of ambiguity in English.

In [None]:
ambig_grammar = nltk.CFG.fromstring("""
S -> NP VP | PN VP | Pro VP
NP -> Det N | NP PP
VP -> V NP | V PN | VP PP
PP -> P NP | P PN
Det -> 'a' | 'an' | 'the' | 'this' | 'my'
N -> 'person' | 'hill' | 'telescope' | 'movie'
Pro -> 'I' | 'you'
PN -> 'Arnold'
V -> 'saw' | 'watched'
P -> 'in' | 'on' | 'with'
""")

The grammar above is in **Chomsky Normal Form** (CNF).
This means that every rule is of the form `X -> Y Z` or `X -> 'terminal'`.
In other words, every node in the tree has either exactly two non-terminals as daughters, or exactly one daughter that is a terminal.
CNF does not change what strings are considered well-formed by the grammar, but it does change what the tree structures may look like.
But since grammars in CNF are easier to handle for some parsers, that is often an acceptable trade-off.

Let's see what our grammar produces for some (possibly) ambiguous sentences.

In [None]:
import re

def print_parses(sentence, grammar):
    sentence = re.findall(r"\w+", sentence)
    parser = nltk.ChartParser(grammar)
    for parse in parser.parse(sentence):
        print(parse)

In [None]:
print_parses("I watched a movie with Arnold", ambig_grammar)

As expected, we get two parses: one where me and Arnold watched a movie, and one where a movie with Arnold was watched by me.

But what happens if there is no parse at all?

In [None]:
print_parses("Arnold movie watched a I with", ambig_grammar)

We don't get any output!
That's because we're using a `for`-loop to print every item in the list of parses that the parser found.
If there are no such parses at all, then the list is empty.
So our loop reduces to `for parse in []`, which means that nothing at all happens.
Let's modify our `print_parses` function so that we get an explicit notification if there is no parse.

In [None]:
import re

def print_parses(sentence, grammar):
    sentence = re.findall(r"\w+", sentence)
    parser = nltk.ChartParser(grammar)
    empty = True  # let's assume there are no parses
    for parse in parser.parse(sentence):
        print(parse)
        empty = False  # nope, found at least one parse
    if empty:
        print("no parses")

In [None]:
print_parses("I watched a movie with Arnold", ambig_grammar)

In [None]:
print_parses("Arnold movie watched a I with", ambig_grammar)

You might be wondering why we didn't just use something like `if parser.parse(sentence) != []` or `if len(parser.parse(sentence)) > 0` to check if there are any parses.
The answer is that `parser.parse(sentence)` doesn't return a list but a generator.
Generators were covered in an expansion unit, so we won't go over them here.
But the bottom line is that genreators are more ephemeral objects for which we cannot test emptiness in the usual ways.
Hence the workaround with a boolean variable `empty` to keep track of whether any parses have been printed.

Anyways, let's try one more example.

In [None]:
print_parses("I watched a film with Arnold", ambig_grammar)

Hmm, this didn't work.
We got `ValueError: Grammar does not cover some of the input words: "'film'".`
The way a CFG works, every lexical item must occur in some rewrite rule of the grammar.
We can again use the `.productions` method to check for each word in the input what rule it occurred in.

In [None]:
import re

sentence = re.findall(r"\w+", "I watched a movie with Arnold")
for n in range(len(sentence)):
    print(ambig_grammar.productions(rhs=sentence[n]))

The use of `rhs=x` tells Python to only show us those rules whose right-hand side is `x`.
So `rhs=sentence[n]` only shows the rules that have `sentence[n]` as their right-hand side.

Okay, let's do one more example.

In [None]:
print_parses("a movie with Arnold watched Arnold with Arnold", ambig_grammar)

This sentence is not ambiguous according to the grammar because even though there is a rule `NP -> NP PP`, there is no `PN`-counterpart of the form `PN -> PN PP`.
Hence proper names cannot be modified by PPs.
This implies that *with Arnold* can only modify the VP in the example sentence.

## Inspecting the chart

As mentioned earlier, nltk's chart parser is a tuned version of the CKY algorithm.
We can verify that by printing the initial and final chart used by the parser.
The code for that is a bit complicated, so I've put it away in a separate file.
Just run the cells below as usual and don't worry about the function definitions going on under the hood.

In [None]:
%run chartprinter.py

In [None]:
print_chart("I watched a movie with Arnold", ambig_grammar)

This is an excellent opportunity for you to test your own CKY parsing skills.
The cells below contain an example grammar and some sentences.
For each one, first draw your own chart, then run the code to see the actual output.
Keep in mind that your chart should also contain the backpointers, which are not displayed in the output of `print_chart` as it would be too convoluted.
Also, `print_chart` won't fill any cells beyond the first diagonal if there is no parse at all.

In [None]:
complex_grammar = nltk.CFG.fromstring("""
S -> NP VP | PN VP | Pro VP
NP -> Det NP | NP PP | A NP | A N | 'movies' | 'nobody'
VP -> V NP | V PN | VP PP | 'slept' | 'snored'
PP -> P NP | P PN
Det -> 'a' | 'an' | 'the' | 'this' | 'my'
N -> 'person' | 'hill' | 'telescope' | 'movie'
Pro -> 'I' | 'you'
PN -> 'Arnold'
V -> 'saw' | 'watched'
P -> 'in' | 'on' | 'with'
A -> 'old' | 'cool' | 'young'
""")

In [None]:
print_chart("I watched an old movie", complex_grammar)

In [None]:
print_chart("I watched an old movie with Arnold", complex_grammar)

In [None]:
print_chart("I watched a movie with young Arnold", complex_grammar)

Alright, ready for a real challenge?
Time to parse some *police* sentences.
For some reason that I haven't been able to determine, the charts do not display correctly in cases of lexical ambiguity, so you'll have to make do with a printout of the actual parses.

In [None]:
police_grammar = nltk.CFG.fromstring("""
S -> NP VP | NP VPtop
NP -> 'police' | NP RelS
RelS -> NP VPrel
VP -> V NP
VPtop -> NP V
VPrel -> 'police'
V -> 'police'
""")

In [None]:
print_parses("police police police", police_grammar)

In [None]:
print_parses("police police police police", police_grammar)

In [None]:
print_parses("police police police police police", police_grammar)

In [None]:
print_parses("police police police police police police", police_grammar)

Careful, the next one might make your head explode.

In [None]:
print_parses("police police police police police police police", police_grammar)

As you can see, the number of potential parses explodes as sentences get longer.
This is a basic fact of syntax: there are tons of ways to combine words, so a single string may have dozens of structures attached to it.
Humans rarely notice this kind of ambiguity because we implicitly use meaning to quickly rule out implausible parses.
For instance, if you hear somebody say "I once shot an elephant in my pajamas", you don't assume that the elephant was wearing the pajamas.
Computers don't have access to this kind of world knowledge, and that's why parsing is a much harder task for them.

## Bullet point summary

- Every sentence has an invisible tree structure.
- The phrase structure of a sentence can be described by a context-free grammar:

```python
nltk.CFG.fromstring("""
X -> Y Z
Y -> 'y'
Z -> 'z'
""")
```

- A grammar just defines the set of all possible structures in a language.
  We need a parser to actually determine the structure(s) of a given sentence.
  
```python
# instantiate parser from grammar
some_parser = nltk.ChartParser(grammar)
# compute all parses of a sentence
parses = some_parser.parse(sentence)
```

- The `parse` method returns a generator (see the earlier expansion unit), **not** a list.
  We can iterate over the generator with `for`, but we can't pick out individual elements, test emptyness, or calculate the length.