# 3: CFGs/Grammar in NLTK

## 1. Constructing grammars

There are two main ways to build CFG(Context-free grammars) in NLTK. The first is to create a string which represents the productions of the CFG. Note that the LHS of the first rule is taken to be the start.

We will use the following grammar rules:
```
S -> NP VP
NP -> PN
NP -> NNS
VP -> VB
VP -> VB NP
VP -> VB NP NP
```
Each rule (also called production) denotes a replacement of the left-hand side symbol with one or more right-hand side symbols.

We can use the rules in our grammar to derive phrases and sentences.

We will also use the following lexical rules:
```
PN -> 'she'
PN -> 'we'
PN -> 'us'
PN -> 'her'
NNS -> 'bills'
NNS -> 'coins'
VB -> 'give'
VB -> 'gives'
```

Note that the first rule of the grammar determines the start symbol (`S` in our case).

In [1]:
from nltk import CFG
from nltk.parse.generate import generate
from random import shuffle

grammar = CFG.fromstring("""S -> NP VP
NP -> PN
NP -> NNS
VP -> VB
VP -> VB NP
VP -> VB NP NP
PN -> 'she'
PN -> 'we'
PN -> 'us'
PN -> 'her'
NNS -> 'bills'
NNS -> 'coins'
VB -> 'give'
VB -> 'gives'""")

Let's print our grammar

In [2]:
print(grammar)

Grammar with 14 productions (start state = S)
    S -> NP VP
    NP -> PN
    NP -> NNS
    VP -> VB
    VP -> VB NP
    VP -> VB NP NP
    PN -> 'she'
    PN -> 'we'
    PN -> 'us'
    PN -> 'her'
    NNS -> 'bills'
    NNS -> 'coins'
    VB -> 'give'
    VB -> 'gives'


## 2. Parsing sentences

We can use CFGs for parsing. There are a few parsing algorithms included with NLTK, let's use the Earley Chart parser.

In [3]:
from nltk.parse import EarleyChartParser

nltk_parser = EarleyChartParser(grammar)

sent = "she gives us coins".split(" ")

parses = nltk_parser.parse(sent)

for parse in parses:
    print(parse)

(S (NP (PN she)) (VP (VB gives) (NP (PN us)) (NP (NNS coins))))


The grammar will also parse some clearly ungrammatical sentences.

In [4]:
sent = "her gives us coins".split(" ")

parses = nltk_parser.parse(sent)

for parse in parses:
    print(parse)

(S (NP (PN her)) (VP (VB gives) (NP (PN us)) (NP (NNS coins))))


## 3. Modeling subject and object NPs

We will now specialize the noun phrases in subject and object positions in order to prevent our grammar from analyzing ungrammatical sentences like "her gives we coins".

We will create two new non-terminals `NP_subj` and `NP_obj` which are intended for subject and object positions, respectively.

In [5]:
grammar = CFG.fromstring("""S -> NP_subj VP
NP_subj -> PN_subj
NP_subj -> NNS
NP_obj -> PN_obj
NP_obj -> NNS
VP -> VB
VP -> VB NP_obj
VP -> VB NP_obj NP_obj
PN_subj -> 'she'
PN_subj -> 'we'
PN_obj -> 'us'
PN_obj -> 'her'
NNS -> 'bills'
NNS -> 'coins'
VB -> 'give'
VB -> 'gives'""")

Let's make sure that the grammar still parses the original grammatical sentence "she gives us coins".

In [6]:
nltk_parser = EarleyChartParser(grammar)

sent = "she gives us coins".split(" ")

parses = nltk_parser.parse(sent)

for parse in parses:
    print(parse)

(S
  (NP_subj (PN_subj she))
  (VP (VB gives) (NP_obj (PN_obj us)) (NP_obj (NNS coins))))


Let's also make sure that it does not parse the ungrammatical sentence "her gives we coins".

In [7]:
sent = "her gives us coins".split(" ")

parses = nltk_parser.parse(sent)

for parse in parses:
    print(parse)

## 3. Adding rules to an existing grammar

Grammars can also be built programmatically from a set of productions, which in turn can be built from Nonterminals and strings. This could be used to import rules from an existing grammar while making changes or additions, for instance. 

Let's add simple adverbial phrases to the end of all verb phrases of our grammar. We can assume that these consist of a single adverb (RB) like "grudgingly" or a combination of two adverbs like "very grudgingly".

In addition to our existing `VP` rules:
```
VP -> VB
VP -> VB NP_obj
VP -> VB NP_obj NP_obj
```
we will add three new rules to our grammar:
```
VP -> VB AdvP
VP -> VB NP_obj AdvP
VP -> VB NP_obj NP_obj AdvP
```

In [8]:
from nltk.grammar import Production, Nonterminal

AdvP = Nonterminal("AdvP")
VP = Nonterminal("VP")

new_productions = []

for rule in grammar.productions():
    if rule.lhs() == VP:
        production = Production(VP, rule.rhs() + (AdvP,))   # append the new AdvP
        new_productions.append(production)

# combine two productions
new_grammar = CFG(Nonterminal("S"), grammar.productions() + new_productions)

Let's print the new grammar.

In [9]:
print(new_grammar)

Grammar with 19 productions (start state = S)
    S -> NP_subj VP
    NP_subj -> PN_subj
    NP_subj -> NNS
    NP_obj -> PN_obj
    NP_obj -> NNS
    VP -> VB
    VP -> VB NP_obj
    VP -> VB NP_obj NP_obj
    PN_subj -> 'she'
    PN_subj -> 'we'
    PN_obj -> 'us'
    PN_obj -> 'her'
    NNS -> 'bills'
    NNS -> 'coins'
    VB -> 'give'
    VB -> 'gives'
    VP -> VB AdvP
    VP -> VB NP_obj AdvP
    VP -> VB NP_obj NP_obj AdvP


We will then add rules for the internal structure of an `AdvP`:
```
AdvP -> RB
AdvP -> RB RB
RB -> 'grudgingly'
RB -> 'very'
```

In [10]:
AdvP = Nonterminal("AdvP")
RB = Nonterminal("RB")

# here we add them manually
new_productions.append(Production(AdvP, (RB,)))
new_productions.append(Production(AdvP, (RB,RB)))
new_productions.append(Production(RB, ["very"]))
new_productions.append(Production(RB, ["grudingly"]))

new_grammar = CFG(Nonterminal("S"), grammar.productions() + new_productions)

Let's test the new grammar.

In [11]:
nltk_parser = EarleyChartParser(new_grammar)

sent = "she gives us coins very grudingly".split(" ")

parses = nltk_parser.parse(sent)

for parse in parses:
    print(parse)

(S
  (NP_subj (PN_subj she))
  (VP
    (VB gives)
    (NP_obj (PN_obj us))
    (NP_obj (NNS coins))
    (AdvP (RB very) (RB grudingly))))


---------

## 4. More examples

### 1). Adding singular and proper nouns, determiners, adjectives and prepositional phrases

Let's start with our final grammar from the last notebook:
```
S -> NP_subj VP
AdvP -> RB
AdvP -> RB RB
NP_subj -> PN_subj
NP_subj -> NNS
NP_obj -> PN_obj
NP_obj -> NNS
VP -> VB
VP -> VB NP_obj
VP -> VB NP_obj NP_obj
VP -> VB AdvP
VP -> VB NP_obj AdvP
VP -> VB NP_obj NP_obj AdvP
PN_subj -> 'she'
PN_subj -> 'we'
PN_obj -> 'us'
PN_obj -> 'her'
NNS -> 'bills'
NNS -> 'coins'
VB -> 'give'
VB -> 'gives'
RB -> 'grudgingly'
RB -> 'very'
```

Out task:

1. Add the singular nouns "man", "woman", and "coin", as well as determiner "the" to the grammar and extend the `NP_subj` and `NP_obj` rules to cover singular nouns.
1. Add the proper nouns "Thursday" and "Bill" to the grammar and extend the `NP_subj` and `NP_obj` rules to cover proper nouns.
1. Extend the grammar with the adjectives "shiny" and "golden". In addition to the existing NPs, you should also accept noun phrases like "shiny coins" and "golden man" containing an adjective.
1. Add the preposition (IN) "on" to the grammar. Extend the adverbial phrase so that is can consist of a prepositional phrase (PP) like "on Thursday" which consists of a preposition and an object NP.   

In [12]:
from nltk import CFG

grammar = CFG.fromstring('''S -> NP_subj VP
AdvP -> RB
AdvP -> RB RB
AdvP -> PP
NP_subj -> PN_subj
NP_subj -> NNS
NP_subj -> NNP
NP_subj -> DT NN
NP_subj -> DT JJ NN
NP_subj -> JJ NNS
NP_obj -> PN_obj
NP_obj -> NNS
NP_obj -> DT NN
NP_obj -> DT JJ NN
NP_obj -> JJ NNS
NP_obj -> NNP
PP -> IN NP_obj
VP -> VB
VP -> VB NP_obj
VP -> VB NP_obj NP_obj
VP -> VB AdvP
VP -> VB NP_obj AdvP
VP -> VB NP_obj NP_obj AdvP
DT -> 'the'
PN_subj -> 'she'
PN_subj -> 'we'
PN_obj -> 'us'
PN_obj -> 'her'
IN -> 'on'
JJ -> 'shiny'
JJ -> 'golden'
NN -> 'man'
NN -> 'woman'
NN -> 'coin'
NNP -> 'Bill'
NNP -> 'Thursday'
NNS -> 'bills'
NNS -> 'coins'
VB -> 'give'
VB -> 'gives'
RB -> 'grudgingly'
RB -> 'very'
''')


Now we check that the grammar returns at least one parse for each of the following sentences:
    
```
the man gives us coins
she gives us the coin
Bill gives us coins very grudgingly
she gives Bill coins
Bill gives us golden coins
the golden man gives us shiny coins
Bill gives us shiny coins on Thursday
```

In [13]:
from nltk.parse import EarleyChartParser

nltk_parser  = EarleyChartParser(grammar)

sents = ["the man gives us coins",
"she gives us the coin",
"Bill gives us coins very grudgingly",
"she gives Bill coins",
"Bill gives us golden coins",
"the golden man gives us shiny coins",
"Bill gives us shiny coins on Thursday"]

for sent in sents:
    tok_sent = sent.split(" ")
    parses = nltk_parser.parse(tok_sent)
    print("Parses for %s:" % sent)
    for parse in parses:
        print(parse)

Parses for the man gives us coins:
(S
  (NP_subj (DT the) (NN man))
  (VP (VB gives) (NP_obj (PN_obj us)) (NP_obj (NNS coins))))
Parses for she gives us the coin:
(S
  (NP_subj (PN_subj she))
  (VP (VB gives) (NP_obj (PN_obj us)) (NP_obj (DT the) (NN coin))))
Parses for Bill gives us coins very grudgingly:
(S
  (NP_subj (NNP Bill))
  (VP
    (VB gives)
    (NP_obj (PN_obj us))
    (NP_obj (NNS coins))
    (AdvP (RB very) (RB grudgingly))))
Parses for she gives Bill coins:
(S
  (NP_subj (PN_subj she))
  (VP (VB gives) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))
Parses for Bill gives us golden coins:
(S
  (NP_subj (NNP Bill))
  (VP
    (VB gives)
    (NP_obj (PN_obj us))
    (NP_obj (JJ golden) (NNS coins))))
Parses for the golden man gives us shiny coins:
(S
  (NP_subj (DT the) (JJ golden) (NN man))
  (VP
    (VB gives)
    (NP_obj (PN_obj us))
    (NP_obj (JJ shiny) (NNS coins))))
Parses for Bill gives us shiny coins on Thursday:
(S
  (NP_subj (NNP Bill))
  (VP
    (VB gives)
    (NP_ob

Use the function `nltk.parse.generate` to generate 20 sentences which are parsed by your grammar. 

You can use `help(generate)` to figure out how to call the function.

In [14]:
from nltk.parse.generate import generate

for s in generate(grammar,n=20):
    print(s)

['she', 'give']
['she', 'gives']
['she', 'give', 'us']
['she', 'give', 'her']
['she', 'give', 'bills']
['she', 'give', 'coins']
['she', 'give', 'the', 'man']
['she', 'give', 'the', 'woman']
['she', 'give', 'the', 'coin']
['she', 'give', 'the', 'shiny', 'man']
['she', 'give', 'the', 'shiny', 'woman']
['she', 'give', 'the', 'shiny', 'coin']
['she', 'give', 'the', 'golden', 'man']
['she', 'give', 'the', 'golden', 'woman']
['she', 'give', 'the', 'golden', 'coin']
['she', 'give', 'shiny', 'bills']
['she', 'give', 'shiny', 'coins']
['she', 'give', 'golden', 'bills']
['she', 'give', 'golden', 'coins']
['she', 'give', 'Bill']


### 2)Adding subject-verb agreement

English display number agreement between a 3rd person subject and verb. This is why "she **gives** Bill coins" and "we **give** Bill coins" are grammatical, but "\*she **give** Bill coins" and "\*we **gives** Bill coins" are not.

Our grammar will happily parse both the grammatical and ungrammatical sentences. Please make sure that this is the case.

In [15]:
sents = ["she gives Bill coins",
"we give Bill coins",
"she give Bill coins",
"we gives Bill coins"         
]

# your code here

for sent in sents:
    tok_sent = sent.split(" ")
    parses = nltk_parser.parse(tok_sent)
    print("Parses for %s:" % sent)
    for parse in parses:
        print(parse)

Parses for she gives Bill coins:
(S
  (NP_subj (PN_subj she))
  (VP (VB gives) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))
Parses for we give Bill coins:
(S
  (NP_subj (PN_subj we))
  (VP (VB give) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))
Parses for she give Bill coins:
(S
  (NP_subj (PN_subj she))
  (VP (VB give) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))
Parses for we gives Bill coins:
(S
  (NP_subj (PN_subj we))
  (VP (VB gives) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))


Now we want to specialize verb and subject phrases to handle subject-verb agreement. We need to create new phrase types `NP_subj_sg`, `NP_subj_pl`, `VP_sg` and `VP_pl` to handle the agreement.

In [16]:
# your code here

grammar = CFG.fromstring('''S -> NP_subj_sg VP_sg
S -> NP_subj_pl VP_pl
AdvP -> RB
AdvP -> RB RB
AdvP -> PP
NP_subj_sg -> PN_subj_sg
NP_subj_pl -> PN_subj_pl
NP_subj_pl -> NNS
NP_subj_sg -> NNP
NP_subj_sg -> DT NN
NP_subj_sg -> DT JJ NN
NP_subj_pl -> JJ NNS
NP_obj -> PN_obj
NP_obj -> NNS
NP_obj -> DT NN
NP_obj -> DT JJ NN
NP_obj -> JJ NNS
NP_obj -> NNP
PP -> IN NP_obj
VP_sg -> VB_sg
VP_pl -> VB_pl
VP_sg -> VB_sg NP_obj
VP_pl -> VB_pl NP_obj
VP_sg -> VB_sg NP_obj NP_obj
VP_pl -> VB_pl NP_obj NP_obj
VP_sg -> VB_sg AdvP
VP_pl -> VB_pl AdvP
VP_sg -> VB_sg NP_obj AdvP
VP_pl -> VB_pl NP_obj AdvP
VP_sg -> VB_sg NP_obj NP_obj AdvP
VP_pl -> VB_pl NP_obj NP_obj AdvP
DT -> 'the'
PN_subj_sg -> 'she'
PN_subj_pl -> 'we'
PN_obj -> 'us'
PN_obj -> 'her'
IN -> 'on'
JJ -> 'shiny'
JJ -> 'golden'
NN -> 'man'
NN -> 'woman'
NN -> 'coin'
NNP -> 'Bill'
NNP -> 'Thursday'
NNS -> 'bills'
NNS -> 'coins'
VB_pl -> 'give'
VB_sg -> 'gives'
RB -> 'grudgingly'
RB -> 'very'
''')


Our grammar now correctly handles the example sentences "she gives Bill coins", "we give Bill coins", "she give Bill coins" and "we gives Bill coins". **Remember to reinitialize the nltk_parser object with the modified grammar**.

In [17]:
nltk_parser  = EarleyChartParser(grammar)

sents = ["she gives Bill coins",
"we give Bill coins",
"she give Bill coins",
"we gives Bill coins"         
]

for sent in sents:
    tok_sent = sent.split(" ")
    parses = nltk_parser.parse(tok_sent)
    print("Parses for %s:" % sent)
    for parse in parses:
        print(parse)

Parses for she gives Bill coins:
(S
  (NP_subj_sg (PN_subj_sg she))
  (VP_sg (VB_sg gives) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))
Parses for we give Bill coins:
(S
  (NP_subj_pl (PN_subj_pl we))
  (VP_pl (VB_pl give) (NP_obj (NNP Bill)) (NP_obj (NNS coins))))
Parses for she give Bill coins:
Parses for we gives Bill coins:


Implementing subject-verb agreement this way feels a bit cumbersome, but we will learn another way to do this more elegantly later.