# 2: nltk.RegexpParser

In this notebook, we will learn how to extract noun phrase chunks using the NLTK `RegexParser` class.

`RegexParser` can be used to match phrases like NPs using regular expressions on POS tags so let's start by making sure that we can POS tag sentences.  

## 1. A simple regex parser

In [1]:
from nltk import pos_tag, word_tokenize

tagged_sent = pos_tag(word_tokenize("The dog saw my cat"))
print(tagged_sent)

[('The', 'DT'), ('dog', 'NN'), ('saw', 'VBD'), ('my', 'PRP$'), ('cat', 'NN')]


`word_tokenize()` is how we make the string into tokens.

In [2]:
word_tokenize("The dog saw my cat")

['The', 'dog', 'saw', 'my', 'cat']

We will then need to import the `RegexpParser` class.

In [3]:
from nltk import RegexpParser

Let's start by defining a very simple chunker which recognizes **NPs consisting of a single determiner or possessive pronoun followed by noun**.

The syntax for defining chunks is the following:

```
phrase_type: {regular expression on POS tags}
```

In the regular expressions, **you will need to surround POS tags using angle brackets** `<...>`, for example `<NN>`.

We can apply our chunker on a POS tagged sentence like `tagged_sent` using the `RegexpParser.parse()` function. It will return an NLTK syntax tree.

In [4]:
simple_chunker = RegexpParser("NP: {<DT|PRP\$><NN>}")  # PRP$ means possessive pronoun, but we need \ to escape the $
print(simple_chunker.parse(tagged_sent)) 

(S (NP The/DT dog/NN) saw/VBD (NP my/PRP$ cat/NN))


Let's save our chunked sentence in a variable `chunked_sent`.

In [5]:
chunked_sent = simple_chunker.parse(tagged_sent)

We will now **extract chunks from `chunked_sent`**. The aim is to form a list of NP chunks `["The dog", "my cat"]`.

We can access all the nested phrases in our chunk tree using `Tree.subtrees()`.

In [6]:
for phrase in chunked_sent.subtrees():
    print(phrase)

(S (NP The/DT dog/NN) saw/VBD (NP my/PRP$ cat/NN))
(NP The/DT dog/NN)
(NP my/PRP$ cat/NN)


The `subtrees()` function will print all nested trees which includes te sentence itself. We're only interested in the noun phrases which have label `NP`. Additionally, we want to extract the chunks as strings, so we will use the `Tree.leaves()` function.

In [7]:
for phrase in chunked_sent.subtrees():
    if phrase.label() == "NP":
        print(phrase.leaves())

[('The', 'DT'), ('dog', 'NN')]
[('my', 'PRP$'), ('cat', 'NN')]


This looks almost correct. The only problem is that the leaves here are pairs `(word, pos_tag)` so let's extract the words and joint those into a string.

In [8]:
for phrase in chunked_sent.subtrees():
    if phrase.label() == "NP":
        leaves = phrase.leaves()
        words = [word for word, tag in phrase]
        print(" ".join(words))

The dog
my cat


We can now create a function which returns the noun phrase chunks in a sentence:

In [9]:
def get_chunks(sent, chunker):
    # First tokenize and POS tag the sentence and extract NP chunks. 
    tagged_sent = pos_tag(word_tokenize(sent))
    chunked_sent = chunker.parse(tagged_sent)
    
    chunks = []
    for phrase in chunked_sent.subtrees():
        if phrase.label() == "NP":
            tagged_chunk = phrase.leaves()
            words = [word for word, tag in tagged_chunk]
            chunks.append(" ".join(words))
    return chunks

print(get_chunks("The boy saw a cat", simple_chunker))

['The boy', 'a cat']


### Practice

Define a chunker for adjective phrases `AdjP` called `adj_chunker`. An adjective phrase should should contain an adjective `JJ`, `JJR` or `JJS` which can be preceded by an optional adverb `RB`. You can indicate optionality using the question mark operator, i.e. `<TAG>?` or use a disjunction `EXPR1|EXPR2`. You can also use parentheses as needed.

Test your chunker using the sentence "The grey cat saw a very large mouse". Your chunker should find two chunks: "grey" and "very small". You should tag the sentence using NLTK `pos_tag` and then apply the chunker. 

In [10]:
# your code here
tagged_sent = pos_tag(word_tokenize("The grey cat saw a very large mouse"))
simple_chunker = RegexpParser("AdjP: {<RB>?<JJ(R|S)?>}")
print(simple_chunker.parse(tagged_sent))

(S
  The/DT
  (AdjP grey/JJ)
  cat/NN
  saw/VBD
  a/DT
  (AdjP very/RB large/JJ)
  mouse/NN)


## 2. Applying RegexParser to a corpus

NLTK offers a **corpus which has gold standard chunks prepared by linguists**. It is the `conll2000` chunking shared task  dataset.

Let's start by downloading the corpus.

In [11]:
import nltk 
nltk.download('conll2000')

[nltk_data] Downloading package conll2000 to /Users/lxy/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


True

The `conll2000` corpus **contains verb phrase (VP) and prepositional phrase (PP) chunks in addition to noun phrase chunks**. You can extract all chunks using the command `conll2000.chunked_sents()` but we will only need NP chunks now so we'll use a **`chunk_types=["NP"]` parameter** to indicate this.

In [12]:
from nltk.corpus import conll2000

gold_chunked_sents = conll2000.chunked_sents(chunk_types=["NP"])

Let's print **the first chunked sentence.**

In [13]:
print(gold_chunked_sents[0])

(S
  (NP Confidence/NN)
  in/IN
  (NP the/DT pound/NN)
  is/VBZ
  widely/RB
  expected/VBN
  to/TO
  take/VB
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  for/IN
  (NP September/NNP)
  ,/,
  due/JJ
  for/IN
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  fail/VB
  to/TO
  show/VB
  (NP a/DT substantial/JJ improvement/NN)
  from/IN
  (NP July/NNP and/CC August/NNP)
  (NP 's/POS near-record/JJ deficits/NNS)
  ./.)


We can now **apply our `simple_chunker` on the `conll2000` corpus**. Since the `conll2000` corpus contains gold standard POS tags, we can use those when extracting chunks.

Let's first get all the **POS tagged sentences from `conll2000`** and save them in a variable `tagged_sents`.

In [14]:
tagged_sents = conll2000.tagged_sents()
print(tagged_sents[0])

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ('pound', 'NN'), ('is', 'VBZ'), ('widely', 'RB'), ('expected', 'VBN'), ('to', 'TO'), ('take', 'VB'), ('another', 'DT'), ('sharp', 'JJ'), ('dive', 'NN'), ('if', 'IN'), ('trade', 'NN'), ('figures', 'NNS'), ('for', 'IN'), ('September', 'NNP'), (',', ','), ('due', 'JJ'), ('for', 'IN'), ('release', 'NN'), ('tomorrow', 'NN'), (',', ','), ('fail', 'VB'), ('to', 'TO'), ('show', 'VB'), ('a', 'DT'), ('substantial', 'JJ'), ('improvement', 'NN'), ('from', 'IN'), ('July', 'NNP'), ('and', 'CC'), ('August', 'NNP'), ("'s", 'POS'), ('near-record', 'JJ'), ('deficits', 'NNS'), ('.', '.')]


And let's **apply our `simple_chunker` to the first tagged sentence**. 

The only NP we actually find is `(NP the/DT pound/NN)`. Comparing with the gold standard chunked sentence, we can immediately see two issues:

1. The determiner is often optional as in `(NP Confidence/NN)`.
1. NPs often contain adjectives like `(NP another/DT sharp/JJ dive/NN)`

In [15]:
print(simple_chunker.parse(tagged_sents[0]))

(S
  Confidence/NN
  in/IN
  the/DT
  pound/NN
  is/VBZ
  widely/RB
  expected/VBN
  to/TO
  take/VB
  another/DT
  (AdjP sharp/JJ)
  dive/NN
  if/IN
  trade/NN
  figures/NNS
  for/IN
  September/NNP
  ,/,
  (AdjP due/JJ)
  for/IN
  release/NN
  tomorrow/NN
  ,/,
  fail/VB
  to/TO
  show/VB
  a/DT
  (AdjP substantial/JJ)
  improvement/NN
  from/IN
  July/NNP
  and/CC
  August/NNP
  's/POS
  (AdjP near-record/JJ)
  deficits/NNS
  ./.)


Let's improve our `simple_chunker` by **making the determiner (`DT`) tag optional and allowing for an optional adjective  (`JJ`) tag** between the determiner and noun. 

This works substantially better but there are still several cases which are not correctly handled. One is that **our chunker only allows for simple `NN` tags** which means that nouns having `NNS`, `NNP` and `NNPS` tags are missed. 

In [16]:
simple_chunker = RegexpParser("NP: {<DT>?<JJ>?<NN>}")

print(simple_chunker.parse(tagged_sents[0]))

(S
  (NP Confidence/NN)
  in/IN
  (NP the/DT pound/NN)
  is/VBZ
  widely/RB
  expected/VBN
  to/TO
  take/VB
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN)
  figures/NNS
  for/IN
  September/NNP
  ,/,
  due/JJ
  for/IN
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  fail/VB
  to/TO
  show/VB
  (NP a/DT substantial/JJ improvement/NN)
  from/IN
  July/NNP
  and/CC
  August/NNP
  's/POS
  near-record/JJ
  deficits/NNS
  ./.)


We can now further extend our NP chunker by allowing for any noun tag. The easiest way to do this is to notice that noun tags in the Penn Treebank POS tagset are exactly the tags which start with `NN`. We can model noun tags using the regular expression `NN.*` which allows for an arbitrary combination of characters after `NN`. 

We will also allow several nouns in the NP to account for cases like "budget deficit".

In [17]:
simple_chunker = RegexpParser("NP: {<DT>?<JJ>?<NN.*>+}")

print(simple_chunker.parse(tagged_sents[0]))

(S
  (NP Confidence/NN)
  in/IN
  (NP the/DT pound/NN)
  is/VBZ
  widely/RB
  expected/VBN
  to/TO
  take/VB
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  for/IN
  (NP September/NNP)
  ,/,
  due/JJ
  for/IN
  (NP release/NN tomorrow/NN)
  ,/,
  fail/VB
  to/TO
  show/VB
  (NP a/DT substantial/JJ improvement/NN)
  from/IN
  (NP July/NNP)
  and/CC
  (NP August/NNP)
  's/POS
  (NP near-record/JJ deficits/NNS)
  ./.)


Finally, we'd like to compare the chunks identified by our chunker to the gold standard noun phrase chunks in the `conll2000` corpus. There is a builtin function in `RegexpParser` which computes both IOB tag accuracy and F1-score for us. It takes the gold standard chunked `conll2000` corpus as input.

We can see that while we do identify quite a few correct chunks, we could definitely do better given the 64.1% F1-score. You can continue to develop better regex chunkers :).

In [18]:
print(simple_chunker.evaluate(gold_chunked_sents))

ChunkParse score:
    IOB Accuracy:  79.3%%
    Precision:     69.2%%
    Recall:        59.6%%
    F-Measure:     64.1%%
