## Chunking  
The first step in entity recognition is the identification of chunks, e.g. noun phrases

In [1]:
import nltk

In [2]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

In [3]:
# define a pattern for noun phrases
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [4]:
# create a chunk parser, and test on sent
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [None]:
result.draw() # Won't work on Jupyter: tkinter is a desktop GUI toolkit and won't work over the web in the browser in JupyterLab

In [6]:
sentence = [("any", "DT"), ("new", "JJ"), ("policy", "NN"), ("measures", "NNS")]

In [7]:
sentence = [("earlier", "JJR"), ("stages", "NNS")]

In [8]:
# we modify the grammar to accommodate more complex examples
grammar = "NP: {<DT>?<JJ.*>*<NN.*>+}"

### Chunking with regular expressions

In [9]:
# define a chunk using multiple regexes
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""

In [10]:
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [11]:
print(cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


In [12]:
# overlapping matches are resolved by taking the leftmost match
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "NP: {<NN><NN>}  # Chunk two consecutive nouns"
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

(S (NP money/NN market/NN) fund/NN)


### Chinking
Chinks define patterns that we want to exclude from a chunk

In [13]:
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
"""

In [14]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)

In [15]:
print(cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


### IOB tags  
Chunks are usually tagged according to their placement in the pattern, e.g.  
B - beginning  
I - inside  
O - outside

We PRP B-NP  
saw VBD O  
the DT B-NP  
yellow JJ I-NP  
dog NN I-NP 

This format allows us to represent more than one chunk type, as long as they do not overlap

Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases

## Exercises

The IOB format categorizes tagged tokens as I, O and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively?

1. Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases

In [16]:
sentence = [("many", "JJ"), ("researchers", "NNS"), ("two", "CD"), ("weeks", "NNS"), ("both", "DT"), ("new", "JJ"), ("positions", "NNS")]

In [17]:
# singular noun phrases: grammar = "NP: {<DT>?<JJ>*<NN>}"
grammar = r"""
  NP: {<CD|DT>?<JJ>*<NN.*>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}           # chunk sequences of proper nouns
"""

In [18]:
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S
  (NP many/JJ researchers/NNS)
  (NP two/CD weeks/NNS)
  (NP both/DT new/JJ positions/NNS))


2. Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising

In [19]:
grammar = r"""
  NP: {<DT><VBG><NN.*>}
      {<NN><VBG><NN>}
      {<CD|DT>?<JJ>*<NN.*>}  
"""

In [20]:
sentence = [("the", "DT"), ("receiving", "VBG"), ("end", "NN")]

In [23]:
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S (NP assistant/NN managing/VBG editor/NN))


In [22]:
sentence = [("assistant", "NN"), ("managing", "VBG"), ("editor", "NN")]

3. Write one or more tag patterns to handle coordinated noun phrases, e.g. "July/NNP and/CC August/NNP", "all/DT your/PRP$ managers/NNS and/CC supervisors/NNS", "company/NN courts/NNS and/CC adjudicators/NNS"

In [24]:
grammar = r"""
  NP: {<DT>?<PRP\$>?<NN.*>+<CC><NN.*>}
      {<DT><VBG><NN.*>}
      {<NN><VBG><NN>}
      {<CD|DT>?<JJ>*<NN.*>}  
"""

In [25]:
sentence = [("July", "NNP"), ("and", "CC"), ("August", "NNP")]

In [26]:
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S (NP July/NNP and/CC August/NNP))


In [27]:
sentence = [("all", "DT"), ("your", "PRP$"), ("managers", "NNS"), ("and", "CC"), ("supervisors", "NNS")]

In [28]:
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S (NP all/DT your/PRP$ managers/NNS and/CC supervisors/NNS))


In [29]:
sentence = [("company", "NN"), ("courts", "NNS"), ("and", "CC"), ("adjudicators", "NNS")]

In [30]:
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S (NP company/NN courts/NNS and/CC adjudicators/NNS))
