## Chunking  
The first step in entity recognition is the identification of chunks, e.g. noun phrases

In [1]:
import nltk

In [2]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

In [3]:
# define a pattern for noun phrases
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [4]:
# create a chunk parser, and test on sent
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [None]:
result.draw() # wont work on Jupyter - tkinter is a desktop GUI toolkit and won't work over the web in the browser in JupyterLab

In [6]:
sentence = [("any", "DT"), ("new", "JJ"), ("policy", "NN"), ("measures", "NNS")]

In [7]:
sentence = [("earlier", "JJR"), ("stages", "NNS")]

In [8]:
# we modify the grammar to accommodate more complex examples
grammar = "NP: {<DT>?<JJ.*>*<NN.*>+}"

### Chunking with regular expressions

In [9]:
# define a chunk using multiple regexes
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""

In [10]:
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [11]:
print(cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


In [12]:
# overlapping matches are resolved by taking the leftmost match
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "NP: {<NN><NN>}  # Chunk two consecutive nouns"
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

(S (NP money/NN market/NN) fund/NN)


### Chinking
Chinks define patterns that we want to exclude from a chunk

In [13]:
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
"""

In [14]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)

In [15]:
print(cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


### IOB tags  
Chunks are usually tagged according to their placement in the pattern, e.g.  
B - beginning  
I - inside  
O - outside

We PRP B-NP  
saw VBD O  
the DT B-NP  
yellow JJ I-NP  
dog NN I-NP 

This format allows us to represent more than one chunk type, as long as they do not overlap