## Chunking 



The basic technique we will use for entity recognition is chunking, which segments
and labels multitoken sequences as illustrated. The smaller boxes show
the word-level tokenization and part-of-speech tagging, while the large boxes show
higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization,
which omits whitespace, chunking usually selects a subset of the tokens. Also like
tokenization, the pieces produced by a chunker do not overlap in the source text.


In [47]:
import nltk
from nltk.tree import Tree
from IPython.display import display

In [2]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]


In [3]:
grammar = "NP: {<DT>?<JJ>*<NN>}"


In [4]:
cp = nltk.RegexpParser(grammar)

In [5]:
result = cp.parse(sentence)
print (result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [6]:
result.draw()

 Chunking with regex 

In [11]:
grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN>} 
    {<NNP>+} 
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [12]:
print (cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


In [13]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]

In [14]:
grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns"

In [15]:
cp = nltk.RegexpParser(grammar)


In [16]:
print (cp.parse(nouns))

(S (NP money/NN market/NN) fund/NN)


## Chinking
Sometimes it is easier to define what we want to exclude from a chunk. We can define
a chink to be a sequence of tokens that is not included in a chunk. In the following
example, barked/VBD at/IN is a chink:
[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]
Chinking is the process of removing a sequence of tokens from a chunk. If the matching
sequence of tokens spans an entire chunk, then the whole chunk is removed; if the
sequence of tokens appears in the middle of the chunk, these tokens are removed,
leaving two chunks where there was only one before. If the sequence is at the periphery
of the chunk, these tokens are removed, and a smaller chunk remains.

In [2]:
grammar = r"""
NP:
    {<.*>+}
    }<VBD|IN>+{ 
"""

In [3]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

In [4]:
cp = nltk.RegexpParser(grammar)

In [6]:
print (cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


## Developing and Evaluating Chunkers

In [7]:
from nltk.corpus import conll2000
print (conll2000.chunked_sents('train.txt')[99])

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


We start off by
establishing a baseline for the trivial chunk parser cp that creates no chunks:

In [8]:
from nltk.corpus import conll2000
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print (cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


In [9]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print (cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


## Recursion in Linguistic Structure
Building Nested Structure with Cascaded Chunkers
So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules

In [24]:
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

In [25]:
cp = nltk.RegexpParser(grammar)

In [26]:
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

In [27]:
print (cp.parse(sentence))

(S
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


## Trees
A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node. Here's an example of a tree (note that they are standardly drawn upside-down):

In [28]:
tree1 = nltk.Tree('NP', ['Alice'])

In [29]:
print (tree1)

(NP Alice)


In [30]:
tree2 = nltk.Tree('NP', ['the', 'rabbit'])

In [31]:
print (tree2)

(NP the rabbit)


In [32]:
tree3 = nltk.Tree('VP', ['chased', tree2])

In [33]:
tree4 = nltk.Tree('S', [tree1, tree3])

In [34]:
print (tree4)

(S (NP Alice) (VP chased (NP the rabbit)))


In [36]:
print (tree4[1])

(VP chased (NP the rabbit))


In [38]:
tree4[1].label()

'VP'

In [39]:
tree4.leaves()

['Alice', 'chased', 'the', 'rabbit']

In [40]:
tree4[1][1][1]

'rabbit'

In [51]:
tree3.draw()

## Tree Traversal
It is standard to use a recursive function to traverse a tree. 

In [78]:
def traverse(t):
    try:
        t.label()
    except AttributeError:
        print (t)
    else:
        # Now we know that t.node is defined
        print ( (t.label()))
        for child in t:
            traverse(child)
        print ()

In [79]:
t = nltk.Tree.fromstring('(S (NP Alice) (VP chased (NP the rabbit)))')

In [80]:
traverse(t)

S
NP
Alice

VP
chased
NP
the
rabbit





## Named Entity Recognition
At the start of this chapter, we briefly introduced named entities (NEs). Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. lists some of the more commonly used types of NEs. These should be self-explanatory, except for "Facility": human-made artifacts in the domains of architecture and civil engineering; and "GPE": geo-political entities such as city, state/province, and country.

In [81]:
sent = nltk.corpus.treebank.tagged_sents()[22]

In [82]:
print (nltk.ne_chunk(sent, binary=True))

(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (NE Brooke/NNP)
  T./NNP
  Mossman/NNP
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE Vermont/NNP College/NNP)
  of/IN
  (NE Medicine/NNP)
  ./.)


In [83]:
print (nltk.ne_chunk(sent)) 

(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (PERSON Vermont/NNP College/NNP)
  of/IN
  (GPE Medicine/NNP)
  ./.)


## Relation extraction 
nce named entities have been identified in a text, we then want to extract the relations that exist between them. As indicated earlier, we will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. The following example searches for strings that contain the word in. The special regular expression (?!\b.+ing\b) is a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerund.

In [96]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
