#### Day 69

# Chinking

Chinking is used together with chunking, but while chunking is used to include a pattern, **chinking** is used to exclude a pattern.

Let’s reuse the quote you used in the section on chunking. You already have a list of tuples containing each of the words in the quote along with its part of speech tag:

In [1]:
from nltk.tokenize import word_tokenize

In [2]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."

In [3]:
words_in_lotr_quote = word_tokenize(lotr_quote)
print(words_in_lotr_quote)

['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']


In [4]:
import nltk
lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)
print(lotr_pos_tags)

[('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')]


# define grammar

The next step is to create a grammar to determine what you want to include and exclude in your chunks. This time, you’re going to use more than one line because you’re going to have more than one rule. Because you’re using more than one line for the grammar, you’ll be using triple quotes ("""):

In [5]:
grammar = """Chunk: {<.*>+}
       }<JJ>{"""

The first rule of your grammar is {<.\*>+}. This rule has curly braces that face inward ({}) because it’s used to determine what patterns you want to include in you chunks. In this case, you want to include everything: <.*>+.

The second rule of your grammar is }&lt;JJ&gt;{. This rule has curly braces that face outward (}{) because it’s used to determine what patterns you want to exclude in your chunks. In this case, you want to exclude adjectives: &lt;JJ&gt;.

Create a chunk parser with this grammar:

In [6]:
chunk_parser = nltk.RegexpParser(grammar)

Now chunk your sentence with the chink you specified:



In [7]:
tree = chunk_parser.parse(lotr_pos_tags)

You get this tree as a result:

In [8]:
print(tree)

(S
  (Chunk It/PRP 's/VBZ a/DT)
  dangerous/JJ
  (Chunk
    business/NN
    ,/,
    Frodo/NNP
    ,/,
    going/VBG
    out/RP
    your/PRP$
    door/NN
    ./.))


In [9]:
tree;

In [10]:
tree.draw()

You get this visual representation of the tree:

Here, you’ve excluded the adjective 'dangerous' from your chunks and are left with two chunks containing everything else. The first chunk has all the text that appeared before the adjective that was excluded. The second chunk contains everything after the adjective that was excluded.

Now that you know how to exclude patterns from your chunks, it’s time to look into named entity recognition (NER).

# Example

In [11]:
sentence1="the cat is sitting with the bats on the striped mat under many flying geese"

In [12]:
words = word_tokenize(sentence1)

In [13]:
print(words)

['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on', 'the', 'striped', 'mat', 'under', 'many', 'flying', 'geese']


In [14]:
postag = nltk.pos_tag(words)
print(postag)

[('the', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('with', 'IN'), ('the', 'DT'), ('bats', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('striped', 'JJ'), ('mat', 'NN'), ('under', 'IN'), ('many', 'JJ'), ('flying', 'VBG'), ('geese', 'JJ')]


In [15]:
gra1 = """
       Chunk: {<.*>+}
       }<NN>{"""

In [16]:
CP1 = nltk.RegexpParser(gra1)

In [17]:
CP1

<chunk.RegexpParser with 1 stages>

In [18]:
t1 = CP1.parse(postag)
print(t1)

(S
  (Chunk the/DT)
  cat/NN
  (Chunk
    is/VBZ
    sitting/VBG
    with/IN
    the/DT
    bats/NNS
    on/IN
    the/DT
    striped/JJ)
  mat/NN
  (Chunk under/IN many/JJ flying/VBG geese/JJ))


In [19]:
t1.draw()

In [20]:
sentence2="the little yellow dog barked at the cat "

In [21]:
wrds = word_tokenize(sentence2)

In [22]:
print(wrds)

['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat']


In [23]:
pt = nltk.pos_tag(wrds)
print(pt)

[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')]


In [24]:
gra2 = """
       Chunk: {<.*>+}
       }<DT>{"""

In [25]:
CP2 = nltk.RegexpParser(gra2)

In [26]:
CP2

<chunk.RegexpParser with 1 stages>

In [27]:
t2 = CP2.parse(pt)
print(t2)

(S
  the/DT
  (Chunk little/JJ yellow/JJ dog/NN barked/VBD at/IN)
  the/DT
  (Chunk cat/NN))


In [28]:
t2.draw()