#### Day 69

# Chunking

While tokenizing allows you to identify words and sentences, **chunking** allows you to identify **phrases**.

<p>
<b>Note:</b> A <b>phrase</b> is a word or group of words that works as a single unit to perform a grammatical function. <b>Noun phrases</b> are built around a noun.

Here are some examples:
<ul>
<li>“A planet”</li>
<li>“A tilting planet”</li>
<li>“A swiftly tilting planet”</li>
</ul>
</p>

Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

Here’s how to import the relevant parts of NLTK in order to chunk:

In [1]:
from nltk.tokenize import word_tokenize

Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging. You can use this quote from <a href = "https://en.wikipedia.org/wiki/The_Lord_of_the_Rings" >The Lord of the Rings</a>:

In [2]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."

Now tokenize that string by word:

In [3]:
words_in_lotr_quote = word_tokenize(lotr_quote)
words_in_lotr_quote

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

Now you’ve got a list of all of the words in lotr_quote.

The next step is to tag those words by part of speech:

In [6]:
import nltk
# nltk.download("averaged_perceptron_tagger")

lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)
lotr_pos_tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

You’ve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a chunk grammar.

**Note**: A **chunk grammar** is a combination of rules on how sentences should be chunked. It often uses <a href = "https://realpython.com/regex-python/"> regular expressions</a>, or **regexes**.

For this tutorial, you don’t need to know how regular expressions work, but they will definitely <a href = "https://xkcd.com/208/"> come in handy</a> for you in the future if you want to process text.

Create a chunk grammar with one regular expression rule:

In [7]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

NP stands for noun phrase. You can learn more about **noun phrase chunking** in <a href= "https://www.nltk.org/book/ch07.html#noun-phrase-chunking" > Chapter 7</a> of Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit.

According to the rule you created, your chunks:
<ul>
    
<li>1.Start with an optional (?) determiner ('DT')</li>
<li>2.Can have any number (*) of adjectives (JJ)</li>
<li>3.End with a noun (&lt; NN>)</li>
</ul> 
    
Create a **chunk parser** with this grammar:

In [9]:
chunk_parser = nltk.RegexpParser(grammar) # creating a parse/model with the grammar we made

In [10]:
chunk_parser

<chunk.RegexpParser with 1 stages>

Now try it out with your quote:

In [11]:
tree = chunk_parser.parse(lotr_pos_tags)

Here’s how you can see a visual representation of this tree:

In [12]:
tree.draw()

From visual representation we get the following:

You got two noun phrases:

1.**'a dangerous business'** has a determiner, an adjective, and a noun.

2.**'door'** has just a noun.


### Example

In [13]:
sentence2="the little yellow dog barked at the cat "

In [18]:
# word tokkenise
words=word_tokenize(sentence2)
print(words)

['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat']


In [19]:
# parts of speech
pos_chunk = nltk.pos_tag(words)
print(pos_chunk)

[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')]


In [14]:
# define grammar

# ? atleast 0 or 1 time
# * is 0 or more times
# ^ starts with
grammar=("NP:{<DT>?<JJ>*<NN>}")

In [15]:
chunkParser=nltk.RegexpParser(grammar) # RegexpParser is used to take the phrase using the help of grammar

In [16]:
chunkParser

<chunk.RegexpParser with 1 stages>

In [23]:
t = chunkParser.parse(pos_chunk) #pass tagged words in the chucking parser
print(t)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [24]:
t.draw()