# Tokenization, Tagging, Chunking - Chunking

Words can split up in a way that we might not want. Like `new` and `york` for newyork. This can be solved by chunking. 

**Step 1 of chunking will be `Part of Speech` tagging. This helps to give structure to find CHUNKS**

Through chunking, we can prevent two word entities from being split.

In [1]:
sentence = "I will go to the coffee shop in New York after I get off the jet plane."

In [2]:
import nltk

In [3]:
tokens = nltk.word_tokenize(sentence)
tokens[:5]

['I', 'will', 'go', 'to', 'the']

In [20]:
# Step 1: Tagging with POS
sent_tag = nltk.pos_tag(tokens)
sent_tag[:8]

[('I', 'PRP'),
 ('will', 'MD'),
 ('go', 'VB'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('coffee', 'NN'),
 ('shop', 'NN'),
 ('in', 'IN')]

** To find chunks we need to define the sequence we are looking for.**

In [21]:
sequence =  '''
            CHUNK: {<NNP>+}
                   {<NN>+}
            '''

NNP and NN are types of noun. + denotes 1 or more of this type of nouns. We can change sequence to change the chunk we are looking for.

In [22]:
NPChunker = nltk.RegexpParser(sequence)

In [23]:
result = NPChunker.parse(sent_tag)

In [24]:
print (result)

(S
  I/PRP
  will/MD
  go/VB
  to/TO
  the/DT
  (CHUNK coffee/NN shop/NN)
  in/IN
  (CHUNK New/NNP York/NNP)
  after/IN
  I/PRP
  get/VBP
  off/IN
  the/DT
  (CHUNK jet/NN plane/NN)
  ./.)


We see 3 chunks 1. Coffee shop is a single idea. 2. New York is also a single idea. 3. Same is with Jet plane.

We can use chunks where we are looking for similar 'single idea'

We can also change sequences to define a different chunk.