In [None]:
Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not 
represent the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single word instead 
of ‘South’ and ‘Africa’ separate words.

Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output. 
Similar to POS tags, there are a standard set of Chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc. 
Chunking is very important when you want to extract information from text such as Locations, Person Names etc. 
In NLP called Named Entity Extraction.

A chunk is a short phrase within a sentence.

Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more
structure to the sentence. The result is a grouping of the words in “chunks”.

 These are patterns of part-of-speech tags that knows what kinds of words make up a chunk. 
We can also add patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.
 
A ChunkRule class specifies what to include in a chunk, while a ChinkRule class specifies what to exclude from a chunk. In other words, chunking creates chunks, while chinking breaks up those chunks.


In [11]:
import nltk
from nltk.tokenize import word_tokenize

# POS tagging
sent = "This will be chunked. This is for Test. World is awesome. Hello world."

print(nltk.pos_tag(word_tokenize(sent)))

# creating a regular expression for chunking verbs and nouns
chunkRule = r"""chunk: {<NN.?>*<NNS.?>*<NNP.?>*<NNPS.?>*<VB.?>*<VBD.?>*<VBG.?>*<VBN.?>*<VBP.?>*<VBZ.?>*}"""

My_parser = nltk.RegexpParser(chunkRule)
chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent)))

print(chunked)

[('This', 'DT'), ('will', 'MD'), ('be', 'VB'), ('chunked', 'VBN'), ('.', '.'), ('This', 'DT'), ('is', 'VBZ'), ('for', 'IN'), ('Test', 'NNP'), ('.', '.'), ('World', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.'), ('Hello', 'NNP'), ('world', 'NN'), ('.', '.')]
(S
  This/DT
  will/MD
  (chunk be/VB chunked/VBN)
  ./.
  This/DT
  (chunk is/VBZ)
  for/IN
  (chunk Test/NNP)
  ./.
  (chunk World/NNP is/VBZ)
  awesome/JJ
  ./.
  (chunk Hello/NNP world/NN)
  ./.)


In [18]:

import nltk

def prepareForNLP(text):
	sentences = nltk.sent_tokenize(text)
	sentences = [nltk.word_tokenize(sent) for sent in sentences]
	sentences = [nltk.pos_tag(sent) for sent in sentences]
	return sentences

def chunk(sentence):
	chunkToExtract = """
	NP: {<NNP>*}
		{<DT>?<JJ>?<NNS>}
		{<NN><NN>}"""
	parser = nltk.RegexpParser(chunkToExtract)
	result = parser.parse(sentence)
    
	for subtree in result.subtrees():
		if subtree.label() == 'NP':
			t = subtree
			t = ' '.join(word for word, pos in t.leaves())
			print(t)



sentences = prepareForNLP("A prison riot left six members of staff needing hospital treatment earlier this month, the BBC learns")
for sentence in sentences:
	chunk(sentence)

prison riot
members
hospital treatment
BBC
learns


In [33]:
from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule
from nltk.tree import Tree
t = Tree('S', [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'),('many', 'JJ'), ('chapters', 'NNS')])
cs = ChunkString(t)
cs

<ChunkString: '<DT><NN><VBZ><JJ><NNS>'>

In [34]:
ur = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns')
ur.apply(cs)
cs

<ChunkString: '{<DT><NN><VBZ><JJ><NNS>}'>

In [35]:
ir = ChinkRule('<VB.*>', 'chink verbs')
ir.apply(cs)
cs

<ChunkString: '{<DT><NN>}<VBZ>{<JJ><NNS>}'>

In [None]:
cp = chunk.Regexp("NP: {<DT>?<JJ>*<NN>}")
#noun phrases that consist of an optional determiner, followed by any number of adjectives, then a noun.

In [37]:
# Example of a simple regular expression based NP chunker.
import nltk
sentence = "the little yellow dog barked at the cat"
#Define your grammar using regular expressions
grammar = ('''
    NP: {<DT>?<JJ>*<NN>} # NP
    ''')
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tagged

[('the', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'JJ'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('cat', 'NN')]

In [38]:
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees():
    print(subtree)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
(NP the/DT little/JJ yellow/JJ dog/NN)
(NP the/DT cat/NN)


In [None]:
tree.draw()