Chunking involves grouping together individual pieces of information from a sentence, such as nouns, verbs, adjectives, and adverbs, into larger units known as chunks. The most common type of chunking is noun phrase (NP) chunking, which involves identifying and extracting noun phrases from a sentence, such as "the cat," "a book," or "my friend." Another type of chunking is verb phrase (VP) chunking, which involves identifying and extracting verb phrases from a sentence, such as "ate breakfast," "is running," or "will sing."

Noun phrases are part of speech patterns that include a noun. They can also include whatever other parts of speech make grammatical sense, and can include multiple nouns. Some common noun phrase patterns are:

    Noun
    Noun-Noun..… -Noun
    Adjective(s)-Noun
    Verb-(Adjectives-)Noun

A noun phrase consists of a noun or pronoun, which is called the head, and any dependent words before or after the head.

Chunking are important techniques in NLP because they allow us to extract meaningful information from text data. By identifying and grouping together chunks of information, we can analyze patterns and relationships within the text and extract relevant insights.

POS Example

CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there is
FW foreign word d’hoevre
IN preposition/subordinating conjunction in, of, like
JJ adjective big
JJR adjective, comparative bigger
JJS adjective, superlative biggest
LS list marker 1)
MD modal could, will
NN noun, singular or mass door
NNS noun plural doors
NNP proper noun, singular John
NNPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend's
PRP personal pronoun I, he, it
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

In [11]:
#1. Write a tag pattern to match noun phrases containing plural
#head nouns, e.g., many/JJ researchers/NNS, two/CD weeks/NNS,
#both/DT new/JJ positions/NNS. Try to do this by generalizing
#the tag pattern that handled singular noun phrases.

import nltk
import re
from nltk import word_tokenize, pos_tag

textchunk = [("many", "JJ"), ("researchers", "NNS"), ("two", "CD"), ("weeks", "NNS"), ("both","DT"), ("new", "JJ"), ("positions", "NNS")]
corpus = nltk.RegexpParser("NP:{<DT>?<CD>?<JJ>*<NNS>}")

print(corpus.parse(textchunk))
for chunk in corpus.parse(textchunk):
    print(chunk)


(S
  (NP many/JJ researchers/NNS)
  (NP two/CD weeks/NNS)
  (NP both/DT new/JJ positions/NNS))
(NP many/JJ researchers/NNS)
(NP two/CD weeks/NNS)
(NP both/DT new/JJ positions/NNS)


In [2]:
#Pick one of the three chunk types in the CoNLL-2000 Chunking Corpus.
#Inspect the data and try to observe any patterns in the POS tag sequences
#that make up this kind of chunk. Develop a simple chunker using the regular
#expression chunker nltk.RegexpParser. Discuss any tag sequences that are difficult to chunk reliably.

nltk.download('conll2000')
from nltk.corpus import conll2000

for i in range(20):
    print(i, conll2000.chunked_sents('train.txt', chunk_types = ['VP'])[i])

0 (S
  Confidence/NN
  in/IN
  the/DT
  pound/NN
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  another/DT
  sharp/JJ
  dive/NN
  if/IN
  trade/NN
  figures/NNS
  for/IN
  September/NNP
  ,/,
  due/JJ
  for/IN
  release/NN
  tomorrow/NN
  ,/,
  (VP fail/VB to/TO show/VB)
  a/DT
  substantial/JJ
  improvement/NN
  from/IN
  July/NNP
  and/CC
  August/NNP
  's/POS
  near-record/JJ
  deficits/NNS
  ./.)
1 (S
  Chancellor/NNP
  of/IN
  the/DT
  Exchequer/NNP
  Nigel/NNP
  Lawson/NNP
  's/POS
  restated/VBN
  commitment/NN
  to/TO
  a/DT
  firm/NN
  monetary/JJ
  policy/NN
  (VP has/VBZ helped/VBN to/TO prevent/VB)
  a/DT
  freefall/NN
  in/IN
  sterling/NN
  over/IN
  the/DT
  past/JJ
  week/NN
  ./.)
2 (S
  But/CC
  analysts/NNS
  (VP reckon/VBP)
  underlying/VBG
  support/NN
  for/IN
  sterling/NN
  (VP has/VBZ been/VBN eroded/VBN)
  by/IN
  the/DT
  chancellor/NN
  's/POS
  failure/NN
  (VP to/TO announce/VB)
  any/DT
  new/JJ
  policy/NN
  measures/NNS
  in/IN
  his/PRP$
  Mansio

[nltk_data] Downloading package conll2000 to C:\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!


The most common patterns:

(VP rose/VBD)

(VP are/VBP topped/VBN) (VP has/VBZ been/VBN eroded/VBN)

(VP could/MD be/VB)

(VP has/VBZ helped/VBN to/TO prevent/VB) (VP being/VBG forced/VBN to/TO increase/VB)

(VP is/VBZ widely/RB expected/VBN to/TO take/VB)


In [3]:
grammar = r"VP: {<[VRMT].*>+}" # {<VB.>?<RB>*<MD>?<VB.>?<TO>?<VB.>} or {<VB.>?<RB.>*<MD>?<VB.>?<TO>?<VB.>} or {<VB.>?<RB>*<MD>?<VB.>?<TO>?<MD>?<RB>*<VB.>}
cp = nltk.RegexpParser(grammar)

for i in range(20):
  test_sent =  conll2000.chunked_sents('train.txt', chunk_types = ['VP'])[i]
  print(i, print(cp.parse(test_sent)))

(S
  Confidence/NN
  in/IN
  the/DT
  pound/NN
  (VP (VP is/VBZ widely/RB expected/VBN to/TO take/VB))
  another/DT
  sharp/JJ
  dive/NN
  if/IN
  trade/NN
  figures/NNS
  for/IN
  September/NNP
  ,/,
  due/JJ
  for/IN
  release/NN
  tomorrow/NN
  ,/,
  (VP (VP fail/VB to/TO show/VB))
  a/DT
  substantial/JJ
  improvement/NN
  from/IN
  July/NNP
  and/CC
  August/NNP
  's/POS
  near-record/JJ
  deficits/NNS
  ./.)
0 None
(S
  Chancellor/NNP
  of/IN
  the/DT
  Exchequer/NNP
  Nigel/NNP
  Lawson/NNP
  's/POS
  (VP restated/VBN)
  commitment/NN
  (VP to/TO)
  a/DT
  firm/NN
  monetary/JJ
  policy/NN
  (VP (VP has/VBZ helped/VBN to/TO prevent/VB))
  a/DT
  freefall/NN
  in/IN
  sterling/NN
  over/IN
  the/DT
  past/JJ
  week/NN
  ./.)
1 None
(S
  But/CC
  analysts/NNS
  (VP (VP reckon/VBP) underlying/VBG)
  support/NN
  for/IN
  sterling/NN
  (VP (VP has/VBZ been/VBN eroded/VBN))
  by/IN
  the/DT
  chancellor/NN
  's/POS
  failure/NN
  (VP (VP to/TO announce/VB))
  any/DT
  new/JJ
  policy

A gerund is a word like “swimming” in the sentence “I have always enjoyed swimming.” The term refers to the “-ing” form of a verb when it functions as a noun


In [13]:
#Write a tag pattern to cover noun phrases that contain gerunds,
#e.g., the/DT receiving/VBG end/NN, assistant/NN managing/VBG editor/NN.
#Add these patterns to the grammar, one per line.
#Test your work using some tagged sentences of your own devising.

grammar = """
    NP: {<DT><VBG><NN>}    # chunk determiner, gerund, and noun
        {<NN><VBG><NN>}    # chunk noun, gerund, and noun
        # or {<DT|NN><VBG><NN>}
"""
cp = nltk.RegexpParser(grammar)
sentences = [[("the", "DT"), ("receiving", "VBG"), ("end", "NN")],
             [("assistant", "NN"),  ("managing", "VBG"),  ("editor", "NN")]]

for sent in sentences:
    print(cp.parse(sent))

(S (NP the/DT receiving/VBG end/NN))
(S (NP assistant/NN managing/VBG editor/NN))
