# CHUNKING
Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in “chunks”. Chunking is a process of extracting phrases from unstructured text, which means analyzing a sentence to identify the constituents(Noun Groups, Verbs, verb groups, etc.) It works on top of POS tagging. It uses POS-tags as input and provides chunks as output.In other words, in a shallow parse tree, there’s one maximum level between the root and the leaves.

In [1]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [2]:
train_text = state_union.raw("C:/datasets_2119_3569_state_union_1945-Truman.txt")
sample_text = state_union.raw("C:/datasets_2119_3569_state_union_1947-Truman.txt")


In [3]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [4]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)


In [None]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            result=chunkParser.parse(nltk.pos_tag(train_text))
            print(result)
            chunked.draw()     
    except Exception as e:
        print(str(e))

process_content()

(S
  (Chunk P/NNP R/NNP E/NNP S/NNP)
  I/PRP
  (Chunk
    D/NNP
    E/NNP
    N/NNP
    T/NNP
     /NNP
    H/NNP
    A/NNP
    R/NNP
    R/NNP
    Y/NNP
     /NNP
    S/NNP)
  ./.
   /NN
  (Chunk T/NNP R/NNP U/NNP M/NNP A/NNP N/NNP)
  '/POS
  (Chunk S/NNP)
  (Chunk
     /VBZ
    A/NNP
    D/NNP
    D/NNP
    R/NNP
    E/NNP
    S/NNP
    S/NNP
     /NNP
    B/NNP
    E/NNP
    F/NNP
    O/NNP
    R/NNP
    E/NNP)
  (Chunk  /VBZ A/NNP  /NNP J/NNP O/NNP)
  I/PRP
  (Chunk N/NNP T/NNP  /NNP S/NNP E/NNP S/NNP S/NNP)
  I/PRP
  (Chunk
    O/NNP
    N/NNP
     /NNP
    O/NNP
    F/NNP
     /NNP
    T/NNP
    H/NNP
    E/NNP
     /NNP
    C/NNP
    O/NNP
    N/NNP
    G/NNP
    R/NNP
    E/NNP
    S/NNP
    S/NNP
    
/NNP
     /NNP)
  (Chunk 
/VBZ A/NNP p/NN)
  r/NN
  i/NN
  l/VBP
   /$
  1/CD
  6/CD
  ,/,
   /VBD
  1/CD
  9/CD
  4/CD
  5/CD
  
/NN
  (Chunk 
/NNP M/NNP r/NN)
  ./.
   /CC
  (Chunk S/NNP)
  p/VBP
  e/VBZ
  a/DT
  k/NN
  e/NN
  r/NN
  ,/,
  (Chunk  /NNP M/NNP r/NN)
  ./.
   /CC
