# Chunking

### Create a Chunk Parser : 
An NP chunk should be formed whenever the chunker finds ***an optional determiner (DT)*** followed by ***any number of adjectives (JJ)*** and then ***a noun (NN)***. 

In [0]:
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
grammar = "CHUNK_NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
# result.draw()

## Tag Patterns
A tag pattern is a sequence of part-of-speech tags
delimited using angle brackets, e.g. < DT >?< JJ >*< NN >.

## Chunking with Regular Expressions

In [0]:
grammar = r"""
CHUNK_NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and noun
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))

### Chunk two consecutive nouns

In [0]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "CHUNK_NP: {<NN><NN>} "            
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

**Update Grammar**

In [0]:
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]   
grammar = "CHUNK_NP: {<NN>+}"
cp = nltk.RegexpParser(grammar)
print(cp.parse(nouns))

## Expore Corpora



*   **Tagged Sentence** = [('However', 'WRB'), (',', ','), ('the', 'AT'), ('jury', 'NN'), ('said', 'VBD'), ('it', 'PPS'), ('believes', 'VBZ'), ('``', '``'), ('these', 'DTS'), ('two', 'CD'), ('offices', 'NNS'), ('should', 'MD'), ('be', 'BE'), ('combined', 'VBN'), ('to', 'TO'), ('achieve', 'VB'), ('greater', 'JJR'), ('efficiency', 'NN'), ('and', 'CC'), ('reduce', 'VB'), ('the', 'AT'), ('cost', 'NN'), ('of', 'IN'), ('administration', 'NN'), ("''", "''"), ('.', '.')]

*   cp.parse(sent) = tree = (S
  However/WRB
  ,/,
  the/AT
  jury/NN
  said/VBD
  it/PPS
  believes/VBZ
  ``/``
  these/DTS
  two/CD
  offices/NNS
  should/MD
  be/BE
  **(CHUNK combined/VBN to/TO achieve/VB)**
  greater/JJR
  efficiency/NN
  and/CC
  reduce/VB
  the/AT
  cost/NN
  of/IN
  administration/NN
  ''/''
  ./.)
  

*   if subtree.label() == 'CHUNK' : print(subtree) ==> (CHUNK serve/VB to/TO protect/VB)





In [0]:
import nltk
nltk.download('brown')
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():       
  tree = cp.parse(sent)
  for subtree in tree.subtrees():
    if subtree.label() == 'CHUNK':  
      print(sent), print(tree),print(subtree)
    

## Chinking 
Sometimes it is easier to define what we want to exclude from a chunk. :)

Define a chink to be a sequence of tokens that is not included in a chunk such as barked/VBD at/IN is a chink:  }< VBD | IN >+{  

In [0]:
import nltk
grammar = r"""
            CHUNK_NP:
            {<.*>+}      # Chunk everything
            }<VBD|IN>+{  # Chink sequences of VBD and IN
          """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

## Representing Chunks: Tags vs Trees

เราสามารถ represent Chunk โดย ใช้ IOB tags 

*   B - Begin
*   O - Outside
*   I - Inside

ไปเรื่อยๆ ถ้าเริ่ม chunk ใหม่ เราก็เริ่ม B- แล้วก็ I ไป ถ้าไม่ใช่ Chunk ก็ O ไป 

> We PRP B-NP

> saw VBD O

> the DT B-NP

> yellow JJ I-NP

> dog NN I-NP


# Developing and Evaluating Chunkers


> **The CoNLL 2000 corpus** contains 270k words of Wall Street Journal
text, divided into "train" and "test" portions, annotated with part-of-speech tags and chunk tags in the IOB format.



> ** Read the 100th sentence of the "train" portion of the corpus:**



In [0]:
import nltk
nltk.download('conll2000')
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])

CoNLL 2000 corpus contains three chunk types: 
*   NP chunk =>  a couple 
*   VP chunk => to give
*   PP chunk => because of

> we can use the chunk_types argument to select them:


In [0]:
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

## Simple Evaluation and Baselines

### establishing a baseline

ไม่ใส่กฎอะไรเลย ดูความถูกต้องทั้งหมด (IOB Accuracy) ดูว่าทาย NP ถูกเท่าไรด้วย Precision, Recall, F-Measure 

In [0]:
from nltk.corpus import conll2000
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

*The IOB tag accuracy indicates that more than a third of the words are tagged with O, i.e. not in an NP chunk. However, since our tagger did not find
any chunks, its precision, recall, and f-measure are all zero.*

### Naive tagger, look for all tags in NP chunk
that looks for tags beginning with letters
that are characteristic of noun phrase tags (e.g. CD, DT, and JJ).

In [0]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

### Noun Phrase Chunking with a Unigram Tagger

> เราเอา Unigram Tagger มา predict   IOB Chunk tags โดยบอก POS ด้วย  
> (UnigramTagger ดูว่าคำนั้นเกิดเป็น Type ไหนเยอะสุด ก็บอกเป็น Type นั้นเลย ไม่ดู Context)


In [0]:
class UnigramChunker(nltk.ChunkParserI):
  def __init__(self, train_sents):
    train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                    for sent in train_sents]
    self.tagger = nltk.UnigramTagger(train_data)

  def parse(self, sentence):
    pos_tags = [pos for (word,pos) in sentence]
    tagged_pos_tags = self.tagger.tag(pos_tags)
    chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
    conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                  in zip(sentence, chunktags)]
    return nltk.chunk.conlltags2tree(conlltags)

sent :
> < Tree, len() = 15>

> _label:'S'

> [0]:('At', 'IN')

>[1]:Tree('NP', [('the', 'DT'), ('same', 'JJ'), ('time', 'NN')])

>[2]:(',', ',')[3]:Tree('NP', [('he', 'PRP')])


using tree2conlltags to map each chunk tree to a list of word,tag,chunk triples


> nltk.chunk.tree2conlltags(sent) = 
[('At', 'IN', 'O'), ('the', 'DT', 'B-NP'), ('same', 'JJ', 'I-NP'), ('time', 'NN', 'I-NP'),



In [0]:
# use unigram tagger to find the IOB tag given its POS tag
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP"])
train_sents = conll2000.chunked_sents("train.txt", chunk_types=["NP"])
unigram_chunker = UnigramChunker(train_sents)
print( unigram_chunker.evaluate(test_sents))

ดูว่า Model บอกอะไรเรา

In [0]:
postags = sorted(set(pos for sent in train_sents
                         for (word, pos) in sent.leaves()))
print( unigram_chunker.tagger.tag(postags))


*   most punctuation marks occur outside of NP chunks except  \# and \$, which are used as currency markers.
*   determiners (DT) and possessives (PRP \$ and WP \$) occur at the beginnings of NP chunks
*   noun types (NN, NNP, NNPS, NNS) mostly occur inside of NP chunks.



### bigram chunker

* Change One Line     self.tagger = nltk.UnigramTagger(train_data) -> self.tagger = nltk.BigramTagger(train_data) 
* increase accuracy a bit


In [0]:
class BigramChunker(nltk.ChunkParserI):
  def __init__(self, train_sents):
    train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                    for sent in train_sents]
    self.tagger = nltk.BigramTagger(train_data)  

  def parse(self, sentence):
    pos_tags = [pos for (word,pos) in sentence]
    tagged_pos_tags = self.tagger.tag(pos_tags)
    chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
    conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                  in zip(sentence, chunktags)]
    return nltk.chunk.conlltags2tree(conlltags)

In [69]:
# use bigram tagger to find the IOB tag given its POS tag
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP"])
train_sents = conll2000.chunked_sents("train.txt", chunk_types=["NP"])
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


All chunks are created entirely based on part-of-speech tags. How about these !!! Same POS but Diff Chunk


*   Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.                ======= [The farmer][rice]
*   Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.          ======== [my computer monitor]

The content of the words is needed to maximize chunking performance.

### Training Classifier-Based Chunkers

เอา word มาด้วย

In [0]:
def _npchunk_features(sentence, i, history):
  features = {}
  word, pos = sentence[i]
  features["pos"] = pos
  # add previous POS tag
  prevword, prevpos = "<START>", "<START>" if i == 0 else sentence[i-1]
  features["prevpos"] = prevpos
  # add current word
  features["word"] = word
  # more features
  nextword, nextpos = "<END>", "<END>" if i == len(sentence) - 1 else sentence[i+1]
  features["nextpos"] = nextpos
  features["prevpos+pos"] = "%s+%s" % (prevpos, pos)
  features["pos+nextpos"] = "%s+%s" % (pos, nextpos)
  # tags since last determiner
  tags_since_dt = set()
  for word, pos in sentence[:i]:
    if pos == "DT":
      tags_since_dt = set()
    else:
      tags_since_dt.add(pos)
  features["tags_since_dt"] = "+".join(sorted(tags_since_dt))
  return features

class ConsecutiveNPChunkTagger(nltk.TaggerI):
  def __init__(self, train_sents):
    train_set = []
    for tagged_sent in train_sents:
      untagged_sent = nltk.tag.untag(tagged_sent)
      history = []
      for i, (word, tag) in enumerate(tagged_sent):
        featureset = _npchunk_features(untagged_sent, i, history)
        train_set.append((featureset, tag))
        history.append(tag)
    self.classifier = nltk.MaxentClassifier.train(train_set,
      algorithm="GIS", trace=0)

  def tag(self, sentence):
    history = []
    for i, word in enumerate(sentence):
      featureset = _npchunk_features(sentence, i, history)
      tag = self.classifier.classify(featureset)
      history.append(tag)
    return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI):
  def __init__(self, train_sents):
    tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
    self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

  def parse(self, sentence):
    tagged_sents = self.tagger.tag(sentence)
    conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
    return nltk.chunk.conlltags2tree(conlltags)

In [73]:
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP"])
train_sents = conll2000.chunked_sents("train.txt", chunk_types=["NP"])
chunker = ConsecutiveNPChunker(train_sents)
print( chunker.evaluate(test_sents))

      Training stopped: keyboard interrupt
ChunkParse score:
    IOB Accuracy:  95.9%%
    Precision:     88.0%%
    Recall:        91.7%%
    F-Measure:     89.8%%


# Recursion in Linguistic Structure

A multi-stage chunk grammar containing recursive rules

In [77]:
grammar = r"""
    NP : {<DT|JJ|NN.*>+}    # chunk sentences of DT,JJ,NN
    PP : {<IN><NP>}         # chunk preposition followed by NP
    VP : {<VB.*><NP|PP|CLAUSE>+$}  # chunk verb and their argument
    CLAUSE : {<NP><VP>}     # chunk NP,VP
  """
cp = nltk.RegexpParser(grammar , loop =2) # parses sentence multiple times
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print( cp.parse(sentence))
  

(S
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))


In [75]:
sentence2 = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NNP"),
    ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
    ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print( cp.parse(sentence2))

(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NNP)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))


loop to specify the number of times the set of patterns should be run ครั้งเดียว VP หายตรง saw, think

# Named Entity Recognition

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the
parameter binary=True , then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON,
ORGANIZATION, and LOCATION/GPE.

In [81]:
import nltk
nltk.download('treebank')
nltk.download('maxent_ne_chunker')
nltk.download('words')
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent, binary=True))

[nltk_data] Downloading package treebank to /content/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /content/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /content/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (NE Brooke/NNP)
  T./NNP
  Mossman/NNP
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE Vermont/NNP College/NNP)
  of/IN


In [82]:
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent))

(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (PERSON Vermont/NNP College/NNP)
  of/IN
  (GPE Medicine/NNP)
  ./.)


# Relation Extraction

Once NE have been identified in a text, we then want to extract the relations that exist between them.


* \bword\b - word boundary
* Negative Lookahead (?!     )
* .+ matches any character 




In [90]:
import re
nltk.download('ieer')
# IN = re.compile(r'.*\bin\b(?!\b.+ing)')
IN = re.compile(r'.*\bin\b')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
  for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
    print(nltk.sem.rtuple(rel))

[nltk_data] Downloading package ieer to /content/nltk_data...
[nltk_data]   Package ieer is already up-to-date!
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
