# Extracting Information from Text

In [31]:
import nltk, re, pprint
from nltk.corpus import conll2000

The method of getting meaning from text is called Information Extraction.<br>
Raw Text --> [Sentence Segmentation] --> [Tokenization] --> [POS Tagging] --> [Entity Detection] --> [Relation Detection] --> Relations(list of tuples)<br>
>First, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity detection. In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation detection to search for likely relations between different entities in the text.

In [2]:
def ie_preprocess(document):
    sentences = [nltk.sent_tokenize(document)]
    sentences = [ntlk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

### 2. Chunking
The basic technique we will use for entity detection is chunking, which segments and labels multi-token sequences as illustrated in 2.1. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

##### Noun Phrase Chunking
We search for chnks corresponding to individual noun phrases.<br>
[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

 In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule [2]. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser [3], and test it on our example sentence [4]. The result is a tree, which we can either print [5], or display graphically [6].

In [3]:
sent = 'the little yellow dog barked at the cat'
tags = nltk.pos_tag(nltk.word_tokenize(sent))

In [4]:
tags

[('the', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'JJ'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('cat', 'NN')]

In [16]:
grammar = 'NP: {<DT>?<JJ>*<NN>}'#Tag Pattern

cp =nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)

(S
  Rapunzel/NNP
  let/VBD
  down/RP
  her/PRP$
  (NP long/JJ golden/JJ hair/NN))


In [13]:
result.draw()

##### Tag Patterns
The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT>?<JJ>*<NN>. Tag patterns are similar to regular expression patterns

##### Chunking with Regular Expressions
To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

The first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked [1], and run the chunker on this input [2].

In [7]:
grammar = r"""
        NP: {<DT|PP\$>?<JJ>*<NN>}
        {<NNP>+}
        """
cp = nltk.RegexpParser(grammar)
sentence = 'Rapunzel let down her long golden hair'
tags = nltk.pos_tag(nltk.word_tokenize(sentence))
chunked = cp.parse(tags)
print(chunked)

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  her/PRP$
  (NP long/JJ golden/JJ hair/NN))


If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked:

In [14]:
nouns = [('money', 'NN'), ('market', 'NN'), ('fund', 'NN')]
grammer = 'NP: {<NN><NN>}'
cp = nltk.RegexpParser(grammar)
chunked = cp.parse(nouns)
print(chunked)

(S (NP money/NN) (NP market/NN) (NP fund/NN))


In [15]:
chunked.draw()

##### Exploring Text Corpora
we saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows:

In [23]:
cp = nltk.RegexpParser('CHUNK: {<V.*>}<TO><V.*>')
brown =nltk.corpus.brown
tagged_sents = brown.tagged_sents()
for sent in tagged_sents:
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

(CHUNK combined/VBN)
(CHUNK continue/VB)
(CHUNK serve/VB)
(CHUNK wanted/VBD)
(CHUNK allowed/VBN)
(CHUNK expected/VBN)
(CHUNK expected/VBN)
(CHUNK expected/VBN)
(CHUNK intends/VBZ)
(CHUNK seek/VB)
(CHUNK like/VB)
(CHUNK designed/VBN)
(CHUNK get/VB)
(CHUNK expects/VBZ)
(CHUNK expected/VBN)
(CHUNK prefer/VB)
(CHUNK required/VBN)
(CHUNK permitted/VBN)
(CHUNK designed/VBN)
(CHUNK Asked/VBN)
(CHUNK got/VBN)
(CHUNK raised/VBN)
(CHUNK scheduled/VBN)
(CHUNK cut/VBN)
(CHUNK needed/VBN)
(CHUNK hastened/VBD)
(CHUNK found/VBN)
(CHUNK continue/VB)
(CHUNK compelled/VBN)
(CHUNK made/VBN)
(CHUNK revamped/VBN)
(CHUNK want/VB)
(CHUNK appear/VB)
(CHUNK fails/VBZ)
(CHUNK plans/VBZ)
(CHUNK going/VBG)
(CHUNK plans/VBZ)
(CHUNK come/VBN)
(CHUNK voted/VBD)
(CHUNK happens/VBZ)
(CHUNK authorized/VBN)
(CHUNK hesitated/VBN)
(CHUNK try/VB)
(CHUNK decided/VBN)
(CHUNK taken/VBN)
(CHUNK left/VBN)
(CHUNK stand/VB)
(CHUNK decided/VBN)
(CHUNK trying/VBG)
(CHUNK proposing/VBG)
(CHUNK decided/VBN)
(CHUNK directed/VBN)
(CHUN

Encapsulate the above example inside a function find_chunks() that takes a chunk string like "CHUNK: {<V.*> <TO> <V.*>}" as an argument. Use it to search the corpus for several other patterns, such as four or more nouns in a row, e.g. "NOUNS: {N\.\*>{4,}}"

##### Chinking
Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example,  barked/VBD at/IN is a chink:

 [ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]
 
 Chinking is the process of removing a sequence of tokens from a chunk. If the matching sequence of tokens spans an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before. If the sequence is at the periphery of the chunk, these tokens are removed, and a smaller chunk remains

In [24]:
grammar = r"""
            NP: {<.*>+} #Chunk Everything
            }<VBD|IN>+{ #Chink sequences of VBD and IN
            """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [25]:
cp.parse(sentence).draw()

##### Represnting Chunks: Tags vs Trees

As befits their intermediate status between tagging and parsing (8.), chunk structures can be represented using either tags or trees. The most widespread file representation uses **IOB tags**. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O.

IOB tags have become the standard way to represent chunk structures in files.

Here is how the information would appear in a file

We PRP B-NP<br>
saw VBD O<br>
the DT B-NP<br>
yellow JJ I-NP<br>
dog NN I-NP<br>

>NLTK uses trees for its internal representation of chunks, but provides methods for reading and writing such trees to the IOB format.

### Developing and Evaluating Chunkers

##### Reading IOB Format and the CoNLL 2000 Corpus

Using the corpus module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP and  PP. As we have seen, each sentence is represented using multiple lines, as shown below:

he PRP B-NP<br>
accepted VBD B-VP<br>
the DT B-NP<br>
position NN I-NP<br>
...

A conversion function chunk.conllstr2tree() builds a tree representation from one of these multi-line strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks:

In [29]:
text = ''' he PRP B-NP
... accepted VBD B-VP
... the DT B-NP
... position NN I-NP
... of IN B-PP
... vice NN B-NP
... chairman NN I-NP
... of IN B-PP
... Carlyle NNP B-NP
... Group NNP I-NP
... , , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
'''.replace('.','')
nltk.chunk.conllstr2tree(text, chunk_types= ['NP']).draw()

ValueError: Error on line 0

In [33]:
print(conll2000.chunked_sents('train.txt')[99])#prints the 100th sentece

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


In [34]:
print(type(conll2000.chunked_sents('train.txt')[99]))#prints the 100th sentece

<class 'nltk.tree.Tree'>


As you can see, the CoNLL 2000 corpus contains three chunk types: NP chunks, which we have already seen; VP chunks such as has already delivered; and PP chunks such as because of. Since we are only interested in the NP chunks right now, we can use the chunk_types argument to select them:



In [37]:
print(conll2000.chunked_sents('train.txt', chunk_types = ['NP'])[99])

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)


##### Simple Evaluation and Baselines

In [40]:
cp = nltk.RegexpParser('')
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


In [41]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


As you can see, this approach achieves decent results. However, we can improve on it by adopting a more data-driven approach, where we use the training corpus to find the chunk tag (I, O, or B) that is most likely for each part-of-speech tag. In other words, we can build a chunker using a unigram tagger (4). But rather than trying to determine the correct part-of-speech tag for each word, we are trying to determine the correct chunk tag, given each word's part-of-speech tag.

We define the **UnigramChunker** class, which uses a unigram tagger to label sentences with **chunk tags.** Most of the code in this class is simply used to convert back and forth between the chunk tree representation used by NLTK's **ChunkParserI interface**, and the IOB representation used by the embedded tagger. The class defines two methods: a constructor [1] which is called when we build a new UnigramChunker; and the parse method [3] which is used to chunk new sentences.

In [44]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sets):
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]
        for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
    
    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)
                    in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

The constructor [1] expects a list of training sentences, which will be in the form of chunk trees. It first converts training data to a form that is suitable for training the tagger, using  tree2conlltags to map each chunk tree to a list of word,tag,chunk triples. It then uses that converted training data to train a unigram tagger, and stores it in self.tagger for later use.

The parse method [3] takes a tagged sentence as its input, and begins by extracting the part-of-speech tags from that sentence. **It then tags the part-of-speech tags with IOB chunk tags, using the tagger self.tagger that was trained in the constructor**. Next, it extracts the chunk tags, and combines them with the original sentence, to yield conlltags. Finally, it uses conlltags2tree to convert the result back into a chunk tree.

In [46]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types = ['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types = ['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


In [47]:
postags = sorted(set(pos for sent in train_sents 
                     for (word, pos) in sent.leaves()))
print(unigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]


It has discovered that most punctuation marks occur outside of NP chunks, with the exception of # and \$, both of which are used as currency markers. It has also found that determiners (DT) and possessives (PRP$ and WP$) occur at the beginnings of NP chunks, while noun types (NN, NNP, NNPS, NNS) mostly occur inside of NP chunks.

Having built a unigram chunker, it is quite easy to build a bigram chunker: we simply change the class name to BigramChunker, and modify line [2] in 3.1 to construct a BigramTagger rather than a UnigramTagger

In [50]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [51]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


##### Training Classifier_Based Chunkers

Both the regular-expression based chunkers and the n-gram chunkers decide what chunks to create entirely based on part-of-speech tags. However, sometimes part-of-speech tags are insufficient to determine how a sentence should be chunked. For example, consider the following two statements:

(3)		
a.		Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.

b.		Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

These two sentences have the same part-of-speech tags, yet they are chunked differently. In the first sentence, the farmer and rice are separate chunks, while the corresponding material in the second sentence, the computer monitor, is a single chunk. Clearly, we need to make use of information about the content of the words, in addition to just their part-of-speech tags, if we wish to maximize chunking performance.

One way that we can incorporate information about the content of words is to use a classifier-based tagger to chunk the sentence. Like the n-gram chunker considered in the previous section, this classifier-based chunker will work by assigning IOB tags to the words in a sentence, and then converting those tags to chunks. For the classifier-based tagger itself, we will use the same approach that we used in 1 to build a part-of-speech tagger.

The basic code for the classifier-based NP chunker is shown. It consists of two classes. The first class [1] is almost identical to the ConsecutivePosTagger class from 1.5. The only two differences are that it calls a different feature extractor [2] and that it uses a MaxentClassifier rather than a NaiveBayesClassifier [3]. The second class [4] is basically a wrapper around the tagger class that turns it into a chunker. During training, this second class maps the chunk trees in the training corpus into tag sequences; in the parse() method, it converts the tag sequence provided by the tagger back into a chunk tree.


In [61]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            history = []
            untagged_sent = nltk.tag.untag(tagged_sent)
            for i, (word, tag) in enumerate(tagged_sent):
                feature_set = npchunk_feature(untagged_sent, i, history)
                train_set.append((feature_set, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set,
                                            algorithm = 'iis', trace =0)
        
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            feature_set = npchunk_feature(sentence, i, history)
            tag = self.classifier.classify(feature_set)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w,t), c) for (w, t, c) in
                        nltk.chunk.tree2conlltags(sent)]
                       for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
        
    def parse(self, sentence):
        tagged_sents =self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

In [62]:
def npchunk_feature(sentence, i, history):
    """This is a feature extractor which return pos tag
    of the word in the sentence on index i"""
    word, pos = sentence[i]
    return {'pos': pos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


In [None]:
def npchunk_feature(sentence, i, history):
    """This is a feature extractor which return pos tag
    of the word in the sentence on index i"""
    word, pos = sentence[i]
    if i==0:
        prevword, prevpos = '<START>', '<START>'
    else:
        prevword, prevpos = sentence[i-1]
    return {'pos':pos, 'prevpos':prevpos}


chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

In [None]:
def npchunk_feature(sentence, i, history):
    """This is a feature extractor which return pos tag
    of the word in the sentence on index i"""
    word, pos = sentence[i]
    if i==0:
        prevword, prevpos = '<START>', '<START>'
    else:
        prevword, prevpos = sentence[i-1]
    return {'pos':pos, 'word': word, 'prevpos':prevpos}


chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

Finally, we can try extending the feature extractor with a variety of additional features, such as lookahead features [1], paired features [2], and complex contextual features [3]. This last feature, called tags-since-dt, creates a string describing the set of all part-of-speech tags that have been encountered since the most recent determiner, or since the beginning of the sentence if there is no determiner before index i. .

In [None]:
def npchunk_feature(sentence, i, history):
    word, tag = sentence[i]
    if i==0:
        prevword, prevpos = '<START>', '<START>'
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = '<END>', '<END>'
    else:
        nextword, nextpos = sentence[i+1]
    
    return {'pos':pos,
           'prevpos': prevpos,
           'word': word,
           'nextpos': nextpos,
           'prevpos+pos':%s+%s(prevpos, pos),
           'pos+nextpos': %s+%s(pos, nextpos),
           'tags-since-dt': tags_since_dt(sentence, i)}


In [64]:
def tags_since_dt(sentence, i):
    tags =set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags =set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))

In [None]:
chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))