# Extracting Information from Text

## Information Extraction

Information can be extracted from structured or unstructured data. In structured data, there is a predictable organization of entities and relationships. 

Let's a look at an example of extracting information related to companies and their locations, if the data was stored in Python as a list of tuples `(entity, relation, entity)`. Using this information, we'll try to answer the question "Which organizations operate in Atlanta?"

In [10]:
import nltk
import re
import pprint

In [11]:
locs = [('Omnicom', 'IN', 'New York'),
         ('DDB Needham', 'IN', 'New York'),
         ('Kaplan Thaler Group', 'IN', 'New York'),
         ('BBDO South', 'IN', 'Atlanta'),
         ('Georgia-Pacific', 'IN', 'Atlanta')]

In [12]:
query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']
print(query)

['BBDO South', 'Georgia-Pacific']


Over the next sections of this notebook, we'll follow a process of processing raw text to convert it into a more structured form so that information extraction is possible.

The first three steps of this task are segmentation, tokenization, and part-of-speech tagging. 

Let's define a function to do those.

In [13]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

## Chunking

Chunking is a technique for entity detection, which segments and labels multi-token sequences.

### Noun Phrase Chunking

In Noun Phrase chunking (NP chunking), we search for chucks corresponding to individual noun phrases. 

Let's construct a simple NP-chunker by defining a chunk grammar.

In [14]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

In [15]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [16]:
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)

In [17]:
print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
