### Reference

http://www.nltk.org/book/ch07.html

### Information Extraction from structured date 

      More like a database with tables storing all the information which are of interest . 
      | Org Name       | Location |
      |----------      |----------|
      |Omnicom 	    |New York  |
      |DDB Needham     |New York  | 
      |Kaplan Thaler   |New York  |
      |BBDO South      |Atlanta   |
      |Georgia-Pacific |Atlanta   |
      
      
      
      

In [2]:
locs = [('Omnicom', 'IN', 'New York'),('DDB Needham', 'IN', 'New York'),
         ('Kaplan Thaler Group', 'IN', 'New York'),
        ('BBDO South', 'IN', 'Atlanta'),
         ('Georgia-Pacific', 'IN', 'Atlanta')]

# Showing organization name along with location 

# Which organizations operate in Atlanta? 

query = [ e1 for(e1,e2,e3) in locs if e3=='Atlanta' ]
query



['BBDO South', 'Georgia-Pacific']

Information Extraction usually deals with unstructured data as in 


  **(1)The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.**
  
  
  Approaches include 
  ____________________________
 
 1) General Representation of Meaning
 
 2)  Convert Unstructured data to Structured data 



### Block Diagram of Information Extraction

  ![alt-text](http://www.nltk.org/images/ie-architecture.png )
  
  Relation Detection is used to identify the relations between entities recognized during Named Entity Recognition (NER)
  
  
  The first three stages the preprocessing phase can be done using the code

In [0]:
import nltk

def pre_process(document):
  sentences = nltk.sent_tokenize(document) 
  words= [nltk.word_tokenize(sent) for sent in sentences] 
  pos_tagg = [nltk.pos_tag(word) for word in words] 
  return pos_tagg

In [4]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [5]:
# Testing with random sentence 
document = "The phone comes with an in-display fingerprint scanner and a water drop notch on the display."
print(pre_process(document))

[[('The', 'DT'), ('phone', 'NN'), ('comes', 'VBZ'), ('with', 'IN'), ('an', 'DT'), ('in-display', 'JJ'), ('fingerprint', 'NN'), ('scanner', 'NN'), ('and', 'CC'), ('a', 'DT'), ('water', 'NN'), ('drop', 'NN'), ('notch', 'NN'), ('on', 'IN'), ('the', 'DT'), ('display', 'NN'), ('.', '.')]]


  Next comes the named entity recognition phase 
    Named Entities include
  
  _________________
  
  *  Proper Noun (Like names of Persons ,Objects )
  
  Relation Extraction usually takes place between named entities that are closer to one another .

## Chunking

  Smaller boxes show world level tokenization and parts of speech tagging . Larger boxes show high level chunking
  ![alt-text](http://www.nltk.org/images/chunk-segmentation.png)
  
  

### Noun Phrase Chunking 

In NP Chunking we search for chunks corresponding to Noun Phrase

[The dog ] barked at [the cat]  
    Phrases within square brackets are noun phrase chunks 
    
    

    

In [10]:
# Create an NP Chunk Parser 

# Get tagging from function pre_process

document = "The old man walked slowly with a stick ."
tagged_words = pre_process(document)
print(tagged_words)

[[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('walked', 'VBD'), ('slowly', 'RB'), ('with', 'IN'), ('a', 'DT'), ('stick', 'NN'), ('.', '.')]]


In [12]:
# A Noun Phrase is formed by a determiner followed by the noun

grammar = "NP: {<DT>?<JJ>*<NN>}"     # The rule  NP -> DETERMINER (ADJECTIVE)* NOUN


# Create our chunk parser 

chunk_parser = nltk.RegexpParser(grammar)
result = chunk_parser.parse(tagged_words[0])
print(result)

(S
  (NP The/DT old/JJ man/NN)
  walked/VBD
  slowly/RB
  with/IN
  (NP a/DT stick/NN)
  ./.)


 A tag pattern is a sequence of part-of-speech tags delimited using angle brackets,


### Chunking using Regular Expression


In [19]:
grammar = r"""
          NP : {<DT|PP\$>?<JJ>*<NN>}
               {<NNP>+}
          """
chunk_parser = nltk.RegexpParser(grammar)

sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), 
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

print(chunk_parser.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


In [23]:
# Overlapping  chunking

new_tag_words = pre_process("money market fund")[0]
grammar = "NP:{<NN><NN>}"
parser = nltk.RegexpParser(grammar)
print(parser.parse(new_tag_words))

(S (NP money/NN market/NN) fund/NN)
