## Information extraction

Much of the data out there is unstructured, that is, it is on blogs, other webpages etc. One example is a webpage for a news article. For such data, we are interested in different kinds of data that is hidden in the free text. We may ask:

- Who wrote the article?
- Where is it coming from?
- What does it say?
- Which entities (names, places, people, dates) are involved?

What we also have in this article are relationships (what happened to who, where?)

Let's give all this formal definitions:

**Named entity**: Noun phrases that are of specific type and refer to specific individuals, places, organizations...

**Named entity recognition**: Process of identifying named entities in text. 
    - Identify the mention
    - Identify the tag
**Relationship extraction**: Identify relationships between named entities     
For example the word, Chicago. It could mean an album, a team or even a font!



## Named Entity Extraction

Named entities are noun phrases that refer to specific types of individuals, such as organizations, people, dates, etc. Therefore, the purpose of a named entity recognition (NER) system is to identify all textual mentions of the named entities.

### spaCy

In the following exercise, we'll build our own named entity recognition system with the Python module `spaCy`, a Python module commonly used for Natural Language Processing in industry. 

In [2]:
import spacy
import pandas as pd

In [3]:
nlp = spacy.load('en')

RuntimeError: Model 'en' not installed. Please run 'python -m spacy.en.download' to install latest compatible model.

In [3]:
# Lets try it

review = "Columbia University was founded in 1754 as King's College by royal charter of King George II of England. It is the oldest institution of higher learning in the state of New York and the fifth oldest in the United States. Controversy preceded the founding of the College, with various groups competing to determine its location and religious affiliation. Advocates of New York City met with success on the first point, while the Anglicans prevailed on the latter. However, all constituencies agreed to commit themselves to principles of religious liberty in establishing the policies of the College. In July 1754, Samuel Johnson held the first classes in a new schoolhouse adjoining Trinity Church, located on what is now lower Broadway in Manhattan. There were eight students in the class. At King's College, the future leaders of colonial society could receive an education designed to 'enlarge the Mind, improve the Understanding, polish the whole Man, and qualify them to support the brightest Characters in all the elevated stations in life.'' One early manifestation of the institution's lofty goals was the establishment in 1767 of the first American medical school to grant the M.D. degree."

In [4]:
doc = nlp(review)
doc

Columbia University was founded in 1754 as King's College by royal charter of King George II of England. It is the oldest institution of higher learning in the state of New York and the fifth oldest in the United States. Controversy preceded the founding of the College, with various groups competing to determine its location and religious affiliation. Advocates of New York City met with success on the first point, while the Anglicans prevailed on the latter. However, all constituencies agreed to commit themselves to principles of religious liberty in establishing the policies of the College. In July 1754, Samuel Johnson held the first classes in a new schoolhouse adjoining Trinity Church, located on what is now lower Broadway in Manhattan. There were eight students in the class. At King's College, the future leaders of colonial society could receive an education designed to 'enlarge the Mind, improve the Understanding, polish the whole Man, and qualify them to support the brightest Cha

In [5]:
sentences = [sentence.orth_ for sentence in doc.sents] 

In [6]:
sentences

["Columbia University was founded in 1754 as King's College by royal charter of King George II of England.",
 'It is the oldest institution of higher learning in the state of New York and the fifth oldest in the United States.',
 'Controversy preceded the founding of the College, with various groups competing to determine its location and religious affiliation.',
 'Advocates of New York City met with success on the first point, while the Anglicans prevailed on the latter.',
 'However, all constituencies agreed to commit themselves to principles of religious liberty in establishing the policies of the College.',
 'In July 1754, Samuel Johnson held the first classes in a new schoolhouse adjoining Trinity Church, located on what is now lower Broadway in Manhattan.',
 'There were eight students in the class.',
 "At King's College, the future leaders of colonial society could receive an education designed to 'enlarge the Mind, improve the Understanding, polish the whole Man, and qualify the

In [7]:
sentences = [sentence.orth_ for sentence in doc.sents] # list of sentences
print("There were {} sentences found.".format(len(sentences)))

There were 9 sentences found.


In [8]:
nounphrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
print("There were {} noun phrases found.".format(len(nounphrases)))

There were 52 noun phrases found.


In [9]:
entities = list(doc.ents) # converts entities into a list
print("There were {} entities found".format(len(entities)))

There were 26 entities found


In [16]:
import nltk
 
tokes = []
for i in sentences:
    tokens = nltk.word_tokenize(i)
    tokes.append(tokens)
    
for i in 

[['Columbia',
  'University',
  'was',
  'founded',
  'in',
  '1754',
  'as',
  'King',
  "'s",
  'College',
  'by',
  'royal',
  'charter',
  'of',
  'King',
  'George',
  'II',
  'of',
  'England',
  '.'],
 ['It',
  'is',
  'the',
  'oldest',
  'institution',
  'of',
  'higher',
  'learning',
  'in',
  'the',
  'state',
  'of',
  'New',
  'York',
  'and',
  'the',
  'fifth',
  'oldest',
  'in',
  'the',
  'United',
  'States',
  '.'],
 ['Controversy',
  'preceded',
  'the',
  'founding',
  'of',
  'the',
  'College',
  ',',
  'with',
  'various',
  'groups',
  'competing',
  'to',
  'determine',
  'its',
  'location',
  'and',
  'religious',
  'affiliation',
  '.'],
 ['Advocates',
  'of',
  'New',
  'York',
  'City',
  'met',
  'with',
  'success',
  'on',
  'the',
  'first',
  'point',
  ',',
  'while',
  'the',
  'Anglicans',
  'prevailed',
  'on',
  'the',
  'latter',
  '.'],
 ['However',
  ',',
  'all',
  'constituencies',
  'agreed',
  'to',
  'commit',
  'themselves',
  'to',
 

E.g: Paracetamol helps treat hay fever. 

          treatment
Paracetamol -----------> Hay fever

## Chunking

Chunking is used for entity recognition and segments and labels multitoken sequences. This typically involves segmenting multi-token sequences and labeling them with entity types, such as 'person', 'organization', or 'time'. 

### Noun Phrase Chunking

Noun Phrase Chunking, or NP-Chunking, is where we search for chunks corresponding to individual noun phrases.

We can use nltk, as is the case most of the time, to create a chunk parser. We begin with importing nltk and defining a sentence with its parts-of-speeches tagged (which we covered in the previous tutorial). 



In [None]:
import nltk 
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

Next, we define the tag pattern of an NP chunk. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. `<DT>?<JJ>*<NN>`. This is how the parse tree for a given sentence is acquired.  



In [None]:
pattern = "NP: {<DT>?<JJ>*<NN>}" 

Finally we create the chunk parser with the nltk RegexpParser() class.

In [None]:
NPChunker = nltk.RegexpParser(pattern) 

In [None]:
result = NPChunker.parse(sentence) 
result.draw()