# T-725 Natural Language Processing: Lab 8
In today's lab, we will be working with named entity recognition and information extraction.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [81]:
import nltk
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

## Named entity recognition
NLTK includes a classifier for tagging named entities, which is described in [Chapter 7.5](https://www.nltk.org/book/ch07.html#sec-ner) of the NLTK book.

In [82]:
sent = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

sent_tokens = nltk.word_tokenize(sent)
sent_tagged = nltk.pos_tag(sent_tokens)
sent_ner = nltk.ne_chunk(sent_tagged)

print(sent_ner)

(S
  The/DT
  2020/CD
  Nobel/NNP
  Prize/NNP
  in/IN
  (GPE Physics/NNP)
  is/VBZ
  awarded/VBN
  to/TO
  (PERSON Roger/NNP Penrose/NNP)
  ,/,
  (PERSON Reinhard/NNP Genzel/NNP)
  and/CC
  (PERSON Andrea/NNP Ghez/NNP)
  for/IN
  their/PRP$
  work/NN
  on/IN
  black/JJ
  holes/NNS
  ./.)


The NLTK book shows a list of commonly used named entity categories along with examples:

NE Type | Examples
--- | ---
ORGANIZATION | Georgia-Pacific Corp., WHO
PERSON | Eddy Bonte, President Obama
LOCATION | Murray River, Mount Everest
DATE | June, 2008-06-29
TIME | two fifty a m, 1:30 p.m.
MONEY | 175 million Canadian Dollars, GBP 10.40
PERCENT | twenty pct, 18.75 %
FACILITY | Washington Monument, Stonehenge
GPE | South East Asia, Midlothian

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 AM, monday morning, October 23rd. Remember to save your file before uploading it.

## Question 1
Use `nltk.ne_chunk(tagged_sentence)` to identify the named entities in the sentences below. Note that you have to tokenize and tag the sentences first. Print out and review the trees. Find at least one error and leave a description of it as a comment or in a text cell below.

In [83]:
# On this day, October 16th (from https://en.wikipedia.org/wiki/October_16):
sentences = [
    "1813 – The Sixth Coalition attacks Napoleon in the three-day Battle of Leipzig.",
    "1923 – The Walt Disney Company is founded.",
    "1968 – Yasunari Kawabata becomes the first Japanese person to be awarded the Nobel Prize in Literature.",
    "1975 – Three-year-old Rahima Banu, from Bangladesh, is the last known case of naturally occurring smallpox.",
    "2002 – The Bibliotheca Alexandrina opens in Egypt, commemorating the ancient library of Alexandria."
]


In [84]:
# Your solution here
for s in sentences:
    print(f"SENTENCE: {s}")
    sent_tokens = nltk.word_tokenize(s)
    sent_tagged = nltk.pos_tag(sent_tokens)
    sent_ner = nltk.ne_chunk(sent_tagged)

    print(sent_ner)
    print()

SENTENCE: 1813 – The Sixth Coalition attacks Napoleon in the three-day Battle of Leipzig.
(S
  1813/CD
  –/VBZ
  The/DT
  (ORGANIZATION Sixth/JJ Coalition/NNP)
  attacks/NNS
  Napoleon/NNP
  in/IN
  the/DT
  three-day/JJ
  Battle/NNP
  of/IN
  (GPE Leipzig/NNP)
  ./.)

SENTENCE: 1923 – The Walt Disney Company is founded.
(S
  1923/CD
  –/VBZ
  The/DT
  (ORGANIZATION Walt/NNP Disney/NNP Company/NNP)
  is/VBZ
  founded/VBN
  ./.)

SENTENCE: 1968 – Yasunari Kawabata becomes the first Japanese person to be awarded the Nobel Prize in Literature.
(S
  1968/CD
  –/NNP
  Yasunari/NNP
  Kawabata/NNP
  becomes/VBZ
  the/DT
  first/JJ
  (GPE Japanese/JJ)
  person/NN
  to/TO
  be/VB
  awarded/VBN
  the/DT
  (ORGANIZATION Nobel/NNP Prize/NNP)
  in/IN
  (GPE Literature/NNP)
  ./.)

SENTENCE: 1975 – Three-year-old Rahima Banu, from Bangladesh, is the last known case of naturally occurring smallpox.
(S
  1975/CD
  –/JJ
  Three-year-old/NNP
  (PERSON Rahima/NNP Banu/NNP)
  ,/,
  from/IN
  (GPE Banglade

One error is "Literature" being classified as a Geo-Political Entity (GPE) in the third sentence

## Question 2
[SpaCy](https://spacy.io/) is another NLP library for Python. Try out its named entity recognition system on the sentences in Question 1. Does it repeat any of the mistakes that NLTK makes? Does it make any errors that NLTK doesn't? Leave your answer as a comment or in a text cell below.

In [85]:
import spacy
from spacy import displacy
import en_core_web_sm

nlp = en_core_web_sm.load()

# Example
text = """The 2020 Nobel Prize in Physics is awarded to Roger Penrose, Reinhard
Genzel and Andrea Ghez for their work on black holes."""

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [86]:
# Your solution here
for s in sentences:
    doc = nlp(s)
    displacy.render(doc, style="ent", jupyter=True)

While the "Literature" error is fixes, there are more mistakes:

- "Napoleon" being classified as an ORG
- separating "Sixth" from "Coalition"
- Not classifying "Bibliotheca Alexandrina" as an ORG

## Question 3
Use regular expressions to try to find instances of the following relationships in the `reuters` corpus:
1. Organizations or companies and their subsidiaries, divisions or parts, e.g.:
  * *Moss Rosenberg Verft, a subsidiary of Kvaerner Industrier A/S*
  * *Merrill Lynch Capital Partners, a unit of Merrill Lynch*
2. Executives and the companies they work for, e.g.:
  * *Isao Nakamura, president of Higashi Nippon*
  *  *Henry Rosenberg, chairman of Crown Central Petroleum*

Your results don't have to be perfect! Getting a few relevant matches is enough, but try to keep irrelevant results to a minimum.

In [87]:
import re
from nltk.corpus import reuters
nltk.download('reuters', quiet=True)

# Create a copy of the text where there's only a single space between each word
text = " ".join(reuters.raw().split())

# Example
for m in re.findall(r'(?: [A-Z][a-z]+)+ said it acquired (?:[A-Z][a-z]+ )+', text):
  print(m)

# Note how normal groups and non-capturing groups work with re.findall():
# a_string = "a a b"
# re.findall(r'(a )+b', a_string): ['a '] (normal group)
# re.findall(r'(?:a )+b', a_string): ['a a b'] (non-capturing group)

 Douglas Corp said it acquired Frampton Computer Services 
 Corp said it acquired Private Formulations Inc 
 Forstmann Little said it acquired Sybron 
 Southmark Corp said it acquired Berg Ventures 
 Sico said it acquired Sterling 
 First Financial Management Corp said it acquired Confidata 
 Philadelphia Suburban Corp said it acquired Mentor Systems 
 Medar Inc said it acquired Automatic Inspection Devices 
 Stryker Corp said it acquired Hexcel Medical 
 Inspeech Inc said it acquired Norma Bork Associates Inc 
 Olin Hunt Specialty Products Inc said it acquired Image Technology Corp 
 Enro Holding Corp said it acquired Enro Shirt Co 
 Seal Inc said it acquired Ademco 


In [88]:
patSub = r'(?: [A-Z][a-z]+)+, a (?:subsidiary|unit|section) of (?:[A-Z][a-z]+ )+'
patExec = r'(?: [A-Z][a-z]+)+, (?:president|chairman|director|head) of (?:[A-Z][a-z]+ )+'

print("\n1. Subsidiaries")

for m in re.findall(patSub, text)[:40]:
  print(m)

print("\n2. Executives")

for m in re.findall(patExec, text)[:40]:
  print(m)



1. Subsidiaries
 Spie Batignolles, a subsidiary of Schneider 
 James Beam Distilling Co, a unit of American Brands Inc 
 Magma Copper Co, a subsidiary of Newmont Mining 
 Scallop Petroleum Corp, a subsidiary of Royal 
 Allied Stores Corp, a subsidiary of Campeau 
 Merrill Lynch Capital Partners, a unit of Merrill Lynch 
 Reliance Financial Serivces Corp, a subsidiary of Reliance Group Holdings Inc 
 Acquisition Corp, a subsidiary of Merrill Lynch Capital Partners 
 Monsanto Chemical Company, a unit of Monsanto Co 
 Gallaher Ltd, a subsidiary of American Brands Inc 
 Algonquin Gas Transmission Co, a unit of Texas Eastern 
 Kennecott Corp, a unit of British Petroleum Co 
 Chase Home Mortgage Corp, a subsidiary of Chase Manhattan Corp 
 Permian Corp, a subsidiary of National 
 Inspiration Consolidated Copper Co, a subsidiary of Inspiration Resources 
 Belcher Oil Co, a unit of Coastal Corp 
 Alcan Aluminum Corp, a subsidiary of Alcan Aluminium 
 Young Co, a subsidiary of Service Resource

## Question 4
It's much easier to extract relationships from text that is tagged with named entities. This can be accomplished using the `nltk.sem.extract_rels()` function, as described in [Chapter 7.6](https://www.nltk.org/book/ch07.html#relation-extraction) of the NLTK book. The function takes two named entity categories and a regular expression as arguments and returns all instances where the pattern occurs between the two categories (allowing for up to 10 tokens between them, by default).

The `ieer` (Information Extraction and Entity Recognition) corpus contains named entity annotations, such as `PER`, `ORG` and `LOC`. Find some instances of the following relationships using `nltk.sem.extract_rels()`:
1. Professors and the organizations they work for, e.g.:
  * *Roger Goldman, a law professor at St. Louis University*
2. Family members e.g.,:
  * *Louis XIV and his brother, Philippe*
  * *Mildred Rosenbaum and her husband Stanley*
3. People and where are from, e.g.:
  * *Anna Rechnio of Poland*

In [89]:
from nltk.corpus import ieer
nltk.download('ieer', quiet=True)

# Example
pattern = re.compile(r'.*\bacquired?\b')

for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('ORG', 'ORG', doc, 'ieer', pattern):
    print(nltk.sem.rtuple(rel))

[ORG: 'Omnicom'] 'moved to acquire' [ORG: 'GGT']
[ORG: 'BDDP'] 'was acquired last year by' [ORG: 'GGT']
[ORG: 'Safeway Stores'] 'acquired' [ORG: 'Mutual']


In [90]:
# Your solution here
# 1. Professors and the organizations they work for
pattern = re.compile(r'.*\ba(?: [a-z]+)? (?:professor|teacher) at\b')

for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('PER', 'ORG', doc, 'ieer', pattern):
    print(nltk.sem.rtuple(rel))

[PER: 'Pepper Schwartz'] ', a sociology professor at the' [ORG: 'University of Washington']
[PER: 'Roger Goldman'] ', a law professor at' [ORG: 'St. Louis University']


In [91]:
#2. Family members
pattern = re.compile(r'.*\b(?:his|her|their|your|my|our) (?:(?:moth|fath|broth|sist)er|sibiling|husband|wife|spouse|parent|child|children|kid)(?:s)?\b')

for doc in nltk.corpus.ieer.parsed_docs():
  for rel in nltk.sem.extract_rels('PER', 'PER', doc, 'ieer', pattern):
    print(nltk.sem.rtuple(rel))

[PER: 'Jack N. Berkman'] ', an alumnus, and his wife,' [PER: 'Lillian R. Berkman']
[PER: 'Louis XIV'] 'and his brother,' [PER: 'Philippe']
[PER: 'Clinton'] 'and his wife,' [PER: 'Hillary Rodham Clinton']
[PER: 'Mildred Rosenbaum'] 'and her husband' [PER: 'Stanley']
