# Lab5 - Assignment 5 about extraction of properties

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-5 assignment of the Text Mining course. It is about Property Extraction.

**Due**: 17 Mar at 23:59

**How to submit**: Please submit your assignment using Canvas (see *Assignments* -> *Lab Session Property Extraction*). Convert your notebook to PDF (in JupyterLab, this can be done by clicking on *File* in the menu bar, select *Export Notebook As*, then select *Export Notebook to PDF*)

**Points**: each exercise is suffixed with the number of points you can obtain for the exercise.

**Assignment goals**:
* Get insight into the challenges of entity property extraction.
* Learn how to build a transparent property extraction method based on patterns.
* Get insight into the pros and cons of two pattern-based property extraction methods.
* Be able to run your extractors on unseen documents from Wikipedia.
* Be able to evaluate property extractors.

In this assignment, the main focus lies on creating your own pattern-based property extractors. You are then going to run them on Wikipedia texts, evaluate them against gold values, and reflect on their relative performance.

 We recommend that you go through the notebooks in the following order:
* *Read the assignment (see below)*
* *Lab5-Property-extraction.ipynb*
* *Answer the questions of the assignment (see below) using the provided notebooks and submit*

**Hint:** in the explanation notebook, we had an example about extraction of properties with substring matching and with dependencies. You can use much of that code here, but make sure you make the right adjustments.

**Good luck & have fun!**

In [5]:
import spacy
import lab5_utils as utils

model="en_core_web_sm"

nlp = spacy.load(model)
# print("Info: Loaded model '%s'" % model)

### 1. Extracting properties with substring matching (12 points)

**Exercise 1a** Write code that extracts the birth year of a person by using substring matching. (4 points)


In [6]:
def extract_birth_year_regex(doc, patterns):
    # Extract the birth year of a person with regular expressions
    property_value_type='DATE'
    target_entity_type='PERSON'
    
    # the following 3 lines merge entities and noun chunks into one token
    # this is useful in our cases, so we will always do it.
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = {}
    
    # step Ia - generate possible property values
    dates=utils.get_entities_of_type(property_value_type, doc)
    for date in dates:
        # step Ib - is one of our patterns found before the date 
        if utils.pattern_found_on_the_left(doc, date.i, patterns):
            # step II - find the closest entity of some target type
            pers=utils.find_closest_entity(doc.ents, date.idx, target_entity_type)
            # step III - normalize the year
            year=utils.extract_year_from_date(date.text)
            if year and pers:
                relations[pers]=year

    return relations 

**Exercise 1b** Test your *birth year substring matching extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [7]:
born_patterns=['born in', 'birthdate', 'born on']

text='Peter was born in 1975.'
text2 = "Peter Pan's dog was born in 1990"

print( "This sentence \033[1m WILL\033[0m work:",text)
birth_year_relations=extract_birth_year_regex(nlp(text), born_patterns)
print(birth_year_relations)
print()
print( "This sentence \033[1m WILL NOT\033[0m work:",text2)
birth_year_relations2=extract_birth_year_regex(nlp(text2), born_patterns)
print(birth_year_relations2)

This sentence [1m WILL[0m work: Peter was born in 1975.
{'Peter': 1975}

This sentence [1m WILL NOT[0m work: Peter Pan's dog was born in 1990
{}


**Exercise 1c** Write code that extracts the manufacturer of a device by using substring matching. (4 points)

In [8]:
def extract_manufacturer_regex(doc, patterns, main_entity):
    # Extract the manufacturer of a device by using regular expressions
    property_value_type='ORG'
    target_entity_type='PRODUCT'
    
    # the following 3 lines merge entities and noun chunks into one token
    # this is useful in our cases, so we will always do it.
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = {}
    
    manus =utils.get_entities_of_type(property_value_type, doc)
    
    for manu in manus:
        if utils.pattern_found_on_the_left(doc, manu.i, patterns):
            prod=utils.find_closest_entity(doc.ents, manu.idx, target_entity_type)
            if not prod:
                prod = main_entity
            if manu and prod:
                relations[prod]=manu.text
    return relations


**Exercise 1d** Test your *manufacturer substring matching extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [9]:
manu_predicates=['manufactured', 'produced', 'developed by', 'developed']
main_entity = 'iPhone'
sentence='the iPhone was developed by Apple in 2000.'
sentence2 = 'Apple developed the iPad in 2005 .'

print( "This sentence \033[1m WILL\033[0m work:",sentence)
manu_relations=extract_manufacturer_regex(nlp(sentence), manu_predicates, main_entity)
print(manu_relations)
print()
print( "This sentence \033[1m WILL NOT\033[0m work:",sentence2)
manu_relations2=extract_manufacturer_regex(nlp(sentence2), manu_predicates, main_entity)
print(manu_relations2)

This sentence [1m WILL[0m work: the iPhone was developed by Apple in 2000.
{'iPhone': 'Apple'}

This sentence [1m WILL NOT[0m work: Apple developed the iPad in 2005 .
{}


### 2. Extracting properties by using dependency information (12 points)

In [10]:
def fitting_dependency(token, predicates):
    """
    Check whether the we find the right keyword in the correct part of the dependency tree.
    """
    # Find prepositional objects that have a head with dependency label 'agent'
    # and its head has a dependency label 'acl'
    # Also, we make sure that the head of the head of our object is one of our keywords.
    if token.dep_ == 'nsubjpass' and token.head.dep_ == 'ROOT':
        pred=token.head.head
        if pred.text in predicates:
            return True
        else:
            return False
    else:
        return False

**Exercise 2a** Write code that extracts the birth year of a person by using dependency information. (4 points)

In [11]:
de(doc, predicates):
    
    property_value_type='PERSON'
    target_entity_type='DATE'
    
    # the following 3 lines merge entities and noun chunks into one token
    # this is useful in our cases, so we will always do it.
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
    
    relations={}
    
    # step Ia - generate possible property values
    persons=utils.get_entities_of_type(property_value_type, doc)
    
    for person in persons:
        print(person)
        # step Ib - do we find the right keyword in the correct part of the dependency tree?
        if fitting_dependency(person, predicates):
            # step II - find the closest entity of some target type
            date=utils.find_closest_entity(doc.ents, person.idx, target_entity_type)
            year =utils.extract_year_from_date(date)
            if person and year:
                relations[person]=year
        else:
            print('HELP')
    return relations

SyntaxError: invalid syntax (<ipython-input-11-a784a9f790f6>, line 1)

**Exercise 2b** Test your *birth year dependency extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [None]:
born_patterns=['born', 'birthdate']
sentence='Peter was born in 1975.'
sentence2 = "In 1975 Peter's dog was born."

print( "This sentence \033[1m WILL\033[0m work:",sentence)
birth_year_relations=extract_birth_year_dep(nlp(sentence), born_patterns)
print(birth_year_relations)
print()
print( "This sentence \033[1m WILL NOT\033[0m work:",sentence2)
birth_year_relations2=extract_birth_year_dep(nlp(sentence2), born_patterns)
print(birth_year_relations2)

**Exercise 2c** Write code that extracts the manufacturer of a device by using dependency information. (4 points)

In [None]:
def fitting_dependency(token, predicates):
    """
    Check whether the we find the right keyword in the correct part of the dependency tree.
    """
    # Find prepositional objects that have a head with dependency label 'agent'
    # and its head has a dependency label 'acl'
    # Also, we make sure that the head of the head of our object is one of our keywords.
    if token.dep_ == 'pobj' and token.head.dep_ == 'agent' and token.head.head.dep_ =='ROOT':
        pred=token.head.head
        if pred.text in predicates:
            return True
        else:
            return False
    else:
        return False

In [None]:
def extract_manufacturer(doc, predicates, main_entity):
    
    property_value_type='ORG'
    target_entity_type='PRODUCT'
    
    # the following 3 lines merge entities and noun chunks into one token
    # this is useful in our cases, so we will always do it.
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
    
    relations={}
    
    # step Ia - generate possible property values
    manus=utils.get_entities_of_type(property_value_type, doc)
    
    for manu in manus:
        # step Ib - do we find the right keyword in the correct part of the dependency tree?
        if fitting_dependency(manu, predicates):
            # step II - find the closest entity of some target type
            device=utils.find_closest_entity(doc.ents, manu.idx, target_entity_type)
            # Devices are often not recognized properly by SpaCy - 
            # if we find no device, we assume that the relation is about the main entity of the document
            if not device:
                device=main_entity
            if device and manu:
                relations[device]=manu.text
    return relations

**Exercise 2d** Test your *manufacturer dependency extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [None]:
manu_predicates=['manufactured', 'produced', 'developed']
main_entity = 'iPhone'
main_entity2 ='Walkman'
sentence='the iPhone was developed by Apple in 2000.'
sentence2 = 'Apple developed the iPad in 2005.'
sentence3= 'Walkman is a brand of portable media players manufactured by Sony. The original Walkman, released in 1979, was a portable cassette player that changed listening habits by allowing people to listen to music of their choice on the move. It was devised by Sony founders Masaru Ibuka and Akio Morita, who felt Sony existing portable player was too unwieldy and expensive'
print( "This sentence \033[1m WILL\033[0m work:",sentence)
manu_relations=extract_manufacturer(nlp(sentence), manu_predicates, main_entity)
print(manu_relations)

print()
print( "This sentence \033[1m WILL NOT\033[0m work:",sentence2)
manu_relations2=extract_manufacturer(nlp(sentence2), manu_predicates, main_entity)
print(manu_relations2)
print( "This sentence \033[1m WILL NOT\033[0m work:",sentence3)
manu_relations3=extract_manufacturer(nlp(sentence2), manu_predicates, main_entity2)
print(manu_relations3)

### 3. Running and evaluating extractors on Wikipedia (8 points)

We will run our extractors on 50 documents about people and 50 documents about devices. We provide code to load the lists of entities and the gold values.

In [None]:
import json
with open("birthyears.json", 'rb') as f:
    gold_birthyears=json.load(f)
    wiki_people=list(gold_birthyears.keys())
    
with open("manufacturers.json", 'rb') as f:
    gold_manufacturers=json.load(f)
    wiki_devices=list(gold_manufacturers.keys())
print(wiki_people)

The lists `wiki_people` and `wiki_devices` contain the names of 50 people and 50 devices, respectively.

The dictionaries `gold_birthyears` and `gold_manufacturers` contain gold values for each of these entities.

We provide a function that evaluates your extracted property values against known ("gold") property values. The function returns three evaluation scores: precision, recall, f1-score. You can find call this function as follows:

`utils.evaluate_property(system_json, gold_json)`

(make sure to replace the system_json and the gold_json with the concrete dictionaries you are comparing, depending on the property and the method)

Now that we have stored the gold values for both properties in our dictionaries `gold_birthyears` and `gold_manufacturers`, and written the evaluation function, we need to obtain the system output as well and then perform evaluation.

For this purpose, we will run our extractors on texts about the same 50 people and 50 devices from Wikipedia. As in the explanation notebook, we will use the `Wikipedia` library for this purpose. Same as in the explanation notebook, we will only process the first three sentences.

In exercises 3a and 3b, we will run all our four processing functions and store the results in four different dictionaries. 
Then, in exercise 3c, we will run the evaluation function four times to compute precision, recall, and F1-score for all four functions.

**Exercise 3a** Run your two extractors about birth years of people (from exercise 1a and 1b) on all 50 documents about people. Save the extracted values in two different dictionaries: `birthyear_regex` and `birthyear_dep`. (3 points)

In [13]:
import wikipedia

In [21]:
texts_p={}
print(wiki_people)

for entity in wiki_people:
    print(entity)
    wp = wikipedia.page(entity)
    # get the first 3 sentences of a wikipedia article
    first_three_sentences=wp.content.split('.')[:3]
    entity_text=('.').join(first_three_sentences)
    # create a dictionary (JSON) where the key is your entity, and the value is its 3-sentences wikipedia text. 
    texts_p[entity]=entity_text
    print(entity_text)
    print()


['Al Pacino', 'Alan Rickman', 'Albert Finney', 'Alyson Hannigan', 'Andie MacDowell', 'Andrew Lloyd Webber', 'Andrzej Wajda', 'Andrzej Żuławski', 'Angela Davis', 'Anthony Quinn', 'Antonio Banderas', 'Ashley Judd', 'Ava Gardner', 'Barbara Stanwyck', 'Ben Elton', 'Bernardo Bertolucci', 'Betty Marsden', 'Billy Wilder', 'Blake Edwards', 'Bob Black', 'Bob Keeshan', 'Brad Pitt', 'Cameron Diaz', 'Carmen Miranda', 'Carole Lombard', 'Catherine Deneuve', 'Cesare Zavattini', 'Chandra Levy', 'Charlton Heston', 'Chaz Bono', 'Christine McVie', 'Christopher Lambert', 'Christopher Lee', 'Clark Gable', 'Clint Eastwood', 'Clive Sinclair', 'Cybill Shepherd', 'Dan Aykroyd', 'Dannii Minogue', 'Dave Cutler', 'David Blaine', 'David Boies', 'David Gauthier', 'David Jason', 'David Niven', 'Denise Richards', 'Desmond Llewelyn', 'Don Siegel', 'Dudley Moore', 'Dustin Hoffman']
Al Pacino


PageError: Page id "al pacion" does not match any pages. Try another id!

In [20]:
birthyear_regex={} 
birthyear_dep={}

texts_p
print(wiki_people)
born_patterns=['born in', 'birthdate', 'born on']

for entity in wiki_people:
    print('entity is',entity)
    print('manu_predicates',manu_predicates)
    print(nlp(texts_p[entity]))
    born_relations=extract_birth_year_dep(nlp(texts_p[entity]), born_patterns)
    print("relation  is",born_relations)
    print()
    birthyear_dep.update(born_relations)
    #manufacturers_dep[entity]=manu_relation[1]
    
    

for entity in wiki_people:
    print('entity is',entity)
    print('manu_predicates',manu_predicates)
    print(nlp(texts_p[entity]))
    born_relations=extract_birth_year_regex(nlp(texts_p[entity]), born_patterns)
    print("relation  is",born_relations)
    print()
    birthyear_regex.update(born_relations)
    #manufacturers_dep[entity]=manu_relation[1]
    
    


['Al Pacino', 'Alan Rickman', 'Albert Finney', 'Alyson Hannigan', 'Andie MacDowell', 'Andrew Lloyd Webber', 'Andrzej Wajda', 'Andrzej Żuławski', 'Angela Davis', 'Anthony Quinn', 'Antonio Banderas', 'Ashley Judd', 'Ava Gardner', 'Barbara Stanwyck', 'Ben Elton', 'Bernardo Bertolucci', 'Betty Marsden', 'Billy Wilder', 'Blake Edwards', 'Bob Black', 'Bob Keeshan', 'Brad Pitt', 'Cameron Diaz', 'Carmen Miranda', 'Carole Lombard', 'Catherine Deneuve', 'Cesare Zavattini', 'Chandra Levy', 'Charlton Heston', 'Chaz Bono', 'Christine McVie', 'Christopher Lambert', 'Christopher Lee', 'Clark Gable', 'Clint Eastwood', 'Clive Sinclair', 'Cybill Shepherd', 'Dan Aykroyd', 'Dannii Minogue', 'Dave Cutler', 'David Blaine', 'David Boies', 'David Gauthier', 'David Jason', 'David Niven', 'Denise Richards', 'Desmond Llewelyn', 'Don Siegel', 'Dudley Moore', 'Dustin Hoffman']
entity is Al Pacino
manu_predicates ['manufactured', 'produced', 'developed by', 'developed']


KeyError: 'Al Pacino'

**Exercise 3b** Run your extractors about manufacturers of devices (from exercise 2a and 2b) on all 50 documents about devices. Make sure you only process the first three sentences from each document. Save the extracted values in two lists: `manufacturers_regex` and `manufacturers_dep`. (3 points)

In [23]:
textsdev={}
wiki_devices
for entity in wiki_devices:
    print(entity)
    if entity == 'Game Gear':
        break    # break here
    wp = wikipedia.page(entity)
    # get the first 3 sentences of a wikipedia article
    first_three_sentences=wp.content.split('.')[:3]
    entity_text=('.').join(first_three_sentences)
    # create a dictionary (JSON) where the key is your entity, and the value is its 3-sentences wikipedia text. 
    textsdev[entity]=entity_text
    print(entity_text)
    print()
    
print(textsdev)


3DO Interactive Multiplayer
The 3DO Interactive Multiplayer, often called the 3DO, is a home video game console developed by The 3DO Company. Conceived by entrepreneur and Electronic Arts founder Trip Hawkins, the 3DO was not a console manufactured by the company itself, but a series of specifications, originally designed by Dave Needle and R. J

PDP-7
The PDP-7 was a minicomputer produced by Digital Equipment Corporation as part of the PDP series. Introduced in 1964, shipped since 1965, it was the first to use their Flip-Chip technology. With a cost of US$72,000, it was cheap but powerful by the standards of the time

TRS-80 Color Computer
The RadioShack TRS-80 Color Computer (later marketed as the Tandy Color Computer and sometimes nicknamed the CoCo) is a line of home computers based on the Motorola 6809 processor. The Tandy Color Computer line started in 1980 with what is now called the CoCo 1 and ended in 1991 with the more powerful CoCo 3. All three CoCo models maintained a high 

In [165]:
manufacturers_regex={}
manufacturers_dep={}
textsdev
print(wiki_devices)
manu_predicates=['manufactured', 'produced', 'developed']

for entity in wiki_devices:
    if entity == 'Game Gear':
        break    # break here
    print('entity is',entity)
    print('manu_predicates',manu_predicates)
    print(nlp(textsdev[entity]))
    manu_relations=extract_manufacturer(nlp(textsdev[entity]), manu_predicates, entity)
    print("relation  is",manu_relations)
    print()
    manufacturers_dep.update(manu_relations)
    #manufacturers_dep[entity]=manu_relation[1]
    
    
    
for entity in wiki_devices:
    if entity == 'Game Gear':
        break    # break here
    print('entity is',entity)
    print('manu_predicates',manu_predicates)
    print(nlp(textsdev[entity]))
    manu_relations=extract_manufacturer_regex(nlp(textsdev[entity]), manu_predicates, entity)
    print("relation  is",manu_relations)
    print()
    manufacturers_regex.update(manu_relations)
    #manufacturers_regex[entity]=manu_relation[1]
    


['3DO Interactive Multiplayer', 'PDP-7', 'TRS-80 Color Computer', 'Walkman', 'Sega TeraDrive', 'GameCube', 'Cray-1', 'Sega CD', '32X', 'Game Gear', 'Sega Saturn', 'TRS-80', 'Vectrex', 'PalmPilot', 'ZX81', 'Volkswagen D24TIC engine', 'Volkswagen D24T engine', 'Volkswagen D24 engine', 'Spice MI-335 (Stellar Craze)', 'Spice Stellar Nhance Mi-435', 'Arirang (smartphone)', 'Micro-Professor MPF-I', 'HTC Touch Diamond2', 'HTC Touch 3G', 'HTC Touch Viva', 'Aakash (tablet)', 'Typekit', 'IPod Mini', 'Mac Mini', 'BMW M2B15', 'Coleco Gemini', 'Zune HD', 'Zune 30', 'Zune 4, 8, 16', 'Zune 80, 120', 'Motorola Hint QA30', 'Motorola W233', 'Motodext', 'Motorola A3100', 'Motorola Aura', 'Motorola Calgary', 'Motorola Photon Q', 'Motorola i1', 'NES Advantage', 'Game Boy', 'Nokia 6210 Navigator', 'Nokia 6710 Navigator', 'Nokia 5320 XpressMusic']
entity is 3DO Interactive Multiplayer
manu_predicates ['manufactured', 'produced', 'developed']
The 3DO Interactive Multiplayer, often called the 3DO, is a home vi

**Exercise 3c** Run the evaluation function `evaluate_property` to compute the performance for each of your four functions. Print the precision, recall, and F1-scores. (2 points)

In [166]:
print(utils.evaluate_property(birthyear_regex, gold_birthyears))
print(utils.evaluate_property(manufacturers_dep, gold_manufacturers))

print(utils.evaluate_property(birthyear_dep, gold_birthyears))
print(utils.evaluate_property(manufacturers_regex, gold_manufacturers))



NameError: name 'birthyear_regex' is not defined

### 4. Reflection (8 points)

For each entity, we will now compare the two methods to extract properties in terms of precision and recall.

**Question 4a** Comparing the precision between the methods based on regular expressions and on syntax dependencies:
* Which method yields lower precision?
* Why do you think this is the case?
* Give an example to support your argument.

(4 points)

In [None]:
# Your answer here...

**Question 4b** Let's compare the recall for both properties. 
* Which method yields lower recall?
* Why do you think this is the case?
* Give an example to support your argument.

(4 points)

In [13]:
# Your answer here...