# Lab5 - Assignment 5 about extraction of properties

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-5 assignment of the Text Mining course. It is about Property Extraction.

**Due**: 17 Mar at 23:59

**How to submit**: Please submit your assignment using Canvas (see *Assignments* -> *Lab Session Property Extraction*). Convert your notebook to PDF (in JupyterLab, this can be done by clicking on *File* in the menu bar, select *Export Notebook As*, then select *Export Notebook to PDF*)

**Points**: each exercise is suffixed with the number of points you can obtain for the exercise.

**Assignment goals**:
* Get insight into the challenges of entity property extraction.
* Learn how to build a transparent property extraction method based on patterns.
* Get insight into the pros and cons of two pattern-based property extraction methods.
* Be able to run your extractors on unseen documents from Wikipedia.
* Be able to evaluate property extractors.

In this assignment, the main focus lies on creating your own pattern-based property extractors. You are then going to run them on Wikipedia texts, evaluate them against gold values, and reflect on their relative performance.

 We recommend that you go through the notebooks in the following order:
* *Read the assignment (see below)*
* *Lab5-Property-extraction.ipynb*
* *Answer the questions of the assignment (see below) using the provided notebooks and submit*

**Hint:** in the explanation notebook, we had an example about extraction of properties with substring matching and with dependencies. You can use much of that code here, but make sure you make the right adjustments.

**Good luck & have fun!**

### 1. Extracting properties with substring matching (12 points)

**Exercise 1a** Write code that extracts the birth year of a person by using substring matching. (4 points)


In [1]:
import spacy
import lab5_utils as utils
import wikipedia
from spacy import displacy

model="en_core_web_sm"
nlp = spacy.load(model)

def extract_birth_year_regex(text):
    # using regex taking all years and taking the lowest value will always extract the correct value.
    property_value_type = 'DATE'
    target_entity_type = 'PERSON'
    
    doc = nlp(text)
    patterns = ['born in', 'born on', 'born']
    
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = {}  
    
    dates = utils.get_entities_of_type(property_value_type, doc)

    for date in dates:
        if utils.pattern_found_on_the_left(doc, date.i, patterns):
            org = utils.find_closest_entity(doc.ents, date.idx, target_entity_type)
            year = utils.extract_year_from_date(date.text)
            if year and org:
                relations[org] = year

    return relations 

# dob_patterns=['born in', 'born on']

**Exercise 1b** Test your *birth year substring matching extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [2]:
sent_works = 'John was born in July 12th 1999'
sent_dnwork = 'Alyson Lee Hannigan (born March 24, 1974) is an American'

for text in [sent_works, sent_dnwork]:
    dob_relations = extract_birth_year_regex(text)
    print(dob_relations)

{'John': 1999}
{'Alyson Lee Hannigan': 1974}


**Exercise 1c** Write code that extracts the manufacturer of a device by using substring matching. (4 points)

In [3]:
def extract_manufacturer_regex(text):
    property_value_type='ORG'
    target_entity_type='ORG'
    
    doc = nlp(text)
    patterns =  ['manufactured by', 'produced by', 'developed by']
    
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
    
    relations = {}  
    
    manufacturers = utils.get_entities_of_type(property_value_type, doc)

    for manufacturer in manufacturers:
        if utils.pattern_found_on_the_left(doc, manufacturer.i, patterns):
            product = utils.find_closest_entity(doc.ents, manufacturer.idx, target_entity_type)
            if product:
                relations[product] = manufacturer.text

    return relations 

**Exercise 1d** Test your *manufacturer substring matching extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [4]:
sent_works = """The TeraDrive (テラドライブ, TeraDoraibu) is an IBM PC compatible system with an integrated Mega Drive, developed by Sega and manufactured by IBM in 1991. """
sent_dnwork = 'Apple has been developing the Iphone for 15 years now.'

for text in [sent_works, sent_dnwork]:
    manu_relation = extract_manufacturer_regex(text)
    print(manu_relation)

{'TeraDoraibu': 'IBM'}
{}


### 2. Extracting properties by using dependency information (12 points)

**Exercise 2a** Write code that extracts the birth year of a person by using dependency information. (4 points)

In [5]:
# It is hard to perform this task because not every sentence contains the predicate 'born', when people died only
#their date of birth and data of death is mentioned. This can be retrieved by using regex and filtering out all
# year, but this is error prone and might not be connected to the person.

def fitting_dependency_birth(token, predicates):
    if (token.dep_ == 'nummod' or token.dep_ == 'npadvmod') and (token.head.dep_ == 'acl' or token.head.head.dep_ == 'acl' or token.head.head.head.dep_ == 'acl'):
        if token.head.dep_ == 'acl':
            pred = token.head
        if token.head.head.dep_ == 'acl':
            pred = token.head.head
        if token.head.head.head.dep_ == 'acl':
            pred = token.head.head.head
        if pred.text in predicates:
            return True
        else:
            return False
    else:
        return False

def extract_birth_year_dep(text, main_entity): 
    
    property_value_type = 'DATE'
    target_entity_type = 'PERSON'
    
    predicates = ['born']
    doc = nlp(text)
    
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
        
    relations={}
    
    dates = utils.get_entities_of_type(property_value_type, doc)
    for date in dates:
        if fitting_dependency_birth(date, predicates):
            person = utils.find_closest_entity(doc.ents, date.idx, target_entity_type)
            
            if not person:
                device = main_entity
            if person and date:
                relations[person] = utils.extract_year_from_date(date.text)
    
    return relations

**Exercise 2b** Test your *birth year dependency extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [6]:
sent_works = 'John (born July 12th 1999)'
sent_dnwork = 'He first saw light in 1800'

enities = ['John', 'He']

for text, entity in zip([sent_works, sent_dnwork], enities):
    print(extract_birth_year_dep(text, entity))

{'John': 1999}
{}


In [7]:
# wp = wikipedia.page('Christopher Lee')
# entity_text = "Sir Christopher Frank Carandini Lee,  (27 May 1922 – 7 June 2015) was an English actor, singer and author With a career spanning nearly 7 decades, Lee was well known for portraying villains and became best known for his role as Count Dracula in a sequence of Hammer Horror films, a typecasting he always lamented His other film roles include Francisco Scaramanga in the James Bond film The Man with the Golden Gun (1974), Count Dooku in the Star Wars prequel trilogy (2002 and 2005), and Saruman in the Lord of the Rings film trilogy (2001–2003) and the Hobbit film trilogy (2012–2014) "#('').join(wp.content.split('.')[:3])
# doc = nlp(entity_text)
# displacy.render(doc, jupyter=True, style='dep')
# for sent in doc.sents:
#     utils.to_nltk_tree(sent.root).pretty_print()

# extract_birth_year_dep(entity_text, 'Christopher Lee')

**Exercise 2c** Write code that extracts the manufacturer of a device by using dependency information. (4 points)

In [8]:
def fitting_dependency_man(token, predicates):
    if token.dep_ == 'pobj' and token.head.dep_ == 'agent' and token.head.head.dep_ =='acl':
        pred=token.head.head
        if pred.text in predicates:
            return True
        else:
            return False
    else:
        return False

def extract_manufacturer_dep(text, main_entity):
        
    property_value_type='ORG'
    target_entity_type='PRODUCT'
    
    predicates = ['manufactured', 'produced', 'developed', 'designed']
    doc = nlp(text)
    
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
    
    relations={}
    
    manus = utils.get_entities_of_type(property_value_type, doc)
    for manu in manus:
        if fitting_dependency_man(manu, predicates):
            device=utils.find_closest_entity(doc.ents, manu.idx, target_entity_type)

            if not device:
                device=main_entity
            if device and manu:
                relations[device]=manu.text
    return relations


In [9]:
wp = wikipedia.page("GameCube")
entity_text = ('').join(wp.content.split('.')[:3])
doc = nlp(entity_text)
print(entity_text)
displacy.render(doc, jupyter=True, style='dep')

for token in doc:
    print(token.text, token.ent_iob_, token.ent_type_)


The Nintendo GameCube (commonly abbreviated as GameCube) is a home video game console released by Nintendo in Japan and North America in 2001 and in the PAL territories in 2002 The sixth-generation console is the successor to the Nintendo 64 It competed with Sony's PlayStation 2 and Microsoft's original Xbox


The B LOC
Nintendo I LOC
GameCube I LOC
( O 
commonly O 
abbreviated O 
as O 
GameCube B ORG
) O 
is O 
a O 
home O 
video O 
game O 
console O 
released O 
by O 
Nintendo B ORG
in O 
Japan B GPE
and O 
North B LOC
America I LOC
in O 
2001 B DATE
and O 
in O 
the O 
PAL O 
territories O 
in O 
2002 B DATE
The O 
sixth B ORDINAL
- O 
generation O 
console O 
is O 
the O 
successor O 
to O 
the B EVENT
Nintendo I EVENT
64 I EVENT
It O 
competed O 
with O 
Sony B ORG
's O 
PlayStation B PRODUCT
2 I PRODUCT
and O 
Microsoft B ORG
's O 
original O 
Xbox B PERSON


**Exercise 2d** Test your *manufacturer dependency extractor* in the following way. 

* Write a sentence on which you expect that the extractor *WILL* work. 
* Write a sentence on which you expect that the extractor *WILL NOT* work. 

Run your extractor on both sentences and print the results. Make sure that the results are as expected. (2 points)

In [10]:
#unable to make this work because it barely recognizes any of the enitities as products.


sent_works = """The iPhone is a line of smartphones designed and marketed by Apple Inc. All generations of the iPhone use Apple's iOS mobile operating system software. The first-generation iPhone was released on June 29, 2007, and multiple new hardware iterations with new iOS releases have been released since.

The user interface is built around the device's multi-touch screen, including a virtual keyboard. The iPhone has Wi-Fi and can connect to cellular networks. An iPhone can take photos, play music, send and receive email, browse the web, send and receive text messages, record notes, perform mathematical calculations, and receive visual voicemail. Shooting video also became a standard feature with the iPhone 3GS. Other functionality, such as video games, reference works, and social networking, can be enabled by downloading mobile apps. As of January 2017, Apple's App Store contained more than 2.2 million applications available for the iPhone.

Apple has released twelve generations of iPhone models, each accompanied by one of the twelve major releases of the iOS operating system. The first-generation iPhone was a GSM phone and established design precedents, such as a button placement that has persisted throughout all releases and a screen size maintained for the next four iterations. The iPhone 3G added 3G network support, and was followed by the iPhone 3GS with improved hardware, the iPhone 4 with a metal chassis, higher display resolution and front-facing camera, and the iPhone 4S with improved hardware and the voice assistant Siri. The iPhone 5 featured a taller, 4 inches (100 mm) display and Apple's newly introduced Lightning connector. In 2013, Apple released the iPhone 5S with improved hardware and a fingerprint reader (marketed as 'Touch ID'), and the lower-cost iPhone 5C, a version of the 5 with colored plastic casings instead of metal. They were followed by the larger iPhone 6 and iPhone 6 Plus, with models featuring 4.7-and-5.5-inch (120 and 140 mm) displays. The iPhone 6S was introduced the following year, which featured hardware upgrades and support for pressure-sensitive touch inputs, as well as the iPhone SE—which featured hardware from the 6S but the smaller form factor of the 5S. In 2016, Apple unveiled the iPhone 7 and iPhone 7 Plus, which add water resistance, improved system and graphics performance, a new rear dual-camera setup on the Plus model, and new color options, while removing the 3.5 mm headphone jack found on previous models. The iPhone 8 and iPhone 8 Plus were released in 2017, adding a glass back and an improved screen and camera. The iPhone X was released alongside the iPhone 8 and iPhone 8 Plus, with its highlights being a near bezel-less design, an improved camera and a new facial recognition system, named Face ID, but having no home button, and therefore, no Touch ID. In September 2018, Apple again released 3 new iPhones, which are the iPhone XS, an upgraded version of the since discontinued iPhone X, iPhone XS Max, a larger variant with the series' biggest display as of 2018 and iPhone XR, a lower end version of the iPhone X.

The first-generation iPhone was described as "revolutionary" and a "game-changer" for the mobile phone industry. Subsequent iterations of the iPhone have also garnered praise. The iPhone is one of the most widely used smartphones in the world, and its success has been credited with helping Apple become one of the world's most valuable publicly traded companies."""
sent_dnwork = 'Apple has been developing the Iphone for 15 years now.'

for text, entity in zip([sent_works, sent_dnwork], ['Iphone', 'Iphone']):
    manu_relation = extract_manufacturer_dep(text, entity)
    print(manu_relation)
    
    

{}
{}


The relation extraction does not seem to work because the nlp module does not recognizes any Product entity

### 3. Running and evaluating extractors on Wikipedia (8 points)

We will run our extractors on 50 documents about people and 50 documents about devices. We provide code to load the lists of entities and the gold values.

In [11]:
import json
with open("birthyears.json", 'rb') as f:
    gold_birthyears = json.load(f)
    wiki_people = list(gold_birthyears.keys())
    
with open("manufacturers.json", 'rb') as f:
    gold_manufacturers = json.load(f)
    wiki_devices = list(gold_manufacturers.keys())

The lists `wiki_people` and `wiki_devices` contain the names of 50 people and 50 devices, respectively.

The dictionaries `gold_birthyears` and `gold_manufacturers` contain gold values for each of these entities.

We provide a function that evaluates your extracted property values against known ("gold") property values. The function returns three evaluation scores: precision, recall, f1-score. You can find call this function as follows:

`utils.evaluate_property(system_json, gold_json)`

(make sure to replace the system_json and the gold_json with the concrete dictionaries you are comparing, depending on the property and the method)

Now that we have stored the gold values for both properties in our dictionaries `gold_birthyears` and `gold_manufacturers`, and written the evaluation function, we need to obtain the system output as well and then perform evaluation.

For this purpose, we will run our extractors on texts about the same 50 people and 50 devices from Wikipedia. As in the explanation notebook, we will use the `Wikipedia` library for this purpose. Same as in the explanation notebook, we will only process the first three sentences.

In exercises 3a and 3b, we will run all our four processing functions and store the results in four different dictionaries. 
Then, in exercise 3c, we will run the evaluation function four times to compute precision, recall, and F1-score for all four functions.

**Exercise 3a** Run your two extractors about birth years of people (from exercise 1a and 1b) on all 50 documents about people. Save the extracted values in two different dictionaries: `birthyear_regex` and `birthyear_dep`. (3 points)

In [12]:
birthyear_regex = {}
birthyear_dep = {}

for person in wiki_people:
    try:
        wp = wikipedia.page(person)
    except:
        try:
            person_n = person.replace(" ", "")
            wk = wikipedia.page(person_n)
        except:
            wp = None
            print('Does not exists', person)
        
    if wp:
        entity_text = ('').join(wp.content.split('.'))
        birthyear_regex[person] = extract_birth_year_regex(entity_text)
        birthyear_dep[person] = extract_birth_year_dep(entity_text, person)
        
    else:
        birthyear_regex[person] = {'DNE'}
        birthyear_dep[person] = {'DNE'}

Does not exists Albert Finney


IndexError: list index out of range

In [None]:
('').join(wikipedia.page(wiki_people[0].replace(" ", "")).content.split('.'))

**Exercise 3b** Run your extractors about manufacturers of devices (from exercise 2a and 2b) on all 50 documents about devices. Make sure you only process the first three sentences from each document. Save the extracted values in two lists: `manufacturers_regex` and `manufacturers_dep`. (3 points)

In [None]:
manufacturers_regex = {}
manufacturers_dep = {}

for device in wiki_devices:
    try:
        wp = wikipedia.page(device)
    except:
        wp = None
        print('Does not exists', device)
        
    if wp:
        entity_text = ('').join(wp.content.split('.'))
        manufacturers_regex[device] = extract_manufacturer_regex(entity_text)
        manufacturers_dep[device] = extract_birth_year_dep(entity_text, device)
    else:
        manufacturers_regex[device] = {'DNE'}
        manufacturers_dep[device] = {'DNE'}

**Exercise 3c** Run the evaluation function `evaluate_property` to compute the performance for each of your four functions. Print the precision, recall, and F1-scores. (2 points)

In [None]:
print('Scores for Manufaturers Regex: ')
print(utils.evaluate_property(manufacturers_regex, gold_manufacturers))
print('\nScores for Manufaturers Dependency: ')
print(utils.evaluate_property(manufacturers_dep, gold_manufacturers))
print('\nScores for Birthyear Regex: ')
print(utils.evaluate_property(birthyear_regex, gold_birthyears))
print('\nScores for Birthyear Dependency: ')
print(utils.evaluate_property(birthyear_dep, gold_birthyears))

### 4. Reflection (8 points)

For each entity, we will now compare the two methods to extract properties in terms of precision and recall.

**Question 4a** Comparing the precision between the methods based on regular expressions and on syntax dependencies:
* Which method yields lower precision?
* Why do you think this is the case?
* Give an example to support your argument.

(4 points)

- Precision is equal to zero because there is not a single case where our system falsely labels something as postive. It doesn't happen that we by accident correctly find the correct combination.
- Both methods yield very low results.
- This caused by different things:
    - It is not possible to compare the gold labels with the entities extracted because they differ. Take names for example, someone can be called MR. Gold or with all his names. 
    - In the wikipedia pages of people, if the person died the birth date en data of death are both mentioned without any predicate 'born'. This makes it difficult to find this relation. (12 August 1920 - 23 July 2019) is the used format
    - The same disambigution holds for manufacturers and the devices the are producing. It is very hard te find the exact same name in the text. 
    - PRODUCTs don't seem to be detected that well
- Examples:
        -'Alyson Hannigan': {'Alyson Lee Hannigan': 1974}
        - 'Cybill Shepherd': {'Cybill Lynne Shepherd': 1950}
        - 'Walkman': {'Walkman': 'Sony The original Walkman'}

**Question 4b** Let's compare the recall for both properties. 
* Which method yields lower recall?
* Why do you think this is the case?
* Give an example to support your argument.

(4 points)

- Recall will be very close to zero because we have a lot of false negatives. This means there are a lot of cases where the item is said to be false but is actually true. This is because the names are all a bit different, so we have a lot of disambiguitation. See the example in 4a.