# Lab5-Property extraction

In this notebook, we provide more information about the task of Property Extraction.

Overview of the content covered in this notebook:
1. Introduction to property extraction
2. Building pattern-based extractors
3. Coding pattern-based extractors
4. Evaluating extractors

**At the end of this notebook, you will be able to**:
* understand the task of Property Extraction and its relation to similar tasks, like Relation Extraction
* build a pattern-based Property Extractor
* apply it to extract properties from text
* evaluate the pattern-based extractor

**Useful links**:


### 1. Introduction to property extraction


The task of property extraction aims to fill knowledge bases with information about entities that we find in text. There are other tasks that are similar to it, such as: 
* slot filling, where we attempt to complete entity information according to some schema
* relation extraction - given two entities, what is their relation (for example, in Microsoft X Bill_Gates, the relation X is `hasCEO`)
* knowledge base completion, where we usually complete a knowledge base by inference from existing structured information (not from text).
* open information extraction - no schema available, disambiguation is non-trivial

In all these tasks, including property extraction, we typically extract knowledge in the form of **a triple**. A triple consists of three elements: a subject, a predicate, and an object. An example of a triple is:

Barack_Obama hasAge 57

Here, Barack_Obama is a subject, hasAge is a predicate/relation, and 57 is an object. The subjects and the predicates of a triple are always URIs; the objects can be either a URI (like Barack_Obama), or a literal (like 57, or "Barack").

Hence, property extraction typically requires us to:
1. **detect** a property value in text (e.g., 57) and an entity it belongs to ("Barack")
2. **interpret** both the property value (57 as a number) and the entity ("Barack" means Barack_Obama)
3. **find their relation** - the connection between Barack_Obama and 57 is the relation hasAge

In this sense, the task of property extraction builds on top of the output of NERC and the NERD.

**Challenges** The property extraction task is difficult because of similar factors as we have seen with entity linking: ambiguity (of entities, relations, and property values), variation (of entities, values, and relations) and vagueness (when insufficient details or information is given). Futhermore,  relations can sometimes span multiple sentences or require a lot of world knowledge in order to understand them.

### 2. Building extractors


#### 2.1 Methods

In this week, we will create our own property extractors. Building automatic property extractors is not trivial, because it requires multiple steps, and these steps in practice might differ a lot between different attributes.

There are two common methods for extracting attributes from text: defining patterns and distant supervision. 
* The most basic approach for extracting attributes from text is by pattern matching. This approach is transparent, but it requires us to define the patterns for each of the properties separately. For example, for the attribute "birthplace", we can use the pattern: "X, born in Y", or "X from Y". Typically, the patterns are combined with syntactic information on entity types to help their precision. For example, we will check whether indeed X is a person and Y is indeed a location in the above example.
* The second approach, distant supervision, relies on knowledge base information that is loosely based on a text. For example, you can think of a Wikipedia document that describes Donald Trump on one hand, and structured information from for example wikidata that tells us that he is born in 1950. However, we don't know if this information is explicitly mentioned in the Wikipedia text and if so, where and how. With this method, we train for example a recurrent neural network on top of this kind of output, and hope that the neural network will learn the patterns in which this attribute is typically given in text.

#### 2.2 Building a pattern-based extractor

Typically, this consists of several parts:
1. find a mention of a specific attribute (for example, money or birth date) in text
2. assign this mention to some subject entity
3. normalize the value of the attribute value
4. normalize the subject entity

**Example** Let's say we want to extract values for `founding year` in the following paragraph:

"Juventus F.C. is an Italian professional football club based in Turin. The club was founded in March 1897 by a group of Torinese students."


First, we need to find attribute values that contain information about founding years. For example, we can use the pattern "founded in" to extract the attribute value `March 1897` in the second sentence.

Next, we need to see to which subject this attribute belongs. Assuming that we perform dependency parsing of the sentences, we can find that the relation "founded in" has a subject `The club`. At this point, we can extract the following relation:

The club FOUNDING_YEAR March 1897

Syntactially this is the correct way to extract the relation. However, the relation is not really very useful yet - we need to normalize its subject and object somehow to make it useful in a semantic sense.

Hence, we can normalize the value "March 1897" to a year value `1897`, for example, by looking for 4-digit numbers in the phrase.

Then, we can normalize "The club" to `Juventus F.C.` by using entity coreference between the two sentences. We can even disambiguate the mention to https://en.wikipedia.org/wiki/Juventus_F.C., or `Juventus_F.C.` for brevity. This finally leads us to the following relation:

`Juventus_F.C. FOUNDING_YEAR 1897`

which looks much more useful.

### 3. Coding pattern-based extractors

A simple example of extracting relations between phrases and entities uses spaCy's named entity recognizer and the dependency parse. Here, we extract
money and currency values (entities labelled as MONEY) and then check the dependency tree to find the noun phrase they are referring to – for example:
$9.4 million --> Net income. Compatible with: spaCy v2.0.0+

In [1]:
#!/usr/bin/env python
# coding: utf8

import re
import spacy

model="en_core_web_sm"

nlp = spacy.load(model)
print("Info: Loaded model '%s'" % model)

Info: Loaded model 'en_core_web_sm'


In [10]:
#Find entities of a certain type and with the smallest distance to the property.
def find_closest_entity(entities, prop_position, e_type):
    min_distance=9999
    closest_entity=None
    for ent in entities:
        if ent.label_!=e_type: 
            continue # skip entities of different types
            
        # determine the distance between the entity and the property
        distance=abs(ent.start_char-prop_position)
        if min_distance>distance:
            min_distance=distance
            closest_entity=ent
    return closest_entity

#Get the year 
def extract_year_from_date(date):
    match = re.findall('\d{4}', date)
    return int(match[0])

def print_relations(r):
    for r1, r2, r3 in r:
        print('{:<10}\t{}\t{}'.format(r1.text, r2, r3))
    return

In [11]:
def extract_date_relations(doc, patterns):
    # merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = {}
    
    for date in filter(lambda w: w.ent_type_ == 'DATE', doc):
        if date.dep_ == 'pobj' and date.head.dep_ == 'prep':
            pred=date.head.head
            if pred.text not in patterns:
                continue

        year=extract_year_from_date(date.text)
        org=find_closest_entity(doc.ents, date.idx, 'ORG')
        if year and org:
            relations[org.text]=year

    return relations        

In [12]:
def extract_nationality(doc):
    relations={}
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
    
    for nationality in filter(lambda w: w.ent_type_ == 'NORP', doc):
        per=find_closest_entity(doc.ents, nationality.idx, 'PERSON')
        if per and nationality:
            relations[per.text]=nationality
    return relations

**Making sure our extractors work**


In [13]:
def get_all_relations(TEXTS):
    
    all_relations={'founding_year':{}, 'nationality': {}}

    
    founded_patterns=['founded', 'established', 'created']
    
    for index, text in enumerate(TEXTS):
        doc = nlp(text)
        all_relations['founding_year'].update(extract_date_relations(doc, founded_patterns))
        all_relations['nationality'].update(extract_nationality(doc))
        print('\nRelations in text %d' % (index+1))
#        print(founded_relations, nationality_relations)
#        print_relations(founded_relations)
#        print_relations(nationality_relations)

#        all_founded_relations += founded_relations
#        all_nationality_relations +=nationality_relations
    print(all_relations)
    return all_relations

In [14]:
TEXTS = [
    'Mike is French, and John is American.',
    'Juventus F.C. is an Italian professional football club based in Turin. The club was founded in March 1897 by a group of Torinese students.',
]    

if __name__ == '__main__':
    print("Info: Processing %d texts" % len(TEXTS))
    print()
    print('**Extracted relations:**')
    print()
    

    get_all_relations(TEXTS)

    # Expected output:
    # Net income      MONEY   $9.4 million
    # the prior year  MONEY   $2.7 million
    # Revenue         MONEY   twelve billion dollars
    # a loss          MONEY   1b

Info: Processing 2 texts

**Extracted relations:**


Relations in text 1

Relations in text 2
{'founding_year': {'Juventus F.C.': 1897}, 'nationality': {'Mike': French, 'John': American}}


#### 3.3 Processing wikipedia

Next, we are going to use Wikipedia. You first need to install client package to access Wikipedia. From the terminal (with the settings that you use for notebooks) run:

conda install -c conda-forge wikipedia 

In [15]:
import wikipedia

The next code extracts properties from Wikipedia. You need to have Internet connection to be able to access Wikipedia. If you are not connected or the connection is too slow you get the following error:

NewConnectionError: <urllib3.connection.HTTPConnection object at 0x1289d3c88>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known


In [16]:
entities=["Piek Vossen", "Frank van Harmelen", "Bremen High School (Midlothian, Illinois)"]
texts=[]

for entity in entities:
    wp = wikipedia.page(entity)
    texts.append(wp.content)
    
system_data = get_all_relations(texts)

ConnectionError: HTTPConnectionPool(host='en.wikipedia.org', port=80): Max retries exceeded with url: /w/api.php?list=search&srprop=&srlimit=1&limit=1&srsearch=Piek+Vossen&srinfo=suggestion&format=json&action=query (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1289d3c88>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

### 4. Evaluating extractors

In [None]:
properties=['founding_year', 'nationality']
gold={}
gold['nationality']={'Piek Vossen': 'Dutch', 'Frank van Harmelen': 'Dutch'}
gold['founding_year']={'Bremen High School (Midlothian, Illinois)': '1953'}

for prop, property_data in gold.items():
    for entity, property_value in property_data.items():
        if entity in system_data[prop]:
            system_value=system_data[prop][entity]
        else:
            system_value='not found'
        print('Entity: %s, property: %s, gold value: %s, system value: %s' % (entity, prop, property_value, system_value))