# Lab5-Property extraction

In this notebook, we provide more information about the task of Property Extraction.

Overview of the content covered in this notebook:
1. Introduction to property extraction
2. Building pattern-based extractors
3. Coding pattern-based extractors
4. Evaluating extractors

**At the end of this notebook, you will be able to**:
* understand the task of Property Extraction and its relation to similar tasks, like Relation Extraction
* build a pattern-based Property Extractor
* apply it to extract properties from text
* evaluate the pattern-based extractor

**Useful links**:


### Before we start: set up your environment



**1. Install Wikipedia client** In this week's lab session we are going to use Wikipedia. You first need to install client package to access Wikipedia. From the terminal (with the settings that you use for notebooks) run:

`conda install -c conda-forge wikipedia`

In [None]:
import wikipedia

**2. Internet** Note that you need to have Internet connection to be able to access Wikipedia. If you are not connected or the connection is too slow you get the following error:

`NewConnectionError: <urllib3.connection.HTTPConnection object at 0x1289d3c88>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known`

**3. SpaCy** Another library that we will use in this week's lab sesion is SpaCy and its English model "en_core_web_sm". You probably have this already installed because we used this setup in assignment 3. If you don't, then please follow the instructrions on the SpaCy website: https://spacy.io.

Typically, the installation commands you need are:

`conda install -c conda-forge spacy`

`python -m spacy download en_core_web_sm`

We can now import SpaCy and its English model:

In [8]:
import spacy

model="en_core_web_sm"

nlp = spacy.load(model)
print("Info: Loaded model '%s'" % model)

Info: Loaded model 'en_core_web_sm'


If the above code blocks did not yield any error, then you are all set up for this week's session. Let's start ;)

### 1. Introduction to property extraction

In the task of entity linking, we performed disambiguation of entity mentions in text by making a connection to the correct referrent for a mention in a knowledge base. Although these knowledge bases are typically fairly large, they are far from complete. Tasks like property extraction and relation extraction help to make knowledge bases more complete.

The task of property extraction aims to fill knowledge bases with information about properties of entities that we find in text. There are other tasks that are similar to it, such as: 
* slot filling, where we attempt to complete entity information according to some schema
* relation extraction - given two entities, what is their relation (for example, in Microsoft X Bill_Gates, the relation X is `hasCEO`)
* knowledge base completion, where we usually complete a knowledge base by inference from existing structured information (not from text).
* open information extraction - no schema available, disambiguation is non-trivial

In all these tasks, including property extraction, we typically extract "pieces of knowledge" in the form of **a triple**. A triple consists of three elements: a subject, a predicate, and an object. An example of a triple is:

Barack_Obama hasAge 57

Here, Barack_Obama is a subject, hasAge is a predicate/relation, and 57 is an object. The subjects and the predicates of a triple are always URIs; the objects can be either a URI (like Barack_Obama), or a literal (like 57, or "Barack").

Hence, property extraction typically requires us to:
1. **detect** a property value in text (e.g., 57) and an entity it belongs to ("Barack")
2. **interpret** both the property value (57 as a number) and the entity ("Barack" means Barack_Obama)
3. **find their relation** - the connection between Barack_Obama and 57 is the relation hasAge

In this sense, the task of property extraction builds on top of the output of NERC and the NERD.

**Challenges** The property extraction task is difficult for reasons similar to those we have discussed with entity linking: ambiguity (of entities, relations, and property values), variation (of entities, values, and relations), and vagueness (when insufficient details or information is given). Futhermore,  relations can sometimes span multiple sentences or require a lot of world knowledge in order to understand them.

### 2. Building extractors

The main focus of this week's lab session is on creating our own property extractors. 

#### 2.1 Methods
Building automatic property extractors is not trivial, because it requires multiple steps, and these steps in practice might differ a lot between different attributes.

There are two common methods for extracting attributes from text: pattern matching and distant supervision. 
* The most basic approach for extracting attributes from text is by pattern matching. This approach is transparent, but it requires us to define the patterns for each of the properties separately. For example, for the attribute "birthplace", we can use the pattern: "X, born in Y", or "X from Y". Typically, the patterns are combined with syntactic information on entity types to help their precision. For example, we will check whether indeed X is a person and Y is indeed a location in the above example.
* The second approach, distant supervision, relies on knowledge base information that is loosely based on a text. For example, you can think of a Wikipedia document that describes Donald Trump on one hand, and structured information from for example wikidata that tells us that he is born in 1946. However, we don't know if this information is explicitly mentioned in the Wikipedia text and if so, where and how. With the distant supervision method, we train for example a recurrent neural network on top of this kind of output, and hope that the neural network will learn the patterns in which this attribute is typically given in text.

We will use a pattern matching approach to build our extractors in this week's lab session. If you are curious about how to build a distant supervision extractor, you can check Snorkel (their [introductory notebooks](https://github.com/HazyResearch/snorkel/tree/master/tutorials/intro) are quite user-friendly).

#### 2.2 Building a pattern-based extractor

Typically, a pattern-based extractor consists of several parts:
1. find a mention of a specific attribute (for example, money or birth date) in text
2. assign this mention to some subject entity
3. normalize the attribute value
4. normalize the subject entity

**Example** Let's say we want to extract values for `founding year` in the following paragraph from Wikipedia:

"Juventus F.C. is an Italian professional football club based in Turin. The club was founded in March 1897 by a group of Torinese students."

**Step I** First, we need to find attribute values that contain information about founding years. For example, we can use the pattern "founded in" to extract the attribute value `March 1897` in the second sentence.

**Step II** Next, we need to see to which subject this attribute belongs. Assuming that we perform dependency parsing of the sentences, we can find that the relation "founded in" has a subject `The club`. At this point, we can extract the following relation:

The club FOUNDING_YEAR March 1897

Syntactially this is the correct way to extract the relation. However, the relation is not really very useful yet - we need to normalize its subject and object somehow to make it useful in a semantic sense.

**Step III** Hence, we can normalize the value "March 1897" to a year value `1897`, for example, by looking for 4-digit numbers in the phrase.

**Step IV** Then, we can normalize "The club" to `Juventus F.C.` by using entity coreference between the two sentences. We can then disambiguate the mention to https://en.wikipedia.org/wiki/Juventus_F.C., or `Juventus_F.C.` for brevity. This finally leads us to the following relation:

`Juventus_F.C. FOUNDING_YEAR 1897`

which looks much more useful (and it is on a semantic level, so we can store it in a knowledge base if we would like to).

### 3. Coding pattern-based extractors

We will start by searching entity mentions of some type (e.g., nationality or date) by using SpaCy's named entity recognizer. 

Often this is not enough for step I, because an entity of a certain type can be a value of different attributes. For example, the date "1946" can be a year of birth, a year of death, a founding year of a company, a year of starting/ending professional activity, etc. 

For this reason, we can check whether we find some keywords before the phrase (such as "born in" or "founded in"). We will build such an extractor in 3.1. 

We can also use the dependency tree (also from SpaCy) to find the predicate that is associated with this value and see whether that one matches our patterns. We will do this in 3.2.

For step II of assigning the value to some entity phrase, we will look for the closest entity to this attribute value and assign it to that one.

We will reuse much of the functions in the two examples. Let's write those first.

In [4]:
# import the default library for pattern matching, called `re`
import re

In [55]:
def get_entities_of_type(a_type, a_doc):
    return filter(lambda w: w.ent_type_ == a_type, a_doc)

def find_closest_entity(entities, prop_position, e_type):
    """
    Find entities of a certain type and with the smallest distance to the property.
    """
    min_distance=9999
    closest_entity=None
    for ent in entities:
        if ent.label_!=e_type: 
            continue # skip entities of different types
            
        # determine the distance between the entity and the property
        distance=abs(ent.start_char-prop_position)
        if min_distance>distance:
            min_distance=distance
            closest_entity=ent
    return closest_entity

#### 3.1 Using substring matching

We will use substring matching to look for founding years of organizations. We will use three simple patterns for this purpose: 'founded in X', 'established in X', 'created in X'. As mentioned before, we will also make sure that X is an entity of type `DATE`.

We will use three helper functions in this example: to find the closest entity (step II), to convert the date to a year (step III), and to print the relations in a visually nicer way.

In [56]:
def check_for_pattern(doc, i, pattern):
    pattern_tokens=pattern.split(' ')
    num_tokens=len(pattern_tokens)
    tokens=[]
    for x in reversed(range(1, num_tokens+1)):
        prev_index=i-x
        tokens.append(doc[prev_index].text)
        
    return tokens==pattern_tokens

def pattern_found_on_the_left(doc, token_index, patterns):
    for pattern in patterns:
        if check_for_pattern(doc, token_index, pattern)==True:
            return True
    return False
        
def extract_year_from_date(date):
    """
    Extract the year value from a date by looking for four consecutive digits.
    """
    match = re.findall('\d{4}', date)
    first_match=match[0]
    return int(first_match)

In [57]:
def extract_date_relations(doc, patterns):
    """
    Extract date properties from a document and assign them to an entity.
    """
    
    # merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = {}
    
    dates=get_entities_of_type('DATE', doc)
    
    for date in dates:
        
        if pattern_found_on_the_left(doc, date.i, patterns):
            print('yo')
            year=extract_year_from_date(date.text)
            org=find_closest_entity(doc.ents, date.idx, 'ORG')
            print(year, org)
            if year and org:
                relations[org.text]=year

    return relations        

Let's now test whether the date extraction works as we expect it to.

In [58]:
founded_patterns=['founded in', 'established in', 'created in']

text='Airbus was founded in December 1991, not in January 1992.'

doc = nlp(text)
date_relations=extract_date_relations(doc, founded_patterns)
print(date_relations)

yo
1991 Airbus
{'Airbus': 1991}


#### 3.2 Using syntax dependencies


Here, we extract money and currency values (entities labelled as MONEY) and then check the dependency tree to find the noun phrase they are referring to – for example:
$9.4 million --> Net income. 

In [73]:
def predicate_found(token, predicates):
    if token.dep_ == 'pobj' and token.head.dep_ == 'prep':
        pred=token.head.head
        print(pred.text)
        if pred.text in predicates:
            return True
        else:
            return False
    else:
        return False

In [74]:
def extract_nationality(doc, predicates):
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()
    
    relations={}
    
    nationalities=get_entities_of_type('NORP', doc)
    
    for nationality in nationalities:
        print('yo')
        if predicate_found(nationality, predicates):
            per=find_closest_entity(doc.ents, nationality.idx, 'PERSON')
            if per and nationality:
                relations[per.text]=nationality
    return relations

In [75]:
predicates=['is', 'was']
text='Bush is French.'

doc = nlp(text)
nat_relations=extract_nationality(doc, predicates)
print(nat_relations)

yo
{}


#### 3.3 Running all our extractors


In [None]:
founded_patterns=['founded', 'established', 'created']

In [None]:
texts = [
    'Mike is French, and John is American.',
    'Juventus F.C. is an Italian professional football club based in Turin. The club was founded in March 1897 by a group of Torinese students.',
]    

if __name__ == '__main__':
    
    print("Info: Processing %d texts" % len(texts))
    print()
    print('**Extracted relations:**')
    print()
    
    all_relations={'founding_year':{}, 'nationality': {}}
    
    for index, text in enumerate(texts):
        doc = nlp(text)
        date_relations=extract_date_relations(doc, founded_patterns)
        nationality_relations=extract_nationality(doc)
        
        all_relations['founding_year'].update(date_relations)
        all_relations['nationality'].update(nationality_relations)

    print(all_relations)
    
    
    # Expected output:
    # Net income      MONEY   $9.4 million
    # the prior year  MONEY   $2.7 million
    # Revenue         MONEY   twelve billion dollars
    # a loss          MONEY   1b

### 4. Processing wikipedia

Now that we know how to run our extractors on some text documents, we can do this on a larger scale. As an illustration, here we will load a few Wikipedia documents and try to extract properties from them.

In [None]:
entities=["Piek Vossen", "Frank van Harmelen", "Bremen High School (Midlothian, Illinois)"]
texts=[]

for entity in entities:
    wp = wikipedia.page(entity)
    texts.append(wp.content)
    
system_data = get_all_relations(texts)

### 5. Evaluating extractors

We will evaluate the extractors by computing a precision, recall, and F1-score per document. We will only check the extracted values for the main entity in the document, not for any of the others.

Similar as with entity linking, we will decide on true positives, false positives, false negatives per textual unit and not per class. In this case, the textual unit we will use is the document. 

In [None]:
properties=['founding_year', 'nationality']
gold={}
gold['nationality']={'Piek Vossen': 'Dutch', 'Frank van Harmelen': 'Dutch'}
gold['founding_year']={'Bremen High School (Midlothian, Illinois)': '1953'}

for prop, property_data in gold.items():
    tp=0
    fp=0
    fn=0
    for entity, gold_value in property_data.items():
        if entity in system_data[prop]: 
            system_value=system_data[prop][entity]
            if system_value==gold_value:
                tp+=1
            else:
                fp+=1
                fn+=1
        else:
            fn+=1
        print('Entity: %s, property: %s, gold value: %s, system value: %s' % (entity, prop, gold_value, system_value))
        
    precision=tp/(tp+fp)
    recall=tp/(tp+fn)
    f1=2*precision*recall/(precision+recall)
    
    print("Evaluation for property %s: \nprecision: %f, \nrecall: %f, \nF1-score: %f" % (prop, precision, recall, f1))