# L2: Information Extraction

Students: mindu931, karzi360

## Getting started

The first cell imports the Python module required for this lab.

In [4]:
import tm2

The next cell imports spaCy and loads its English language model.

In [5]:
import spacy

nlp = spacy.load('en')

## Data and gold standard

The data is contained in the following file:

In [6]:
data_file = "/home/TDDE16/labs/l2/data/gmb.txt"

The `tm2` module defines a function `read_data` that returns an iterator over the lines in a file. You should use this function to read the data for this lab. Use the optional argument `n` to restrict the iteration to the first few lines of the file. Here is an example:

In [7]:
for sentence in tm2.read_data(data_file, n=3):
    print(sentence)

Masked assailants with grenades and automatic weapons attacked a wedding party in southeastern Turkey, killing 45 people and wounding at least six others.
Turkish officials said the attack occurred Monday in the village of Bilge about 600 kilometers from Ankara.
The wounded were taken to the hospital in the nearby city of Mardin.


In addition to the raw data, we also provide you with a gold standard of entity pairs that your system should be able to extract. The following code loads these pairs from the file `gold.txt` and adds them to the set `gold`. Each pair is augmented with the identifier of the sentence (line number in the data file) which it was extracted from. Note that the sentence (line) numbering starts at index&nbsp;0.

In [9]:
gold_file = "/home/TDDE16/labs/l2/data/gold.txt"

gold = set()
with open(gold_file) as fp:
    for line in fp:
        print(line)
        columns = line.rstrip().split('\t')
        gold.add((int(columns[0]), columns[1], columns[2]))

802	Ali Zardari	Pakistan People 's Party

2297	Abdul Aziz al-Hakim	Supreme Council

4823	Slavkov	Bulgarian National Olympic Committee

7902	Mr. Hakim	Supreme Council

8206	J. Patrick Boyle	American Meat Institute

8633	Ali Rodriguez	Petroleos de Venezuela

9004	Foreign Minister Joschka Fischer	Green Party

11021	Khalaf	al-Qaida

11259	Joseph Domenech	U.N. 's Food and Agricultural Organization

13043	David Petraeus	U.S. Central Command

15203	Joseph Kony	Lord 's Resistance Army

15494	Khodorkovsky	Yukos

15906	President Chen Shui-bian	Democratic Progressive Party

18977	General Petraeus	U.S. Central Command

20496	Avigdor Lieberman	Yisrael Beitenu

20667	Mr. Fini	National Alliance

21914	Mr. Mwanawasa	Southern African Development Community

23016	Osama bin Laden	al-Qaida

28997	Ma	Nationalist Party

30171	al-Zarqawi	al-Qaida

31546	Mr. Abbas	Fatah

32262	Morgan Tsvangirai	Movement for Democratic Change

33646	Mr. Coleman	Senate Government Affairs

34889	Prince Ali	West Asian Football Fe

The following code prints the 10&nbsp;first pairs from the gold standard:

In [10]:
for i, person, org in sorted(gold)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

802	Ali Zardari	Pakistan People 's Party
2297	Abdul Aziz al-Hakim	Supreme Council
4823	Slavkov	Bulgarian National Olympic Committee
7902	Mr. Hakim	Supreme Council
8206	J. Patrick Boyle	American Meat Institute
8633	Ali Rodriguez	Petroleos de Venezuela
9004	Foreign Minister Joschka Fischer	Green Party
11021	Khalaf	al-Qaida
11259	Joseph Domenech	U.N. 's Food and Agricultural Organization
13043	David Petraeus	U.S. Central Command


## Entity extraction

To implement the entity extraction part of our system, we use the full natural language processing power built into spaCy. The following code extracts the entities from the first 5&nbsp;sentences of the data:

In [70]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=5))):
    for ent in doc.ents:
        print("{}\t{}\t{}\t{}".format(ent.text, ent.start, ent.end, ent.label_))

Turkey	13	14	GPE
45	16	17	CARDINAL
at least six	20	23	CARDINAL
Turkish	0	1	NORP
Monday	6	7	DATE
Bilge	11	12	ORG
about 600 kilometers	12	15	QUANTITY
Ankara	16	17	GPE
Mardin	12	13	ORG


## Problem 1: Extract relevant pairs

We identify pairs of entities that are in the &lsquo;is-leader-of&rsquo; relation, based on the strategy outlined in the section on [Relation Extraction](http://www.nltk.org/book/ch07.html#relation-extraction) in the book by Bird, Klein, and Loper (2009):

* look for all triples of the form $(x, \alpha, y)$ where $x$ and $y$ denote named entities of type *person* and *organisation*, respectively, and $\alpha$ is the intervening text
* write a regular expression to match just those instances of $\alpha$ that express the &lsquo;is-leader-of&rsquo; relation


In [71]:
import re
def extract(doc):
    """Extract relevant relation instances from the specified document.
    
    Args:
        doc: The sentence as analysed by spaCy.
    Yields:
        Pairs of strings representing the extracted relation instances.
    """  
    #define the relationship
    rela = re.compile(r'.*(lead|head|chief|preside|minist|command|manage|supervis|rule|direct).*')                 
    
    #matches list
    matches = []
    #Scan all lines
    total_rows = range(len(doc.ents) -1)
    for n in total_rows:
   
        e1 = doc.ents[n]
        e2 = doc.ents[n+1]
        
        #Only look for entity 1 is person and entity 2 is org
        if (not e1.label_ == "PERSON") or (not e2.label_ in "ORG"):
            continue
        
        #get the text and compare
        txt = doc[e1.end:e2.start]
        is_match = rela.match(str(txt))
        if is_match:
            matches.append((str(e1), str(e2)))
            
    
    return matches 


The following cell shows how your function is supposed to be used. The code prints out the extracted pairs for the first 1,000&nbsp;sentences in the data. It additionally numbers each pair with the sentence identifier.

In [72]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=1000))):
    for person, org in extract(doc):
        print("{}\t{}\t{}".format(i, person, org))

207	Rugova	European Union
351	Jendayi Frazer	Sudan Liberation Army
512	Aung San Suu Kyi	the National League for Democracy
736	Viktor Yanukovych	Russian Party
802	Asif Ali Zardari	the Pakistan People's Party


This cell below will process all lines of data file (62k sentences). Then, add all extracted pairs to the `extracted` set

In [28]:
extracted = set()
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file))):
    for person, org in extract(doc):
        extracted.add((i, person, org))
    print('\rProcessed {} sentences ...'.format(i+1), end='', flush=True)
print(' done')

Processed 62010 sentences ... done


After executing the above cell, all extracted id-string-string triples are in the set `extracted`. The code in the next cell will print the first 10&nbsp;triples in this set.

In [73]:
for i, person, org in sorted(extracted)[:10]:
    print("{}\t{}\t{}".format(i, person, org))
print("Length of the extracted set:", len(extracted))
print("Length of the gold set:", len(gold))

207	Rugova	European Union
351	Jendayi Frazer	Sudan Liberation Army
512	Aung San Suu Kyi	the National League for Democracy
736	Viktor Yanukovych	Russian Party
802	Asif Ali Zardari	the Pakistan People's Party
1349	Karen Hughes	State Department
1790	Koizumi	the United Nations
2297	Abdul Aziz al-Hakim	the Supreme Council for the Islamic Revolution in Iraq
3274	Jack Abramoff	Congress
3291	Krasniqi	the Kosovo Protection Corps
Length of the extracted set: 192
Length of the gold set: 46


Take a look at the length of the `extracted` set and the `gold` set below. We can see that the length of the `extracted` set is 4 times higher. Let's continue with others problem and see what happend

## Problem 2: Evaluate your system

This cells below will compute the precision, recall, and F1 measure of your extractor relative to the gold standard.

In [44]:
def evaluate(reference, predicted):
    """Print out the precision, recall, and F1 for the id-entity-entity
    triples in the set `predicted`, given the triples in the reference set.
    
    Args:
        reference: The reference set of triples.
        predicted: The set of predicted triples.
    Returns:
        Nothing, but prints out precision, recall, and F1.
    """
    # false negatives:  data not in predicted but is in reference
    # false positives: data not in reference but 
    # true positive: data in both predicted and reference 
    tp = predicted.intersection(reference)
    
    # true positive + false positive = predicted
    # true positive + false negative = reference
    precision = len(tp)/len(predicted)
    recall    = len(tp)/len(reference)
    F1        = (2*precision*recall)/(precision + recall)
  
    print("Percision:", round(precision*100,2), "%")
    print("Recall:", round(recall*100,2), "%")
    print("F1:", round(F1*100,2), "%")

#print(tm2.evaluate(gold, extracted))

And this is the reults. The low result is predicted because the higher different in the length of 2 sets as mentioned before.

In [60]:
evaluate(gold, extracted)
#print(tm2.evaluate(gold, extracted))

Percision: 2.6 %
Recall: 10.87 %
F1: 4.2 %


## Problem 3: Entity resolution

We realise that your extractor (probably) does a rather poor job in matching the gold standard. One reason for this is that the NLP preprocessing is not perfect (spaCy was not trained on the annotations in the Groningen Meaning Bank), and that the approach of using regular expressions for relation extraction is rather naive.

Another reason however is that the current version of your system does not include a component for *entity resolution*. To give an example, your system does not realise that the strings `David Petraeus` and `General David Petraeus` refer to the same entity.

To solve the problem, we implement a function called `normalise` that takes an entity mention (a string) as its input and rewrites it to the form used in the gold standard. While this is &lsquo;cheating&rsquo;, it allows we to assess the performance of a more realistic system, and helps to illustrate that information extraction can be very domain-specific.

The following cell contains skeleton code for the `normalise` function.

In [54]:
def normalise(text):
    #Scan the gold files, if some words is match - return the entity
    for (i,e1,e2) in gold:
        if (text in e1) or (e1 in text):
            return e1
        elif (text in e2) or (e2 in text):
            return e2
    return text

The next cell shows how `normalise` is intended to be used. Each triple in the set `extracted` is transformed by feeding the two entity mentions into the `normalise` function. The normalised triples are then added to a new set `extracted_normalised`.

In [55]:
extracted_normalised = set()
for triple in extracted:
    extracted_normalised.add((triple[0], normalise(triple[1]), normalise(triple[2])))

And here is the new results, after using normalisation rules. We can see that it is much more better.

In [56]:
evaluate(gold, extracted_normalised)

Percision: 13.54 %
Recall: 56.52 %
F1: 21.85 %


## Problem 4: Error analysis

In this task, we do the error analysis of our information extraction system. We will do it by hand, and then use the visualisation tools provided by spaCy. 
For example, the following code cell visualises the output of the named entity recogniser for the given input sentence:

In [76]:
from spacy import displacy

sentence = u'Slavkov will lose his position as head of the Bulgarian National Olympic Committee.'

displacy.render(nlp(sentence), style='ent', jupyter=True)


### Recall-related errors (false negatives)

Requirement: "By tuning the `normalise` function, you can deal with some of the recall-related mistakes that your system makes. Other recall-related errors cannot be fixed in this way. To illustrate this, find at least 5&nbsp;entity pairs in the gold standard that your system still does not identify correctly, and enter them into the text box below. For each example, provide a brief explanation of what goes wrong. Try to find examples that illustrate different types of errors."

Here is the 5 false negatives entities

And, here is the full text of these sentences:

4823: Slavkov will lose his position as head of the Bulgarian National Olympic Committee.  
7902: Mr. Hakim heads the Shi'ite dominated Supreme Council for the Revolution in Iraq, which has the largest representation in parliament.    
11021: According to the department, Khalaf is a senior leader of al-Qaida in Iraq's "facilitation network," which controls the flow of resources -- including weapons, money and militants -- from Syria into Iraq.  
15494: Thursday, a Moscow court rejected Khodorkovsky's appeal of his conviction, but supporters of the former head of the oil firm Yukos say authorities rushed through the process to prevent him from running in a December parliamentary byelection.  
18977: Lawmakers voted 95 - 2 to make General Petraeus the new leader of the U.S. Central Command, which oversees American forces in the Middle East, East Africa and Central Asia.  


As we can see, these sentence is quite complicated. it not in the format like `Person` rela `ORG`. So, our function failed to detect. To be specific, please look at the example below. The tool is wrong when tag the entity label.

In [77]:
sentence7902 = u'Mr. Hakim heads the Shiite dominated Supreme Council for the Revolution in Iraq, which has the largest representation in parliament.   '
displacy.render(nlp(sentence7902), style='ent', jupyter=True)

### Precision-related errors (false positives)

Next, provide at least 5 entity pairs that represent false positives of your system. Explain what goes wrong.

Some of these pairs is still true. But, some of it false because the sentences is quite complicated. For example, in the first sentance: Rugova will meet with the leader of European Union. (He is the leader of a difference oganization, but not the EU).  In the second ones. Jendayi Frazer will meet the leader of Sudan Liberation Army, but he is not the leader.

In [85]:
sentence207 = u'The blast occurred at 8:20 am as President Rugova s motorcade was headed to a meeting with European Union foreign policy chief Javier Solana.  '
displacy.render(nlp(sentence207), style='ent', jupyter=True)
sentence351 = u'Saturday, a senior U.S. official, Assistant Secretary of State for African Affairs, Jendayi Frazer, met with rival leaders of the rebel Sudan Liberation Army to urge them to present a united front at the next round of peace talks.'
displacy.render(nlp(sentence351), style='ent', jupyter=True)

### Incompleteness of the gold standard

You may have noticed that some of your system&rsquo;s false positives are actually &lsquo;correct&rsquo;. This can happen because, while each entity pair in the gold standard has been manually checked for correctness, no check has been made that the gold standard contains all relevant pairs. Find at least 5&nbsp;entity pairs in the data that are valid instances of the &lsquo;is-leader-of&rsquo; relation (according to your subjective judgement) but that are not contained in the gold standard.

In [82]:
sentence3291= "At the time of his arrest, Mr. Krasniqi was a commander of the Kosovo Protection Corps, a post-war civil defense group that deals with emergencies in Kosovo."
displacy.render(nlp(sentence3291), style='ent', jupyter=True)

"Did you find any examples that you did not find when looking for false positives?"   

Yes. For example, this is the sentence number 10: "In a report issued Tuesday, U.N. Secretary-General Kofi Annan says Haiti is at a critical juncture, as the country prepares for its first set of elections since the ouster of President Jean-Bertrand Aristide in February, 2004."

In this sentence, Jean-Bertrand Aristide is the president (~ leader) of Haiti. But this sentence doesn't appear in both `gold` and `extracted` set. And, again, this sentence is quite complicated. It's hard for extract the relationship in a sentence like this.


This is the end of the assignment.