# Lab4b - Assignment 4 about NED
This notebook describes the LAB-4 assignment of the Text Mining course. It is about Entity linking.

**Assignment goals**:
* Learn how to evaluate an entity linking system.
* Learn how to run two entity linking systems (AIDA and DBpedia Spotlight).
* Learn how to interpret the system output and the evaluation results.
* Get insight into differences between the two systems.
* Be able to describe differences between the two methods in terms of their results.
* Be able to propose future improvements based on the observed results.
* Get insight into the difficulty of NED and how this depends on specific entity mentions.
* Get insight into the relation between NED and NER.
* Get insight into other challenges of this task.

In this assignment, you are going to work with two systems for entity linking: AIDA and DBpedia Spotlight. You will run them on an entity linking dataset and evaluate their performance. You will perform both quantitative and qualitative analysis of their output, and run one of these systems on your own text. We will reflect on the results of these tasks.

 We recommend that you go through the notebooks in the following order:
* *Read the assignment (see below)*
* *Lab4.1-Entity-linking-introduction.ipynb*
* *Lab4.2-Entity-linking-evaluation.ipynb*
* *Lab4.3-Entity-linking-tools.ipynb*
* *Answer the questions of the assignment (see below) using the provided notebooks and submit*

**Note:** We will use the dataset Reuters-128 in this assignment. This dataset was introduced in the notebook 'Lab4.3-Entity-linking-tools', so you probably have it already (in case you do not have it make sure you download it from Canvas first and put it in the same location as this notebook). 

## Credits
The notebooks in this block have been created by [Filip Ilievski](http://ilievski.nl).

In [4]:
from rdflib import Graph, URIRef
from tqdm import tqdm ## to create progress bar to measure progress
import sys
import requests
import urllib
import urllib.parse
from urllib.request import urlopen, Request
from urllib.parse import urlencode
import xml.etree.cElementTree as ET
from lxml import etree
import time
import json

# import our own utility functions and classes
import lab4_utils as utils
import lab4_classes as classes

### 1. Quantitative analysis (17 points)

**Exercise 1a** Write code that runs both systems on the full Reuters-128 dataset. (4 points)

In [18]:
def aida_disambiguation(articles, aida_url):
    """
    Perform disambiguation with AIDA.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:  
        for i, article in enumerate(articles):
            
            original_content = article.content
            new_content=original_content
            for entity in reversed(article.entity_mentions):
                entity_span=new_content[entity.begin_index: entity.end_index]
                new_content=new_content[:entity.begin_index] + '[[' + entity_span + ']]' + new_content[entity.end_index:]

            params={"text": new_content, "tag_mode": 'manual'}
            request = Request(aida_url, urlencode(params).encode())
            this_json = urlopen(request).read().decode('unicode-escape')
            try:
                results=json.loads(this_json)
            except:
                continue

            dis_entities={}
            for dis_entity in results['mentions']:

                if 'bestEntity' in dis_entity.keys():
                    best_entity = dis_entity['bestEntity']['kbIdentifier']
                    clean_url = best_entity[5:] #SKIP YAGO:
                else:
                    clean_url = 'NIL'
                dis_entities[str(dis_entity['offset'])] = clean_url 
            
            for entity in article.entity_mentions:
                start = entity.begin_index
                try:
                    dis_url = str(dis_entities[str(start)])  
                except:
                    dis_url = 'NIL'
                entity.aida_link = dis_url

            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
    return articles

def spotlight_disambiguate(articles, spotlight_url, confidence=0.5):
    """
    Perform disambiguation with DBpedia Spotlight.
    """
    with tqdm(total = len(articles), file = sys.stdout) as pbar:
        for i, article in enumerate(articles):
            annotation = etree.Element("annotation", text=article.content)

            for mention in article.entity_mentions:
                sf = etree.SubElement(annotation, "surfaceForm")
                sf.set("name", mention.mention)
                sf.set("offset", str(mention.begin_index))
            my_xml = etree.tostring(annotation, xml_declaration = True, encoding='UTF-8')
            
            results = requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': confidence}), 
                                  headers={'Accept': 'application/json'})
            while results.status_code != 200: 
                print('Trying again...')
                results = requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': confidence}), 
                                  headers = {'Accept': 'application/json'})
                time.sleep(5)
            
            j = results.json()
            dis_entities={}
            if 'Resources' in j: 
                resources=j['Resources']
            else: 
                resources=[]
            for dis_entity in resources:
                dis_entities[str(dis_entity['@offset'])] = utils.normalizeURL(dis_entity['@URI'])
            
            for entity in article.entity_mentions:
                start = entity.begin_index
                if str(start) in dis_entities:
                    dis_url = dis_entities[str(start)]
                else:
                    dis_url = 'NIL'
                entity.spotlight_link = dis_url

            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)

            time.sleep(np.random(0,150))
    return articles

In [16]:
reuters_file = 'Reuters-128.ttl'
aida_disambiguation_url = "https://gate.d5.mpi-inf.mpg.de/aida/service/disambiguate"
spotlight_disambiguation_url = "http://model.dbpedia-spotlight.org/en/disambiguate"

articles = utils.load_article_from_nif_file(reuters_file)
processed_aida = aida_disambiguation(articles, aida_disambiguation_url)

processed: 2:   2%|▏         | 2/128 [00:11<16:25,  7.82s/it]

  app.launch_new_instance()


processed: 128:  96%|█████████▌| 123/128 [03:24<00:08,  1.67s/it]


In [19]:
proccessed_spotlight = spotlight_disambiguate(articles, spotlight_disambiguation_url)

  0%|          | 0/128 [00:00<?, ?it/s]Trying again...
Trying again...
Trying again...
  0%|          | 0/128 [00:13<?, ?it/s]


KeyboardInterrupt: 

**Exercise 1b** Write code that evaluates the two systems on this dataset by computing their overall precision, recall, and F1-score. (5 points)

In [83]:
def evaluate_entity_linking(system_decisions, gold_decisions):
    
    tp, fp, fn = 0, 0, 0
    
    for gold_entity,system_entity in zip(gold_decisions,system_decisions):
        if gold_entity=='NIL' and system_entity=='NIL': continue
        if gold_entity==system_entity:
            tp+=1
        else:
            if gold_entity!='NIL':
                fn+=1
            if system_entity!='NIL':
                fp+=1

#     print('TP: %d; \nFP: %d, \nFN: %d' % (tp, fp, fn))            

    precision=tp/(tp+fp)
    recall=tp/(tp+fn)
    f1=2*precision*recall/(precision+recall)
    
    return precision, recall, f1

In [85]:
decisions_aida = []
gold = []
for article in processed_aida:
    decisions_aida += [entity.aida_link for entity in article.entity_mentions]
    gold += [entity.gold_link for entity in article.entity_mentions]
  
decisions_spotlight = []
for article in proccessed_spotlight:
    decisions_spotlight += [entity.spotlight_link for entity in article.entity_mentions]
    
precision_a, recall_a, f1_a = evaluate_entity_linking(decisions_aida, gold)
print('AIDA System:\nPrecision:', precision_a, ' Recall:', recall_a, ' F1:', f1_a)
precision_s, recall_s, f1_s = evaluate_entity_linking(decisions_aida, gold)
print('Spotlight System:\nPrecision:', precision_s, ' Recall:', recall_s, ' F1:', f1_s)

AIDA System:
Precision: 0.61875  Recall: 0.45692307692307693  F1: 0.5256637168141594


**Question 1c** What is the F1-score per system? Which system performs better? Is that also the better system in terms of precision and recall? (4 points)

In [8]:
# Your answer here...

**Question 1d** For each of the systems, compare the precision against the recall: which is higher? What does that mean (hint: think of NIL entities)? (4 points)

In [9]:
# Your answer here...

### 2. Qualitative analysis (15 points)

**Exercise 2a** Check the entity disambiguation by AIDA against the gold entities on the document with identifier "http://aksw.org/N3/Reuters-128/82#char=0,1370" (write code to print the entity mentions, gold links and AIDA links). (2 points)

In [8]:
correct_article = [article for article in articles if article.identifier == 'http://aksw.org/N3/Reuters-128/82#char=0,1370'][0]

In [12]:
processed_aida = aida_disambiguation([correct_article], aida_disambiguation_url)

processed: 1: 100%|██████████| 1/1 [00:08<00:00,  8.58s/it]




for entity in processed_aida[0].entity_mentions:
    print('entity:', entity.mention ,', AIDA: ', entity.aida_link, ', GL:', entity.gold_link)

You can see in this document that one of the mentions of "Tokyo" is disambiguated wrongly by AIDA as `Tokyo` (it should be `Tokyo_Stock_Exchange`). Knowing how AIDA works, what would be your explanation for this error? (4 points)

**answer: ** The mention of Tokyo is better related to the city and not to the Stock Exchange, so there is a stronger conncetion. 

**Exercise 2b** Check the entity disambiguation by Spotlight against the gold entities on the document "http://aksw.org/N3/Reuters-128/36#char=0,1146" (write code to print the entity mentions, gold links and Spotlight links). (2 points)

In [36]:
correct_article = [article for article in articles if article.identifier == 'http://aksw.org/N3/Reuters-128/36#char=0,1146'][0]

In [41]:
processed_both = spotlight_disambiguate([correct_article], spotlight_disambiguation_url)

processed: 1: 100%|██████████| 1/1 [00:01<00:00,  1.15s/it]


You can see in this document that the mention of "Group of Seven" is disambiguated wrongly by Spotlight as `G8` (it should be `G7`). Knowing how Spotlight works, what would be your explanation for this error? (4 points)

In [45]:
for entity in processed_both[0].entity_mentions:
    print('entity:', entity.mention ,', SPOTLIGHT: ', entity.spotlight_link, ', GL:', entity.gold_link)

entity: U.S. Treasury , AIDA:  United_States_Department_of_the_Treasury , GL: United_States_Department_of_the_Treasury
entity: Group of Five , AIDA:  Group_of_Five , GL: Group_of_Five
entity: Gerhard Stoltenberg , AIDA:  Gerhard_Stoltenberg , GL: Gerhard_Stoltenberg
entity: Bundesbank , AIDA:  German_Federal_Bank , GL: Deutsche_Bundesbank
entity: Karl Otto Poehl , AIDA:  NIL , GL: Karl_Otto_Pöhl
entity: Edouard Balladur , AIDA:  Édouard_Balladur , GL: Édouard_Balladur
entity: Jacques de Larosiere , AIDA:  NIL , GL: Jacques_de_Larosière
entity: Kiichi Miyazawa , AIDA:  Kiichi_Miyazawa , GL: Kiichi_Miyazawa
entity: Satoshi Sumita , AIDA:  NIL , GL: Satoshi_Sumita
entity: Robin Leigh Pemberton , AIDA:  NIL , GL: Robin_Leigh-Pemberton,_Baron_Kingsdown
entity: Group of Seven , AIDA:  Group_of_Seven , GL: G7
entity: Giovanni Goria , AIDA:  Giovanni_Goria , GL: Giovanni_Goria
entity: Treasury , AIDA:  HM_Treasury , GL: United_States_Department_of_the_Treasury
entity: James Baker , AIDA:  Jame

**Question 2c** In the document with identifier "http://aksw.org/N3/Reuters-128/67#char=0,1627":
- both systems correctly decide that "Michel Dufour" is a `NIL` entity with no representation in the English Wikipedia. 
- however, Spotlight later decides that "Dufour" refers to `Guillaume-Henri_Dufour`

How would you help Spotlight fix this error? (Hint: think of how you would know that "Dufour" is a NIL entity in that document) (3 points)

**answer: ** If it was known before that 'Michel Dufour' is a NIL entity this should be remembered by the disambiguation checker and linked to other occurences. This reduces ambiguity over the document and saves computing resources.

In [86]:
correct_article = [article for article in articles if article.identifier == 'http://aksw.org/N3/Reuters-128/67#char=0,1627'][0]
processed_both = spotlight_disambiguate([correct_article], spotlight_disambiguation_url)
processed_aida=aida_disambiguation([correct_article], aida_disambiguation_url)

for entity in processed_both[0].entity_mentions:
    print('entity:', entity.mention ,', SPOTLIGHT:', entity.spotlight_link, 'AIDA:', entity.aida_link, ', GL:', entity.gold_link)

processed: 1: 100%|██████████| 1/1 [00:02<00:00,  2.30s/it]
processed: 1: 100%|██████████| 1/1 [00:00<00:00,  1.71it/s]
entity: Dominion Textile Inc , SPOTLIGHT: NIL AIDA: NIL , GL: Dominion_Textile
entity: Burlington Industries Inc , SPOTLIGHT: NIL AIDA: NIL , GL: Burlington_Industries
entity: Michel Dufour , SPOTLIGHT: NIL AIDA: NIL , GL: NIL
entity: Reuters , SPOTLIGHT: Reuters AIDA: Reuters , GL: Reuters
entity: Dominion Textile , SPOTLIGHT: Dominion_Textile AIDA: Dominion_Textile , GL: Dominion_Textile
entity: Dufour , SPOTLIGHT: Guillaume-Henri_Dufour AIDA: Antoine_Dufour , GL: NIL
entity: Dominion Textile , SPOTLIGHT: Dominion_Textile AIDA: Dominion_Textile , GL: Dominion_Textile
entity: Thomas Bell , SPOTLIGHT: Thom_Bell AIDA: Thom_Bell , GL: NIL
entity: Dominion Textile , SPOTLIGHT: Dominion_Textile AIDA: Dominion_Textile , GL: Dominion_Textile
entity: Avondale Mills , SPOTLIGHT: Avondale_Mills AIDA: NIL , GL: NIL
entity: Dufour , SPOTLIGHT: Guillaume-Henri_Dufour AIDA: Antoin

  app.launch_new_instance()


### 3. Running your own text (14 points)

Let's now run one of the tools (you can choose which one) with our own text. You don't need to provide the mentions for this case, you can let the software also perform the recognition of mentions.

**Exercise 3a** Add your own text that you would like to be processed. (1 point)

In [51]:
import spacy
from spacy import displacy
nlp = spacy.load('en')

text="""
Wade Wilson is a dishonorably discharged special forces operative working as a mercenary when he meets Vanessa, a prostitute. They become romantically involved, and a year later she accepts his marriage proposal. Wilson is diagnosed with terminal cancer, and leaves Vanessa without warning so she will not have to watch him die.

A mysterious recruiter approaches Wilson, offering an experimental cure for his cancer. He is taken to Ajax and Angel Dust, who inject him with a serum designed to awaken latent mutant genes. They subject Wilson to days of torture to induce stress and trigger any mutation he may have, without success. When Wilson discovers Ajax's real name is Francis and mocks him for it, Ajax leaves Wilson in a hyperbaric chamber that periodically takes him to the verge of asphyxiation over a weekend. This finally activates a superhuman healing ability that cures the cancer but leaves Wilson severely disfigured with burn-like scars over his entire body. He escapes from the chamber and attacks Ajax but relents when told that his disfigurement can be cured. Ajax subdues Wilson and leaves him for dead in the now-burning laboratory.

Wilson survives and seeks out Vanessa. He does not reveal to her he is alive fearing her reaction to his new appearance. After consulting with his best friend Weasel, Wilson decides to hunt down Ajax for the cure. He becomes a masked vigilante, adopting the name "Deadpool" (from Weasel picking him in a dead pool), and moves into the home of an elderly blind woman named Al. He questions and murders many of Ajax's men until one, the recruiter, reveals his whereabouts. Deadpool intercepts Ajax and a convoy of armed men on an expressway. He kills everyone but Ajax, and demands the cure from him but the X-Man Colossus and his trainee Negasonic Teenage Warhead interrupt him. Colossus wants Deadpool to mend his ways and join the X-Men. Taking advantage of this distraction, Ajax escapes. He goes to Weasel's bar where he learns of Vanessa.

Ajax kidnaps Vanessa and takes her to a decommissioned helicarrier in a scrapyard. Deadpool convinces Colossus and Negasonic to help him. They battle Angel Dust and several soldiers while Deadpool fights his way to Ajax. During the battle, Negasonic accidentally destroys the equipment stabilizing the helicarrier. Deadpool protects Vanessa from the collapsing ship, while Colossus carries Negasonic and Angel Dust to safety. Ajax attacks Deadpool again but is overpowered. He reveals there is no cure after all and, despite Colossus's pleading, Deadpool kills him. He promises to try to be more heroic moving forward. Though Vanessa is angry with Wilson for leaving her, she reconciles with him.
"""
doc = nlp(text)

**Exercise 3b** Write a function to process this text with Spotlight or another tool of your choice, and run it. Print the list of entities that you receive. (3 points)

In [57]:
from spacy.tokens import Span
def spotlight_disambiguate(doc, spotlight_url, confidence=0.5):
    entities=[]

    annotation = etree.Element("annotation", text=doc.text)

    for entity in doc.ents:
        sf = etree.SubElement(annotation, "surfaceForm")
        sf.set("name", entity.text)
        sf.set("offset", str(entity.start_char))
    my_xml=etree.tostring(annotation, xml_declaration=True, encoding='UTF-8')

    results=requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': confidence}), 
                          headers={'Accept': 'application/json'})
    j=results.json()
    dis_entities={}
    if 'Resources' in j: 
        resources=j['Resources']
    else: 
        resources=[]
    for dis_entity in resources:
        dis_entities[str(dis_entity['@offset'])] = utils.normalizeURL(dis_entity['@URI'])

    for entity in doc.ents:
        start = entity.start_char
        if str(start) in dis_entities:
            dis_url = dis_entities[str(start)]
        else:
            dis_url = 'NIL'
        print(dis_url)
        linked_entity = Span(doc, start=entity.start, end=entity.end, label=entity.label_, kb_id=dis_url)
        entities.append(linked_entity)

    return entities

In [56]:
spotlight_disambiguation_url = "http://model.dbpedia-spotlight.org/en/disambiguate"
processed_spotlight = spotlight_disambiguate(doc, spotlight_disambiguation_url)

for entity in processed_spotlight:
    print('mention: %s; type:%s; url:%s'% (entity.text, entity.label_, entity.kb_id_))

Deadpool
Vanessa_Fisk
One_Year_Later
Woodrow_Wilson
Vanessa_Fisk
Woodrow_Wilson
Ajax_(mythology)
Azrieal
Woodrow_Wilson
Woodrow_Wilson
Ajax_(mythology)
NIL
Ajax_(mythology)
Woodrow_Wilson
NIL
Woodrow_Wilson
Ajax_(mythology)
Woodrow_Wilson
Woodrow_Wilson
Vanessa_Fisk
Weasel
Woodrow_Wilson
Ajax_(mythology)
Deadpool
Weasel
Al_Capone
Ajax_(mythology)
Ajax_(mythology)
Ajax_(mythology)
Negasonic_Teenage_Warhead
Alternative_versions_of_Colossus
Deadpool
Ajax_(mythology)
Weasel
Vanessa_Fisk
Vanessa_Fisk
NIL
Azrieal
Deadpool
Ajax_(mythology)
NIL
Deadpool
Vanessa_Fisk
Alternative_versions_of_Colossus
NIL
Azrieal
Deadpool
Alternative_versions_of_Colossus
Deadpool
Vanessa_Fisk
Woodrow_Wilson
mention: Wade Wilson; type:PERSON; url:Deadpool
mention: Vanessa; type:GPE; url:Vanessa_Fisk
mention: a year later; type:DATE; url:One_Year_Later
mention: Wilson; type:ORG; url:Woodrow_Wilson
mention: Vanessa; type:ORG; url:Vanessa_Fisk
mention: Wilson; type:ORG; url:Woodrow_Wilson
mention: Ajax; type:PERSON; 

**Question 3c** Try changing the value of the confidence parameter and re-runing the annotation in 3b. What is the role of the confidence parameter? What happens if you increase/decrease its value? (4 points)

**answer:** The confidence parameter is used to tell the disambiguation function how confident it has to be before it can say something about the entity. For our text there is no real change

In [59]:
spotlight_disambiguation_url = "http://model.dbpedia-spotlight.org/en/disambiguate"
processed_spotlight = spotlight_disambiguate(doc, spotlight_disambiguation_url, 0.2)

for entity in processed_spotlight:
    print('mention: %s; type:%s; url:%s'% (entity.text, entity.label_, entity.kb_id_))

Deadpool
Vanessa_Fisk
One_Year_Later
Woodrow_Wilson
Vanessa_Fisk
Woodrow_Wilson
Ajax_(mythology)
Azrieal
Woodrow_Wilson
Woodrow_Wilson
Ajax_(mythology)
Francis_of_Assisi
Ajax_(mythology)
Woodrow_Wilson
NIL
Woodrow_Wilson
Ajax_(mythology)
Woodrow_Wilson
Woodrow_Wilson
Vanessa_Fisk
Weasel
Woodrow_Wilson
Ajax_(mythology)
Deadpool
Weasel
Al_Capone
Ajax_(mythology)
Ajax_(mythology)
Ajax_(mythology)
Negasonic_Teenage_Warhead
Alternative_versions_of_Colossus
Deadpool
Ajax_(mythology)
Weasel
Vanessa_Fisk
Vanessa_Fisk
NIL
Azrieal
Deadpool
Ajax_(mythology)
NIL
Deadpool
Vanessa_Fisk
Alternative_versions_of_Colossus
NIL
Azrieal
Deadpool
Alternative_versions_of_Colossus
Deadpool
Vanessa_Fisk
Woodrow_Wilson
mention: Wade Wilson; type:PERSON; url:Deadpool
mention: Vanessa; type:GPE; url:Vanessa_Fisk
mention: a year later; type:DATE; url:One_Year_Later
mention: Wilson; type:ORG; url:Woodrow_Wilson
mention: Vanessa; type:ORG; url:Vanessa_Fisk
mention: Wilson; type:ORG; url:Woodrow_Wilson
mention: Ajax;

**Question 3d** Pick one mistake that your tool made on this text (if there are no mistakes, just try annotating another text). Answer the following questions:
* Which mistake did the tool make? 
* Can you say what would be the correct decision instead? 
* Which of the phases (recognition, candidate generation, disambiguation) seems to have caused this error? 

(6 points)

**answer: ** Most problems are in the disambugiation of Persons. A lot of times only the first or last name is mentioned and it assigns the 'most famous' person with that name in the database tot he entity. Besides this recognition and candidate generation seems to work pretty good.