# Lab4-Assignment about Named Entity Recognition, Classification and Disambiguation

This notebook describes the assignment of Lab 4 of the text mining course. 

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* perform feature ablation and gain insight into the contribution of various features
* Learn how to evaluate an entity linking system.
* Learn how to run two entity linking systems (AIDA and DBpedia Spotlight).
* Learn how to interpret the system output and the evaluation results.
* Get insight into differences between the two systems.
* Be able to describe differences between the two methods in terms of their results.
* Be able to propose future improvements based on the observed results.
* Get insight into the difficulty of NED and how this depends on specific entity mentions.

The assignment consists of 2 parts:

* Named Entity Recornition and Classificaiton: excersizes 1 & 2
* Named Entity Disambiguation and Linking: excersizes 3 & 4


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and dapated by Piek vossen

# Named Entity Recognition and Classification

Excercises 2 and 3 focus on Named Entity Recognition and Classification

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [None]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the nerc_datasets folder on your local machine
train = ConllCorpusReader('/Users/piek/Desktop/ONDERWIJS/data/nerc_datasets/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
       # add features
    }
   

In [None]:
### Adapt the path to point to the NERC_datasets folder on your local machine
train = ConllCorpusReader('/Users/piek/Desktop/ONDERWIJS/nerc_datasets/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in train.iob_words():
    a_dict = {
        # add features
    }


**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [None]:
from collections import Counter 

my_list=[1,2,1,3,2,5]
Counter(my_list)


**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
vec = DictVectorizer()
the_array = # your code here

**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [8]:
from sklearn import svm

In [9]:
lin_clf = svm.LinearSVC()

In [3]:
##### [ YOUR CODE SHOULD GO HERE ]
# lin_clf.fit( # your code here

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [10]:
# your code here

## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [None]:
import pandas

In [None]:
##### Adapt the path to point to your local copy of NERC_datasets
path = '/Users/piek/Desktop/ONDERWIJS/data/nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

In [None]:
len(kaggle_dataset)

In [None]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

# Entity Linking

Excersizes 3 and 4 focus on Entity linking

### Excersize 3 (NEL): Quantitative analysis  [Points: 15] 

In this assignment, you are going to work with two systems for entity linking: AIDA and DBpedia Spotlight. You will run them on an entity linking dataset and evaluate their performance. You will perform both quantitative and qualitative analysis of their output, and run one of these systems on your own text. We will reflect on the results of these tasks.

**Note:** We will use the dataset Reuters-128 in this assignment. This dataset was introduced in the notebook 'Lab4.3-Entity-linking-tools', so you probably have it already (in case you do not have it make sure you download it from Canvas first and put it in the same location as this notebook). 


**Exercise 1a** Write code that runs both systems on the full Reuters-128 dataset. (5 points)

In [1]:
# Run both systems on the full Reuters-128 dataset

from rdflib import Graph, URIRef
from tqdm import tqdm 
import sys
import requests
import urllib
import urllib.parse
from urllib.request import urlopen, Request
from urllib.parse import urlencode
import xml.etree.cElementTree as ET
from lxml import etree
import time
import json

# import our own utility functions and classes
import lab4_utils as utils
import lab4_classes as classes

In [2]:
reuters_file='Reuters-128.ttl'
articles=utils.load_article_from_nif_file(reuters_file)

In [38]:
def aida_disambiguation(articles, aida_url):
    """
    Perform disambiguation with AIDA.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:  #pbar provides a nice progress bar for the interation over the articles
        for i, article in enumerate(articles):
                                    
            # AIDA expects entity mentions that are pre-marked inside text. 
            # should be transformed to "[[Obama]] visited [[Paris]] today."
            original_content = article.content 
            new_content=original.content       
            for entity in reversed(article.entity_mentions):
                entity_span=new_content[entity.begin_index: entity.end_index]
                new_content=new_content[:entity.begin_index] + '[[' + entity_span + ']]' + new_content[entity.end_index:]

            # Now, we can run the AIDA library with this string.
            params={"text": new_content, "tag_mode": 'manual'}
            request = Request(aida_url, urlencode(params).encode())
            # AIDA returns a json structure
            this_json = urlopen(request).read().decode('unicode-escape')
            try:
                results=json.loads(this_json)
            except:
                continue
            # print(this_json)
            # Let's normalize the disambiguated entities.
            # This means mostly removing the first part of the URI which is always the same (YAGO:)
            # and leaving only the entity identification part (e.g., Barack_Obama).
            dis_entities={}
            # We iterate over the data elements "mentions" in the json results
            for dis_entity in results['mentions']:
               # print(dis_entity)
                ## AIDI labels the bestEntity in the json
                if 'bestEntity' in dis_entity.keys():
                    best_entity=dis_entity['bestEntity']['kbIdentifier']
                    clean_url=best_entity[5:] #SKIP YAGO:
                else:
                    clean_url='NIL'
                dis_entities[str(dis_entity['offset'])] = clean_url # BECOMES THE VALUE IN THE DICTIONARY FOR THE OFFSET(REPRESENTING THE START OF THE MENTION) IN THE TEXT
                
            # We can now store the entity to our class instance for later processing.
            for entity in article.entity_mentions:
                start = entity.begin_index
                try:
                    dis_url = str(dis_entities[str(start)])  # WE GET THE DISAMBIGUATED URL
                except:
                    dis_url='NIL'
                entity.aida_link = dis_url  # THE ENTITY IS ENRICHED WITH THE AIDA_LINK

            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
    return articles

In [4]:
def spotlight_disambiguate(articles, spotlight_url):
    """
    Perform disambiguation with DBpedia Spotlight.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
            # Similar as with AIDA, we first prepare the document text and the mentions
            # in order to provide these to Spotlight as input.
            
            # We build up the XML structure that Spotligh wants as input
            # The next function Element creates the XML element with the text attribute
            annotation = etree.Element("annotation", text=article.content)
            
            # We iterate over the eneity mentions from our Reuters data to create the surface form elements
            for mention in article.entity_mentions:
                sf = etree.SubElement(annotation, "surfaceForm")
                sf.set("name", mention.mention)
                sf.set("offset", str(mention.begin_index))
            my_xml=etree.tostring(annotation, xml_declaration=True, encoding='UTF-8')
            # Send a disambiguation request to spotlight
            results=requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': 0.5}), 
                                  headers={'Accept': 'application/json'})
            # Note that you can adjust the confidence value. Check the online demo to see the effect. 
            # What will happen with the recall and precision if you increase the confidence?
            
            # Process the results and normalize the entity URIs
            j=results.json()
            dis_entities={}
            if 'Resources' in j: 
                resources=j['Resources']
            else: 
                resources=[]
            for dis_entity in resources:
                dis_entities[str(dis_entity['@offset'])] = utils.normalizeURL(dis_entity['@URI'])
            
            # Let's now store the URLs by Spotlight to our class for later analysis.
            for entity in article.entity_mentions:
                start = entity.begin_index
                if str(start) in dis_entities:
                    dis_url = dis_entities[str(start)]
                else:
                    dis_url = 'NIL'
                entity.spotlight_link = dis_url
    
            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
                
            # Pause for 100ms to prevent overloading the server
            time.sleep(0.1)
    return articles

In [35]:
test_items = articles[0:5] #test initial code with something less heavy

In [9]:
aida_disambiguation_url = "https://gate.d5.mpi-inf.mpg.de/aida/service/disambiguate"
spotlight_disambiguation_url="http://model.dbpedia-spotlight.org/en/disambiguate"

In [36]:
processed_aida=aida_disambiguation(test_items, aida_disambiguation_url)
processed_both=spotlight_disambiguate(processed_aida, spotlight_disambiguation_url)

processed: 1:  20%|██████████████                                                        | 1/5 [00:00<00:02,  1.56it/s]



processed: 5: 100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.82it/s]
processed: 5: 100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.04s/it]


In [37]:
for article in processed_both:
    doc_id = article.identifier
    print(doc_id)
    print()
    for mention in article.entity_mentions:
        print('|mention: %s\t|gold:\t%s\t|aida:\t%s\t|spotlight:\t%s |' % (mention.mention, mention.gold_link, mention.aida_link, mention.spotlight_link))
        
system_decisions = [] 
gold_decisions = []
for article in processed_both:
    doc_id = article.identifier
    print(doc_id)
    print()
    #print(article.entity_mentions.mention)
    #print('|mention: %s\t|gold:\t%s\t|aida:\t%s\t|spotlight:\t%s |' % (mention.mention, mention.gold_link, mention.aida_link, mention.spotlight_link))

http://aksw.org/N3/Reuters-128/34#char=0,912

|mention: Social Affairs Ministry	|gold:	Ministry_of_Social_Affairs_and_Employment_(Netherlands)	|aida:	NIL	|spotlight:	NIL |
http://aksw.org/N3/Reuters-128/70#char=0,282

|mention: Federal Reserve	|gold:	Federal_Reserve_System	|aida:	NIL	|spotlight:	Federal_Reserve_System |
|mention: Fed	|gold:	Federal_Reserve_System	|aida:	Federal_Reserve_System	|spotlight:	Federal_Reserve_System |
|mention: Fed	|gold:	Federal_Reserve_System	|aida:	Federal_Reserve_System	|spotlight:	Federal_Reserve_System |
http://aksw.org/N3/Reuters-128/104#char=0,130

|mention: Yankee Cos Inc	|gold:	NIL	|aida:	NIL	|spotlight:	NIL |
|mention: Eskey Inc	|gold:	Esky	|aida:	NIL	|spotlight:	NIL |
http://aksw.org/N3/Reuters-128/14#char=0,1163

|mention: Volkswagen AG	|gold:	Volkswagen_Group	|aida:	Volkswagen	|spotlight:	Volkswagen_Group |
|mention: VW	|gold:	Volkswagen_Group	|aida:	NIL	|spotlight:	Volkswagen |
|mention: VW	|gold:	Volkswagen_Group	|aida:	NIL	|spotlight:	Volksw

**Exercise 1b** Write code that evaluates the two systems on this dataset by computing their overall precision, recall, and F1-score. (5 points)

In [12]:
# Write a function to compute the precision, recall, and F1-score for each of the systems on this dataset
def evaluate_entity_linking(system_decisions, gold_decisions):
    """
    Compute precision, recall, and F1-score by comparing two paired lists of: system decisions and gold data decisions.
    """
    tp=0
    fp=0
    fn=0
    
    for gold_entity,system_entity in zip(gold_decisions,system_decisions):
        if gold_entity=='NIL' and system_entity=='NIL': continue
        if gold_entity==system_entity:
            tp+=1
        else:
            if gold_entity!='NIL':
                fn+=1
            if system_entity!='NIL':
                fp+=1

    print('TP: %d; \nFP: %d, \nFN: %d' % (tp, fp, fn))            

    precision=tp/(tp+fp)
    recall=tp/(tp+fn)
    f1=2*precision*recall/(precision+recall)
    
    return precision, recall, f1

**Question 1c** What is the F1-score per system? Which system performs better? Is that also the better system in terms of precision and recall? Which is higher and what does that mean (hint: think of NIL entities)?(5 points)

In [13]:
# Your answer here...

### Excersize 4 (NEL): Qualitative analysis [Points: 15] 

**Exercise 2a** Check the entity disambiguation by AIDA against the gold entities on the document with identifier "http://aksw.org/N3/Reuters-128/82#char=0,1370" (write code to print the entity mentions, gold links and AIDA links). (2 points)

In [10]:
# Your code here...

You can see in this document that one of the mentions of "Tokyo" is disambiguated wrongly by AIDA as `Tokyo` (it should be `Tokyo_Stock_Exchange`). Knowing how AIDA works, what would be your explanation for this error? (4 points)

In [11]:
# Your answer here...

**Exercise 2b** Check the entity disambiguation by Spotlight against the gold entities on the document "http://aksw.org/N3/Reuters-128/36#char=0,1146" (write code to print the entity mentions, gold links and Spotlight links). (2 points)

In [12]:
# Your code here...

You can see in this document that the mention of "Group of Seven" is disambiguated wrongly by Spotlight as `G8` (it should be `G7`). Knowing how Spotlight works, what would be your explanation for this error? (4 points)

In [13]:
# Your answer here...

**Question 2c** In the document with identifier "http://aksw.org/N3/Reuters-128/67#char=0,1627":
- both systems correctly decide that "Michel Dufour" is a `NIL` entity with no representation in the English Wikipedia. 
- however, Spotlight later decides that "Dufour" refers to `Guillaume-Henri_Dufour`

How would you help Spotlight fix this error? (Hint: think of how you would know that "Dufour" is a NIL entity in that document) (3 points)

In [None]:
# Your answer here...

## End of this notebook