# Lab4-Assignment about Named Entity Recognition, Classification and Disambiguation

This notebook describes the assignment of Lab 4 of the text mining course. 

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* perform feature ablation and gain insight into the contribution of various features
* Learn how to evaluate an entity linking system.
* Learn how to run two entity linking systems (AIDA and DBpedia Spotlight).
* Learn how to interpret the system output and the evaluation results.
* Get insight into differences between the two systems.
* Be able to describe differences between the two methods in terms of their results.
* Be able to propose future improvements based on the observed results.
* Get insight into the difficulty of NED and how this depends on specific entity mentions.

The assignment consists of 2 parts:

* Named Entity Recornition and Classificaiton: excersizes 1 & 2
* Named Entity Disambiguation and Linking: excersizes 3 & 4


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and dapated by Piek vossen

# Named Entity Recognition and Classification

Excercises 2 and 3 focus on Named Entity Recognition and Classification

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [1]:
from nltk.corpus.reader import ConllCorpusReader

In [24]:
train = ConllCorpusReader('nerc_datasets/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
        #features:
        'words': token,
        'pos': pos
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)
print('First 10 elements from the training instances feautures:\n',training_features[:10])
print()
print('First 10 elements from the training instances NERC labels:\n', training_gold_labels[:10])  

First 10 elements from the training instances feautures:
 [{'words': 'EU', 'pos': 'NNP'}, {'words': 'rejects', 'pos': 'VBZ'}, {'words': 'German', 'pos': 'JJ'}, {'words': 'call', 'pos': 'NN'}, {'words': 'to', 'pos': 'TO'}, {'words': 'boycott', 'pos': 'VB'}, {'words': 'British', 'pos': 'JJ'}, {'words': 'lamb', 'pos': 'NN'}, {'words': '.', 'pos': '.'}, {'words': 'Peter', 'pos': 'NNP'}]

First 10 elements from the training instances NERC labels:
 ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-PER']


In [25]:
### Adapt the path to point to the NERC_datasets folder on your local machine
test = ConllCorpusReader('nerc_datasets/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in test.iob_words():
    a_dict = {
        #features:
        'words':token,
        'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)
print('First 10 elements from the test instances feautures:\n',test_features[:10])
print()
print('First 10 elements from the test instances NERC labels:\n', test_gold_labels[:10])

First 10 elements from the test instances feautures:
 [{'words': 'SOCCER', 'pos': 'NN'}, {'words': '-', 'pos': ':'}, {'words': 'JAPAN', 'pos': 'NNP'}, {'words': 'GET', 'pos': 'VB'}, {'words': 'LUCKY', 'pos': 'NNP'}, {'words': 'WIN', 'pos': 'NNP'}, {'words': ',', 'pos': ','}, {'words': 'CHINA', 'pos': 'NNP'}, {'words': 'IN', 'pos': 'IN'}, {'words': 'SURPRISE', 'pos': 'DT'}]

First 10 elements from the test instances NERC labels:
 ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O']


**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [18]:
# from collections import Counter 
# my_list=[1,2,1,3,2,5]
# Counter(my_list)
import pandas

In [26]:
print( '\033[1m How many instances are in train and test?\033[0m')
print('There are %d instances in train and %d instances in test.'%(len(training_features),len(test_features)))
print()

#Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
df_train = pandas.DataFrame(training_gold_labels)
df_train.columns = ['frequency']
print('\033[1m NERC-label frequency distribution of train: \033[0m  \n', df_train.apply(pandas.value_counts))
print()
df_test = pandas.DataFrame(test_gold_labels)
df_test.columns = ['frequency']
print('\033[1m NERC-label frequency distribution of test: \033[0m \n', df_test.apply(pandas.value_counts))
print()
print('\033[1m Balance and differences between test and train : \033[0m \n\
The data is reasonably balanced, in both the train and test data the\
highest frequency is the O NERC label, and the lowest frequency is the I-MISC NERC label.\
The only difference is the frequency of the B-LOC NERC label,\
the train data has relatively more instances.') 

[1m How many instances are in train and test?[0m
There are 203621 instances in train and 46435 instances in test.

[1m NERC-label frequency distribution of train: [0m  
         frequency
O          169578
B-LOC        7140
B-PER        6600
B-ORG        6321
I-PER        4528
I-ORG        3704
B-MISC       3438
I-LOC        1157
I-MISC       1155

[1m NERC-label frequency distribution of test: [0m 
         frequency
O           38323
B-LOC        1668
B-ORG        1661
B-PER        1617
I-PER        1156
I-ORG         835
B-MISC        702
I-LOC         257
I-MISC        216

[1m Balance and differences between test and train : [0m 
The data is reasonably balanced, in both the train and test data thehighest frequency is the O NERC label, and the lowest frequency is the I-MISC NERC label.The only difference is the frequency of the B-LOC NERC label,the train data has relatively more instances.


**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [4]:
from sklearn.feature_extraction import DictVectorizer

In [7]:
vec = DictVectorizer()
all_features = training_features + test_features
the_array = vec.fit_transform(all_features).toarray()
new_train = the_array[:203621]
new_test = the_array[203621:]
print(new_train)
print(new_test)


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [5]:
from sklearn import svm
from sklearn.metrics import classification_report


In [6]:
lin_clf = svm.LinearSVC()

In [10]:
lin_clf.fit(new_train, training_gold_labels)
predict_label = lin_clf.predict(new_test)
print(classification_report(test_gold_labels, predict_label))

              precision    recall  f1-score   support

       B-LOC       0.81      0.78      0.79      1668
      B-MISC       0.78      0.66      0.72       702
       B-ORG       0.79      0.52      0.63      1661
       B-PER       0.86      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.57      0.59      0.58       216
       I-ORG       0.70      0.47      0.56       835
       I-PER       0.33      0.87      0.48      1156
           O       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.72      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



**2d)** The outside tags perform well, this is probably because there is not much overlap/ no connection between named entities and the locations, verbs, prepositions, etc.. 
The inside person tags performs poorly, probably because of last names that also accur on their own and are therefor classified wrongly. 

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [7]:
import gensim

In [8]:
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('model/GoogleNews-vectors-negative300.bin', binary=True)  

In [9]:
embedding_train=[]
embedding_gold_labels=[]
embedding_test=[]
embedding_gold_test_labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        embedding_train.append(vector)
        embedding_gold_labels.append(ne_label)
        
for token, pos, ne_label in test.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        embedding_test.append(vector)
        embedding_gold_test_labels.append(ne_label)

In [16]:
lin_clf.fit(embedding_train, embedding_gold_labels)
predict_label = lin_clf.predict(embedding_test)
print(classification_report(embedding_gold_test_labels, predict_label))

              precision    recall  f1-score   support

       B-LOC       0.76      0.80      0.78      1668
      B-MISC       0.72      0.70      0.71       702
       B-ORG       0.69      0.64      0.66      1661
       B-PER       0.75      0.67      0.71      1617
       I-LOC       0.51      0.42      0.46       257
      I-MISC       0.60      0.54      0.57       216
       I-ORG       0.48      0.33      0.39       835
       I-PER       0.59      0.50      0.54      1156
           O       0.97      0.99      0.98     38323

    accuracy                           0.93     46435
   macro avg       0.68      0.62      0.64     46435
weighted avg       0.92      0.93      0.92     46435



**2e) Comparison classification reports 2d and 2e**:
Overall the precision is a bit lower when using the embeddings as input, however, the difference is not great and the accuracy, macro avg and weighted avd are still good. The only exeption is the inside person tag, when using the embeddings of the words as input it performs much better. 


## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [17]:
import pandas

In [27]:
##### Adapt the path to point to your local copy of NERC_datasets
path = 'nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

In [30]:
len(kaggle_dataset)

In [31]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

# Entity Linking

Excersizes 3 and 4 focus on Entity linking

### Excersize 3 (NEL): Quantitative analysis  [Points: 15] 

In this assignment, you are going to work with two systems for entity linking: AIDA and DBpedia Spotlight. You will run them on an entity linking dataset and evaluate their performance. You will perform both quantitative and qualitative analysis of their output, and run one of these systems on your own text. We will reflect on the results of these tasks.

**Note:** We will use the dataset Reuters-128 in this assignment. This dataset was introduced in the notebook 'Lab4.3-Entity-linking-tools', so you probably have it already (in case you do not have it make sure you download it from Canvas first and put it in the same location as this notebook). 


**Exercise 1a** Write code that runs both systems on the full Reuters-128 dataset. (5 points)

In [None]:
# Run both systems on the full Reuters-128 dataset

**Exercise 1b** Write code that evaluates the two systems on this dataset by computing their overall precision, recall, and F1-score. (5 points)

In [None]:
# Write a function to compute the precision, recall, and F1-score for each of the systems on this dataset

**Question 1c** What is the F1-score per system? Which system performs better? Is that also the better system in terms of precision and recall? Which is higher and what does that mean (hint: think of NIL entities)?(5 points)

In [None]:
# Your answer here...

### Excersize 4 (NEL): Qualitative analysis [Points: 15] 

**Exercise 2a** Check the entity disambiguation by AIDA against the gold entities on the document with identifier "http://aksw.org/N3/Reuters-128/82#char=0,1370" (write code to print the entity mentions, gold links and AIDA links). (2 points)

In [None]:
# Your code here...

You can see in this document that one of the mentions of "Tokyo" is disambiguated wrongly by AIDA as `Tokyo` (it should be `Tokyo_Stock_Exchange`). Knowing how AIDA works, what would be your explanation for this error? (4 points)

In [None]:
# Your answer here...

**Exercise 2b** Check the entity disambiguation by Spotlight against the gold entities on the document "http://aksw.org/N3/Reuters-128/36#char=0,1146" (write code to print the entity mentions, gold links and Spotlight links). (2 points)

In [None]:
# Your code here...

You can see in this document that the mention of "Group of Seven" is disambiguated wrongly by Spotlight as `G8` (it should be `G7`). Knowing how Spotlight works, what would be your explanation for this error? (4 points)

In [None]:
# Your answer here...

**Question 2c** In the document with identifier "http://aksw.org/N3/Reuters-128/67#char=0,1627":
- both systems correctly decide that "Michel Dufour" is a `NIL` entity with no representation in the English Wikipedia. 
- however, Spotlight later decides that "Dufour" refers to `Guillaume-Henri_Dufour`

How would you help Spotlight fix this error? (Hint: think of how you would know that "Dufour" is a NIL entity in that document) (3 points)

In [None]:
# Your answer here...

## End of this notebook