# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [1]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('CONLL2003/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
        'words': token,
        'pos': pos
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)

print(f'a dictionary of features for each training instance: {training_features[:10]} \n')
print(f'labels for each training instance: {training_gold_labels[:10]}')



a dictionary of features for each training instance: [{'words': 'EU', 'pos': 'NNP'}, {'words': 'rejects', 'pos': 'VBZ'}, {'words': 'German', 'pos': 'JJ'}, {'words': 'call', 'pos': 'NN'}, {'words': 'to', 'pos': 'TO'}, {'words': 'boycott', 'pos': 'VB'}, {'words': 'British', 'pos': 'JJ'}, {'words': 'lamb', 'pos': 'NN'}, {'words': '.', 'pos': '.'}, {'words': 'Peter', 'pos': 'NNP'}] 

labels for each training instance: ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-PER']


In [2]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader('CONLL2003/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in test.iob_words():
    a_dict = {
        'words': token,
        'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)

print(f'a dictionary of features for each test instance: {test_features[:10]} \n')
print(f'labels for each test instance: {test_gold_labels[:10]}')


a dictionary of features for each test instance: [{'words': 'SOCCER', 'pos': 'NN'}, {'words': '-', 'pos': ':'}, {'words': 'JAPAN', 'pos': 'NNP'}, {'words': 'GET', 'pos': 'VB'}, {'words': 'LUCKY', 'pos': 'NNP'}, {'words': 'WIN', 'pos': 'NNP'}, {'words': ',', 'pos': ','}, {'words': 'CHINA', 'pos': 'NNP'}, {'words': 'IN', 'pos': 'IN'}, {'words': 'SURPRISE', 'pos': 'DT'}] 

labels for each test instance: ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O']


**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [3]:
from collections import Counter 

# Check for instances in train and test set
train_instances = 0
test_instances = 0
for item in train.iob_words():
    train_instances += 1
for item in test.iob_words():
    test_instances += 1

print(f'Train instances = {train_instances} \nTest instances = {test_instances}')


Train instances = 203621 
Test instances = 46435


In [4]:
# Frequency distribution of the NERC labels
freq_labels_train = Counter(training_gold_labels)
freq_labels_test = Counter(test_gold_labels)

print(f'Frequency distribution training set: \n{freq_labels_train} \n')
print(f'Frequency distribution test set: \n{freq_labels_test}')

Frequency distribution training set: 
Counter({'O': 169578, 'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155}) 

Frequency distribution test set: 
Counter({'O': 38323, 'B-LOC': 1668, 'B-ORG': 1661, 'B-PER': 1617, 'I-PER': 1156, 'I-ORG': 835, 'B-MISC': 702, 'I-LOC': 257, 'I-MISC': 216})


Please write something about the distribution here!!

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [5]:
from sklearn.feature_extraction import DictVectorizer

In [6]:
vec = DictVectorizer()
merged_features = training_features + test_features
dict_list = vec.fit_transform(merged_features)

train_list = dict_list[:train_instances]
test_list = dict_list[train_instances:]

**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [7]:
from sklearn import svm
from sklearn.metrics import classification_report

In [8]:
lin_clf = svm.LinearSVC()

In [9]:
# Fit model
lin_clf.fit(train_list, training_gold_labels)

LinearSVC()

In [10]:
pred = lin_clf.predict(test_list)

In [11]:
# Create report for labels
report = classification_report(test_gold_labels, pred)
print(report)

              precision    recall  f1-score   support

       B-LOC       0.81      0.78      0.79      1668
      B-MISC       0.78      0.66      0.72       702
       B-ORG       0.79      0.52      0.63      1661
       B-PER       0.86      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.57      0.59      0.58       216
       I-ORG       0.70      0.47      0.56       835
       I-PER       0.33      0.87      0.48      1156
           O       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.72      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



* Describe what labels it performs the best / worst on

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [12]:
import gensim
##### Adapt the path to point to your local copy of the Google embeddings model
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

train_input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        train_input_vectors.append(vector)


In [13]:
test_input_vectors=[]
labels=[]
for token, pos, ne_label in test.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        test_input_vectors.append(vector)

In [14]:
lin_clf_embed = svm.LinearSVC()
lin_clf_embed.fit(train_input_vectors, training_gold_labels)

LinearSVC()

In [15]:
embed_pred = lin_clf_embed.predict(test_input_vectors)

embed_report = classification_report(test_gold_labels, embed_pred)
print(embed_report)

              precision    recall  f1-score   support

       B-LOC       0.76      0.80      0.78      1668
      B-MISC       0.72      0.70      0.71       702
       B-ORG       0.69      0.64      0.66      1661
       B-PER       0.75      0.67      0.71      1617
       I-LOC       0.51      0.42      0.46       257
      I-MISC       0.60      0.54      0.57       216
       I-ORG       0.48      0.33      0.39       835
       I-PER       0.59      0.50      0.54      1156
           O       0.97      0.99      0.98     38323

    accuracy                           0.93     46435
   macro avg       0.68      0.62      0.64     46435
weighted avg       0.92      0.93      0.92     46435



In [16]:
# For comparison the former report

print(report)

              precision    recall  f1-score   support

       B-LOC       0.81      0.78      0.79      1668
      B-MISC       0.78      0.66      0.72       702
       B-ORG       0.79      0.52      0.63      1661
       B-PER       0.86      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.57      0.59      0.58       216
       I-ORG       0.70      0.47      0.56       835
       I-PER       0.33      0.87      0.48      1156
           O       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.72      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [17]:
import pandas

In [18]:
##### Adapt the path to point to your local copy of NERC_datasets
path = 'ner_dataset.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False, encoding='latin1')



  exec(code_obj, self.user_global_ns, self.user_ns)


In [19]:
kaggle_dataset = kaggle_dataset.fillna(method="ffill")

kaggle_dataset.head(5)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [20]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:150000]
print(len(df_train), len(df_test))

100000 50000


In [21]:
# Create dict with features for training and test

dict_nerc_train = []
nerc_train_label_alternate = []

for word, pos, tag in zip(df_train.Word, df_train.POS, df_train.Tag):
    a_dict = {
        'Words': word,
        'Pos': pos
    }
    dict_nerc_train.append(a_dict)
    nerc_train_label_alternate.append(tag)

dict_nerc_test = []
nerc_test_label_alternate = []

for word, pos, tag in zip(df_test.Word, df_test.POS, df_test.Tag):
    a_dict = {
        'Words': word,
        'Pos': pos
    }
    dict_nerc_test.append(a_dict)
    nerc_test_label_alternate.append(tag)

In [22]:
dict_nerc_train[:10]

[{'Words': 'Thousands', 'Pos': 'NNS'},
 {'Words': 'of', 'Pos': 'IN'},
 {'Words': 'demonstrators', 'Pos': 'NNS'},
 {'Words': 'have', 'Pos': 'VBP'},
 {'Words': 'marched', 'Pos': 'VBN'},
 {'Words': 'through', 'Pos': 'IN'},
 {'Words': 'London', 'Pos': 'NNP'},
 {'Words': 'to', 'Pos': 'TO'},
 {'Words': 'protest', 'Pos': 'VB'},
 {'Words': 'the', 'Pos': 'DT'}]

In [23]:
# Function that processes the data into sentences
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [24]:
getter_train = SentenceGetter(df_train)
sentences_train = getter_train.sentences

getter_test = SentenceGetter(df_test)
sentences_test = getter_test.sentences

In [25]:
# input is a sentence as a structure show above
#and and ith word from the sentence to return the features for that word

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    # data structure consisting of a feature name and value for the token
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(), # lower case variant of the token
        'word[-3:]': word[-3:], #suffix of 3 characters
        'word[-2:]': word[-2:], #suffix of 2 characters
        'word.isupper()': word.isupper(), # initial captial
        'word.istitle()': word.istitle(), # all words ini caps
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2], #first two characters of the PoS Tag
    }
    if i > 0:
        # adding features for the word based on the previous word
        word1 = sent[i-1][0] # previous word
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True # Beginning of sentence as a feature

    if i < len(sent)-1:
        # adding features for the word based on the next word
        word1 = sent[i+1][0] # next word
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True # end of sentence as a feature

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [26]:
nerc_features_train = [sent2features(s) for s in sentences_train]
nerc_labels_train = [sent2labels(s) for s in sentences_train]

nerc_features_test = [sent2features(s) for s in sentences_test]
nerc_labels_test = [sent2labels(s) for s in sentences_test]

In [27]:
sentences_test

[[('"', '``', 'O'),
  ('Death', 'NN', 'O'),
  ('to', 'TO', 'O'),
  ('America', 'NNP', 'B-geo'),
  ('"', '``', 'I-geo'),
  ('marched', 'VBD', 'O'),
  ('through', 'IN', 'O'),
  ('streets', 'NNS', 'O'),
  ('Wednesday', 'NNP', 'B-tim'),
  (',', ',', 'O'),
  ('smashing', 'VBG', 'O'),
  ('cars', 'NNS', 'O'),
  (',', ',', 'O'),
  ('damaging', 'JJ', 'O'),
  ('shops', 'NNS', 'O'),
  ('and', 'CC', 'O'),
  ('throwing', 'VBG', 'O'),
  ('stones', 'NNS', 'O'),
  ('at', 'IN', 'O'),
  ('U.S.', 'NNP', 'B-geo'),
  ('troops', 'NNS', 'O'),
  ('.', '.', 'O')],
 [('Protests', 'NNS', 'O'),
  ('erupted', 'VBD', 'O'),
  ('after', 'IN', 'O'),
  ('Newsweek', 'NNP', 'O'),
  ('magazine', 'NN', 'O'),
  ('reported', 'VBD', 'O'),
  ('that', 'IN', 'O'),
  ('interrogators', 'NNS', 'O'),
  ('at', 'IN', 'O'),
  ('Guantanamo', 'NNP', 'B-geo'),
  ('placed', 'VBD', 'O'),
  ('copies', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('the', 'DT', 'O'),
  ('Muslim', 'NNP', 'B-org'),
  ('holy', 'JJ', 'O'),
  ('book', 'NN', 'O'),
  ('on', '

In [44]:
# TODO fix dictvectorizer
"""Current issue that arises with DictVectorizer is that the output of sentence getter and word2feature
   return a list instead of the required dict, hence an alternate variant is used which has been created
   like in task 1 with words and pos."""

vec = DictVectorizer()
nerc_merged_features = dict_nerc_train + dict_nerc_test
nerc_dict_list = vec.fit_transform(nerc_merged_features)

nerc_train_list = dict_list[:len(df_train)]
nerc_test_list = dict_list[len(df_train):len(df_train + df_test)]

In [45]:
print(nerc_train_list.shape)
print(nerc_test_list.shape)

(100000, 27361)
(50000, 27361)


**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

In [48]:
nerc_lin_clf = svm.LinearSVC()

In [49]:
# Fit model
"""Using the alternate variant of the dictionary and labels as the sentencegetter and word2feature
   do not work properly for this SVM."""
nerc_lin_clf.fit(nerc_train_list, nerc_train_label_alternate)
nerc_pred = nerc_lin_clf.predict(nerc_test_list)

In [50]:
# Create report for labels
# TODO : fix zero division error
nerc_report = classification_report(nerc_test_label_alternate, nerc_pred)
print(nerc_report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        16
       B-eve       0.00      0.00      0.00        27
       B-geo       0.04      0.00      0.00      1827
       B-gpe       0.00      0.00      0.00       795
       B-nat       0.00      0.00      0.00        16
       B-org       0.00      0.00      0.00       923
       B-per       0.02      0.00      0.00       851
       B-tim       0.02      0.00      0.00       966
       I-art       0.00      0.00      0.00         7
       I-eve       0.00      0.00      0.00        14
       I-geo       0.00      0.00      0.00       387
       I-gpe       0.00      0.00      0.00         8
       I-nat       0.00      0.00      0.00        10
       I-org       0.03      0.00      0.00       710
       I-per       0.03      0.00      0.00       863
       I-tim       0.06      0.00      0.01       274
           O       0.85      0.99      0.91     42306

    accuracy              

  _warn_prf(average, modifier, msg_start, len(result))


## End of this notebook