# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [1]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
       # add features
    "words": token, 
    "pos": pos
    }
    # append dict of words and their pos to training_features
    training_features.append(a_dict)
    # append labels of words to training_gold_labels
    training_gold_labels.append(ne_label)
   

In [26]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader('CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in test.iob_words():
    a_dict = {
    "words": token, 
    "pos": pos
    }
    # append dict of words and their pos to test_features
    test_features.append(a_dict)
    # append labels of words to test_gold_labels
    test_gold_labels.append(ne_label)

**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [8]:
# num of training instances
print("Num of training instances:", len(training_features))
# num of test instances
print("Num of test instances:", len(test_features))

Num of training instances: 203621
Num of test instances: 46435


In [9]:
from collections import Counter 
# Distribution of labels in both training and test sets
print("Labels distribution of training set:", Counter(training_gold_labels))
print()
print("Labels distribution of test set:", Counter(test_gold_labels))


Labels distribution of training set: Counter({'O': 169578, 'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155})

Labels distribution of test set: Counter({'O': 38323, 'B-LOC': 1668, 'B-ORG': 1661, 'B-PER': 1617, 'I-PER': 1156, 'I-ORG': 835, 'B-MISC': 702, 'I-LOC': 257, 'I-MISC': 216})


### Comparison

We observe that most tokens get the label "O" in both datasets. The number of instances in the training set is higher than the test set. Therefore, The actual entity tokens in the training set range between 1155 ('I-MISC') and 7140 ('B-LOC'). Wheres in the test set, they range between 216 ('I-MISC') and 1668 ('B-LOC'). However, labels categories in both datasets are identically sorted with regards to number of occurrences.  

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [10]:
from sklearn.feature_extraction import DictVectorizer

In [11]:
vec = DictVectorizer()
the_array = vec.fit_transform(training_features + test_features)

# Split back the training and test features
X = the_array[:len(training_features)]
test_set = the_array[len(training_features):]

**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [12]:
from sklearn import svm

In [14]:
lin_clf = svm.LinearSVC()

In [15]:
##### [ YOUR CODE SHOULD GO HERE ]
lin_clf.fit(X, training_gold_labels) # your code here

LinearSVC()

In [16]:
pred = lin_clf.predict(test_set)
pred

array(['O', 'O', 'I-PER', ..., 'O', 'B-PER', 'O'], dtype='<U6')

In [17]:
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

# we put the pred and y in an array because the fuctions from seqeval.metrics expect 2d arrays by default.
pred = [list(pred)]
y = [test_gold_labels]

print("precision-score: {:.1%}".format(precision_score(y, pred)))
print("recall-score: {:.1%}".format(recall_score(y, pred)))
print("F1-score: {:.1%}".format(f1_score(y, pred)))

precision-score: 58.2%
recall-score: 66.3%
F1-score: 62.0%


In [18]:
import sklearn
print(sklearn.metrics.classification_report(test_gold_labels, pred[0], target_names=list(set(training_gold_labels))))

              precision    recall  f1-score   support

       B-ORG       0.81      0.78      0.79      1668
      B-MISC       0.78      0.66      0.72       702
       I-PER       0.79      0.52      0.63      1661
       I-ORG       0.86      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
       B-PER       0.57      0.59      0.58       216
      I-MISC       0.70      0.47      0.56       835
           O       0.33      0.87      0.48      1156
       B-LOC       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.72      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



# Observation
The SVM classifier seems to perform well on the I-LOC, I-ORG, O, B-MISC, I-MISC, and I-PER features. We can see that the f1-score for these most of these features is higher than 72%. For instance, I-ORG has an f1-score of 79%, which is between its precision (81%) and recall (78%) and has 1668 supports. It is also apparent that B-PER, B-ORG and O, for example, have a relatively lower f1-score and perform worse. We can see for most of these features there's a tradeoff between precision and recall. For the B-ORG, for instance, we have a precision of 86%, but a recall of 44%.

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [20]:
# your code here
import gensim
##### Adapt the path to point to your local copy of the Google embeddings model
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('C:/Users/denis/AI-year3/text_mining/ba-text-mining-master/lab_sessions/lab2/models/GoogleNews-vectors-negative300.bin.gz', binary=True)  

In [21]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector = word_embedding_model[token]
        else:
            vector = [0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

In [22]:
# Recreate the model and fit the embeddings
lin_clf = svm.LinearSVC()
lin_clf.fit(input_vectors, labels)

LinearSVC()

In [27]:
test_input_vectors=[]
test_labels=[]
for token, pos, ne_label in test.iob_words():
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector = word_embedding_model[token]
        else:
            vector = [0]*300
        test_input_vectors.append(vector)
        test_labels.append(ne_label)

In [29]:
pred = lin_clf.predict(test_input_vectors)
print(pred)

['O' 'O' 'B-LOC' ... 'O' 'B-PER' 'O']


In [30]:
print(sklearn.metrics.classification_report(test_labels, pred))

              precision    recall  f1-score   support

       B-LOC       0.76      0.80      0.78      1668
      B-MISC       0.72      0.70      0.71       702
       B-ORG       0.69      0.64      0.66      1661
       B-PER       0.75      0.67      0.71      1617
       I-LOC       0.51      0.42      0.46       257
      I-MISC       0.60      0.54      0.57       216
       I-ORG       0.48      0.33      0.39       835
       I-PER       0.59      0.50      0.54      1156
           O       0.97      0.99      0.98     38323

    accuracy                           0.93     46435
   macro avg       0.68      0.62      0.64     46435
weighted avg       0.92      0.93      0.92     46435



# Observation
The SVM classifier for the embeddings performs somewhat differently. It seems to perform well on the B-LOC, B-MISC, O, and B-PER features. As the SVM of the one-hot-encodings seemed to perform better on features that have "I" associated with their types, this one of the embeddings on those features associated with "B" followed by the types of the words. Moreover, we can see that the f1-score for these most of these features is higher than 72%. For instance, B-LOC has an f1-score of 78%, which is between its precision (76%) and recall (80%) and has 1668 supports. It is also apparent that I-PER, I-ORG and I-LOC, for example, have a relatively lower f1-score and perform worse. We can see for most of these features there isn't seem to be a noticeable tradeoff between precision and recall. So, it seems as if the embeddings were capturing the meaning from sentences more than one-hot-encodings.

## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [125]:
import pandas as pd

In [126]:
##### Adapt the path to point to your local copy of NERC_datasets
path = 'C:/Users/denis/AI-year3/text_mining/ba-text-mining-master/text-mining/lab_sessions/lab4/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 281837: expected 25 fields, saw 34\n'


In [127]:
len(kaggle_dataset)

1050795

In [128]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

100000 20000


In [129]:
train_labels = df_train['tag'].values.tolist()
test_labels = df_test['tag'].values.tolist()

In [138]:
print("Labels distribution of training set:", Counter(train_labels))

Labels distribution of training set: Counter({'O': 84725, 'B-geo': 3303, 'B-org': 1876, 'I-per': 1846, 'B-tim': 1823, 'B-gpe': 1740, 'B-per': 1668, 'I-org': 1470, 'I-geo': 690, 'I-tim': 549, 'B-art': 75, 'B-eve': 53, 'I-gpe': 51, 'I-eve': 47, 'I-art': 43, 'B-nat': 30, 'I-nat': 11})


In [130]:
df_train_to_list = []
for indx, row in df_train.iterrows(): 
    a_dict = {
        # select only useful columns 
        "words": row['word'],
        "prev-iob": row['prev-iob'], 
        "prev-pos": row['prev-pos'],
        "pos": row['pos'], 
        "next-pos": row['next-pos']
    }
    df_train_to_list.append(a_dict)

In [131]:
df_test_to_list = []
for indx, row in df_test.iterrows(): 
    a_dict = {
        # select only useful columns 
        "words": row['word'],
        "prev-iob": row['prev-iob'], 
        "prev-pos": row['prev-pos'],
        "pos": row['pos'], 
        "next-pos": row['next-pos']
    }
    df_test_to_list.append(a_dict)

In [132]:
vec = DictVectorizer()

the_array = vec.fit_transform(df_train_to_list + df_test_to_list)

# Split back the training and test features
X = the_array[:len(df_train_to_list)]
test_set = the_array[len(df_train_to_list):]

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

In [133]:
lin_clf = svm.LinearSVC()

In [134]:
lin_clf.fit(X, train_labels)

LinearSVC()

In [135]:
pred = lin_clf.predict(test_set)
pred

array(['O', 'O', 'O', ..., 'O', 'O', 'O'], dtype='<U5')

In [136]:
print(sklearn.metrics.classification_report(test_labels, pred))

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         4
       B-eve       0.00      0.00      0.00         0
       B-geo       0.80      0.86      0.83       741
       B-gpe       0.97      0.93      0.95       296
       B-nat       1.00      0.62      0.77         8
       B-org       0.72      0.59      0.65       397
       B-per       0.81      0.80      0.81       333
       B-tim       0.99      0.78      0.87       393
       I-geo       0.98      0.96      0.97       156
       I-gpe       1.00      1.00      1.00         2
       I-nat       1.00      1.00      1.00         4
       I-org       0.95      0.93      0.94       321
       I-per       0.93      0.99      0.96       319
       I-tim       0.98      0.87      0.92       108
           O       0.99      0.99      0.99     16918

    accuracy                           0.97     20000
   macro avg       0.81      0.75      0.78     20000
weighted avg       0.97   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Answer: 

For our model we only selected certain columns of the pandas dataframe that we thought were useful, such as:  'word', 'prev-iob', 'prev-pos', 'pos' and 'next-pos'. The columns 'prev-iob', 'prev-pos' and 'next-pos' were selected since these are likely to be predictive for whether a token is part of a named entity. We decided not to include the 'prev-prev-iob' since most of the named entities are no longer than 2 tokens. 

As for the performance of the model, the category B-art and B-eve both had a precision, recall and f1-score of 0.00. The B-eve tag received a 0.00 score since the test set did not contain any instances with this tag. As for the B-art tag, the poor perfromance could be a result of the model having overfit the training data, since it contained only 75 instances of the B-art tag. 
The categories I-gpe and I-nat received a precision and recall score of 1.0. This means that the model correcly identified all instances of the I-gpe and I-nat tags. 
The model seems to be performing well with respect to the other tags as well with f1 scores ranging from 0.77 to 0.99, with the exception of the B-org tag which received a recall of 0.59.  

## End of this notebook