# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [71]:
from nltk.corpus.reader import ConllCorpusReader
from typing import List, Dict, Tuple, Union
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('nerc_datasets/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features: List[Dict[str,str]] = []
training_gold_labels: List[str] = []

for token, pos, ne_label in train.iob_words():
    a_dict: Dict[str, str] = {
       'words': token, 
       'pos': pos
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)

In [72]:
# train.COLUMN_TYPES

In [73]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader('nerc_datasets/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features: List[Dict[str,str]] = []
test_gold_labels: List[str] = []

for token, pos, ne_label in test.iob_words():
    a_dict: Dict[str, str] = {
       'words': token, 
       'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)

**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [74]:
from collections import Counter 


num_train:int = len(training_features) # number of instances in the training set
num_test:int = len(test_features) # number of instances in the test set

freq_dist_train = Counter(training_gold_labels)
freq_dist_test = Counter(test_gold_labels)

freq_dist_train: Dict[str, int] = dict(sorted(freq_dist_train.items(), key=lambda item: item[1], reverse=True)) # The sorted frequency distribution dictionary of the training set (high to low)
freq_dist_test: Dict[str, int] = dict(sorted(freq_dist_test.items(), key=lambda item: item[1], reverse=True)) # The sorted frequency distribution dictionary of the test set (high to low)

print(f'Total instances TRAINING SET: {num_train}, Total instances TEST SET: {num_test}.')
print()

for i,j in zip(freq_dist_train, freq_dist_test):
    print(f'Label: {i}, Absolute frequency TRAINING SET: {freq_dist_train[i]}, Relative frequency TRAINING SET: {(freq_dist_train[i]/num_train)*100:.2f}%.')
    print(f'Label: {j}, Absolute frequency TEST SET: {freq_dist_test[j]}, Relative frequency TEST SET: {(freq_dist_test[j]/num_test)*100:.2f}%.')
    print()



Total instances TRAINING SET: 203621, Total instances TEST SET: 46435.

Label: O, Absolute frequency TRAINING SET: 169578, Relative frequency TRAINING SET: 83.28%.
Label: O, Absolute frequency TEST SET: 38323, Relative frequency TEST SET: 82.53%.

Label: B-LOC, Absolute frequency TRAINING SET: 7140, Relative frequency TRAINING SET: 3.51%.
Label: B-LOC, Absolute frequency TEST SET: 1668, Relative frequency TEST SET: 3.59%.

Label: B-PER, Absolute frequency TRAINING SET: 6600, Relative frequency TRAINING SET: 3.24%.
Label: B-ORG, Absolute frequency TEST SET: 1661, Relative frequency TEST SET: 3.58%.

Label: B-ORG, Absolute frequency TRAINING SET: 6321, Relative frequency TRAINING SET: 3.10%.
Label: B-PER, Absolute frequency TEST SET: 1617, Relative frequency TEST SET: 3.48%.

Label: I-PER, Absolute frequency TRAINING SET: 4528, Relative frequency TRAINING SET: 2.22%.
Label: I-PER, Absolute frequency TEST SET: 1156, Relative frequency TEST SET: 2.49%.

Label: I-ORG, Absolute frequency TRA

### Answer 2b
While the datasets exhibit some imbalance with certain labels having significantly fewer instances compared to others, the distribution of NERC labels remains relatively consistent across both datasets. In both the training and test data, a majority of instances are labeled as 'O' (Other) with a relative frequency of 83.28% in the training set and 82.53% in the test set.

Following this, the next three labels ('B-LOC', 'B-PER', 'B-ORG', and 'I-PER') demonstrate similar relative frequencies ranging between 3.51% and 2.22% in both datasets. Conversely, the least frequent labels ('I-ORG', 'B-MISC', 'I-LOC', and 'I-MISC') are observed with lower relative frequencies, occurring between 1.82% and 0.47% of the cases in both datasets.

Despite slight disparities in the absolute frequencies of labels, the overall distribution pattern remains quite consistent between the training and test datasets. This consistency is crucial for ensuring the robustness and generalizability of the Named Entity Recognition (NER) model across various datasets and scenarios.

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [75]:
from sklearn.feature_extraction import DictVectorizer

In [76]:
vec = DictVectorizer(sparse=True)
train_test = training_features + test_features
the_array = vec.fit_transform(train_test)

train_array = the_array[:num_train]
test_array = the_array[num_train:]
# print(train_array.shape)
# print(test_array.shape)
# print(the_array.shape)




**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [77]:
from sklearn import svm

In [78]:
lin_clf = svm.LinearSVC()

In [79]:
##### [ YOUR CODE SHOULD GO HERE ]
from sklearn.metrics import classification_report

lin_clf.fit(train_array, training_gold_labels) # training
pred = lin_clf.predict(test_array) # testing

report = classification_report(test_gold_labels, pred ,digits = 3) # evaluation

print(report)

              precision    recall  f1-score   support

       B-LOC      0.812     0.775     0.793      1668
      B-MISC      0.782     0.664     0.718       702
       B-ORG      0.792     0.519     0.627      1661
       B-PER      0.860     0.437     0.579      1617
       I-LOC      0.618     0.529     0.570       257
      I-MISC      0.570     0.588     0.579       216
       I-ORG      0.703     0.467     0.561       835
       I-PER      0.333     0.871     0.481      1156
           O      0.985     0.984     0.985     38323

    accuracy                          0.920     46435
   macro avg      0.717     0.648     0.655     46435
weighted avg      0.939     0.920     0.922     46435



### Answer 4d:
**O**: The high precision and recall values for the non-entity class suggest that the classifier performs well in identifying tokens that do not belong to any named entity category, indicating robust learning of non-entity patterns. This is likely due to their over-represention in the training set.<br>
**B-LOC**: The relatively high precision and recall suggest that location entities often have distinct patterns or keywords, making them easier for the classifier to recognize accurately. <br>
**B-MISC**:  Miscellaneous entities cover a wide range of categories and can be ambiguous depending on the context. The relatively lower recall may indicate the presence of diverse and less recognizable patterns for miscellaneous entities.<br>
**B-ORG**: Organizations often have unique naming conventions or structures, contributing to the relatively high precision. However, the lower recall suggests that some organization entities may have less distinct patterns or may be less prevalent in the data. <br>
**B-PER**: Person names may follow certain recognizable patterns, contributing to the high precision. However, the lower recall suggests that some person entities may be less predictable or less prevalent in the data.<br>
**I-LOC**:  Inside location entities often occur within the context of longer phrases or sentences, where the presence of multiple words and syntactic structures can introduce ambiguity and variability. Therefore, inside location entities may have less distinct patterns compared to beginning of location entities, leading to slightly lower precision(0.618) and recall (0.529) values. <br>
**I-ORG**:  Inside organization entities may exhibit less distinct patterns compared to beginning of organization entities, leading to lower recall values.<br>
**I-PER**: The low precision (0.333) suggests that the classifier struggles to accurately predict inside person entities, possibly due to less recognizable patterns. However, the high recall (0.871) indicates that it effectively captures most actual inside person entities. This discrepancy suggests that although the classifier is effective at capturing most instances of inside person entities, it also tends to generate a significant number of false positive predictions, likely because inside person entities exhibit less recognizable patterns and may share similarities with other entity types or non-entity tokens.<br>




**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [80]:
import gensim
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=500000) 

training_input= []

for token, pos, ne_label in train.iob_words():
    word=token #the next word from the tokenized text
    # we check if our word 
    # is inside the model vocabulary (loaded with the Google word2vec embeddings)
    if word in word_embedding_model:
        # in this case the word was found and vector is assigned with its embedding vector as the value
        vector=word_embedding_model[word]
    else: 
        # if the word does not exist in the embeddings vocabulary, 
        # we create a vector with all zeros.
        # The Google word2vec model has 300 dimensions so we creat a vector with 300 zeros
        vector=[0]*300
        # print('This word is not in the word2vec vocabulary:', word)
    training_input.append(vector)


In [81]:
test_input = []
for token, pos, ne_label in test.iob_words():
    word=token #the next word from the tokenized text
    # we check if our word 
    # is inside the model vocabulary (loaded with the Google word2vec embeddings)
    if word in word_embedding_model:
        # in this case the word was found and vector is assigned with its embedding vector as the value
        vector=word_embedding_model[word]
    else: 
        # if the word does not exist in the embeddings vocabulary, 
        # we create a vector with all zeros.
        # The Google word2vec model has 300 dimensions so we creat a vector with 300 zeros
        vector=[0]*300
        # print('This word is not in the word2vec vocabulary:', word)
    test_input.append(vector)
    

In [82]:
lin_clf_emb = svm.LinearSVC()
lin_clf_emb.fit(training_input, training_gold_labels)


LinearSVC()

In [83]:
pred_emb = lin_clf_emb.predict(test_input)
report_emb = classification_report(test_gold_labels, pred_emb ,digits = 3) # evaluation

# print(report_emb)
report_cols = report_emb.split('\n')[0]
report_emb_lines = report_emb.split('\n')[1:]
report_lines = report.split('\n')[1:]

print(report_cols)
for line_emb, line in zip(report_emb_lines, report_lines):
    if line_emb != '':
        print(line_emb, '(embedded)')
        print(line)
        print()
    else:
        print(line)
        





              precision    recall  f1-score   support

       B-LOC      0.759     0.796     0.777      1668 (embedded)
       B-LOC      0.812     0.775     0.793      1668

      B-MISC      0.724     0.688     0.706       702 (embedded)
      B-MISC      0.782     0.664     0.718       702

       B-ORG      0.700     0.635     0.666      1661 (embedded)
       B-ORG      0.792     0.519     0.627      1661

       B-PER      0.760     0.635     0.692      1617 (embedded)
       B-PER      0.860     0.437     0.579      1617

       I-LOC      0.522     0.412     0.461       257 (embedded)
       I-LOC      0.618     0.529     0.570       257

      I-MISC      0.595     0.523     0.557       216 (embedded)
      I-MISC      0.570     0.588     0.579       216

       I-ORG      0.498     0.344     0.407       835 (embedded)
       I-ORG      0.703     0.467     0.561       835

       I-PER      0.591     0.454     0.513      1156 (embedded)
       I-PER      0.333     0.871     0.

## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [103]:
import pandas

In [85]:
##### Adapt the path to point to your local copy of NERC_datasets
path = 'nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)



  kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)
b'Skipping line 281837: expected 25 fields, saw 34\n'


In [86]:
len(kaggle_dataset)

1050795

In [96]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))
df_train.head(10)

100000 20000


Unnamed: 0,id,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
0,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
1,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
2,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
3,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
4,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O
5,5,through,london,to,TO,lowercase,to,NNP,capitalized,London,...,have,VBP,lowercase,have,lowercase,marched,1.0,lowercase,through,O
6,6,london,to,protest,VB,lowercase,protest,TO,lowercase,to,...,march,VBN,lowercase,marched,lowercase,through,1.0,capitalized,London,B-geo
7,7,to,protest,the,DT,lowercase,the,VB,lowercase,protest,...,through,IN,lowercase,through,capitalized,London,1.0,lowercase,to,O
8,8,protest,the,war,NN,lowercase,war,DT,lowercase,the,...,london,NNP,capitalized,London,lowercase,to,1.0,lowercase,protest,O
9,9,the,war,in,IN,lowercase,in,NN,lowercase,war,...,to,TO,lowercase,to,lowercase,protest,1.0,lowercase,the,O


In [92]:
training_features = []
training_labels = []

for index, row in df_train.iterrows():
    features = {}  

    for column in df_train.columns:
        
        features[column] = row[column]  # Adds each feature to the dict
        
    training_features.append(features)
    training_labels.append(row['tag'])

In [93]:
test_features = []
test_labels = []

for index, row in df_test.iterrows():
    features = {}  

    for column in df_test.columns:
        
        features[column] = row[column]  # Adds each feature to the dict
        
    test_features.append(features)
    test_labels.append(row['tag'])

In [109]:
import numpy
all_features = training_features + test_features

vec = DictVectorizer(sparse=True)
array = vec.fit_transform(all_features)

train_arr = array[:100000]
test_arr = array[100000:]

print(numpy.unique(train_array))


[<203621x27361 sparse matrix of type '<class 'numpy.float64'>'
	with 407242 stored elements in Compressed Sparse Row format>]


**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

In [99]:
lin_clf_kaggle = svm.LinearSVC()
lin_clf_kaggle.fit(train_arr, training_labels)



LinearSVC()

## End of this notebook