# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [1]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('/mnt/sda1/Text_Mining_Group45/lab_sessions/lab4/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
      'bias': 1.0,
      'words': token,  # original token
      'pos': pos,  # Part Of Speech tag of the token
      'word.lower()': token.lower(),  # lower case variant of the token
      'word[-3:]': token[-3:],  # suffix of 3 characters
      'word[-2:]': token[-2:],  # suffix of 2 characters
      'word.isupper()': token.isupper(),  # is the token in uppercase
      'word.istitle()': token.istitle(),  # does the token start with a capital letter
      'word.isdigit()': token.isdigit(),  # is the token a digit
      'postag': pos,  # Part Of Speech tag
      'postag[:2]': pos[:2],  # first two characters of the PoS tag
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)
   

In [2]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader('/mnt/sda1/Text_Mining_Group45/lab_sessions/lab4/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in test.iob_words():
    a_dict = {
      'bias': 1.0,
      'words': token,  # original token
      'pos': pos,  # Part Of Speech tag of the token
      'word.lower()': token.lower(),  # lower case variant of the token
      'word[-3:]': token[-3:],  # suffix of 3 characters
      'word[-2:]': token[-2:],  # suffix of 2 characters
      'word.isupper()': token.isupper(),  # is the token in uppercase
      'word.istitle()': token.istitle(),  # does the token start with a capital letter
      'word.isdigit()': token.isdigit(),  # is the token a digit
      'postag': pos,  # Part Of Speech tag
      'postag[:2]': pos[:2],  # first two characters of the PoS tag
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)

**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [3]:
from collections import Counter 
count_train=Counter(training_gold_labels)
count_test=Counter(test_gold_labels)

#no of instances
print("\nNumber of instances in training data:", len(training_gold_labels))
print("Number of instances in test data:", len(test_gold_labels))
print("Proportion test data:", len(test_gold_labels)/(len(test_gold_labels)+len(training_gold_labels)))
print('\n')
# new_dict={}
# #frequency distribution of NERC labels
print('Train NERC labels')
for label, freq in count_train.items():
    print(f"{label}: {freq} ({round((freq/len(training_gold_labels))*100,2)}%)")

print('\nTest NERC labels')
for label, freq in count_test.items():
    print(f"{label}: {freq} ({round((freq/len(test_gold_labels))*100, 2)}%)")



Number of instances in training data: 203621
Number of instances in test data: 46435
Proportion test data: 0.1856984035576031


Train NERC labels
B-ORG: 6321 (3.1%)
O: 169578 (83.28%)
B-MISC: 3438 (1.69%)
B-PER: 6600 (3.24%)
I-PER: 4528 (2.22%)
B-LOC: 7140 (3.51%)
I-ORG: 3704 (1.82%)
I-MISC: 1155 (0.57%)
I-LOC: 1157 (0.57%)

Test NERC labels
O: 38323 (82.53%)
B-LOC: 1668 (3.59%)
B-PER: 1617 (3.48%)
I-PER: 1156 (2.49%)
I-LOC: 257 (0.55%)
B-MISC: 702 (1.51%)
I-MISC: 216 (0.47%)
B-ORG: 1661 (3.58%)
I-ORG: 835 (1.8%)


### Answer (b)
## Orlando TO DO (with micheals text)

The frequency distribution of NERC labels in both the training and test datasets reveals a significant imbalance, with the 'O' (Outside of named entities) label dominating at 83.28% and 82.53% respectively. This skew towards non-entity tokens is common in NER tasks, reflecting the nature of natural language where named entities constitute a smaller portion of the text. Other common labels like 'B-LOC' (Beginning of a location), 'B-PER' (Beginning of a person's name), and 'I-PER' (Inside a person's name) show somewhat proportional representations between the training and test sets, with 'B-LOC' at 3.51% and 3.59%, 'B-PER' at 3.24% and 3.48%, and 'I-PER' at 2.22% and 2.49%, respectively. However, these entities, alongside 'B-ORG' (Beginning of an organization's name) and 'I-ORG' (Inside an organization's name), still make up a small fraction compared to the 'O' label, indicating a lack of balance in terms of named entity coverage.

While there are similarities in the proportions of NERC labels between the training and test data, indicating a degree of consistency in data annotation and representation, differences are also evident. Specifically, minor entities such as 'I-MISC' (Inside a miscellaneous entity) and 'I-LOC' exhibit nearly identical proportions (around 0.57% in training and slightly less in test data), emphasizing the challenge in achieving a balanced dataset for these less frequent categories. The minor discrepancies in distribution percentages point to slight variations in entity representation, which could impact model performance, especially on the less represented categories. This imbalance suggests that while models trained on this data may perform well on identifying the 'O' label and somewhat well on more common entities like locations and persons, they might struggle with less frequent entities due to their underrepresentation in the training material.

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [4]:
from sklearn.feature_extraction import DictVectorizer


In [5]:
vec = DictVectorizer()
features=training_features + test_features
the_array = vec.fit_transform(features)
# print(the_array)
# print(len(the_array))

n_train = len(training_features)

train_array = the_array[:n_train]
test_array = the_array[n_train:]


**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [6]:
from sklearn import svm

In [7]:
lin_clf = svm.LinearSVC()

In [8]:
from sklearn.metrics import classification_report
lin_clf.fit(train_array, training_gold_labels)
test_predictions = lin_clf.predict(test_array)
report = classification_report(test_gold_labels, test_predictions)
print(report)

              precision    recall  f1-score   support

       B-LOC       0.73      0.81      0.77      1668
      B-MISC       0.71      0.70      0.71       702
       B-ORG       0.68      0.58      0.63      1661
       B-PER       0.68      0.59      0.63      1617
       I-LOC       0.59      0.54      0.56       257
      I-MISC       0.56      0.59      0.58       216
       I-ORG       0.54      0.49      0.51       835
       I-PER       0.48      0.54      0.51      1156
           O       0.98      0.99      0.99     38323

    accuracy                           0.93     46435
   macro avg       0.66      0.65      0.65     46435
weighted avg       0.92      0.93      0.92     46435



## Answer (d)
## Micheal TO DO
The classifier demonstrates strong performance particularly on the 'O' (Outside of named entities) label, achieving high precision, recall, and F1-score of 0.98, 0.99, and 0.99 respectively. This label, which represents tokens not classified as named entities, benefits from its overwhelming presence in the dataset, as evidenced by the provided data distribution. Such a dominant representation facilitates the classifier's learning, making it adept at identifying non-entity components of the text with high accuracy. The high performance on 'O' significantly contributes to the overall accuracy metric of 0.93, indicating that the classifier is particularly efficient at discerning non-entity text portions, likely due to the abundance of examples during training that bolster its predictive confidence and reduce the likelihood of false positives or negatives within this category.

Conversely, the classifier shows comparatively weaker performance on several entity categories, notably 'I-PER' (Inside a person's name), 'I-ORG' (Inside an organization's name), and 'I-LOC' (Inside a location), with F1-scores of 0.51, 0.51, and 0.56 respectively. These categories are characterized by lower precision and recall, suggesting difficulties in accurately identifying and classifying tokens that are part of named entities extending beyond a single token. The relatively poor performance on these labels can be attributed to several factors, including the inherent complexity of recognizing entities that span multiple tokens, potential inconsistencies in labeling, and the relatively smaller number of examples for these categories compared to the 'O' label. This results in less training data for these specific entity types, complicating the model's task of learning their distinctive features amidst the linguistic variability of natural language. Additionally, the precision-recall trade-off observed, especially in 'I-PER' and 'I-ORG', points to challenges in balancing the detection of true positives against the avoidance of false positives, further complicating the accurate classification of these entity types.

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [9]:
import gensim
##### Adapt the path to point to your local copy of the Google embeddings model
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('/mnt/sda1/Text_Mining_Group45/lab_sessions/GoogleNews-vectors-negative300.bin', binary=True)  

training_vectors=[]
train_labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        training_vectors.append(vector)
        train_labels.append(ne_label)

test_vectors=[]
test_labels=[]
for token, pos, ne_label in test.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        test_vectors.append(vector)
        test_labels.append(ne_label)

lin_clf2 = svm.LinearSVC()
lin_clf2.fit(training_vectors, train_labels)
pred=lin_clf2.predict(test_vectors)

print(classification_report(test_labels, pred))

              precision    recall  f1-score   support

       B-LOC       0.76      0.80      0.78      1668
      B-MISC       0.72      0.70      0.71       702
       B-ORG       0.69      0.64      0.66      1661
       B-PER       0.75      0.67      0.71      1617
       I-LOC       0.51      0.42      0.46       257
      I-MISC       0.60      0.54      0.57       216
       I-ORG       0.48      0.33      0.39       835
       I-PER       0.59      0.50      0.54      1156
           O       0.97      0.99      0.98     38323

    accuracy                           0.93     46435
   macro avg       0.68      0.62      0.64     46435
weighted avg       0.92      0.93      0.92     46435



## Remi TO DO

The comparison between the classification results using manually crafted feature vectors and those obtained from embedding-based features shows a notable improvement in the model's performance when leveraging word embeddings. Specifically, for entity types such as B-LOC, B-MISC, B-ORG, and B-PER, there is a marked increase in precision and recall, leading to higher F1-scores. This improvement underscores the power of embeddings to capture semantic and contextual nuances of words, enhancing the model's ability to distinguish between different entity types more effectively. While the accuracy for non-entity type 'O' remains high in both cases, the use of embeddings has contributed to a more balanced performance across all entity types, as evidenced by the increase in macro-average F1-score from 0.65 to 0.74. This highlights the embeddings' role in providing a richer representation of tokens, facilitating better generalization and more nuanced entity recognition compared to traditional feature engineering approaches.


## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [1]:
import pandas
from sklearn.feature_extraction import DictVectorizer
from sklearn import svm
from sklearn.metrics import classification_report

In [2]:
##### Adapt the path to point to your local copy of NERC_datasets
path = '/mnt/sda1/Text_Mining_Group45/lab_sessions/lab4/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


In [3]:
len(kaggle_dataset)

1050795

In [11]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]

sentence_column=df_train['sentence_idx']

def string(x):
    return (str(x))

new=sentence_column.apply(string)
df_train['sentence_idx']=new
df_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['sentence_idx']=new


Unnamed: 0,id,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
0,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
1,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
2,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
3,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
4,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


In [17]:
#Split df into labels and features arrays 

train_labels = df_train['tag'].values
features_train = df_train.drop('tag', axis=1)
features_train = features_train.drop('id', axis=1)
# features_train = features_train.drop('sentence_idx', axis=1)
features_dict_train = features_train.to_dict(orient='records')

test_labels = df_test['tag'].values
features_test = df_test.drop('tag', axis=1)
features_test = features_test.drop('id', axis=1)
# features_test = features_test.drop('sentence_idx', axis=1)
features_dict_test = features_test.to_dict(orient='records')


In [15]:
# from sklearn.feature_extraction import DictVectorizer
# from sklearn import svm
# from sklearn.metrics import classification_report

In [18]:
vec2 = DictVectorizer()
features2=features_dict_train + features_dict_test
print(features2[1])
the_array = vec2.fit_transform(features2)
print(the_array)

n_train = len(features_dict_train)

train_array = the_array[:n_train]
test_array = the_array[n_train:]

print(train_array.shape)
print(test_array.shape)


{'lemma': 'of', 'next-lemma': 'demonstr', 'next-next-lemma': 'have', 'next-next-pos': 'VBP', 'next-next-shape': 'lowercase', 'next-next-word': 'have', 'next-pos': 'NNS', 'next-shape': 'lowercase', 'next-word': 'demonstrators', 'pos': 'IN', 'prev-iob': 'O', 'prev-lemma': 'thousand', 'prev-pos': 'NNS', 'prev-prev-iob': '__START1__', 'prev-prev-lemma': '__start1__', 'prev-prev-pos': '__START1__', 'prev-prev-shape': 'wildcard', 'prev-prev-word': '__START1__', 'prev-shape': 'capitalized', 'prev-word': 'Thousands', 'sentence_idx': '1.0', 'shape': 'lowercase', 'word': 'of'}
  (0, 7329)	1.0
  (0, 13352)	1.0
  (0, 18437)	1.0
  (0, 24093)	1.0
  (0, 24122)	1.0
  (0, 29674)	1.0
  (0, 35348)	1.0
  (0, 35386)	1.0
  (0, 43763)	1.0
  (0, 46843)	1.0
  (0, 46883)	1.0
  (0, 47411)	1.0
  (0, 55054)	1.0
  (0, 55074)	1.0
  (0, 55577)	1.0
  (0, 62960)	1.0
  (0, 62973)	1.0
  (0, 67110)	1.0
  (0, 74411)	1.0
  (0, 78696)	1.0
  (0, 86270)	1.0
  (0, 90816)	1.0
  (0, 94729)	1.0
  (1, 5280)	1.0
  (1, 10446)	1.0
  :

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

In [19]:
lin_clf = svm.LinearSVC()
lin_clf.fit(train_array, train_labels)
test_predictions = lin_clf.predict(test_array)
report = classification_report(test_labels, test_predictions)
print(report)

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         4
       B-eve       0.00      0.00      0.00         0
       B-geo       0.87      0.88      0.87       741
       B-gpe       0.90      0.93      0.92       296
       B-nat       0.80      0.50      0.62         8
       B-org       0.77      0.67      0.72       397
       B-per       0.81      0.83      0.82       333
       B-tim       0.95      0.84      0.89       393
       I-geo       0.97      0.96      0.97       156
       I-gpe       1.00      1.00      1.00         2
       I-nat       1.00      1.00      1.00         4
       I-org       0.95      0.93      0.94       321
       I-per       0.95      0.98      0.96       319
       I-tim       1.00      0.86      0.93       108
           O       0.99      0.99      0.99     16918

    accuracy                           0.97     20000
   macro avg       0.80      0.76      0.77     20000
weighted avg       0.97   

High Performance

    Geographical Entities (B-geo and I-geo): The model shows excellent precision and recall for both beginning and inside markers of geographical entities, with F1-scores of 0.85 and 0.97, respectively. This indicates the model's strong capability in recognizing and classifying geographical names accurately.

    Temporal Entities (B-tim and I-tim): Temporal entities are also well-handled, with high F1-scores of 0.88 for beginning markers and 0.92 for inside markers. The model effectively identifies and classifies dates and times.

    Persons (B-per and I-per): The precision and recall for person names are both high, with F1-scores of 0.81 and 0.96 for beginning and inside markers, respectively. This shows the model's proficiency in recognizing names of people.

    General Entities (O): The model achieves near-perfect performance in identifying tokens outside of named entities, with an F1-score of 0.99. This is crucial for NER tasks, as the majority of text often falls under this category.

Moderate Performance

    Organizations (B-org and I-org): While still good, the performance on organization entities shows room for improvement, with F1-scores of 0.70 for beginning markers and 0.94 for inside markers. The lower score for beginning markers suggests challenges in initially identifying organizations.

    Political Entities (B-gpe): The model performs well on geopolitical entities with an F1-score of 0.90, indicating robust identification and classification of countries, cities, and states.

Low or No Performance

    Art, Events, and Nature (B-art, B-eve, B-nat, I-nat): The model struggles with these categories, showing very low or no performance on art and events, and moderate performance on natural phenomena. The zero scores for B-art and B-eve indicate a failure to correctly identify any instances of these types, likely due to their sparse representation in the dataset.

    Miscellaneous Inside Markers (I-gpe): While I-gpe shows a high F1-score, it's based on a very small sample size (only 2 instances), suggesting that the high performance may not be generalizable.

## End of this notebook