# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [9]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('/Users/kamilpulchny/Desktop/Text Mining/ba-text-mining/lab_sessions/lab4/CONLL2003/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
         'word': token,
         'pos': pos
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)

In [10]:
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('/Users/kamilpulchny/Desktop/Text Mining/ba-text-mining/lab_sessions/lab4/CONLL2003/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in train.iob_words():
    a_dict = {
        'word': token,
        'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)

**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Answer: training data 203621, test data: 46435
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Answer:
* Training NERC labels frequency distribution:
B-ORG: 6321
O: 169578
B-MISC: 3438
B-PER: 6600
I-PER: 4528
B-LOC: 7140
I-ORG: 3704
I-MISC: 1155
I-LOC: 1157
* Test NERC labels frequency distribution:
O: 38323
B-LOC: 1668
B-PER: 1617
I-PER: 1156
I-LOC: 257
B-MISC: 702
I-MISC: 216
B-ORG: 1661
I-ORG: 835
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?
* Answer: The training and test data is balanced because data has a larger number of instances, 203621, compared to the test data, 46435, which is correct in order to get good training results. The training and test data differ in terms of the NERC labels frequency distribution. The training data has a higher frequency distribution of NERC labels compared to the test data which my influence validation later on.

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [11]:
from collections import Counter 

my_list=[1,2,1,3,2,5]
Counter(my_list)


Counter({1: 2, 2: 2, 3: 1, 5: 1})

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [12]:
from sklearn.feature_extraction import DictVectorizer

In [13]:
vec = DictVectorizer()
the_array = vec.fit_transform(training_features + test_features)

num_train = len(training_features)
train_array = the_array[:num_train]
test_array = the_array[num_train:]

**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [14]:
from sklearn import svm

In [15]:
lin_clf = svm.LinearSVC()

In [16]:
##### [ YOUR CODE SHOULD GO HERE ]
from sklearn.metrics import classification_report
lin_clf.fit(train_array, training_gold_labels)
pred_test = lin_clf.predict(test_array)
print(classification_report(test_gold_labels, pred_test))



              precision    recall  f1-score   support

       B-LOC       0.81      0.77      0.79      1668
      B-MISC       0.78      0.66      0.71       702
       B-ORG       0.79      0.52      0.62      1661
       B-PER       0.87      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.59      0.59      0.59       216
       I-ORG       0.66      0.48      0.55       835
       I-PER       0.33      0.87      0.48      1156
           O       0.99      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.71      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



Which NERC labels does the classifier perform well on? Why do you think this is the case?

The classifier does remarkably well on B-LOC entities and the "O" label.  Due to the overwhelming frequency of non-entity tokens in the dataset, which makes them simpler to learn and predict accurately, the "O" label achieves precision of 0.99, recall of 0.98, and a f1-score of 0.98.  Similarly, location names typically display unique, consistent characteristics, like particular capitalization patterns and contextual cues, that help the model recognize them effectively, allowing the classifier to obtain a f1-score of 0.79 for B-LOC tokens.

Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In particular, the classifier has trouble with organization entities (B-ORG and I-ORG) and I-PER and B-PER labels.  Even though I-PER has a high recall of 0.87, its precision is extremely low at 0.33, resulting in a f1-score of 0.48. This suggests that there is a propensity to overpredict I-PER tokens because person name structures are unpredictable and ambiguous.  The B-PER model has a high precision of 0.87 but a low recall of 0.44, which means it misses a lot of real-world person name instances. This is probably because the patterns are varied and unpredictable.  Organization labels perform mediocrely as well, most likely due to the fact that organization names are more difficult to correctly categorize because they frequently lack the distinct, recognizable characteristics of location names.

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [17]:
import numpy as np
from sklearn.metrics import classification_report
import gensim.downloader as api

fasttext_model = api.load("fasttext-wiki-news-subwords-300")

def get_embedding(word):
    word_lower = word.lower()
    if word_lower in fasttext_model:
        return fasttext_model[word_lower]
    else:
        return np.zeros(fasttext_model.vector_size)

train_embeddings = np.array([get_embedding(feat['word']) for feat in training_features])
test_embeddings  = np.array([get_embedding(feat['word']) for feat in test_features])
lin_clf = svm.LinearSVC()
lin_clf.fit(train_array, training_gold_labels)
pred_test = lin_clf.predict(test_array)

print(classification_report(test_gold_labels, pred_test))



              precision    recall  f1-score   support

       B-LOC       0.81      0.77      0.79      1668
      B-MISC       0.78      0.66      0.71       702
       B-ORG       0.79      0.52      0.62      1661
       B-PER       0.87      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.59      0.59      0.59       216
       I-ORG       0.66      0.48      0.55       835
       I-PER       0.33      0.87      0.48      1156
           O       0.99      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.71      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [1]:
import pandas

In [5]:
##### Adapt the path to point to your local copy of NERC_datasets
path = r'C:\Users\krzys\OneDrive\Pulpit\TextMining\ba-text-mining\lab_sessions\lab4\ner_dataset.csv'
kaggle_dataset = pandas.read_csv(path, encoding='latin1', on_bad_lines='skip')

In [6]:
len(kaggle_dataset)

1048575

In [7]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

100000 20000


In [8]:
def extract_features(row):
    word = row['Word']
    features = {
        'word': word,
        'word.lower': word.lower(),
        'is_upper': word.isupper(),
        'is_title': word.istitle(),
        'is_digit': word.isdigit()
    }
    return features

# Create a list of feature dictionaries for the training and test sets
train_features = df_train.apply(extract_features, axis=1).tolist()
test_features = df_test.apply(extract_features, axis=1).tolist()

# Extract the NERC labels into lists (assuming the column 'Tag' contains the labels)
train_labels = df_train['Tag'].tolist()
test_labels = df_test['Tag'].tolist()

# Combine training and test features for a unified one-hot encoding using DictVectorizer
all_features = train_features + test_features
dv = DictVectorizer(sparse=False)
X_all = dv.fit_transform(all_features)

# Separate back into training and test feature matrices
X_train = X_all[:len(train_features)]
X_test = X_all[len(train_features):]

NameError: name 'DictVectorizer' is not defined

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

## End of this notebook