# Exercise 2.2

## Goal

The performance of machine learning systems directly depends on the quality of input features. In this exercise, you will investigate the impact of individual features on a system for named entity recognition: what does the inclusion of each individual feature do to the results? And what happens when they are combined?



## Acknowledgement

This exercise made use of examples from the following exercise (in the HLT course):

https://github.com/cltl/ma-hlt-labs/

Lab3: machine learning


## Procedure

This notebook will provide the code for running the experiments outlined above. You will only need to make minor adaptations to run the feature ablation analysis. This notebook was developed for another more introductory course where students did not need to generate their own features. Please take that into account while reading this code (i.e. you can use this as an example, but it will not work one-to-one on your own data).

The notebooks and set up have been designed for educational purposes: design choices are based on clearly illustrating what is going on and facilitating the exercises.

## The Data

The data of the original assignment been preprocessed to make some useful features directly availabe (as you were recommended to do as well in Assignment 2).

The format of the conll files provided to the students in this original exercise was:

Token  Preceding_token  Capitalization  POS-tag  Chunklabel  Goldlabel

The first lines look like this:

-DOCSTART-      FULLCAP -X-     -X-     O
  
EU              FULLCAP NNP     B-NP    B-ORG

rejects EU      LOWCASE VBZ     B-VP    O

Preceding_token: 
This column provides the token preceding the current token. (This is an empty space if there is no previous token).

Capitalization: 
This column provides information on the capitalization of the token.

## Packages

We will make use of the following packages:

* scikit-learn : provides lots of useful implementations for machine learning (and has relatively good documentation!)
* csv: a light-weight package to deal with data represented in csv (or related formats such as tsv)
* gensim: a useful package for working with word embeddings
* numpy: a packages that (among others) provides useful datastructures and operations to work with vectors

Some notes on design decisions (feel free to ignore these if this is all new to you):

* We are using csv rather than (the more common) pandas for working with the conll files, because pandas standardly applies type conversion, which we do not want when dealing with text that contains numbers (fixing this will make the code look more complex).
* scikit-learn provides several machine learning algorithms, but this is not the focus of this exercise. We are using logistic regression, because it serves the purpose of our experiments and is relatively efficient.



In [1]:
#this cell imports all the modules we'll need. Make sure to run this once before running the other cells


#sklearn is scikit-learn
import sklearn
import csv
import gensim
import numpy as np
import pandas as pd


from sklearn import metrics
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression


In [5]:
#all file paths
#adapt path if needed

# original training/dev data
#Setting some variables that we will use multiple times
trainfile = '../data/conll2003.train.conll'
testfile = '../data/conll2003.dev.conll'

# train/dev data after feature engineering
train_path = '../data/train.csv'
test_path = '../data/dev.csv'
feature_to_index = {
    'token': 1,
    'pos': 2,
    'chunk_tag': 3,
    'token_right': 5,
    'token_left': 6,
    'cap_type': 7
}

# word2vec model
# this step takes a while
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format(
    '../data/GoogleNews-vectors-negative300.bin',
    binary=True)

# Part 1: Traditional Features

In this first part, we will explore the impact of various features on named entity recognition.
We will use so-called traditional features, where the feature values (strings) are presented by one-hot encoding 

## Step 1: A Basic Classifier

We will first walk through the process of creating and evaluating a simple classifier that only uses the token itself as a feature. In the next step, we will run evaluations on this basic system.

This is generally a good way to start experimenting: first walk through the entire experimental process with a very basic, easy to create system to see if everything works, there are no problems with the data etc. You can then build up from there towards a more sophististicated system.


In [6]:
#functions for feature extraction and training a classifier


## For documentation on how to create input representations of features in scikit-learn:
# https://scikit-learn.org/stable/modules/feature_extraction.html


def extract_features_token_only_and_labels(conllfile):
    '''Function that extracts features and gold label from preprocessed conll (here: tokens only).
    
    :param conllfile: path to the (preprocessed) conll file
    :type conllfile: string
    
    
    :return features: a list of dictionaries, with key-value pair providing the value for the feature `token' for individual instances
    :return labels: a list of gold labels of individual instances
    '''

    features = []
    labels = []
    conllinput = open(conllfile, 'r')
    #delimiter indicates we are working with a tab separated value (default is comma)
    #quotechar has as default value '"', which is used to indicate the borders of a cell containing longer pieces of text
    #in this file, we have only one token as text, but this token can be '"', which then messes up the format. We set quotechar to a character that does not occur in our file
    csvreader = csv.reader(conllinput, delimiter='\t',quotechar='|')
    for row in csvreader:
        #I preprocessed the file so that all rows with instances should contain 6 values, the others are empty lines indicating the beginning of a sentence
        if len(row) == 4:
            #structuring feature value pairs as key-value pairs in a dictionary
            #the first column in the conll file represents tokens
            feature_value = {'Token': row[0]}
            features.append(feature_value)
            #The last column provides the gold label (= the correct answer).
            labels.append(row[-1])

    return features, labels



def create_vectorizer_and_classifier(features, labels):
    '''
    Function that takes feature-value pairs and gold labels as input and trains a logistic regression classifier
    
    :param features: feature-value pairs
    :param labels: gold labels
    :type features: a list of dictionaries
    :type labels: a list of strings
    
    :return lr_classifier: a trained LogisticRegression classifier
    :return vec: a DictVectorizer to which the feature values are fitted. 
    '''

    vec = DictVectorizer()
    #fit creates a mapping between observed feature values and dimensions in a one-hot vector, transform represents the current values as a vector
    tokens_vectorized = vec.fit_transform(features)
    lr_classifier = LogisticRegression(solver='saga')
    lr_classifier.fit(tokens_vectorized, labels)

    return lr_classifier, vec

#extract features and labels:
feature_values, labels = extract_features_token_only_and_labels(trainfile)
#create vectorizer and trained classifier:
lr_classifier, vectorizer = create_vectorizer_and_classifier(feature_values, labels)


## Step 2: Evaluation

We will now run a basic evaluation of the system on a test file. 
Two important properties of the test file:

1. the test file and training file are independent sets (if they contain identical examples, this is coincidental)
2. the test file is preprocessed in the exact same way as the training file 

The first function runs our classifier on the test data.

The second function prints out a confusion matrix (comparing predictions and gold labels per class). 
You can find more information on confusion matrices here: https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

The third function prints out the macro precision, recall and f-score of the system

In [7]:

def get_predicted_and_gold_labels_token_only(testfile, vectorizer, classifier):
    '''
    Function that extracts features and runs classifier on a test file returning predicted and gold labels
    
    :param testfile: path to the (preprocessed) test file
    :param vectorizer: vectorizer in which the mapping between feature values and dimensions is stored
    :param classifier: the trained classifier
    :type testfile: string
    :type vectorizer: DictVectorizer
    :type classifier: LogisticRegression()
    
    
    
    :return predictions: list of output labels provided by the classifier on the test file
    :return goldlabels: list of gold labels as included in the test file
    '''
    
    #we use the same function as above (guarantees features have the same name and form)
    sparse_feature_reps, goldlabels = extract_features_token_only_and_labels(testfile)
    #we need to use the same fitting as before, so now we only transform the current features according to this mapping (using only transform)
    test_features_vectorized = vectorizer.transform(sparse_feature_reps)
    predictions = classifier.predict(test_features_vectorized)
    
    return predictions, goldlabels

def print_confusion_matrix(predictions, goldlabels):
    '''
    Function that prints out a confusion matrix
    
    :param predictions: predicted labels
    :param goldlabels: gold standard labels
    :type predictions, goldlabels: list of strings
    '''
    
    
    
    #based on example from https://datatofish.com/confusion-matrix-python/ 
    data = {'Gold':    goldlabels, 'Predicted': predictions    }
    df = pd.DataFrame(data, columns=['Gold','Predicted'])

    confusion_matrix = pd.crosstab(df['Gold'], df['Predicted'], rownames=['Gold'], colnames=['Predicted'])
    print (confusion_matrix)


def print_precision_recall_fscore(predictions, goldlabels):
    '''
    Function that prints out precision, recall and f-score
    
    :param predictions: predicted output by classifier
    :param goldlabels: original gold labels
    :type predictions, goldlabels: list of strings
    '''
    
    precision = metrics.precision_score(y_true=goldlabels,
                        y_pred=predictions,
                        average='macro')

    recall = metrics.recall_score(y_true=goldlabels,
                     y_pred=predictions,
                     average='macro')


    fscore = metrics.f1_score(y_true=goldlabels,
                 y_pred=predictions,
                 average='macro')

    print('P:', precision, 'R:', recall, 'F1:', fscore)
    
#vectorizer and lr_classifier are the vectorizer and classifiers created in the previous cell.
#it is important that the same vectorizer is used for both training and testing: they should use the same mapping from values to dimensions
predictions, goldlabels = get_predicted_and_gold_labels_token_only(testfile, vectorizer, lr_classifier)
print_confusion_matrix(predictions, goldlabels)
print_precision_recall_fscore(predictions, goldlabels)

Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1305      15    101      4      7       0      6      4    395
B-MISC        41     603     14      8      0      12      2      1    241
B-ORG         78      23    690      5     11       3     38     14    479
B-PER         16       3      2    873      0       0      1    104    843
I-LOC         13       2      1      0    150       3     13      6     69
I-MISC         2      27      7      2      7     145      2      4    150
I-ORG         36      11     47      4     38       5    263      5    342
I-PER          6       2      5    102      0       0      1    292    899
O              3      10      4      0      1      10     11      1  42719
P: 0.8114098816314947 R: 0.5475891713668707 F1: 0.6374572803907079


## Step 3: A More Elaborate System

Now that we have run a basic experiment, we are going to investigate alternatives. In this exercise, we only focus on features. We will continue to use the same logistic regression classifier throughout the exercise.

We want to investigate the impact of individual features. We will thus use a function that allows us to specify whether we include a specific feature or not. The features we have at our disposal are:

* the token itself (as used above)
* the preceding token
* the capitalization indication (see above for values that this takes)
* the pos-tag of the token
* the chunklabel of the chunk the token is part of

In [8]:
# the functions with multiple features and analysis

#defines the column in which each feature is located (note: you can also define headers and use csv.DictReader)
#feature_to_index = {'Token': 0, 'Prevtoken': 1, 'Cap': 2, 'Pos': 3, 'Chunklabel': 4}
feature_to_index = {'token':1, 'pos':2, 'chunk_tag':3, 'token_left':5, 'token_right':6, 'cap_type':7}
#['', 'token', 'pos', 'chunk_tag', 'target', 'token_left', 'token_right', 'cap_type']


def extract_features_and_gold_labels(conllfile, selected_features):
    '''Function that extracts features and gold label from preprocessed conll (here: tokens only).
    
    :param conllfile: path to the (preprocessed) conll file
    :type conllfile: string
    
    
    :return features: a list of dictionaries, with key-value pair providing the value for the feature `token' for individual instances
    :return labels: a list of gold labels of individual instances
    '''

    features = []
    labels = []
    conllinput = open(conllfile, 'r')
    #delimiter indicates we are working with a tab separated value (default is comma)
    #quotechar has as default value '"', which is used to indicate the borders of a cell containing longer pieces of text
    #in this file, we have only one token as text, but this token can be '"', which then messes up the format. We set quotechar to a character that does not occur in our file
    csvreader = csv.reader(conllinput, delimiter=',',quotechar='|')


    for index,row in enumerate(csvreader):
        #I preprocessed the file so that all rows with instances should contain 6 values, the others are empty lines indicating the beginning of a sentence
        if index == 0:
            continue
        if len(row) == 8:
            #structuring feature value pairs as key-value pairs in a dictionary
            #the first column in the conll file represents tokens
            feature_value = {}
            for feature_name in selected_features:
                row_index = feature_to_index.get(feature_name)
                feature_value[feature_name] = row[row_index]
            features.append(feature_value)
            #The last column provides the gold label (= the correct answer).
            labels.append(row[4])
    return features, labels

def get_predicted_and_gold_labels(testfile, vectorizer, classifier, selected_features):
    '''
    Function that extracts features and runs classifier on a test file returning predicted and gold labels
    
    :param testfile: path to the (preprocessed) test file
    :param vectorizer: vectorizer in which the mapping between feature values and dimensions is stored
    :param classifier: the trained classifier
    :type testfile: string
    :type vectorizer: DictVectorizer
    :type classifier: LogisticRegression()
    
    
    
    :return predictions: list of output labels provided by the classifier on the test file
    :return goldlabels: list of gold labels as included in the test file
    '''

    #we use the same function as above (guarantees features have the same name and form)
    features, goldlabels = extract_features_and_gold_labels(testfile, selected_features)
    #we need to use the same fitting as before, so now we only transform the current features according to this mapping (using only transform)
    test_features_vectorized = vectorizer.transform(features)
    predictions = classifier.predict(test_features_vectorized)

    return predictions, goldlabels

#define which from the available features will be used (names must match key names of dictionary feature_to_index)
all_features = ['token', 'pos', 'chunk_tag', 'token_left', 'token_right','cap_type']

sparse_feature_reps, labels = extract_features_and_gold_labels(train_path, all_features)
#we can use the same function as before for creating the classifier and vectorizer

lr_classifier, vectorizer = create_vectorizer_and_classifier(sparse_feature_reps, labels)
#when applying our model to new data, we need to use the same features

predictions, goldlabels = get_predicted_and_gold_labels(test_path, vectorizer, lr_classifier, all_features)

print_confusion_matrix(predictions, goldlabels)

print_precision_recall_fscore(predictions, goldlabels)



Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1288      22     66     43      2       0      9     14    149
B-MISC        24     692     35     22      0       2      6     14     79
B-ORG         50      23    957     72      0       1     26     32     62
B-PER         41       7     23   1325      0       1      8     25     63
I-LOC          6       0      2      0    165       3     14     20     23
I-MISC         9      20      3      3      5     212      6     25     51
I-ORG         17       2     17      5     11       6    510     45     96
I-PER          5       1      5     16      0       3      9    971     44
O              4      14     25     38      0       5     21     16  37566
P: 0.8846445875479344 R: 0.8056081274760225 F1: 0.8397141262252669


## Step 4: Feature Ablation Analysis

If all worked well, the system that made use of all features worked better than the system with just the tokens.
We now want to know which of the features contributed to this improved: do we want to include all features?
Or just some?

We can investigate this using *feature ablation analysis*. This means that we systematically test what happens if we add or remove a specific feature. Ideally, we investigate all possible combinations.

The cell below illustrates how you can use the code above to investigate a system with three features. You can modify the selected features to try out different combinations. You can either do this manually and rerun the cell or write a function that creates list of all combinations you want to tests and runs them one after the other.

Include your results in the report of this week.

In [6]:
#{'token':1, 'pos':2, 'chunk_tag':3, 'token_left':5, 'token_right':6, 'cap_type':7}
# example of system with just one additional feature
#define which from the available features will be used (names must match key names of dictionary feature_to_index)
selected_features = ['token', 'pos', 'chunk_tag']

feature_values, labels = extract_features_and_gold_labels(train_path, selected_features)
#we can use the same function as before for creating the classifier and vectorizer
lr_classifier, vectorizer = create_vectorizer_and_classifier(feature_values, labels)
#when applying our model to new data, we need to use the same features
predictions, goldlabels = get_predicted_and_gold_labels(test_path, vectorizer, lr_classifier, selected_features)
print_confusion_matrix(predictions, goldlabels)
print_precision_recall_fscore(predictions, goldlabels)




Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1246      10     93    118      8       0     12     45     61
B-MISC        31     589     26     31      0      12     11     55    119
B-ORG         73      21    716    114      4       2     60    113    120
B-PER         14       1      7   1148      0       0      9    227     87
I-LOC          6       2      1      1    134       3     32     39     15
I-MISC         3      26      7      6      8     157      8     54     65
I-ORG         33       6     52     13     23       5    354     95    128
I-PER          1       2      4     41      0       0     11    957     38
O              2       8     16     67      0       9     15    186  37386
P: 0.7888709858928774 R: 0.6949827135561947 F1: 0.7245114308569229


In [7]:
import itertools
#code taken from Google
feature_combinations = list(itertools.combinations(feature_to_index.keys(), 3))
for feature_c in feature_combinations:
    selected_features = list(feature_c)
    feature_values, labels = extract_features_and_gold_labels(train_path, selected_features)
    lr_classifier, vectorizer = create_vectorizer_and_classifier(
        feature_values, labels)
    predictions, goldlabels = get_predicted_and_gold_labels(
        test_path, vectorizer, lr_classifier, selected_features)
    print(f'-----------{feature_c}---------------')
    print_confusion_matrix(predictions, goldlabels)
    print_precision_recall_fscore(predictions, goldlabels)
    print()
    print()



-----------('token', 'pos', 'chunk_tag')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1246      10     93    118      8       0     12     45     61
B-MISC        31     589     26     31      0      12     11     55    119
B-ORG         73      21    716    114      4       2     60    113    120
B-PER         14       1      7   1148      0       0      9    227     87
I-LOC          6       2      1      1    134       3     32     39     15
I-MISC         3      26      7      6      8     157      8     54     65
I-ORG         33       6     52     13     23       5    354     95    128
I-PER          1       2      4     41      0       0     11    957     38
O              2       8     16     67      0       9     15    186  37386
P: 0.7888709858928774 R: 0.6949827135561947 F1: 0.7245114308569229






-----------('token', 'pos', 'token_left')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1231       6     61    110      4       1      6     25    149
B-MISC        21     650     10     45      0      12      6      7    123
B-ORG         66      19    801    114      1       3     68     48    103
B-PER         17       2     20   1253      0       1      5     81    114
I-LOC         28       1      6     10    131       5     17      9     26
I-MISC         2      43      3     27      1     172      1      2     83
I-ORG         31       6     78     44      8       5    377     24    136
I-PER         26       1     15    225      0       1      0    676    110
O             13       5     16    127      1       4     14     14  37495
P: 0.8247908252203522 R: 0.6950905639767825 F1: 0.74521958218065






-----------('token', 'pos', 'token_right')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1326      12     69     71      3       0      3     34     75
B-MISC        38     601     49     24      0       5      6     19    132
B-ORG         78      21    861     85      0       1     18     42    117
B-PER         48       2     31   1228      0       0      4     87     93
I-LOC          9       2      1      0    157       2     19     30     13
I-MISC         6       8     10      1      7     195     12     34     61
I-ORG         26       5      6      6     29       7    441     62    127
I-PER          4       0      5      9      0       0     12    988     36
O              8       9     34     47      0       6     29    137  37419
P: 0.8443571761301695 R: 0.7618243008856151 F1: 0.793045657952427






-----------('token', 'pos', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1273      22     90    103      6       0     14      4     81
B-MISC        34     630     25     52      0       7     17      3    106
B-ORG         76      22    799    108      9       3     67     17    122
B-PER         16      16      6   1270      0       0      9     92     84
I-LOC         17       2      1     26    128       3     32      5     19
I-MISC         2      33     10     37      9     144      8      6     85
I-ORG         40       3     60     84     25       5    347      8    137
I-PER         10       4      5    546      0       0     10    420     59
O              6      29     16     49      1       8     23      6  37551
P: 0.777156497451169 R: 0.6542920979487996 F1: 0.6949804569101775






-----------('token', 'chunk_tag', 'token_left')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1147       4     46      9      4       0      2      6    375
B-MISC        12     626      3      6      0      11      1      1    214
B-ORG         47      19    734     12      1       1     48     12    349
B-PER         13       1     17    799      0       0      0     16    647
I-LOC          7       0      1      0    137       5     14      5     64
I-MISC         1      37      1      1      0     169      0      1    124
I-ORG         22       6     37      1      8       5    349     17    264
I-PER          4       0      3     29      0       0      0    582    436
O              1       2     12      4      0       7     17     47  37599
P: 0.8849921200287638 R: 0.6341786145749984 F1: 0.7319999052014984






-----------('token', 'chunk_tag', 'token_right')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1206      18     50     16      2       0      9      0    292
B-MISC        34     576     15      6      0       3      1      0    239
B-ORG         43      22    734     10      0       0      6      1    407
B-PER         13       1     14    919      0       0      2      5    539
I-LOC          6       2      0      0    158       2     14      8     43
I-MISC         2       7      0      0      7     193      3      3    119
I-ORG         15       5      3      1     28       7    385      7    258
I-PER          0       0      0      5      0       0      2    703    344
O              3       7     13      3      1       2     13      2  37645
P: 0.9099408769372482 R: 0.6773992746113658 F1: 0.7707006274706293






-----------('token', 'chunk_tag', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1241      18     94     86      8       0     11     45     90
B-MISC        38     610     20     42      0      11      3     62     88
B-ORG         75      24    764     96      4       2     64     88    106
B-PER         15       2      4   1202      0       0      6    215     49
I-LOC          6       2      1      0    134       3     32     38     17
I-MISC         2      27      7      4      8     147      6     49     84
I-ORG         31      11     52     11     23       5    347     99    130
I-PER          1       2      4     43      0       0      3    949     52
O              4       8     16    100      0      11     18     43  37489
P: 0.7961848095840558 R: 0.7007199187515929 F1: 0.7325121385037207


-----------('token', 'token_left', 'token_rig



-----------('token', 'token_left', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1231      11     62     68      4       0      3     29    185
B-MISC        23     674      7     55      0      12      3     15     85
B-ORG         61      19    835    100      1       1     68     48     90
B-PER         32       2     21   1287      0       1      4     96     50
I-LOC         28       1      6     10    132       5     15     10     26
I-MISC         5      45      1     26      1     174      0      3     79
I-ORG         38       6     66     48      8       5    379     25    134
I-PER         58       0     10    230      0       1      0    677     78
O             16       7     14     92      2       8     11     25  37514
P: 0.824723962208154 R: 0.7053780108224728 F1: 0.7514205407824479






-----------('token', 'token_right', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1327      21     76     32      4       0      3     36     94
B-MISC        42     622     52     21      0       5      4     26    102
B-ORG         74      25    905     46      0       0     20     45    108
B-PER         50       3     34   1151      0       0      3     89    163
I-LOC          6       2      1      0    160       2     19     27     16
I-MISC         7      10     11      4      7     197      9     34     55
I-ORG         25       5      8      5     28       8    459     62    109
I-PER          5       0      5      8      0       0      6    979     51
O              9       8     28     25      0       9     24     25  37561
P: 0.8564185039912492 R: 0.7672174685056465 F1: 0.8035107779624737






-----------('pos', 'chunk_tag', 'token_left')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        853      15     46    252     11       2      7     24    383
B-MISC        25     284     16     88      7       9     21     23    401
B-ORG        145      19    467    165      2       6    120     43    256
B-PER         98       3     30   1154      1       2     10     22    173
I-LOC         31       2      6     23     36       2     19     48     66
I-MISC        11      41      2     31      5      59      6     17    162
I-ORG         28       9     50     50     12       4    250    114    192
I-PER          9       2      5    111     19       5     12    733    158
O            105      71     56    240      8       9     24     25  37151
P: 0.6299312263402735 R: 0.48668137974918463 F1: 0.5255604858464089






-----------('pos', 'chunk_tag', 'token_right')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        973       7    162    148      0       0      2     39    262
B-MISC        56      40    116     55      0       4     13     38    552
B-ORG        192      23    603    109      0       0      3     54    239
B-PER        280       6     78    858      0       3      4     92    172
I-LOC          4       0      2      0    137       2     26     46     16
I-MISC         6       2      9      7      9     117     33     56     95
I-ORG         39       2      2     13     39      17    327    132    138
I-PER          5       0      4      2      0       1     12    987     43
O            201       9    101     79      2      15     29    183  37070
P: 0.6669516332210849 R: 0.5604221679369109 F1: 0.5788235086069191






-----------('pos', 'chunk_tag', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  I-ORG  I-PER      O
Gold                                                
B-LOC       1193      28      4      2    197    169
B-MISC       145     388     15     24    164    138
B-ORG        626      29     11     18    298    241
B-PER        763      20      5      4    618     83
I-LOC          5       3      0      2    170     53
I-MISC        14      15      0      8    173    124
I-ORG         51       3      2     32    456    165
I-PER         16       3      4     16    941     74
O            654      82      6      3    173  36771


  _warn_prf(average, modifier, msg_start, len(result))


P: 0.31317008805878843 R: 0.3461553080887 F1: 0.2801726295012383






-----------('pos', 'token_left', 'token_right')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1051      14     89    173      1       0      7     33    225
B-MISC        39     306     44     38      0       3      8     10    426
B-ORG        132      31    725    104      0       2     13     19    197
B-PER         89       4     53   1127      0       4      4     20    192
I-LOC          9       1      5      0    140       4     18     14     42
I-MISC         9      23      5      2      4     154     18      6    113
I-ORG         38       2     13      7     14      12    412     47    164
I-PER         13       1      3     14      1       3      4    898    117
O            146      57     74     81      2      10     24     18  37277
P: 0.7904740895307667 R: 0.649070404352056 F1: 0.7027946043151473






-----------('pos', 'token_left', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        861      36    159    177      2       2      6     29    321
B-MISC        34     556     32     91      3       8     20      8    122
B-ORG        151      28    589    154      0       4     96     54    147
B-PER        111      21     75   1133      0       3      3     33    114
I-LOC         99       3     13     35      4       1      3     12     63
I-MISC        24      49     12     51      0      69      3      9    117
I-ORG         90      17    142    121      0       1    132     45    161
I-PER        122       6     11    222      0       4      2    548    139
O            109      67     98    175      0       4     13     39  37184
P: 0.6228742877033762 R: 0.48150883344392553 F1: 0.5032192468107788






-----------('pos', 'token_right', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        990      28    198    120      1       0      1    105    150
B-MISC        64     439    138     50      0       6      9     42    126
B-ORG        198      45    668    101      0       1      5     61    144
B-PER        253      21     89    908      0       3      3    127     89
I-LOC          4       3      2      0    136       2     23     41     22
I-MISC         9       6      9      8      9     130     32     48     83
I-ORG         42       4      4     13     39      17    332    126    132
I-PER          5       4      5      1      0       1     11    969     58
O            190      98    141     75      1      14     35     67  37068
P: 0.6988953946721076 R: 0.6246869095309038 F1: 0.6457873795717668






-----------('chunk_tag', 'token_left', 'token_right')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        623       2     31     22      0       0      3      3    909
B-MISC        10     176      3      2      0       3      0      0    680
B-ORG         52       9    387     11      0       0      5      2    757
B-PER         22       0      7    601      0       1      1      0    861
I-LOC          1       0      0      0    127       5     14      4     82
I-MISC         0      13      0      0     15     110      7      7    182
I-ORG         15       0      1      3     11       5    302     16    356
I-PER          3       0      0      0      0       0      1    722    328
O            145      27     67     31      7      19     35     45  37313
P: 0.8172135332646592 R: 0.47631444978092496 F1: 0.5801935822953305






-----------('chunk_tag', 'token_left', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        827      45     33     72      5       2     14     49    546
B-MISC        34     400     13    107      5       8     12     39    256
B-ORG        138      21    504     83      2       6    114     46    309
B-PER         96       4     31    873      0       2      6     32    449
I-LOC         39      11      4     19     25       2     19     71     43
I-MISC         8      52      3     31      5      80      5     41    109
I-ORG         33      22     43     59     11       5    234    144    158
I-PER         18       0      5    110     10       5     21    783    102
O            188      66     73    120      2      10     18     36  37176
P: 0.6340176346871568 R: 0.486641617042126 F1: 0.5308836239080277






-----------('chunk_tag', 'token_right', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        773     115     91    112      0       0      1     32    469
B-MISC        82     248     79     84      0       5      1     88    287
B-ORG        111     164    548     83      0       0      2     57    258
B-PER        116      13     77    812      0       3      4     91    377
I-LOC          4       0      4      0    136       2     23     42     22
I-MISC         7       8      3      9     18     114     32     54     89
I-ORG         42       1      3     14     37      18    326    134    134
I-PER          5       0      5      2      0       1      4    971     66
O            128      36    158     84      1      15     28     44  37195
P: 0.6753943650624467 R: 0.5615451895581622 F1: 0.5990955307377868






-----------('token_left', 'token_right', 'cap_type')---------------
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        986      58     71     84      2       1      9     26    356
B-MISC        73     468     39     77      0       5      3     17    192
B-ORG        114      67    720     90      0       2     16     20    194
B-PER         90      13     49   1078      0       4      6     23    230
I-LOC         13       2      2      3    136       8     22     20     27
I-MISC         9      35      2     12      5     158     19     14     80
I-ORG         43       5     12     21     15      14    422     55    122
I-PER         18       0      3     37      1       3      4    908     80
O            124      49     80    104      1      14     28     43  37246
P: 0.7745616083917495 R: 0.6629838738587819 F1: 0.7087758786365671




# Part 2: One-hot versus Embeddings

In this second part of the exercise, we will compare results using one-hot encodings to pretrained word embeddings.

## One-hot representation

In one-hot representation, each feature value is represented by an n-dimensional vector, where n corresponds to the number of possible values the feature can take. In our system, the Token feature can take the value of each token that occurs at least once in the corpus. This feature thus uses a vector with the size of the vocabulary in the corpus. Each possible value is associated with a specific dimension. If this value is represented, that dimension will receive the value 1 and all other dimensions will have the value 0.

The system receive a concatenation of all feature representations as input.


## What does one-hot look like?

We will start with an illustration of a one-hot representation. We will use the capitalization feature for this: it has 6 possible values and is therefore represented by a 6-dimensional vector. If you would like a more precise look, you may consider creating a toy example of a few lines, in which the capitalization feature has different values.

In [10]:
# create classifier with caps feature only and print vectorizer, then with token only (but you see less)

selected_features = ['cap_type']

feature_values, labels = extract_features_and_gold_labels(train_path, selected_features)

#creating a vectorizing
vectorizer = DictVectorizer()
#fitting the values to dimensions (creating a mapping) and transforming the current observations according to this mapping
capitalization_vectorized = vectorizer.fit_transform(feature_values)
print(capitalization_vectorized.toarray())

[[0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 ...
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]]


## Using word embeddings

We are now going to use word embeddings to represent tokens. We load a pretrained distributional semantic model.
You can use the same model as in Exercise 2.1. We tested the exercise with the same model (GoogleNews negative sampling 300 dimensions) as Exercise 2.1 as well.

Note: loading the model may take a while. You probably want to run that only once.


In [11]:
def extract_embeddings_as_features_and_gold(conllfile,word_embedding_model):
    '''
    Function that extracts features and gold labels using word embeddings
    
    :param conllfile: path to conll file
    :param word_embedding_model: a pretrained word embedding model
    :type conllfile: string
    :type word_embedding_model: gensim.models.keyedvectors.Word2VecKeyedVectors
    
    :return features: list of vector representation of tokens
    :return labels: list of gold labels
    '''
    labels = []
    features = []
    
    conllinput = open(conllfile, 'r')
    csvreader = csv.reader(conllinput, delimiter=',',quotechar='|')
    for index, row in enumerate(csvreader):
        if index == 0:
            continue
        if len(row) == 8:
            if row[1] in word_embedding_model:
                vector = word_embedding_model[row[1]]
            else:
                vector = [0]*300
            features.append(vector)
            labels.append(row[4])
    return features, labels

def create_classifier(features, labels):
    '''
    Function that creates classifier from features represented as vectors and gold labels
    
    :param features: list of vector representations of tokens
    :param labels: list of gold labels
    :type features: list of vectors
    :type labels: list of strings
    
    :returns trained logistic regression classifier
    '''
    
    
    lr_classifier = LogisticRegression(solver='saga')
    lr_classifier.fit(features, labels)
    
    return lr_classifier
    
    
def label_data_using_word_embeddings(testfile, word_embedding_model, classifier):
    '''
    Function that extracts word embeddings as features and gold labels from test data and runs a classifier
    
    :param testfile: path to test file
    :param word_embedding_model: distributional semantic model
    :param classifier: trained classifier
    :type testfile: string
    :type word_embedding_model: gensim.models.keyedvectors.Word2VecKeyedVectors
    :type classifier: LogisticRegression
    
    :return predictions: list of predicted labels
    :return labels: list of gold labels
    '''
    
    dense_feature_representations, labels = extract_embeddings_as_features_and_gold(testfile,word_embedding_model)
    predictions = classifier.predict(dense_feature_representations)
    
    return predictions, labels


# I printing announcements of where the code is at (since some of these steps take a while)

print('Extracting dense features...')
dense_feature_representations, labels = extract_embeddings_as_features_and_gold(train_path,word_embedding_model)
print('Training classifier....')
classifier = create_classifier(dense_feature_representations, labels)
print('Running evaluation...')
predicted, gold = label_data_using_word_embeddings(test_path, word_embedding_model, classifier)
print_confusion_matrix(predictions, goldlabels)
print_precision_recall_fscore(predicted, gold)

Extracting dense features...
Training classifier....
Running evaluation...
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1288      22     66     43      2       0      9     14    149
B-MISC        24     692     35     22      0       2      6     14     79
B-ORG         50      23    957     72      0       1     26     32     62
B-PER         41       7     23   1325      0       1      8     25     63
I-LOC          6       0      2      0    165       3     14     20     23
I-MISC         9      20      3      3      5     212      6     25     51
I-ORG         17       2     17      5     11       6    510     45     96
I-PER          5       1      5     16      0       3      9    971     44
O              4      14     25     38      0       5     21     16  37566
P: 0.7398747647969717 R: 0.6551042215507106 F1: 0.6907918605672456


## Including the preceding token

We can include the preceding token as a feature in a similar way. We simply concatenate the two vectors.

In [10]:

def extract_embeddings_of_current_and_preceding_as_features_and_gold(conllfile,word_embedding_model):
    '''
    Function that extracts features and gold labels using word embeddings for current and preceding token
    
    :param conllfile: path to conll file
    :param word_embedding_model: a pretrained word embedding model
    :type conllfile: string
    :type word_embedding_model: gensim.models.keyedvectors.Word2VecKeyedVectors
    
    :return features: list of vector representation of tokens
    :return labels: list of gold labels
    '''
    labels = []
    features = []
    
    conllinput = open(conllfile, 'r')
    csvreader = csv.reader(conllinput, delimiter=',',quotechar='|')
    for row in csvreader:
        if len(row) == 8:
            if row[1] in word_embedding_model:
                vector1 = word_embedding_model[row[1]]
            else:
                vector1 = [0]*300
            if row[6] in word_embedding_model:
                vector2 = word_embedding_model[row[6]]
            else:
                vector2 = [0]*300
            features.append(np.concatenate((vector1,vector2)))
            labels.append(row[4])
    return features, labels
    
    
def label_data_using_word_embeddings_current_and_preceding(testfile, word_embedding_model, classifier):
    '''
    Function that extracts word embeddings as features (of current and preceding token) and gold labels from test data and runs a trained classifier
    
    :param testfile: path to test file
    :param word_embedding_model: distributional semantic model
    :param classifier: trained classifier
    :type testfile: string
    :type word_embedding_model: gensim.models.keyedvectors.Word2VecKeyedVectors
    :type classifier: LogisticRegression
    
    :return predictions: list of predicted labels
    :return labels: list of gold labels
    '''
    
    features, labels = extract_embeddings_of_current_and_preceding_as_features_and_gold(testfile,word_embedding_model)
    predictions = classifier.predict(features)
    
    return predictions, labels



print('Extracting dense features...')
features, labels = extract_embeddings_of_current_and_preceding_as_features_and_gold(train_path,word_embedding_model)
print('Training classifier...')
#we can use the same function as for just the tokens itself
classifier = create_classifier(features, labels)
print('Running evaluation...')
predicted, gold = label_data_using_word_embeddings_current_and_preceding(test_path, word_embedding_model, classifier)
print_confusion_matrix(predictions, goldlabels)
print_precision_recall_fscore(predicted, gold)

Extracting dense features...
Training classifier...




Running evaluation...
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC        986      58     71     84      2       1      9     26    356
B-MISC        73     468     39     77      0       5      3     17    192
B-ORG        114      67    720     90      0       2     16     20    194
B-PER         90      13     49   1078      0       4      6     23    230
I-LOC         13       2      2      3    136       8     22     20     27
I-MISC         9      35      2     12      5     158     19     14     80
I-ORG         43       5     12     21     15      14    422     55    122
I-PER         18       0      3     37      1       3      4    908     80
O            124      49     80    104      1      14     28     43  37246


  _warn_prf(average, modifier, msg_start, len(result))


P: 0.7646172743895793 R: 0.7090309354680027 F1: 0.7345602505082284


## A mixed system

The code below combines traditional features with word embeddings. Note that we only include features with a limited range of possible values. Combining one-hot token representations (using highly sparse dimensions) with dense representations is generally not a good idea.

In [12]:


def extract_word_embedding(token, word_embedding_model):
    '''
    Function that returns the word embedding for a given token out of a distributional semantic model and a 300-dimension vector of 0s otherwise
    
    :param token: the token
    :param word_embedding_model: the distributional semantic model
    :type token: string
    :type word_embedding_model: gensim.models.keyedvectors.Word2VecKeyedVectors
    
    :returns a vector representation of the token
    '''
    if token in word_embedding_model:
        vector = word_embedding_model[token]
    else:
        vector = [0]*300
    return vector


def extract_feature_values(row, selected_features):
    '''
    Function that extracts feature value pairs from row
    
    :param row: row from conll file
    :param selected_features: list of selected features
    :type row: string
    :type selected_features: list of strings
    
    :returns: dictionary of feature value pairs
    '''
    feature_values = {}
    for feature_name in selected_features:
        r_index = feature_to_index.get(feature_name)
        feature_values[feature_name] = row[r_index]
        
    return feature_values
    
    
def create_vectorizer_traditional_features(feature_values):
    '''
    Function that creates vectorizer for set of feature values
    
    :param feature_values: list of dictionaries containing feature-value pairs
    :type feature_values: list of dictionairies (key and values are strings)
    
    :returns: vectorizer with feature values fitted
    '''
    vectorizer = DictVectorizer()
    vectorizer.fit(feature_values)
    
    return vectorizer
        
    
def combine_sparse_and_dense_features(dense_vectors, sparse_features):
    '''
    Function that takes sparse and dense feature representations and appends their vector representation
    
    :param dense_vectors: list of dense vector representations
    :param sparse_features: list of sparse vector representations
    :type dense_vector: list of arrays
    :type sparse_features: list of lists
    
    :returns: list of arrays in which sparse and dense vectors are concatenated
    '''
    
    combined_vectors = []
    sparse_vectors = np.array(sparse_features.toarray())
    
    for index, vector in enumerate(sparse_vectors):
        combined_vector = np.concatenate((vector,dense_vectors[index]))
        combined_vectors.append(combined_vector)
    return combined_vectors
    

def extract_traditional_features_and_embeddings_plus_gold_labels(conllfile, word_embedding_model, vectorizer=None):
    '''
    Function that extracts traditional features as well as embeddings and gold labels using word embeddings for current and preceding token
    
    :param conllfile: path to conll file
    :param word_embedding_model: a pretrained word embedding model
    :type conllfile: string
    :type word_embedding_model: gensim.models.keyedvectors.Word2VecKeyedVectors
    
    :return features: list of vector representation of tokens
    :return labels: list of gold labels
    '''
    labels = []
    dense_vectors = []
    traditional_features = []
    
    conllinput = open(conllfile, 'r')
    csvreader = csv.reader(conllinput, delimiter=',',quotechar='|')
    for index, row in enumerate(csvreader):
        if index == 0:
            continue
        if len(row) == 8:
            token_vector = extract_word_embedding(row[1], word_embedding_model)
            pt_vector = extract_word_embedding(row[6], word_embedding_model)
            dense_vectors.append(np.concatenate((token_vector,pt_vector)))
            #mixing very sparse representations (for one-hot tokens) and dense representations is a bad idea
            #we thus only use other features with limited values
            other_features = extract_feature_values(row, ['cap_type'])
            traditional_features.append(other_features)
            #adding gold label to labels
            labels.append(row[4])
            
    #create vector representation of traditional features
    if vectorizer is None:
        #creates vectorizer that provides mapping (only if not created earlier)
        vectorizer = create_vectorizer_traditional_features(traditional_features)
    sparse_features = vectorizer.transform(traditional_features)
    combined_vectors = combine_sparse_and_dense_features(dense_vectors, sparse_features)
    
    return combined_vectors, vectorizer, labels

def label_data_with_combined_features(testfile, classifier, vectorizer, word_embedding_model):
    '''
    Function that labels data with model using both sparse and dense features
    '''
    feature_vectors, vectorizer, goldlabels = extract_traditional_features_and_embeddings_plus_gold_labels(testfile, word_embedding_model, vectorizer)
    predictions = classifier.predict(feature_vectors)
    
    return predictions, goldlabels


print('Extracting Features...')
feature_vectors, vectorizer, gold_labels = extract_traditional_features_and_embeddings_plus_gold_labels(train_path, word_embedding_model)
print('Training classifier....')
lr_classifier = create_classifier(feature_vectors, gold_labels)
print('Running the evaluation...')
predictions, goldlabels = label_data_with_combined_features(test_path, lr_classifier, vectorizer, word_embedding_model)
print_confusion_matrix(predictions, goldlabels)
print_precision_recall_fscore(predictions, goldlabels)

Extracting Features...
Training classifier....




Running the evaluation...
Predicted  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER      O
Gold                                                                      
B-LOC       1395      28     91     24      5       0     17      2     31
B-MISC        46     644     69     20      1       7      6      0     81
B-ORG         88      41    974     49      4       3     13      0     51
B-PER         35       8     28   1367      5       0      4     13     33
I-LOC          6       0      3      0    174       6     25      3     16
I-MISC         3      11      8      2     10     208     22      3     67
I-ORG         20       5     14      6     29      23    466      9    137
I-PER          1       1      5     42      3       0     11    964     27
O              9      25     59     15      2      22     46      2  37509
P: 0.8477110179619565 R: 0.8067990229479484 F1: 0.8253967328075144
