# Lab3.9 Machine learning using embeddings

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

RMA/Text Mining MA, Introduction to HLT

In this notebook, you are going to use word embeddings instead of the one-hot-encoding of words. Word embeddings have many advantages:

* they capture similarities across words that can be learned from massive amounts of text data without annotation
* machine learning can easily exploit similarity because the embeddings are also represented as vectors
* the word embedding vectors are much smaller (100 up to 500 dimensions) and more dense than one-hot-encodings, which results in more efficient and compact models that also generalize better.

At the end of this notebook, you should have learned:

* how replace the words in your training set by there embeddings
* how to train a classifier enriched with embeddings
* how to represent the words for any unseen text as embeddings
* how to add embeddings to our NERC system
* how to work with some popular data sets for NERC with which such embeddings can be combined



## 1 Quick introduction to embeddings

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used other information: features of the previous words (on the left) or the next words (on the right), whether the current word starts with a capital, whether it is an abbreviation, etc.

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth). Embeddings are however the weights in the hidden layer of a neural network that is trained to predict the contexts rather than representing the context in a vector directly.

In this section, we will load pre-trained word embeddings called word2vec, created by Google. The embeddings have 300 dimensions.

First, please download the file from [their google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). Then, create a folder in the same directory as this notebook, called 'model' and unpack the word2vec file in that folder.

We will load the embedding model with the Gensim package that we used before.

In [1]:
import gensim

We can now load the file using the gensim library (this takes a while):

In [2]:
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  

Word embeddings capture certain meaning aspects of words. Previous research has shown that they can partially capture simiarity ("tapas" is similar to "pintxos"), relatedness (tapas relates to Spain), and analogy ("Paris" is to "France" as "Rome" is to "Italy"). 

To get an idea of these properties of embeddings, we can compute the cosine similarity between two word vectors. We will expect for example, that "cat" and "tiger" are more similar than "cat" and "Germany". Feel free to play a bit with word1 and word2 below to get some feeling of the information these embeddings capture.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [4]:
word1='tapas'
word2='pintxos'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.6477412]]


We can also get the most similar words to some word, say 'apple':

In [5]:
print(word_embedding_model.most_similar('apple', topn=10))

[('apples', 0.7203598022460938), ('pear', 0.6450696587562561), ('fruit', 0.6410146355628967), ('berry', 0.6302294731140137), ('pears', 0.6133961081504822), ('strawberry', 0.6058261394500732), ('peach', 0.6025873422622681), ('potato', 0.596093475818634), ('grape', 0.5935864448547363), ('blueberry', 0.5866668224334717)]


## 2 Using embeddings in our NERC model

Next, we will use the same example of Named Entity Recognition and Classification (NERC) as in the previous notebook but now replace the one-hot-vector for the vocabularies by their dense embeddings.

We use the same text as before and process it using SpaCy to get the words and the part of speech. We define the labels in the same way as well.

In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')

text="Germany's representative to the European Union"

doc=nlp(text)

## The series of labels that go with the word tokens from the input text
y=['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG']

We will now replace the one-hot input representation of our words with embeddings. We generate our input data by simply looking up each word in the embeddings model. If we find it, we add the embedding vector to the training input, if not we add a vector with 300 zero values.

The following code creates an array from all the tokens in the spaCy document object "doc" by taking the embedding vectors for each word.

In [7]:
training_input=[]
for token in doc:
    word=token.text  #the next word from the tokenized text
    # we check if our model (loaded with the Google word2vec embeddings)
    # is inside the model
    if word in word_embedding_model:
        # in this case the word was found and vector is assigned with its embedding vector as the value
        vector=word_embedding_model[word]
    else: 
        # if the word does not exist in the embeddings vocabulary, 
        # we create a vector with all zeros.
        # The Google word2vec model has 300 dimensions so we creat a vector with 300 zeros
        vector=[0]*300
        print('This word is not in the word2vec vocabulary:', word)
    training_input.append(vector)

This word is not in the word2vec vocabulary: 's
This word is not in the word2vec vocabulary: to


We see that for two tokens from the spaCy output, we did not get an embedding.

We can inspect the first element in our training_input, which is the same size as the tokenized sentence but the words are replaced by embeddings.

In [8]:
print("The length of the training input = ",len(training_input))
#### the first token has the following embedding values
print(training_input[0])

The length of the training input =  7
[ 0.25976562  0.140625    0.24707031  0.00958252 -0.25       -0.08251953
 -0.09912109 -0.35351562 -0.1484375   0.1484375  -0.03540039 -0.05249023
  0.09277344 -0.14257812 -0.01483154  0.01647949  0.03710938  0.18847656
 -0.03955078 -0.05786133  0.26757812  0.10693359 -0.04345703  0.06738281
 -0.00177765  0.1328125  -0.16308594 -0.05908203 -0.22558594  0.12207031
  0.10791016 -0.19433594 -0.16210938 -0.14257812  0.09033203 -0.14648438
 -0.12109375  0.09960938  0.26367188  0.12695312  0.140625    0.11083984
  0.02697754 -0.01635742  0.00292969  0.14746094 -0.06542969 -0.16699219
  0.03662109  0.14941406 -0.14746094  0.06835938 -0.09228516  0.12207031
 -0.09179688  0.09082031 -0.38476562  0.03051758 -0.21679688 -0.12597656
 -0.08642578 -0.26171875 -0.08496094 -0.13964844 -0.02832031 -0.203125
  0.29101562 -0.13574219 -0.07226562  0.16308594 -0.19042969  0.22265625
  0.05566406  0.21289062  0.05053711 -0.09814453  0.12158203  0.01000977
  0.15234375 -0

Same as in the earlier cases, once we have the vector representations, we can use them to train our model.

In [9]:
from sklearn import svm

lin_clf = svm.LinearSVC()
lin_clf.fit(training_input, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

**Testing the model** Let's say we want to test our model with the sentence: 'I love beer from Munich'. What we need to do is to preprocess the text in the same way as the training data by using spaCy (otherwise, we may get a mismatch in features), and next replace each word by an embedding vector as well.

In [10]:
test_sentence='I love beer from Munich'
test_doc=nlp(test_sentence)

test_input=[]

for token in test_doc:
    word=token.text
    if word in word_embedding_model:
        vector=word_embedding_model[word]
    else:
        vector=[0]*300
    test_input.append(vector)

Because our representation is the same, we can aske the classifier to make a prediction for it:

In [11]:
pred=lin_clf.predict(test_input)
print(pred)

['O' 'O' 'O' 'O' 'B-LOC']


The classifier assigned IOB tags to the tokens in order and the final obtained the label 'B-LOC', which is correct.

Congratulations! You have now trained and testing your first embeddings-based NERC model. Note that the word 'Munich' is not in the training data but the system still managed to make a correct(!) prediction because the embedding matched.

So far you have just worked with a few toy examples. In order to obtain a good performance machine learning systems may need thousands and sometimes hundreds of thousands training examples. The vocabulary of a language is large and there is also large variation in expressions. Having only a few examples for each words or expression requires to have massive amounts of text.

To some extent, word embedding resolve the issue of *data sparseness*, as words unseen in the training data may still be similar to other words that are in the training data. Word embeddings are derived from millions of documents (billions of tokens) and are likely to have embeddings for most words.

## 3 Combining embeddings with one-hot encoding

So how can we combine the word embeddings with one-hot-encodings for other features?

We are first going to get the one-hot encondings of the text as we did in the previous notebook using the DictVectorizer

In [12]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

In [13]:
training_instances=[]
for token in doc:
    one_training_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_} # this concatenates the PoS and Lemma
    training_instances.append(one_training_instance)

the_array = vec.fit_transform(training_instances).toarray() 

If we inspect the array, we see it holds 7 rows, each row representing one token, and 12 columns, each column representing a feature value.

In [14]:
the_array.shape
# ROWS are WORDS, COLUMNS are FEATURES

(7, 12)

In [15]:
# the first token values
print(the_array[0])

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]


In [16]:
print(the_array)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1.]]


Our training_input array represented the same text with embeddings. Let's inspect the array for the emddings using the numpy module (imported as np at the start of this notebook!).

In [17]:
np.array(training_input).shape
# ROWS are WORDS, COLUMNS are EMBEDDINGS

(7, 300)

It has the same rows but 300 additional features. We can now *concatenate* the features for each word using numpy:

In [18]:
features_input=np.array(the_array)
embeddings_input=np.array(training_input)

In [19]:
print(features_input)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1.]]


We assume that the number of rows is the same across the two arrays and each row corresponds to the same token instance.

In [20]:
#### num_words is the number of rows
num_words=features_input.shape[0]
concat_input=[] # for storing the result of concatenating
for index in range(num_words):
    print('Combining the values for:', index, " from the features and the embeddings")
    representation=list(features_input[index]) + list(embeddings_input[index]) # concatenate features per word
    concat_input.append(representation)

Combining the values for: 0  from the features and the embeddings
Combining the values for: 1  from the features and the embeddings
Combining the values for: 2  from the features and the embeddings
Combining the values for: 3  from the features and the embeddings
Combining the values for: 4  from the features and the embeddings
Combining the values for: 5  from the features and the embeddings
Combining the values for: 6  from the features and the embeddings


If we check the shape, we see it has the same rows but now the combination of features result in 312 columns.

In [21]:
np.array(concat_input).shape

(7, 312)

Lets inspect the concatenated vector for the first token.

In [22]:
print(concat_input[0])

[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.259765625, 0.140625, 0.2470703125, 0.00958251953125, -0.25, -0.08251953125, -0.09912109375, -0.353515625, -0.1484375, 0.1484375, -0.035400390625, -0.052490234375, 0.0927734375, -0.142578125, -0.01483154296875, 0.0164794921875, 0.037109375, 0.1884765625, -0.03955078125, -0.057861328125, 0.267578125, 0.10693359375, -0.04345703125, 0.0673828125, -0.00177764892578125, 0.1328125, -0.1630859375, -0.05908203125, -0.2255859375, 0.1220703125, 0.10791015625, -0.1943359375, -0.162109375, -0.142578125, 0.09033203125, -0.146484375, -0.12109375, 0.099609375, 0.263671875, 0.126953125, 0.140625, 0.11083984375, 0.0269775390625, -0.016357421875, 0.0029296875, 0.1474609375, -0.0654296875, -0.1669921875, 0.03662109375, 0.1494140625, -0.1474609375, 0.068359375, -0.09228515625, 0.1220703125, -0.091796875, 0.0908203125, -0.384765625, 0.030517578125, -0.216796875, -0.1259765625, -0.08642578125, -0.26171875, -0.0849609375, -0.1396484375, -0.0283203

### 3.1 Representing the test data

Note that we need to represent the test data in the same way as the train data. So also when testing we need to create an array with the same 312 features. We first use SpaCy again to get the linguistics features.

In [23]:
test_sentence='I love beer from Munich'
test_doc=nlp(test_sentence)

test_instances=[]
for token in test_doc:
    one_test_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_} # this concatenates the PoS and Lemma
    test_instances.append(one_test_instance)

print(test_instances)
the_test_array = vec.fit_transform(test_instances).toarray()
the_test_array.shape


[{'part-of-speech': 'PRON', 'lemma': '-PRON-'}, {'part-of-speech': 'VERB', 'lemma': 'love'}, {'part-of-speech': 'NOUN', 'lemma': 'beer'}, {'part-of-speech': 'ADP', 'lemma': 'from'}, {'part-of-speech': 'PROPN', 'lemma': 'Munich'}]


(5, 10)

In [24]:
print(the_test_array)

[[1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 1. 0.]]


We have a problem. The training and test array do not have the same number of columns. For the training set we had 12 features, and now we have 10. The vectorizer takes the properties and values from the data. Since the training and test data are different also the vector representation are different. Not only in size but also mixing positions and values differently. To fix this, we need to apply the vectorizer function to both the train and test data to create the array with the dimensions and after that split the data again.
This is how we do it.

### 3.2 Harmonizing one-hot-vectors across training and test sets

In [25]:
vec = DictVectorizer()
## First we concatenate the training and test instances and fit these to a vector representation
train_and_test_instance = training_instances + test_instances
print(train_and_test_instance)

[{'part-of-speech': 'PROPN', 'lemma': 'Germany'}, {'part-of-speech': 'PART', 'lemma': "'s"}, {'part-of-speech': 'NOUN', 'lemma': 'representative'}, {'part-of-speech': 'ADP', 'lemma': 'to'}, {'part-of-speech': 'DET', 'lemma': 'the'}, {'part-of-speech': 'PROPN', 'lemma': 'European'}, {'part-of-speech': 'PROPN', 'lemma': 'Union'}, {'part-of-speech': 'PRON', 'lemma': '-PRON-'}, {'part-of-speech': 'VERB', 'lemma': 'love'}, {'part-of-speech': 'NOUN', 'lemma': 'beer'}, {'part-of-speech': 'ADP', 'lemma': 'from'}, {'part-of-speech': 'PROPN', 'lemma': 'Munich'}]


In [26]:
the_array = vec.fit_transform(train_and_test_instance).toarray()
the_array.shape

(12, 19)

We see that we now have 12 rows (tokens) and 19 values. From this shared feature space, we need to recover the data corresponding to the training data and the data corresponding to the test data. Since the order is based on the concatenation, we can take the length of the training_instances to separate the first part as the training data and the second part as the test data.

In [27]:
print(the_array)

[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


In [28]:
# For the training set we take the fiorst part of the data upto the length of the training_instances
training_onehot = the_array[:len(training_instances)]
#For the test set, we take the remaining part of the data starting at the length of the training_instances
#(remember that '0' is the first data element)
test_onehot = the_array[len(training_instances):]

print('Number of training words =', training_onehot.shape)
print('Number of test words =', test_onehot.shape)

Number of training words = (7, 19)
Number of test words = (5, 19)


In [29]:
print(training_onehot)

[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


In [30]:
print(test_onehot)

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


Now, we ensured that the feature space is the same for the training and test data. Next we get the embeddings for both sets and combine these with the one-got-vector representations. We start with the training data again.

In [31]:
features_training_input=np.array(training_onehot)
embeddings_training_input=np.array(training_input)

In [32]:
num_words=training_onehot.shape[0]
concat_train_input=[]
for index in range(num_words):
    print(index)
    representation=list(training_onehot[index]) + list(embeddings_training_input[index]) # concatenate features per word
    concat_train_input.append(representation)

# we check the shape
np.array(concat_train_input).shape

0
1
2
3
4
5
6


(7, 319)

In [33]:
print(concat_train_input[0])

[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.259765625, 0.140625, 0.2470703125, 0.00958251953125, -0.25, -0.08251953125, -0.09912109375, -0.353515625, -0.1484375, 0.1484375, -0.035400390625, -0.052490234375, 0.0927734375, -0.142578125, -0.01483154296875, 0.0164794921875, 0.037109375, 0.1884765625, -0.03955078125, -0.057861328125, 0.267578125, 0.10693359375, -0.04345703125, 0.0673828125, -0.00177764892578125, 0.1328125, -0.1630859375, -0.05908203125, -0.2255859375, 0.1220703125, 0.10791015625, -0.1943359375, -0.162109375, -0.142578125, 0.09033203125, -0.146484375, -0.12109375, 0.099609375, 0.263671875, 0.126953125, 0.140625, 0.11083984375, 0.0269775390625, -0.016357421875, 0.0029296875, 0.1474609375, -0.0654296875, -0.1669921875, 0.03662109375, 0.1494140625, -0.1474609375, 0.068359375, -0.09228515625, 0.1220703125, -0.091796875, 0.0908203125, -0.384765625, 0.030517578125, -0.216796875, -0.1259765625, -0.08642578125, -0.26171875, -0.08

In [34]:
features_test_input=np.array(test_onehot)
embeddings_test_input=np.array(test_input)

In [35]:
num_words=test_onehot.shape[0]
concat_test_input=[]
for index in range(num_words):
    print(index)
    representation=list(test_onehot[index]) + list(embeddings_test_input[index]) # concatenate features per word
    concat_test_input.append(representation)

# we check the shape
np.array(concat_test_input).shape

0
1
2
3
4


(5, 319)

We can now train the classifier in the same way as we did before but now with the concatenated features, where the one-hot-vectors are aligned.

In [36]:
lin_clf.fit(concat_train_input, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [37]:
pred=lin_clf.predict(concat_test_input)
print(pred)

['O' 'O' 'O' 'O' 'B-LOC']


## 4. NERC datasets

Now that we've seen how to represent linguistic features, we also need to access real linguistic training data for the NERC task. In this section, we will look at large data sets that have been created by the community in which people have been annotating entities. In the assignment, you will use this data to train and test models that give a realistic performance.

Here, we will load two NERC datasets and quickly inspect their contents.

**Preparation** Please download the .zip file with the two datasets from [this link](http://kyoto.let.vu.nl/~vossen/rma_hlt/nerc_datasets.zip)

Then unpack the .zip, so that the folder `nerc_datasets` is created in the same directory as this notebook. If you want to store it elsewhere, you can do that but need to adapt the path in the calls below.

### 4.1 CoNLL-2003

 One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you just downloaded. You can open the file "train.txt" in a text editor to inspect its content:

````
-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
````

It follows the IOB format with one token on a line followed by columns wit the PoS, the constituent and the IOB entity tag. You can check the "test.txt" file to see it has a similar format

You can load it using the following code snippet, which makes use of the NLTK function ConllCorpusReader to do the magic. More information on the ConllCorpusReader can be found here: https://www.nltk.org/_modules/nltk/corpus/reader/conll.html

The function has three parameters:

* the path to the folder where ConLL-2003 is stored (locally in my case)
* the name of the file that will be loaded from that folder
* labels for the columns that are expected in the input file

We store the result in a variable with the name 'train' which is of the type 'nltk.corpus.reader.conll.ConllCorpusReader'

In [39]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('nerc_datasets/CONLL2003',
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])


We can use 'dir' to see it has many data elements that correspond to the many different features that can be found in the CoNNL data.

In [40]:
dir(train)

['CHUNK',
 'COLUMN_TYPES',
 'IGNORE',
 'NE',
 'POS',
 'SRL',
 'TREE',
 'WORDS',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_chunk_types',
 '_colmap',
 '_encoding',
 '_fileids',
 '_get_chunked_words',
 '_get_column',
 '_get_iob_words',
 '_get_parsed_sent',
 '_get_root',
 '_get_srl_instances',
 '_get_srl_spans',
 '_get_tagged_words',
 '_get_words',
 '_grids',
 '_pos_in_tree',
 '_read_grid_block',
 '_require',
 '_root',
 '_root_label',
 '_srl_includes_roleset',
 '_tagset',
 '_tree_class',
 'abspath',
 'abspaths',
 'chunked_sents',
 'chunked_words',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'iob_sents',
 'iob_words',
 'license',
 'open'

We are for now only interested in the token, the pos and the ne_label. Let's check the first one in train:

In [41]:
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

EU NNP B-ORG


We can for example iterate through this data, and make a list of the tokens as inputs, and of the `ne_label` values as desirable outputs. The input tokens could for example be looked up in our word embeddings dictionary.

In [42]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

We have successfully loaded our data. Let's see how many tokens/labels we have:

In [43]:
print(len(labels))

203621


In [44]:
print('Last ten labels =', labels[:10])

Last ten labels = ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-PER']


Obviously, we should have the same size of input_vectors:

In [45]:
print(len(input_vectors))

203621


In a next step, we could easily train a model on this data as shown in above by combining the input vectors with the labels in a fit function. You will see it takes a lot longer to train the classifier with this  data set that has over 200K instances. On my machine it took about 5 minutes.

In [46]:
lin_clf.fit(input_vectors, labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

If you want to apply this classifier to a data set for testing, you need to apply the same vectorization procedure as you have followed for the training data.

Before you apply a classifier to a data set, it is important to know the data set and especially the statistics about how the labels are distributed. In other words, how often do tokens in the data set belong a human annotated data set?

This tells you how frequent or rare certain data categories are and how challenging it is for a system to learn and predict each category.

Because we have created a list of labels from our data, we can use a simple Python function *Counter* to get the statistics:

In [117]:
from collections import Counter 
print(Counter(labels))

Counter({'O': 169578, 'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155})


This clearly shows that most tokens get the label *O* and the actually enity tokens range between 1155 and 7140.

### 4.2 Kaggle
[*Kaggle*](https://www.kaggle.com/docs) is an open source platform for sharing data and competitions. It has over 1000's of datasets and  frequently releases new data and challenges. We are going to have a quick look at the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) that they provided and which was also provided in the zip file you downloaded as a so-called CSV file: ner.csv and ner_v2.csv. CSV stands for comma-separated-values and it is a commonly used format to exchange e.g. Excell or spreadsheet data as text files. Instances of data are represented on separate lines followed by values separated by commas. Another format is tab-separated-values or TSV, in which case tabs are used as in the CoNLL formats. Very often people store TSV formats in files with the extension ".csv", so it is always good practice to check the actual content to see what is used as a separator. The first line of a CSV or TSV file is usually the header that labels the different columns. 

The [*pandas*](https://pandas.pydata.org) package is a powerful package to handle data in various formats. You can check the website for details and documentation. Here we are going to use it to inspect the data.

To load data fYou can load it in the following way:

In [47]:
import pandas

In [48]:
path = 'nerc_datasets/kaggle/ner_v2.csv'

In [49]:
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [114]:
kaggle_dataset.columns

Index(['id', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'pos', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word', 'sentence_idx', 'shape',
       'word', 'tag'],
      dtype='object')

You can seen that a wide range of features is given for each token. [Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [115]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break


0
id                             0
lemma                   thousand
next-lemma                    of
next-next-lemma         demonstr
next-next-pos                NNS
next-next-shape        lowercase
next-next-word     demonstrators
next-pos                      IN
next-shape             lowercase
next-word                     of
pos                          NNS
prev-iob              __START1__
prev-lemma            __start1__
prev-pos              __START1__
prev-prev-iob         __START2__
prev-prev-lemma       __start2__
prev-prev-pos         __START2__
prev-prev-shape         wildcard
prev-prev-word        __START2__
prev-shape              wildcard
prev-word             __START1__
sentence_idx                   1
shape                capitalized
word                   Thousands
tag                            O
Name: 0, dtype: object
NERC label O


You can see that each token has many different features that people have considered useful for trhe task of NERC. In addition to the usual suspects that we saw before, each token also has features indicating previous and next words and their PoS, but als the shape of the word (upper and lower case patterns), and even the previous IOB tags.

We could use all these features as inputs in a machine learning model with our DictVectorizer, or by transforming them using embeddings if the values are words.

## End of this notebook