# Lab3.3 Machine learning using embeddings
Introduction to HLT, RMA VU University

In this notebook, you are going to use word embeddings instead of the one-hot-encoding of words. Word embeddings have many advantages:

* they capture similarities across words that can be learned from massive amounts of text data without annotation
* machine learning can easily exploit these similarity because the embeddings are also represented as vectors
* the word embedding vectors are much smaller (30 up to 500 dimensions) and more dense than one-hot-encodings, which results in more efficient and compact models that can also generalize better

At the end of this notebook, you should have learned:

* how replace the words in your training set by there embeddings
* how to train a classifier enriched with embeddings
* how to represent the words for any unseen text as embeddings
* how to add embeddings to our NERC system
* how to work with some popular data sets for NERC to which such embeddings ca be applied



## 1 Quick introduction to embeddings

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used a huge number of other information: features of the previous words (on the left) or the next words (on the right), whether the current word starts with a capital, whether it is an abbreviation, etc.

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth).

In this section, we will load pre-trained word embeddings called word2vec, created by Google. 

First, please download the file from [their google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). Then, create a folder in the same directory as this notebook, called 'model' and unpack the word2vec file in that folder.

We will load the embedding model with the Gensim package that we used before.

In [5]:
import gensim

We can now load the file using the gensim library (this takes a while):

In [6]:
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  

Word embeddings capture certain meaning aspects of words. Previous research has shown that they can partially capture simiarity ("tapas" is similar to "pintxos"), relatedness (tapas relates to Spain), and analogy ("Paris" is to "France" as "Rome" is to "Italy"). 

To get an idea of these properties of embeddings, we can compute the cosine similarity between two word vectors. We will expect for example, that "cat" and "tiger" are more similar than "cat" and "Germany". Feel free to play a bit with word1 and word2 below to get some feeling of the information these embeddings capture.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [8]:
word1='tapas'
word2='pintxos'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.6477412]]


We can also get the most similar words to some word, say 'apple':

In [9]:
print(word_embedding_model.most_similar('apple', topn=10))

[('apples', 0.7203598022460938), ('pear', 0.6450696587562561), ('fruit', 0.6410146355628967), ('berry', 0.6302294731140137), ('pears', 0.6133961081504822), ('strawberry', 0.6058261394500732), ('peach', 0.6025873422622681), ('potato', 0.596093475818634), ('grape', 0.5935864448547363), ('blueberry', 0.5866668224334717)]


## 2 Using embeddings in our NERC model

Next, we will use the same example of Named Entity Recognition and Classification (NERC) as before but now replace the one-hot-vector for the vocabularies by their dense embeddings.

We use the same example as before and process it using SpaCy. We define the labels in the same way as well.

In [10]:
import spacy

nlp = spacy.load('en')

text="Germany's representative to the European Union"

doc=nlp(text)

## The series of labels that go with the word tokens from the input text
y=['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG']

We will now replace the one-hot input representation of our words with embeddings. We generate our input data by simply looking up each word in the embeddings model.

The following code creates an array from all the tokens in the document object "doc" by taking the embedding vectors for each word.

In [11]:
training_input=[]
for token in doc:
    word=token.text  #the next word from the tokenized text
    # we check if our model (loaded with the Google word2vec embeddings)
    # is inside the model
    if word in word_embedding_model:
        # in this case the word was found and vector is assigned with its embedding vector as the value
        vector=word_embedding_model[word]
    else: 
        # if the word does not exist in the embeddings vocabulary, 
        # we create a vector with all zeros.
        # The Google word2vec model has 300 dimensions so we creat a vector with 300 zeros
        vector=[0]*300
        print('This word is not in the word2vec vocabulary:', word)
    training_input.append(vector)

This word is not in the word2vec vocabulary: 's
This word is not in the word2vec vocabulary: to


We can inspect the first element in our training_input, which is the same size as the tokenized sentence but the words are replaced by embeddings.

In [12]:
print("The length of the training input = ",len(training_input))
print(training_input[0])

The length of the training input =  7
[ 0.25976562  0.140625    0.24707031  0.00958252 -0.25       -0.08251953
 -0.09912109 -0.35351562 -0.1484375   0.1484375  -0.03540039 -0.05249023
  0.09277344 -0.14257812 -0.01483154  0.01647949  0.03710938  0.18847656
 -0.03955078 -0.05786133  0.26757812  0.10693359 -0.04345703  0.06738281
 -0.00177765  0.1328125  -0.16308594 -0.05908203 -0.22558594  0.12207031
  0.10791016 -0.19433594 -0.16210938 -0.14257812  0.09033203 -0.14648438
 -0.12109375  0.09960938  0.26367188  0.12695312  0.140625    0.11083984
  0.02697754 -0.01635742  0.00292969  0.14746094 -0.06542969 -0.16699219
  0.03662109  0.14941406 -0.14746094  0.06835938 -0.09228516  0.12207031
 -0.09179688  0.09082031 -0.38476562  0.03051758 -0.21679688 -0.12597656
 -0.08642578 -0.26171875 -0.08496094 -0.13964844 -0.02832031 -0.203125
  0.29101562 -0.13574219 -0.07226562  0.16308594 -0.19042969  0.22265625
  0.05566406  0.21289062  0.05053711 -0.09814453  0.12158203  0.01000977
  0.15234375 -0

Same as in the earlier cases, once we have the vector representations, we can use them to train our model.

In [12]:
from sklearn import svm

lin_clf = svm.LinearSVC()
lin_clf.fit(training_input, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

**Testing the model** Let's say we want to test our model with the sentence: 'I love beer from Munich'.

In [13]:
test_sentence='I love beer from Munich'
test_doc=nlp(test_sentence)
gold_labels=['O', 'O', 'O', 'O', 'B-LOC']

test_inputs=[]

for token in test_doc:
    word=token.text
    if word in word_embedding_model:
        vector=word_embedding_model[word]
    else:
        vector=[0]*300
    test_inputs.append(vector)
    
pred=lin_clf.predict(test_inputs)
print(pred)

['O' 'O' 'O' 'O' 'B-LOC']


Congratulations! You have now trained and testing your first embeddings-based NERC model.

As mentioned above, a more modern version of this model would be to replace SVM with a sequence-to-sequence architecture from the recurrent neural networks family.

So far you have just worked with a few toy examples. In order to obtain a good performance machine learning systems may need thousands and sometimes hundreds of thousands training examples. The vocabulary of a language is large and there is also large variation in expressions. Having only a few examples for each words or expression requires to have massive amounts of text.

To some extent, word embedding resolve the issue of *data sparseness*, as words unseen in the training data may still be similar to other words that arein the training data. Word embeddings are derived from millions of documents (billions of tokens) and are likely to have embeddings for most common words.

### 2') Combining embeddings with one-hot encoding

In [13]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

In [14]:
training_instances=[]
for token in doc:
    one_training_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_} # this concatenates the PoS and Lemma
    training_instances.append(one_training_instance)

the_array = vec.fit_transform(training_instances).toarray() 

In [15]:
the_array.shape
# ROWS are WORDS, COLUMNS are FEATURES

(7, 12)

In [19]:
training_input=[]
for token in doc:
    word=token.text  #the next word from the tokenized text
    # we check if our model (loaded with the Google word2vec embeddings)
    # is inside the model
    if word in word_embedding_model:
        # in this case the word was found and vector is assigned with its embedding vector as the value
        vector=word_embedding_model[word]
    else: 
        # if the word does not exist in the embeddings vocabulary, 
        # we create a vector with all zeros.
        # The Google word2vec model has 300 dimensions so we creat a vector with 300 zeros
        vector=[0]*300
        print('This word is not in the word2vec vocabulary:', word)
    training_input.append(vector)

This word is not in the word2vec vocabulary: 's
This word is not in the word2vec vocabulary: to


In [24]:
np.array(training_input).shape
# ROWS are WORDS, COLUMNS are EMBEDDINGS

(7, 300)

We can now concatenate the features for each word:

In [2]:
features_input=np.array(the_array)
embeddings_input=np.array(training_input)


NameError: name 'np' is not defined

In [1]:
features_input

NameError: name 'features_input' is not defined

In [34]:
assert features_input.shape[0]==embeddings_input.shape[0], 
'Error: different number of words are represented by embeddings compared to linguistic features'

num_words=features_input.shape[0]
concat_input=[]
for index in range(num_words):
    print(index)
    representation=list(features_input[index]) + list(embeddings_input[index]) # concatenate features per word
    concat_input.append(representation)

0
1
2
3
4
5
6


In [38]:
np.array(concat_input).shape

(7, 312)

## 3. NERC datasets

In this section, we will look at large data sets that have been created by the community in which people have been annotating entities. In the assignment (Lab4.5.assignment) you will use this data to train and test models on this data that give a realistic performance.

Here, we will load two NERC datasets and quickly inspect their contents.

**Preparation** Please download the .zip file with the two datasets from [this link](http://kyoto.let.vu.nl/~vossen/rma_hlt/nerc_datasets.zip)

Then unpack the .zip, so that the folder `nerc_datasets` lies in the same directory as this notebook.

### 3.1 CoNLL-2003

Now that we've seen how to represent linguistic features, we also need to access relevant linguistic training data for the NERC task. One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you just downloaded.
You can load it using the following code snippet, which makes use of the NLTK function ConllCorpusReader to do the magic. More information on the ConllCorpusReader can be found here: https://www.nltk.org/_modules/nltk/corpus/reader/conll.html

In [None]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('nerc_datasets/CONLL2003', # the folder where ConLL-2003 is stored (you downloaded this with the zip file from canvas) 
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

We can for example iterate through this data, and make a list of the tokens as inputs, and of the `ne_label` values as desirable outputs. The input tokens could for example be looked up in our word embeddings dictionary.

In [None]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

We have successfully loaded our data. Let's see how many tokens/labels we have:

In [None]:
print(len(labels))

In [None]:
print('Last ten labels =', labels[:10])

Obviously, we should have the same size of input_vectors

In [None]:
print(len(input_vectors))

In a next step, we could easily train a model on this data as shown in above by combining the input vectors with the labels in a fir function.

### 3.2 Kaggle
Another interesting dataset is the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), which we also provided in the zip file you downloaded from Canvas. You can load it in the following way:

In [None]:
import pandas

In [None]:
path = 'nerc_datasets/kaggle/ner_v2.csv'

In [None]:
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [None]:
kaggle_dataset.columns

You can seen that a wide range of features is given for each token. [Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [None]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break

We could for instance use these features as inputs in a machine learning model with our DictVectorizer, or by transforming them using embeddings.

End of this notebook