# Skip-gram with Negative Samples 
### by RafaelxFernandes

## Downloading Corpora

In [1]:
import nltk
nltk.download() # go to the Corpora tab and double click on 'brown' and 'conll2000'

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
from nltk.corpus import brown
from gensim.models import Word2Vec
import multiprocessing

In [3]:
sentences = brown.sents()
sentences[:3]

[['The',
  'Fulton',
  'County',
  'Grand',
  'Jury',
  'said',
  'Friday',
  'an',
  'investigation',
  'of',
  "Atlanta's",
  'recent',
  'primary',
  'election',
  'produced',
  '``',
  'no',
  'evidence',
  "''",
  'that',
  'any',
  'irregularities',
  'took',
  'place',
  '.'],
 ['The',
  'jury',
  'further',
  'said',
  'in',
  'term-end',
  'presentments',
  'that',
  'the',
  'City',
  'Executive',
  'Committee',
  ',',
  'which',
  'had',
  'over-all',
  'charge',
  'of',
  'the',
  'election',
  ',',
  '``',
  'deserves',
  'the',
  'praise',
  'and',
  'thanks',
  'of',
  'the',
  'City',
  'of',
  'Atlanta',
  "''",
  'for',
  'the',
  'manner',
  'in',
  'which',
  'the',
  'election',
  'was',
  'conducted',
  '.'],
 ['The',
  'September-October',
  'term',
  'jury',
  'had',
  'been',
  'charged',
  'by',
  'Fulton',
  'Superior',
  'Court',
  'Judge',
  'Durwood',
  'Pye',
  'to',
  'investigate',
  'reports',
  'of',
  'possible',
  '``',
  'irregularities',
  "''",
 

## Building embedding

• sentences — The iterable over the tokenised sentences we will train on (the Brown sentences).

• window — This determines which words are considered contexts of the target. For the window of size n the contexts are defined by capturing n words to the left of the target and n words to its right. The size of window will affect the type of similarity captured in the embeddings — bigger windows will result in more topical/domain similarities.

• min_count — We can use this parameter to tell the model to ignore some infrequent words — don’t create an embedding for them and don’t include them as contexts. The min_count defines a threshold frequency value that needs to be reached for the word to be included in the vocabulary.

• negative — Defines the number of negative samples (incorrect training pair instances) that are drawn for each good sample.

• workers — Determines how many worker threads will be used to train the model.

In our setting for window and negative samples we will follow the settings from the original Skip-gram papers. We will set the workers parameter to the number of available cores and train our model for ten epochs (as our training data is quite small, ~1M words)

In [4]:
w2v = Word2Vec(
    sentences,
    window = 5,
    min_count = 5,
    negative = 15,
    workers = multiprocessing.cpu_count()
)

In [5]:
# Get trained embeddings - a KeyedVector instance
word_vectors = w2v.wv
result = word_vectors.similar_by_word('Saturday')
print("Most similar to 'Saturday':\n ", result[:10])

Most similar to 'Saturday':
  [('Monday', 0.9549916386604309), ('Sunday', 0.9446992874145508), ('Friday', 0.9341464638710022), ('Tuesday', 0.9257744550704956), ('fourth', 0.9210171103477478), ('Wednesday', 0.9108353853225708), ('ending', 0.9106801152229309), ('December', 0.9002511501312256), ('afternoon', 0.8998737931251526), ('April', 0.898629903793335)]


## Using embedding as feature in a Neural model

Now that we have our embeddings it’s time to put them into use. We will use them as features for the part-of-speech (POS) tagging model we will develop. A part-of-speech is a grammatical category of a word, such as a noun, verb or an adjective. Given a sequence of words, the task is to label each of them with a suitable POS tag.

We will build a simple neural model for multi-class classification. For now, we will ignore the context of the word we are tagging — our network will take only one word as input and output the probability distribution over all possible POS tags. To train and evaluate our model we will make use of yet another NLTK resource: the data from the CONLL-2000 Shared Task, which has been annotated with POS tags.

### Preparing the data

In [6]:
from nltk.corpus import conll2000

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Activation, Flatten
from tensorflow.keras.utils import to_categorical

import collections
import numpy as np

In [7]:
train_words = conll2000.tagged_words("train.txt")
test_words = conll2000.tagged_words("test.txt")

In [8]:
train_words[:10]

[('Confidence', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('pound', 'NN'),
 ('is', 'VBZ'),
 ('widely', 'RB'),
 ('expected', 'VBN'),
 ('to', 'TO'),
 ('take', 'VB'),
 ('another', 'DT')]

Our first step is to process this data into a model-friendly format — replace all words and tags with their corresponding indexes and split the data into inputs and outputs (tag labels). To do that we will need a dictionary which maps words to their corresponding ids and a similar dictionary for the tags. We will create the latter based on our CONLL training data, but to create the first we will use the vocabulary of our trained embedding model — as it should only contain the words which we are able to represent.

In [9]:
# Accepts text in the form of (word, pos) tuples
# and returns a dictionary mapping POS-tags to unique ids
def get_tag_vocabulary(tagged_words):
    
    tag2id = {}
    
    for item in tagged_words:
        tag = item[1]
        tag2id.setdefault(tag, len(tag2id))
    
    return tag2id

In [10]:
# The word_vectors.key_to_index dictionary stores integers
word2id = {key: value for key, value in word_vectors.key_to_index.items()}
tag2id = get_tag_vocabulary(train_words)

Adding a new word to our vocabulary — the ‘UNK’, which will represent all words we don’t have an embedding for. But adding this word to the vocabulary means it will need to have a corresponding embedding, not present in our representations. One solution would be to retrain Skip-gram after having replaced some occurrences of low frequency words in our training data with an ‘UNK’ token. But we will approach this problem from a different angle by approximating the UNK’s vector with a mean of all existing embeddings. After doing so, we will add this new representation to the matrix of all other embeddings.

In [11]:
UNK_INDEX = 0
UNK_TOKEN = "UNK"

In [12]:
# Adds a new word to the existing matrix of word embeddings
def add_new_word(new_word, new_vector, new_index, embedding_matrix, word2id):
    
    # Inserting the vector before given index, along axis 0
    embedding_matrix = np.insert(embedding_matrix, [new_index], [new_vector], axis = 0)
    
    # Updating the indexes of words that follow the new word
    word2id = {word: (index + 1) if index >= new_index else index
              for word, index in word2id.items()}
    word2id[new_word] = new_index
    
    return embedding_matrix, word2id

In [13]:
embedding_matrix = word_vectors.vectors
unk_vector = embedding_matrix.mean(0)
embedding_matrix, word2id = add_new_word(UNK_TOKEN, unk_vector, UNK_INDEX, embedding_matrix, word2id)

Now it’s time to get our integer, model-friendly data — both for the train and test splits.

In [14]:
# Replaces all words and tags with their corresponding ids and
# separates words(features) from the tags(labels)
def get_int_data(tagged_words, word2id, tag2id):
    
    # X holds word ids, Y hold their tags ids
    X, Y = [], []
    
    # Variable to keep track of the number of unknown words
    # which are words we don't have a representation for
    unk_count = 0
    
    for word, tag in tagged_words:
        Y.append(tag2id.get(tag))
        
        if word in word2id:
            X.append(word2id.get(word))
        else:
            X.append(UNK_INDEX)
            unk_count += 1
            
    X = np.asarray(X).astype(np.float32)
    Y = np.asarray(Y).astype(np.float32)

    print("Data created. Percentage of unknown words: %.3f" % (unk_count/ len(tagged_words)))
    
    return X, Y

In [15]:
X_train, Y_train = get_int_data(train_words, word2id, tag2id)
X_test, Y_test = get_int_data(test_words, word2id, tag2id)

Y_train, Y_test = to_categorical(Y_train), to_categorical(Y_test)

Data created. Percentage of unknown words: 0.143
Data created. Percentage of unknown words: 0.149


### Defining and training the model

Our next step is to define the model for POS classification. We will do so using TensorFlow’s implementation of the Keras API. Our model will take as input an index into the word embedding matrix, which will be used to look up the appropriate embedding. It will have one hidden layer with the tanh activation function and at the final layer will use the softmax activation — outputting a probability distribution over all possible tags.

In [16]:
EMB_DIM = 100
HIDDEN_SIZE = 50
BATCH_SIZE = 128

In [17]:
# Create and returns a simple part-of-speech model,
# which takes only one word as input
def define_model(embedding_matrix, class_count):
    
    vocab_length = len(embedding_matrix)
    
    # Sequential model is a stack of layers, we will add them one by one
    model = Sequential()
    
    model.add(Embedding( # Layer which turns word indexes into vectors
        input_dim = vocab_length,
        output_dim = EMB_DIM, # output of this layer is the embedding of the input word
        weights = [embedding_matrix], # matrix holding the trained embeddings
        input_length = 1)) # specify how many indexes we are looking for
    model.add(Flatten())
    model.add(Dense(HIDDEN_SIZE))
    model.add(Activation("tanh"))
    model.add(Dense(class_count))
    model.add(Activation("softmax"))
    
    model.compile(optimizer = tf.optimizers.Adam(),
                 loss = tf.keras.losses.CategoricalCrossentropy(),
                 metrics = ["accuracy"])
    
    return model

In [18]:
pos_model = define_model(embedding_matrix, len(tag2id))
pos_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1, 100)            1517400   
                                                                 
 flatten (Flatten)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 50)                5050      
                                                                 
 activation (Activation)     (None, 50)                0         
                                                                 
 dense_1 (Dense)             (None, 44)                2244      
                                                                 
 activation_1 (Activation)   (None, 44)                0         
                                                                 
Total params: 1,524,694
Trainable params: 1,524,694
Non-

In [19]:
# Training the model
pos_model.fit(X_train, Y_train, batch_size = BATCH_SIZE, epochs = 1, verbose = 1)



<keras.callbacks.History at 0x1a597383bb0>

### Evaluating the model

Now that we have a trained model it’s time to see how well it’s performing on the unseen data. We will use it to tag the words from the test data and calculate the accuracy of its predictions: the ratio of the number of correct tags to the number of all words in the test set. To get more insight, we will also determine what are the most commonly mistagged words.

In [20]:
# Evaluates the given model by computing the accuracy of its predictions
# on the given test data and prints out 10 most mistagged words
def evaluate_model(model, id2word, x_test, y_test):
    
    _, acc = model.evaluate(x_test, y_test)
    print("Accuracy: %.4f" % acc)
    
    # Get model predictions and count its erros
    y_pred = np.argmax(model.predict(x_test), axis = -1)
    error_counter = collections.Counter()
    
    for i in range(len(x_test)):
        correct_tag_id = np.argmax(y_test[i]) # turn a one-hot encoding to an index
        
        if y_pred[i] != correct_tag_id:
            word = id2word[int(x_test[i])]
            error_counter[word] += 1
            
    print("Most common errors: \n", error_counter.most_common(10))

In [21]:
id2word = sorted(word2id, key = word2id.get)
evaluate_model(pos_model, id2word, X_test, Y_test)

Accuracy: 0.8479
Most common errors: 
 [('UNK', 5034), ('that', 136), ('have', 51), ('as', 37), ('more', 30), ('about', 18), ('executive', 18), ('American', 18), ('plans', 16), ('called', 14)]


As expected, our model performs the worst when tagging the unknown words. The accuracy is 85%, which is not too bad, but we can do better. Let’s try improving the model by making the classification context-dependent!

### Building a context-dependent model

We will now alter the model built in the previous steps to take more than one word index as input. In addition to the index of the classified word we will feed in the indexes of two words to its left side and two words to its right side — all in the order of their appearance in the training data.

Apart from redefining our model we also need to adjust the way we process the CONLL data: the X_train and X_test will now consist of arrays of indexes, rather than single indexes. We will use a sliding-window approach to retrieve all word spans of length 5 — each consisting of the tagged word and its context-words. For each such span, the corresponding label will be the tag of the middle word. To represent the missing contexts of words at the beginning and the end of the training data sequence we will use a new, special word — the end-of-sequence (EOS). We will add EOS using the previously defined add_new_word function, in a similar way to how we have added UNK:

In [22]:
EOS_INDEX = 1
EOS_TOKEN = "EOS"

In [23]:
# Creating a random EOS vector
eos_vector = np.random.standard_normal(EMB_DIM)
embedding_matrix, word2id = add_new_word(EOS_TOKEN, eos_vector, EOS_INDEX, embedding_matrix, word2id)

In [24]:
# Defined the size of the context window
CONTEXT_SIZE = 2

In [25]:
# Replaces all words and tags with their corresponding ids and
# generates an array of label ids Y and the training data X,
# which consists of arrays of word indexes (of tagged word and its context)
def get_window_int_data(tagged_words, word2id, tag2id):
    
    # X holds word ids, Y hold their tags ids
    X, Y = [], []
    
    # Variable to keep track of the number of unknown words
    # which are words we don't have a representation for
    unk_count = 0
    
    # The complete span of the sliding window
    span = 2 * CONTEXT_SIZE + 1
    buffer = collections.deque(maxlen = span)
    padding = [(EOS_TOKEN, None)] * CONTEXT_SIZE
    buffer += padding + tagged_words[:CONTEXT_SIZE]
    
    for item in (tagged_words[CONTEXT_SIZE:] + padding):
        buffer.append(item)
        
        # The input to the model is the ids of all words in the window
        window_ids = np.array([word2id.get(word) if (word in word2id) else UNK_INDEX
                              for (word, _) in buffer])
        
        X.append(window_ids)
        
        # The label is the tag of the middle word
        middle_word, middle_tag = buffer[CONTEXT_SIZE]
        Y.append(tag2id.get(middle_tag))
        
        if middle_word not in word2id:
            unk_count += 1
            
    X = np.asarray(X).astype(np.float32)
    Y = np.asarray(Y).astype(np.float32)

    print("Data created. Percentage of unknown words: %.3f" % (unk_count/ len(tagged_words)))
    
    return X, Y

Our next step is defining the model. It will be very similar to the simple model from our previous steps. In fact, the only thing that will change is the Embedding layer, which will now take 5 word indexes instead of 1. We will also slightly alter our evaluation function — to support the structure of our new training data.

In [26]:
# Create and returns a simple part-of-speech model,
# which takes only one word as input
def define_context_sensitive_model(embedding_matrix, class_count):
    
    vocab_length = len(embedding_matrix)
    total_span = CONTEXT_SIZE * 2 + 1
    
    # Sequential model is a stack of layers, we will add them one by one
    model = Sequential()
    
    model.add(Embedding( # Layer which turns word indexes into vectors
        input_dim = vocab_length,
        output_dim = EMB_DIM, # output of this layer is the embedding of the input word
        weights = [embedding_matrix], # matrix holding the trained embeddings
        input_length = total_span)) # specify how many indexes we are looking for
    model.add(Flatten())
    model.add(Dense(HIDDEN_SIZE))
    model.add(Activation("tanh"))
    model.add(Dense(class_count))
    model.add(Activation("softmax"))
    
    model.compile(optimizer = tf.optimizers.Adam(),
                 loss = tf.keras.losses.CategoricalCrossentropy(),
                 metrics = ["accuracy"])
    
    return model

In [27]:
# Evaluates the given model by computing the accuracy of its predictions
# on the given test data and prints out 10 most mistagged words
def evaluate_context_sensitive_model(model, id2word, x_test, y_test):
    
    _, acc = model.evaluate(x_test, y_test)
    print("Accuracy: %.4f" % acc)
    
    # Get model predictions and count its erros
    y_pred = np.argmax(model.predict(x_test), axis = -1)
    error_counter = collections.Counter()
    
    for i in range(len(x_test)):
        correct_tag_id = np.argmax(y_test[i]) # turn a one-hot encoding to an index
        
        if y_pred[i] != correct_tag_id:
            if isinstance(x_test[i], np.ndarray):
                word = id2word[int(x_test[i][CONTEXT_SIZE])]
            else:
                word = id2word[int(x_test[i])]
            
            error_counter[word] += 1
            
    print("Most common errors: \n", error_counter.most_common(10))

In [28]:
X_train2, Y_train2 = get_window_int_data(train_words, word2id, tag2id)
X_test2, Y_test2 = get_window_int_data(test_words, word2id, tag2id)

Y_train2, Y_test2 = to_categorical(Y_train2), to_categorical(Y_test2)

Data created. Percentage of unknown words: 0.143
Data created. Percentage of unknown words: 0.149


In [29]:
cs_pos_model = define_context_sensitive_model(embedding_matrix, len(tag2id))
cs_pos_model.fit(X_train2, Y_train2, batch_size = BATCH_SIZE, epochs = 1, verbose = 1)



<keras.callbacks.History at 0x1a59459b280>

In [30]:
evaluate_context_sensitive_model(cs_pos_model, id2word, X_test2, Y_test2)

Accuracy: 0.9055
Most common errors: 
 [('UNK', 2978), ('is', 67), ('out', 30), ('so', 19), ('old', 14), ('content', 10), ('boot', 9), ('It', 8), ('C.', 8), ('its', 8)]


Our accuracy jumped up to 91%! It looks like adding the context really helped with tagging the unknown words and also helped to disambiguate other words.

## References:

- Skip-Gram: NLP context words prediction algorithm: https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c

- Word2Vec (skip-gram model): PART 1 - Intuition.: https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b

- Word2Vec (Skip-Gram model) Explained: https://medium.datadriveninvestor.com/word2vec-skip-gram-model-explained-383fa6ddc4ae

- Word2Vec Tutorial Part 2 - Negative Sampling: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

- NLP’s word2vec: Negative Sampling Explained: https://www.baeldung.com/cs/nlps-word2vec-negative-sampling

- NLP 102: Negative Sampling and GloVe: https://towardsdatascience.com/nlp-101-negative-sampling-and-glove-936c88f3bc68