# Embedding Process (WIP)

This notebook constitutes the embedding training process using the CBOW model described in the Corpus Data Generation notebook. This model works by using the average embeddding of contxt words to predict a target word, forcing the model to generate linearly meaningful embeddings.

In [1]:
import tensorflow as tf

First, we need to define some variables we will need that we know from the previous notebook. The batch size was determined experimentally for load speed and computational efficiency on our machines. Ideally, the batch size would be much larger.

In [2]:
vocab_size = 75677
window_size = 3
embed_size = 100
batch_size = 256

Helper function to convert and example from the TfRecords binary format.

In [3]:
from keras.utils.np_utils import to_categorical

def parse_examples(examples):
    corpus_feature_set = {
        'target' : tf.io.FixedLenFeature([], tf.int64, default_value=0),
        'context' : tf.io.FixedLenFeature([6], tf.int64, default_value=[0,0,0,0,0,0])
    }
    tensor_dict = tf.io.parse_example(examples, corpus_feature_set)
    tensor_dict['target'] = tf.one_hot(tensor_dict['target'], depth = vocab_size)
    return tensor_dict

Using TensorFlow backend.


Building the dataset using the parrallel interleave on our files. This means we can utilize multiple cores to be continiously loading data and preparing it for the model.

In [4]:
def build_dataset(file_pattern):
    return tf.data.Dataset.list_files(
        file_pattern
    ).interleave(
        tf.data.TFRecordDataset,
        cycle_length=tf.data.experimental.AUTOTUNE,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).shuffle(
        2048
    ).batch(
        batch_size=batch_size,
        drop_remainder=False
    ).map(
        map_func=parse_examples,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).prefetch(
        tf.data.experimental.AUTOTUNE
    )

Next, we need to build our model. The model is simply an embedding layer followed by an averaging layer, and a fully dense head with a softmax activation.

In [5]:
import tensorflow as tf
from tensorflow.python.keras import layers, Sequential

# build CBOW architecture
cbow = Sequential([
    layers.Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2),
    layers.GlobalAveragePooling1D(),
    layers.Dense(vocab_size, activation='softmax')
])

# view model summary
print(cbow.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 6, 100)            7567700   
_________________________________________________________________
global_average_pooling1d (Gl (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 75677)             7643377   
Total params: 15,211,077
Trainable params: 15,211,077
Non-trainable params: 0
_________________________________________________________________
None


Now lets do some training. We'll be using the tensorflow checkpoint management system so that we can periodically interupt the training to see progress. We chose the Adam optimizer, which will automatically adjust training speed for each parameter. Adam can also be checkpointed, so that the training will continue identically upon reloading the notebook. The loss we are using is categorical crossentropy since the task is categorization.

In [6]:
opt = tf.keras.optimizers.Adam(0.1)
dataset = build_dataset('corpus_training_data/*.tfrecords')
training = dataset.skip(100)
validation = dataset.take(100)
checkpoint = tf.train.Checkpoint(step = tf.Variable(1), net=cbow, optimizer=opt)
manager = tf.train.CheckpointManager(checkpoint, './embedding_checkpoints', max_to_keep = 5)

In [7]:
def train_and_checkpoint(net, training, validation, manager, save_frequency = 1000, num_epochs = 10):
    checkpoint.restore(manager.latest_checkpoint)
    if manager.latest_checkpoint:
        print("Restored from {}".format(manager.latest_checkpoint))
        cbow.compile(loss='categorical_crossentropy', optimizer=opt)
    else:
        print("Initializing from scratch.")
        cbow.compile(loss='categorical_crossentropy', optimizer=opt)
    
    for epoch in range(num_epochs):
        print('Epoch {} started'.format(epoch))
        for batch in training:
            loss = net.train_on_batch(batch['context'],batch['target'])
            checkpoint.step.assign_add(1)
            if int(checkpoint.step) % save_frequency == 0:
                save_path = manager.save()
                print("Saved checkpoint for step {}: {}".format(int(checkpoint.step), save_path))
                print("loss {:1.2f}".format(loss))
        loss = 0
        for num_batch, batch in enumerate(validation):
            loss += cbow.evaluate(batch['context'], batch['target'], verbose = 0)
        print('Epoch {} finished with validation loss of {}'.format(epoch, loss / (num_batch + 1)))

And all that is left now is to train. Over the course of 2 days we managed to complete about 2.5 epochs. Thats impressive, considering there are 24 million examples!

In [8]:
train_and_checkpoint(cbow, training, validation, manager)

Restored from ./embedding_checkpoints\ckpt-288
Epoch 0 started


KeyboardInterrupt: 

Now we need to save these embeddings. First, we write some .tsv files. These are used in the tensorflow embedding projector, which is helpful for identifying the progress of the model.

In [None]:
import pandas as pd

word_indexes = pd.read_csv('word_associations.csv')
word_lookup = {key : value for (key, value) in zip(word_indexes['1'], word_indexes['0'])}
word_lookup.update({0 : ' '})
word_lookup

In [None]:
import io

embeddings = cbow.layers[0].get_weights()[0]

out_v = io.open('embedding_results/vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('embedding_results/meta.tsv', 'w', encoding='utf-8')

for num in range(1,len(word_lookup)):
    try:
        word = word_lookup[num]
        vec = embeddings[num] # skip 0, it's padding.
        out_m.write(word + "\n")
        out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    except TypeError:
        print(word)
        print(num)
    
out_v.close()
out_m.close()

To save the model for use in our RNN, we first dehead the model, and then save its weights. This way, we only have to load in this headless portion within the RNN notebook.

In [None]:
embedding_model = tf.keras.Sequential([cbow.layers[0]])
embedding_model.summary()

In [None]:
embedding_model.save_weights('embedding_results/model_weights/model_weights')

Below is an example of how one would load in the embedding model. First you must make a new model of the same structure, and then load the saved weights.

In [1]:
import tensorflow as tf

vocab_size = 75677
embed_size = 100
window_size = 3

imported_embeddings = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2)
])

imported_embeddings.load_weights('embedding_results/model_weights/model_weights')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x25e52570d88>

In [2]:
imported_embeddings.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 6, 100)            7567700   
Total params: 7,567,700
Trainable params: 7,567,700
Non-trainable params: 0
_________________________________________________________________
