# keras training

this file loads our processed data and trains a recurrent language model

In [1]:
import numpy as np
from keras import backend as K
from keras.models import Model, load_model
from keras.layers import Input, Embedding, Dropout, LSTM, Lambda, Dense, Activation 
from keras.callbacks import ModelCheckpoint
import h5py

Using TensorFlow backend.


In [2]:
# restrict GPU usage here, if using multi-gpu
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [3]:
# read in data
cardtext = [list(x) for x in list(np.load('data/card_texts.npy'))]
c2i = np.load('data/c2i.npy').item()
i2c = np.load('data/i2c.npy').item()
xcards = np.load('data/xcards.npy')
ycards = np.load('data/ycards.npy')

In [4]:
# add axis for sparse_categorical_Crossentropy
ycards = ycards[:, :, np.newaxis]

## define the hyperparameters

**dropout rate** : prevents over-fitting; by 'ignoring' every few characters, the language model must learn to generalize. typical values are 0.25 to 0.50  
**embedding size** : the size of the character embeddings, which are learned through training  
**hidden_size** : the size of the LSTM gates and cells; i.e. the size of its 'memory'  
**vocab_size** : the model will predict one of *n* characters where *n* is the vocabulary size  
**batch size** : we will use *minibatch gradient descent*; this is the number of examples we will train on each batch  
**number of epochs** : one *epoch* is one pass through all the training data  

In [5]:
# set parameters
DROP_RATE = 0.50              # dropout: between 0.25 and 0.5 is common
HIDDEN_SIZE = 500             # lstm feature vector size
MAX_Y_LEN = ycards.shape[1]   # maximum card length
VOCAB_SIZE = len(c2i.keys())  # number of characters
BATCH_SIZE = 32               # cards per batch
NUM_EPOCHS = 10               # number of epochs to train

## define the model

the model we will use is a *recurrent language model*. essentially, our network will predict the next character, given the previous chracters it has seen/generated. we could try to help the network realize that it is at the beginning of a card by initializing the states to a fixed value such as zeroes (using `initial_state`), but we will leave the state initialization random here, and rely on the initial start-of-sentence token to signal to the network that we are starting a card. hopefully, the random initial state might help randomize the generated cards.

we have already divided the cards into lists of characters, *indexed* the strings into integer arrays, and *padded* the arrays to a fixed length in the previous files. we also created input and output sequences that are offset by one, such that the first element of the *output* corresponds with the *second* element of the input etc. this is because we will train the model with *teacher forcing* : at each step, we will input the *true* character, and induce the network to output the next element. on decode, of course, since we are randomly generating the card sequences, we must input the *actual* previous output.

due to this, our training and decode networks are slightly different. our training network takes full sequence inputs, and outputs full sequence outputs (the inputs, offset by one). this is because we already know the full sequences we are training on: the actual cards. on decode, we want new cards, so we will generate each character at a time, and feed *that* predicted character (sampled randomly from the softmax distribution, for randomness) back into the LSTM to generate the next character. because the LSTM relies on a 'memory' of what it has already generated, we must also input the previous *states*, which we can do with `return_sequences=True`.

this network is adapted from the decoder in [the keras blog seeq2seq article](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)

In [6]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_input  = Input(shape=(MAX_Y_LEN, ), name='lm_input')
decoder_embed  = Embedding(VOCAB_SIZE, HIDDEN_SIZE, 
                           mask_zero=True, trainable=True, name='lm_emb')
decoder_lstm1  = LSTM(HIDDEN_SIZE, 
                      return_sequences=True, 
                      return_state=True, 
                      name='lm_lstm1')
decoder_lstm2  = LSTM(HIDDEN_SIZE, 
                      return_sequences=True, 
                      return_state=True, 
                      name='lm_lstm2')

decoder_dense_1  = Dense(HIDDEN_SIZE, activation='relu', name='lm_dns_1')
# decoder_dense_2  = Dense(VOCAB_SIZE, activation='softmax', name='lm_dns_final')

## weight tying

this concept is from Ofir Press & Lior Wolf 2017, ["Using the Output Embedding to Improve Language Models"](https://arxiv.org/pdf/1608.05859.pdf)

a conceptual outline by Ofir Press can be seen [on his blog](http://ofir.io/Neural-Language-Modeling-From-Scratch/)

they define a Language Model generally as taking a word _c_, represented by a one-hot vector, embedding it into a dense vector with a weight matrix **U**, doing some computation on it (passing it through two LSTM layers, in this model, as well as in the blog example) to get a dense vector _h_, and then converting this back to a word prediction using matrix **V** followed by a softmax activation i.e. a `Dense` layer in `keras`. however, they note that _c_ and _h_ both share the property of being 'word vectors', and that the matrices **U** and **V** are of the same dimension (size of vocabulary x embedding size), and are conducting inverse operations (mapping words to dense vectors, and mapping dense vectors to words). so they propose "weight tying", which sets the weights **U** = **V** (though for the math, one is _transposed_). 

this can be demonstrated in `numpy` with the following: 

1. our parameters can be as follows:

```
word vocabulary = 10
embedding dims  =  4
```

2. we can make an artificial word embedding `e` of size `(word vocabulary x embedding dims) == (10, 4)`
   here each column = a 'word embedding' which here is filled with the n-th value:   

```
>>> e
array([[ 1,  1,  1,  1],
       [ 2,  2,  2,  2],
       [ 3,  3,  3,  3],
       [ 4,  4,  4,  4],
       [ 5,  5,  5,  5],
       [ 6,  6,  6,  6],
       [ 7,  7,  7,  7],
       [ 8,  8,  8,  8],
       [ 9,  9,  9,  9],
       [10, 10, 10, 10]])
```

3. a word is represented as a one-hot vector of length `word vocabulary`.  
   here `w` is the vector for the third word in the index

```
>>> w
array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
```

4. to retrieve the embedding, we use the dot product,  
   which multiplies each non-target row by 0 and the target embedding row by 1:

```
>>> np.dot(w, e)
array([3, 3, 3, 3])
```

5. now we assume we do some calculation here and get an LSTM output vector `h`  of `embedding size`

```
>>> h
array([2, 2, 2, 2])
```

6. then we can expand back to a 10-word vocabulary using `e.T` i.e. setting **U** = **V**   
   (this only applies when our input and outut vocabularies are the same)  
   (again, this means that the LSTM hidden size must `==` the embedding size)  

```
>>> np.dot(h, e.T)
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80])
```

7. this output will then be turned into a probability distribution over the ten words using softmax.  

we can use a `Lambda` layer that will take the recurrent outputs and `dot` them with the transposed embedding weights:

In [None]:
# weight-tying Lambda
def weight_tying(layer_input):
    result = K.dot(layer_input, K.transpose(decoder_embed.weights[0]))
    return result

decoder_dense_2 = Lambda(weight_tying, name='weight_tying')
decoder_dense_3 = Activation('softmax')

In [None]:
# define the actual model
x = decoder_embed(decoder_input)
x = Dropout(DROP_RATE)(x)
x, h1, c1 = decoder_lstm1(x)
x = Dropout(DROP_RATE)(x)
x, h2, c2 = decoder_lstm2(x)
x = Dropout(DROP_RATE)(x)
x = decoder_dense_1(x)
x = Dropout(DROP_RATE)(x)
x = decoder_dense_2(x)
x = decoder_dense_3(x)

model = Model(decoder_input, x)

In [7]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lm_input (InputLayer)        (None, 256)               0         
_________________________________________________________________
lm_emb (Embedding)           (None, 256, 500)          51000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 256, 500)          0         
_________________________________________________________________
lm_lstm1 (LSTM)              [(None, 256, 500), (None, 2002000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 256, 500)          0         
_________________________________________________________________
lm_lstm2 (LSTM)              [(None, 256, 500), (None, 2002000   
_________________________________________________________________
dropout_3 (Dropout)          (None, 256, 500)          0         
__________

In [8]:
# compile
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])

## model training

we define a `ModelCheckpoint` that will save models as we train, in case the model training takes a long time. we then `fit` the model to train. we use `verbose=2` to view per-epoch stats; `verbose=1`, while it provides per-batch stats, can freeze Jupyter Lab, and `TQDMNotebook` doesn't work with Jupyter Lab yet (AFAIK).

we then save weights at the end of training (and re-load them to test). we have a (commented by default) cell for loading weights before training, to allow us to continue training a partially-trained model.

In [9]:
cpoint = ModelCheckpoint('model/weights.{epoch:04d}-{loss:.4f}.h5', 
                         monitor='loss',
                        save_best_only=True,
                        save_weights_only=True,
                        period=2)

In [10]:
# model.load_weights('model/weights_final.h5')

In [11]:
model.fit(xcards, ycards, 
          batch_size=BATCH_SIZE, 
          epochs=NUM_EPOCHS, 
          callbacks=[cpoint], 
          verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe7dfcbfe48>

In [12]:
model.save_weights('model/weights_tiedfinal.h5')

In [13]:
model.load_weights('model/weights_tiedfinal.h5')

In [14]:
# save architecture with json
with open('model/weights_tiedfinal.json', 'w') as f:
    f.write(model.to_json())