# Language Model with PTB in tf.Keras
In this project, I'm trying to build language model using PTB dataset. 

The code is modified from [CharlesWu123/SelfStudyTF](git@github.com:CharlesWu123/SelfStudyTF.git)

I'm trying to use tf.Keras as possible.

The data preprocessing is omitted in this notebook.

In [1]:
import numpy as np
import tensorflow as tf
import codecs
from tensorflow import keras

TRAIN_DATA = './ptb.train'
EVAL_DATA = './ptb.valid'
TEST_DATA = './ptb.test'
VOCAB = './ptb.vocab'          # Vocabulary file
HIDDEN_SIZE = 300
NUM_LAYERS = 2
VOCAB_SIZE = 10000
TRAIN_BATCH_SIZE = 128
TRAIN_NUM_STEP = 30

EVAL_BATCH_SIZE = 1
EVAL_NUM_STEP = 1
NUM_EPOCH = 50
LSTM_KEEP_PROB = 0.9
EMBEDDING_KEEP_PROB = 0.9
MAX_GRAD_NORM = 5
SHARE_EMB_AND_SOFTMAX = True

In [2]:
# Avoid 'Blas GEMM launch failed'
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC' #A "Best-fit with coalescing" algorithm, simplified from a version of dlmalloc.
config.gpu_options.per_process_gpu_memory_fraction = 0.3
config.gpu_options.allow_growth = True
keras.backend.set_session(tf.Session(config=config))

## Read data from file
After preprocessing, the data in file is the ids of words according to vocabulary file.

Each line is ended by &lt;eos>, and missing words has been replaced by &lt;unk>.

In [3]:
def load_data(data_file):
    with open(data_file, 'r') as fin:
        # read full file as a long string
        id_string = ' '.join([line.strip() for line in fin.readlines()])
    id_list = [int(w) for w in id_string.split()]  # Convert word id to integer
    return id_list


# Load data from file
data_train = load_data(TRAIN_DATA)
data_val = load_data(EVAL_DATA)

len_train = len(data_train)
len_val = len(data_val)

print('Training data length', len_train)
print('Validating data length', len_val)

Training data length 929589
Validating data length 73760


# Create input data generator
Design a generator to generate input data for training.
Shift the input data right with one word for labeling.
As for sparse_categorical_accuracy, the labels should be reshaped for one more dimension.

In [4]:
class KerasBatchGenerator(object):

    def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=5):
        self.data = data
        self.num_steps = num_steps
        self.batch_size = batch_size
        self.vocabulary = vocabulary
        # this will track the progress of the batches sequentially through the
        # data set - once the data reaches the end of the data set it will reset
        # back to zero
        self.current_idx = 0
        # skip_step is the number of words which will be skipped before the next
        # batch is skimmed from the data set
        self.skip_step = skip_step

    def generate(self):
        x = np.zeros((self.batch_size, self.num_steps))
        y = np.zeros((self.batch_size, self.num_steps))
        while True:
            for i in range(self.batch_size):
                if self.current_idx + self.num_steps >= len(self.data):
                    # reset the index back to the start of the data set
                    self.current_idx = 0
                x[i, :] = self.data[self.current_idx:self.current_idx + self.num_steps]
                temp_y = self.data[self.current_idx +
                                   1:self.current_idx + self.num_steps + 1]
                # convert all of temp_y into a one hot representation
                y[i, :] = temp_y
                self.current_idx += self.skip_step
            # x = x.reshape(self.batch_size, self.num_steps, 1)
            py = y.reshape(self.batch_size, self.num_steps, 1)
            yield x, py

gen_train_data = KerasBatchGenerator(
    data_train, TRAIN_NUM_STEP, TRAIN_BATCH_SIZE, VOCAB_SIZE,
    skip_step=TRAIN_NUM_STEP
)

gen_val_data = KerasBatchGenerator(
    data_val, TRAIN_NUM_STEP, TRAIN_BATCH_SIZE, VOCAB_SIZE,
    skip_step=TRAIN_NUM_STEP
)

# Build the model
Here we build 

In [5]:
model = keras.Sequential()
model.add(keras.layers.Embedding(VOCAB_SIZE, HIDDEN_SIZE, input_length=TRAIN_NUM_STEP))
model.add(keras.layers.Dropout(1 - EMBEDDING_KEEP_PROB))
for _ in range(NUM_LAYERS):
    model.add(keras.layers.CuDNNLSTM(units=HIDDEN_SIZE, return_sequences=True))
model.add(keras.layers.Dropout(1 - LSTM_KEEP_PROB))
model.add(keras.layers.TimeDistributed(keras.layers.Dense(VOCAB_SIZE)))
model.add(keras.layers.Activation('softmax'))
model.summary()

W0925 15:11:56.121625 10056 deprecation.py:506] From C:\Users\HP\.conda\envs\tfgpu\lib\site-packages\tensorflow\python\keras\initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0925 15:11:56.159556 10056 deprecation.py:506] From C:\Users\HP\.conda\envs\tfgpu\lib\site-packages\tensorflow\python\ops\init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 30, 300)           3000000   
_________________________________________________________________
dropout (Dropout)            (None, 30, 300)           0         
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM)       (None, 30, 300)           722400    
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 30, 300)           722400    
_________________________________________________________________
dropout_1 (Dropout)          (None, 30, 300)           0         
_________________________________________________________________
time_distributed (TimeDistri (None, 30, 10000)         3010000   
_________________________________________________________________
activation (Activation)      (None, 30, 10000)         0

![](model.png)

# Compile and Train
Store checkpoint along training.

Using tensorflow to track the running status.

In [6]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
              metrics=['sparse_categorical_accuracy'])
cp_callback = keras.callbacks.ModelCheckpoint(
    filepath='./models/model-{epoch:02d}.hdf5', verbose=1)
tb_callback = tf.keras.callbacks.TensorBoard(
    log_dir='./logs',
    histogram_freq=1, batch_size=TRAIN_BATCH_SIZE,
    write_graph=True, write_grads=False, write_images=True,
    embeddings_freq=0, embeddings_layer_names=None,
    embeddings_metadata=None, embeddings_data=None, update_freq=500
    )

model.fit_generator(generator=gen_train_data.generate(),
                    steps_per_epoch=len_train // (TRAIN_BATCH_SIZE * TRAIN_NUM_STEP),
                    epochs=NUM_EPOCH, callbacks=[cp_callback, tb_callback],
                    validation_data=gen_val_data.generate(),
                    validation_steps=len_val // (TRAIN_BATCH_SIZE * TRAIN_NUM_STEP))

W0925 15:11:57.257587 10056 deprecation.py:323] From C:\Users\HP\.conda\envs\tfgpu\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/50
Epoch 00001: saving model to ./models/model-01.hdf5
Epoch 2/50
Epoch 00002: saving model to ./models/model-02.hdf5
Epoch 3/50
Epoch 00003: saving model to ./models/model-03.hdf5
Epoch 4/50
Epoch 00004: saving model to ./models/model-04.hdf5
Epoch 5/50
Epoch 00005: saving model to ./models/model-05.hdf5
Epoch 6/50
Epoch 00006: saving model to ./models/model-06.hdf5
Epoch 7/50
Epoch 00007: saving model to ./models/model-07.hdf5
Epoch 8/50
Epoch 00008: saving model to ./models/model-08.hdf5
Epoch 9/50
Epoch 00009: saving model to ./models/model-09.hdf5
Epoch 10/50
Epoch 00010: saving model to ./models/model-10.hdf5
Epoch 11/50
Epoch 00011: saving model to ./models/model-11.hdf5
Epoch 12/50
Epoch 00012: saving model to ./models/model-12.hdf5
Epoch 13/50
Epoch 00013: saving model to ./models/model-13.hdf5
Epoch 14/50
Epoch 00014: saving model to ./models/model-14.hdf5
Epoch 15/50
Epoch 00015: saving model to ./models/model-15.hdf5
Epoch 16/50
Epoch 00016: saving model to ./models

Epoch 25/50
Epoch 00025: saving model to ./models/model-25.hdf5
Epoch 26/50
Epoch 00026: saving model to ./models/model-26.hdf5
Epoch 27/50
Epoch 00027: saving model to ./models/model-27.hdf5
Epoch 28/50
Epoch 00028: saving model to ./models/model-28.hdf5
Epoch 29/50
Epoch 00029: saving model to ./models/model-29.hdf5
Epoch 30/50
Epoch 00030: saving model to ./models/model-30.hdf5
Epoch 31/50
Epoch 00031: saving model to ./models/model-31.hdf5
Epoch 32/50
Epoch 00032: saving model to ./models/model-32.hdf5
Epoch 33/50
Epoch 00033: saving model to ./models/model-33.hdf5
Epoch 34/50
Epoch 00034: saving model to ./models/model-34.hdf5
Epoch 35/50
Epoch 00035: saving model to ./models/model-35.hdf5
Epoch 36/50
Epoch 00036: saving model to ./models/model-36.hdf5
Epoch 37/50
Epoch 00037: saving model to ./models/model-37.hdf5
Epoch 38/50
Epoch 00038: saving model to ./models/model-38.hdf5
Epoch 39/50
Epoch 00039: saving model to ./models/model-39.hdf5
Epoch 40/50
Epoch 00040: saving model to

Epoch 49/50
Epoch 00049: saving model to ./models/model-49.hdf5
Epoch 50/50
Epoch 00050: saving model to ./models/model-50.hdf5


<tensorflow.python.keras.callbacks.History at 0x215e703da90>

# Model Restore and Test
As we saved model for each epoch, we can try to test the model by predicting words.

In [8]:
VOCAB = './ptb.vocab'          # Vocabulary file
# Project word to id
with codecs.open(VOCAB, 'r', 'utf-8') as f_vocab:
    vocab = [w.strip() for w in f_vocab.readlines()]
word_to_id = {k: v for (k, v) in zip(vocab, range(len(vocab)))}
reversed_dictionary = dict(zip(word_to_id.values(), word_to_id.keys()))

data_test = np.array(load_data(TEST_DATA))
len_test = len(data_test)

model = keras.models.load_model('./models/model-50.hdf5')
dummy_iters = 40

W0925 17:07:01.880927 10056 deprecation.py:506] From C:\Users\HP\.conda\envs\tfgpu\lib\site-packages\tensorflow\python\ops\init_ops.py:97: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0925 17:07:01.881924 10056 deprecation.py:506] From C:\Users\HP\.conda\envs\tfgpu\lib\site-packages\tensorflow\python\ops\init_ops.py:97: calling Orthogonal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0925 17:07:01.881924 10056 deprecation.py:506] From C:\Users\HP\.conda\envs\tfgpu\lib\site-packages\tensorflow\python\ops\init_ops.py:97: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is depre

First, comparing with training data.

In [9]:
example_training_generator = KerasBatchGenerator(data_train, TRAIN_NUM_STEP, 1, VOCAB_SIZE,
                                                 skip_step=1)print("Training data:")

for i in range(dummy_iters):
    dummy = next(example_training_generator.generate())

num_predict = 100
true_print_out = "Actual words: "
pred_print_out = "Predicted words: "
for i in range(num_predict):
    data = next(example_training_generator.generate())
    prediction = model.predict(data[0])
    predict_word = np.argmax(prediction[:, TRAIN_NUM_STEP - 1, :])
    true_print_out += reversed_dictionary[data_train[TRAIN_NUM_STEP + dummy_iters + i]] + " "
    pred_print_out += reversed_dictionary[predict_word] + " "
print(true_print_out)
print(pred_print_out)

Training data:
Actual words: director of this british industrial conglomerate <eos> a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported <eos> the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said <eos> <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N <eos> although preliminary findings were reported more 
Predicted words: director of the <unk> bank bank <eos> the <unk> of <unk> <unk> <unk> to <unk> the <unk> <unk> and been the <unk> of of the <eos> <eos> the <unk> of <unk> and to the <eos> than N N <eos> <eos> said <eos> the <unk> <unk> is is expected <unk> by a 's a <unk> of the the <unk> to the <eos> a of the the the <eos> this said <eos> the <unk> a <unk> of th

Second, comparing with test data.

In [10]:
example_testing_generator = KerasBatchGenerator(data_test, TRAIN_NUM_STEP, 1, VOCAB_SIZE,
                                                skip_step=1)
print("Testing data:")
for i in range(dummy_iters):
    dummy = next(example_testing_generator.generate())

num_predict = 100
true_print_out = "Actual words: "
pred_print_out = "Predicted words: "
for i in range(num_predict):
    data = next(example_testing_generator.generate())
    prediction = model.predict(data[0])
    predict_word = np.argmax(prediction[:, TRAIN_NUM_STEP - 1, :])
    true_print_out += reversed_dictionary[data_test[TRAIN_NUM_STEP + dummy_iters + i]] + " "
    pred_print_out += reversed_dictionary[predict_word] + " "
print(true_print_out)
print(pred_print_out)

Testing data:
Actual words: futures <eos> the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure <eos> big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say <eos> heavy selling of standard & poor 's 500-stock index futures in chicago <unk> beat stocks downward <eos> seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed <eos> 
Predicted words: the <eos> the company N market <unk> are the <unk> board 's trading dollar will the of the year is <unk> <unk> by the <unk> crash <eos> the are n't be the value <eos> <eos> the board managers are to be up with the <unk> <eos> <unk> the company <unk> <eos> <eos> the the investors of the prices said <eos> the trading the the & poor 's 500-stock index rose 