# Text generator based on RNN

## Brief
Generate fake abstract with RNN model under tensorflow r1.3.

### Import libraries

In [1]:
import tensorflow as tf
import numpy as np
import random
import os

### Configurations 

In [2]:
vocab = (" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_abcdefghijklmnopqrstuvwxyz{|}\n")
graph_path = r"./graphs"
test_text_path = os.path.normpath(r"../Dataset/arvix_abstracts.txt")
batch_size=50
model_param_path=os.path.normpath(r"./model_checkpoints")

### Data encoding
#### Basic Assumption

* A full string sequence consists $START$ & $STOP$ signal with characters in the middle. 

#### Encoding policy
* A set $\mathcal{S}$ that consists of many characters is utilized to encode the characters.
* The $1^{st}$ entry of the vector corresponds to $UNKNOWN$ characters(l.e. characters that are beyond $\mathcal{S}$). 
* The last entry of the vector corresponds to $STOP$ signal of the sequence. 
* The entries in the middle corresponds to the indices of the characters within $\mathcal{S}$. 
* The $START$ signal is represented as a zero vector. 

#### Implementation & Test
##### Declaration

In [3]:
class TextCodec:
    def __init__(self, vocab):
        self._vocab = vocab
        self._dim = len(vocab) + 2

    def encode(self, string, sess = None, start=True, stop=True):
        """
        Encode string.
        Each character is represented as a N-dimension one hot vector. 
        N = len(self._vocab)+ 2
        
        Note:
        The first entry of the vector corresponds to unknown character. 
        The last entry of the vector corresponds to STOP signal of the sequence. 
        The entries in the middle corresponds to the index of the character. 
        The START signal is represented as a zero vector. 
        """
        tensor = [vocab.find(ch)+1 for ch in string]
        if stop:
             tensor.append(len(vocab)+1)  # String + STOP
        tensor = tf.one_hot(tensor,depth=len(vocab)+2,on_value=1.0,off_value=0.0,axis=-1, dtype=tf.float32)
        if start:
            tensor=tf.concat([tf.zeros([1, len(vocab)+2],dtype=tf.float32),tensor],axis=0)  # String + START
        if sess is None:
            with tf.Session() as sess:
                nparray=tensor.eval()
        elif type(sess) == tf.Session:
            nparray = tensor.eval(session=sess)
        else:
            raise TypeError('"sess" must be {}, got {}'.format(tf.Session, type(sess)))    
        return nparray

    def decode(self, nparray, default="[UNKNOWN]",start="[START]",stop="[STOP]",strip=False):
        text_list = []
        indices=np.argmax(nparray,axis=1)
        for v, ch_i in zip(nparray,indices):
            if np.all(v==0):
                text_list.append(start if not strip else "")
            elif ch_i==0:
                text_list.append(default)
            elif ch_i==len(self._vocab)+1:
                text_list.append(stop if not strip else "")
            else:
                text_list.append(vocab[ch_i-1])
        return "".join(text_list)
    
    @property
    def dim(self):
        return self._dim

##### Test
See how encoding and decoding work. 

In [4]:
test_codec=TextCodec(vocab)
test_text_encoded=test_codec.encode("Hello world!")
print("Encoded text looks like:\n{}".format(test_text_encoded))
test_text_decoded=test_codec.decode(nparray=test_text_encoded,strip=False)
print("Decoded text looks like:\n{}".format(test_text_decoded))

Encoded text looks like:
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  1.]]
Decoded text looks like:
[START]Hello world[UNKNOWN][STOP]


## Load data set

In [5]:
with open(test_text_path, "r") as f:
    raw_text_list = "".join(f.readlines()).split("\n")
print("Loaded abstract from a total of {} theses.".format(len(raw_text_list)))
# See what we have loaded
sample_text_no = random.randint(0, len(raw_text_list)-1)
sample_text_raw = raw_text_list[sample_text_no]
print("A sample text in the data set:\n{}".format(sample_text_raw))
sample_text_encoded=test_codec.encode(sample_text_raw)
print("Encoded text:\n{}".format(sample_text_encoded))
print("Decoded text:\n{}".format(test_codec.decode(sample_text_encoded)))
encoded_data = test_codec.encode("\n".join(raw_text_list), start=False, stop=False)

Loaded abstract from a total of 7201 theses.
A sample text in the data set:
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign differen

## Define Batch Generator

In [107]:
def batch_generator(data, codec, batch_size, seq_length, reset_every):
    if type(data) == str:
        data=codec.encode(data, start=False, stop=False)
    head = 0
    reset_index = 0
    batch = []
    seq = []
    increment = seq_length * reset_every - 1
    extras = codec.encode("", start=True, stop=True)
    v_start, v_stop = extras[0: 1, :], extras[1: 2, :]
    while head < np.shape(data)[0] or len(batch) == batch_size:
        if len(batch) == batch_size:
            batch = np.array(batch)
            for offset in range(reset_every):
                yield (batch[:, offset * seq_length: (offset + 1) * seq_length, :], 
                batch[:, offset * seq_length + 1: (offset + 1) * seq_length + 1, :])
            batch = []
        else:
            seq = np.concatenate([v_start, data[head: head + increment, :], v_stop], axis=0)
            print(codec.decode(seq, strip=False))
            if np.shape(seq)[0] == (increment + 2):
                batch.append(data[head: head + increment + 2])
            head += increment

## Check the generator

In [None]:
batch_length = 100
reset_every = 2
batch_size = 2
batches = batch_generator(data=encoded_data, 
                               codec=test_codec, 
                               batch_size=batch_size, 
                               seq_length=batch_length, 
                               reset_every=reset_every)
for (x, y), i in zip(batches, range(reset_every * 2)):
    
    print("Batch {}".format(i))
    if (i % reset_every) == 0:
        print("Reset")
    for j in range(batch_size):
        decoded_x, decoded_y = test_codec.decode(x[j], strip=False), test_codec.decode(y[j], strip=False)
        print(x[j])
        print(test_codec.decode(x[j][:1, :], strip=False))
        print("Index of sub-sequence:\n{}\nSequence input:\n{}:\nSequence output:\n{}".format(j, 
                                                                                          decoded_x, 
                                                                                          decoded_y))

### Define model class

In [None]:
class DRNN(tf.nn.rnn_cell.RNNCell):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layer, dtype=tf.float32):
        super(tf.nn.rnn_cell.RNNCell, self).__init__(dtype=dtype)
        assert type(input_dim) == int and input_dim > 0, "Invalid input dimension. "
        self._input_dim = input_dim
        assert type(num_hidden_layer) == int and num_hidden_layer > 0, "Invalid number of hidden layer. "
        self._num_hidden_layer = num_hidden_layer
        assert type(hidden_dim) == int and hidden_dim > 0, "Invalid dimension of hidden states. "
        self._hidden_dim = hidden_dim
        assert type(output_dim) == int and output_dim > 0, "Invalid dimension of output dimension. "
        self._output_dim = output_dim
        self._state_is_tuple = True
        with tf.variable_scope("input_layer"):
            self._W_xh = tf.get_variable("W_xh", shape=[self._input_dim, self._hidden_dim])
            self._b_xh = tf.get_variable("b_xh", shape=[self._hidden_dim])
        with tf.variable_scope("rnn_layers"):
            self._cells = [tf.nn.rnn_cell.GRUCell(self._hidden_dim) for _ in range(num_hidden_layer)]
        with tf.variable_scope("output_layer"):
            self._W_ho_list = [tf.get_variable("W_h{}o".format(i), shape=[self._hidden_dim, self._output_dim])
                               for i in range(num_hidden_layer)]
            self._b_ho = tf.get_variable("b_ho", shape=[self._output_dim])

    @property
    def output_size(self):
        return self._output_dim

    @property
    def state_size(self):
        return (self._hidden_dim,) * self._num_hidden_layer

    def zero_state(self, batch_size, dtype):
        if self._state_is_tuple:
            return tuple(cell.zero_state(batch_size, dtype)for cell in self._cells)
        else:
            raise NotImplementedError("Not implemented yet.")

    def __call__(self, _input, state, scope=None):
        assert type(state) == tuple and len(state) == self._num_hidden_layer, "state must be a tuple of size {}".format(
            self._num_hidden_layer)
        hidden_layer_input = tf.matmul(_input, self._W_xh) + self._b_xh
        prev_output = hidden_layer_input
        final_state = []
        output = None
        for hidden_layer_index, hidden_cell in enumerate(self._cells):
            with tf.variable_scope("cell_{}".format(hidden_layer_index)):
                new_output, new_state = hidden_cell(prev_output, state[hidden_layer_index])
                prev_output = new_output + hidden_layer_input  # Should be included in variable scope of this layer or?
                final_state.append(new_state)
            _W_ho = self._W_ho_list[hidden_layer_index]
            if output is None:
                output = tf.matmul(new_output, _W_ho)
            else:
                output = output + tf.matmul(new_output, _W_ho)
        output = tf.tanh(output + self._b_ho)
        # output = tf.nn.relu(output)
        final_state = tuple(final_state)
        return output, final_state

    def inspect_weights(self, sess):
        val = self._W_xh.eval(sess)
        print("W_xh:\n{}\nF-norm:\n{}".format(val, norm(val)))
        val = self._b_xh.eval(sess)
        print("b_xh:\n{}\nF-norm:\n{}".format(val, norm(val)))
        for hidden_layer_index in range(self._num_hidden_layer):
            val = self._W_ho_list[hidden_layer_index].eval(sess)
            print("W_h{}o:\n{}\nF-norm:\n{}".format(hidden_layer_index, val, norm(val)))
        val = self._b_ho.eval(sess)
        print("b_ho:\n{}\nF-norm:\n{}".format(val, norm(val)))


## Create Batches

In [52]:
batch_length = 100
reset_every = 100
batch_size = 50
batches = list(batch_generator(data=encoded_data, 
                               codec=test_codec, 
                               batch_size=batch_size, 
                               seq_length=batch_length, 
                               reset_every=reset_every))

KeyboardInterrupt: 

### Make an instance of the model and define the rest of the graph
#### Thoughts
If GRU is used, then the outputs of GRU shall not be directly used as desired output without further transforms. (e.g. A cell accpet 2 inputs, a state from the previous cell and the input of this cell(which is approximated by the state input), then the RNN cell can be treated as a normal feed forward network. 

**The proposal above is to be tested again due to the previous bug in training (Failed to feed the initial state given by the RNN output from last sequnce)**

In [None]:
tf.reset_default_graph()
input_dim = output_dim = test_codec.dim
hidden_dim = 700
num_hidden_layer = 3
# test_rnn_cell = DRNN(input_dim, hidden_dim, output_dim, num_hidden_layer)
rnn_cell = DRNN(input_dim=input_dim, output_dim=output_dim, num_hidden_layer=num_hidden_layer, hidden_dim=hidden_dim)
init_state = tuple(tf.placeholder_with_default(input=tensor, 
                                         shape=[None, hidden_dim]) for tensor in rnn_cell.zero_state(
    batch_size=batch_size, dtype=tf.float32))
seq_input = tf.placeholder(shape=[None, None, input_dim], dtype=tf.float32)
target_seq_output = tf.placeholder(shape=[None, None, output_dim], dtype=tf.float32)
seq_output, final_states = tf.nn.dynamic_rnn(cell=rnn_cell,inputs=seq_input, 
                                                      initial_state=init_state, dtype=tf.float32)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=target_seq_output, logits=seq_output))
summary_op = tf.summary.scalar(tensor=loss, name="loss")
global_step = tf.get_variable(initializer=0,trainable=False, dtype=tf.int64)

### Training

In [50]:
n_epoch=2
learning_rate=1e-3
train_op=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
print_every = 50
partition_size = 100
logdir = os.path.normpath("./graphs")
with tf.Session() as sess, tf.summary.FileWriter(logdir=logdir) as writer:

    print(sess, writer)
    sess.run(tf.global_variables_initializer())
    feed_dict = dict()
    for epoch in range(n_epoch):
        for i, batch in enumerate(batches):
            x, y = batch[:, :-1, :], batch[:, 1:, :]
            print(x.shape, y.shape)
            feed_dict = {seq_input: x, seq_output: y}
            if (i % reset_every) != 0:
                for i in range(num_hidden_layer):
                    feed_dict[init_state[i]] = final_states[i]
        _, summary, states, step = sess.run(fetches=[train_op, summaries, final_states, global_step], feed_dict=feed_dict)
    writer.add_summary(summary=summary, global_step=step)

def online_inference():
    raise NotImplementedError("Not implemented yet.")

<tensorflow.python.client.session.Session object at 0x7f2f88911898> <tensorflow.python.summary.writer.writer.FileWriter object at 0x7f2f891fae10>
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 

(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86) (50, 99, 86)
(50, 99, 86)

NameError: name 'summaries' is not defined

### Test online inference