## References
- [char-nn by Karpathy](https://github.com/karpathy/char-rnn)
- [RNN on WILDML](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/) and [code](https://github.com/dennybritz/rnn-tutorial-rnnlm)
- implement simple RNN (nor LSTM nor GRU) with theano

In [1]:
import numpy as np
import theano
import theano.tensor as T
from theano import shared, function

Using gpu device 0: GeForce GTX 980M (CNMeM is disabled, cuDNN 5005)


In [2]:
theano.config.device

'gpu'

## generate some simple artificial data to better understand the mechanism of RNN
- Using [The Adventures of Tom Sawyer](https://www.gutenberg.org/files/74/74.txt) from gutenberg project

In [126]:
texts = open("../data/TheAdventuresofTomSawyer.txt").read()
charset = set(texts)
ind2char = dict(enumerate(charset))
char2ind = dict(map(reversed, ind2char.items()))
data = map(char2ind.get, texts)

print len(charset), len(data)

86 421909


## char RNN with explicit loop
- fixed window lenght
- Because the network is not really deep (max seq len is 5), so using simple weights initialization
- Use traditional `tanh` for linearity
- simple derivative clip between -5, 5
- reset hidden state vector to 0 every time after going through the seqs of the whole dataset - we might use a different strategy for sentence modelling, see below for differences

In [119]:
class RNN(object):
    def __init__(self, L, D, H, lr, lmbda = 0):
        """
        L: sequence lenght, D: vocabulary size (word dimension)
        H: hidden size
        lr: learning rate
        """
        self.L = L
        self.D = D
        self.H = H
        self.lr = lr
        self.lmbda = lmbda
        
        x = T.ivector(name = "x") # seq of word hashs
        y = T.ivector(name = "y") # seq of next-word hashs
        h = T.vector(name = "h", dtype="float32") # current state of rnn - h value
        
        Wxh = shared(np.random.randn(D, H).astype("float32") / np.sqrt(D), name = "Wxh")
        Whh = shared(np.random.randn(H, H).astype("float32") / np.sqrt(H), name = "Whh")
        bh = shared(np.zeros(H).astype("float32"), name = "bh")
        Why = shared(np.random.randn(H, D).astype("float32") / np.sqrt(H), name = "Why")
        by = shared(np.zeros(D).astype("float32"), name = "by")
        
        hs = [None] * (L+1)
        probs = [None] * L
        errors = [None] * L

        hs[-1] = h
        for i in xrange(L):
            hs[i] = T.tanh(Wxh[x[i], :] + Whh.dot(hs[i-1]) + bh)
            probs[i] = T.nnet.softmax(hs[i].dot(Why) + by).flatten()
            errors[i] = -T.log(probs[i][y[i]])

        data_loss = sum(errors)
        reg_loss = (Wxh * Wxh).sum() + (Whh * Whh).sum() + (Why * Why).sum()
        loss = data_loss + lmbda * reg_loss

        ## simple truncated derivate
        dWxh = T.clip(T.grad(loss, Wxh), -5, 5)
        dWhh = T.clip(T.grad(loss, Whh), -5, 5)
        dbh = T.clip(T.grad(loss, bh), -5, 5)
        dWhy = T.clip(T.grad(loss, Why), -5, 5)
        dby = T.clip(T.grad(loss, by), -5, 5)
        
        self.train_on_seq = function(inputs = [x, y, h], 
                        outputs = [loss, hs[-2]],    # hs[-1] is init h, hs[-2] is the latest one 
                        updates = [ (Wxh, Wxh - lr * dWxh)
                                  , (Whh, Whh - lr * dWhh)
                                  , (bh, bh - lr * dbh)
                                  , (Why, Why - lr * dWhy)
                                  , (by, by - lr * dby)])
        self.predict = function(inputs = [x, h], 
                  outputs = [probs[-1], hs[-2]])
    
    def generate(self, x, h, n):
        """Generate n chars based on the current x and h
        """
        seq = x
        while len(seq) < len(x) + n:
            p, h = self.predict(x, h)
            c = np.random.choice(xrange(self.D), p = p)
            x = x[1:] + [c]
            seq.append(c)
        return "".join(map(ind2char.get, seq))

In [121]:
## test initialization
L = 25
D = len(charset)
H = 100
lr = 0.001

rnn = RNN(L, D, H, lr, lmbda = 0)
rnn.train_on_seq(x = [0] * L, y = [0] * L, h = np.zeros(H).astype("float32"))[0], -np.log(1./D) * L

(array(111.25082397460938, dtype=float32), 111.35868240633768)

In [124]:
print rnn.generate([0] * L, np.zeros(H).astype("float32"), 100)

                         v]VB;V*y*1_0)DhMeC!@!Hi#1&! JF72)JlAX 2WMB@,&91[C[PWV!�hkG2ua9(hA[lv.UNpa4-QOfzURas0uIB5]eL%88]K!X-


In [127]:
L = 25
D = len(charset)
H = 100
lr = 0.001

rnn = RNN(L, D, H, lr, lmbda = 0)

ichar = 0
iteration = 0
iseq = 0

after_n_seq = 5000

total_loss = 0
N = len(data)

while True:
    if ichar == 0:
        iteration += 1
        hval = np.zeros(H).astype("float32") ## reset h after the whole text
        
    xval = data[ichar:ichar+L]
    yval = data[ichar+1:ichar+1+L]
    loss, hval = rnn.train_on_seq(xval, yval, hval)
    total_loss += loss
    ichar += L
    iseq += 1
    if iseq % after_n_seq == 0:
        print iteration, ichar, total_loss / after_n_seq
        total_loss = 0
    if iseq % 15000 == 0:
        print rnn.generate(xval, hval, 100), "\n"
    if ichar+1+L >= N: 
        ichar = 0
    if iteration >= 15: break

1 125000 68.6428634369
1 250000 58.6001480789
1 375000 55.6343107094
he Welshman's part of it wall larredan
bnerloan suoutd-ingjy heracd;cuTglfory hed meml Yes diailds Coad nate he foony Wird b 

2 78100 55.6106525269
2 203100 51.7930876827
2 328100 50.6970004478
etween the walls of sumacHe, and
thot and whed henchilenous youist,
ecprared nt ffasevencore har hit. I "ande mind o 

3 31200 51.4263897202
3 156200 49.0206732822
3 281200 48.4693963398
just a given name, like a pindang on-to feas onde wiss lleghing incom'se, bugs then's, we
to
for a mus, a de son't hi's obe 

3 406200 47.4534150406
4 109300 48.7907652168
4 234300 46.5159355717
he church, and I couldn't, for the jodruf it. At in, .
Thed ther dell was neaingo's to new ad ory. Bun?ieb in't ghet it to h 

4 359300 46.4297860298
5 62400 47.8426988739
5 187400 45.704068224
lads
had gone off on that dad, throkte. He for then his boy, mast wher had the stall theawn, Wneft therd tho sand
for a
da 

5 312400 45.2377740635
6 15500 46.2

KeyboardInterrupt: 

***The error is generally decreasing along with more iterations - it didnt move much after 10 iterations, for such a naive training algorithm***

***Using GPU is actaully slower than a good multi-core cpu***

***As a comparison, lets see what if we reset RNN"s hidden state for every sequence - it doesnt seem to matter too much***

In [44]:
L = 25
D = len(charset)
H = 100
lr = 0.001

rnn = RNN(L, D, H, lr, lmbda = 0)

ichar = 0
iteration = 0
iseq = 0

after_n_seq = 5000

total_loss = 0
N = len(data)

while True:
    if ichar == 0:
        iteration += 1
#         hval = np.zeros(H) ## reset h after the whole text
        
    xval = data[ichar:ichar+L]
    yval = data[ichar+1:ichar+1+L]
#     loss, hval = rnn.train_on_seq(xval, yval, hval)
    ## reset state of RNN every every seq in training
    loss, hval = rnn.train_on_seq(xval, yval, np.zeros(H).astype("float32"))
    total_loss += loss
    ichar += L
    iseq += 1
    if iseq % after_n_seq == 0:
        print iteration, ichar, total_loss / after_n_seq
        total_loss = 0
    if iseq % 15000 == 0:
        print rnn.generate(xval, hval, 100), "\n"
    if ichar+1+L >= N: 
        ichar = 0
    if iteration >= 15: break

1 125000 69.734506693
1 250000 59.0533565856
1 375000 56.2464675457
he Welshman's part of it wimoctsnn tunt hour anowe reno plway gloun."
und thon hen outhy andein. Thene fpip hammenythed
txd 

2 78100 56.3822990522
2 203100 52.7785379187
2 328100 51.9047743228
etween the walls of sumacle. Aod to was in.--noptes theperd ferven goulll---over andasllon; and gandir't sminh
epot. for the 

3 31200 52.7883702437
3 156200 50.3466271011
3 281200 49.913223164
just a given name, like assing ast tor Orakning a was! was ofly, bog---bafe', dicl, and ale victsont't for. Ast, land, agring 

3 406200 49.0111927036
4 109300 50.2727315207
4 234300 48.0810530385
he church, and I couldn't mon't a booked Hudgind? I resp_cej: Tam, do I fore
ffrem duf cluck. AHt mumy ano his hard!" whe pe 

4 359300 48.0184455581
5 62400 49.3947333979
5 187400 47.2889732475
lads
had gone off on thancer, he save nowertalints to nevered that necrowmed toatorly that ored upde'cld that to
distale: h 

5 312400 46.9074245066
6 1

## word Rnn with theano scan
- reset hidden state right after learning each sentence - different from char seq modelling
- the seq length is not fixed any more
- using a CPU version (by forcing float to float64) because gpu is actually slower for this naive implementation
- error use the error mean of seq because their lengths are not same anymore

In [128]:
import nltk
from theano import scan

In [129]:
sentexts = nltk.sent_tokenize(texts)
sentences = [["SENT_START"] + nltk.word_tokenize(s) + ["SENT_END"] for s in sentexts]
vocab = set(sum(sentences, []))
ind2word = dict([(i, w) for i, w in enumerate(vocab)])
word2ind = dict([(w, i) for i, w in enumerate(vocab)])
data = [map(word2ind.get, s) for s in sentences]
print len(sentences), len(vocab)

5030 8957


***Now the sequences are of different lengths***

In [157]:
## the RNN model is basically the same as the previous one,
## except that we don't need the sequence length to be fixed

class RNNWord(object):
    def __init__(self, D, H, lr, lmbda = 0):
        """
        D: vocabulary size (word dimension)
        H: hidden size
        lr: learning rate
        sequence lenght will be derived implicitly from inputs
        """
        self.D = D
        self.H = H
        self.lr = lr
        self.lmbda = lmbda
        
        x = T.ivector(name = "x") # seq of word hashs
        y = T.ivector(name = "y") # seq of next-word hashs
        h = T.dvector(name = "h") # current state of rnn - h value
        
        Wxh = shared(np.random.randn(D, H) / np.sqrt(D), name = "Wxh")
        Whh = shared(np.random.randn(H, H) / np.sqrt(H), name = "Whh")
        bh = shared(np.zeros(H), name = "bh")
        Why = shared(np.random.randn(H, D) / np.sqrt(H), name = "Why")
        by = shared(np.zeros(D), name = "by")
        
        def forward(xt, ht_1, Wxh, Whh, bh, Why, by):
            ht = T.tanh(Wxh[xt, :] + Whh.dot(ht_1) + bh)
            probt = T.nnet.softmax(ht.dot(Why) + by).flatten()
            return [probt, ht]
        [probs, hs], _ = scan(fn = forward, 
                                sequences = [x],
                                outputs_info=[None, h], 
                                non_sequences=[Wxh, Whh, bh, Why, by], 
                                truncate_gradient=10, ## truncated bptt
                                strict=True)

        errs, _ = scan(fn = lambda prob, y: -T.log(prob[y]), 
                   sequences = [probs, y],
                   outputs_info = None)
        data_loss = errs.mean() # use mean instead of sum for seq error
        reg_loss = (Wxh * Wxh).sum() + (Whh * Whh).sum() + (Why * Why).sum()
        loss = data_loss + lmbda * reg_loss

        ## simple truncated derivate
        dWxh = T.grad(loss, Wxh)
        dWhh = T.grad(loss, Whh)
        dbh = T.grad(loss, bh)
        dWhy = T.grad(loss, Why)
        dby = T.grad(loss, by)
        
        self.train_on_seq = function(inputs = [x, y, h], 
                        outputs = [loss, hs[-1]],    # hs[-1] is the last one 
                        updates = [ (Wxh, Wxh - lr * dWxh)
                                  , (Whh, Whh - lr * dWhh)
                                  , (bh, bh - lr * dbh)
                                  , (Why, Why - lr * dWhy)
                                  , (by, by - lr * dby)])
        self.predict = function(inputs = [x, h], 
                  outputs = [probs[-1], hs[-1]])
    
    def generate(self, x, h, n):
        """Generate n chars based on the current x and h
        """
        seq = x
        while len(seq) < len(x) + n:
            p, h = self.predict(x, h)
            c = np.random.choice(xrange(self.D), p = p)
            x = x[1:] + [c]
            seq.append(c)
        return " ".join(map(ind2word.get, seq))

In [159]:
D = len(vocab)
H = 50
lr = 0.001
rnn = RNNWord(D, H, lr)

L = 11
print rnn.train_on_seq(x = [0] * L, y = [0] * L, h = np.zeros(H))[0], -np.log(1./D) 

print rnn.generate([word2ind["SENT_START"]], np.zeros(H), 50)

9.10078868839 9.10019062848
SENT_START drooped Widger charm OR _He_ magnanimous suck 'Mph remote suspect wormed Dark bulletin-board jackets mend Becomes Mode quiet spring sounded Court discovered other's roads dollars day's pulling bug ghosts destructive oar walk yet oars wailings unfolded morrow Harbison ton borders newspapers Bible-prize sport strokes attention neither Swims threaded skiffs 74-h.zip


In [162]:

D = len(vocab)
H = 50
lr = 0.001

rnn = RNNWord(D, H, lr, lmbda = 0)



after_n_seq = 1000
total_loss = 0

for iteration in xrange(3):
    iseq = 0
    for seq in data:
        xval = seq[:-1]
        yval = seq[1:]
        hval = np.zeros(H) # reset h for every seq/sentence
        loss, hval = rnn.train_on_seq(xval, yval, hval)
        total_loss += loss
        if iseq % after_n_seq == 0:
            print iteration, iseq, total_loss / after_n_seq
            total_loss = 0
            print rnn.generate([word2ind["SENT_START"]], np.zeros(H), 50), "\n"
        iseq += 1

0 0 0.00913558878305
SENT_START rendezvous notching miles allowing boiler referred amiss coffee Self-Examination enemy Niagara hanging tan Cornered looking-glass gets crossing writing wildeyed fact loft 'a pointing dreary Holler Lionized villagers hams agues spring-board unwhitewashed battle wedge swear _theirs_ stalactites permitted pile downhearted passed preposterous sins battery valley personating bow Lost pull enabled BEFORE 

0 1000 9.07190889844
SENT_START martyr defeated ached exact Dore by Forest morning B. wading admired minds silver NO three-mile furtively suffering fantastic latterly grownup 74-h.zip deserted lengths sepulchral hour-glass Lengthy sadly curve festoons Made Presently promising worst instantly surely Alarming small caved iron-gray meed minded treasures escapade Christ Asserts running sunlight veins imposing prosy 

0 2000 8.94777005642
SENT_START hubbub gang festive prompted meditating bat knowing wended pills tete tumblings tails halting pushing gesticulation