<iframe height='265' scrolling='no' title='Waving Text CSS3' src='//codepen.io/molefrog/embed/ieJbo/?height=265&theme-id=0&default-tab=result&embed-version=2' frameborder='no' allowtransparency='true' allowfullscreen='true' style='width: 100%;'>See the Pen <a href='https://codepen.io/molefrog/pen/ieJbo/'>Waving Text CSS3</a> by Alexey Taktarov (<a href='https://codepen.io/molefrog'>@molefrog</a>) on <a href='https://codepen.io'>CodePen</a>.
</iframe>

Ok, so I'm gonna try to make an RNN learn to speak like Hofstadter, trying to Google as little as possible :P

## Loading and processing the data

Let's try creating a proper `Dataset` (see [here](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html))

In [17]:
import torch
from torch.utils.data import Dataset, DataLoader


### Dataset class (<span style="color:red; letter-spacing: 3px;">FAIL</span> <-Can't use for length-varying data, so can't use here)


``torch.utils.data.Dataset`` is an abstract class representing a
dataset.
Your custom dataset should inherit ``Dataset`` and override the following
methods:

-  ``__len__`` so that ``len(dataset)`` returns the size of the dataset.
-  ``__getitem__`` to support the indexing such that ``dataset[i]`` can
   be used to get :math:`i`\ th sample

Let's create a dataset class for our sentences dataset. We will
read the txt in ``__init__``

In [18]:
# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'-"
n_letters=len(all_letters)

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# One-hot matrix of first to last letters (not including EOS) for input
def inputTensor(line):
    tensor = torch.zeros(len(line),1, n_letters)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_letters.find(letter)] = 1
    return tensor

# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) # EOS
    return torch.LongTensor(letter_indexes)


In [99]:
# DataLoader
# I think this is only really useful for when your dataset is big and doesn't fit on memory
# class LoadText(Dataset):
#     def __init__(self, filename):
#         f = open(filename,"r")
#         lines = list(f)
#         lines = [l for l in lines if l != "\n"]
#         self.inputTensorsList =list(map(lambda x: inputTensor(unicodeToAscii(x)),lines))
#         self.targetTensorsList =list(map(lambda x: targetTensor(unicodeToAscii(x)),lines))
# #         print(inputTensorsList[0].shape)
# #         self.inputTensors = torch.cat(inputTensorsList,dim=0)
# #         self.tartgetTensors = torch.cat(targetTensorsList,dim=0)
#     def __len__(self):
#         return len(self.inputTensorsList)
#     def __getitem__(self,idx):
#         sentence = self.inputTensorsList[idx]
#         shifted_sentence = self.targetTensorsList[idx]
#         return {"input": sentence, "target": shifted_sentence}

f = open("GEB.txt","r")
lines = list(f)
lines = [l for l in lines if l != "\n"]
lines = [unicodeToAscii(l) for l in lines]
data = [inputTensor(l) for l in lines]
targets = [targetTensor(l) for l in lines]
import random
n = len(data)
def get_minibatch(batch_size):
    indices = random.sample(range(n),batch_size)
    return [(data[i],targets[i]) for i in indices]

In [96]:
# processed_data = LoadText("GEB.txt")
# dataloader = DataLoader(processed_data, batch_size=128,
                        shuffle=True, num_workers=4)

## Defining the model
Ok, how do I make an RNN? Hmm

In [21]:
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F


In [22]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        self.hidden_size = hidden_size

        self.i2i = nn.Linear(input_size + hidden_size, input_size + hidden_size)
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.Softmax(dim=1)
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        combined = F.relu(self.i2i(combined))
        hidden = F.tanh(self.i2h(combined))
        output = self.i2o(combined)
        out = self.softmax(output)
        logoutput = self.logsoftmax(output)
        return logoutput, out, hidden

    def initHidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

n_hidden = 128


In [23]:
rnn = RNN(n_letters, n_hidden, n_letters)

### Testing the model

In [63]:
def sample_sentence():
    leti = random.randint(0,n_letters-1)
    h = Variable(torch.zeros(1,n_hidden))
    let = Variable(torch.zeros(1,n_letters))
    let[0,leti]=1
#     print(all_letters[torch.max(let,dim=1)[1]])
    letis = [leti]
    iters = 0
    max_iters = 200
    while leti != n_letters-1 and iters<=max_iters:
        _,letter_distribution,h = rnn(let,h)
        leti = torch.multinomial(letter_distribution,1).data[0][0]
        let = Variable(torch.zeros(1,n_letters))
        let[0,leti]=1
        letis.append(leti)
#     print(letis)
    return "".join([all_letters[leti] for leti in letis])

In [71]:
sample_sentence()

"LHQfZiY.kqzAkCSJeGLt.HeA''WdOsPMP''.Pl;DcKsVh;-"

### Training

The `NLLLoss` stands for negative log-likelihood loss, which means (log of) the probability (likelihood) that each of the right {next characters} are obtained from the softmax distribution that the `rnn` outputs at each point in the sequence.

We work with the log of the probability for convenience (the numbers are more manageble, which the computer thanks us for, as it has to store something like -20, instead of 0.0000000000000000001 or something; this helps avoiding accuracy errors). For us, it means that instead of multiplying the likelihoods of the characters to get the likelihood of the sentence, we add the log-likelihoods.

We are going to use Adam optimizer, instead of the simpler SGD, because it tends to work better (really beacuse [god Karpathy says so](http://karpathy.github.io/2015/05/21/rnn-effectiveness/))

In [79]:
criterion = nn.NLLLoss()
optim = torch.optim.Adam(rnn.parameters())

In [None]:
from math import ceil
num_iters = 10000
# learning_rate = 0.001
batch_size = 128
for iteration in range(num_iters):
    loss = 0
    rnn.zero_grad()
    for b in get_minibatch(batch_size):
        sentence_chars = b[0]
        target_next_chars = b[1]
        h = Variable(torch.zeros(1,n_hidden))
        len_sentence = sentence_chars.shape[0]
        # I had some problems with very long sentences before, not sure why. Limiting it to 100 chars for now
        # but it's something to check out later.
        for i in range(min(len_sentence,100)):
            input_char = Variable(sentence_chars[i,:,:])
            loglet,_,h = rnn(input_char,h)
            target_char = Variable(torch.LongTensor([target_next_chars[i]]))
            loss += criterion(loglet,target_char)
    #         lets.append(let)
    loss /= batch_size
    loss.backward()
    optim.step()
# below is if we wanted to use SGD
#     for p in rnn.parameters():
#         p.data.add_(-learning_rate, p.grad.data)
    if iteration%(ceil(num_iters/1000))==0:
        print(iteration)
        print(loss.data.numpy()[0])

In [85]:
## Trying to train using DataLoader, but can't make data loader with length-varying data....

# from math import ceil
# num_iters = 18000
# batch_size = 100
# size_data = len(dataloader)


# for epoch in range(num_epochs):
#     for i_batch, sample_batched in enumerate(dataloader):
#         loss = 0
#         rnn.zero_grad()
#         inputs = sample_batched['input']
#         targets = sample_batched['target']
#         for i in range(inputs.shape[0]):
#             inp = inputs[i:i+1,:,:]
#             tar = targets[i:i+1,:,:]
#             h = Variable(torch.zeros(1,n_hidden))
#             len_sentence = inp.shape[0]
#             for i in range(len_sentence):
#                 input_char = Variable(sentence_chars[i,:,:])
#                 loglet,_,h = rnn(input_char,h)
#                 target_char = Variable(torch.LongTensor([target_next_chars[i]]))
#                 loss += criterion(loglet,target_char)
#         #         lets.append(let)
#         loss /= batch_size
#         loss.backward()
#         optim.step()
#         if i_batch%(ceil(size_data/1000))==0:
#             print(iteration)
#             print(loss.data.numpy()[0])

0
116.73786
18
111.70514
36
110.96231
54
111.41884
72
115.752106
90
117.673454
108
116.097855
126
108.972
144
108.51685
162
108.541504
180
116.06244
198
106.08992
216
114.05661
234
104.19088
252
108.145355
270
98.140625
288
106.9031
306
109.556465
324
104.70042
342
110.50124
360
103.985245
378
101.96027
396
107.859695
414
95.28157
432
104.65615
450
108.14628
468
100.447014
486
117.771126
504
100.352295
522
97.181755
540
107.49696
558
103.089066
576
94.55757
594
106.73015
612
112.398125
630
104.1905
648
110.126526
666
95.92023
684
97.13247
702
94.146835
720
97.55515
738
93.83797
756
97.83145
774
95.436386
792
101.613815
810
106.32706
828
103.570076
846
102.016335
864
92.85781
882
97.490166
900
93.0558
918
105.28509
936
112.07302
954
97.0695
972
107.646034
990
103.99272
1008
110.86503
1026
91.13515
1044
99.663475
1062
91.40614
1080
99.87008
1098
95.99698
1116
107.48521
1134
102.03435
1152
101.00247
1170
103.35263
1188
105.56466
1206
98.92982
1224
95.95692
1242
100.70881
1260
95.80939
127

10440
89.73581
10458
86.70008
10476
80.95887
10494
83.14602
10512
75.85327
10530
83.15125
10548
94.03833
10566
83.69385
10584
83.18735
10602
85.46871
10620
78.23728
10638
86.16403
10656
72.991066
10674
84.53394
10692
79.36183
10710
75.187874
10728
77.85134
10746
74.87686
10764
83.31312
10782
75.68197
10800
75.918144
10818
80.861786
10836
81.10103
10854
84.30643
10872
77.71907
10890
85.29105
10908
80.866196
10926
77.70947
10944
76.56818
10962
87.78525
10980
83.49509
10998
79.65094
11016
80.84707
11034
80.88976
11052
80.96597
11070
81.073784
11088
83.3989
11106
84.17748
11124
86.60485
11142
80.31356
11160
83.49215
11178
79.45812
11196
71.34569
11214
87.97654
11232
87.548836
11250
81.203156
11268
81.69362
11286
78.99413
11304
78.33982
11322
79.82788
11340
76.94642
11358
81.90973
11376
84.59487
11394
74.493805
11412
76.65231
11430
86.5512
11448
76.887405
11466
78.61206
11484
74.386375
11502
85.60951
11520
83.4355
11538
69.35745
11556
89.23941
11574
76.12985
11592
79.051834
11610
78.73453
1

In [97]:
# next(iter(dataloader))

RuntimeError: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 132, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 132, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 112, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/usr/local/lib/python3.5/dist-packages/torch/functional.py", line 105, in stack
    return torch.cat(inputs, dim, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 86 and 92 in dimension 1 at /home/guillefix/Dropbox/Code/ai/learning/pytorch/aten/src/TH/generic/THTensorMath.c:3496


In [165]:
#Saving trained net
import pickle
# pickle.dump(rnn.state_dict(), open("trained_simple_rnn.pkl","wb"))
#net.load_state_dict(pickle.load(open("trained_simple_rnn.pkl","rb")))

In [235]:
#code graveyard
# targets[0]
# let = Variable(d[0,:,:])
# h = Variable(torch.zeros(1,n_hidden))
# loss.backward()
# let,h = rnn(let,h)
# let,h