# Character level language model - Dinosaurus land

** This problem comes from Andrew Ng's coursera course. Instead of using their Keras-based solution, we will try to solve this problem based on Tensorflow.**

Imagine that leading biology researchers are creating new breeds of dinosaurs and bringing them to life on earth, and your job is to give names to these dinosaurs. If a dinosaur does not like its name, it might go beserk, so choose wisely! 

<table>
<td>
<img src="images/dino.jpg" style="width:250;height:300px;">

</td>

</table>

Luckily you have learned some deep learning and you will use it to save the day. Your assistant has collected a list of all the dinosaur names they could find, and compiled them into this [dataset](dinos.txt). (Feel free to take a look by clicking the previous link.) To create new dinosaur names, you will build a character level language model to generate new names. Your algorithm will learn the different name patterns, and randomly generate new names. 


By completing this assignment you will learn:

- How to store text data for processing using an RNN 
- How to synthesize data, by sampling predictions at each time step and passing it to the next RNN-cell unit
- How to build a character-level text generation recurrent neural network




In [265]:
import numpy as np
#from utils import *
import random
import math
import torch
from torch import nn, optim
import os
import time

## 1 - Problem Statement

### 1.1 - Dataset and Preprocessing

Run the following cell to read the dataset of dinosaur names, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size. 

In [266]:
data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

del data

There are 19912 total characters and 27 unique characters in your data.


The characters are a-z (26 characters) plus the "\n" (or newline character), which in this assignment plays a role similar to the `<EOS>` (or "End of sentence") token we had discussed in lecture, only here it indicates the end of the dinosaur name rather than the end of a sentence. 
In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, `char_to_ix` and `ix_to_char` are the python dictionaries. 

In [267]:
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}


## 2 - Building the language model 

It is time to build the character-level language model for text generation. 





### 2.1 - Build training data

Given the dataset of dinosaur names, we use each line of the dataset (one name) as one training example. The following function build_training_data() generates X_train, Y_train, and seqlen.

In [268]:
def build_training_data( filename, name_length=30 ):

    with open(filename) as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]    
    
    data_ix = []   
    for name in examples:
        data_ix.append([char_to_ix[ch] for ch in name])

    seqlen = []
    X_train = []
    Y_train = []
    for name_ix in data_ix:  
        seqlen.append( len(name_ix) )
        x = name_ix.copy()
        x.extend([np.nan]*(name_length-len(name_ix)))
        X_train.append( x )
        y = name_ix[1:].copy()+[char_to_ix["\n"]]
        y.extend([np.nan]*(name_length-len(name_ix)))
        Y_train.append( y )
    
    X_train, Y_train, seqlen = np.array(X_train), np.array(Y_train), np.array(seqlen)
    
    return X_train, Y_train, seqlen

In [269]:
tX, tY, tSeqLen = build_training_data( "dinos.txt" )
print('len(tX)={}, len(tY)={}, len(seqlen)={}'.format(len(tX), len(tY), len(tSeqLen)))
print(tX[0])
print(tY[0])
print(tSeqLen[0])

len(tX)=1539, len(tY)=1539, len(seqlen)=1539
[ 1.  1.  3.  8.  5. 14. 15. 19.  1. 21. 18. 21. 19. nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
[ 1.  3.  8.  5. 14. 15. 19.  1. 21. 18. 21. 19.  0. nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
13


In [270]:
from torch.utils.data import (Dataset, DataLoader, TensorDataset)

class DinoDataset( Dataset ):
    def __init__(self, filename, name_length=30, char_embedding=27 ):
        with open(filename) as f:
            examples = f.readlines()
        examples = [x.lower().strip() for x in examples]    
    
        data_ix = []   
        for name in examples:
            data_ix.append([char_to_ix[ch] for ch in name])

        self.char_embedding=char_embedding
        self.seqlen = []
        self.X_train = []
        self.Y_train = []
        for name_ix in data_ix:  
            self.seqlen.append( len(name_ix) )
            x = name_ix.copy()
            x.extend([np.nan]*(name_length-len(name_ix)))
            self.X_train.append( x )
            y = name_ix[1:].copy()+[char_to_ix["\n"]]
            y.extend([np.nan]*(name_length-len(name_ix)))
            self.Y_train.append( y )
    
        self.X_train, self.Y_train, self.seqlen = np.array(self.X_train), np.array(self.Y_train), np.array(self.seqlen)
#        print("self.X_train.type: ", self.X_train.dtype)
        
    def __len__(self):
        return self.X_train.shape[0]
    
    def __getitem__(self, idx):
        
        return self.X_train[idx], self.Y_train[idx], self.seqlen[idx]
        
#        x_onehot = torch.zeros([self.seqlen[idx], self.char_embedding], dtype=torch.int64 )
#        pos = torch.tensor(self.X_train[idx], dtype=torch.int64)
#        pos = pos[:self.seqlen[idx]].view((self.seqlen[idx], 1))
#        x_onehot.scatter_(1, pos, 1)
        
#        y_onehot = torch.zeros([self.seqlen[idx], self.char_embedding], dtype=torch.int64)
#        pos = torch.tensor(self.Y_train[idx], dtype=torch.int64)
#        pos = pos[:self.seqlen[idx]].view((self.seqlen[idx], 1))
#        y_onehot.scatter_(1, pos, 1)
#        onehot_length = torch.tensor(self.seqlen[idx])
   
#        return x_onehot, y_onehot, onehot_length

In [271]:
class DinoNameNet( nn.Module ):
    def __init__(self, char_embedding=27, hidden_dim=100, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM( char_embedding, hidden_dim, num_layers, batch_first=True )
        self.linear = nn.Linear( hidden_dim, char_embedding ) # 
        
    def forward(self, x, h0=None, length=None):
        padded_out = torch.zeros(x.size())
        padded_out[:,:,0] = 1.0
        if length is not None:
            x = nn.utils.rnn.pack_padded_sequence( x, length, batch_first=True )
        # input : x.size()=[#batch, #seq, #input]
        # input : h0.size()=[#layer*#dir, #batch, #hidden ] (cell)
        # output : x.size() = [#batch, #seq, #dir*#hidden] 
        # output : h_n.size() = [#layer*#dir, #batch, #hidden]
        out, (hidden, cell) = self.lstm(x, h0)
        out, output_lengths = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
#        if length is not None:
#            out = hidden[-1]
#        out = self.linear(out).squeeze() # out.size() = [32, seqlen<30, 27]
        out = self.linear(out)

        padded_out[:,:out.size()[1],:] = out
#        print('[net] padded_out', padded_out.size())
#        print(padded_out)
        padded_out = F.log_softmax(padded_out, dim=2)
#        print(padded_out)

        return padded_out, (hidden, cell)

In [272]:
class DinoNameNetGRU( nn.Module ):
    def __init__(self, char_embedding=27, hidden_dim=100, num_layers=2):
        super().__init__()
        self.gru = nn.GRU( char_embedding, hidden_dim, num_layers, batch_first=True )
#        torch.nn.init.xavier_uniform_(self.gru.weight.ih )
        self.linear = nn.Linear( hidden_dim, char_embedding ) # 
        
    def forward(self, x, h0=None, length=None):
        padded_out = torch.zeros(x.size())
        padded_out[:,:,0] = 1.0
        if length is not None:
            x = nn.utils.rnn.pack_padded_sequence( x, length, batch_first=True )
        # input : x.size()=[#batch, #seq, #input]
        # input : h0.size()=[#layer*#dir, #batch, #hidden ] (cell)
        # output : x.size() = [#batch, #seq, #dir*#hidden] 
        # output : h_n.size() = [#layer*#dir, #batch, #hidden]
        out, hidden = self.gru(x, h0)
        out, output_lengths = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
#        if length is not None:
#            out = hidden[-1]
#        out = self.linear(out).squeeze() # out.size() = [32, seqlen<30, 27]
        out = self.linear(out)

        padded_out[:,:out.size()[1],:] = out
#        print('[net] padded_out', padded_out.size())
#        print(padded_out)
        padded_out = F.log_softmax(padded_out, dim=2)
#        print(padded_out)

        return padded_out, hidden

In [273]:
class DinoNameNetGRU2( nn.Module ):
    def __init__(self, char_embedding=27, hidden_dim=100, num_layers=2):
        super().__init__()
        self.gru = nn.GRU( char_embedding, hidden_dim, num_layers, batch_first=True )
        self.linear = nn.Linear( hidden_dim, char_embedding ) # 
        
    def forward(self, x, h0=None, length=None):
        padded_out = torch.zeros(x.size())
        padded_out[:,:,0] = 1.0
        # input : x.size()=[#batch, #seq, #input]
        # input : h0.size()=[#layer*#dir, #batch, #hidden ] (cell)
        # output : x.size() = [#batch, #seq, #dir*#hidden] 
        # output : h_n.size() = [#layer*#dir, #batch, #hidden]
        out, hidden = self.gru(x, h0)
        out = self.linear(out)
        padded_out = F.log_softmax(out, dim=2)
#        print(padded_out)

        return padded_out, hidden

In [None]:
import torch.nn.functional as F

batch_size = 64
target_padding = 100
learning_rate = 0.001

dd = DinoDataset("dinos.txt")
train_loader = DataLoader( dd, batch_size=batch_size, shuffle=True, num_workers=0)

net = DinoNameNetGRU2()
#net.to('cuda:0')
opt = optim.Adam(net.parameters(), lr=learning_rate)
loss_f = nn.NLLLoss(ignore_index=target_padding)

def get_onehot( x, embedding ):
    x_onehot = torch.zeros(x.size()+torch.Size([embedding]), dtype=torch.int64 )
#    print(x_onehot)
    pos = x.clone()
    pos[torch.isnan(pos)] = 0
    pos = pos.view(pos.size()+torch.Size([1])).long()
    x_onehot.scatter_(2, pos, 1)    
    return x_onehot

def sort_minibatch( data, label, length ):
    length_sorted, sorted_indices = length.sort(descending=True)
    data_sorted = data[sorted_indices, :]
    label_sorted = label[sorted_indices]
    return (data_sorted, label_sorted, length_sorted)

def eval_net():
    net.eval()
    x_o = torch.empty(1, 1, 27).uniform_(1, 26).long()
    h_in = None
    _, idx = torch.max(x_o,2)
    dino_name = ix_to_char[idx.item()]
    for i in range(30):
        y_pred, h_out = net(x_o.float(), h_in, length=torch.tensor([1]))
        _, idx = torch.max(y_pred.squeeze(), 0)
        dino_name = dino_name + ix_to_char[idx.item()]
        h_in = h_out
        if idx == 0:
            break
    print(dino_name)    


for epoch in range(35000):
    losses = []
    net.train()
    count = 0
    for x, y, l in train_loader:
#        (x, y, l) = sort_minibatch(x, y, l)
#        x = x.to('cuda:0')
#        y = y.to('cuda:0')
#        l = l.to('cuda:0')
        x_o = get_onehot(x, embedding=27)
#        y_o = get_onehot(y, embedding=27)
        y_pred, _ = net(x_o.float(), length=l)
#        print('[main] y_pred', y_pred)

        y_pred = y_pred.view(-1,27)
        y[y != y] = target_padding
        y = y.view(-1).long()
#        print('[main] y: ', y)
#        print('[main] y.size:', y.size())
#        y_o = y_o.view(-1).long()
#        loss = F.nll_loss(y_pred, y)
        loss = loss_f(y_pred, y)
        
        net.zero_grad()
        loss.backward()
        opt.step()
        
        losses.append(loss.item())
        count += 1
        if count%20 is 0 : 
            print("epoch:{}, count:{}, loss:{}".format(epoch, count, loss.item()))
 #       break
 #   break
    if epoch%100==0:
        eval_net()

for i in range(10):
    eval_net()
    

epoch:0, count:20, loss:2.8020565509796143
zaaaaaaaaauuuuuuuusssssssssssss
epoch:1, count:20, loss:2.6656558513641357
epoch:2, count:20, loss:2.5662617683410645
epoch:3, count:20, loss:2.4748659133911133
epoch:4, count:20, loss:2.374203681945801
epoch:5, count:20, loss:2.192744255065918
epoch:6, count:20, loss:2.0463461875915527
epoch:7, count:20, loss:1.9656676054000854
epoch:8, count:20, loss:1.928909182548523
epoch:9, count:20, loss:1.865503191947937
epoch:10, count:20, loss:1.86142098903656
epoch:11, count:20, loss:1.7328835725784302
epoch:12, count:20, loss:1.7489486932754517
epoch:13, count:20, loss:1.7112056016921997
epoch:14, count:20, loss:1.69237220287323
epoch:15, count:20, loss:1.882509469985962
epoch:16, count:20, loss:1.7022266387939453
epoch:17, count:20, loss:1.6408882141113281
epoch:18, count:20, loss:1.6242756843566895
epoch:19, count:20, loss:1.6083358526229858
epoch:20, count:20, loss:1.654066562652588
epoch:21, count:20, loss:1.5660464763641357
epoch:22, count:20, 

epoch:184, count:20, loss:0.5696982741355896
epoch:185, count:20, loss:0.4961225092411041
epoch:186, count:20, loss:0.5001122355461121
epoch:187, count:20, loss:0.48203179240226746
epoch:188, count:20, loss:0.5024473667144775
epoch:189, count:20, loss:0.5228689908981323
epoch:190, count:20, loss:0.4915364682674408
epoch:191, count:20, loss:0.5262443423271179
epoch:192, count:20, loss:0.4988448917865753
epoch:193, count:20, loss:0.4970792829990387
epoch:194, count:20, loss:0.5115046501159668
epoch:195, count:20, loss:0.48718711733818054
epoch:196, count:20, loss:0.49172651767730713
epoch:197, count:20, loss:0.5115198493003845
epoch:198, count:20, loss:0.5360208749771118
epoch:199, count:20, loss:0.49221453070640564
epoch:200, count:20, loss:0.4704471528530121
aeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
epoch:201, count:20, loss:0.5066354274749756
epoch:202, count:20, loss:0.5159472227096558
epoch:203, count:20, loss:0.5083310008049011
epoch:204, count:20, loss:0.5310472846031189
epoch:205, count:20

epoch:364, count:20, loss:0.4152110815048218
epoch:365, count:20, loss:0.398891806602478
epoch:366, count:20, loss:0.4045659303665161
epoch:367, count:20, loss:0.3990562856197357
epoch:368, count:20, loss:0.4415740966796875
epoch:369, count:20, loss:0.4363328516483307
epoch:370, count:20, loss:0.41441234946250916
epoch:371, count:20, loss:0.423787921667099
epoch:372, count:20, loss:0.4257424771785736
epoch:373, count:20, loss:0.4129028022289276
epoch:374, count:20, loss:0.42722249031066895
epoch:375, count:20, loss:0.4227849245071411
epoch:376, count:20, loss:0.4043430984020233
epoch:377, count:20, loss:0.44314879179000854
epoch:378, count:20, loss:0.4037250578403473
epoch:379, count:20, loss:0.43221965432167053
epoch:380, count:20, loss:0.42026761174201965
epoch:381, count:20, loss:0.4180854558944702
epoch:382, count:20, loss:0.410565048456192
epoch:383, count:20, loss:0.4139127731323242
epoch:384, count:20, loss:0.4140026867389679
epoch:385, count:20, loss:0.41462478041648865
epoch:3

epoch:544, count:20, loss:0.3864929676055908
epoch:545, count:20, loss:0.41981270909309387
epoch:546, count:20, loss:0.39670610427856445
epoch:547, count:20, loss:0.3929799795150757
epoch:548, count:20, loss:0.39067262411117554
epoch:549, count:20, loss:0.4148615300655365
epoch:550, count:20, loss:0.4062841534614563
epoch:551, count:20, loss:0.39471855759620667
epoch:552, count:20, loss:0.40154123306274414
epoch:553, count:20, loss:0.4000531733036041
epoch:554, count:20, loss:0.3774481415748596
epoch:555, count:20, loss:0.39035436511039734
epoch:556, count:20, loss:0.4072941243648529
epoch:557, count:20, loss:0.40481144189834595
epoch:558, count:20, loss:0.4212375283241272
epoch:559, count:20, loss:0.38310813903808594
epoch:560, count:20, loss:0.37940147519111633
epoch:561, count:20, loss:0.37783902883529663
epoch:562, count:20, loss:0.37764719128608704
epoch:563, count:20, loss:0.39637959003448486
epoch:564, count:20, loss:0.42470112442970276
epoch:565, count:20, loss:0.40593487024307

epoch:724, count:20, loss:0.39272478222846985
epoch:725, count:20, loss:0.42375391721725464
epoch:726, count:20, loss:0.39673957228660583
epoch:727, count:20, loss:0.3924542963504791
epoch:728, count:20, loss:0.39417764544487
epoch:729, count:20, loss:0.39858192205429077
epoch:730, count:20, loss:0.387398362159729
epoch:731, count:20, loss:0.3810771703720093
epoch:732, count:20, loss:0.4034442603588104
epoch:733, count:20, loss:0.3766273856163025
epoch:734, count:20, loss:0.4130386412143707
epoch:735, count:20, loss:0.4031965732574463
epoch:736, count:20, loss:0.4063574969768524
epoch:737, count:20, loss:0.40029338002204895
epoch:738, count:20, loss:0.37378841638565063
epoch:739, count:20, loss:0.392450749874115
epoch:740, count:20, loss:0.3918949365615845
epoch:741, count:20, loss:0.38475608825683594
epoch:742, count:20, loss:0.3919731676578522
epoch:743, count:20, loss:0.38825032114982605
epoch:744, count:20, loss:0.3931029736995697
epoch:745, count:20, loss:0.4007608890533447
epoch:

epoch:903, count:20, loss:0.3888958692550659
epoch:904, count:20, loss:0.37515610456466675
epoch:905, count:20, loss:0.4070926606655121
epoch:906, count:20, loss:0.3840644061565399
epoch:907, count:20, loss:0.4088035821914673
epoch:908, count:20, loss:0.37330329418182373
epoch:909, count:20, loss:0.41821351647377014
epoch:910, count:20, loss:0.39453983306884766
epoch:911, count:20, loss:0.3826748728752136
epoch:912, count:20, loss:0.3925037682056427
epoch:913, count:20, loss:0.3896673023700714
epoch:914, count:20, loss:0.38890984654426575
epoch:915, count:20, loss:0.3927111327648163
epoch:916, count:20, loss:0.40081077814102173
epoch:917, count:20, loss:0.38562920689582825
epoch:918, count:20, loss:0.3882480263710022
epoch:919, count:20, loss:0.38014307618141174
epoch:920, count:20, loss:0.37856927514076233
epoch:921, count:20, loss:0.4047309160232544
epoch:922, count:20, loss:0.3668542206287384
epoch:923, count:20, loss:0.38563305139541626
epoch:924, count:20, loss:0.37871894240379333

epoch:1082, count:20, loss:0.3737814128398895
epoch:1083, count:20, loss:0.40349045395851135
epoch:1084, count:20, loss:0.3864971399307251
epoch:1085, count:20, loss:0.38313278555870056
epoch:1086, count:20, loss:0.3831699788570404
epoch:1087, count:20, loss:0.36359500885009766
epoch:1088, count:20, loss:0.3839676082134247
epoch:1089, count:20, loss:0.39913150668144226
epoch:1090, count:20, loss:0.36563193798065186
epoch:1091, count:20, loss:0.4056858420372009
epoch:1092, count:20, loss:0.39400339126586914
epoch:1093, count:20, loss:0.3921423852443695
epoch:1094, count:20, loss:0.3904128968715668
epoch:1095, count:20, loss:0.39376014471054077
epoch:1096, count:20, loss:0.42075344920158386
epoch:1097, count:20, loss:0.38279253244400024
epoch:1098, count:20, loss:0.39805400371551514
epoch:1099, count:20, loss:0.39930665493011475
epoch:1100, count:20, loss:0.3930653929710388
wuieiaiaiaiaiaiaiaiaiaiaiaiaiai
epoch:1101, count:20, loss:0.40533512830734253
epoch:1102, count:20, loss:0.4121957

epoch:1258, count:20, loss:0.3994091749191284
epoch:1259, count:20, loss:0.3850037455558777
epoch:1260, count:20, loss:0.38911381363868713
epoch:1261, count:20, loss:0.3826574981212616
epoch:1262, count:20, loss:0.3996143043041229
epoch:1263, count:20, loss:0.3923388719558716
epoch:1264, count:20, loss:0.3816656470298767
epoch:1265, count:20, loss:0.38340523838996887
epoch:1266, count:20, loss:0.39621999859809875
epoch:1267, count:20, loss:0.39850082993507385
epoch:1268, count:20, loss:0.3699995279312134
epoch:1269, count:20, loss:0.3971841633319855
epoch:1270, count:20, loss:0.38139796257019043
epoch:1271, count:20, loss:0.378691703081131
epoch:1272, count:20, loss:0.38709232211112976
epoch:1273, count:20, loss:0.3795868456363678
epoch:1274, count:20, loss:0.43536099791526794
epoch:1275, count:20, loss:0.46648362278938293
epoch:1276, count:20, loss:0.47217246890068054
epoch:1277, count:20, loss:0.4133840501308441
epoch:1278, count:20, loss:0.38355743885040283
epoch:1279, count:20, los

### 2.2 - Build the model 

**References**:
- This exercise took inspiration from Andrej Karpathy's implementation: https://gist.github.com/karpathy/d4dee566867f8291f086. To learn more about text generation, also check out Karpathy's [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
- For the Shakespearian poem generator, our implementation was based on the implementation of an LSTM text generator by the Keras team: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py 

## Conclusion

You can see that your algorithm has started to generate plausible dinosaur names towards the end of the training. At first, it was generating random characters, but towards the end you could see dinosaur names with cool endings. Feel free to run the algorithm even longer and play with hyperparameters to see if you can get even better results. Our implemetation generated some really cool names like `maconucon`, `marloralus` and `macingsersaurus`. Your model hopefully also learned that dinosaur names tend to end in `saurus`, `don`, `aura`, `tor`, etc.
