# Character level language model - Dinosaurus land

Welcome to Dinosaurus Island! 65 million years ago, dinosaurs existed, and in this assignment they are back. You are in charge of a special task. Leading biology researchers are creating new breeds of dinosaurs and bringing them to life on earth, and your job is to give names to these dinosaurs. If a dinosaur does not like its name, it might go beserk, so choose wisely! 

<table>
<td>
<img src="images/dino.jpg" style="width:250;height:300px;">

</td>

</table>

Luckily you have learned some deep learning and you will use it to save the day. Your assistant has collected a list of all the dinosaur names they could find, and compiled them into this [dataset](dinos.txt). (Feel free to take a look by clicking the previous link.) To create new dinosaur names, you will build a character level language model to generate new names. Your algorithm will learn the different name patterns, and randomly generate new names. Hopefully this algorithm will keep you and your team safe from the dinosaurs' wrath! 

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np

In [2]:
# In this cell you will write code to do the following:
# Each line in dinos.txt contains the names of dinosaurs which will be used for training
# step 1: read the entire contents of the file as a string
name = []
with open("dinos.txt","r") as f:
    # step 2: convert the entire string to lowercase
    queries = f.read().lower()
    temp = str(queries)
    vocabulary = set()
    # step 3: extract the characters that make up the string in a set. This will be the vocabulary.
    for char in temp:
        vocabulary.update(char)

# step 4: print the size of the string (which will be total number of characters in the training set) and
#         size of vocabulary. Note that '\n' is part of vocabulary and it will be used as EOS character 
#         in our model
    
print('size of vocabulary:', len(vocabulary))  

      






size of vocabulary: 27


In [3]:
# In this cell you will write code to do the following:
# step 1: create a dictionary that maps characters to indices
char_to_indx = dict()
indx_to_char = dict()
i = 0

for char in vocabulary:
    char_to_indx[char] = i
    indx_to_char[i] = char
    i+=1

print("char:",char_to_indx)
print("indx:",indx_to_char)

# step 2: create another dictionary that maps indices to characters

char: {'d': 0, 'v': 1, 'i': 2, 'g': 3, 'c': 4, 'q': 5, 'z': 6, 'y': 7, 'p': 8, 'u': 9, 'm': 10, 't': 11, 's': 12, 'r': 13, 'e': 14, 'b': 15, '\n': 16, 'k': 17, 'w': 18, 'l': 19, 'h': 20, 'n': 21, 'a': 22, 'o': 23, 'x': 24, 'j': 25, 'f': 26}
indx: {0: 'd', 1: 'v', 2: 'i', 3: 'g', 4: 'c', 5: 'q', 6: 'z', 7: 'y', 8: 'p', 9: 'u', 10: 'm', 11: 't', 12: 's', 13: 'r', 14: 'e', 15: 'b', 16: '\n', 17: 'k', 18: 'w', 19: 'l', 20: 'h', 21: 'n', 22: 'a', 23: 'o', 24: 'x', 25: 'j', 26: 'f'}


In [4]:
# Let's initialize few variables
n_hidden_nodes = 50
num_iterations = 35000
torch.backends.cudnn.enabled = False
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 1.2 - Overview of the model

A Pictorial representation of the model you will build is given below: 
    
<img src="images/rnn1.png" style="width:450;height:300px;">
<caption><center> **Figure 1**: Recurrent Neural Network, similar to what you had built in the previous notebook "Building a RNN - Step by Step".  </center></caption>

At each time-step, the RNN tries to predict what is the next character given the previous characters. The dataset $X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a list of characters in the training set, while $Y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ is such that at every time-step $t$, we have $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$.

In [5]:
# In this cell you will write code to do the following:
# You need to create a class inherited from nn.Module that uses nn.RNN and nn.Linear along with F.log_softmax
# (instead of softmax) to forward the log probabilites to the caller of the forward method of this class.
# Note that you have to accumulate log probabilities with respect to every character in the sequence 
# and then forward this.
# You also have to explicitly initialize the parameters of RNN and Linear modules in the init method of this
# class. We will use 1 layer unidirectional RNN. Weight parameters for RNN are weight_ih_l0 and weight_hh_l0. 
# These are wrongly documented as weight_ih_l[0] and weight_hh_l[0] in the documentation. Bias parameters are 
# bias_ih_l0 and bias_hh_l0.



# RNN Module Class
class model(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(model, self).__init__()
        
        
        self.rnn = nn.RNN(input_size,hidden_size)
        
        # initializing weights
        self.rnn.weight_ih_l0 = nn.Parameter(torch.randn(self.rnn.weight_ih_l0.size()))
        self.rnn.weight_hh_l0 = nn.Parameter(torch.randn(self.rnn.weight_hh_l0.size()))
        torch.nn.init.constant_(self.rnn.bias_ih_l0,0.0)
        torch.nn.init.constant_(self.rnn.bias_hh_l0,0.0)
        #self.rnn.bias_ih_l0.data.fill_(0)
        #self.rnn.bias_hh_l0.data.fill_(0)
        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden):
        # compute output
        out,hn = self.rnn(input,hidden)
        out = self.linear(out)
        out = F.log_softmax(out, dim=-1)
       
        # return output and hidden state for "next" time-step
        return out,hn




In [6]:
# In this cell you will write code to do the following:
# step 1: Instantiate the model and port it to GPU
# define hidden size
input_size = len(vocabulary)
output_size = len(vocabulary)

# define model
rnn = model(input_size, n_hidden_nodes,output_size).to(device)
rnn.zero_grad()
# step 2: set loss criterion to NLL loss
loss_function = nn.NLLLoss().to(device)

# step 3: set optimizer to SGD with a suitable learning rate

optimizer = optim.SGD(rnn.parameters(), lr=0.01)

In [7]:
class cell(nn.Module):
    def __init__(self, input_size, hidden_size,vocab_size,rnn):
        super(cell, self).__init__()
        
        self.rnn_cell = nn.RNNCell(input_size,hidden_size)
        self.rnn_model = rnn
        
        #self.fc = nn.Linear(hidden_size,vocab_size)
        self.rnn_cell.weight_ih = self.rnn_model.rnn.weight_ih_l0#rnn[0]
        self.rnn_cell.weight_hh = self.rnn_model.rnn.weight_hh_l0 #rnn[1]
        self.rnn_cell.bias_ih = self.rnn_model.rnn.bias_ih_l0 #rnn[2]
        self.rnn_cell.bias_hh = self.rnn_model.rnn.bias_hh_l0 #rnn[3]
    


    def forward(self, input, hidden):
        # compute output
        hn = self.rnn_cell(input,hidden)
        output = F.softmax(self.rnn_model.linear(hn), dim=-1)
        return output,hn


In [8]:
def get_name():
    input_to_rnn_cell = torch.zeros((1,input_size))
    input_to_rnn_cell = input_to_rnn_cell.to(device)
    h_out = torch.zeros((1,n_hidden_nodes))
    h_out = h_out.to(device)
    word = []
    a = 0
    while indx_to_char[a] != '\n':
        rnn_cell = cell(input_size,n_hidden_nodes,len(vocabulary),rnn)#list(rnn.rnn.parameters()))
        rnn_cell = rnn_cell.to(device)
        output,h_out = rnn_cell.forward(input_to_rnn_cell,h_out)
        output = output.cpu()
        output = output.detach().numpy()
        output = output.flatten()
        array =  np.arange(0,27)
        a = np.random.choice(array,p=output)
        word.append(a)
        input_to_rnn_cell= torch.zeros((1,input_size))
        input_to_rnn_cell[0,a] = 1
        input_to_rnn_cell = input_to_rnn_cell.to(device)
    
    name = " "
    for i in range(len(word)):
        name = name + indx_to_char[word[i]]
    
    print("name :",name)


In [9]:
# Build list of all dinosaur names (training examples).
# Note that earlier we had just built the vocabulary.
# examples variable below is a list of dinosaur names in lowercase with all leading and trainling white spaces
# removed.
# The examples are randomly shuffled.
with open("dinos.txt") as f:
    examples = f.readlines()
examples = [x.lower().strip() for x in examples]

# Shuffle list of all dinosaur names

np.random.shuffle(examples)

#print(examples)
print(examples)

['appalachiosaurus', 'palaeolimnornis', 'issasaurus', 'luanchuanraptor', 'sellacoxa', 'sarcolestes', 'cardiodon', 'tototlmimus', 'arcusaurus', 'condorraptor', 'yezosaurus', 'protrachodon', 'hanssuesia', 'anchiornis', 'siamosaurus', 'rebbachisaurus', 'omosaurus', 'lamplughsaura', 'yuanmousaurus', 'struthiosaurus', 'eotriceratops', 'ngexisaurus', 'oryctodromeus', 'kulceratops', 'dinotyrannus', 'bambiraptor', 'magnosaurus', 'lamaceratops', 'dracopelta', 'riodevasaurus', 'morinosaurus', 'neuquensaurus', 'huayangosaurus', 'herrerasaurus', 'pachysuchus', 'chindesaurus', 'spondylosoma', 'lancanjiangosaurus', 'araucanoraptor', 'xiongguanlong', 'yunganglong', 'calamospondylus', 'phaedrolosaurus', 'latirhinus', 'australovenator', 'dashanpusaurus', 'creosaurus', 'dongyangopelta', 'tarascosaurus', 'miragaia', 'shanag', 'deinocheirus', 'walkersaurus', 'nothronychus', 'eucoelophysis', 'morrosaurus', 'tanycolagreus', 'quilmesaurus', 'tawa', 'morelladon', 'uberabatitan', 'tornieria', 'rhadinosaurus', 

In [None]:
# Fill or complete the code where required.
# We will do training in this cell.
# Batch size will be 1 (one).
hidden = torch.zeros(1,1,n_hidden_nodes)
hidden = hidden.to(device)
for j in range(num_iterations):      # number of training iterations
    
    # fill h_0 with tensor of zeros. h_0 is the a_0 u saw in your lectures
      # see documentation to determine dimension of h_0 and accordingly create the tensor
    
    index = j % len(examples) # choose the index of a dinosaur name. Index is guarenteed to be in 
                              # range  of 0 to # examples - 1
    data = [char_to_indx[ch] for ch in examples[index]] # create the list of indices of charecters that
                                                    # make up the chosen dinosaur name
    label = data[1:] + [char_to_indx["\n"]] # the label list. Here label at each time instant is the character 
                                          # next to the input character at that time. So y(t) = x(t+1). When
                                          # t is final time instant, label will be EOS character which is 
                                          # '\n' for us
    
    # You are required to do the following below:
    # Convert data to a tensor of one hot representations
    seq_len = len(data)
    input_to_rnn = torch.zeros((seq_len,1,len(vocabulary)))
    labelOut = torch.tensor(torch.zeros(seq_len,1),dtype=torch.long)

    
    for i in range(seq_len):
        input_to_rnn[i,0,data[i]] = 1
        labelOut[i,0]=label[i]
        
    labelOut = labelOut.reshape(seq_len,-1)
    labelOut = labelOut.to(device)
 
    # Convert label to a LongTensor of indices required for NLL loss. See documentation for clarity.
    
    input_to_rnn = input_to_rnn.to(device)
    # Do the forward propagation, receive log probabilities, compute loss at every time instant and aggregate 
    # the loss.
   
    # zero-out gradients
    rnn.zero_grad()
    # for each character in the name
   
        # forward propagate input - and output is "prob"
        # remember the diagram!!!
    output, hidden = rnn(input_to_rnn, hidden)
    hidden = hidden.detach()
    # compute loss
    loss = 0
    for i in range(seq_len):
        loss += loss_function(output[i],labelOut[i])

        
   
    loss.backward()
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    optimizer.step()
    
    if j % 200 == 0:
        # Print the loss in every iteration.
        print("Index :", j, "   Loss :", loss.item() / seq_len)
        get_name()
        get_name()
        get_name()

Index : 0    Loss : 3.3444981575012207
name :  pe

name :  ymrpkddwkz

name :  ghquty

Index : 200    Loss : 3.1788618087768556
name :  whtooorxqxqnoh

name :  lomrzftqpcxbc

name :  bvjkhyjdjoenopt

Index : 400    Loss : 3.279190335954939
name :  sfyxbzxjrcapsioucvtgdaa

name :  udpovcyxdhwazitqdyoyiukanyjm

name :  dhwrsmfrwfkcmoyrxpoljhrgcninkcputqkshfjxhwomlqodszparnfmfmfd

Index : 600    Loss : 3.163517878605769
name :  owqxijboewrwusqp

name :  pfgdhhou

name :  xvnxhunovnuld

Index : 800    Loss : 3.35919984181722
name :  pyenwxv

name :  vd

name :  aooee

Index : 1000    Loss : 3.586217498779297
name :  guj

name :  jllaprxpovjeooxzoziunqkvbqnivbalcqxilvklhkmrauztz

name :  ohqaspxycrvwpnkyhwfxhegqfbcagjosemtryoyfacnpb

Index : 1200    Loss : 3.4637245178222655
name :  lxyukoofmmlywe

name :  kofjuowtvoytvbcadshhmkal

name :  bmrxfserqerautmjeqwv

Index : 1400    Loss : 3.4686329181377706
name :  xpundatdkvemqmil

name :  tnmhgcftj

name :  uedmtrenhsikar

Index : 1600    Loss