# Character Level Neural Language Model


Character Level Language Models take in a bulk of text and try to model the probability distribution of the next character given a sequence of previous characters. We can use this probability distribution to generate one character at a time. It is easy to notice the difference between word n-grams, which use a sequence of words to predict the next word. Furthermore, ours is a Neural model, where we will be using LSTM cells as the basis of the neural network. One can create different Neural networks as well (like GRUs).

To give an example of what we want the model to do, let us consider the vocabulary {d,o,c,t,r}. If we were to train our model on the word 'doctor', we would want our model to learn the following - When we see a 'd', the model would give a higher probability to the letter 'o'. If we see the string 'do', the next letter should most likely be 'c'. If we see 'doc', the letter 't' should be most likely and so on.

Now that we know what we want, let's dive into the code!

In [1]:
#importing dependencies

from __future__ import unicode_literals, print_function, division
from io import open
import random
import numpy as np
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_tensor_type('torch.cuda.FloatTensor')

from torch.autograd import Variable





## Step 1 - Create encodings

After we import our train set, we need to give a vector embedding for all the characters. In my code, I am using one-hot encoding. For this purpose, I assign a unique index for each character. This index is then used to create the one-hot vector.

My train data is some of the works of shakespeare. It was obtained from https://github.com/karpathy/char-rnn/tree/master/data/tinyshakespeare.

In [2]:
#importing the train data and creating a unique index for each character

f = open('input.txt', encoding = 'utf-8')
n = 50000
s = f.read()
s = s[:n]    #using only the first 100000 charcters for this code (suggested to use the entire corpus)
chars = set(s)
n_chars = len(set(s))
indices = range(n_chars)

index_to_char = dict(zip(indices,chars))
char_to_index = dict(zip(chars,indices))

In [3]:
s_encodings = []
for i in s:
    temp = np.zeros(n_chars)
    temp[char_to_index[i]] = 1
    s_encodings.append(np.array(temp))
    
s_encodings = np.array(s_encodings)

In [4]:
s_encodings = np.reshape(s_encodings,(200,int(n/200),n_chars)) #creating batches to optimize performance


## Step 2 - Creating and Training the Model

My model is a simple 2 layered LSTM network with hidden size = number of characters (size of one-hot encoding vector).
I have defined 'seq_length'. This variable defines how big our context is, i.e. how many characters will I use to predict the next one (during training).


This is the place where you can make the most changes. So go one and change any parameters (or the model itself) and create your own character level language model.

In [5]:
class Model(nn.Module):
    def __init__(self, input_size, hidden_size, latent_dim):
        super(Model, self).__init__()
        self.hidden_size = hidden_size
        self.LSTMi = nn.LSTM(hidden_size, latent_dim, 2,batch_first = True)

    def forward(self, input):
        output, hidden = self.LSTMi(input)
        return output

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [9]:
seq_length = 20
latent_dim = n_chars
n_epochs =1000
model = Model(seq_length,n_chars,latent_dim).to(device)


# Step 3 - Training the model

During training, the input to the network is a character sequence of length seq_length. Since we want to predict the next character, our target must be another sequence of characters which immediately follow the charcters in the input sequence. Therefore, if c1,c2...cn (n = seq_length) is my input sequence , then my target sequence would be c2,c3...c(n+1).
 
The loss function that I have used is Mean Squared Error and the optimiser used is Adam.

Another thing to note - Training for higher number of epochs has shown to give better results.

In [10]:
loss = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)

In [11]:
for epoch in range(n_epochs):
                if (epoch % 100 == 0):
                    print (epoch)

                for i in range(0,len(s_encodings)-seq_length):                       
                    inputs = torch.Tensor(s_encodings[i:i+seq_length])
                    targets = torch.Tensor(s_encodings[i+1:i+seq_length+1])
                    outputs = model(inputs)
                    optimizer.zero_grad()
                    loss1 = loss(outputs,targets)
                    loss1.backward()
                    optimizer.step()

0
100
200
300
400
500
600
700
800
900


In [12]:
out_path = "CLLNM.pth"
torch.save(model.state_dict(),out_path)   #saving the model weights

# Step 4 - Generating text using the model

Now that we have our trained model, let's use it to generate our own text.

Generating text is an iterative process. First, we randomly select a starting letter. Each character is given an equal probability of being selected as the starting character. The one-hot encoding of the selected character is then fed into the trained model. Now we follow the following steps iteratively:

1 - The output vector is converted into a probabilty distribtion (by using softmax). 

2 - One character is then sampled (multinomial sampling) on the basis of the probability distribution. The reason I don't always take the most probable charcter is so that we don't run into a loop of characters. Sampling helps keep the text fresh.

3 - The one hot encoding of the selected charcter is fed into model

4 - Repeat Step 1

This is repeated a fixed number of times (or you could add and use a stop character in your training data).

One thing to note - Before I obtain the softmax of the output vector, I divide all the elements of the output vector by a number. If this number is small, then the relative probabilty of the most probable character increases, i.e. the gap between the most probable character and the remaining characters increases. The effect? During sampling, I am more likely to keep choosing only one character. 
On the flip side, if the number is large, then all the probabilities will come closer. While this can add more variety to the text, it can also lead to more mistakes.

In [13]:
model1 = Model(seq_length,n_chars,latent_dim).to(device)
model1.load_state_dict(torch.load('CLLNM.pth'))

In [31]:
txt = ''
a = random.randint(0,n_chars-1)
for i in range(600):
    
    txt += index_to_char[a]
    temp = np.zeros(n_chars)
    temp[a] = 1
    pred = model1(torch.Tensor(temp).unsqueeze(0).unsqueeze(0))

    pred = pred.div(0.5)
    probs = torch.exp(pred).squeeze()
    probs= probs.div(torch.sum(probs))
    a = torch.multinomial(probs.float(), 1).resize(1).float()
    a = int(a.data.cpu().numpy()[0])





# And thats it!

Now that you have learned how to make a character level neural language model, you can create your own! 
You can mess around with the given model, or create an entirely new one on your own!

There are many uses of these models. Based on what your training data is, your outputs will change to match the train set. A few examples are:
1. List of names - you can use this to generate names of your own
2. Wikipedia articles
3. Source Codes and Scripts (C, Python etc)
4. News articles

If you are still interested in more, I would suggest you to read Andrej Karpathy's blog - http://karpathy.github.io/2015/05/21/rnn-effectiveness/ and see his GitHub repository - https://github.com/karpathy/char-rnn.