<h1> INTRODUCTION</h1>
This code runs a Character-Wise-LSTM on Yoruba text. The algorithm will train character by character on the yoruba text and also generate text character by character. I specifically chose Yoruba text because it has special characters in it.
<h6>The model</h6>
The model is based on Andrej Karpathy's <a href= "http://karpathy.github.io/2015/05/21/rnn-effectiveness/">post on RNNs</a> and <a href= "https://github.com/karpathy/char-rnn">implementation in Torch</a>. Below is the general architecture of the character-wise RNN.

<img src="https://github.com/Dhareey/Char-Wise-LSTM/blob/master/images/charseq.jpeg?raw=1" width="500">
<h6>The Data</h6>
The training data is an excerpt from yoruba bible. I have written a scrapper to scrap any chapter online in Yoruba. 

<h6> HOW TO GO ABOUT IT</h6>
<ol>
    <li>Get the data</li>
    <li>The algorithm cant understand strings, so we have to convert it to numbers</li>
    <li>Preprocess the data</li>
    Since its a character-wise training, each character will be fed into the algorithm, one by one. To pass each character into the algorithm, we have to do one <a href=''>hot encoding</a>. Meaning, we will pass array of 1s and 0s, where the column of the character being passed is 1 and the other column is 0. This will be done for each letter in the data. Meaning, the total number of columns will be the total number of unique character in the text. For example. The letter a will be passed as
    <p> [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] and the letter be will be passed as </p>
    <p> [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]</p>
    Where each column represents a unique letter in the alphabet
    <li>Since the data is hundreds of thousand character long, we will have to train the algorithm in batches </li>
    <li> Define the LSTM architecture</li>
    <p> The architecture is pretty basic, one character enters the LSTM at a time. To understand how LSTM works, please chck out the 
</ol>

In [0]:
from scrapper import getchap
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [0]:
# Step1: Getting the data
book, validationbook = ['gal','heb','rom','rev','mat','mrk','luk','jhn'], ['heb', 'rom']
chapter, validchapter = 6, 9
data, validData= [], []
for eachbook in book:
    for i in range(chapter):
        data.append(getchap(eachbook, str(i+1)))
        
for eachbook in validationbook:
    for i in range(6,validchapter):
        validData.append(getchap(eachbook, str(i+1)))

In [0]:
datas, valdatas = '\n'.join(data), '\n'.join(validData)

In [5]:
#Visualise the data
print(datas[:200])

1 Paulu Aposteli 
1 Paulu, aposteli tí a rán kì í ṣe láti ọ̀dọ̀ ènìyàn wá, tàbí nípa ènìyàn, ṣùgbọ́n nípa Jesu Kristi àti Ọlọ́run Baba, ẹni tí ó jí i dìde kúrò nínú òkú. 
2 Àti gbogbo àwọn arákùnrin t


In [0]:
# Step2: Tokenise the data
uniquedata = tuple(set(datas))
datacode = {alphabet:code for code,alphabet in dict(enumerate(uniquedata)).items()}
encodedData = np.array([datacode[word] for word in datas])
encodedvalData = np.array([datacode[word] for word in valdatas])

In [7]:
#Visualise tokenised data
encodedData[:100]

array([54, 79, 71, 35, 78, 68, 78, 79, 37,  4, 49, 92, 46, 56, 68, 39, 79,
       60, 54, 79, 71, 35, 78, 68, 78, 19, 79, 35,  4, 49, 92, 46, 56, 68,
       39, 79, 46, 29, 79, 35, 79, 72, 82, 22, 79, 94, 33, 79, 29, 79, 57,
       56, 79, 68, 82, 46, 39, 79, 80, 31, 38, 80, 31, 79, 44, 22, 33, 28,
        2, 22, 79, 20, 82, 19, 79, 46,  2, 55, 29, 79, 22, 29,  4, 35, 79,
       44, 22, 33, 28,  2, 22, 19, 79, 57, 93, 14, 55, 80, 81, 22])

In [8]:
encodedvalData[:100]

array([36, 79, 42, 56, 68, 94, 39, 92, 56, 38, 56, 94, 39, 79, 66,  8, 81,
       79,  2, 68, 93, 34, 82,  2, 79, 60, 54, 79, 48, 29, 46, 49, 72, 29,
       79, 42, 56, 68, 94, 39, 92, 56, 38, 56, 94, 39, 79, 28, 33, 29, 19,
       79, 80, 55, 35, 79, 88, 35, 68,  8, 11, 78, 19, 79,  2, 68, 93, 34,
       82,  2, 79,  0, 68, 80, 81, 72, 78, 22, 79,  0, 31, 14, 82, 73, 32,
       14, 49, 19, 79,  8, 22, 39, 79, 46, 29, 79, 52, 79,  4,  2])

In [0]:
encoded_ydata = list(encodedData[1:])
encoded_ydata.append(encodedData[0])
validdatalist = list(encodedvalData[1:])
validdatalist.append(encodedvalData[0])

In [0]:
encodedydata= np.array(encoded_ydata)
encodedvaliddata = np.array(validdatalist)

In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(encodedData), torch.from_numpy(encodedydata))
valid_data = TensorDataset(torch.from_numpy(encodedvalData), torch.from_numpy(encodedvaliddata))

# dataloaders
batch_size= 20
seq_length= 100
batch= batch_size * seq_length

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=False, batch_size=batch, drop_last = True)
valid_loader = DataLoader(valid_data, shuffle=False, batch_size=batch, drop_last= True)

In [0]:
dataiter= iter(train_loader)
sample_x, sample_y = dataiter.next()

validiter = iter(valid_loader)
valid_x, valid_y = dataiter.next()

In [0]:
# Step 3. Onehot encoding
def oneHotEncode(arr, cols_num):
    """
    This function takes in an pytorch dataloader object and returns a one-hot encoded array with dimensions of array x n_labels.
    E.G if it takes an array of [3,2,1] and n_labels of 8, it returns a 3x3(array_size by cols_num) hot encoded array like so
    [[0 0 1 0 0 0 0 0]
    [0 1 0 0 0 0 0 0]
    [1 0 0 0 0 0 0 0]]
    """
    array = np.array(arr)
    # First, create an array.size by cols_num array of zeros in float
    one_hot = np.zeros((array.size,cols_num), dtype= np.float32)
    
    # Fill a "1" to each row based on the value in array
    one_hot[np.arange(one_hot.shape[0]), array.flatten()] = 1.
    
    # Return back to the original shape
    one_hot = one_hot.reshape((*array.shape, cols_num))
    return torch.from_numpy(one_hot)

In [13]:
train_on_gpu = torch.cuda.is_available()
if (train_on_gpu):
  print('Yaay, CUDA is available, now you can train')
else:
  print('Dude, dont try it, your pc is gonna crash')

Yaay, CUDA is available, now you can train


In [0]:
# Step 4 Model Architecture
class yorLSTM(nn.Module):
    def __init__(self, tokens, n_hidden, n_layers=2,drop_prob=0.5,lr= 0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden= n_hidden
        self.lr = lr
            
        #Here, we are creating character dict for the project
        self.chars = tokens
        self.int2chars = dict(enumerate(self.chars))
        self.char2int = {num:alphabet for alphabet,num in self.int2chars.items()}
            
        # Define model layers
        self.lstm = nn.LSTM(len(self.chars), self.n_hidden, self.n_layers,dropout=drop_prob,batch_first=True)
            
        # Dropout in between layers
        self.dropout =nn.Dropout(drop_prob)
            
        # Connect to a fully connected layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
        
    def forward(self, x, hidden):
        out, hidden = self.lstm(x, hidden)
        
        # Dropout to avoid overfitting
        out= self.dropout(out)
        
        # Reshape for fully connected layer
        out = out.contiguous().view(-1, self.n_hidden)
        
        # Pass through a fully connected layer
        out = self.fc(out)
        
        #return out, hidden
        return out, hidden
        
    def init_hidden(self, batch_size):
        # Initialize the weight and hidden value
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers,batch_size,self.n_hidden).zero_().cuda(),
                      weight.new(self.n_layers,batch_size,self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers,batch_size,self.n_hidden).zero_())
        
        return hidden
        
            

In [0]:
# Step5 Train the Model
def train(network,training_data, validation_data, epochs, batch_size,lr, seq_length, clip=5, vis=10):
    # Set the RNN network to train
    network.train()
    
    #Set the optimiser and calculate the loss
    optimiser = torch.optim.Adam(network.parameters(), lr= lr)
    criterion = nn.CrossEntropyLoss()
    
    # Run on CUDA if available
    if (train_on_gpu):
        network.cuda()
        
    counter = 0
    # set total number of characters
    n_chars = len(network.chars)
    
    # Train in range of epochs
    for i in range(epochs):
        # initialise the hidden state
        h = network.init_hidden(batch_size)
        for x, y in training_data:
            counter+=1
            
            # One-Hot-Encode the training data
            x = oneHotEncode(x.reshape(batch_size,seq_length), n_chars)
            
            # If on Cuda, cnvert the x and y to cuda
            if (train_on_gpu):
                x, y = x.cuda(), y.cuda()
            
            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])
            
            # Set accumulated gradient to zero
            network.zero_grad()
            
            output, h = network(x, h)
            
            # Calculate the loss and back propagate
            loss = criterion(output, y.view(batch_size*seq_length).long())
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(network.parameters(), clip)
            optimiser.step()
            
            # Calculate validation loss at every 10 iteration
            if counter% vis == 0:
                # Initialise the hidden state
                val_h = network.init_hidden(batch_size)
                validation_losses = []
                
                #set network to evalution
                network.eval()
                
                for x,y in validation_data:
                    x = oneHotEncode(x.reshape(batch_size,seq_length), n_chars)
                    
                    if (train_on_gpu):
                        x,y = x.cuda(), y.cuda()
                    val_h = tuple([each for each in val_h])
                    
                    output, val_h = network(x, val_h)
                    
                    #Calculate the loss
                    loss = criterion(output, y.view(batch_size* seq_length).long())
                    
                    # Append loss to validation losses
                    validation_losses.append(loss.item())
                    
                    # Set network back to training
                    network.train()
                    
                    print("Epoch: {}/{}...".format(i+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(validation_losses)))

In [16]:

# Instantiate the LSTM Model
n_hidden = 500

network = yorLSTM(uniquedata,n_hidden)
print(network)

yorLSTM(
  (lstm): LSTM(97, 500, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=500, out_features=97, bias=True)
)


In [17]:
# Initiate the training
epochs = 40
lr = 0.001
train(network,train_loader,valid_loader,epochs,batch_size,lr,seq_length)

Epoch: 1/40... Step: 10... Loss: 3.5576... Val Loss: 3.5576
Epoch: 1/40... Step: 10... Loss: 3.5659... Val Loss: 3.5618
Epoch: 1/40... Step: 10... Loss: 3.5607... Val Loss: 3.5614
Epoch: 1/40... Step: 10... Loss: 3.5602... Val Loss: 3.5611
Epoch: 1/40... Step: 10... Loss: 3.4880... Val Loss: 3.5465
Epoch: 1/40... Step: 10... Loss: 3.5762... Val Loss: 3.5514
Epoch: 1/40... Step: 10... Loss: 3.5089... Val Loss: 3.5454
Epoch: 1/40... Step: 10... Loss: 3.5487... Val Loss: 3.5458
Epoch: 1/40... Step: 10... Loss: 3.5582... Val Loss: 3.5472
Epoch: 1/40... Step: 10... Loss: 3.6091... Val Loss: 3.5534
Epoch: 1/40... Step: 20... Loss: 3.4866... Val Loss: 3.4866
Epoch: 1/40... Step: 20... Loss: 3.4578... Val Loss: 3.4722
Epoch: 1/40... Step: 20... Loss: 3.4457... Val Loss: 3.4634
Epoch: 1/40... Step: 20... Loss: 3.4926... Val Loss: 3.4707
Epoch: 1/40... Step: 20... Loss: 3.3997... Val Loss: 3.4565
Epoch: 1/40... Step: 20... Loss: 3.4667... Val Loss: 3.4582
Epoch: 1/40... Step: 20... Loss: 3.4372.

In [0]:
# Save the model with the name_numberOfEpoch.net
modelname = 'lstm_20_epoch.net'

checkpoint = {'n_hidden': network.n_hidden,
              'n_layers': network.n_layers,
              'state_dict': network.state_dict(),
              'tokens': network.chars   
}

with open(modelname, 'wb') as f:
  torch.save(checkpoint, f)


In [0]:
def predict(network, char, h= None, top_k= None):
  # tensor inputs
        x = np.array([[network.char2int[char]]])
        inputs = torch.from_numpy(x)

        inputs = oneHotEncode(inputs, len(network.chars))
        
        if(train_on_gpu):
            inputs = inputs.cuda()
        
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = network(inputs, h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
        
        # get top characters
        if top_k is None:
            top_ch = np.arange(len(network.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return network.int2chars[char], h

In [0]:
def sample(network, size, prime, top_k=None):
        
    if(train_on_gpu):
        network.cuda()
    else:
        network.cpu()
    
    network.eval() # eval mode
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = network.init_hidden(1)
    for ch in prime:
        char, h = predict(network, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(network, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [25]:
print(sample(network, 30000, prime='Jesu', top_k=5))

Jesu. 
46 Ṣùgbọ́n kí ẹ̀yin náà sì ń wá ọ̀nà láti máa rìn, wọ́n ń wá ọ̀nà sárà sí àwọn ènìyàn pé àwa tí ń bẹ nínú rẹ̀. 
18 Nítorí pé kí a má ba à ṣáájú yín ju ni ọmọ Josẹfu,tí í ṣe ọmọ Enamu,tí í ṣe ọmọ Aamini,tí í ṣe ọmọ Aali, tí í ṣe ọmọ Maattati,tí í ṣe ọmọ Judeatili, tí í ṣe ọmọ Neti,tí í ṣe ọmọ Elmeki,tí í ṣe ọmọ Nealiti,tí í ṣe ọmọ Elmu,tí í ṣe ọmọ Matttiti,tí í ṣe ọmọ Aadi, tí í ṣe ọmọ Joa,ti,tí í ṣe omi wá ni ibi sí ìdòde; 
50 (Ó sì dáhùn, ó sì fi ìrò mọ́ àwọn tí ó ń gbàdúrà, láti máa sìn láti ṣe fún un. 
27 Nítorí pé, èmí ń fi ọ̀pá rìrà; ṣùgbọ́n ẹni tí ó kọ́ ohun tí Ọmọ ènìyàn sì ń wàásù ìgbàgbọ́ tí ó ń gbọ́ ni ọ̀rọ̀ tí ó wà fún ọ pé, “Ẹ wá kò wí pé, “Èmi kò lè fi ìwọ̀n mọ́.” 
16 Àkókò ní ìjọba Ọ̀dọ́-àgùntàn ọ̀rọ̀ wọn; ó sì pè ní sábẹ́ òfin tí ó wí fún wọn pé, “Ẹ má ṣe jẹ́ kí a máa jẹ́ ti ara rẹ jẹ ní ọjọ́ mẹ́ta: nítorí tí a ń fi ọkàn, tí ìwọ kò sì tọ̀ ọ́ wí nípa iṣẹ́. 
41 Nígbà tí ó sì dá wọn lóhùn pé, “Èmi yóò fi ṣe náà ní ẹ̀kọ́ òfin, bẹ́ẹ̀ ni ó wà níbẹ̀, ó sì ń wá ọ̀n lọ́gọn