In this project the book "The mysterious Island" by Jules Verne is use to build a text generation model using RRN.

## Preprocessing the dataset

First we download the dataset or "the book" from gutenberg project website as a text file. Then we can create a variable that holds all the unique words in the book. 

In [18]:
import numpy as np

with open("pg1268.txt", "r", encoding='utf-8') as file:
    text = file.read()
start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_indx:end_indx]
char_set = set(text)
print('total_length: ' , len(text))
print('Unique characters: ', len(char_set))


total_length:  1130779
Unique characters:  86


Now we should turn the string data into numeric values. For this, we make a function to take each character and turn it into an integer. 

In [19]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array([char2int[ch] for ch in text],
                        dtype=np.int32)

what we want to do is really important, we have a classification task. In which we should be able to predict which letter should come after the other. For example if we have "Deep Learnin", the model should be able to predict that the next letter is 'g'. So we have a classification task with an output size of `charset` which in this case is 86 characters. This is litteraly, generating next character based on a multiclass classification task.  

## Preparing text sequences for training

In [44]:
device = 'cuda'

In [20]:
# Turn the data into 41 characters chunk
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i:i+chunk_size]
               for i in range(len(text_encoded)-chunk_size)]

In [21]:
from torch.utils.data import Dataset
import torch


We create a Pytorch dataset so we can turn it into a DataLoader for the model 

In [22]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks
    
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, index):
        text_chunk = self.text_chunks[index]
        return text_chunk[:-1].long(), text_chunk[1:].long()
    

seq_dataset = TextDataset(torch.tensor(text_chunks))


Let's look at some example from this transformed dataset:

In [None]:
for i, (seq, target) in enumerate(seq_dataset):
    print(('Input (x):'), repr(''.join(char_array[seq])))
    print(('Target (y):'), repr(''.join(char_array[target])))

    if i== 7:
        break

Input (x): 'THE MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOU'
Target (y): 'HE MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS'
Input (x): 'HE MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS'
Target (y): 'E MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS '
Input (x): 'E MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS '
Target (y): ' MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS I'
Input (x): ' MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS I'
Target (y): 'MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS IS'
Input (x): 'MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS IS'
Target (y): 'YSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISL'
Input (x): 'YSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISL'
Target (y): 'STERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISLA'
Input (x): 'STERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISLA'
Target (y): 'TERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISLAN'
Input (x): 'TERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISLAN'
Target (y): 'ERIOUS ISLAND ***\n\nTHE MYSTERIOUS ISLAND'


In [46]:
from torch.utils.data import DataLoader
BATCH_SIZE = 64
seq_dl = DataLoader(seq_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

## Building a character-level RNN model

In [55]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        
        out = self.embedding(x)  # (batch, seq_len, embed_dim)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))  # (batch, seq_len, hidden_size)
        out = self.fc(out)  # (batch, seq_len, vocab_size)
        
        out = out.reshape(-1, out.size(2))  # (batch*seq_len, vocab_size)
        return out, hidden, cell
    
    def init_hidden(self, batch_size, device="gpu"):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size, device=device)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size, device=device)
        return hidden, cell


In [56]:
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
model = RNN(vocab_size=vocab_size, embed_dim=embed_dim, rnn_hidden_size=rnn_hidden_size).to(device)
model

RNN(
  (embedding): Embedding(86, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=86, bias=True)
)

### Loss function with optimzer and a schedular

In [59]:
import torch.optim as optim
num_epochs = 10000

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

"""scheduler = optim.lr_scheduler.OneCycleLR(
     optimizer,
     max_lr=1e-3,                       
     steps_per_epoch=len(seq_dl), 
    epochs=num_epochs"
)"""

'scheduler = optim.lr_scheduler.OneCycleLR(\n     optimizer,\n     max_lr=1e-3,                       \n     steps_per_epoch=len(seq_dl), \n    epochs=num_epochs"\n)'

In [60]:
from tqdm import tqdm

for epoch in tqdm(range(num_epochs)):
    seq_batch, target_batch = next(iter(seq_dl))
    
    # move data to GPU
    seq_batch, target_batch = seq_batch.to(device), target_batch.to(device)

    # initialize hidden and cell on GPU
    hidden, cell = model.init_hidden(batch_size=BATCH_SIZE, device=device)

    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c].unsqueeze(1), hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    #scheduler.step()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, loss: {loss.item()/seq_length:.4f}")


  0%|          | 2/10000 [00:00<31:40,  5.26it/s]

Epoch 0, loss: 1.3298


  1%|          | 102/10000 [00:17<28:34,  5.77it/s]

Epoch 100, loss: 1.3092


  2%|▏         | 202/10000 [00:34<28:00,  5.83it/s]

Epoch 200, loss: 1.3550


  3%|▎         | 302/10000 [00:52<30:03,  5.38it/s]

Epoch 300, loss: 1.3463


  4%|▍         | 402/10000 [01:10<29:01,  5.51it/s]

Epoch 400, loss: 1.3144


  5%|▌         | 502/10000 [01:28<26:45,  5.91it/s]

Epoch 500, loss: 1.3549


  6%|▌         | 602/10000 [01:46<28:36,  5.47it/s]

Epoch 600, loss: 1.3399


  7%|▋         | 702/10000 [02:03<28:13,  5.49it/s]

Epoch 700, loss: 1.3635


  8%|▊         | 802/10000 [02:22<31:28,  4.87it/s]

Epoch 800, loss: 1.4204


  9%|▉         | 902/10000 [02:39<27:21,  5.54it/s]

Epoch 900, loss: 1.3017


 10%|█         | 1002/10000 [02:57<26:44,  5.61it/s]

Epoch 1000, loss: 1.3579


 11%|█         | 1055/10000 [03:06<26:22,  5.65it/s]


KeyboardInterrupt: 