# Project Description

This project involves training an `LSTM` on a document for Character level Modelling. 

The model is trained on the book by Jules Verne titled ` MYSTERIOUS ISLAND`, 1874

The Dataset can be downloaded for the project [here](https://www.gutenberg.org/files/1268/1268-0.txt)

Import relevant libraries

In [26]:
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.distributions.categorical import Categorical

## Data Preprocessing 

-  Remove irrelevant text such as Title
- Copyright information

In [2]:
with open('data/1268-0.txt', 'r', encoding='utf-8') as f:
    text = f.read()

start_idx = text.find('THE MYSTERIOUS ISLAND')
end_idx = text.find('END OF THE PROJECT GUTENBERG EBOOK THE MYSTERIOUS ISLAND') 
text = text[start_idx:end_idx]

char_text = set(text)  # set of all unique characters in the text

print('Number of characters in the text: {}'.format(len(text)))
print('Number of unique characters in the text: {}'.format(len(char_text)))

Number of characters in the text: 1112300
Number of unique characters in the text: 80


### Create a Mapping that converts between Characters and Integers

In [3]:
sorted_chars = sorted(char_text)  # sorting the characters
char_array = np.array(sorted_chars)  
char2int = {ch:i for i,ch in enumerate(sorted_chars)}  # dictionary that maps each character to a unique integer

encoded_text = np.array([char2int[ch] for ch in text], dtype=np.int32)  # encoding the text

print('Encoded text length:', len(encoded_text))

print(text[:15], 'Encoded_text --->', encoded_text[:15])
print(encoded_text[15:30], 'Reversed Text --->', ''.join(char_array[encoded_text[15:30]]))

Encoded text length: 1112300
THE MYSTERIOUS  Encoded_text ---> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28  1  6  6  6  0  0  0  0  0] Reversed Text ---> ISLAND ***







In [4]:
for ch in text[:15]:
    print(ch, '--->', char2int[ch])

T ---> 44
H ---> 32
E ---> 29
  ---> 1
M ---> 37
Y ---> 48
S ---> 43
T ---> 44
E ---> 29
R ---> 42
I ---> 33
O ---> 39
U ---> 45
S ---> 43
  ---> 1


## Create a Dataset

- Use a specified sequence length of say n
- Divide the dataset into chunks of `n+1`
- `some_encoded_text[:n]` will represent the input sequence
- `some_encoded_text[1:n+1]` will represent the target sequence


In [5]:
seq_len = 40
chunk_size = seq_len + 1
chunks = len(encoded_text) // chunk_size

text_chunks = [encoded_text[i:i+chunk_size] for i in range(len(encoded_text)-chunk_size)]

class CharDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks
    
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()    

char_dataset  = CharDataset(torch.from_numpy(np.array(text_chunks)))
char_dataset

<__main__.CharDataset at 0x292c4198c10>

In [6]:
for i, (inp, target)  in enumerate(char_dataset):
    print('Encoded Input:\n', inp, '\nEncoded Target:\n', target)
    print('\nInput Token: ', repr(''.join(char_array[inp])))
    print('Target Token: ', repr(''.join(char_array[target])))
    break
    

Encoded Input:
 tensor([44, 32, 29,  1, 37, 48, 43, 44, 29, 42, 33, 39, 45, 43,  1, 33, 43, 36,
        25, 38, 28,  1,  6,  6,  6,  0,  0,  0,  0,  0, 44, 32, 29,  1, 37, 48,
        43, 44, 29, 42]) 
Encoded Target:
 tensor([32, 29,  1, 37, 48, 43, 44, 29, 42, 33, 39, 45, 43,  1, 33, 43, 36, 25,
        38, 28,  1,  6,  6,  6,  0,  0,  0,  0,  0, 44, 32, 29,  1, 37, 48, 43,
        44, 29, 42, 33])

Input Token:  'THE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTER'
Target Token:  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERI'


### Prepare our Dataset into  DataLoader and convert to batches

In [7]:
batch_size = 64
torch.manual_seed(42)

char_dl = DataLoader(char_dataset, batch_size=batch_size, shuffle=True)

## Building our RNN Model

In [8]:
class RNN(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.hidden_dim = hidden_dim
        self.rnn  = nn.LSTM(embedding_dim, hidden_dim, 
                            batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, hidden, cell):
        ''' Forward pass through the network.'''
        out = self.embedding(x).unsqueeze(1) # out.shape = (batch_size, seq_len, embedding_dim)
        out, (hidden, cell) = self.rnn(out, (hidden, cell)) # out.shape = (batch_size, seq_len, hidden_dim)
        out  = self.fc(out).reshape(out.size(0), -1) # out.shape = (batch_size, vocab_size)
        return out, hidden, cell
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden and cell state '''
        hidden = torch.zeros(1, batch_size, self.hidden_dim)
        cell = torch.zeros(1, batch_size, self.hidden_dim)
        return hidden, cell
    

# # example
# # Parameters
# vocab_size = 100
# embedding_dim = 20
# hidden_dim = 50

# # Instantiate the model
# rnn = RNN(vocab_size, embedding_dim, hidden_dim)

# # Generate input data
# batch_size = 64
# seq_len = 40
# x = torch.randint(0, vocab_size, (batch_size, seq_len))
# hidden = torch.zeros(1, batch_size, hidden_dim)
# cell = torch.zeros(1, batch_size, hidden_dim)

# # Pass data through the model
# output = rnn.forward(x, hidden, cell)
# output


Define the Parameters

In [9]:
vocab_size = len(char_array)
embedding_dim = 256
rnn_hidden_dim = 512
model = RNN(vocab_size, embedding_dim, rnn_hidden_dim)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

### Compile the Modle

In [10]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [11]:
num_epochs = 10000
torch.manual_seed(42)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    inp, target = next(iter(char_dl))
    optimizer.zero_grad()
    loss = 0 

    for ch in range(seq_len):
        pred, hidden, cell  = model(inp[:,ch], hidden, cell)  # Use the teacher-forcing technique
        loss += loss_fn(pred, target[:,ch])
        # print(pred.shape)
    
    loss.backward()
    optimizer.step()
    loss = loss.item() / seq_len

    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')

Epoch 0 loss: 4.3964
Epoch 500 loss: 1.4802
Epoch 1000 loss: 1.3792
Epoch 1500 loss: 1.2563
Epoch 2000 loss: 1.2549
Epoch 2500 loss: 1.1840
Epoch 3000 loss: 1.1640
Epoch 3500 loss: 1.1858
Epoch 4000 loss: 1.1623
Epoch 4500 loss: 1.1688
Epoch 5000 loss: 1.1145
Epoch 5500 loss: 1.0829
Epoch 6000 loss: 1.0705
Epoch 6500 loss: 1.0246
Epoch 7000 loss: 1.0460
Epoch 7500 loss: 1.0393
Epoch 8000 loss: 1.0241
Epoch 8500 loss: 1.0220
Epoch 9000 loss: 1.0274
Epoch 9500 loss: 1.0265


Save the Model

In [12]:
torch.save(model.state_dict(), 'model_state_dict.pth')

## Evaluate our Model

- `Create a function that generates some output based on a starting sting`

1. Initialization:

- Start with an initial string `starting_str`.
- Set generated_str to starting_str.

2. Iteration:

- Convert the recent part of generated_str to its encoded integer form.
- Pass the encoded sequence into the RNN model to update its hidden state.
- Let the model predict logits for the next possible character.
- Use the `Categorical` class to sample a character based on these logits.
- Append the sampled character to generated_str.

3. Completion:

- Continue iterating until `generated_str` reaches the desired length.

In [13]:
def sample(model, starting_str, len_generated_text=500,  scale_factor=1.00):
    """Generate the next sequence of texts using the LSTM model and a starting string.

    Args:
        model : LSTM model
        starting_str (str): reference string to predict the next sequence of characters
        len_generated_text (int, optional): The length of tokens to generate.
        scale_factor (float, optional): The temperature to use.This controls the randomness of the predictions.

    Returns:
        str: Generates the completed sequences of characters
    """

    # Encode the starting string into integers
    encoded_input = torch.tensor([char2int[char] for 
                                char in starting_str])
    encoded_input = encoded_input.reshape(1, -1) # or encoded_input.unsqueeze(0)

    generated_str = starting_str

    # Update the model's hidden state based on the starting string
    model.eval() 
    hidden, cell = model.init_hidden(encoded_input.size(0))
    for ch in range(len(starting_str) - 1):
        _, hidden, cell = model(encoded_input[:,ch].view(1), hidden, cell)
    
    # Predict the next character based on the previous ones
    last_char = encoded_input[:,-1]
    for ch in range(len_generated_text):
        logits , hidden, cell = model(last_char.view(1), hidden, cell)
        
        
        logits = logits.squeeze(0) 
        scaled_logits = logits * scale_factor
        dist = Categorical(logits=scaled_logits) # Construct a multinomial distribution over the classes
        last_char = dist.sample() # Sample a character from the distribution
        generated_str += str(char_array[last_char.item()]) # Append the sampled character to the generated text

    return generated_str

In [21]:
torch.manual_seed(42)
gen_seq = sample(model, 'the colonists')
print(gen_seq)

the colonists which water but the reporter
found it not to keep him that if they could not go at once this time the settlers had about to goats, and he wished to goic with intentions of the liquid vast and twenty miles. Did not pass.

Immeried himself used as the reporter told us very low tide. Their lips of circular distances they would restrain his reserve. “They then,” replied the reporter.

“But reason there!” cried Pencroft. “But any traced a thing for anything anything besides, and they kanded an islen


Increasing the temperature

In [24]:
torch.manual_seed(42)
gen_seq = sample(model, 'the colonists', scale_factor=2.0)
print(gen_seq)

the colonists were still flat with thick above the shore was the other substance from the balloon, and the two large end of the submarine forest, and he stretched out the cavern which had taken him as if it was impossible to his cheeks, looking at the end of the sea.

The seaman should remain increased in the sea at the moment when the sea before his reserved for the castaways had been completely gallously from the mouth of the Mercy was going to his companions.

“This is almost probably they may have taken 


- Though increasing the temperature makes the generated text more diverse and random, <br>
the text becomes less coherent

Decreasing the temperature 

In [25]:
torch.manual_seed(42)
gen_seq = sample(model, 'the colonists', scale_factor=0.5)
print(gen_seq)

the colonists ran two;
neverth, for it had
fought on, to leave lu. Forsts were micest plan.

The hopefouvogigatings were yets uppabbated, grass,” cloved imporharezeic warm” stophes.

On not like thissana curs trust?--asked, “neve he.

Emvizualeh!! S5 heres a voice,lump, withlish thone--Willwn me,
hill
lem yourseaplored?
Coffss Cyrus?”
 “True Sestoke irea Leckhan libeling.” shall therrire Hutable, judged that they samoqsionaryze, which
forgottenda, they metreat?
Yrus, or!”

milest
nothation; butatedia. FRvas 


Decreasing the temperature makes the output more coherent but less predictable

To do 

- Increase the `sequence length` to see its effects on the model
- Train for more epochs
- Might use a Transformer-based architecture for this same problem