# Character-level language modeling in PyTorch

In the model that we will build now, the input is a text document, and our goal is to develop a model that can generate
new text that is similar in style to the input document.

In character-level language modeling, the input is broken down into a sequence of characters that are fed into our
network one character at a time. The network will process each new character in conjunction with the memory of the
previously seen characters to predict the next one.

In [1]:
# Downloading the dataset
!curl -O https://raw.githubusercontent.com/rasbt/machine-learning-book/refs/heads/main/ch15/1268-0.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1144k  100 1144k    0     0  1347k      0 --:--:-- --:--:-- --:--:-- 1346k


In [2]:
# Preprocessing the dataset
import numpy as np
with open('1268-0.txt','r',encoding='utf-8') as fp:
    text = fp.read()
start_idx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_idx:end_indx]
char_set = set(text)
print(f"Total length: {len(text)}")
print(f"Unique charcters: {len(char_set)}")

Total length: 1112350
Unique charcters: 80


We now need a way to convert characters into integer values and vice-versa.

In [3]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array([char2int[ch] for ch in text],dtype=np.int32)
print(text[:15],"===>",text_encoded[:15])
print(text_encoded[15:21],"===>","".join(char_array[text_encoded[15:21]]))

THE MYSTERIOUS  ===> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28] ===> ISLAND


Our goal now is to design a model that can predict the next character of a given input sequence, where the input
sequence represents an incomplete text. This problem can be thinked of as a multiclass classification task.

Let's firstly clip the sequence length to 40. In practice, the sequence length impacts the quality of the generated
text. Longer sequences can result in more meaningful sentences. For shorter sequences, however, the model might focus
on capturing individual words correctly, while ignoring the context for the most part.

Thus, in practice, finding a sweet spot and good value for the sequence length is a hyperparameter optimization problem,
which we have to evaluate empirically. (In this specific case 40 offers a good tradeoff)

In [4]:
import torch
from torch.utils.data import Dataset
seq_length = 40
chunk_size = seq_length+1
text_chunks = [text_encoded[i:i+chunk_size] for i in range(len(text_encoded)-chunk_size+1)]
from torch.utils.data import Dataset
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, index):
        text_chunk = self.text_chunks[index]
        return text_chunk[:-1].long(), text_chunk[1:].long()

seq_dataset = TextDataset(torch.tensor(text_chunks))

  seq_dataset = TextDataset(torch.tensor(text_chunks))


In [5]:
for i, (seq,target) in enumerate(seq_dataset):
    print(' Input (x): ',repr("".join(char_array[seq])))
    print('Tartet (y): ',repr("".join(char_array[target])))
    print()
    if i == 1:
        break

 Input (x):  'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
Tartet (y):  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

 Input (x):  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'
Tartet (y):  'E MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by '



In [6]:
from torch.utils.data import DataLoader
batch_size = 64
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset,batch_size,shuffle=True,drop_last=True)

In [7]:
import torch.nn as nn
class RNN(nn.Module):
    def __init__(self,vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim,rnn_hidden_size,batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size,vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden,cell) = self.rnn(out,(hidden,cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(1,batch_size,self.rnn_hidden_size)
        cell = torch.zeros(1,batch_size,self.rnn_hidden_size)
        return hidden, cell

In [16]:
VOCAB_SIZE = len(char_array)
EMBED_DIM = 256
RNN_HIDDEN_SIZE = 512
torch.manual_seed(1)
model = RNN(VOCAB_SIZE,EMBED_DIM,RNN_HIDDEN_SIZE)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.005)

In [17]:
num_epochs = 10_000
torch.manual_seed(1)
for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size) # The cell state and hidden state are empty at the beginning of the
    # sequence
    seq_batch, target_batch = next(iter(seq_dl))
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred,hidden,cell = model(seq_batch[:,c],hidden,cell)
        loss += loss_fn(pred,target_batch[:,c])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch % 500 == 0:
        print(f"Epoch {epoch} loss: {loss:.4f}")

Epoch 0 loss: 4.3722
Epoch 500 loss: 1.3942
Epoch 1000 loss: 1.3521
Epoch 1500 loss: 1.2300
Epoch 2000 loss: 1.2288
Epoch 2500 loss: 1.1846
Epoch 3000 loss: 1.1713
Epoch 3500 loss: 1.1494
Epoch 4000 loss: 1.1892
Epoch 4500 loss: 1.1569
Epoch 5000 loss: 1.0866
Epoch 5500 loss: 1.1185
Epoch 6000 loss: 1.1548
Epoch 6500 loss: 1.1408
Epoch 7000 loss: 1.1057
Epoch 7500 loss: 1.1695
Epoch 8000 loss: 1.1409
Epoch 8500 loss: 1.1615
Epoch 9000 loss: 1.1039
Epoch 9500 loss: 1.1127


To predict the next character in the sequence, we can simply select the element with the maximum logit value, which is
equivalent to selecting the character with the highest probability. However, instead of always selecting the character
with the highest likelihood, we want to (randomly) sample from the outputs (otherwise the model would be deterministic).

PyTorch already provides a class, `torch.distributions.Categorical`, which we can use to draw random samples from a
categorical distribution (the probabilities represent the probabilities of the element being picked: every element with
the same probability means that on a big amount of samples, each element will be picked equally; on element with a
larger probability than the others means that that element will be picked more compared to the others).

In [18]:
from torch.distributions import Categorical
def sample(model, starting_str, len_generated_text = 500,scale_factor=1.0):
    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input,(1,-1))
    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(encoded_input[:,c].view(1),hidden,cell)

    last_char = encoded_input[:,-1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(last_char.view(1),hidden, cell)
        logits = torch.squeeze(logits,0)
        scaled_logits = logits*scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

In [19]:
torch.manual_seed(1)
print(sample(model, starting_str="The island"))

The island would discover
unleady to be feared through the engineer, who, and the sailor’s walls drew a mound, infested without
details be allowed themselves, abounted themselves at the open timid fine pottery lengths of these trees of the east following his crims, and
has sharp break in Granite House from account of the island, and as soon as CLOUDScouts? The colonists, who put the 10t of the entrance, which searched as the outlet. The day as its apearish answered this lessure, corrived down the two comp


Furthermore, to control the predictability of the generated samples, the logits computed by the RNN model can be scaled
before being passed to `Categorical` for sampling. The scaling factor, $\alpha$, can be interpreted as an analog to the
temperature in physics. Higher temperatures ($\alpha>1$) result in more entropy or randomness, versus more predictable behaviour at 
lower temperatures ($\alpha<1$).

In [24]:
# Higher temperature
torch.manual_seed(1)
print(sample(model, starting_str="The island",scale_factor=0.5))

The island of mass, 1 Hove
olcaniDest easy ir,
dericed off Cyru, fraft which do also howhereable
does; I built isffallen have vedntinuous pardoquarting:”

Pencroft, dived undly,” as I
a leak timic finish!

Tubelti’tes instructed lying!--
It would necessary,
hisip?
Ju!”

Remember vious up, in side sharp tightly--mysteresaril that-ajours kreased amond, ammunies on wools sharp, precary basalt petuentle being him.-

No! kyon-taT--simply. Af anyoks?”

“IS” risb-dust!”

Cyripdle our mountain widazing?”

“clover


In [25]:
# Lower temperature
torch.manual_seed(1)
print(sample(model, starting_str="The island",scale_factor=2))

The island was the colonists had not been able to say another side, and the terrible sources of the island and tools, which was devoted to his companions.

At this was there and a spring the pirates and points of the island and a purpose of the summit of the divan, and the sailor was agreed the sand and finding the shore, and and the sailor was more commenced the poultry-yard were then about to the beach. He and Neb knew the presence of the engineer, and Gideon Spilett and his companions disappeared at th
