# Character-level language modeling in PyTorch

In the model that we will build now, the input is a text document, and our goal is to develop a model that can generate
new text that is similar in style to the input document.

In character-level language modeling, the input is broken down into a sequence of characters that are fed into our
network one character at a time. The network will process each new character in conjunction with the memory of the
previously seen characters to predict the next one.

In [3]:
# Downloading the dataset
!curl -O https://raw.githubusercontent.com/rasbt/machine-learning-book/refs/heads/main/ch15/1268-0.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1144k  100 1144k    0     0  1606k      0 --:--:-- --:--:-- --:--:-- 1606k


In [1]:
# Preprocessing the dataset
import numpy as np
with open('1268-0.txt','r',encoding='utf-8') as fp:
    text = fp.read()
start_idx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_idx:end_indx]
char_set = set(text)
print(f"Total length: {len(text)}")
print(f"Unique charcters: {len(char_set)}")

Total length: 1112350
Unique charcters: 80


We now need a way to convert characters into integer values and vice-versa.

In [2]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array([char2int[ch] for ch in text],dtype=np.int32)
print(text[:15],"===>",text_encoded[:15])
print(text_encoded[15:21],"===>","".join(char_array[text_encoded[15:21]]))

THE MYSTERIOUS  ===> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28] ===> ISLAND


Our goal now is to design a model that can predict the next character of a given input sequence, where the input
sequence represents an incomplete text. This problem can be thinked of as a multiclass classification task.

Let's firstly clip the sequence length to 40. In practice, the sequence length impacts the quality of the generated
text. Longer sequences can result in more meaningful sentences. For shorter sequences, however, the model might focus
on capturing individual words correctly, while ignoring the context for the most part.

Thus, in practice, finding a sweet spot and good value for the sequence length is a hyperparameter optimization problem,
which we have to evaluate empirically. (In this specific case 40 offers a good tradeoff)

In [3]:
import torch
from torch.utils.data import Dataset
seq_length = 40
chunk_size = seq_length+1
text_chunks = [text_encoded[i:i+chunk_size] for i in range(len(text_encoded)-chunk_size+1)]
from torch.utils.data import Dataset
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, index):
        text_chunk = self.text_chunks[index]
        return text_chunk[:-1].long(), text_chunk[1:].long()

seq_dataset = TextDataset(torch.tensor(text_chunks))

  seq_dataset = TextDataset(torch.tensor(text_chunks))


In [4]:
for i, (seq,target) in enumerate(seq_dataset):
    print(' Input (x): ',repr("".join(char_array[seq])))
    print('Tartet (y): ',repr("".join(char_array[target])))
    print()
    if i == 1:
        break

 Input (x):  'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
Tartet (y):  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

 Input (x):  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'
Tartet (y):  'E MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by '



In [5]:
from torch.utils.data import DataLoader
batch_size = 64
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset,batch_size,shuffle=True,drop_last=True)