**Project two - character level language modelling in PyTorch:**
- Language modelling is a fascinating application that enables machines to perform human language-related tasks, such as generating English sentences. In the model we build, the inpput is a text document, and the goal is to develop a model that can generate new text that is similar in style to the input document. Examples of such input are a book, or a computer program in a specific programming language
- In character-level language modelling, the input is broken down to a sequence of characters that are fed into the network one character at a time. The network then then processes each new character in conjunction with the memory of the previously seen characters to predict the next one.

**Preprocessing the dataset:**
- Data used: book: The Mysterious Island by Jules Verne. available on gutenberg.org as file 1268-0.txt
- Once downloaded, we can now read the dataset into a python session as plain text - using the following code, we will read text directly from the downloaded file and remove portions from the beginning and end (these contain certain descriptions of the Gutenberg project). Then, we will create a python variable char_set, that represents the set of unique characters observed in this text

In [1]:
import numpy as np

In [2]:
with open('/Users/blaise/Documents/ML/Machine-Learning-and-Big-Data-Analytics/data/1268-0.txt', 'r', encoding='utf-8') as fp:
    text = fp.read()

In [15]:
start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')

In [16]:
text = text[start_indx:end_indx]

In [18]:
char_set = set(text)

In [20]:
print('Total Length: ', len(text))
print('Unique characters: ', len(char_set))

Total Length:  1112310
Unique characters:  80


We need to map the characters to integers. To do this, we create a simple python dictionary that maps each character to an integer, char2int. We also need a reverse mapping to convert the results of our model back to text. Although the reverse can be done using a dictionary that associates integer keys with character values, using a Numpy Array and indexing the array ton map indices to those unique characters is more efficient. 

Building a dictionary to map characters to integers, and reverse mapping via indexing a numpy array

In [21]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)

In [24]:
text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype=np.int32
)

In [25]:
text_encoded

array([44, 32, 29, ...,  6,  6,  6], shape=(1112310,), dtype=int32)

In [26]:
print('Text encoded shape: ', text_encoded.shape)

Text encoded shape:  (1112310,)


In [27]:
print(text[:15], " == Encoding ==> ", text_encoded[:15])

THE MYSTERIOUS   == Encoding ==>  [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]


In [36]:
print(text_encoded[:15], " == Reverse ==> ",''.join(char_array[text_encoded[:15]]))

[44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]  == Reverse ==>  THE MYSTERIOUS 


So, the text_encoded numpy array contains the encoded values for all characters in the text. Now, we will print out the mappings of the first 5 characters from this array:

In [37]:
for ex in text_encoded[:5]:
    print('{} -> {}'.format(ex, char_array[ex]))

44 -> T
32 -> H
29 -> E
1 ->  
37 -> M


**Formulating the Problem:**

<img src='/Users/blaise/Documents/ML/Machine-Learning-and-Big-Data-Analytics/resource_images/text_generation.png'>

- Text generation can be phrased as a classification task. As in the image shown above, we can consider the sequences shown in the left-hand box to be the input. In order to generate new text, the goal then is to design a model that can predict the next character of a given input sequence, where the input sequence represents an incomplete text. For example, after seeing "Deep Learn," the model should predict "i" as the next character. Given that we've got 80 unique characters, this problem then becomes a multiclass classification task

<img src="/Users/blaise/Documents/ML/Machine-Learning-and-Big-Data-Analytics/resource_images/text_generation2.png">

- Starting with a sequence of length 1 (i.e. one single letter), we can iteratively generate new text based on this multiclass classification approach as illustrated in the image above.

- To implement the text generation task in pytorch, we clip the sequence length to 40. implying the input tensor x, consisits of 40 tokens. The sequence length impacts the quality of generated text. Longer sequences can result in more meaningful sentences. For shorter sequences, the model might focus on capturing individual words correctly, while ignoring the context for the most part. Although longer sequences usually result in more meaningful sentences, for long sequences, the rnn model will have problems capturing long-range dependencies. Thus in practice, finding a sweet spot and good value for the sequence length is a hyperparameter optimization problem.

- As seen in the previous figure, the inputs, x, and targets, y, are offset by one character. Hence, we will split the text into chunks of size 41: the first 40 characters will form the input sequence, x, and the last 40 elements will form the target sequence, y. 

- We have already stored the entire encoded text in its original order in text_encoded. We will first create text chunks consisting of 41 characters each. We will further get rid of the last chunk if its shorter than 41 characters. As a result, the new chunk dataset, named text_chunks, will always contain sequences of size 41. The 41-character chunks will then be used to construct the sequence x (i.e. the input), as well as the sequence y (the target)

In [38]:
import torch
from torch.utils.data import Dataset, DataLoader, Subset

In [39]:
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i:i+chunk_size] for i in range(len(text_encoded)-chunk_size+1)]

In [47]:
text_chunks[:2]

[array([44, 32, 29,  1, 37, 48, 43, 44, 29, 42, 33, 39, 45, 43,  1, 33, 43,
        36, 25, 38, 28,  0,  0, 51, 74,  1, 34, 70, 61, 54, 68,  1, 46, 54,
        67, 63, 54,  0,  0, 12, 19], dtype=int32),
 array([32, 29,  1, 37, 48, 43, 44, 29, 42, 33, 39, 45, 43,  1, 33, 43, 36,
        25, 38, 28,  0,  0, 51, 74,  1, 34, 70, 61, 54, 68,  1, 46, 54, 67,
        63, 54,  0,  0, 12, 19, 18], dtype=int32)]

In [50]:
len(text_chunks)

1112270

In [40]:
from torch.utils.data import Dataset

In [41]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks
    
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, index):
        text_chunk = self.text_chunks[index]
        return text_chunk[:-1].long(), text_chunk[1:].long()

In [42]:
seq_dataset = TextDataset(torch.tensor(text_chunks))

  seq_dataset = TextDataset(torch.tensor(text_chunks))


Let's take a look at some example sequences from this transformed dataset:

In [48]:
for i, (seq, target) in enumerate(seq_dataset):
    print(' Input(x): ',
          repr(''.join(char_array[seq])))
    print('Target (y): ',
          repr(''.join(char_array[target])))
    print()
    if i == 1:
        break

 Input(x):  'THE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1'
Target (y):  'HE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n18'

 Input(x):  'HE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n18'
Target (y):  'E MYSTERIOUS ISLAND\n\nby Jules Verne\n\n187'



Finally - the last step in preparing the dataset is to transform this dataset to mini-batches:

In [49]:
batch_size = 128
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

**Building a character-level RNN model:**
- Now that the dataset is ready, building the model will be relatively straightforward

In [51]:
import torch.nn as nn

In [None]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)
    
    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell

We need to have logits as the outputs of the model so that we can sample from the model predictions in order to generate new text. 

In [71]:
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
torch.manual_seed(1)

<torch._C.Generator at 0x121190b90>

In [72]:
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

In [81]:
device = torch.device("mps" if torch.mps.is_available() else "cpu")

In [82]:
model = model.to(device)

The next step we have is to create a loss function and optimizer

In [73]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

Now, train the model for 10k epochs and we use one batch obtained randomly 

In [84]:
num_epochs = 10000
torch.manual_seed(1)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    hidden, cell = hidden.to(device), cell.to(device)
    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch,target_batch = seq_batch.to(device), target_batch.to(device)
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:,c], hidden, cell)
        loss += loss_fn(pred, target_batch[:,c])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch%500 == 0:
        print(f'Epoch {epoch} | loss: {loss:.4f}')


Epoch 0 | loss: 4.3708
Epoch 500 | loss: 1.3372
Epoch 1000 | loss: 1.2515
Epoch 1500 | loss: 1.1892
Epoch 2000 | loss: 1.1472
Epoch 2500 | loss: 1.1355
Epoch 3000 | loss: 1.0908
Epoch 3500 | loss: 1.1101
Epoch 4000 | loss: 1.0732
Epoch 4500 | loss: 1.0539
Epoch 5000 | loss: 1.0865
Epoch 5500 | loss: 1.0617
Epoch 6000 | loss: 1.0547
Epoch 6500 | loss: 1.0971
Epoch 7000 | loss: 1.0661
Epoch 7500 | loss: 1.0987
Epoch 8000 | loss: 1.1226
Epoch 8500 | loss: 1.0818
Epoch 9000 | loss: 1.0698
Epoch 9500 | loss: 1.0914


Next, we evaluate the model to generate new text, starting with a given short string.

**Evaluation Phase - Generating new text passages:**
- The RNN model we've trained returns logits of size 80 for each unique character. These logits can be readily converted to probabilities, via the softmax function, that a particular character will be encountered as the next character. To predict the next character in the sequence, we can simply select the element with the maximum logit value, which is equivalent to selecting the character with the highest probability. However, instead of always selecting the character with the highest likelihood, we want to randomly sample from the outputs; otherwise the model will always produce the same text. PyTorch already provides a class, torch.distributions.categorical.Categorical, which we can use to draw random samples from a categorical distribution. 

In [85]:
from torch.distributions.categorical import Categorical

In [86]:
torch.manual_seed(1)
logits = torch.tensor([[1.0,1.0,1.0]])

In [87]:
print('Probabilities: ', nn.functional.softmax(logits, dim=1).numpy()[0])

Probabilities:  [0.33333334 0.33333334 0.33333334]


In [88]:
m = Categorical(logits=logits)
samples = m.sample((10,))
print(samples.numpy())

[[0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [2]
 [1]
 [1]]


Can be seen that, with the given logits, the categories have the same probabilities (example above). so they all are equiprobable categories. Therefore, if we use a large sample size (num_samples -> infinity ), we would expect the number of occurrences of each category to reach approx 1/3 of the sample size. By changing the logits to [1,1,3], then we would expect to observe more occurrences for category 2 (when a very large number of samples is drawn from the distribution)

In [89]:
torch.manual_seed(1)
logits = torch.tensor([[1.0, 1.0, 3.0]])
print('Probabilities: ', nn.functional.softmax(logits, dim=1).numpy()[0])
m = Categorical(logits=logits)
samples = m.sample((10,))
print(samples.numpy())

Probabilities:  [0.10650698 0.10650698 0.78698605]
[[0]
 [2]
 [2]
 [1]
 [2]
 [1]
 [2]
 [2]
 [2]
 [2]]


Using categorical, we can generate examples based on the logits computed by the model. 
- We will define a function sample(), that receives a short starting string, and generates a new string, generated_str, which is initially set to the input string. starting_str is encoded to a sequence of integers, encoded_input. encoded_input is passed to the rnn model one character at a time to update the hidden states. The last character of encoded model to generate a new character. 


In [94]:
z = torch.randn(1,40,256)

In [None]:
z[:,2].view(1).shape

torch.Size([1, 256])

In [119]:
def sample(model, starting_str, len_generated_text=500, scale_factor=1.0):
    encoded_input = torch.tensor(
        [char2int[s] for s in starting_str]
    )

    encoded_input = torch.reshape(
        encoded_input, (1, -1)
    ).to(device)
    generated_str = starting_str
    model.eval()
    hidden,cell = model.init_hidden(1)
    hidden,cell = hidden.to(device), cell.to(device)

    for c in range(len(starting_str)-1):
        _, hidden, cell = model(
            encoded_input[:,c].view(1), hidden, cell
        )
    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(
            last_char.view(1), hidden,cell
        )
        logits = torch.squeeze(logits,0)
        scaled_logits = logits*scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])
    return generated_str

Generate some text:

In [120]:
torch.manual_seed(1)
print(sample(model, starting_str='The island'))

The island was far,
a few hundreds which had before rapidly as fact as wild no drop and entrenty against a name which ringthening about the most dire-Fash of a black current of glasses, from died a
bridge barked at the same puriods, the “Bound are large that one aid separable down.

Cyrus Harding quickly go to see in the midst of the vessel formed an outlinely consula.

What belonging to shower vive-

Three substance was soon, “sponderweds.”

“I trying of the day,” added Neb had, after turned the man; but


Furthermore, to control the predictability of generated samples (i.e., generating text following the learned patterns), the logits computed by the RNN model can be interpreted as an analog to the temp in physics. Higher temp - more entropy or randomness versus more predictable behaviour at lower temps. By scaling logits with alpha < 1, the probabilities becomee more uniform

In [121]:
logits = torch.tensor([[1.0, 1.0, 3.0]])

print('Probabilities before scaling: ', nn.functional.softmax(logits, dim=1).numpy()[0])

print('Probabilities after scaling with 0.5: ', nn.functional.softmax(0.5*logits, dim=1).numpy()[0])

print('Probabilities after scaling with 0.1: ', nn.functional.softmax(0.1*logits, dim=1).numpy()[0])

Probabilities before scaling:  [0.10650698 0.10650698 0.78698605]
Probabilities after scaling with 0.5:  [0.21194156 0.21194156 0.57611686]
Probabilities after scaling with 0.1:  [0.3104238  0.3104238  0.37915248]


alpha = 2.0 => more predictable

In [122]:
torch.manual_seed(1)
print(sample(model, starting_str='The island',scale_factor=2.0))

The island was necessary to employ and struck the corral. The settlers were now about the south coast, and the balloon in the meanwhile the last rest of the
cavern with a
way the balloon was diminished by the castaways had scarcely expected to any of the balloon and his companions formed the door. The colonists had been seen that the settlers were allowed to the windows, was brought to the sand of the sky, and he would proceed to the
south.

“And this day the day before an exploration, was all the aid of 


alpha = 0.5 => more randomnesss:

In [123]:
torch.manual_seed(1)
print(sample(model, starting_str='The island',scale_factor=0.5))

The islandresses or
a gig had notlieding thanks offy? Halpo
Nen Now? Hemisphere.”

The merespotrial was, unneclined Girinots will, but
Parking, two cheezinglyas, it issued-decerribed
five hend! We douds, ebbing smoutn, saw it exacrieter or, glass, largenetal one aim, on becuminory.

Capean podo. ‘Hvice ingenieds, alleging nilled a bjegile; they wenking, and letc with head admiss of trabins. Abervive tof, 6 counly Pencrspas had vexed
ofwerly, freeting edoth. The colonists found it aftalliue, “but happonb a


In [129]:
torch.manual_seed(1)
print(sample(model, starting_str='The island',scale_factor=2.5))

The island was necessary to employ the shore.

The colonists were prevented to the sand were on the sand of the sea--the storm had been able to return to the sea.

The next day, the 22nd of the part of the “Bonadventure” was the sea.

The settlers looked at the same sole and soon became more difficult to give the sea.

The sailor and the sailor and Neb and Pencroft and his companions should be enough to rest and on the same moment of two way to the south coast of the cavern which had landed on the sand
of
