# <span style="color:#0b486b">  FIT3181/5215: Deep Learning (2025)</span>
***
*CE/Lecturer (Clayton):*  **Dr Trung Le** | trunglm@monash.edu <br/>
*Lecturer (Clayton):* **A/Prof Zongyuan Ge** | zongyuan.ge@monash.edu <br/>
*Lecturer (Malaysia):*  **Dr Arghya Pal** | arghya.pal@monash.edu <br/>
 <br/>
*Head Tutor 3181:*  **Ms Ruda Nie H** |  \[RudaNie.H@monash.edu \] <br/>
*Head Tutor 5215:*  **Ms Leila Mahmoodi** |  \[leila.mahmoodi@monash.edu \]

<br/> <br/>
Faculty of Information Technology, Monash University, Australia
***

# <span style="color:#0b486b">Tutorial 7c (Additional Reading): RNN for Text Generation</span> <span style="color:red">***</span> #

This tutorial is designed to show one of the applications of RNN in generating texts or sequences. Basically, we train an RNN using the maximum log-likelihood principle and then use this trained RNN to generate texts that imitate the existed texts in the dataset we trained our RNN on.

We first import the necessary modules.

## <span style="color:#0b486b">I. Download and preprocess data</span> ##

In [None]:
import os
import re
import shutil
import torch
import numpy as np
import torch.nn as nn

In [None]:
DATA_DIR = "."
CHECKPOINT_DIR = os.path.join(DATA_DIR, "checkpoints")
if not os.path.exists(CHECKPOINT_DIR):
    os.mkdir(CHECKPOINT_DIR)

The below function helps to download the dataset at a specific URL and split the sentences into characters.  

In [None]:
def download_and_read(urls):
    import urllib3
    import re

    http = urllib3.PoolManager()
    texts = []

    for url in urls:
        # Read the text from URL
        resp = http.request("GET", url) # it's a file like object and works just like a file
        text = resp.data.decode("utf8")

        # remove byte order mark
        text = text.replace("\ufeff", "")
        # remove newlines
        text = text.replace('\n', ' ')
        text = re.sub(r'\s+', " ", text)

        # add it to the list
        texts.extend(text)
    return texts

We download the dataset and the variable *texts* is a list containing all characters of the sentences in this dataset.

In [None]:
texts = download_and_read(["http://www.gutenberg.org/cache/epub/28885/pg28885.txt", "https://www.gutenberg.org/files/12/12-0.txt"])

In [None]:
print(texts[0:100])

['T', 'h', 'e', ' ', 'P', 'r', 'o', 'j', 'e', 'c', 't', ' ', 'G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g', ' ', 'e', 'B', 'o', 'o', 'k', ' ', 'o', 'f', ' ', 'A', 'l', 'i', 'c', 'e', "'", 's', ' ', 'A', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'e', 's', ' ', 'i', 'n', ' ', 'W', 'o', 'n', 'd', 'e', 'r', 'l', 'a', 'n', 'd', ' ', 'T', 'h', 'i', 's', ' ', 'e', 'b', 'o', 'o', 'k', ' ', 'i', 's', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'u', 's', 'e', ' ', 'o', 'f', ' ', 'a', 'n', 'y', 'o', 'n', 'e', ' ']


We extract the vocabulary of all unique characters in this dataset and store in *vocab*. In addition, we have two dictionaries: *char2idx* and *idx2char* to convert between the characters and their indices.

In [None]:
# create the vocabulary
vocab = sorted(set(texts))
print("vocab size: {:d}".format(len(vocab)))
# create mapping from vocab chars to ints
char2idx = {c:i for i, c in enumerate(vocab)}
idx2char = {i:c for c, i in char2idx.items()}

vocab size: 93


We transform the characters in *texts* to the indices in *texts_as_ints* and then chop the data into batch dataset *sequences* of length 100.

In [None]:
# numericize the texts
texts_as_ints = np.array([char2idx[c] for c in texts])

# drop the remainder
data_size = len(texts_as_ints) // 100
texts_as_ints = texts_as_ints[: data_size * 100]

# sequences: [None, 100]
sequences = texts_as_ints.reshape(-1, 100)

We examine the texts of the first 5 samples.

In [None]:
for item in sequences[:5]:
  ids = [idx2char[i] for i in item]
  print(''.join(ids))
  print("---")

The Project Gutenberg eBook of Alice's Adventures in Wonderland This ebook is for the use of anyone 
---
anywhere in the United States and most other parts of the world at no cost and with almost no restri
---
ctions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenbe
---
rg License included with this ebook or online at www.gutenberg.org. If you are not located in the Un
---
ited States, you will have to check the laws of the country where you are located before using this 
---


For the below function, you can imagine *sequence* is a batch of characters, for example \['I', 'l', 'o', 'v', 'e', 'D', 'L'\], this function will return \['I', 'l', 'o', 'v', 'e', 'D'\] and \['l', 'o', 'v', 'e', 'D', 'L'\].

The idea later is that we feed \['I', 'l', 'o', 'v', 'e', 'D'\] to our RNN and try to predict \['l', 'o', 'v', 'e', 'D', 'L'\] which is the set of next characters. We also convert them to ``torch.LongTensor``.

In [None]:
input_seq = torch.LongTensor(sequences[:, 0:-1])
output_seq = torch.LongTensor(sequences[:, 1:])

We encapsulate our generation model in the class *CharGenModel*. Our model has one embedding layer and one hidden layer with GRU cells.

In [None]:
class CharGenModel(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim):
    super(CharGenModel, self).__init__()

    self.embedding_layer = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
    self.rnn_layer = nn.GRU(embedding_dim, hidden_dim, num_layers=2, batch_first=True)
    self.dense_layer = nn.Linear(hidden_dim, vocab_size)

  def forward(self, x):
    e = self.embedding_layer(x)
    h, _ = self.rnn_layer(e)
    y = self.dense_layer(h)
    return y

We build the model and create a DataLoader.

In [None]:
vocab_size = len(vocab)
embedding_dim = 512
hidden_dim = 1024

In [None]:
model = CharGenModel(vocab_size, embedding_dim, hidden_dim)

In [None]:
from torch.utils.data import DataLoader
indices = list(range(input_seq.size(0)))
loader = DataLoader(indices, batch_size=64, shuffle=True)

We define the loss function which is the weighted mean of the loss at each time step.

In [None]:
loss_fn = nn.CrossEntropyLoss(reduction="mean")

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

To generate a text, we start from a prefix_string. We convert this string to a list of indices and declare a 2D tensor from this list with the first dimension to be $1$. We feed *inputs* to the model to work out the prediction probability *preds* and sample *pred_id* from this probability and so on.

In [None]:
def generate_text(model, prefix_string, char2idx, idx2char, device, num_chars_to_generate=1000):
  chars = [char2idx[s] for s in prefix_string]
  inputs = torch.LongTensor(chars)
  inputs = inputs.unsqueeze(dim=0).to(device)
  text_generated = []
  model.eval()
  for i in range(num_chars_to_generate):
    preds = model(inputs)
    preds = preds[:, -1, :]
    preds = preds.squeeze(dim=0)
    pred_id = preds.argmax()
    text_generated.append(idx2char[pred_id.item()])
    inputs = torch.cat((inputs[:, 1:], pred_id.view(1,1)), dim=1)

  return prefix_string + "".join(text_generated)

We start the training procedure. First, we must load the model onto GPU.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

CharGenModel(
  (embedding_layer): Embedding(93, 512, padding_idx=0)
  (rnn_layer): GRU(512, 1024, num_layers=2, batch_first=True)
  (dense_layer): Linear(in_features=1024, out_features=93, bias=True)
)

In [None]:
def train_epoch(model, optimizer, input_seq, output_seq, loader, criterion, device):
    model.train()
    losses = 0
    for idx in loader:
        x = input_seq[idx, :].to(device)
        target_seq = output_seq[idx, ].to(device)

        predicted_seq = model(x)
        predicted_seq = predicted_seq.permute(0,2,1)
        loss = loss_fn(predicted_seq, target_seq)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses += loss.item()
    return losses / len(loader)


In [None]:
num_epochs = 50
curr_loss = 1e+5
for epoch in range(1, num_epochs + 1):
  train_loss = train_epoch(model, optimizer, input_seq, output_seq, loader, loss_fn, device)
  msg = f"Epoch: {epoch}/{num_epochs} - Train loss: {train_loss:.3f}"
  print(msg)

  # generate texts every time the loss decreases
  if train_loss < curr_loss:
    gen_text = generate_text(model, "Alice opened the door", char2idx, idx2char, device, num_chars_to_generate=100)
    curr_loss = train_loss
    print(gen_text)
    print("---")


Epoch: 1/50 - loss = 2.487
Alice opened the door the was she said the Queen the was she said the Queen the was she said the Queen the was she said t
---
Epoch: 2/50 - loss = 1.694
Alice opened the door of the said the Queen she said the Queen she said the Queen she said the Queen she said the Queen s
---
Epoch: 3/50 - loss = 1.448
Alice opened the door of the same of the same of the same of the same of the same of the same of the same of the same of 
---
Epoch: 4/50 - loss = 1.303
Alice opened the door with the same sharp and the table so much a little brook, and the Red Queen said to herself, and th
---
Epoch: 5/50 - loss = 1.202
Alice opened the door was the copyright law in the words and she was a long way of the court, and the bottle of the court
---
Epoch: 6/50 - loss = 1.115
Alice opened the door with the trademark license in the way the rest of the work and the procession or entity to see it w
---
Epoch: 7/50 - loss = 1.037
Alice opened the door in the same thing as she could se

In [None]:
# Test generations
gen_text = generate_text(model, "The Queen said", char2idx, idx2char, device, num_chars_to_generate=100)
print(gen_text)

The Queen said to the White King went how fall upon Alice, and tried to stroke it; but it was _very_ dry; and then


---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>