

# Homework 2 - Recurrent Neural Networks

In this part of the homework we are going to work with Recurrent Neural Networks, in particular GRU. One of the greatest things that Recurrent Neural Networks can do when working with sequences is retaining data from several timesteps in the past. We are going to explore that property by constructing an 'echo' Recurrent Neural Network.

The goal here is to make a model that given a sequence of letters or digits will output that same sequence, but with a certain delay. Let's say the input is a string 'abacaba', we want the model to not output anything for 3 steps (delay length), and then output the original string step by step, except the last 3 characters. So, target output is then 'XXXabac', where 'X' is empty output.

This is similar to [this notebook](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/09-echo_data.ipynb) (which you should refer to when doing this assignment), except we're working not with a binary string, but with a sequence of integers between 0 and some N. In our case N is 26, which is the number of letters in the alphabet.

## Dataset

Let's implement the dataset. In our case, the data is basically infinite, as we can always generate more examples on the fly, so there's no need to load it from disk.

In [1]:
import random
import string

import torch

# Max value of the generated integer. 26 is chosen becuase it's
# the number of letters in English alphabet.
N = 26


def idx_to_onehot(x, k=N+1):
  """ Converts the generated integers to one-hot vectors """
  ones = torch.sparse.torch.eye(k)
  shape = x.shape
  res = ones.index_select(0, x.view(-1).type(torch.int64))
  return res.view(*shape, res.shape[-1])


class EchoDataset(torch.utils.data.IterableDataset):

  def __init__(self, delay=4, seq_length=15, size=1000):
    self.delay = delay
    self.seq_length = seq_length
    self.size = size

  def __len__(self):
    return self.size

  def __iter__(self):
    """ Iterable dataset doesn't have to implement __getitem__.
        Instead, we only need to implement __iter__ to return
        an iterator (or generator).
    """
    for _ in range(self.size):
      seq = torch.tensor([random.choice(range(1, N + 1)) for i in range(self.seq_length)], dtype=torch.int64)
      result = torch.cat((torch.zeros(self.delay), seq[:self.seq_length - self.delay])).type(torch.int64)
      yield seq, result

DELAY = 4
DATASET_SIZE = 200000
ds = EchoDataset(delay=DELAY, size=DATASET_SIZE)

## Model

Now, we want to implement the model. For our purposes, we want to use GRU. The architecture consists of GRU and a decoder. Decoder is responsible for decoding the GRU hidden state to yield a predicting for the next output. The parts you are responsible for filling with your code are marked with `TODO`.

In [2]:
class GRUMemory(torch.nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.gru = torch.nn.GRU(input_size=27, hidden_size=hidden_size, num_layers=2,
                               batch_first=True, dropout=0.1)
        self.output_layer = torch.nn.Linear(hidden_size, 27)

    def forward(self, x):
        # x shape: (batch_size, seq_length, 27)
        gru_out, _ = self.gru(x)
        # Apply output layer to get logits
        logits = self.output_layer(gru_out)
        return logits

    @torch.no_grad()
    def test_run(self, s):
        # Convert string to indices (1-26 for a-z)
        indices = torch.tensor([ord(c) - ord('a') + 1 for c in s], dtype=torch.long)
        # Convert to one-hot encoding
        one_hot = torch.zeros(len(s), 27)
        for i, idx in enumerate(indices):
            one_hot[i, idx] = 1.0

        # Add batch dimension
        one_hot = one_hot.unsqueeze(0)

        # Get predictions
        logits = self.forward(one_hot)
        predictions = torch.argmax(logits, dim=-1).squeeze(0)

        # Convert back to string
        result = ""
        for pred in predictions:
            if pred == 0:
                result += " "
            else:
                result += chr(pred.item() + ord('a') - 1)

        return result

## Training
Below you need to implement the training of the model. We give you more freedom as for the implementation. The two limitations are that it has to execute within 10 minutes, and that error rate should be below 1%.

In [3]:
def test_model(model, sequence_length=15):
  """
  This is the test function that runs 100 different strings through your model,
  and checks the error rate.
  """
  total = 0
  correct = 0
  for i in range(500):
    s = ''.join([random.choice(string.ascii_lowercase) for i in range(random.randint(15, 25))])
    result = model.test_run(s)
    for c1, c2 in zip(s[:-DELAY], result[DELAY:]):
      correct += int(c1 == c2)
    total += len(s) - DELAY

  return correct / total

In [4]:
import time
import torch.optim as optim
from torch.utils.data import DataLoader

start_time = time.time()

# Training parameters - optimized for speed
HIDDEN_SIZE = 64
BATCH_SIZE = 256
LEARNING_RATE = 0.003
EPOCHS = 4

# Initialize model and dataset with smaller dataset for faster training
model = GRUMemory(HIDDEN_SIZE)
dataset = EchoDataset(delay=DELAY, size=100000)  # Reduced dataset size
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)

# Training setup
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

print("Starting training...")
model.train()

for epoch in range(EPOCHS):
    total_loss = 0
    batch_count = 0

    for seq, target in dataloader:
        # Convert to one-hot encoding
        seq_one_hot = idx_to_onehot(seq)

        # Forward pass
        optimizer.zero_grad()
        output = model(seq_one_hot)

        # Calculate loss (flatten for cross entropy)
        loss = criterion(output.view(-1, 27), target.view(-1))

        # Backward pass
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        batch_count += 1

    avg_loss = total_loss / batch_count
    print(f"Epoch {epoch+1}/{EPOCHS}, Average Loss: {avg_loss:.6f}")

end_time = time.time()
duration = end_time - start_time
print(f"Training completed in {duration:.2f} seconds")

# Test the model
accuracy = test_model(model)
print(f"Final accuracy: {accuracy:.4f}")

assert duration < 600, f'execution took {duration:.2f} seconds, which is longer than 10 mins'
assert accuracy > 0.99, f'accuracy is too low, got {accuracy}, need 0.99'
print('tests passed')

Starting training...
Epoch 1/4, Average Loss: 1.109960
Epoch 2/4, Average Loss: 0.008073
Epoch 3/4, Average Loss: 0.001811
Epoch 4/4, Average Loss: 0.000828
Training completed in 69.39 seconds
Final accuracy: 1.0000
tests passed


## Variable delay model

Now, to make this more complicated, we want to have varialbe delay. So, now, the goal is to transform a sequence of pairs (character, delay) into a character sequence with given delay. Delay is constant within one sequence.

### Dataset
As before, we first implement the dataset:

In [5]:
class VariableDelayEchoDataset(torch.utils.data.IterableDataset):

  def __init__(self, max_delay=8, seq_length=20, size=1000):
    self.max_delay = max_delay
    self.seq_length = seq_length
    self.size = size

  def __len__(self):
    return self.size

  def __iter__(self):
    for _ in range(self.size):
      seq = torch.tensor([random.choice(range(1, N + 1)) for i in range(self.seq_length)], dtype=torch.int64)
      delay = random.randint(0, self.max_delay)
      result = torch.cat((torch.zeros(delay), seq[:self.seq_length - delay])).type(torch.int64)
      yield seq, delay, result

### Model

And the model.

In [6]:
class VariableDelayGRUMemory(torch.nn.Module):
    def __init__(self, hidden_size, max_delay):
        super().__init__()
        self.hidden_size = hidden_size
        self.max_delay = max_delay

        # Delay embedding layer
        self.delay_embedding = torch.nn.Embedding(max_delay + 1, 16)

        # GRU with concatenated input (one-hot + delay embedding)
        self.gru = torch.nn.GRU(input_size=27 + 16, hidden_size=hidden_size,
                               num_layers=2, batch_first=True, dropout=0.1)

        # Output layer
        self.output_layer = torch.nn.Linear(hidden_size, 27)

    def forward(self, x, delays):
        # x shape: (batch_size, seq_length, 27)
        # delays shape: (batch_size,)

        batch_size, seq_length = x.shape[0], x.shape[1]

        # Expand delay to match sequence length and embed
        delay_expanded = delays.unsqueeze(1).expand(-1, seq_length)  # (batch_size, seq_length)
        delay_embedded = self.delay_embedding(delay_expanded)  # (batch_size, seq_length, 16)

        # Concatenate input with delay embedding
        combined_input = torch.cat([x, delay_embedded], dim=-1)  # (batch_size, seq_length, 43)

        # Pass through GRU
        gru_out, _ = self.gru(combined_input)

        # Apply output layer
        logits = self.output_layer(gru_out)

        return logits

    @torch.no_grad()
    def test_run(self, s, delay):
        # Convert string to indices (1-26 for a-z)
        indices = torch.tensor([ord(c) - ord('a') + 1 for c in s], dtype=torch.long)

        # Convert to one-hot encoding
        one_hot = torch.zeros(len(s), 27)
        for i, idx in enumerate(indices):
            one_hot[i, idx] = 1.0

        # Add batch dimension
        one_hot = one_hot.unsqueeze(0)
        delay_tensor = torch.tensor([delay], dtype=torch.long)

        # Get predictions
        logits = self.forward(one_hot, delay_tensor)
        predictions = torch.argmax(logits, dim=-1).squeeze(0)

        # Convert back to string
        result = ""
        for pred in predictions:
            if pred == 0:
                result += " "
            else:
                result += chr(pred.item() + ord('a') - 1)

        return result


### Train

As before, you're free to do what you want, as long as training finishes within 10 minutes and accuracy is above 0.99 for delays between 0 and 8.

In [7]:
def test_variable_delay_model(model, seq_length=20):
  """
  This is the test function that runs 100 different strings through your model,
  and checks the error rate.
  """
  total = 0
  correct = 0
  for i in range(500):
    s = ''.join([random.choice(string.ascii_lowercase) for i in range(seq_length)])
    d = random.randint(0, model.max_delay)
    result = model.test_run(s, d)
    if d > 0:
      z = zip(s[:-d], result[d:])
    else:
      z = zip(s, result)
    for c1, c2 in z:
      correct += int(c1 == c2)
    total += len(s) - d

  return correct / total

In [8]:
import time
from torch.utils.data import DataLoader

start_time = time.time()

# Training parameters - balanced speed and accuracy
MAX_DELAY = 8
SEQ_LENGTH = 20
HIDDEN_SIZE = 96
BATCH_SIZE = 384
EPOCHS = 8
LEARNING_RATE = 0.002

# Initialize model and dataset with more data for better accuracy
model = VariableDelayGRUMemory(hidden_size=HIDDEN_SIZE, max_delay=MAX_DELAY)
dataset = VariableDelayEchoDataset(max_delay=MAX_DELAY, seq_length=SEQ_LENGTH, size=200000)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)

# Training setup with learning rate scheduler
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.8)

print("Starting variable delay training...")
model.train()

for epoch in range(EPOCHS):
    total_loss = 0
    batch_count = 0

    for seq, delays, target in dataloader:
        # Convert to one-hot encoding
        seq_one_hot = idx_to_onehot(seq)

        # Forward pass
        optimizer.zero_grad()
        output = model(seq_one_hot, delays)

        # Calculate loss (flatten for cross entropy)
        loss = criterion(output.view(-1, 27), target.view(-1))

        # Backward pass
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        batch_count += 1

    avg_loss = total_loss / batch_count
    scheduler.step()
    print(f"Epoch {epoch+1}/{EPOCHS}, Average Loss: {avg_loss:.6f}, LR: {optimizer.param_groups[0]['lr']:.6f}")

end_time = time.time()
duration = end_time - start_time
print(f"Variable delay training completed in {duration:.2f} seconds")

# Test the model
accuracy = test_variable_delay_model(model, seq_length=SEQ_LENGTH)
print(f"Final variable delay accuracy: {accuracy:.4f}")

assert end_time - start_time < 600, 'executing took longer than 10 mins'
assert test_variable_delay_model(model, seq_length=SEQ_LENGTH) > 0.99, 'accuracy is too low'
print('tests passed')

Starting variable delay training...
Epoch 1/8, Average Loss: 1.240476, LR: 0.002000
Epoch 2/8, Average Loss: 0.246605, LR: 0.002000
Epoch 3/8, Average Loss: 0.095953, LR: 0.001600
Epoch 4/8, Average Loss: 0.044215, LR: 0.001600
Epoch 5/8, Average Loss: 0.026236, LR: 0.001600
Epoch 6/8, Average Loss: 0.014477, LR: 0.001280
Epoch 7/8, Average Loss: 0.009236, LR: 0.001280
Epoch 8/8, Average Loss: 0.007060, LR: 0.001280
Variable delay training completed in 507.89 seconds
Final variable delay accuracy: 0.9980
tests passed
