<a href="https://colab.research.google.com/github/Chryron/CSC2516_NN-DL/blob/main/CSC2516_Homework_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 6

In this homework you will be training and using a "char-RNN". This is the name given to a character-level recurrent neural network language model by [this famous blog post by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). Before you start on the rest of the homework, please give the blog post a read, it's quite good!

I don't expect you to implement the char-RNN from scratch. Andrej's original char-rnn is in Torch (the predecessor to PyTorch that is not commonly used anymore). Fortunately, there are many other implementations of this model available; for example, there is one (in both mxnet and pytorch) in chapters 8 and 9 of [the textbook](http://d2l.ai), and another pytorch one [here](https://github.com/spro/char-rnn.pytorch). **Please use one of these example implementations (or another one that you find) when completing this homework**.

For this homework, please complete the following steps:

1. Download and tokenize the [Shakespeare dataset](http://www.gutenberg.org/files/100/100-0.txt) at a character level. I recommend basing your solution on the following code:
```Python
# Remove non-alphabetical characters, lowercase, and replace whitespace with ' '
raw_dataset = ' '.join(re.sub('[^A-Za-z ]+', '', text).lower().split())
# Maps token index to character
idx_to_char = list(set(raw_dataset))
# Maps character to token index
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
# Tokenize the dataset
corpus_indices = [char_to_idx[char] for char in raw_dataset]
```
1. Train a "vanilla" RNN (as described in chapter 9 of [the textbook](http://d2l.ai)) on the Shakespeare dataset. Report the training loss and generate some samples from the model at the end of training.
1. Train a GRU RNN (as described in chapter 10 of [the textbook](http://d2l.ai)) on the Shakespeare datatset. Is the final training loss higher or lower than the vanilla RNN? Are the samples from the model more or less realistic?
1. Find a smaller, simpler dataset than the Shakespeare data (you can find some ideas in Andrej's blog post, but feel free to get creative!) and train either the vanilla or GRU RNN on it instead. Is the final training loss higher or lower than it was for the Shakespeare data?

In [64]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader, Dataset

import IPython
import os
import re

In [65]:
# Get the path of the current Jupyter Notebook
notebook_path = IPython.get_ipython().starting_dir
data_dir = os.path.join(notebook_path, '../data/')
shakespeare = os.path.join(data_dir, 'shakespeare/100-0.txt')

# Initialize an empty list to store the lines after 'Contents'
relevant_lines = []

# Read the file and skip lines until 'Contents'
with open(shakespeare, 'r') as f:
    for line in f:
        if 'contents' in line.strip().lower():
            break
    # Read and store the remaining lines
    relevant_lines = f.readlines()

# Convert the list of lines to a single string
text = ''.join(relevant_lines)

# Apply regex cleaning
# Remove non-alphabetical characters, lowercase, and replace whitespace with ' '
raw_dataset = ' '.join(re.sub('[^A-Za-z ]+', '', text).lower().split())

# Maps token index to character
idx_to_char = list(set(raw_dataset))
# Maps character to token index
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
# Tokenize the dataset
corpus_indices = [char_to_idx[char] for char in raw_dataset]

In [82]:
class CharRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, model="vanilla", n_layers=1):
        super(CharRNN, self).__init__()
        self.model = model.lower()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        self.encoder = nn.Embedding(input_size, hidden_size)
        if self.model == "vanilla":
            self.rnn = nn.RNN(hidden_size, hidden_size, n_layers)
        elif self.model == "lstm":
            self.rnn = nn.LSTM(hidden_size, hidden_size, n_layers)
        elif self.model == "gru":
            self.rnn = nn.GRU(hidden_size, hidden_size, n_layers)

        self.decoder = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden):
        batch_size = input.size(0)
        encoded = self.encoder(input)
        output, hidden = self.rnn(encoded.view(input.size(1), batch_size, self.hidden_size), hidden)
        output = self.decoder(output.view(batch_size, input.size(1), self.hidden_size))
        return output, hidden

    # def forward2(self, input, hidden):
    #     encoded = self.encoder(input.view(1, -1))
    #     output, hidden = self.rnn(encoded.view(1, 1, -1), hidden)
    #     output = self.decoder(output.view(1, -1))
    #     return output, hidden
    
    def init_hidden(self, batch_size):
        if self.model == "lstm":
            return (Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size)),
                    Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size)))
        return Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size))


In [83]:
class CharDataset(Dataset):
    def __init__(self, data, sequence_length):
        self.data = data
        self.sequence_length = sequence_length
        self.total_sequences = len(data) - sequence_length

    def __len__(self):
        return self.total_sequences

    def __getitem__(self, idx):
        return (
            torch.tensor(self.data[idx:idx+self.sequence_length], dtype=torch.long),
            torch.tensor(self.data[idx+1:idx+self.sequence_length+1], dtype=torch.long)
        )

In [84]:
def train_model(model, dataloader, criterion, optimizer, epochs, device):
    model.to(device)
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for batch, (inputs, targets) in enumerate(dataloader):
            # Initialize hidden state

            inputs, targets = inputs.to(device), targets.to(device)
            
            hidden = model.init_hidden(inputs.size(0))
            if model.model == "lstm":
                hidden = (hidden[0].to(device), hidden[1].to(device))
            else:
                hidden = hidden.to(device)

            # Zero the gradients
            model.zero_grad()

            # Forward pass
            output, hidden = model(inputs, hidden)

            # Compute loss
            loss = criterion(output.view(-1, model.output_size), targets.view(-1))

            # Backward pass and optimize
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss / len(data_loader)}')


In [85]:
learning_rate = 0.005
n_epochs = 10
# all_losses = []
batch_size = 64
sequence_length = 50
hidden_size = 128
n_characters = len(idx_to_char)
n_layers = 1

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
shakespeare_dataset = CharDataset(corpus_indices, sequence_length)
train_dl = DataLoader(shakespeare_dataset, batch_size=batch_size, shuffle=True)



In [86]:
# Initialize model, loss function, and optimizer
model = CharRNN(input_size=n_characters, hidden_size=hidden_size, output_size=n_characters, model="vanilla", n_layers=n_layers)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [87]:
train_model(model, train_dl, criterion, optimizer, n_epochs, device)

KeyboardInterrupt: 