# Assignment 12.1 - Recurrent Neural Networks

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry as .ipynb-file and as .pdf.

#### Please state both names of your group members here:
Jane and John Doe

## Task 12.1.1: RNN - 'ShakesGen'

Let's create a `ShakesGen` !!<br><br>
The data folder contains a shakespeare folder with works from William Shakespeare. Your task is to implement an RNN that learns to write Shakespeare-style text.

Below, you'll find all the utility code needed for this task. The Corpus class serves as a dataset, and you can retrieve a batch with its target by calling `get_batch` on a batchified dataset.

* Build the missing model components and train your ShakesGen model. **(RESULT)**
* Generate at least 30 lines of text using your ShakesGen model. **(RESULT)**

Especially, if you train on cpu, you can stop training after 5 minutes and generate based on the current model state.

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/4000/0*WdbXF_e8kZI1R5nQ.png", width=700)

In [None]:
# Some imports
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.cuda as cuda
import torch.optim as optim
import torch.nn.functional as F
import os
import tqdm
import numpy as np

In [None]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        
        # This is very english language specific
        # We will ingest only these characters:
        self.whitelist = [chr(i) for i in range(32, 127)]
        
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r',  encoding="utf8") as f:
            tokens = 0
            for line in f:
                line = ''.join([c for c in line if c in self.whitelist])
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r',  encoding="utf8") as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                line = ''.join([c for c in line if c in self.whitelist])
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids
    
def batchify(data, batch_size):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // batch_size
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * batch_size)
    # Evenly divide the data across the bsz batches.
    data = data.view(batch_size, -1).t().contiguous()
    return data

def get_batch(source, i, bptt_size=35):
    seq_len = min(bptt_size, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

In [None]:
# Use Corpus to load data
corpus = Corpus('./data/shakespeare')

In [None]:
vocab_size = len(corpus.dictionary)
print(vocab_size)

# Print first 100 words from training data
words = [corpus.dictionary.idx2word[corpus.train[i].item()] for i in range(min(100, len(corpus.train)))]
print(' '.join(words))

In [None]:
idx = corpus.dictionary.word2idx.get("That", -1)  # returns -1 if not found
print(f"Index of the word 'That': {idx}")

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# TODO: Implement RNN model class & Training loop here

# Tip: Use an Embedding layer to Tokenize each word.
# e.g., self.embedding = nn.Embedding(vocab_size, embed_dim)

## Congratz, you made it! :)