# Assignment 12.1 - Recurrent Neural Networks

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry as .ipynb-file and as .pdf.

#### Please state both names of your group members here:
Farah Ahmed Atef Abdelhameed Hafez

## Task 12.1.1: RNN - 'ShakesGen'

Let's create a `ShakesGen` !!<br><br>
The data folder contains a shakespeare folder with works from William Shakespeare. Your task is to implement an RNN that learns to write Shakespeare-style text.

Below, you'll find all the utility code needed for this task. The Corpus class serves as a dataset, and you can retrieve a batch with its target by calling `get_batch` on a batchified dataset.

* Build the missing model components and train your ShakesGen model. **(RESULT)**
* Generate at least 30 lines of text using your ShakesGen model. **(RESULT)**

Especially, if you train on cpu, you can stop training after 5 minutes and generate based on the current model state.

In [1]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://miro.medium.com/max/4000/0*WdbXF_e8kZI1R5nQ.png", width=700)

In [2]:
# Some imports
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.cuda as cuda
import torch.optim as optim
import torch.nn.functional as F
import os
import tqdm
import numpy as np

In [3]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()

        # This is very english language specific
        # We will ingest only these characters:
        self.whitelist = [chr(i) for i in range(32, 127)]

        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r',  encoding="utf8") as f:
            tokens = 0
            for line in f:
                line = ''.join([c for c in line if c in self.whitelist])
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r',  encoding="utf8") as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                line = ''.join([c for c in line if c in self.whitelist])
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids

def batchify(data, batch_size):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // batch_size
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * batch_size)
    # Evenly divide the data across the bsz batches.
    data = data.view(batch_size, -1).t().contiguous()
    return data

def get_batch(source, i, bptt_size=35):
    seq_len = min(bptt_size, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

#Get shakespeare folder

In [4]:
!git clone https://github.com/BioroboticsLab/ml2526_assignments.git
!mkdir -p data
!mv ml2526_assignments/data/shakespeare ./data/
!rm -rf ml2526_assignments

Cloning into 'ml2526_assignments'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (54/54), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 84 (delta 38), reused 43 (delta 30), pack-reused 30 (from 1)[K
Receiving objects: 100% (84/84), 59.52 MiB | 32.73 MiB/s, done.
Resolving deltas: 100% (44/44), done.


In [5]:
# Use Corpus to load data
corpus = Corpus('./data/shakespeare')

In [6]:
vocab_size = len(corpus.dictionary)
print(vocab_size)

# Print first 100 words from training data
words = [corpus.dictionary.idx2word[corpus.train[i].item()] for i in range(min(100, len(corpus.train)))]
print(' '.join(words))

74010
<eos> THE SONNETS <eos> <eos> 1 <eos> <eos> From fairest creatures we desire increase, <eos> That thereby beautys rose might never die, <eos> But as the riper should by time decease, <eos> His tender heir might bear his memory: <eos> But thou contracted to thine own bright eyes, <eos> Feedst thy lights flame with self-substantial fuel, <eos> Making a famine where abundance lies, <eos> Thy self thy foe, to thy sweet self too cruel: <eos> Thou that art now the worlds fresh ornament, <eos> And only herald to the gaudy spring, <eos> Within thine own bud buriest thy content, <eos>


In [7]:
idx = corpus.dictionary.word2idx.get("That", -1)  # returns -1 if not found
print(f"Index of the word 'That': {idx}")

Index of the word 'That': 10


In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#RNN Model

In [9]:
# TODO: Implement RNN model class & Training loop here

# Tip: Use an Embedding layer to Tokenize each word.
# e.g., self.embedding = nn.Embedding(vocab_size, embed_dim)
class RNNModel(nn.Module):
  def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers):
    super(RNNModel, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embed_dim)
    self.rnn=nn.RNN(embed_dim, hidden_dim, num_layers)
    self.fc=nn.Linear(hidden_dim, vocab_size)

  def forward(self, x):
    x=self.embedding(x)
    x, _ = self.rnn(x)
    x = self.fc(x)
    return x


#Train loop

In [10]:
def train(model, batchifieddatasettrain, batchifieddatasetval,optimizer, criterion, epoch, device, bptt=35):
    model.train()
    for i in range(epoch):
      totaltrainloss=0
      totalvalloss=0
      model.train()
      for l in range(0, batchifieddatasettrain.size(0) - 1, bptt):
            batch_x, batch_y = get_batch(batchifieddatasettrain, l, bptt)


            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            optimizer.zero_grad()
            output = model(batch_x)
            output = output.view(-1, output.size(-1))
            loss = criterion(output, batch_y)
            totaltrainloss+=loss.item()
            loss.backward()
            optimizer.step()
      print("Epoch: ", i, "Average Training Loss: ", totaltrainloss/(batchifieddatasettrain.size(0)// bptt))
      model.eval()
      with torch.no_grad():
        for l in range(0, batchifieddatasetval.size(0) - 1, bptt):
          valbatch_x, valbatch_y = get_batch(batchifieddatasetval, l, bptt)
          valbatch_x, valbatch_y = valbatch_x.to(device), valbatch_y.to(device)

          valoutput=model(valbatch_x)
          valoutput = valoutput.view(-1, valoutput.size(-1))
          valloss = criterion(valoutput, valbatch_y)
          totalvalloss+=valloss.item()

      print("Epoch: ", i,  "Average Validation Loss: ", totalvalloss/ (batchifieddatasetval.size(0)//bptt))





#Generating Text Function

In [11]:
def generateText(model, start_word, corpus, device, mylength=300):
    myindex = corpus.dictionary.word2idx[start_word]
    input = torch.tensor([[myindex]], device=device)
    mywords = [start_word]
    model.eval()
    for _ in range(mylength):
      with torch.no_grad():
        logits= model(input)
        probs = F.softmax(logits[:, -1], dim=-1)
        myindex = torch.multinomial(probs, num_samples=1).item()
        word=corpus.dictionary.idx2word[myindex]
        input = torch.tensor([[myindex]], device=device)
        mywords.append(word)
        if mywords[-1]=='<eos>':
          break
    return ' '.join(mywords)

#Train and Generate

In [12]:
model=RNNModel(vocab_size, 50, 128, 1)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
batchifieddataset = batchify(corpus.train, 35)
batchifieddatasetval = batchify(corpus.valid, 35)
train(model, batchifieddataset,batchifieddatasetval, optimizer, criterion, 20, device)

Epoch:  0 Average Training Loss:  7.3161020948077145
Epoch:  0 Average Validation Loss:  7.503486352808335
Epoch:  1 Average Training Loss:  6.820293290997451
Epoch:  1 Average Validation Loss:  7.23349432851754
Epoch:  2 Average Training Loss:  6.509971199170598
Epoch:  2 Average Validation Loss:  7.005438626981249
Epoch:  3 Average Training Loss:  6.275552005138037
Epoch:  3 Average Validation Loss:  6.867374261220296
Epoch:  4 Average Training Loss:  6.12325286977696
Epoch:  4 Average Validation Loss:  6.783494799744849
Epoch:  5 Average Training Loss:  6.030118784252203
Epoch:  5 Average Validation Loss:  6.735578153647628
Epoch:  6 Average Training Loss:  5.968577216818647
Epoch:  6 Average Validation Loss:  6.702457745869954
Epoch:  7 Average Training Loss:  5.9237416639642895
Epoch:  7 Average Validation Loss:  6.68090472501867
Epoch:  8 Average Training Loss:  5.889925718869803
Epoch:  8 Average Validation Loss:  6.6657591988058655
Epoch:  9 Average Training Loss:  5.8613927021

In [13]:
for i in range(30):
  print(generateText(model, "The", corpus, device))


The heart, <eos>
The prince's sorrow how we have misuse me oft? <eos>
The Duke of Somerset's how? <eos>
The King him that neglected beauty be come in us that Called when they [Thunder] <eos>
The law to take no man and William Paris; <eos>
The heart now <eos>
The juggled is from their absent telling peace! And bid it know; my week? <eos>
The dog or in some note <eos>
The Glendower? of keeping of living. <eos>
The 'solus' and I, rob mad prisoners a man, for our praises! in cloak-bag- but anatomy, <eos>
The intelligence <eos>
The breath but that is ready on my land, what the sound. My blood to repent my Westminster drink the princes runs, being Health to wilfully Lucius <eos>
The foolish? <eos>
The frailty, <eos>
The diadem at Eastcheap on your dare, men of him! Bless him what is hither; foes! <eos>
The maids in daggers hook are to to Caesar- <eos>
The law? <eos>
The hurly wild-cat; <eos>
The humour horses-a do not come. <eos>
The hour ourselves <eos>
The northern parle. <eos>
The jest ha

In [14]:
for i in range(30):
  startword=np.random.choice(corpus.dictionary.idx2word)
  print(generateText(model, startword , corpus, device))


height to his winter bearing a thousand Good cousin, she shall we were temple, a fortunes, hither, <eos>
draught, and ready to the other fashion and most hands I be? <eos>
stay'd! <eos>
Coeur-de-lion proceedings. <eos>
biddings to turn both dogs are indeed! I infinite. <eos>
Recall <eos>
dwelt the rite? <eos>
engraft Between the King's select my villaine. <eos>
unhorse [Stands stealing, store, <eos>
Pomfret, you can my poor Anteroom with bowels, Caesar day under my Deserved I will I well ingratitude <eos>
tar- villagery, <eos>
confess- <eos>
firmness and Redeemer, <eos>
Exces <eos>
chase, <eos>
shoulders thereby discovers render- here be happened will out of mine arms man. Ay, Thou Helena, round 1605 <eos>
blood to atone you are fasting by all too letters mistook <eos>
garter, a COUNTESSES, Hercules! <eos>
lustre; walks. Thaisa._] <eos>
fantastical Did you Iras; <eos>
folk, Iaylor.] <eos>
roof <eos>
intents, I'll play scab! <eos>
remembered, <eos>
story at one Oracle, <eos>
harsh-sound

## Congratz, you made it! :)