**Homework 25**

In this assignment your will train a RNN to predict characters of *Alice in Wonderland*, from strings of consecutive characters.

We begin as usual with the imports you will need for this assignment.

In [5]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [6]:
device=('cuda' if torch.cuda.is_available()
        else 'cpu')

device

'cpu'

Run the following text block to read *Alice in Wonderland* from the web, store it in the variable `text`, convert to lower case and remove punctuation.

In [7]:
import string
from urllib.request import urlopen
url='https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt'
text = urlopen(url).read().decode('utf-8')
text=text.lower()
text=[c for c in text if (c not in string.punctuation) and (c!='\n')]

Write a class `Tokenizer` with the following methods:


*   `__init__`, a method that builds a dictionary `tokens` whose keys are the set of unique characters in some input `text`, and values are integers.
*   `encode`, a method that takes in a corpus of text, converts each character according to the dictionary built by the __init__ method, and outputs a list of those integers.
*   `decode`, a method that takes a single integer (a value from the dictionary), and returns the corresponding character key.



In [8]:
class Tokenizer():
  def __init__(self,text):
    unique_chars = sorted(set(text))
    self.tokens = {char: idx for idx, char in enumerate(unique_chars)}
    self.index_to_char = {idx: char for char, idx in self.tokens.items()}
  def encode(self,text):
    return [self.tokens[c] for c in text]

  def decode(self,n):
    return self.index_to_char[n]

Now, create an object called `tok` of your `Tokenizer` class, and use it to encode `text` as a list of integers, `text_indices`.

In [9]:
tok=Tokenizer(text)
text_indices=tok.encode(text)

For convenience, we'll define `vocab_size=len(tok.tokens)` to be the length of your tokenizer dictionary:

In [10]:
vocab_size=len(tok.tokens)
vocab_size

29

The next task is to create feature sequences and targets. From `text_indices`, create a list-of-lists `X`. Each sublist of `X` should correspond to 50 consecutive elements of `text_indices`. At the same time, create a list `y` which contains the indices of the characters that follow each sublist of `X`. For example, `X[0]` should be a list containing the first 50 elements of `text_indices`: `text_indices[0]` through `text_indices[49]`. `y[0]` should be the 51st element, `text_indices[50]`.

To keep the size of the feature and target vectors manageable, consecutive lists in `X` should be shifted by 3, so the overlap is 47 elements. Hence, `X[1]` should be a list containing the integers `text_indices[3]` through `text_indices[52]`, and `y[1]` should be the integer `text_indices[53]`.

In [11]:
seq_len=50
X=[]
y=[]
for i in range(0,len(text_indices)-seq_len-1,3):
  X.append(text_indices[i:i+seq_len])
  y.append(text_indices[i+seq_len])

Convert `X` and `y` to torch tensors with the same names, and check their shapes. If done correctly, the shape of `X` should be (45539, 50) and the shape of `y` should be (45539, ):

In [12]:
X=torch.tensor(X, dtype=torch.long, device=device)
y=torch.tensor(y, dtype=torch.long, device=device)
X.shape, y.shape

(torch.Size([45539, 50]), torch.Size([45539]))

Convert `X` to a one-hot encoded vector `OneHotX` of 0's and 1's, and check its shape. You should now have shape (45539,50,29). In other words, the vector `OneHotX` now contains 45,539 sequences of length 50, and each element of each sequence is a 29-dimensional vector of 28 zeros and a single one in the entry corresponding to some character in the text.

In [13]:
OneHotX=F.one_hot(X, num_classes=vocab_size).float()

You're now ready to create your model, which will consist of two seperate one-layer pytorch models. The first will be a recurrent layer that takes in sequences of 29-dimensional vectors, and has a 128 dimensional hidden state. The second will ve a linear layer that will take the last hidden state and produce a 29 dimensional vector.

In [14]:
rnn=nn.RNN(
    input_size=vocab_size,
    hidden_size=128,
    num_layers=1,
    batch_first=True
).to(device)

fc=nn.Linear(128, vocab_size).to(device)

Compile your model using the `Adam` optimizer and an approporiately chosen loss function.

In [15]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(list(rnn.parameters()) + list(fc.parameters()), lr=0.001)

Fit your data to X and y. Train for 50 epochs with a batch size of 128. Each epoch will take about 95 seconds, so you'll want to leave your computer for about an hour for this to complete.

In [16]:
n_epochs=50
N = OneHotX.shape[0]  # total number of observations in training data
batch_size=32

rnn.train()
for epoch in range(n_epochs):
  epoch_loss = 0.0

  # Shuffle the indices
  indices = torch.randperm(N,device=device)

  # Create mini-batches
  for i in range(0, N, batch_size):
    batch_indices = indices[i:i+batch_size]
    batch_X = OneHotX[batch_indices]
    batch_y = y[batch_indices]

    optimizer.zero_grad()
    output, hidden = rnn(batch_X)
    output_last = output[:, -1, :]  
    preds = fc(output_last)
    loss = criterion(preds, batch_y)
    loss.backward()
    optimizer.step()

    epoch_loss += loss.item()*batch_size

  if epoch%2==0:
      avg_loss = epoch_loss / len(y)
      print(f"epoch: {epoch}, avg_loss: {avg_loss}")

epoch: 0, avg_loss: 2.3569527830032886
epoch: 2, avg_loss: 1.970838583485066
epoch: 4, avg_loss: 1.8402020468584421
epoch: 6, avg_loss: 1.7479895551087563
epoch: 8, avg_loss: 1.672956361211271
epoch: 10, avg_loss: 1.6143653076935167
epoch: 12, avg_loss: 1.562607200791937
epoch: 14, avg_loss: 1.519829806960245
epoch: 16, avg_loss: 1.4804779005321858
epoch: 18, avg_loss: 1.4482061491069327
epoch: 20, avg_loss: 1.4218611226659532
epoch: 22, avg_loss: 1.3982054838071099
epoch: 24, avg_loss: 1.3776010191930885
epoch: 26, avg_loss: 1.357822932085843
epoch: 28, avg_loss: 1.3397361304952173
epoch: 30, avg_loss: 1.326487443211464
epoch: 32, avg_loss: 1.3126609781890544
epoch: 34, avg_loss: 1.3007591439972999
epoch: 36, avg_loss: 1.2901715037353572
epoch: 38, avg_loss: 1.2832508652590175
epoch: 40, avg_loss: 1.2734292800431877
epoch: 42, avg_loss: 1.2684789102966836
epoch: 44, avg_loss: 1.2618339186707996
epoch: 46, avg_loss: 1.2522757587237263
epoch: 48, avg_loss: 1.2489957055329528


We will now use your trained model to generate text, one character at a time. Run the following code block to do this. (It will take a minute or two to complete.) Its interesting that although the model generates one character at a time, you'll see very word-like strings in the final text.

In [17]:
rnn.eval()
next_seq=OneHotX[:1]  #Initial "seed" sequence

newtext=''
with torch.no_grad():
  for i in range(500):
    seq=next_seq
    pred=fc(rnn(seq)[1].squeeze()) #predictions of your model
    pred_probs=torch.softmax(pred,dim=0).detach().cpu().numpy() #predictions->probs
    index_pred=np.random.choice(vocab_size,1,p=pred_probs)[0] #choose one
    newtext+=tok.decode(index_pred) #corresponding character

    next_vec=torch.zeros(vocab_size).to(device)
    next_vec[index_pred]=1  #one-hot encode chosen letter index
    next_seq=torch.zeros(1,seq_len,29).to(device)
    next_seq[0,:seq_len-1]=seq[0,1:] #new sequence is last 49 of old sequence
    next_seq[0,seq_len-1]=next_vec  #plus new vector

newtext #display generated text

'ce repeate you oven up any own up to the rebbytis whet  here be reppoke befourlyly what have you begonbetuly as she had when they withta lome beapsinut with extrrildsail wore here getanding to put a fercapien and the picksuchange sore talkyor quetave  bytare wore and mound all the house cerdunting atoherell the ashe plane would her sone said the corsone  said aliceand the queen to tring were for no  algat and somesed and undance ashortlialf down  the pook  of asimf reser about it are hus hime ar'

'ce repeate you oven up any own up to the rebbytis whet  here be reppoke befourlyly what have you begonbetuly as she had when they withta lome beapsinut with extrrildsail wore here getanding to put a fercapien and the picksuchange sore talkyor quetave  bytare wore and mound all the house cerdunting atoherell the ashe plane would her sone said the corsone  said aliceand the queen to tring were for no  algat and somesed and undance ashortlialf down  the pook  of asimf reser about it are hus hime ar'

**COPY AND PASTE THIS TEXT INTO THE SUBMISSION WINDOW ON GRADESCOPE**