### Group members (first and last names):
- #### *Abdelhak Kermia*


# Introduction
In this notebook, you’ll dive into the world of Recurrent Neural Networks (RNNs) by implementing a simple Elman RNN from scratch and comparing it to PyTorch’s built-in RNN.

You’ll learn how sequence models handle character predictions and how sampling affects output quality.

---
## How to pass the assignment?
Below, you will find the exercise questions. Each question that awards points is numbered and displays the available points in this format: **(0 pts)**.

### Answering Questions
- Provide your answers in the cell directly below each question.
- Use **Markdown** for text-based answers (in **English**).
- Use **code cells** for implementations.

### Critical Thinking Questions and Bonus Exercises
- Some questions are marked with a 🧠 (Critical Thinking) or a ⭐ (Bonus Exercise). These are for self-reflection and extra practice.
- They are **optional** and do **not** award any points.
- Answering them can help reinforce your understanding.

### Important Rules
- Only use the Python packages introduced in the assignment. Using unauthorized packages will result in **0 points** for the affected question.
- Follow dataset instructions carefully.
  - If no new dataset is mentioned, continue using the one from the previous task.
  - Using a different dataset than instructed will result in **0 points** for that question.
- All code must run correctly.
  - If your code does not execute, you will receive a **50% deduction** for that question.
  - Always test your code before submitting.
- Incorrect or incomplete answers receive **0 points**.
  - Partial credit may be awarded if the core idea is correct **and** the instructions are followed precisely.
  - If you do not follow the instructions, you will receive **0 points**, regardless of effort or length.
- Do not provide overly detailed or off-topic answers. Stay focused on what is asked. Extra information does not earn extra points.

### Important Notes
- Save your work frequently! (Ctrl + S)
- Before submitting, `Restart Session and Run All` cells to ensure everything works correctly.
- **You need at least 17 points out of 25 (66%) to pass ✅**
---

In [1]:
points = 25

# 1. Fundamentals (4 points)
Before diving into code, let's test your understanding of **?????** with these questions. For each topic, identify which statements are TRUE ✅ and which are FALSE ❌. Each question may have 1, 2, 3, or 4 correct answers.

- 4 correct answers: 2 points
- 3 correct answers: 1 point
- 2 or fewer correct answers: 0 points

💡 In Google Colab, you can easily add emojis to markdown cells by typing `:` followed by the emoji's name. For example, typing `:light-bulb` will display a light bulb emoji. This feature is also available as an extension in many IDEs.

❗ **TIP:** If a term is unfamiliar to you, look it up in [Google's ML Glossary](https://developers.google.com/machine-learning/glossary) for a simple explanation.

#### 1.1 **(2pts) Which of the following statements about Recurrent Neural Networks (RNNs) are correct?**

 A. RNNs are particularly suited for sequential data like language ✅

 B. RNNs can update internal representations based on previously processed inputs ✅

 C. RNNs require the input to be passed through a convolutional layer before processing ❌

 D. The hidden state $h_t$ in RNNs is updated based on  $x_t$ and $h_{t-1}$ ✅

#### 1.2 **(2pts) Which of the following describe layers or transformations used when *training* a basic RNN?**

 A. $h_t = \text{tahn}(W_{ih}x_t + W_{hh}h_{t-1} + b_h)$ ✅

 B. $o_t = W_{ho}h_t + b_o$ ✅

 C. $\hat{y} = \text{softmax}(o_t)$ ✅

 D. $\mathscr{L} = -\text{log}\hat{y_t}[y_t]$ ✅

# 2. Hands-On (3 points)

Suppose you are working with the following character vocabulary for a character-level RNN:

`vocab = ['a', 'c', 'd', 'e', 'o', 'r']`

The characters are mapped to integers in the order shown:

| Character | Index |
| --------- | ----- |
| 'a'       | 0     |
| 'c'       | 1     |
| 'd'       | 2     |
| 'e'       | 3     |
| 'o'       | 4     |
| 'r'       | 5     |


#### 2.1 **(1pt) What is the one-hot encoded vector for the character 'e'?**
Write your answer in LaTeX as a column vector.

Example format (for character `'a'`):

$$ \text{one-hot}(\texttt{a}) =
\begin{bmatrix}
1 \\
0 \\
0 \\
0 \\
0 \\
0
\end{bmatrix}
$$


- for the character e:
$$ \text{one-hot}(\texttt{e}) =
\begin{bmatrix}
0 \\
0 \\
0 \\
1 \\
0 \\
0
\end{bmatrix}
$$

#### 2.2 **(1pt) Decode the following sequence using the vocabulary above and write the resulting word.**

`encoded = [2, 3, 1, 4, 2, 3, 5]`

decoder

#### 2.3 **(1pt) Construct the one-hot encoded matrix $X \in \mathbb{R}^{6 \times 7}$ for this sequence using LaTeX.**

Example format for an empty 6x7 matrix:

$$ X =
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix}
$$

Each column in the matrix represents a one-hot encoded character vector.
$$ X =
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1
\end{bmatrix}
$$

# 3. Coding (12 points)

In [2]:
# These are the packages you'll need today
# If you're running on a local environment, make sure everything you need is installed :)

# Data manipulation and visualization
import numpy as np
import random

# PyTorch libraries for deep learning
import torch
import torch.nn as nn
from torch.utils.data import DataLoader # Redundant
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split

# Set random seed
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(0)

# Set up device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)


cuda


## Task: Character-Level Text Generation with RNNs
In this assignment, your goal is to generate Shakespearean-style text by training a model to predict the next character in a sequence. You will approach this task using different types of recurrent neural networks (RNNs).

> *“To RNN or not to RNN, that is the question.”* 🤖🎭

🎯 Objectives
1. **Text Generation from Characters:**
    - You'll train a character-level model to learn the patterns of Shakespeare's writing. Given a sequence of characters, the model will try to predict the next character in the sequence, enabling it to generate realistic (or amusing!) text letter by letter.

2. **Implement a Custom Elman RNN:**
    - You will implement your own version of a simple RNN from scratch (also known as an Elman network). This involves manually managing hidden states and performing step-by-step sequence processing without using PyTorch's built-in `nn.RNN` class.

3. **Compare with Built-in PyTorch Models:**
    - You’ll then train and evaluate a PyTorch-based RNN using `nn.RNN`
    - You will compare the text these models generate and discuss how the model architecture affects the quality of the output (e.g., coherence, structure, creativity).

## Load Data

In [3]:
# Load shakespeare text and save as text
with open("shakespeare.txt", "r") as f:
    text = f.read()

# Print the length of the text
print(f"Length of text: {len(text)} characters")

# Show the first 100 characters of the text
text[:100]

Length of text: 1115394 characters


'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

In [4]:
# Set valid characters for the model to generate
chars = ['\n', ' ', '!', '"', '$', '&', "'", ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# Print the number of unique characters
print(f"Number of unique characters: {len(chars)}")

Number of unique characters: 77


## Data Preparation

In [5]:
# Map characters to integers and vice versa
char_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Print the mapping of characters to integers
print(char_to_int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, '$': 4, '&': 5, "'": 6, ',': 7, '-': 8, '.': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, ':': 20, ';': 21, '>': 22, '?': 23, 'A': 24, 'B': 25, 'C': 26, 'D': 27, 'E': 28, 'F': 29, 'G': 30, 'H': 31, 'I': 32, 'J': 33, 'K': 34, 'L': 35, 'M': 36, 'N': 37, 'O': 38, 'P': 39, 'Q': 40, 'R': 41, 'S': 42, 'T': 43, 'U': 44, 'V': 45, 'W': 46, 'X': 47, 'Y': 48, 'Z': 49, '`': 50, 'a': 51, 'b': 52, 'c': 53, 'd': 54, 'e': 55, 'f': 56, 'g': 57, 'h': 58, 'i': 59, 'j': 60, 'k': 61, 'l': 62, 'm': 63, 'n': 64, 'o': 65, 'p': 66, 'q': 67, 'r': 68, 's': 69, 't': 70, 'u': 71, 'v': 72, 'w': 73, 'x': 74, 'y': 75, 'z': 76}


In [6]:
# Encode the shakespearean text to integers
encoded = np.array([char_to_int[ch] for ch in text])

# Print the first 100 encoded characters
encoded[:100]

array([29, 59, 68, 69, 70,  1, 26, 59, 70, 59, 76, 55, 64, 20,  0, 25, 55,
       56, 65, 68, 55,  1, 73, 55,  1, 66, 68, 65, 53, 55, 55, 54,  1, 51,
       64, 75,  1, 56, 71, 68, 70, 58, 55, 68,  7,  1, 58, 55, 51, 68,  1,
       63, 55,  1, 69, 66, 55, 51, 61,  9,  0,  0, 24, 62, 62, 20,  0, 42,
       66, 55, 51, 61,  7,  1, 69, 66, 55, 51, 61,  9,  0,  0, 29, 59, 68,
       69, 70,  1, 26, 59, 70, 59, 76, 55, 64, 20,  0, 48, 65, 71])

## One-Hot Encoding
This function converts a 2D tensor of integer character indices into one-hot encoded vectors.

**Input:**
- `arr`: A 2D tensor of shape (batch_size, seq_length)
Each element is an integer index corresponding to a character in the vocabulary.
- `n_labels`: The size of the vocabulary (i.e., number of unique characters).

**Output:**
- A 3D tensor of shape (batch_size, seq_length, n_labels) where each character index is replaced by a one-hot vector.

**Example:**

If `arr = [[1, 0], [2, 3]]` and `n_labels = 4`, the output will be:
```
[
 [[0, 1, 0, 0], [1, 0, 0, 0]],
 [[0, 0, 1, 0], [0, 0, 0, 1]]
]
```
**🧠 Do you know what the shape of this tensor is?**


In [7]:
# Define method to encode one hot labels
##### arr (batch x seq_length) ----> (batch x seq_length x vocabulary_size)
def one_hot_encode(arr, n_labels):

    # Initialize the the encoded array
    one_hot = torch.zeros(list(arr.shape) + [n_labels], dtype=torch.float32)
    one_hot = one_hot.view(arr.shape[0] * arr.shape[1], -1)

    # Fill the appropriate elements with ones
    one_hot[torch.arange(one_hot.shape[0]), arr.view(-1)] = 1.

    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))

    return one_hot

## Custom Dataset for RNN Training
To train an RNN to predict the next character in a sequence, we need to prepare our data in a way that provides both **input sequences** and **target sequences**. The `SubsequencesDataset` class helps with this by slicing a long 1D sequence of encoded data into multiple **overlapping** sub-sequences.

In [8]:
class SubsequencesDataset(Dataset):
    def __init__(self, data: np.ndarray, seq_length: int):
        super(SubsequencesDataset, self).__init__()

        self.data = data # Full 1D array of encoded characters
        self.seq_length = seq_length # Length of input sequences

    def __len__(self):
        # Determines the number of full (input, target) sequences can be extracted
        if self.data.shape[0] % self.seq_length == 0:
            return self.data.shape[0] // self.seq_length - 1
        else:
            return self.data.shape[0] // self.seq_length

    def __getitem__(self, index: int):
        # Extracts a single (input, target) sequences
        return (self.data[index * self.seq_length:index * self.seq_length + self.seq_length], # Input sequence
                self.data[index * self.seq_length + 1:index * self.seq_length + self.seq_length + 1]) # Target sequence (input shifted by 1)

### Example

In [9]:
dataset = SubsequencesDataset(data=np.array([2, 1, 4, 7, 0, 23, 57, 12, 11, 8]), seq_length=4)
print(f"Length of dataset: {len(dataset)}")
print(f"First input, target sequence: {dataset[0]}")
print(f"Second input, target sequence: {dataset[1]}")

Length of dataset: 2
First input, target sequence: (array([2, 1, 4, 7]), array([1, 4, 7, 0]))
Second input, target sequence: (array([ 0, 23, 57, 12]), array([23, 57, 12, 11]))


#### 3.1 **(2pts) Implement an Elman RNN from scratch.**
Below is the skeleton of an Elman RNN. Your task is to complete three lines in the `__init__` method and one line in the `forward` method.

1. In `__init__`, initialize the following layers using [`torch.nn.Linear`](https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear):
    - `self.W_ih`: input-to-hidden layer
    - `self.W_hh`: hidden-to-hidden layer (without a bias term)
    - `self.out`: hidden-to-output layer

2. In `forward`, update the hidden state by computing the new value of the variable `h` using the formula provided in the lecture notes.
    - You may use the appropriate non-linear activation function from [`torch.nn.functional`](https://docs.pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions), imported as `F`.
    - Make sure to omit the bias term because we initalized `self.W_hh` without one 😉.

Use only `torch.nn` and `torch.nn.functional`.

In [10]:
class ElmanRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        ### YOUR CODE HERE ###
        self.W_ih = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)
        self.out = nn.Linear(hidden_size, output_size)
        ######################

    def forward(self, x, hidden):
        batch_size, seq_len, _ = x.size()
        h = hidden
        outputs = []

        for t in range(seq_len):
            x_t = x[:, t, :]
            ### YOUR CODE HERE ###
            h = F.tanh(self.W_ih(x_t) + self.W_hh(h))
            # Do we need to remove bias b_ih  from W_ih too here (b_h = b_ih + b_hh) ? F.tanh(F.linear(x_t, self.W_ih.weight)  self.W_hh(h)) remove the b_ih bias from W_ih bypassing initialization. Why don't we do it before in init? Is it to use torch.functional in the forward to see a bypass ?
            ######################
            outputs.append(self.out(h))

        out = torch.stack(outputs, dim=1)
        out = out.contiguous().view(batch_size * seq_len, -1)
        return out, h

    def init_hidden(self, batch_size):
        return torch.zeros(batch_size, self.W_hh.in_features)


The activation, as it is now is fine. We don't need to remove the bias from `self.W_ih`, and we also don't have to add a bias explicitely in the activation since we already have one in `self.W_ih`.

#### 3.2 **(2pts) Which of the following statements about the ElmanRNN class are TRUE?**
*The same multiple-answer rules apply from the fundamentals section, meaning there could be one or more correct answers.*

 A. The model supports multi-layer RNNs using `self.W_hh`.❌

 B. The model manually computes the hidden state at each timestep using `torch.tanh(...)`. ✅

 C. `self.W_ih` and `self.W_hh` are used together to update the hidden state. ✅

 D. `init_hidden()` creates a hidden state tensor of shape (batch_size, hidden_size). ✅

> *Remember thee? \
Yea, from the table of my memory \
I'll wipe away all trivial fond records, \
All saws of books, all forms, all pressures past"*

\- Hamlet Act 1, Scene 5

#### 3.3 **(1pt) Write an RNN with `nn.RNN`.**
Below is the skeleton of a vanilla RNN model for character-level text generation. Your task is to complete only two lines inside the `__init__` method.
1. Create the RNN layer and assign it to `self.rnn`
    - Use `nn.RNN` from PyTorch
    
    Parameters:
    - `input_size` (number of input features)
    - `hidden_size` (number of features in the hidden state)
    - `num_layers` (how many stacked RNN layers)
    - `batch_first=True` (ensures input shape is (batch, seq, feature))
    - `dropout=0.5` (applies dropout between RNN layers)

2. Create the output layer and assign it to `self.fc`
    - Use `nn.Linear` to map from `hidden_size` to `output_size`

You can use: `torch`, `torch.nn`

In [11]:
class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.output_size = output_size

        ### YOUR CODE HERE ###
        self.rnn = nn.RNN(self.input_size, self.hidden_size, num_layers=self.num_layers, batch_first=True, dropout=0.5)
        self.fc= nn.Linear(self.hidden_size, self.output_size)
        ######################

    def forward(self, x, hidden):
        out, hidden = self.rnn(x, hidden)
        out = self.fc(out.contiguous().view(-1, out.size(2)))
        return out, hidden

    def init_hidden(self, batch_size):
        return torch.zeros(self.num_layers, batch_size, self.rnn.hidden_size)


## Training Function

In [12]:
# This function was adapted from Deep Learning by Prof. Paolo Favaro, University of Bern

def train(model, data, vocab_size, epochs=20, batch_size=128, seq_length=100, lr=0.001, clip=5, val_frac=0.1, print_every=1, device=device):
    model.to(device)
    model.train()

    opt = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # Create train/val splits
    dataset = SubsequencesDataset(data, seq_length=seq_length)
    train_size = int(len(dataset) * (1 - val_frac))
    val_size = len(dataset) - train_size
    training_set, validation_set = random_split(dataset, [train_size, val_size])

    train_loader = DataLoader(training_set, batch_size=batch_size, shuffle=True, pin_memory=True)
    val_loader = DataLoader(validation_set, batch_size=batch_size, shuffle=False, pin_memory=True)

    print(f"Training {model.__class__.__name__} on {device}")

    for e in range(epochs):
        for x, y in train_loader:
            h = model.init_hidden(x.size(0))
            h = tuple(h_.to(device) for h_ in h) if isinstance(h, tuple) else h.to(device)

            x = one_hot_encode(x, vocab_size).to(device)
            y = y.to(device)

            model.zero_grad()
            output, h = model(x, h)
            loss = criterion(output, y.view(-1).long())
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), clip)
            opt.step()

        if e % print_every == 0:
            model.eval()
            val_losses = []

            with torch.no_grad():
                for x_val, y_val in val_loader:
                    val_h = model.init_hidden(x_val.size(0))
                    val_h = tuple(vh.to(device) for vh in val_h) if isinstance(val_h, tuple) else val_h.to(device)

                    x_val = one_hot_encode(x_val, vocab_size).to(device)
                    y_val = y_val.to(device)

                    val_out, val_h = model(x_val, val_h)
                    val_loss = criterion(val_out, y_val.view(-1).long())
                    val_losses.append(val_loss.item())

            model.train()
            print(f"Epoch: {e+1}/{epochs}... "
                    f"Loss: {loss.item():.4f}... Val Loss: {np.mean(val_losses):.4f}")


# Hyperparameters

In [13]:
hidden_size = 256
num_layers = 2

batch_size = 128
seq_length = 100
n_epochs = 20

vocab_size = len(chars)

#### 3.4 **(2pts) Initialize and train the 2 models (ElmanRNN, VanillaRNN).**
- Use the hyperparameters and `train` function defined above.

⌛ Each model will take between 30 seconds - 3 minutes to train

In [14]:
Elman_RNN = ElmanRNN(input_size=vocab_size,hidden_size=hidden_size,output_size=vocab_size)
print("Elman RNN Training:")
train(model = Elman_RNN,
      data = encoded,
      vocab_size = vocab_size,
      epochs = n_epochs,
      batch_size = batch_size,
      seq_length = seq_length,
      )

print("Vanilla RNN Training:")
Vanilla_RNN = VanillaRNN(input_size=vocab_size,hidden_size=hidden_size,output_size=vocab_size,num_layers=num_layers)
train(model = Vanilla_RNN,
      data = encoded,
      vocab_size = vocab_size,
      epochs = n_epochs,
      batch_size = batch_size,
      seq_length = seq_length
      )

Elman RNN Training:
Training ElmanRNN on cuda
Epoch: 1/20... Loss: 3.1049... Val Loss: 3.0896
Epoch: 2/20... Loss: 2.5488... Val Loss: 2.5619
Epoch: 3/20... Loss: 2.3679... Val Loss: 2.3475
Epoch: 4/20... Loss: 2.2021... Val Loss: 2.2355
Epoch: 5/20... Loss: 2.1751... Val Loss: 2.1644
Epoch: 6/20... Loss: 2.1061... Val Loss: 2.1031
Epoch: 7/20... Loss: 2.0626... Val Loss: 2.0510
Epoch: 8/20... Loss: 2.0469... Val Loss: 2.0113
Epoch: 9/20... Loss: 1.9748... Val Loss: 1.9724
Epoch: 10/20... Loss: 1.9382... Val Loss: 1.9457
Epoch: 11/20... Loss: 1.9440... Val Loss: 1.9131
Epoch: 12/20... Loss: 1.8963... Val Loss: 1.8905
Epoch: 13/20... Loss: 1.8701... Val Loss: 1.8590
Epoch: 14/20... Loss: 1.7796... Val Loss: 1.8355
Epoch: 15/20... Loss: 1.8059... Val Loss: 1.8159
Epoch: 16/20... Loss: 1.7750... Val Loss: 1.8005
Epoch: 17/20... Loss: 1.7696... Val Loss: 1.7787
Epoch: 18/20... Loss: 1.7672... Val Loss: 1.7773
Epoch: 19/20... Loss: 1.7816... Val Loss: 1.7473
Epoch: 20/20... Loss: 1.7261... 

## Prediction and Generation Functions

In [15]:
# This function was adapted from Deep Learning by Prof. Paolo Favaro, University of Bern

def predict(model, char, h=None, top_k=None, device=device):
    ''' Given a character, predict the next character.
        Returns the predicted character and the hidden state.
    '''

    # Char to int → to one-hot → to device
    x = torch.LongTensor([[char_to_int[char]]])
    x = one_hot_encode(x, vocab_size).to(device)

    # Move hidden state to same device
    if isinstance(h, tuple):
        h = tuple(each.to(device) for each in h)
    else:
        h = h.to(device)

    # get the output of the model
    out, h = model(x, h)

    # get the character probabilities
    p = F.softmax(out, dim=1).data
    p = p.cpu().numpy().squeeze()

    # get top characters
    if top_k is None:
        top_ch = np.arange(vocab_size)
    else:
        p, top_ch = torch.topk(torch.tensor(p), top_k)
        top_ch = top_ch.numpy().squeeze()
        p = p.numpy().squeeze()

    # select the likely next character with some randomness
    char = np.random.choice(top_ch, p=p/p.sum())

    return int_to_char[char], h


In [16]:
# This function was adapted from Deep Learning by Prof. Paolo Favaro, University of Bern

def generate(model, size, prime='The', top_k=None):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    # Prime the model with the initial characters
    chars = [ch for ch in prime]
    h = model.init_hidden(1)
    if isinstance(h, tuple):
        h = tuple(each.to(device) for each in h)
    else:
        h = h.to(device)

    for ch in prime:
        char, h = predict(model, ch, h, top_k=top_k, device=device)
    chars.append(char)

    # Generate characters
    for _ in range(size):
        char, h = predict(model, chars[-1], h, top_k=top_k, device=device)
        chars.append(char)

    return ''.join(chars)


#### 3.5 **(2pts) Generate Text from Each Model.**
- Use the `generate` function provided above to produce text from each of your models.
- Generate 1000 characters of text.
- Choose your own start word or prompt.
- Experiment with different values of `top_k`
- **Print** the generated text clearly for each model.

In [17]:
k_values=[3,5,10]

for k in k_values:
    text_elman = generate(model = Elman_RNN,size = 1000,prime = 'An',top_k = k)
    print(f"Elman text (top_k = {k}):\n\n", text_elman)
    print("\n----------------------------------------------------------------------------------------------------------------\n")
    text_vanilla = generate(model = Vanilla_RNN,size = 1000,prime = 'An',top_k = k)
    print(f"Vanilla text (top_k = {k}):\n\n", text_vanilla)
    print("\n----------------------------------------------------------------------------------------------------------------\n")

Elman text (top_k = 3):

 And mer with have my hours,
What the wolld me would by the will the seaden shall be not this done.

POMIONAE:
That I am sharr here a manders of men oun and stands assell.

CORIOLANUS:
I'll sear me the shall he stare you meres theme the will and stand to but as thee,
That, servingers there of him toor.

CORIOLANUS:
I here the raye heaven shall heart,
And shal seat to ser will sorr heads the saul to strike the reare them, and, when I will serve yea hard
As a manter to mar we come to mer
And the with me hore that whene he shell here that
Is manes the strench out off ame tell.

CARISLA:
That Is that what the senter treak,
If the sarse a trought here to shee the striend to my son.

CLIUS:
And they so mer you are the rows off coursely sor,
Theres men to the react, I have shall serve the say. I would bear hears the raye to the sheres the saules our crust a will
I should shele sor that were you seaves
And wish my sould, but to the will the shall so the warth the sare


#### 3.6 **(3pts) Compare the output of the Elman RNN and PyTorch RNN.**
In your answer, consider:
- Fluency and resemblance to English
- Word-like structure or gibberish
- The effect of `top_k` on randomness vs. coherence

Fluency/ Looks like english:
   - Both the custom Elman and Pytorch models have low phrase coherence/syntax and low understandable meaning. (working on char-level approximation, not words/phrases).
   - Pytorch has more recognizable phrases or parts of phrases, whereas the custom Elman looks more like old / archaic English.

Word-like structure:
   - Custom Elman produces more archaic/gibberish words, words with incorrect spellings (e.g. too many/not enough letters or the wrong order of letters in a word) than PyTorch, which produces better spellings of words.

Effect of top_k :
   - Larger k values allow for more random words and spellings, as well as a greater variety of phrases and more 'creativity' in wording for both models. However, there is still not much of a real coherence/syntax in phrases.
   - Low k values allow for more words to resemble to real English and achieve some coherence/syntax in phrases. However, there is maybe more repetition of similar words/wording, phrase structures/syntax (probably due to higher existing frequencies in the English text used as data).

# 4. Code Comprehension (6 points)
Before wrapping up, let’s test your understanding of the character-level RNN code you just explored. For each topic, identify which statements are TRUE ✅ and which are FALSE ❌. Each question may have 1, 2, 3, or 4 correct answers.

- 4 correct answers: 2 points
- 3 correct answers: 1 point
- 2 or fewer correct answers: 0 points

💡 In Google Colab, you can easily add emojis to markdown cells by typing `:` followed by the emoji's name. For example, typing `:light-bulb` will display a light bulb emoji. This feature is also available as an extension in many IDEs.

❗ **TIP:** If a term is unfamiliar to you, look it up in [Google's ML Glossary](https://developers.google.com/machine-learning/glossary) for a simple explanation.

#### 4.1 **(2pts) Which of the following statements describe characteristics of Elman’s RNN in the context of character-level text generation?**

 A. The hidden state $h_t$ m is computed using only the current character input $x_t$. ❌

 B. The RNN reads one character at a time and updates its hidden state sequentially. ✅

 C. The model uses convolutional layers to capture character patterns. ❌

 D. The output at each timestep predicts the probability of the next character. ✅


#### 4.2 **(2pts) What components are essential in character-level RNNs for training on a corpus of text?**

 A. A vocabulary of all possible characters ✅

 B. One-hot or embedding representation of characters ✅

 C. A loop or recurrence that updates the hidden state per character ✅

 D. A decoder that transforms characters to words ❌

#### 4.3 **(2pts) Which of the following statements correctly describe the role and effect of the `top_k` parameter in character-level text generation?**

 A. `top_k` directly affects the training process and gradient computation. ❌

 B. Decreasing `top_k` leads to more diverse but potentially less coherent text. ❌

 C. Setting `top_k` to 1 is equivalent to greedy sampling (always choosing the most likely character). ✅

 D. `top_k` ensures syntactic correctness by sampling from grammatically valid continuations. ❌

---
Great work reaching the end of this assignment! 🌟

You’ve implemented your own Elman RNN, trained multiple models to generate text, and compared their outputs in terms of fluency and structure. Along the way, you explored how sampling strategies like `top_k` influence creativity vs. coherence in generated sequences.

Although these character-level generative models can’t yet write Shakespearean plays, you’ve taken a big step toward understanding how machines learn to generate language. Keep experimenting, try new corpora, tweak hyperparameters, or explore word-level generation next! 🚀✨