# IS319 - Deep Learning

## TP3 - Recurrent neural networks

Credits: Andrej Karpathy

The goal of this TP is to experiment with recurrent neural networks for a character-level language model to generate text that looks like training text data.

In [106]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. Text data preprocessing

Several text datasets are provided, feel free to experiment with different ones throughout the TP. At the beginning, use a small subset of a given dataset (for example use only 10k characters).

In [107]:
# text_data_fname = 'baudelaire.txt'  # ~0.1m characters (French)
# text_data_fname = 'proust.txt'      # ~7.3m characters (French)
# text_data_fname = 'shakespeare.txt' # ~0.1m characters (English)
text_data_fname = 'lotr.txt'        # ~2.5m characters (English)
# text_data_fname = 'doom.c'          # ~1m characters (C Code)
# text_data_fname = 'linux.c'         # ~11.5m characters (C code)

text_data = open(text_data_fname, 'r').read()
text_data = text_data[:10000] # use a small subset
print(f'Dataset `{text_data_fname}` contains {len(text_data)} characters.')
print('Excerpt of the dataset:')
print(text_data[:2000])

Dataset `lotr.txt` contains 10000 characters.
Excerpt of the dataset:
Three Rings for the Elven-kings under the sky,
               Seven for the Dwarf-lords in their halls of stone,
            Nine for Mortal Men doomed to die,
              One for the Dark Lord on his dark throne
           In the Land of Mordor where the Shadows lie.
               One Ring to rule them all, One Ring to find them,
               One Ring to bring them all and in the darkness bind them
           In the Land of Mordor where the Shadows lie.
           
FOREWORD

This tale grew in the telling, until it became a history of the Great War of the Ring and included many glimpses of the yet more ancient history that preceded it. It was begun soon after _The Hobbit_ was written and before its publication in 1937; but I did not go on with this sequel, for I wished first to complete and set in order the mythology and legends of the Elder Days, which had then been taking shape for some years. I desired to do 

**(Question)** Create a character-level vocabulary for your text data. Create two dictionaries: `ctoi` mapping each character to an index, and the reverse `itoc` mapping each index to its corresponding character. Implement the functions to convert text to tensor and tensor to text using these mappings. Apply these functions to some text data.

In [108]:
# Create the vocabulary and the two mapping dictionaries
# YOUR CODE HERE

import numpy as np

vocabulary = list(set(text_data))

print(vocabulary)

ctoi = {vocabulary[num]: num for num in range(0, len(vocabulary))}

itoc = {num : vocabulary[num] for num in range(0, len(vocabulary))}

print(itoc)
print(ctoi)

# Implement the function converting text to tensor
def text_to_tensor(text, ctoi):
    # YOUR CODE HERE
    tensor = np.array([ctoi[c] for c in text])
    tensor = torch.Tensor(tensor).int()

    return tensor 

# Implement the function converting tensor to text
def tensor_to_text(tensor, itoc):
    # YOUR CODE HERE
    tensor = tensor.numpy()
    text = [itoc[i] for i in tensor]

    return text

# Apply your functions to some text data
# YOUR CODE HERE
sample = text_data[:10]

print("Sample : " + str(sample))

s_tensor = text_to_tensor(sample, ctoi)
s_text = tensor_to_text(s_tensor, itoc)

print("Tensor of sample : " + str(s_tensor))
print("Text of tensor : " + "".join(s_text))

['i', 'H', '-', ',', ';', ':', '8', 'l', 'E', 'Y', 'u', '(', 'o', 'L', 'c', 's', 'y', '6', ')', 'S', '_', 'G', '4', '.', 'C', '7', 'f', 'P', 'n', 'r', 'I', 'w', 'm', 'B', 'ó', 'v', 'D', '"', '3', 'e', 'R', 'b', 'é', ' ', 'W', 'g', 'd', 'k', 'F', 't', 'p', "'", 'A', 'T', 'q', 'x', 'û', 'h', 'M', 'z', 'O', '9', '1', 'N', 'a', '\n', 'j']
{0: 'i', 1: 'H', 2: '-', 3: ',', 4: ';', 5: ':', 6: '8', 7: 'l', 8: 'E', 9: 'Y', 10: 'u', 11: '(', 12: 'o', 13: 'L', 14: 'c', 15: 's', 16: 'y', 17: '6', 18: ')', 19: 'S', 20: '_', 21: 'G', 22: '4', 23: '.', 24: 'C', 25: '7', 26: 'f', 27: 'P', 28: 'n', 29: 'r', 30: 'I', 31: 'w', 32: 'm', 33: 'B', 34: 'ó', 35: 'v', 36: 'D', 37: '"', 38: '3', 39: 'e', 40: 'R', 41: 'b', 42: 'é', 43: ' ', 44: 'W', 45: 'g', 46: 'd', 47: 'k', 48: 'F', 49: 't', 50: 'p', 51: "'", 52: 'A', 53: 'T', 54: 'q', 55: 'x', 56: 'û', 57: 'h', 58: 'M', 59: 'z', 60: 'O', 61: '9', 62: '1', 63: 'N', 64: 'a', 65: '\n', 66: 'j'}
{'i': 0, 'H': 1, '-': 2, ',': 3, ';': 4, ':': 5, '8': 6, 'l': 7, 'E'

## 2. Setup a character-level recurrent neural network

**(Question)** Setup a simple embedding layer with `nn.Embedding` to project character indices to `embedding_dim` dimensional vectors. Explain precisely how this layer works and what are its outputs for a given input sequence.

In [109]:
# YOUR CODE HERE

vocab_size = len(vocabulary)

embedding_dim = 10 # ?

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

In [110]:
#Test 

sample = text_data[:10]

sample_tensor = text_to_tensor(sample, ctoi)

s_embedding = embedding_layer(sample_tensor)

print(s_embedding)

tensor([[-1.2222, -0.2497,  0.5205, -1.1818, -0.3512,  0.0829,  0.0615, -0.9148,
         -0.8751, -0.4869],
        [-2.4835, -0.4966,  0.9651,  0.3900,  2.0285, -0.4262,  0.3173,  0.6685,
          0.1133, -1.0985],
        [ 0.6244, -0.9654,  0.6433, -0.2205, -0.0602, -1.1424, -0.2153, -1.4845,
         -2.9274, -1.4334],
        [-0.9708,  0.8360, -0.6122, -1.8138,  0.7967,  1.0027,  0.3080,  2.0241,
         -1.6984, -0.6595],
        [-0.9708,  0.8360, -0.6122, -1.8138,  0.7967,  1.0027,  0.3080,  2.0241,
         -1.6984, -0.6595],
        [-0.3834, -0.4818,  0.8683,  1.1925, -0.0331,  0.1693,  0.9464,  0.8296,
         -0.7374, -0.6310],
        [-0.6513,  0.4812,  0.5314,  0.1909,  0.2617,  0.7559, -2.2369,  0.4191,
          0.3797,  0.4100],
        [-0.1366, -0.8664, -0.1562, -0.1915,  0.8270,  0.0253, -0.5566, -0.6141,
          0.1634,  1.1352],
        [ 0.6390,  0.9776,  0.2482, -0.6635,  1.8979,  0.6915, -1.0867, -1.8876,
         -1.3236, -0.7577],
        [ 0.0688, -

YOUR ANSWER HERE

**(Question)** Setup a single-layer RNN with `nn.RNN` (without defining a custom class). Use `hidden_dim` size for hidden states. Explain precisely the outputs of this layer for a given input sequence.

In [111]:
# YOUR CODE HERE

hidden_dim = 10

simple_rnn = nn.Sequential(
    embedding_layer,  
    nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, batch_first=True)
)

In [112]:
sample_output, hid = simple_rnn(text_to_tensor(sample, ctoi))
print(sample_output)
print(hid)

tensor([[ 0.4989,  0.4847, -0.2031,  0.3047, -0.7470,  0.3420,  0.0329, -0.3772,
          0.0595,  0.2205],
        [ 0.6564,  0.7275,  0.5441,  0.0686, -0.7717,  0.6560,  0.3799,  0.1070,
         -0.8785,  0.1572],
        [ 0.7917,  0.8228, -0.8421,  0.8381, -0.9041,  0.5942,  0.0492,  0.1503,
         -0.0644, -0.5926],
        [-0.4491,  0.6646,  0.3512,  0.2249, -0.4473,  0.1599, -0.3692, -0.1253,
         -0.4995,  0.6542],
        [-0.4573,  0.6285,  0.2699, -0.4040, -0.5353, -0.2194, -0.2508, -0.4276,
         -0.2011,  0.5817],
        [ 0.2645,  0.6063, -0.0918,  0.1322,  0.1753,  0.5787,  0.1951,  0.5038,
         -0.4850, -0.1601],
        [-0.7393,  0.5371,  0.5715, -0.4982,  0.0740,  0.3937, -0.4383, -0.4787,
         -0.3877,  0.8461],
        [-0.0964, -0.4158,  0.4477, -0.5916,  0.0222,  0.2129,  0.1277, -0.3081,
          0.0237,  0.0901],
        [ 0.7715, -0.1714, -0.8914, -0.0446, -0.9030,  0.2613, -0.4587, -0.8389,
          0.1010, -0.2958],
        [-0.4957, -

YOUR ANSWER HERE

**(Question)** Create a simple RNN model with a custom `nn.Module` class. It should contain: an embedding layer, a single-layer RNN, and a dense output layer. For each character of the input sequence, the model should predict the probability of the next character. The forward method should return the probabilities for next characters and the corresponding hidden states.
After completing the class, create a model and apply the forward pass on some input text. Understand and explain the results.

*Note:* depending on how you implement the loss function later, it can be convenient to return logits instead of probabilities, i.e. raw values of the output layer before any activation function. 

In [113]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__()
        # YOUR CODE HERE
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.emb = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.lin = nn.Linear(hidden_dim, vocab_size)


    def forward(self, tensor_data, hidden_state=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        out = self.emb(tensor_data)
        out, hid = self.rnn(out, hidden_state)
        out = self.lin(out)
        
        return out, hid

# Initialize a model and apply the forward pass on some input text
# YOUR CODE HERE

embedding_dim = 15
hidden_dim = 10

model = CharRNN(vocab_size, embedding_dim, hidden_dim)

sample = "hello world"

sample_tensor = text_to_tensor(sample, ctoi)

logits, hidden = model(sample_tensor)

probabilities = F.softmax(logits)

print(probabilities[0])

tensor([0.0130, 0.0140, 0.0072, 0.0204, 0.0099, 0.0212, 0.0087, 0.0095, 0.0294,
        0.0156, 0.0198, 0.0117, 0.0140, 0.0127, 0.0198, 0.0129, 0.0155, 0.0258,
        0.0205, 0.0132, 0.0048, 0.0107, 0.0134, 0.0185, 0.0210, 0.0235, 0.0076,
        0.0141, 0.0128, 0.0058, 0.0232, 0.0167, 0.0189, 0.0094, 0.0164, 0.0152,
        0.0107, 0.0224, 0.0169, 0.0103, 0.0214, 0.0098, 0.0236, 0.0134, 0.0236,
        0.0157, 0.0246, 0.0090, 0.0109, 0.0061, 0.0095, 0.0115, 0.0125, 0.0103,
        0.0138, 0.0072, 0.0308, 0.0209, 0.0138, 0.0098, 0.0150, 0.0199, 0.0177,
        0.0096, 0.0188, 0.0066, 0.0070], grad_fn=<SelectBackward0>)


  probabilities = F.softmax(logits)


YOUR ANSWER HERE

**(Question)** Implement a simple training loop to overfit on a small input sequence. The loss function should be a categorical cross entropy on the predicted characters. Monitor the loss function value over the iterations.

In [114]:
# Sample a small input sequence into tensor `input_seq` and store its corresponding expected sequence into tensor `target_seq`
# YOUR CODE HERE

# Implement a training loop overfitting an input sequence and monitoring the loss function
def train_overfit(model, input_seq, target_seq, n_iters=200, learning_rate=0.2):
    # YOUR CODE HERE
    criterion = nn.CrossEntropyLoss()
    # criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.005, momentum = 0.9)  

    for i in range(n_iters):
        outputs, _ = model(input_seq)
        loss = criterion(outputs, target_seq)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if((i+1)%10==0):
            print('Epoch [{}/{}], Loss: {:.4f}'.format(i+1, n_iters, loss.item()))

    
# Initialize a model and make it overfit the input sequence
# YOUR CODE HERE

def one_hot(input_seq, target_seq, vocab_size, ctoi):
    one_hot_vec = np.zeros((len(input_seq), vocab_size))
    for i, c in enumerate(input_seq):
        t_index = ctoi[target_seq[i]]
        one_hot_vec[i][t_index] = 1
    return torch.Tensor(one_hot_vec)


input_seq = text_data[:100]
target_seq = text_data[1:101]

target_seq = one_hot(input_seq, target_seq, vocab_size, ctoi)
input_seq = text_to_tensor(input_seq, ctoi)

embedding_dim = 5
hidden_dim = 3

model = CharRNN(vocab_size, embedding_dim, hidden_dim)

train_overfit(model, input_seq, target_seq, learning_rate=0.3)

Epoch [10/200], Loss: 3.1742
Epoch [20/200], Loss: 2.7236
Epoch [30/200], Loss: 2.5126


Epoch [40/200], Loss: 2.3927
Epoch [50/200], Loss: 2.3156
Epoch [60/200], Loss: 2.2267
Epoch [70/200], Loss: 2.1615
Epoch [80/200], Loss: 2.1106
Epoch [90/200], Loss: 2.0699
Epoch [100/200], Loss: 2.0269
Epoch [110/200], Loss: 1.9868
Epoch [120/200], Loss: 1.9448
Epoch [130/200], Loss: 1.8908
Epoch [140/200], Loss: 1.8046
Epoch [150/200], Loss: 1.7258
Epoch [160/200], Loss: 1.6538
Epoch [170/200], Loss: 1.5945
Epoch [180/200], Loss: 1.5458
Epoch [190/200], Loss: 1.4963
Epoch [200/200], Loss: 1.4491


**(Question)** Implement a `predict_argmax` method for your `RNN` model. Then, verify your overfitting: use some characters of your input sequence as context to predict the remaining ones. Experiment with the current model and analyze the results.

In [115]:
class CharRNN(CharRNN):
    def predict_argmax(self, context_tensor, n_predictions):
        # Apply the forward pass for the context tensor
        # Then, store the last prediction and last hidden state
        # YOUR CODE HERE
        out, hid = self.forward(context_tensor)
        self.last_out = out[-1]
        self.last_hid = hid[-1].unsqueeze(-1)
        
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        predictions = []
        for i in range(n_predictions):
            str_out = itoc[np.argmax(F.softmax(self.last_out, -1).detach().numpy())]
            predictions.append(str_out)
            self.last_out = text_to_tensor(str_out, ctoi)
            log, hid = self.forward(self.last_out, self.last_hid.view(1, self.hidden_dim))
            self.last_out = log
            self.last_hid = hid.unsqueeze(-1)  


        return predictions
            
        

# Initialize a model and make it overfit as above
# Then, verify your overfitting by predicting characters given some context
# YOUR CODE HERE

input_seq = text_data[:1000]
target_seq = text_data[1:1001]

target_seq = one_hot(input_seq, target_seq, vocab_size, ctoi)
input_seq = text_to_tensor(input_seq, ctoi)

embedding_dim = 15
hidden_dim = 30

model = CharRNN(vocab_size, embedding_dim, hidden_dim)

train_overfit(model, input_seq, target_seq)

predictions = model.predict_argmax(text_to_tensor("Fight", ctoi), 200)

predictions = "".join(predictions)

print(predictions)

Epoch [10/200], Loss: 2.9891
Epoch [20/200], Loss: 2.6086
Epoch [30/200], Loss: 2.3733
Epoch [40/200], Loss: 2.1996
Epoch [50/200], Loss: 2.0738
Epoch [60/200], Loss: 1.9795
Epoch [70/200], Loss: 1.8997
Epoch [80/200], Loss: 1.8296
Epoch [90/200], Loss: 1.7662
Epoch [100/200], Loss: 1.7081
Epoch [110/200], Loss: 1.6550
Epoch [120/200], Loss: 1.6058
Epoch [130/200], Loss: 1.6138
Epoch [140/200], Loss: 1.5629
Epoch [150/200], Loss: 1.4916
Epoch [160/200], Loss: 1.4449
Epoch [170/200], Loss: 1.4073
Epoch [180/200], Loss: 1.3615
Epoch [190/200], Loss: 1.3545
Epoch [200/200], Loss: 1.3042
e the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the th


YOUR ANSWER HERE

Using the argmax function to predict the next character can yield a deterministic generator always predicting the same characters. Instead, it is common to predict the next character by sampling from the distribution of output predictions, adding some randomness into the generator.

**(Question)** Implement a `predict_proba` method for your `RNN` model. It should be very similar to `predict_argmax`, but instead of using argmax, it should randomly sample from the output predictions. To do that, you can use the `torch.distribution.Categorical` class and its `sample()` method. Verify that your method correctly added some randomness.

In [121]:
class CharRNN(CharRNN):
    def predict_proba(self, input_context, n_predictions):
        # YOUR CODE HERE
        out, hid = self.forward(input_context)
        self.last_out = out[-1]
        self.last_hid = hid[-1].unsqueeze(-1)
        predictions = []
        for i in range(n_predictions):
            proba = F.softmax(self.last_out, -1)          
            dist = torch.distributions.categorical.Categorical(probs=proba, logits=None, validate_args=None)
            str_out = itoc[dist.sample().item()]
            predictions.append(str_out)
            self.last_out = text_to_tensor(str_out, ctoi)
            log, hid = self.forward(self.last_out, self.last_hid.view(1, self.hidden_dim))
            self.last_out = log
            self.last_hid = hid.unsqueeze(-1)  


        return predictions

# Verify that your predictions are not deterministic anymore
# YOUR CODE HERE
input_seq = text_data[:1000]
target_seq = text_data[1:1001]

target_seq = one_hot(input_seq, target_seq, vocab_size, ctoi)
input_seq = text_to_tensor(input_seq, ctoi)

embedding_dim = 15
hidden_dim = 30

model = CharRNN(vocab_size, embedding_dim, hidden_dim)

train_overfit(model, input_seq, target_seq)

predictions = model.predict_proba(text_to_tensor("Fight", ctoi), 200)

predictions = "".join(predictions)

print(predictions)

Epoch [10/200], Loss: 3.1546
Epoch [20/200], Loss: 2.6569
Epoch [30/200], Loss: 2.4382
Epoch [40/200], Loss: 2.2633
Epoch [50/200], Loss: 2.1354
Epoch [60/200], Loss: 2.0413
Epoch [70/200], Loss: 1.9671
Epoch [80/200], Loss: 1.9018
Epoch [90/200], Loss: 1.8421
Epoch [100/200], Loss: 1.7861
Epoch [110/200], Loss: 1.7327
Epoch [120/200], Loss: 1.6805
Epoch [130/200], Loss: 1.6305
Epoch [140/200], Loss: 1.6209
Epoch [150/200], Loss: 1.5637
Epoch [160/200], Loss: 1.5101
Epoch [170/200], Loss: 1.4690
Epoch [180/200], Loss: 1.4521
Epoch [190/200], Loss: 1.4015
Epoch [200/200], Loss: 1.5252
óe .Ie.8en, royeon in orysrorf
 find seay û YhkUil=lis and of ton s toWns N9o
n  ondo- Rinw wiin lo3nes  Nn- them

 f í celewf Moryer warelacl to dornd te mhecale and lenuus in alrewe táe in Wlet( _ar


## 3. Train the RNN model on text data

**(Question)** Adapt your previous code to implement a proper training loop for a text dataset. To do so, we need to specify a sequence length `seq_len`, acting similarly to the batch size in classic neural networks. Then, you can either randomly sample sequences of length `seq_len` from the text dataset over `n_iters` iterations, or properly loop over the text dataset for `n_epochs` epochs (with a random starting point for each epoch to ensure different sequences), to make sure the whole dataset is seen by the model. Feel free to adjust training and model parameters empirically. Start with a small model and a small subset of the text dataset, then move on to larger experiments. Remember to use GPU if available.

In [124]:
# Create the text dataset, compute its mappings and convert it to tensor
# YOUR CODE HERE
import random

# def one_hot(input_sequences, target_sequences, vocab_size, ctoi):
#     one_hot_vecs = []
#     for j, input_seq in enumerate(input_sequences):
#         one_hot_vec = np.zeros((len(input_seq), vocab_size))
#         for i, c in enumerate(input_seq):
#             t_index = ctoi[target_sequences[j][i]]
#             one_hot_vec[i][t_index] = 1
#         one_hot_vecs.append(torch.Tensor(one_hot_vec))
#     return torch.stack(one_hot_vecs)


text_data = open(text_data_fname, 'r').read()
data_length = len(text_data)

vocabulary = list(set(text_data))
vocab_size = len(vocabulary)

ctoi = {vocabulary[i] : i for i in range(vocab_size)}
itoc = {i : vocabulary[i] for i in range(vocab_size)}

seq_len = 10
num_batches = 200
start_indexes = random.sample(range(0, data_length+1), num_batches)

input_seqs = np.array([[text_data[i:i+seq_len], text_data[i+1:i+seq_len+1]] for i in start_indexes])

target_seqs = [one_hot(x[0], x[1], vocab_size, ctoi) for x in input_seqs]
input_seqs = [text_to_tensor(x, ctoi) for x in input_seqs[:,0]]

# Initialize training parameters
# YOUR CODE HERE

embedding_dim = 10
hidden_dim = 50
num_layers = 5
learning_rate = 0.05
n_iters = 100


# Initialize a character-level RNN model

# YOUR CODE HERE

model = CharRNN(vocab_size, embedding_dim, hidden_dim, num_layers=num_layers)
    
# Setup the training loop
# Regularly record the loss and sample from the model to monitor what is happening
# YOUR CODE HERE

def train_overfit2(model, input_seq, target_seq, n_iters=200, learning_rate=0.2):
    # YOUR CODE HERE
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.005, momentum = 0.9)  

    for i in range(n_iters):
        for k, seq in enumerate(input_seq):
            outputs, _ = model(seq)
            target = target_seq[k]
            # print(outputs)
            # print(target)
            loss = criterion(outputs, target.squeeze(1))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        if((i+1)%10==0):
            print('Epoch [{}/{}], Loss: {:.4f}'.format(i+1, n_iters, loss.item()))


train_overfit2(model, input_seqs, target_seqs, n_iters=n_iters, learning_rate=learning_rate)

Epoch [10/100], Loss: 2.4217


KeyboardInterrupt: 

**(Question)** From your trained model, play around with its predictions: start with a custom input sequence and ask the model to predict the rest. Analyze and comment your results.

In [123]:
# YOUR CODE HERE
predictions = model.predict_proba(text_to_tensor("Fight", ctoi), 200)
predictions = "".join(predictions)
print(predictions)

heatergged1;ed,' o LerrenrÉaf
  '`e simelly 'oilarle:.',` W lSey..
 alerltwgd inen.''  T GA=dkid `verhhe.' holklegelenR' 'pzenken.'Zileeelgensogherd7eldankd iny IÉarrhrXer-, youte	rkelL,' wouvl2dj êir


YOUR ANSWER HERE

## 4. Experiment with different RNN architectures

**(Question)** Experiment with different RNN architecures. Potential ideas are multi-layer RNNs, GRUs and LSTMs. All models can be extended to multi-layer using the `num_layers` parameter. Analyze and comment your results.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE