### Note:
This homework assignment is largely inspired from [this tutoriel on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/).
We probably need to polish the text.

They key exercise is coding the backpropagation. A gradient check function is given to have feedback about the correctness of the implementation, without giving the answer.
The level can be adapted by guiding more or less inside the function.
Other functions could also be given as exercise (e.g. `forward_propagation` and `train_with_sgd`).

# Backpropagation in Recurrent Neural Networks

In this notebook, we implement a RNN from scratch using numpy arrays.

The biggest part is already implemented, but you will have to write the backpropagation function.


### What are RNNs?

The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. Here is what a typical RNN looks like:


![](http://www.wildml.com/wp-content/uploads/2015/09/rnn.jpg)


The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows:

\begin{aligned}
s_t &= f(Ux_t + Ws_{t-1}) \\
o_t &= \mathrm{softmax}(Vs_t)
\end{aligned}


- $x_t$ is the input at time step $t$. For example, $x_1$ could be a one-hot vector corresponding to the second word of a sentence.
- $s_t$ is the hidden state at time step $t$. It’s the “memory” of the network. $s_t$ is calculated based on the previous hidden state and the input at the current step. The function $f$ usually is a nonlinearity such as tanh or ReLU.  $s_{-1}$, which is required to calculate the first hidden state, is typically initialized to all zeroes.
- $o_t$ is the output at step $t$. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary.

There are a few things to note here:

You can think of the hidden state $s_t$ as the memory of the network. $s_t$ captures information about what happened in all the previous time steps. The output at step $o_t$ is calculated solely based on the memory at time $t$. As briefly mentioned above, it’s a bit more complicated  in practice because $s_t$ typically can’t capture information from too many time steps ago.
Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters ($U$, $V$, $W$ above) across all steps. This reflects the fact that we are performing the same task at each step, just with different inputs. This greatly reduces the total number of parameters we need to learn.
The above diagram has outputs at each time step, but depending on the task this may not be necessary. For example, when predicting the sentiment of a sentence we may only care about the final output, not the sentiment after each word. Similarly, we may not need inputs at each time step. The main feature of an RNN is its hidden state, which captures some information about a sequence.
What can RNNs do?

RNNs have shown great success in many NLP tasks. At this point I should mention that the most commonly used type of RNNs are LSTMs, which are much better at capturing long-term dependencies than vanilla RNNs are. But don’t worry, LSTMs are essentially the same thing as the RNN we will develop in this tutorial, they just have a different way of computing the hidden state. We’ll cover LSTMs in more detail in a later post. Here are some example applications of RNNs in NLP (by non means an exhaustive list).


### Defining a toy problem

Let's try to teach a RNN model how to add numbers. We will represent integers with a list of digits.

Example: 142857 = [0 0 0 0 1 4 2 8 5 7]

In [1]:
import operator
import numpy as np
import sys
from datetime import datetime

np.set_printoptions(precision=6, linewidth=130)

In [2]:
# utils
n_digits = 7

def softmax(x):
    xt = np.exp(x - np.max(x))
    return xt / np.sum(xt)

def array_to_decimal(array):
    return int(''.join([str(i) for i in array]), 10)

def decimal_to_array(num, n_digits):
    return np.int_(list(("%%0%dd" % n_digits) % num)[-n_digits:])

print(decimal_to_array(142857, n_digits))
print(array_to_decimal(decimal_to_array(142857, n_digits)))

[0 1 4 2 8 5 7]
142857


In [3]:
# Building a training set

np.random.seed(10)
n_samples = 1000

X_train = np.random.randint(10, size=(n_samples, 2 * n_digits))
y_train = np.zeros((n_samples, 2 * n_digits), dtype=int)
for i in range(n_samples):
    a = array_to_decimal(X_train[i, :n_digits])
    b = array_to_decimal(X_train[i, n_digits:])
    y_train[i] = decimal_to_array(a + b, 2 * n_digits)

print('             ', X_train[0, :n_digits])
print('            +', X_train[0, n_digits:])
print('-----------------------------')
print(y_train[0])

              [9 4 0 1 9 0 1]
            + [8 9 0 8 6 4 3]
-----------------------------
[0 0 0 0 0 0 1 8 3 1 0 5 4 4]


The input $x$ will be a sequence of digits (just like the example printed above) and each $x_t$ is a single digit, one-hot encoded into a vector in $\mathbb{R}^{10}$.

Let's recap the equations for the RNN:


\begin{aligned}
s_t &= \tanh(Ux_t + Ws_{t-1}) \\
o_t &= \mathrm{softmax}(Vs_t)
\end{aligned}


I always find it useful to write down the dimensions of the matrices and vectors. We one-encoded the digits into $C=10$ classes, and let's assume we pick a hidden layer size $H = 100$. You can think of the hidden layer size as the "memory" of our network. Making it bigger allows us to learn more complex patterns, but also results in additional computation. Then we have:


\begin{aligned}
x_t & \in \mathbb{R}^{10} \\
o_t & \in \mathbb{R}^{10} \\
s_t & \in \mathbb{R}^{100} \\
U & \in \mathbb{R}^{100 \times 10} \\
V & \in \mathbb{R}^{10 \times 100} \\
W & \in \mathbb{R}^{100 \times 100} \\
\end{aligned}


This is valuable information. Remember that $U,V$ and $W$ are the parameters of our network we want to learn from data. Thus, we need to learn a total of $2HC + H^2$ parameters. The dimensions also tell us the bottleneck of our model. Note that because $x_t$ is a one-hot vector, multiplying it with $U$ is essentially the same as selecting a column of U, so we don't need to perform the full multiplication. Then, the biggest matrix multiplication in our network is $Vs_t$.

Armed with this, it's time to start our implementation.

#### Initialization

We start by declaring a RNN class and initializing our parameters. We can't just initialize them to 0's because that would result in symmetric calculations in all our layers. We must initialize them randomly. Because proper initialization seems to have an impact on training results there has been lot of research in this area. It turns out that the best initialization depends on the activation function ($\tanh$ in our case) and one [recommended](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) approach is to initialize the weights randomly in the interval from $\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$ where $n$ is the number of incoming connections from the previous layer.

In [4]:
class RNN:
    def __init__(self, input_dim=10, hidden_dim=100, bptt_truncate=20):
        # Assign instance variables
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        self.U = np.random.uniform(-np.sqrt(1./input_dim), np.sqrt(1./input_dim), (hidden_dim, input_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (input_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
        

Above, `input_dim` is the size of each input, and `hidden_dim` is the size of our hidden layer (we can pick it).

#### Forward Propagation

Next, let's implement the forward propagation (predicting digit probabilities) defined by our equations above:

In [5]:
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    # During forward propagation we save all hidden states in s because we need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.input_dim))

    # For each time step...
    for t in np.arange(T):
        # Note that we are indexing U by x[t]. This is the same as multiplying U with a one-hot vector.
        s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
    return [o, s]

RNN.forward_propagation = forward_propagation

We not only return the calculated outputs, but also the hidden states. We will use them later to calculate the gradients, and by returning them here we avoid duplicate computation. Each $o_t$ is a vector of probabilities representing the digits, but sometimes, for example when evaluating our model, all we want is the next digit with the highest probability. We call this function `predict`:

In [6]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)

RNN.predict = predict

Let's try our newly implemented methods and see an example output:

In [7]:
np.random.seed(10)
model = RNN()
o, s = model.forward_propagation(X_train[0])
print(o.shape)
print(o)

(14, 10)
[[ 0.128147  0.086456  0.097282  0.102131  0.081564  0.108697  0.09797   0.106346  0.092526  0.098881]
 [ 0.095863  0.110619  0.112997  0.090618  0.094954  0.11318   0.086146  0.090954  0.110195  0.094474]
 [ 0.09886   0.101381  0.096624  0.077477  0.083638  0.100096  0.102021  0.11491   0.1099    0.115094]
 [ 0.098815  0.099052  0.106088  0.078864  0.078886  0.132081  0.107208  0.09466   0.100423  0.103924]
 [ 0.128161  0.098779  0.09455   0.103251  0.073019  0.122561  0.083596  0.113788  0.095242  0.087053]
 [ 0.090704  0.099418  0.10445   0.087798  0.092747  0.104899  0.099478  0.104261  0.111476  0.104769]
 [ 0.100745  0.099929  0.103282  0.079946  0.078899  0.127029  0.109987  0.093854  0.097476  0.108852]
 [ 0.106561  0.099488  0.090668  0.104637  0.100736  0.105507  0.08942   0.118487  0.106474  0.078023]
 [ 0.125611  0.08225   0.101721  0.105134  0.076596  0.112481  0.091433  0.104065  0.101788  0.098921]
 [ 0.094124  0.108361  0.106366  0.088721  0.092884  0.096963  0

For each digits in the sentence, our model made `n_digits` predictions representing probabilities of each digit. Note that because we initialized $U,V,W$ to random values these predictions are completely random right now. The following gives the indices of the highest probability predictions for each bit:

In [8]:
predictions = model.predict(X_train[0])
print(predictions.shape)
print(predictions)

(14,)
[0 5 9 5 0 8 5 7 0 1 7 3 5 6]


#### Calculating the Loss

To train our network we need a way to measure the errors it makes. We call this the loss function $L$, and our goal is find the parameters $U,V$ and $W$ that minimize the loss function for our training data. A common choice for the loss function is the [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression). If we have $N$ training examples (words in our text) and $C$ classes then the loss with respect to our predictions $o$ and the true labels $y$ is given by:


\begin{aligned}
L(y,o) = - \frac{1}{N} \sum_{n \in N} y_{n} \log o_{n}
\end{aligned}


The formula sums over our training examples and adds to the loss based on how off our prediction are. The further away $y$ (the correct words) and $o$ (our predictions), the greater the loss will be. We implement the function `calculate_loss`:

In [9]:
def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        # We only care about our prediction of the "correct" words
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNN.calculate_total_loss = calculate_total_loss
RNN.calculate_loss = calculate_loss

#### Training the RNN with SGD and Backpropagation Through Time (BPTT)

Remember that we want to find the parameters $U,V$ and $W$ that minimize the total loss on the training data. The most common way to do this is SGD, Stochastic Gradient Descent. The idea behind SGD is pretty simple. We iterate over all our training examples and during each iteration we nudge the parameters into a direction that reduces the error. These directions are given by the gradients on the loss: $\frac{\partial L}{\partial U}, \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W}$. SGD also needs a *learning rate*, which defines how big of a step we want to make in each iteration.

But how do we calculate those gradients we mentioned above? In a traditional Neural Network we do this through the backpropagation algorithm. In RNNs we use a slightly modified version of the this algorithm called **Backpropagation Through Time (BPTT)**. Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps. If you know calculus, it really is just applying the chain rule.

Let's recall the RNN model:

\begin{aligned}  s_t &= \tanh (U x_t + W s_{t-1})\\
\hat{y}_t &= \text{softmax}(V s_t)\end{aligned}  

#### Chain rule

To calculate these gradients we use the chain rule of differentiation. That’s the backpropagation algorithm when applied backwards starting from the error. For the rest we’ll use $E_3$ as an example, just to have concrete numbers to work with.

\begin{aligned}  \frac{\partial E_3}{\partial V} &=\frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial V}\\  &=\frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial z_3}\frac{\partial z_3}{\partial V}\\  &=(\hat{y}_3 - y_3) \otimes s_3 \\  \end{aligned}  

In the above, $z_3 =Vs_3$, and $\otimes$  is the outer product of two vectors. You can try calculating these derivatives yourself (good exercise!). The point is that $\frac{\partial E_3}{\partial V}$  only depends on the values at the current time step, $\hat{y}_3, y_3, s_3$ . If you have these, calculating the gradient for $V$ a simple matrix multiplication.

But the story is different for $\frac{\partial E_3}{\partial W}$ (and for $U$). To see why, we write out the chain rule, just as above:

\begin{aligned}  \frac{\partial E_3}{\partial W} &= \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial W}\\  \end{aligned}  

Now, note that $s_3 = \tanh(Ux_t + Ws_2)$ depends on $s_2$, which depends on $W$ and $s_1$, and so on. So if we take the derivative with respect to $W$ we can’t simply treat $s_2$ as a constant! We need to apply the chain rule again and what we really have is this:

\begin{aligned}  \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial W}\\  \end{aligned}  

We sum up the contributions of each time step to the gradient. In other words, because $W$ is used in every step up to the output we care about, we need to backpropagate gradients from $t=3$ through the network all the way to $t=0$:

![Image of Yaktocat](http://www.wildml.com/wp-content/uploads/2015/10/rnn-bptt-with-gradients.png)

Note that this is exactly the same as the standard backpropagation algorithm that we use in deep Feedforward Neural Networks. The key difference is that we sum up the gradients for W at each time step. In a traditional NN we don’t share parameters across layers, so we don’t need to sum anything. BPTT is just a fancy name for standard backpropagation on an unrolled RNN. Just like with Backpropagation you could define a delta vector that you pass backwards, e.g.: $\delta_2^{(3)} = \frac{\partial E_3}{\partial z_2} =\frac{\partial E_3}{\partial s_3}\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial z_2}$ with $z_2 = Ux_2+ Ws_1$. Then the same equations will apply.


## Exercise

Fill the function below to implement the Backpropagation Through Time (BPTT) algorithm.

To check if your function is correct, you can use the `gradient_check` function below.

In [10]:
def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        # update dLdV
        # ...

        # Initial delta calculation
        # ...

        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # update dLdW
            # ...

            # update dLdU
            # ...
    
            # Update delta for next step
            # ...

            pass
    
    return [dLdU, dLdV, dLdW]

RNN.bptt = bptt

In [11]:
# ((answer))

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

RNN.bptt = bptt

#### Gradient Checking

Whenever you implement backpropagation it is good idea to also implement *gradient checking*, which is a way of verifying that your implementation is correct. The idea behind gradient checking is that derivative of a parameter is equal to the slope at the point, which we can approximate by slightly changing the parameter and then dividing by the change:

$
\begin{aligned}
\frac{\partial L}{\partial \theta} \approx \lim_{h \to 0} \frac{J(\theta + h) - J(\theta -h)}{2h}
\end{aligned}
$

We then compare the gradient we calculated using backpropagation to the gradient we estimated with the method above. If there's no large difference we are good. The approximation needs to calculate the total loss for *every* parameter, so that gradient checking is very expensive.

In [12]:
def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
    # Calculate the gradients using backpropagation. We want to checker if these are correct.
    bptt_gradients = model.bptt(x, y)
    # List of all parameters we want to check.
    model_parameters = ['U', 'V', 'W']
    # Gradient check for each parameter
    for pidx, pname in enumerate(model_parameters):
        # Get the actual parameter value from the mode, e.g. model.W
        parameter = operator.attrgetter(pname)(self)
        print("Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)))
        # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
        it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            ix = it.multi_index
            # Save the original value so we can reset it later
            original_value = parameter[ix]
            # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
            parameter[ix] = original_value + h
            gradplus = model.calculate_total_loss([x],[y])
            parameter[ix] = original_value - h
            gradminus = model.calculate_total_loss([x],[y])
            estimated_gradient = (gradplus - gradminus)/(2*h)
            # Reset parameter to original value
            parameter[ix] = original_value
            # The gradient for this parameter calculated using backpropagation
            backprop_gradient = bptt_gradients[pidx][ix]
            # calculate The relative error: (|x - y|/(|x| + |y|))
            relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))
            # If the error is to large fail the gradient check
            if relative_error > error_threshold:
                print("Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix))
                print("+h Loss: %f" % gradplus)
                print("-h Loss: %f" % gradminus)
                print("Estimated_gradient: %f" % estimated_gradient)
                print("Backpropagation gradient: %f" % backprop_gradient)
                print("Relative Error: %f" % relative_error)
                return 
            it.iternext()
        print("Gradient check for parameter %s passed." % (pname))

RNN.gradient_check = gradient_check

np.random.seed(10)
model = RNN(10, 100, bptt_truncate=1000)
model.gradient_check(X_train[0], y_train[0])

Performing gradient check for parameter U with size 1000.




Gradient check for parameter U passed.
Performing gradient check for parameter V with size 1000.
Gradient check for parameter V passed.
Performing gradient check for parameter W with size 10000.
Gradient check for parameter W passed.


#### SGD Implementation

Now that we are able to calculate the gradients for our parameters we can implement SGD. I like to do this in two steps: 1. A function `sdg_step` that calculates the gradients and performs the updates for one batch. 2. An outer loop that iterates through the training set and adjusts the learning rate.

In [13]:
# Performs one step of SGD.
def sdg_step(self, x, y, learning_rate):
    # Calculate the gradients
    dLdU, dLdV, dLdW = self.bptt(x, y)
    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * dLdU
    self.V -= learning_rate * dLdV
    self.W -= learning_rate * dLdW

RNN.sgd_step = sdg_step

In [14]:
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    """Outer SGD Loop

    Parameters
    ----------
    - model: The RNN model instance
    - X_train: The training data set
    - y_train: The training data labels
    - learning_rate: Initial learning rate for SGD
    - nepoch: Number of times to iterate through the complete dataset
    - evaluate_loss_after: Evaluate the loss after this many epochs
    """
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    for epoch in range(nepoch):
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print("%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss))
            # Adjust the learning rate if loss increases
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5  
                print("Setting learning rate to %f" % learning_rate)
            sys.stdout.flush()
        # For each training example...
        for i in range(len(y_train)):
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

In [15]:
np.random.seed(10)
# Train on a small subset of the data to see what happens
model = RNN()
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=5, evaluate_loss_after=1)

2018-06-04 18:05:10: Loss after num_examples_seen=0 epoch=0: 2.302904
2018-06-04 18:05:12: Loss after num_examples_seen=100 epoch=1: 1.509416
2018-06-04 18:05:14: Loss after num_examples_seen=200 epoch=2: 1.375010
2018-06-04 18:05:16: Loss after num_examples_seen=300 epoch=3: 1.335181
2018-06-04 18:05:17: Loss after num_examples_seen=400 epoch=4: 1.294507


Good, it seems like our implementation is at least doing something useful and decreasing the loss, just like we wanted.