# 1. Introduction

Gated recurrent units (GRUs) are a type of recurrent neural network (RNN) that is trained to process sequential data, such as natural language, time series data, or speech. Like other RNNs, GRUs have the ability to process input sequences of variable length, and to maintain an internal state that allows them to remember information from previous inputs.

One of the main advantages of GRUs is that they are simpler and easier to train than traditional RNNs, which can have difficulty learning long-term dependencies in the data. This is due to the use of gating mechanisms in GRUs, which allow the model to control the flow of information through the network and to selectively retain or discard information from the previous inputs.

There are two types of gates in a GRU: the update gate and the reset gate. The update gate determines what information to retain from the previous hidden state, while the reset gate determines what information to forget.

GRUs are widely used in natural language processing tasks such as language translation, text classification, and language modeling. They are also used in other domains where sequential data is present, such as speech recognition and time series forecasting.

In this project, we implemented GRUCell and GRU for the Needle library.

# 2. Methodology

### GRU Cell

According to [PyTorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.GRUCell.html), GRUCell can be implemented with the following formulas: 
        \begin{array}{ll}
        r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\
        z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\
        n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\
        h' = (1 - z) * n + z * h
        \end{array}

And here is the Python code:
```python
class GRUCell(Module):
    ...
    def forward(self, X, h=None):
        bs, _ = X.shape
        shape = bs, self.hidden_size
        h = h or init.zeros(*shape, device=self.device, dtype=self.dtype)

        X_new = X @ self.W_ih
        h_new = h @ self.W_hh
        if self.bias:
            add_dim = 1, 3 * self.hidden_size
            shape = bs, 3 * self.hidden_size
            X_new += self.bias_ih.reshape(add_dim).broadcast_to(shape)
            h_new += self.bias_hh.reshape(add_dim).broadcast_to(shape)

        rx, zx, xn = ops.split(X_new.reshape((bs, 3, self.hidden_size)), axis=1)
        rh, zh, nh = ops.split(h_new.reshape((bs, 3, self.hidden_size)), axis=1)

        r = Sigmoid()(rx + rh)
        z = Sigmoid()(zx + zh)
        n = Tanh()(xn + r * nh)

        h_out = (1 - z) * n + z * h
        return h_out
```

Now let's test it by comparing it with PyTorch.

In [1]:
import sys
sys.path.append('./python')

import needle as ndl
import numpy as np
import torch

In [2]:
x = ndl.init.randn(1, 20, dtype="float32")
h0 = ndl.init.randn(1, 100, dtype="float32")

Here goes PyTorch implementation:

In [3]:
model_torch = torch.nn.GRUCell(20, 100)
print(model_torch.weight_hh.shape)
print(model_torch.weight_ih.shape)

torch.Size([300, 100])
torch.Size([300, 20])


In [4]:
h_ = model_torch(torch.tensor(x.numpy()), torch.tensor(h0.numpy()))

And now our implementation for Needle:

In [5]:
model_needle = ndl.nn.GRUCell(20, 100)
print(model_needle.W_hh.shape)
print(model_needle.W_ih.shape)

(100, 300)
(20, 300)


In [6]:
model_needle.W_hh = ndl.nn.Parameter(ndl.Tensor(model_torch.weight_hh.detach().numpy().T, requires_grad=True))
model_needle.W_ih = ndl.nn.Parameter(ndl.Tensor(model_torch.weight_ih.detach().numpy().T, requires_grad=True))
model_needle.bias_hh = ndl.nn.Parameter(ndl.Tensor(model_torch.bias_hh.detach().numpy(), requires_grad=True))
model_needle.bias_ih = ndl.nn.Parameter(ndl.Tensor(model_torch.bias_ih.detach().numpy(), requires_grad=True))

h = model_needle.forward(x, h0)

The moment of truth - let's compare the results.

In [7]:
np.linalg.norm(h.detach().numpy() - h_.detach().numpy())

4.4673982e-07

The difference between the reference solution and ours is within numerical precision.

### GRU

Now let's test full GRU Class. Its code is the same as for RNN Class. So for the sake of brevity, we do not provide it here.

In [8]:
X = ndl.init.randn(50, 128, 20, dtype="float32")
h0 = ndl.init.randn(1, 128, 100, dtype="float32")

PyTorch implementation:

In [9]:
model = torch.nn.GRU(20, 100, num_layers=1)
print(model.weight_hh_l0.shape)
print(model.weight_ih_l0.shape)

torch.Size([300, 100])
torch.Size([300, 20])


In [10]:
out_, h_ = model(torch.tensor(X.numpy()), torch.tensor(h0.numpy()))

Needle implementation:

In [11]:
gm = ndl.nn.GRU(20, 100, num_layers=1)
print(gm.gru_cells[0].W_hh.shape)
print(gm.gru_cells[0].W_ih.shape)

(100, 300)
(20, 300)


In [12]:
gm.gru_cells[0].W_hh = ndl.nn.Parameter(ndl.Tensor(model.weight_hh_l0.detach().numpy().T, requires_grad=True))
gm.gru_cells[0].W_ih = ndl.nn.Parameter(ndl.Tensor(model.weight_ih_l0.detach().numpy().T, requires_grad=True))
gm.gru_cells[0].bias_hh = ndl.nn.Parameter(ndl.Tensor(model.bias_hh_l0.detach().numpy(), requires_grad=True))
gm.gru_cells[0].bias_ih = ndl.nn.Parameter(ndl.Tensor(model.bias_ih_l0.detach().numpy(), requires_grad=True))

out, h = gm.forward(X, h0)

Let's compare the results:

In [13]:
print(np.linalg.norm(h.detach().numpy() - h_.detach().numpy()))
print(np.linalg.norm(out.detach().numpy() - out_.detach().numpy()))

1.8714472e-06
1.4572848e-05


And again the difference is within numerical precision.

# 3. Results

* Present the results of your implementation, including any performance metrics that you used to evaluate it.

* Discuss the implications of your results and how they compare to your expectations.

# 4. Conclusion

* Summarize the key points of your report.

* Discuss any limitations or challenges that you encountered while implementing the GRU, and suggest potential areas for future work.