### Gated Recurrent Units (Grus)

## Introduction to Gated Recurrent Units (GRUs)

Gated Recurrent Units (GRUs) are a type of recurrent neural network (RNN) architecture introduced to address the vanishing gradient problem and to efficiently capture long-term dependencies in sequential data. GRUs are a simplified variant of Long Short-Term Memory (LSTM) networks, designed to retain long-term dependencies while reducing computational complexity by having fewer parameters.

---

### Key Features of GRUs

- **Simple Architecture:**  
    GRUs use only two gates—**update** and **reset**—compared to LSTMs, which use three gates (input, forget, and output). This makes GRUs easier to implement and understand.

- **Efficiency:**  
    With fewer parameters, GRUs are computationally faster and require less memory. This efficiency makes them suitable for real-time applications and for training on smaller datasets, where overfitting is a concern.

- **Performance:**  
    Despite their simpler structure, GRUs often achieve performance comparable to LSTMs in capturing sequential dependencies in data such as time series, text, and speech.

---

### GRU Cell Structure: Update and Reset Gates

- **Update Gate (`z_t`):**  
    Determines how much of the previous hidden state should be retained and how much should be updated with new information. It helps the model decide whether to keep the existing memory or overwrite it with new input.

- **Reset Gate (`r_t`):**  
    Controls how much of the past information to forget. When the reset gate is close to zero, the hidden state is reset with the new input, allowing the model to drop irrelevant past information.

- **Hidden State Update:**  
    The new hidden state is a combination of the previous hidden state and the candidate hidden state, weighted by the update gate. This mechanism allows the GRU to adaptively capture dependencies of varying lengths.

**Mathematical Formulation:**
\[
\begin{align*}
z_t &= \sigma(W_z x_t + U_z h_{t-1}) \\
r_t &= \sigma(W_r x_t + U_r h_{t-1}) \\
\tilde{h}_t &= \tanh(W_h x_t + U_h (r_t \odot h_{t-1})) \\
h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
\end{align*}
\]
Where:
- \( x_t \): Input at time step \( t \)
- \( h_{t-1} \): Previous hidden state
- \( \sigma \): Sigmoid activation function
- \( \odot \): Element-wise multiplication

---

### When to Use GRUs vs. LSTMs

- **GRUs** are preferred when:
    - You need a faster, more memory-efficient model.
    - The dataset is smaller or less complex.
    - You want to avoid overfitting due to excessive parameters.

- **LSTMs** may be better when:
    - The task requires modeling very long-term dependencies.
    - You have sufficient computational resources and a large dataset.

In practice, both GRUs and LSTMs are powerful tools for sequence modeling, and the choice often depends on empirical performance for the specific task.

In [1]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SimpleRNN, GRU

In [2]:
# prepare data
vocab_size = 10000
max_len = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocab_size)
X_train = pad_sequences(X_train, maxlen = max_len)
X_test = pad_sequences(X_test, maxlen = max_len)

print(f"Training Data Shape: {X_train.shape}")
print(f"Testing Data Shape: {X_test.shape}")

Training Data Shape: (25000, 200)
Testing Data Shape: (25000, 200)


In [3]:
# rnn model
rnn_model = Sequential(
    [
        Embedding(input_dim=vocab_size, output_dim=128),
        SimpleRNN(128, activation="tanh", return_sequences=False),
        Dense(1, activation="sigmoid"),
    ]
)

rnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
rnn_model.summary()

rnn_history = rnn_model.fit(
    X_train, y_train, epochs=5, batch_size=32, validation_split=0.2
)

loss, accuracy = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}, Test Accuracy: {accuracy}")

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 21ms/step - accuracy: 0.5844 - loss: 0.6552 - val_accuracy: 0.7654 - val_loss: 0.5053
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.7375 - loss: 0.5459 - val_accuracy: 0.6158 - val_loss: 0.6283
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.8176 - loss: 0.4061 - val_accuracy: 0.7590 - val_loss: 0.5182
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.7507 - loss: 0.5144 - val_accuracy: 0.6282 - val_loss: 0.6840
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.7581 - loss: 0.4833 - val_accuracy: 0.7488 - val_loss: 0.5591
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.7487 - loss: 0.5530
Test Loss: 0.5488804578781128, Test Accuracy: 0.7497199773788452


In [4]:
# LSTM Model
lstm_model = Sequential(
    [
        Embedding(input_dim=vocab_size, output_dim=128),
        LSTM(128, activation="tanh", return_sequences=False),
        Dense(1, activation="sigmoid"),
    ]
)

lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
lstm_model.summary()

lstm_history = lstm_model.fit(
    X_train, y_train, epochs=5, batch_size=32, validation_split=0.2
)

loss, accuracy = lstm_model.evaluate(X_test, y_test)
print(f"LSTM Test Loss: {loss}, Test Accuracy: {accuracy}")

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 54ms/step - accuracy: 0.6919 - loss: 0.5661 - val_accuracy: 0.8432 - val_loss: 0.3727
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 51ms/step - accuracy: 0.8871 - loss: 0.2807 - val_accuracy: 0.8620 - val_loss: 0.3427
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 51ms/step - accuracy: 0.9259 - loss: 0.1963 - val_accuracy: 0.8660 - val_loss: 0.3458
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 52ms/step - accuracy: 0.9464 - loss: 0.1491 - val_accuracy: 0.8612 - val_loss: 0.3630
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 50ms/step - accuracy: 0.9633 - loss: 0.1083 - val_accuracy: 0.8558 - val_loss: 0.4036
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 19ms/step - accuracy: 0.8502 - loss: 0.4405
LSTM Test Loss: 0.4295012354850769, Test Accuracy: 0.8517600297927856


In [5]:
# GRU Model
gru_model = Sequential(
    [
        Embedding(input_dim=vocab_size, output_dim=128),
        GRU(128, activation="tanh", return_sequences=False),
        Dense(1, activation="sigmoid"),
    ]
)

gru_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
gru_model.summary()

gru_history = gru_model.fit(
    X_train, y_train, epochs=5, batch_size=32, validation_split=0.2
)

loss, accuracy = gru_model.evaluate(X_test, y_test)
print(f"GRU Test Loss: {loss}, Test Accuracy: {accuracy}")

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 52ms/step - accuracy: 0.6824 - loss: 0.5701 - val_accuracy: 0.8460 - val_loss: 0.3558
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 52ms/step - accuracy: 0.8986 - loss: 0.2564 - val_accuracy: 0.8818 - val_loss: 0.3063
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 59ms/step - accuracy: 0.9484 - loss: 0.1424 - val_accuracy: 0.8750 - val_loss: 0.3238
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 58ms/step - accuracy: 0.9770 - loss: 0.0728 - val_accuracy: 0.8692 - val_loss: 0.3900
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 60ms/step - accuracy: 0.9865 - loss: 0.0422 - val_accuracy: 0.8614 - val_loss: 0.4538
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 17ms/step - accuracy: 0.8574 - loss: 0.4877
GRU Test Loss: 0.4710017144680023, Test Accuracy: 0.86080002784729
