In [5]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

In [6]:
doc='''Recurrent neural networks (RNNs) are deep learning models, typically used to solve problems with sequential input data such as time series. What are they, and how do we use them in time series forecasting?

RNNs are a type of neural network that retains a memory of what it has already processed and thus can learn from previous iterations during its training.

Probably you have done what most of us do when we hear any technical term for the first time. You have tried to understand what recurrent neural networks are by clicking on the top-listed non-ad Google search result. Then you will have found that Wikipedia’s article exhibits a high level of abstraction. It is of limited usefulness when we try to understand what RNNs are and what they are for: "A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs …. Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs." Say what?

Michael Phi provided an excellent, non-mathematical guide on RNNs in a previous Towards Data Science article of his: "Illustrated Guide to Recurrent Neural Networks | by Michael Phi | Towards Data Science". So did Will Koehrsen, in "Recurrent Neural Networks by Example in Python | by Will Koehrsen | Towards Data Science."

Let me summarize the basics we should understand about RNNs, in non-mathematical terms (and then I’d refer you to the additional explanations and illustrations in the two articles Michael and Will wrote in 2018).

A neural network – of which recurrent neural networks are one type, among other types such as convolutional networks – is composed of three elementary components: the input layer, the hidden layers, and the output layer. Each layer consists of so-called nodes (aka neurons).

I’ve read the following analogy for the three main types of neural networks, which are said to mimic human brain functions in specific ways. The following comparisons oversimplify, so best take them with a grain of salt.

the temporal lobe of our brain => artificial neural networks => mainly for classification and regression problems => one of the functions of the temporal lobe is long-term memory
the occipital lobe => convolutional neural networks => mainly for computer vision problems (though temporal convolutional networks, TCNs, can be applied to time series)
the frontal lobe => recurrent neural networks RNN => mainly for time series analysis, sequences, and lists – for instance, in language processing, which deals with sequences of characters, words, and sentences ordered by a grammar; or time series, which consist of temporal sequences of observations => one of the frontal lobe’s functions is short-term memory
Feed-forward neural networks (FFNNs) – such as the grandfather among neural networks, the original single-layer perceptron, developed in 1958— came before recurrent neural networks. In FFNNs, the information flows in only one direction: from the input layer, through the hidden layers, to the output layer, but never backwards in feedback loops. FFNN are often used in pattern recognition. The FFNN multiplies a matrix of weight factors with the inputs and generates the outputs from these weighted inputs. Feed-forward neural networks don’t retain a memory of the inputs they have processed. They suffer from anterograde amnesia, the inability to form new memories (similar to the protagonist in Christopher Nolan’s movie Memento – Wikipedia [this seemed a rare opportunity to mention anterograde amnesia and Memento in a data science article]).

A recurrent neural network, by contrast, retains a memory of what it has processed in its recent previous steps (we’ll come back to the "recent" qualifier in a minute). It makes recurrent connections by going through temporal feedback loops: the output of a preceding step is used as an input for the current process step. Unlike amnesiac FFNNs, this memory enables RNNs to process sequences of inputs without loosing track. The loops make it a recurrent network.'''

In [7]:
# Tokenize words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([doc])
vocab_size = len(tokenizer.word_index) + 1  # +1 for padding/indexing
sequences = tokenizer.texts_to_sequences([doc])[0]

# Create input-output pairs for training
X_train, Y_train = [], []
seq_length = 3  # Number of words to predict the next word

for i in range(len(sequences) - seq_length):
    X_train.append(sequences[i:i+seq_length])
    Y_train.append(sequences[i+seq_length])

X_train = np.array(X_train)
Y_train = to_categorical(Y_train, num_classes=vocab_size)  # Convert output to one-hot


In [8]:
import tensorflow as tf

class MinGRUCell(tf.keras.layers.Layer):
    def __init__(self, hidden_size, vocab_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.z_gate = tf.keras.layers.Dense(hidden_size, activation='sigmoid')
        self.h_tilde = tf.keras.layers.Dense(hidden_size)
        self.output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')

    def precompute(self, x):
        """
        Precompute a = (1 - z_t) and b = z_t * h̃_t for all time steps in parallel.
        x: [B, T, D]
        Returns:
            a: [B, T, H]
            b: [B, T, H]
        """
        z = self.z_gate(x)         # [B, T, H]
        h_hat = self.h_tilde(x)    # [B, T, H]
        a = 1.0 - z
        b = z * h_hat
        return a, b


In [9]:
def parallel_scan(a, b, h0):
    """
    Performs h_t = a_t * h_{t-1} + b_t using a scan (sequential version for simplicity).
    a: [B, T, H]
    b: [B, T, H]
    h0: [B, H]
    Returns:
        h: [B, T, H]
    """
    B, T, H = tf.shape(a)[0], tf.shape(a)[1], tf.shape(a)[2]
    h_list = tf.TensorArray(dtype=tf.float32, size=T)
    h_t = h0  # initial hidden state

    for t in tf.range(T):
        h_t = a[:, t, :] * h_t + b[:, t, :]
        h_list = h_list.write(t, h_t)

    h_stack = tf.transpose(h_list.stack(), [1, 0, 2])  # [B, T, H]
    return h_stack


In [10]:
def train_min_gru_parallel(X, Y, vocab_size, hidden_size=10, embedding_dim=8, epochs=500):
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
    min_gru = MinGRUCell(hidden_size, vocab_size)

    # Embedding layer
    embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True)

    for epoch in range(epochs):
        with tf.GradientTape() as tape:
            h0 = tf.zeros((X.shape[0], hidden_size), dtype=tf.float32)  # Initial hidden state
            embedded_X = embedding_layer(X)  # [B, T, D]

            # Precompute gates
            a, b = min_gru.precompute(embedded_X)  # [B, T, H]

            # Parallel scan to compute h_t over time
            h_all = parallel_scan(a, b, h0)  # [B, T, H]

            # Final prediction from last time step
            y_t = min_gru.output_layer(h_all[:, -1, :])  # [B, vocab_size]

            # Compute categorical crossentropy loss
            loss = tf.reduce_mean(tf.keras.losses.categorical_crossentropy(Y, y_t))

        # Update both minGRU and embedding layer
        grads = tape.gradient(loss, min_gru.trainable_variables + embedding_layer.trainable_variables)
        optimizer.apply_gradients(zip(grads, min_gru.trainable_variables + embedding_layer.trainable_variables))

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss.numpy():.4f}")

        if epoch == epochs - 1:
            return min_gru, embedding_layer


In [11]:
def predict_min_gru_parallel(model, tokenizer, input_text, embedding_layer, hidden_size):
    seq_length = 3
    sequence = tokenizer.texts_to_sequences([input_text])[0][-seq_length:]
    X_input = tf.convert_to_tensor([sequence], dtype=tf.int32)
    h0 = tf.zeros((1, hidden_size), dtype=tf.float32)

    embedded_X = embedding_layer(X_input)       # [1, T, D]
    a, b = model.precompute(embedded_X)         # [1, T, H]
    h_all = parallel_scan(a, b, h0)             # [1, T, H]
    y_t = model.output_layer(h_all[:, -1, :])   # [1, vocab_size]

    predicted_index = tf.argmax(y_t, axis=1).numpy()[0]
    return tokenizer.index_word.get(predicted_index, "<UNK>")


In [None]:
min_gru_model, emb_layer = train_min_gru_parallel(X_train, Y_train, vocab_size)


Epoch 0, Loss: 5.7003
Epoch 10, Loss: 5.4960
Epoch 20, Loss: 4.9460
Epoch 30, Loss: 4.3375
Epoch 40, Loss: 3.6592
Epoch 50, Loss: 2.9691
Epoch 60, Loss: 2.3500
Epoch 70, Loss: 1.8140
Epoch 80, Loss: 1.3746
Epoch 90, Loss: 1.0354
Epoch 100, Loss: 0.7791
Epoch 110, Loss: 0.5882
Epoch 120, Loss: 0.4483
Epoch 130, Loss: 0.3481
Epoch 140, Loss: 0.2769
Epoch 150, Loss: 0.2266
Epoch 160, Loss: 0.1907
Epoch 170, Loss: 0.1644
Epoch 180, Loss: 0.1446
Epoch 190, Loss: 0.1297
Epoch 200, Loss: 0.1184
Epoch 210, Loss: 0.1095
Epoch 220, Loss: 0.1026
Epoch 230, Loss: 0.0973
Epoch 240, Loss: 0.0931
Epoch 250, Loss: 0.0898
Epoch 260, Loss: 0.0870
Epoch 270, Loss: 0.0847
Epoch 280, Loss: 0.0828
Epoch 290, Loss: 0.0811
Epoch 300, Loss: 0.0796
Epoch 310, Loss: 0.0781
Epoch 320, Loss: 0.0768
Epoch 330, Loss: 0.0756
Epoch 340, Loss: 0.0747
Epoch 350, Loss: 0.0740
Epoch 360, Loss: 0.0733
Epoch 370, Loss: 0.0728
Epoch 380, Loss: 0.0723
Epoch 390, Loss: 0.0719
Epoch 400, Loss: 0.0715
Epoch 410, Loss: 0.0712
Epo

In [13]:

input_text = "recurrent neural networks"
predicted = predict_min_gru_parallel(min_gru_model, tokenizer, input_text, emb_layer, hidden_size=10)
print(f"Input: '{input_text}' → Predicted next word: '{predicted}'")

Input: 'recurrent neural networks' → Predicted next word: 'are'
