In [5]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

In [6]:
doc='''Recurrent neural networks (RNNs) are deep learning models, typically used to solve problems with sequential input data such as time series. What are they, and how do we use them in time series forecasting?

RNNs are a type of neural network that retains a memory of what it has already processed and thus can learn from previous iterations during its training.

Probably you have done what most of us do when we hear any technical term for the first time. You have tried to understand what recurrent neural networks are by clicking on the top-listed non-ad Google search result. Then you will have found that Wikipedia’s article exhibits a high level of abstraction. It is of limited usefulness when we try to understand what RNNs are and what they are for: "A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs …. Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs." Say what?

Michael Phi provided an excellent, non-mathematical guide on RNNs in a previous Towards Data Science article of his: "Illustrated Guide to Recurrent Neural Networks | by Michael Phi | Towards Data Science". So did Will Koehrsen, in "Recurrent Neural Networks by Example in Python | by Will Koehrsen | Towards Data Science."

Let me summarize the basics we should understand about RNNs, in non-mathematical terms (and then I’d refer you to the additional explanations and illustrations in the two articles Michael and Will wrote in 2018).

A neural network – of which recurrent neural networks are one type, among other types such as convolutional networks – is composed of three elementary components: the input layer, the hidden layers, and the output layer. Each layer consists of so-called nodes (aka neurons).

I’ve read the following analogy for the three main types of neural networks, which are said to mimic human brain functions in specific ways. The following comparisons oversimplify, so best take them with a grain of salt.

the temporal lobe of our brain => artificial neural networks => mainly for classification and regression problems => one of the functions of the temporal lobe is long-term memory
the occipital lobe => convolutional neural networks => mainly for computer vision problems (though temporal convolutional networks, TCNs, can be applied to time series)
the frontal lobe => recurrent neural networks RNN => mainly for time series analysis, sequences, and lists – for instance, in language processing, which deals with sequences of characters, words, and sentences ordered by a grammar; or time series, which consist of temporal sequences of observations => one of the frontal lobe’s functions is short-term memory
Feed-forward neural networks (FFNNs) – such as the grandfather among neural networks, the original single-layer perceptron, developed in 1958— came before recurrent neural networks. In FFNNs, the information flows in only one direction: from the input layer, through the hidden layers, to the output layer, but never backwards in feedback loops. FFNN are often used in pattern recognition. The FFNN multiplies a matrix of weight factors with the inputs and generates the outputs from these weighted inputs. Feed-forward neural networks don’t retain a memory of the inputs they have processed. They suffer from anterograde amnesia, the inability to form new memories (similar to the protagonist in Christopher Nolan’s movie Memento – Wikipedia [this seemed a rare opportunity to mention anterograde amnesia and Memento in a data science article]).

A recurrent neural network, by contrast, retains a memory of what it has processed in its recent previous steps (we’ll come back to the "recent" qualifier in a minute). It makes recurrent connections by going through temporal feedback loops: the output of a preceding step is used as an input for the current process step. Unlike amnesiac FFNNs, this memory enables RNNs to process sequences of inputs without loosing track. The loops make it a recurrent network.'''

In [7]:
# Tokenize words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([doc])
vocab_size = len(tokenizer.word_index) + 1  # +1 for padding/indexing
sequences = tokenizer.texts_to_sequences([doc])[0]

# Create input-output pairs for training
X_train, Y_train = [], []
seq_length = 3  # Number of words to predict the next word

for i in range(len(sequences) - seq_length):
    X_train.append(sequences[i:i+seq_length])
    Y_train.append(sequences[i+seq_length])

X_train = np.array(X_train)
Y_train = to_categorical(Y_train, num_classes=vocab_size)  # Convert output to one-hot


In [8]:
import tensorflow as tf

class MinLSTMCell(tf.keras.layers.Layer):
    def __init__(self, hidden_size, vocab_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.f_gate = tf.keras.layers.Dense(hidden_size, activation='sigmoid')
        self.c_tilde = tf.keras.layers.Dense(hidden_size, activation=None)
        self.output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')

    def call(self, x_t, c_prev):
        f_t = self.f_gate(x_t)                  # Forget gate from input only
        c_hat = self.c_tilde(x_t)               # Candidate memory
        c_t = f_t * c_prev + (1 - f_t) * c_hat  # Memory update (merged with hidden)
        y_t = self.output_layer(c_t)            # Predict from current hidden state
        return c_t, y_t                         # c_t is also h_t


In [9]:
def train_min_lstm(X, Y, vocab_size, hidden_size=10, embedding_dim=8, epochs=500):
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
    min_lstm = MinLSTMCell(hidden_size, vocab_size)

    # Embedding layer to convert tokens to dense vectors
    embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True)

    for epoch in range(epochs):
        with tf.GradientTape() as tape:
            c_prev = tf.zeros((X.shape[0], hidden_size), dtype=tf.float32)  # initial state

            embedded_X = embedding_layer(X)  # [B, T, D]

            # Feed sequence into the minLSTM one step at a time
            for t in range(embedded_X.shape[1]):
                x_t = embedded_X[:, t, :]
                c_prev, y_t = min_lstm(x_t, c_prev)  # only use c_prev

            # Crossentropy between predicted token distribution and ground truth
            loss = tf.reduce_mean(tf.keras.losses.categorical_crossentropy(Y, y_t))

        grads = tape.gradient(loss, min_lstm.trainable_variables + embedding_layer.trainable_variables)
        optimizer.apply_gradients(zip(grads, min_lstm.trainable_variables + embedding_layer.trainable_variables))

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss.numpy():.4f}")

        if epoch == epochs - 1:
            return min_lstm, embedding_layer


In [10]:
def predict_min_lstm(model, tokenizer, input_text, embedding_layer, hidden_size):
    seq_length = 3
    sequence = tokenizer.texts_to_sequences([input_text])[0][-seq_length:]
    X_input = np.array([sequence])
    c_prev = tf.zeros((1, hidden_size), dtype=tf.float32)

    embedded_X = embedding_layer(X_input)

    for t in range(embedded_X.shape[1]):
        x_t = embedded_X[:, t, :]
        c_prev, y_t = model(x_t, c_prev)

    predicted_index = tf.argmax(y_t, axis=1).numpy()[0]
    return tokenizer.index_word.get(predicted_index, "<UNK>")


In [11]:
# Train
min_lstm_model, emb_layer = train_min_lstm(X_train, Y_train, vocab_size)


Epoch 0, Loss: 5.7002
Epoch 10, Loss: 5.4715
Epoch 20, Loss: 4.8913
Epoch 30, Loss: 4.3175
Epoch 40, Loss: 3.6769
Epoch 50, Loss: 3.0428
Epoch 60, Loss: 2.4444
Epoch 70, Loss: 1.9062
Epoch 80, Loss: 1.4492
Epoch 90, Loss: 1.0929
Epoch 100, Loss: 0.8327
Epoch 110, Loss: 0.6461
Epoch 120, Loss: 0.5074
Epoch 130, Loss: 0.4022
Epoch 140, Loss: 0.3229
Epoch 150, Loss: 0.2638
Epoch 160, Loss: 0.2194
Epoch 170, Loss: 0.1860
Epoch 180, Loss: 0.1611
Epoch 190, Loss: 0.1427
Epoch 200, Loss: 0.1288
Epoch 210, Loss: 0.1181
Epoch 220, Loss: 0.1098
Epoch 230, Loss: 0.1031
Epoch 240, Loss: 0.0977
Epoch 250, Loss: 0.0933
Epoch 260, Loss: 0.0896
Epoch 270, Loss: 0.0864
Epoch 280, Loss: 0.0836
Epoch 290, Loss: 0.0813
Epoch 300, Loss: 0.0792
Epoch 310, Loss: 0.0774
Epoch 320, Loss: 0.0759
Epoch 330, Loss: 0.0746
Epoch 340, Loss: 0.0735
Epoch 350, Loss: 0.0725
Epoch 360, Loss: 0.0716
Epoch 370, Loss: 0.0709
Epoch 380, Loss: 0.0704
Epoch 390, Loss: 0.0698
Epoch 400, Loss: 0.0693
Epoch 410, Loss: 0.0689
Epo

In [12]:
# Predict
input_text = "recurrent neural networks"
predicted_word = predict_min_lstm(min_lstm_model, tokenizer, input_text, emb_layer, hidden_size=10)
print(f"Input: '{input_text}' → Predicted next word: '{predicted_word}'")

Input: 'recurrent neural networks' → Predicted next word: 'are'
