<!-- HTML file automatically generated from DocOnce source (https://github.com/doconce/doconce/)
doconce format html week47.do.txt --no_mako -->
<!-- dom:TITLE: Week 47: Recurrent neural networks and Autoencoders -->

# Week 47: Recurrent neural networks and Autoencoders
**Morten Hjorth-Jensen**, Department of Physics, University of Oslo, Norway

Date: **November 17-21, 2025**

## Plan for week 47

**Plans for the lecture Monday 17 November, with video suggestions etc.**

1. Recurrent neural networks, code examples and long-short-term memory

2. Autoencoders (last topic this semester)

3. Readings and Videos:

4. These lecture notes at <https://github.com/CompPhysics/MachineLearning/blob/master/doc/pub/week47/ipynb/week47.ipynb>

5. See also lecture notes from week 46 at <https://github.com/CompPhysics/MachineLearning/blob/master/doc/pub/week46/ipynb/week46.ipynb>. The lecture on Monday starts with a repetition on recurrent neural networks. The second lecture starts with basics of autoenconders.
<!-- o Video of lecture at <https://youtu.be/RIHzmLv05DA> -->
<!-- o Whiteboard notes at <https://github.com/CompPhysics/MachineLearning/blob/master/doc/HandWrittenNotes/2024/NotesNovember18.pdf> -->
<!-- * [Video of Lecture](https://youtu.be/SpWXsvn5I9E) -->

**Lab sessions on Tuesday and Wednesday.**

1. Work and Discussion of project 3

## Reading recommendations

1. For RNNs, see Goodfellow et al chapter 10, see <https://www.deeplearningbook.org/contents/rnn.html>.

2. Reading suggestions for implementation of RNNs in PyTorch: see Rashcka et al.'s chapter 15 and GitHub site at <https://github.com/rasbt/machine-learning-book/tree/main/ch15>.

## TensorFlow examples
For TensorFlow (using Keras) implementations, we recommend
1. David Foster, Generative Deep Learning with TensorFlow, see chapter 5 at <https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/ch05.html>

2. Joseph Babcock and Raghav Bali Generative AI with Python and their GitHub link, chapters 2 and  3 at <https://github.com/PacktPublishing/Hands-On-Generative-AI-with-Python-and-TensorFlow-2>

## What is a recurrent NN?

A recurrent neural network (RNN), as opposed to a regular fully
connected neural network (FCNN) or just neural network (NN), has
layers that are connected to themselves.

In an FCNN there are no connections between nodes in a single
layer. For instance, $(h_1^1$ is not connected to $(h_2^1$. In
addition, the input and output are always of a fixed length.

In an RNN, however, this is no longer the case. Nodes in the hidden
layers are connected to themselves.

## Why RNNs?

Recurrent neural networks work very well when working with
sequential data, that is data where the order matters. In a regular
fully connected network, the order of input doesn't really matter.

Another property of  RNNs is that they can handle variable input
and output. Consider again the simplified breast cancer dataset. If you
have trained a regular FCNN on the dataset with the two features, it
makes no sense to suddenly add a third feature. The network would not
know what to do with it, and would reject such inputs with three
features (or any other number of features that isn't two, for that
matter).

## More whys
1. Traditional feedforward networks process fixed-size inputs and ignore temporal order. RNNs incorporate recurrence to handle sequential data like time series or language ￼.

2. At each time step, an RNN cell processes input x_t and a hidden state h_{t-1} from the previous step, producing a new hidden state h_t and (optionally) an output y_t.

3. This hidden state acts as a “memory” carrying information forward. For example, predicting stock prices or words in a sentence relies on past inputs ￼ ￼.

4. RNNs share parameters across time steps, so they can generalize patterns regardless of sequence length ￼.

## RNNs in more detail

<!-- dom:FIGURE: [figslides/RNN2.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN2.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNNs in more detail, part 2

<!-- dom:FIGURE: [figslides/RNN3.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN3.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNNs in more detail, part 3

<!-- dom:FIGURE: [figslides/RNN4.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN4.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNNs in more detail, part 4

<!-- dom:FIGURE: [figslides/RNN5.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN5.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNNs in more detail, part 5

<!-- dom:FIGURE: [figslides/RNN6.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN6.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNNs in more detail, part 6

<!-- dom:FIGURE: [figslides/RNN7.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN7.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNNs in more detail, part 7

<!-- dom:FIGURE: [figslides/RNN8.png, width=700 frac=0.9] -->
<!-- begin figure -->

<img src="figslides/RNN8.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## RNN Forward Pass Equations

For a simple (vanilla) RNN with one hidden layer and no bias, the state update and output are:

$$
\mathbf{h}_t = \phi(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1})\,,\quad \mathbf{y}_t = \mathbf{W}_{yh}\mathbf{h}_t,
$$

where \phi is a nonlinear activation (e.g. tanh or ReLU) ￼.

In matrix form,

$$
\mathbf{W}_{xh}\in\mathbb{R}^{h\times d}, \mathbf{W}_{hh}\in\mathbb{R}^{h\times h}, \mathbf{W}_{yh}\in\mathbb{R}^{q\times h},
$$

for input dim d, hidden dim h, output dim q.

We often also write

$$
y_t = f(\mathbf{o}_t) with \mathbf{o}_t=W_{yh}h_t
$$

to include a final activation for classification.

Because the same $\mathbf{W}$ are used each step, gradients during training will propagate through time.

## Unrolled RNN in Time

1. Input $x_1,x_2,x_3,\dots$ feed sequentially; the hidden state flows from one step to the next, capturing past context.

2. After processing the final input $x_T$, the network can make a prediction (many-to-one) or outputs can be produced at each step (many-to-many).

3. Unrolling clarifies that training an RNN is like training a deep feedforward network of depth T, with recurrent connections tying layers together.

## Example Task: Character-level RNN Classification
1. A classic example: feed a name (sequence of characters) one char at a time, and classify its language of origin.

2. At each step, the RNN outputs a hidden state; we use the final hidden state to predict the class of the entire sequence.

3. A character-level RNN reads words as a series of characters—outputting a prediction and ‘hidden state’ at each step, feeding the previous hidden state into the next step. We take the final prediction to be the output” ￼.

4. This illustrates sequence-to-one modeling: every output depends on all previous inputs.

## PyTorch: Defining a Simple RNN, using Tensorflow

In [1]:
import tensorflow as tf
import numpy as np

# -----------------------
# 1. Hyperparameters
# -----------------------
input_size = 10        # Dimensionality of each time step
hidden_size = 20       # Number of recurrent units
num_classes = 2        # Binary classification
sequence_length = 5     # Sequence length
batch_size = 16

# -----------------------
# 2. Dummy dataset
#    X: [batch, seq, features]
#    y: [batch]
# -----------------------
X = np.random.randn(batch_size, sequence_length, input_size).astype(np.float32)
y = np.random.randint(0, num_classes, size=(batch_size,))

# -----------------------
# 3. Build simple RNN model
# -----------------------
model = tf.keras.Sequential([
    tf.keras.layers.SimpleRNN(
        units=hidden_size,
        activation="tanh",
        return_sequences=False,   # Only final hidden state
        input_shape=(sequence_length, input_size)
    ),
    tf.keras.layers.Dense(num_classes)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

# -----------------------
# 4. Train the model
# -----------------------
history = model.fit(
    X, y,
    epochs=5,
    batch_size=batch_size,
    verbose=1
)

# -----------------------
# 5. Evaluate
# -----------------------
logits = model.predict(X)
print("Logits from model:\n", logits)

This recurrent neural network uses the TensorFlow/Keras SimpleRNN, which is the counterpart to PyTorch’s nn.RNN.
In this code we have used
1. return_sequences=False makes it output only the last hidden state, which is fed to the classifier. Also, we have

2. from_logits=True matches the PyTorch CrossEntropyLoss.

## Similar example using PyTorch

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

# -----------------------
# 1. Hyperparameters
# -----------------------
input_size = 10
hidden_size = 20
num_layers = 1
num_classes = 2
sequence_length = 5
batch_size = 16
lr = 1e-3

# -----------------------
# 2. Dummy dataset
# -----------------------
X = torch.randn(batch_size, sequence_length, input_size)
y = torch.randint(0, num_classes, (batch_size,))

# -----------------------
# 3. Simple RNN model
# -----------------------
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(SimpleRNN, self).__init__()
        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            nonlinearity="tanh"
        )
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out, h_n = self.rnn(x)   # out: [batch, seq, hidden]

        # ---- FIX: take only the last time-step tensor ----
        last_hidden = out[:, -1, :]  # [batch, hidden]

        logits = self.fc(last_hidden)
        return logits

model = SimpleRNN(input_size, hidden_size, num_layers, num_classes)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# -----------------------
# 4. Training step
# -----------------------
model.train()
optimizer.zero_grad()

logits = model(X)
loss = criterion(logits, y)
loss.backward()
optimizer.step()

print(f"Loss: {loss.item():.4f}")

## Backpropagation Through Time (BPTT) and Gradients

**Backpropagation Through Time (BPTT).**

1. Training an RNN involves computing gradients through time by unfolding the network: treat the unrolled RNN as a very deep feedforward net.

2. We compute the loss $L = \frac{1}{T}\sum_{t=1}^T \ell(y_t,\hat y_t)$ and backpropagate from $t=T$ down to $t=1.$

3. The computational graphs in the figures below shows how each hidden state depends on inputs and parameters across time ￼.

4. BPTT applies the chain rule along this graph, accumulating gradients from each time step into the shared parameters.

## Truncated BPTT and Gradient Clipping

1. Truncated BPTT: Instead of backpropagating through all T steps, we may backpropagate through a fixed window of length $\tau$. This approximates the full gradient and reduces computation.

2. Concretely, one computes gradients up to $\tau$ steps and treats gradients beyond as zero. This still allows learning short-term patterns efficiently.

3. Gradient Clipping: Cap the gradient norm to a maximum value to prevent explosion. For example in PyTorch:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) ensures $\|\nabla\|\le 1$.
1. These techniques help stabilize training, but the fundamental vanishing problem motivates using alternative RNN cells (LSTM/GRU) in practice (see below).

## Applications of Simple RNNs

1. Forecasting: RNNs can predict future values from historical data. Example tasks include stock prices, weather patterns, or any temporal signal ￼.

2. By feeding in sequence $\{x_1,x_2,\dots,x_T\}$, an RNN can output a prediction $y_T$ (one-step ahead) or even a full sequence $\{y_2,\dots,y_{T+1}\}$.

3. Unlike linear models, RNNs can capture complex temporal patterns (trends, seasonality, autocorrelation) in a data-driven way ￼.

4. Preprocessing (normalization, sliding windows) is important. Split data into train/test by time (no shuffling).

## Sequence Modeling Tasks
1. Many-to-One: Classify or predict one value from an entire sequence (e.g., sentiment analysis of a movie review, or classifying a time series). We use the final hidden state as a summary of the sequence.

2. Many-to-Many (Prediction): Predict an output at each time step (e.g., language modeling or sequential regression). RNN outputs are used at each step.

3. Encoder–Decoder (Seq2Seq): (Advanced) Map input sequences to output sequences of different lengths. Though typically LSTM-based, it is s conceptually possible with simple RNNs.

4. RNNs also apply to physics and biology: e.g., modeling dynamical systems, protein sequences, or neuroscience time series. Any domain with sequential data can use RNN-based modeling.

## Other Sequence Applications
1. Sequence Classification: Use RNN hidden state for class labels. For example, classify a time series into anomaly vs normal.

2. Sequence Labeling: Predict labels at each time step (e.g. part-of-speech tagging). The RNN outputs a vector at each step passed through a classification layer.

3. Language and Text: (Advanced) Character or word-level models use RNNs to generate text or classify documents. E.g., predicting next character from previous ones (RNN language model) ￼.

4. Physically Motivated Data: RNNs can model dynamical systems (e.g., rolling ball trajectories, neuron spikes over time, climate data). They learn temporal patterns directly from data without explicit equations.

## Training and Practical Tips
1. Loss Functions: Use MSE for regression tasks, cross-entropy for classification tasks. Sum or average losses over time steps as needed.

2. Batching Sequences: Handle variable-length sequences by padding or using masking. PyTorch pack_padded_sequence or Keras masking can help.

3. Optimization: Standard optimizers (SGD, Adam) work. Learning rate may need tuning due to sequential correlations.

4. Initial Hidden State: Usually initialized to zeros. Can also learn an initial state or carry hidden state across batches for very long sequences (stateful=True in Keras).

5. Regularization: Dropout can be applied to inputs or recurrent states (PyTorch/RNN has dropout option; Keras has dropout/recurrent_dropout).

## Limitations and Considerations
1. Vanishing Gradients: Simple RNNs have fundamental difficulty learning long-term dependencies due to gradient decay ￼.

2. Capacity: Without gates, RNNs may struggle with tasks requiring remembering far-back inputs. Training can be slow as it’s inherently sequential.

3. Alternatives: In practice, gated RNNs (LSTM/GRU) or Transformers are often used for long-range dependencies. However, simple RNNs are still instructive and sometimes sufficient for short sequences ￼ ￼.

4. Regularization: Weight decay or dropout (on inputs/states) can help generalization but must be applied carefully due to temporal correlations.

5. Statefulness: For very long sequences, one can preserve hidden state across batches (stateful RNN) to avoid resetting memory.

## PyTorch RNN Time Series Example

We first implement a simple RNN in PyTorch to forecast a univariate
time series (a sine wave). The steps are: (1) generate synthetic data
and form input/output sequences; (2) define an nn.RNN model; (3) train
the model with MSE loss and an optimizer; (4) evaluate on a held-out
test set. For example, using a sine wave as in prior tutorials ￼, we
create sliding windows of length seq_length. The code below shows each
step. We use nn.RNN (the basic recurrent layer) followed by a linear
output. The training loop (with MSELoss and Adam) updates the model to
minimize prediction error ￼.

In [3]:
import numpy as np
import torch
from torch import nn, optim

# 1. Data preparation: generate a sine wave and create input-output sequences
time_steps = np.linspace(0, 100, 500)
data = np.sin(time_steps)                   # shape (500,)
seq_length = 20
X, y = [], []
for i in range(len(data) - seq_length):
    X.append(data[i:i+seq_length])         # sequence of length seq_length
    y.append(data[i+seq_length])           # next value to predict
X = np.array(X)                            # shape (480, seq_length)
y = np.array(y)                            # shape (480,)
# Add feature dimension (1) for the RNN input
X = X[..., None]                           # shape (480, seq_length, 1)
y = y[..., None]                           # shape (480, 1)

# Split into train/test sets (80/20 split)
train_size = int(0.8 * len(X))
X_train = torch.tensor(X[:train_size], dtype=torch.float32)
y_train = torch.tensor(y[:train_size], dtype=torch.float32)
X_test  = torch.tensor(X[train_size:],  dtype=torch.float32)
y_test  = torch.tensor(y[train_size:],  dtype=torch.float32)

# 2. Model definition: simple RNN followed by a linear layer
class SimpleRNNModel(nn.Module):
    def __init__(self, input_size=1, hidden_size=16, num_layers=1):
        super(SimpleRNNModel, self).__init__()
        # nn.RNN for sequential data (batch_first=True expects (batch, seq_len, features))
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)    # output layer for prediction

    def forward(self, x):
        out, _ = self.rnn(x)                 # out: (batch, seq_len, hidden_size)
        out = out[:, -1, :]                  # take output of last time step
        return self.fc(out)                 # linear layer to 1D output

model = SimpleRNNModel(input_size=1, hidden_size=16, num_layers=1)
print(model)  # print model summary (structure)

Model Explanation: Here input$\_$size=1 because each time step has one
feature. The RNN hidden state has size 16, and batch$\_$first=True means
input tensors have shape (batch, seq$\_$len, features). We take the last
RNN output and feed it through a linear layer to predict the next
value .

In [4]:
# 3. Training loop: MSE loss and Adam optimizer
criterion = nn.MSELoss()                  # mean squared error loss
optimizer = optim.Adam(model.parameters(), lr=0.01)

epochs = 50
for epoch in range(1, epochs+1):
    model.train()
    optimizer.zero_grad()
    output = model(X_train)               # forward pass
    loss = criterion(output, y_train)     # compute training loss
    loss.backward()                       # backpropagate
    optimizer.step()                      # update weights
    if epoch % 10 == 0:
        print(f'Epoch {epoch}/{epochs}, Loss: {loss.item():.4f}')

Training Details: We train for 50 epochs, printing the training loss
every 10 epochs. As training proceeds, the loss (MSE) typically
decreases, indicating the RNN is learning the sine-wave pattern ￼.

In [5]:
# 4. Evaluation on test set
model.eval()
with torch.no_grad():
    pred = model(X_test)
    test_loss = criterion(pred, y_test)
print(f'Test Loss: {test_loss.item():.4f}')

# (Optional) View a few actual vs. predicted values
print("Actual:", y_test[:5].flatten().numpy())
print("Pred : ", pred[:5].flatten().numpy())

Evaluation: We switch to eval mode and compute loss on the test
set. The lower test loss indicates how well the model generalizes. The
code prints a few sample predictions against actual values for
qualitative assessment.

## Tensorflow (Keras) RNN Time Series Example

Next, we use TensorFlow/Keras to do the same task. We build a
tf.keras.Sequential model with a SimpleRNN layer (the most basic
recurrent layer) ￼ followed by a Dense output. The workflow is
similar: create the same synthetic sine data and split it into
train/test sets; then define, train, and evaluate the model.

In [6]:
import numpy as np
import tensorflow as tf

# 1. Data preparation: same sine wave data and sequences as above
time_steps = np.linspace(0, 100, 500)
data = np.sin(time_steps)                     # (500,)
seq_length = 20
X, y = [], []
for i in range(len(data) - seq_length):
    X.append(data[i:i+seq_length])
    y.append(data[i+seq_length])
X = np.array(X)                               # (480, seq_length)
y = np.array(y)                               # (480,)
# reshape for RNN: (samples, timesteps, features)
X = X.reshape(-1, seq_length, 1)             # (480, 20, 1)
y = y.reshape(-1, 1)                         # (480, 1)

# Split into train/test (80/20)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

Data: We use the same sine-wave sequence and sliding-window split as
in the PyTorch example ￼. The arrays are reshaped to (batch,
timesteps, features) for Keras.

In [7]:
# 2. Model definition: Keras SimpleRNN and Dense
model = tf.keras.Sequential([
    tf.keras.layers.SimpleRNN(16, input_shape=(seq_length, 1)),
    tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mse')   # MSE loss and Adam optimizer
model.summary()

Explanation: Here SimpleRNN(16) creates 16 recurrent units. The model
summary shows the shapes and number of parameters. (Keras handles the
sequence dimension internally.)

In [8]:
# 3. Training
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,    # use 20% of train data for validation
    verbose=1
)

Training: We train for 50 epochs. The fit call also reports validation
loss (using a 20$%$ split of the training data) to monitor
generalization.

In [9]:
# 4. Evaluation on test set
test_loss = model.evaluate(X_test, y_test, verbose=0)
print(f'Test Loss: {test_loss:.4f}')

# (Optional) Predictions
predictions = model.predict(X_test)
print("Actual:", y_test.flatten()[:5])
print("Pred : ", predictions.flatten()[:5])

Evaluation: After training, we call model.evaluate on the test set. A
low test loss indicates good forecasting accuracy. We also predict and
compare a few samples of actual vs. predicted values. This completes
the simple RNN forecasting example in TensorFlow.

Both examples use only basic RNN cells (no LSTM/GRU) and include data
preparation, model definition, training loop, and evaluation. The
PyTorch code uses nn.RNN as and the Keras
code uses SimpleRNN layer. Each code block above is self-contained
and can be run independently with standard libraries (NumPy, PyTorch
or TensorFlow).

## The mathematics of RNNs, the basic architecture

See notebook at <https://github.com/CompPhysics/AdvancedMachineLearning/blob/main/doc/pub/week7/ipynb/rnnmath.ipynb>

## Gating mechanism: Long Short Term Memory (LSTM)

Besides a simple recurrent neural network layer, as discussed above, there are two other
commonly used types of recurrent neural network layers: Long Short
Term Memory (LSTM) and Gated Recurrent Unit (GRU).  For a short
introduction to these layers see <https://medium.com/mindboard/lstm-vs-gru-experimental-comparison-955820c21e8b>
and <https://medium.com/mindboard/lstm-vs-gru-experimental-comparison-955820c21e8b>.

LSTM uses a memory cell for 
modeling long-range dependencies and avoid vanishing gradient
 problems.
Capable of modeling longer term dependencies by having
memory cells and gates that controls the information flow along
with the memory cells.

1. Introduced by Hochreiter and Schmidhuber (1997) who solved the problem of getting an RNN to remember things for a long time (like hundreds of time steps).

2. They designed a memory cell using logistic and linear units with multiplicative interactions.

3. Information gets into the cell whenever its “write” gate is on.

4. The information stays in the cell so long as its **keep** gate is on.

5. Information can be read from the cell by turning on its **read** gate. 

The LSTM were first introduced to overcome the vanishing gradient problem.

## Implementing a memory cell in a neural network

To preserve information for a long time in
the activities of an RNN, we use a circuit
that implements an analog memory cell.

1. A linear unit that has a self-link with a weight of 1 will maintain its state.

2. Information is stored in the cell by activating its write gate.

3. Information is retrieved by activating the read gate.

4. We can backpropagate through this circuit because logistics are have nice derivatives.

## LSTM details

The LSTM is a unit cell that is made of three gates:
1. the input gate,

2. the forget gate,

3. and the output gate.

It also introduces a cell state $c$, which can be thought of as the
long-term memory, and a hidden state $h$ which can be thought of as
the short-term memory.

## Basic layout (All figures from Raschka *et al.,*)

<!-- dom:FIGURE: [figslides/LSTM1.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM1.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## LSTM details

The first stage is called the forget gate, where we combine the input
at (say, time $t$), and the hidden cell state input at $t-1$, passing
it through the Sigmoid activation function and then performing an
element-wise multiplication, denoted by $\odot$.

Mathematically we have (see also figure below)

$$
\mathbf{f}^{(t)} = \sigma(W_{fx}\mathbf{x}^{(t)} + W_{fh}\mathbf{h}^{(t-1)} + \mathbf{b}_f)
$$

where the $W$s are the weights to be trained.

## Comparing with a standard  RNN

<!-- dom:FIGURE: [figslides/LSTM2.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM2.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## LSTM details I

<!-- dom:FIGURE: [figslides/LSTM3.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM3.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## LSTM details II

<!-- dom:FIGURE: [figslides/LSTM4.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM4.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## LSTM details III

<!-- dom:FIGURE: [figslides/LSTM5.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM5.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Forget gate

<!-- dom:FIGURE: [figslides/LSTM6.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM6.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## The forget gate

The naming forget gate stems from the fact that  the Sigmoid activation function's
outputs are very close to $0$ if the argument for the function is very
negative, and $1$ if the argument is very positive. Hence we can
control the amount of information we want to take from the long-term
memory.

$$
\mathbf{f}^{(t)} = \sigma(W_{fx}\mathbf{x}^{(t)} + W_{fh}\mathbf{h}^{(t-1)} + \mathbf{b}_f)
$$

where the $W$s are the weights to be trained.

## Basic layout

<!-- dom:FIGURE: [figslides/LSTM7.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM7.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Input gate

The next stage is the input gate, which consists of both a Sigmoid
function ($\sigma_i$), which decide what percentage of the input will
be stored in the long-term memory, and the $\tanh_i$ function, which
decide what is the full memory that can be stored in the long term
memory. When these results are calculated and multiplied together, it
is added to the cell state or stored in the long-term memory, denoted
as $\oplus$. 

We have

$$
\mathbf{i}^{(t)} = \sigma_g(W_{ix}\mathbf{x}^{(t)} + W_{ih}\mathbf{h}^{(t-1)} + \mathbf{b}_i),
$$

and

$$
\mathbf{g}^{(t)} = \tanh(W_{gx}\mathbf{x}^{(t)} + W_{gh}\mathbf{h}^{(t-1)} + \mathbf{b}_g),
$$

again the $W$s are the weights to train.

## Short summary

<!-- dom:FIGURE: [figslides/LSTM8.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM8.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Forget and input

The forget gate and the input gate together also update the cell state with the following equation,

$$
\mathbf{c}^{(t)} = \mathbf{f}^{(t)} \otimes \mathbf{c}^{(t-1)} + \mathbf{i}^{(t)} \otimes \mathbf{g}^{(t)},
$$

where $f^{(t)}$ and $i^{(t)}$ are the outputs of the forget gate and the input gate, respectively.

## Basic layout

<!-- dom:FIGURE: [figslides/LSTM9.png, width=700 frac=1.0] -->
<!-- begin figure -->

<img src="figslides/LSTM9.png" width="700"><p style="font-size: 0.9em"><i>Figure 1: </i></p>
<!-- end figure -->

## Output gate

The final stage of the LSTM is the output gate, and its purpose is to
update the short-term memory.  To achieve this, we take the newly
generated long-term memory and process it through a hyperbolic tangent
($\tanh$) function creating a potential new short-term memory. We then
multiply this potential memory by the output of the Sigmoid function
($\sigma_o$). This multiplication generates the final output as well
as the input for the next hidden cell ($h^{\langle t \rangle}$) within
the LSTM cell.

We have

$$
\begin{aligned}
\mathbf{o}^{(t)} &= \sigma_g(W_o\mathbf{x}^{(t)} + U_o\mathbf{h}^{(t-1)} + \mathbf{b}_o), \\
\mathbf{h}^{(t)} &= \mathbf{o}^{(t)} \otimes \sigma_h(\mathbf{c}^{(t)}). \\
\end{aligned}
$$

where $\mathbf{W_o,U_o}$ are the weights of the output gate and $\mathbf{b_o}$ is the bias of the output gate.

## Summary of LSTM

LSTMs provide a basic approach for modeling long-range dependencies in sequences.
If you wish to read more, see **An Empirical Exploration of Recurrent Network Architectures**, authored
by Rafal Jozefowicz *et al.,*  Proceedings of ICML, 2342-2350, 2015).

An important recent development are the so-called **gated recurrent units (GRU)**, see for example the article
by Junyoung Chung *et al.,*, at URL:"https://arxiv.org/abs/1412.3555.
This article is an excellent read if you are interested in learning
more about these modern RNN architectures

The GRUs have a simpler
architecture than LSTMs. This leads to computationally more efficient methods, while their
performance in some tasks, such as polyphonic music modeling, is comparable to LSTMs.

## LSTM implementation using TensorFlow

In [10]:
"""
Key points:
1. The input images (28x28 pixels) are treated as sequences of 28 timesteps with 28 features each
2. The LSTM layer processes this sequential data
3. A final dense layer with softmax activation handles the classification
4. Typical accuracy ranges between 95-98% (lower than CNNs but reasonable for demonstration)

Note: LSTMs are not typically used for image classification (CNNs are more efficient), but this demonstrates how to adapt them for such tasks. Training might take longer compared to CNN architectures.

To improve performance, you could:
1. Add more LSTM layers
2. Use Bidirectional LSTMs
3. Increase the number of units
4. Add dropout for regularization
5. Use learning rate scheduling
"""

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.utils import to_categorical

# Load and preprocess data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Reshape data for LSTM (samples, timesteps, features)
# MNIST images are 28x28, so we treat each image as 28 timesteps of 28 features
x_train = x_train.reshape((-1, 28, 28))
x_test = x_test.reshape((-1, 28, 28))

# Convert labels to one-hot encoding
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# Build LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(28, 28)))  # 128 LSTM units
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

# Display model summary
model.summary()

# Train the model
history = model.fit(x_train, y_train,
                   batch_size=64,
                   epochs=10,
                   validation_split=0.2)

# Evaluate on test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f'\nTest accuracy: {test_acc:.4f}')

## And the corresponding one with PyTorch

In [11]:
"""
Key components:
1. **Data Handling**: Uses PyTorch DataLoader with MNIST dataset
2. **LSTM Architecture**:
  - Input sequence of 28 timesteps (image rows)
  - 128 hidden units in LSTM layer
  - Fully connected layer for classification
3. **Training**:
  - Cross-entropy loss
  - Adam optimizer
  - Automatic GPU utilization if available

This implementation typically achieves **97-98% accuracy** after 10 epochs. The main differences from the TensorFlow/Keras version:
- Explicit device management (CPU/GPU)
- Manual training loop
- Different data loading pipeline
- More explicit tensor reshaping

To improve performance, you could:
1. Add dropout regularization
2. Use bidirectional LSTM
3. Implement learning rate scheduling
4. Add batch normalization
5. Increase model capacity (more layers/units)
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Hyperparameters
input_size = 28     # Number of features (pixels per row)
hidden_size = 128   # LSTM hidden state size
num_classes = 10    # Digits 0-9
num_epochs = 10     # Training iterations
batch_size = 64     # Batch size
learning_rate = 0.001

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# MNIST dataset
transform = transforms.Compose([
   transforms.ToTensor(),
   transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

train_dataset = datasets.MNIST(root='./data',
                              train=True,
                              transform=transform,
                              download=True)

test_dataset = datasets.MNIST(root='./data',
                             train=False,
                             transform=transform)

train_loader = DataLoader(dataset=train_dataset,
                         batch_size=batch_size,
                         shuffle=True)

test_loader = DataLoader(dataset=test_dataset,
                        batch_size=batch_size,
                        shuffle=False)

# LSTM model
class LSTMModel(nn.Module):
   def __init__(self, input_size, hidden_size, num_classes):
       super(LSTMModel, self).__init__()
       self.hidden_size = hidden_size
       self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
       self.fc = nn.Linear(hidden_size, num_classes)

   def forward(self, x):
       # Reshape input to (batch_size, sequence_length, input_size)
       x = x.reshape(-1, 28, 28)

       # Forward propagate LSTM
       out, _ = self.lstm(x)  # out: (batch_size, seq_length, hidden_size)

       # Decode the hidden state of the last time step
       out = out[:, -1, :]
       out = self.fc(out)
       return out

# Initialize model
model = LSTMModel(input_size, hidden_size, num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
total_step = len(train_loader)
for epoch in range(num_epochs):
   model.train()
   for i, (images, labels) in enumerate(train_loader):
       images = images.to(device)
       labels = labels.to(device)

       # Forward pass
       outputs = model(images)
       loss = criterion(outputs, labels)

       # Backward and optimize
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()

       if (i+1) % 100 == 0:
           print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')

   # Test the model
   model.eval()
   with torch.no_grad():
       correct = 0
       total = 0
       for images, labels in test_loader:
           images = images.to(device)
           labels = labels.to(device)
           outputs = model(images)
           _, predicted = torch.max(outputs.data, 1)
           total += labels.size(0)
           correct += (predicted == labels).sum().item()

       print(f'Test Accuracy: {100 * correct / total:.2f}%')

print('Training finished.')

## Dynamical ordinary differential equation

Let us illustrate how we could train an RNN using data from the
solution of a well-known differential equation, namely Newton's
equation for oscillatory motion for an object being forced into
harmonic oscillations by an applied external force.

We will start with the basic algorithm for solving this type of
equations using the Runge-Kutta-4 approach. The first code example is
a standalone differential equation solver. It yields positions and
velocities as function of time, starting with an initial time $t_0$
and ending with a final time.

The data the program produces will in turn be used to train an RNN for
a selected number of training data. With a trained RNN, we will then
use the network to make predictions for data not included in the
training. That is, we will train a model which should be able to
reproduce velocities and positions not included in training data.

## The Runge-Kutta-4 code

In [12]:
%matplotlib inline

import numpy as np
import pandas as pd
from math import *
import matplotlib.pyplot as plt
import os

# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"

if not os.path.exists(PROJECT_ROOT_DIR):
    os.mkdir(PROJECT_ROOT_DIR)

if not os.path.exists(FIGURE_ID):
    os.makedirs(FIGURE_ID)

if not os.path.exists(DATA_ID):
    os.makedirs(DATA_ID)

def image_path(fig_id):
    return os.path.join(FIGURE_ID, fig_id)

def data_path(dat_id):
    return os.path.join(DATA_ID, dat_id)

def save_fig(fig_id):
    plt.savefig(image_path(fig_id) + ".png", format='png')


def SpringForce(v,x,t):
#   note here that we have divided by mass and we return the acceleration
    return  -2*gamma*v-x+Ftilde*cos(t*Omegatilde)


def RK4(v,x,t,n,Force):
    for i in range(n-1):
# Setting up k1
        k1x = DeltaT*v[i]
        k1v = DeltaT*Force(v[i],x[i],t[i])
# Setting up k2
        vv = v[i]+k1v*0.5
        xx = x[i]+k1x*0.5
        k2x = DeltaT*vv
        k2v = DeltaT*Force(vv,xx,t[i]+DeltaT*0.5)
# Setting up k3
        vv = v[i]+k2v*0.5
        xx = x[i]+k2x*0.5
        k3x = DeltaT*vv
        k3v = DeltaT*Force(vv,xx,t[i]+DeltaT*0.5)
# Setting up k4
        vv = v[i]+k3v
        xx = x[i]+k3x
        k4x = DeltaT*vv
        k4v = DeltaT*Force(vv,xx,t[i]+DeltaT)
# Final result
        x[i+1] = x[i]+(k1x+2*k2x+2*k3x+k4x)/6.
        v[i+1] = v[i]+(k1v+2*k2v+2*k3v+k4v)/6.
        t[i+1] = t[i] + DeltaT


# Main part begins here

DeltaT = 0.001
#set up arrays 
tfinal = 20 # in dimensionless time
n = ceil(tfinal/DeltaT)
# set up arrays for t, v, and x
t = np.zeros(n)
v = np.zeros(n)
x = np.zeros(n)
# Initial conditions (can change to more than one dim)
x0 =  1.0 
v0 = 0.0
x[0] = x0
v[0] = v0
gamma = 0.2
Omegatilde = 0.5
Ftilde = 1.0
# Start integrating using Euler's method
# Note that we define the force function as a SpringForce
RK4(v,x,t,n,SpringForce)

# Plot position as function of time    
fig, ax = plt.subplots()
ax.set_ylabel('x[m]')
ax.set_xlabel('t[s]')
ax.plot(t, x)
fig.tight_layout()
save_fig("ForcedBlockRK4")
plt.show()

## Using the above data to train an RNN

In the code here we have reworked the previous example in order to generate data that can be handled by recurrent neural networks in order to train our model.

In [13]:
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt


# Newton's equation for harmonic oscillations with external force

# Global parameters
gamma = 0.2        # damping
Omegatilde = 0.5   # driving frequency
Ftilde = 1.0       # driving amplitude

def spring_force(v, x, t):
    """
    SpringForce:
    note: divided by mass => returns acceleration
    a = -2*gamma*v - x + Ftilde*cos(Omegatilde * t)
    """
    return -2.0 * gamma * v - x + Ftilde * np.cos(Omegatilde * t)


def rk4_trajectory(DeltaT=0.001, tfinal=20.0, x0=1.0, v0=0.0):
    """
    Returns t, x, v arrays.
    """
    n = int(np.ceil(tfinal / DeltaT))

    t = np.zeros(n, dtype=np.float32)
    x = np.zeros(n, dtype=np.float32)
    v = np.zeros(n, dtype=np.float32)

    x[0] = x0
    v[0] = v0

    for i in range(n - 1):
        # k1
        k1x = DeltaT * v[i]
        k1v = DeltaT * spring_force(v[i], x[i], t[i])

        # k2
        vv = v[i] + 0.5 * k1v
        xx = x[i] + 0.5 * k1x
        k2x = DeltaT * vv
        k2v = DeltaT * spring_force(vv, xx, t[i] + 0.5 * DeltaT)

        # k3
        vv = v[i] + 0.5 * k2v
        xx = x[i] + 0.5 * k2x
        k3x = DeltaT * vv
        k3v = DeltaT * spring_force(vv, xx, t[i] + 0.5 * DeltaT)

        # k4
        vv = v[i] + k3v
        xx = x[i] + k3x
        k4x = DeltaT * vv
        k4v = DeltaT * spring_force(vv, xx, t[i] + DeltaT)

        # Update
        x[i + 1] = x[i] + (k1x + 2.0 * k2x + 2.0 * k3x + k4x) / 6.0
        v[i + 1] = v[i] + (k1v + 2.0 * k2v + 2.0 * k3v + k4v) / 6.0
        t[i + 1] = t[i] + DeltaT

    return t, x, v


# Sequence generation for RNN training

def create_sequences(x, seq_len):
    """
    Given a 1D array x (e.g., position as a function of time),
    create input/target sequences for next-step prediction.

    Inputs:  [x_i, x_{i+1}, ..., x_{i+seq_len-1}]
    Targets: [x_{i+1}, ..., x_{i+seq_len}]
    """
    xs = []
    ys = []
    for i in range(len(x) - seq_len):
        seq_x = x[i : i + seq_len]
        seq_y = x[i + 1 : i + seq_len + 1]  # shifted by one step
        xs.append(seq_x)
        ys.append(seq_y)

    xs = np.array(xs, dtype=np.float32)      # shape: (num_samples, seq_len)
    ys = np.array(ys, dtype=np.float32)      # shape: (num_samples, seq_len)
    # Add feature dimension (1 feature: the position x)
    xs = np.expand_dims(xs, axis=-1)         # (num_samples, seq_len, 1)
    ys = np.expand_dims(ys, axis=-1)         # (num_samples, seq_len, 1)
    return xs, ys


class OscillatorDataset(Dataset):
    def __init__(self, seq_len=50, DeltaT=0.001, tfinal=20.0, x0=1.0, v0=0.0):
        t, x, v = rk4_trajectory(DeltaT=DeltaT, tfinal=tfinal, x0=x0, v0=v0)
        self.t = t
        self.x = x
        self.v = v
        xs, ys = create_sequences(x, seq_len=seq_len)
        self.inputs = torch.from_numpy(xs)  # (N, seq_len, 1)
        self.targets = torch.from_numpy(ys) # (N, seq_len, 1)

    def __len__(self):
        return self.inputs.shape[0]

    def __getitem__(self, idx):
        return self.inputs[idx], self.targets[idx]


# RNN model (LSTM-based in this example)

class RNNPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=32, num_layers=1, output_size=1):
        super().__init__()
        self.lstm = nn.LSTM(input_size=input_size,
                            hidden_size=hidden_size,
                            num_layers=num_layers,
                            batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x: (batch, seq_len, input_size)
        out, _ = self.lstm(x)   # out: (batch, seq_len, hidden_size)
        out = self.fc(out)      # (batch, seq_len, output_size)
        return out


# Training part

def train_model(
    seq_len=50,
    DeltaT=0.001,
    tfinal=20.0,
    batch_size=64,
    num_epochs=10,
    hidden_size=64,
    lr=1e-3,
    device=None,
):
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Dataset & DataLoader
    dataset = OscillatorDataset(seq_len=seq_len, DeltaT=DeltaT, tfinal=tfinal)
    train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Model, loss, optimizer
    model = RNNPredictor(input_size=1, hidden_size=hidden_size, output_size=1)
    model.to(device)

    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    # Training loop
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        for batch_x, batch_y in train_loader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)

            optimizer.zero_grad()
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item() * batch_x.size(0)

        epoch_loss /= len(train_loader.dataset)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.6f}")

    return model, dataset


# Evaluation / visualization

def evaluate_and_plot(model, dataset, seq_len=50, device=None):
    if device is None:
        device = next(model.parameters()).device

    model.eval()
    with torch.no_grad():
        # Take a single sequence from the dataset
        x_seq, y_seq = dataset[0]  # shapes: (seq_len, 1), (seq_len, 1)
        x_input = x_seq.unsqueeze(0).to(device)  # (1, seq_len, 1)
        # Model prediction (next-step for whole sequence)
        y_pred = model(x_input).cpu().numpy().squeeze(-1).squeeze(0)  # (seq_len,)
        # True target
        y_true = y_seq.numpy().squeeze(-1)  # (seq_len,)
        # Plot comparison
        plt.figure()
        plt.plot(y_true, label="True x(t+Δt)", linestyle="-")
        plt.plot(y_pred, label="Predicted x(t+Δt)", linestyle="--")
        plt.xlabel("Time step in sequence")
        plt.ylabel("Position")
        plt.legend()
        plt.title("RNN next-step prediction on oscillator trajectory")
        plt.tight_layout()
        plt.show()

# This is the main part of the code where we define the network

if __name__ == "__main__":
    # Hyperparameters can be tweaked as you like
    seq_len = 50
    DeltaT = 0.001
    tfinal = 20.0
    num_epochs = 10
    batch_size = 64
    hidden_size = 64
    lr = 1e-3

    model, dataset = train_model(
        seq_len=seq_len,
        DeltaT=DeltaT,
        tfinal=tfinal,
        batch_size=batch_size,
        num_epochs=num_epochs,
        hidden_size=hidden_size,
        lr=lr,
    )

    evaluate_and_plot(model, dataset, seq_len=seq_len)