### Long Short Term Memory Networks

Introduction to LSTMs and How They Address RNN Limitations
---

### What are LSTMs?

Long Short-Term Memory networks (LSTMs) are a specialized type of recurrent neural network (RNN) designed to effectively capture long-term dependencies in sequential data. While traditional RNNs are theoretically capable of handling sequences of arbitrary length, in practice they struggle with learning patterns that span many time steps due to the vanishing gradient problem. This issue makes it difficult for standard RNNs to retain information over long sequences. LSTMs address this limitation by introducing a memory cell and gating mechanisms that regulate the flow of information, enabling the network to remember or forget information as needed.

---

### Key Features of LSTMs

- **Memory Cells:**  
    LSTMs maintain a dedicated memory cell at each time step, which acts as a conveyor belt for information. This cell state is modified only through carefully regulated gates, allowing the network to preserve information across long sequences.

- **Gated Mechanism:**  
    LSTMs use three primary gates—forget, input, and output—to control the flow of information:
        - The **forget gate** decides what information to discard from the cell state.
        - The **input gate** determines which new information to add.
        - The **output gate** controls what information from the cell state is output as the hidden state.
    This selective memory management is crucial for learning complex temporal patterns.

- **Effective for Long Sequences:**  
    By mitigating the vanishing gradient problem, LSTMs can learn dependencies that span many time steps, making them suitable for tasks like language modeling, time series forecasting, and more.

---

### Advantages Over Vanilla RNNs

- **Retain Long-Term Dependencies:**  
    LSTMs are explicitly designed to remember information for long durations, overcoming the limitations of standard RNNs.

- **Prevents Gradient-Related Issues:**  
    The gating mechanisms help prevent both vanishing and exploding gradients during training, leading to more stable and effective learning.

- **Superior Performance:**  
    LSTMs consistently outperform vanilla RNNs on tasks such as language modeling, speech recognition, and time series prediction due to their ability to model long-range dependencies.

---

### LSTM Cell Structure: Input, Forget, and Output Gates

A typical LSTM cell consists of three main gates that control the flow of information:

- **Forget Gate (\(f_t\)):**  
    Decides what information from the previous cell state should be discarded. It takes the previous hidden state and the current input, passes them through a sigmoid activation, and outputs a value between 0 and 1 for each element in the cell state (0 = "completely forget", 1 = "completely keep").

- **Input Gate (\(i_t\)):**  
    Determines which new information should be added to the cell state. It uses a sigmoid layer to decide which values to update and a tanh layer to create a vector of new candidate values (\(\tilde{C}_t\)).

- **Cell State Update (\(C_t\)):**  
    The cell state is updated by combining the results of the forget and input gates, allowing the network to selectively remember or forget information.

- **Output Gate (\(o_t\)):**  
    Controls what part of the cell state should be output as the hidden state for the next time step. It uses a sigmoid layer to decide which parts of the cell state to output, and a tanh layer to scale the cell state values.

---

#### Mathematical Formulation

Let  
- \( x_t \): input at time step \( t \)  
- \( h_{t-1} \): previous hidden state  
- \( C_{t-1} \): previous cell state  
- \( W \), \( U \), \( b \): weight matrices and biases  
- \( \sigma \): sigmoid activation  
- \( \tanh \): hyperbolic tangent activation  

The LSTM cell computes:

\[
\begin{align*}
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \quad &\text{(Forget gate)} \\
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \quad &\text{(Input gate)} \\
\tilde{C}_t &= \tanh(W_C x_t + U_C h_{t-1} + b_C) \quad &\text{(Candidate cell state)} \\
C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad &\text{(New cell state)} \\
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \quad &\text{(Output gate)} \\
h_t &= o_t \odot \tanh(C_t) \quad &\text{(New hidden state)}
\end{align*}
\]

Where:  
- \( f_t \): forget gate vector  
- \( i_t \): input gate vector  
- \( o_t \): output gate vector  
- \( \tilde{C}_t \): candidate values for cell state  
- \( C_t \): updated cell state  
- \( h_t \): updated hidden state  
- \( \odot \): element-wise multiplication  

---

#### Summary Table

| Gate         | Purpose                                      | Activation Functions |
|--------------|----------------------------------------------|----------------------|
| Forget Gate  | Remove irrelevant information                | Sigmoid              |
| Input Gate   | Add new relevant information                 | Sigmoid, Tanh        |
| Output Gate  | Output filtered cell state as hidden state   | Sigmoid, Tanh        |

This gating mechanism allows LSTMs to selectively remember or forget information, making them highly effective for learning long-term dependencies in sequential data.

---

### Applications of LSTMs

LSTMs have been widely adopted in various domains due to their ability to model sequential data and capture long-term dependencies. Some notable applications include:

- **Natural Language Processing (NLP):**  
    - *Sentiment Analysis:* Understanding the sentiment of a sentence or document by considering the context provided by previous words.
    - *Machine Translation:* Translating text from one language to another by capturing the context of entire sentences or paragraphs.
    - *Text Generation:* Generating coherent and contextually relevant text sequences, such as chatbots or story generation.

- **Time Series Forecasting:**  
    - *Stock Price Prediction:* Modeling and predicting future stock prices based on historical data.
    - *Weather Forecasting:* Predicting future weather patterns by analyzing past meteorological data.
    - *Sales Trends:* Forecasting future sales based on previous sales data and seasonal trends.

- **Speech Recognition:**  
    - *Speech-to-Text Conversion:* Converting spoken words into written text by analyzing audio signals over time.

- **Anomaly Detection:**  
    - *Identifying Unusual Patterns:* Detecting anomalies in sequential data, such as fraud detection in financial transactions or fault detection in industrial systems.

LSTMs' ability to capture both short-term and long-term dependencies makes them a powerful tool for a wide range of sequence modeling tasks.

In [4]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SimpleRNN

In [5]:
# prepare data
vocab_size = 10000
max_len = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocab_size)
X_train = pad_sequences(X_train, maxlen = max_len)
X_test = pad_sequences(X_test, maxlen = max_len)

print(f"Training Data Shape: {X_train.shape}")
print(f"Testing Data Shape: {X_test.shape}")

Training Data Shape: (25000, 200)
Testing Data Shape: (25000, 200)


train and build basic rnn model

In [None]:
rnn_model = Sequential(
    [
        Embedding(input_dim=vocab_size, output_dim=128),
        SimpleRNN(128, activation="tanh", return_sequences=False),
        Dense(1, activation="sigmoid"),
    ]
)

rnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
rnn_model.summary()

rnn_history = rnn_model.fit(
    X_train, y_train, epochs=5, batch_size=32, validation_split=0.2
)

loss, accuracy = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}, Test Accuracy: {accuracy}")

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 25ms/step - accuracy: 0.5469 - loss: 0.6817 - val_accuracy: 0.6088 - val_loss: 0.6416
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 28ms/step - accuracy: 0.7151 - loss: 0.5597 - val_accuracy: 0.7584 - val_loss: 0.5109
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 29ms/step - accuracy: 0.7639 - loss: 0.4949 - val_accuracy: 0.6828 - val_loss: 0.6026
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 28ms/step - accuracy: 0.8041 - loss: 0.4387 - val_accuracy: 0.7852 - val_loss: 0.5116
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 27ms/step - accuracy: 0.8706 - loss: 0.3196 - val_accuracy: 0.8142 - val_loss: 0.4799
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.8049 - loss: 0.4823
Test Loss: 0.4771870970726013, Test Accuracy: 0.8063600063323975


create lstm model

In [8]:
lstm_model = Sequential(
    [
        Embedding(input_dim=vocab_size, output_dim=128),
        LSTM(128, activation="tanh", return_sequences=False),
        Dense(1, activation="sigmoid"),
    ]
)

lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
lstm_model.summary()

lstm_history = lstm_model.fit(
    X_train, y_train, epochs=5, batch_size=32, validation_split=0.2
)

loss, accuracy = lstm_model.evaluate(X_test, y_test)
print(f"LSTM Test Loss: {loss}, Test Accuracy: {accuracy}")

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 51ms/step - accuracy: 0.7374 - loss: 0.5141 - val_accuracy: 0.6494 - val_loss: 0.6184
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 50ms/step - accuracy: 0.7948 - loss: 0.4437 - val_accuracy: 0.8024 - val_loss: 0.4261
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 50ms/step - accuracy: 0.8933 - loss: 0.2693 - val_accuracy: 0.7686 - val_loss: 0.4876
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 53ms/step - accuracy: 0.8893 - loss: 0.2820 - val_accuracy: 0.8596 - val_loss: 0.3700
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 53ms/step - accuracy: 0.9501 - loss: 0.1442 - val_accuracy: 0.8732 - val_loss: 0.3605
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 20ms/step - accuracy: 0.8627 - loss: 0.3862
LSTM Test Loss: 0.38313448429107666, Test Accuracy: 0.8621199727058411
