#### About

> LSTM and GRU

Long-Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two recurrent neural network (RNN) architectures commonly used for temporal computing tasks such as weather forecasting, natural language processing, and speech recognition. 

They are designed to overcome the limitations of traditional RNNs, such as the vanishing gradient problem that can occur when training deep neural networks. 

The LSTM and GRU architecture implement a gating mechanism that allows the network to selectively update and store long-sequence information, enabling better learning and modeling of long-term dependencies in sequence data.

1. LSTM (Long Short Term Memory):

An LSTM cell has three gates: an input gate, a forget gate, and an output gate. The input gate controls how much new information is added to the cell state, the forget gate controls how much information is forgotten from the cell state, and the output gate controls how much information is sent to the output. 



```f_t = sigmoid(W_f . x_t + U_f . h_t-1 + b_f)  # forget gate
i_t = sigmoid(W_i . x_t + U_i . h_t-1 + b_i)  # input gate
o_t = sigmoid(W_o . x_t + U_o . h_t-1 + b_o)  # output gate
c_t = f_t .*  c_t-1 + i_t .*  tanh(W_c . x_t + U_c . h_t-1 + b_c)  # cell state
h_t = o_t .*  tanh(c_t)  # hidden state/output
```

Where:

- x_t is the input at time step t
- h_t is the hidden state/output at time step t
- c_t is the cell state at time step t
- W_f, W_i, W_o, W_c are the weight matrices for the input x_t
- U_f, U_i, U_o, U_c are the weight matrices for the previous hidden state h_t-1
- b_f, b_i, b_o, b_c are the bias vectors
- sigmoid is the sigmoid activation function
- tanh is the hyperbolic tangent activation function
- .*  represents element-wise multiplication

In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

2023-04-22 03:23:43.336066: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-22 03:23:43.392197: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-22 03:23:43.392597: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
model = Sequential()
model.add(LSTM(32, input_shape=(10, 5)))  # LSTM layer with 32 units and input shape of (10, 5)
model.add(Dense(1, activation='sigmoid'))  # Output layer with sigmoid activation for binary classification


2023-04-22 03:23:46.928153: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-22 03:23:46.929315: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-04-22 03:23:47.628260: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gra

In [3]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

2. GRU (Restricted Periodic Unit):
The GRU device has two gates: a update gate and a reset gate. The update gate determines how much of the previously hidden state is mixed with the new candidate hidden state, and the reset gate determines how much of the previously hidden state is forgotten.

The key difference between GRU and LSTM is that GRU has fewer parameters than LSTM and is therefore faster to train. GRU also combines the forget and input gates of LSTM into a single "update gate" and merges the cell state and hidden state into a single "hidden state".



Given an input sequence of length T = (x_1, x_2, ..., x_T), and hidden state h_t at time step t, GRU computes the update gate z_t, reset gate r_t, and new candidate hidden state h~_t as follows:

```
z_t = sigmoid(W_zx_t + U_zh_{t-1} + b_z)
r_t = sigmoid(W_rx_t + U_rh_{t-1} + b_r)
h~t = tanh(Wx_t + r_t(U*h{t-1}) + b)

```

where W, U, and b are the weight matrix, hidden state matrix, and bias vector, respectively, for the corresponding gate or hidden state. sigmoid is the sigmoid activation function, and tanh is the hyperbolic tangent activation function.

Next, the current hidden state h_t is computed as a weighted sum of the previous hidden state h_{t-1} and the candidate hidden state h~_t, with the weights determined by the update gate z_t:

h_t = (1 - z_t)h_{t-1} + z_th~_t


In [None]:
from keras.layers import GRU

#num_units = dim of the output space
gru = GRU(num_units, activation='tanh', recurrent_activation='sigmoid', 
          use_bias=True, kernel_initializer='glorot_uniform', 
          recurrent_initializer='orthogonal', bias_initializer='zeros', 
          kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, 
          activity_regularizer=None, dropout=0.0, recurrent_dropout=0.0, 
          implementation=1, return_sequences=False, return_state=False, 
          go_backwards=False, stateful=False, reset_after=True)
