In [10]:
"""GRU: RNN Architecture
ht-1, ht, xt, rt, zt, ht' are all vectors flowing, except xt all must have same dimensions.
1. Calculate rt(reset gate)
2. ht'(candidate hidden state)
3. zt (update gate)
4. ht (current hidden state)

It comprises of 2 gates, reset and update.
3 ANN compared to 4 in LSTM.
5 point-wise operation same numbered but different in operation as compared to LSTM.

In reset gate,
ht-1 © xt -> σ = rt [Reset Gate]
{[rt ⊗ ht-1] © xt} -> tanh =h't (candidate hidden state)
In update gate,
ht-1 © xt -> σ = zt [Update Gate]
Now, ht = [(1-zt) ⊗ ht-1] ⊕ [zt ⊗ h't]

Observe, reset gate rt decides how much ht-1(previous info) flows according to xt(current input).
And, zt the update gate also decides how much ht-1 is passed and more is zt, less is ht-1(previous info).

LSTM vs GRU:
1. Number of gates:
LSTM - 3 (Forget, Input, Output)
GRU - 2 (Reset, Update)

2. Memory Units:
LSTM: Two separate states (Cell, hidden state)
GRU: Single hidden layer

3. Parameter Count and Computational Complexity:
LSTM: Generally more
GRU: Comparatively less

5. Empirical Performance:
LSTM: Slightly better than GRUs, in many tasks especially complex ones.
GRUs: Comparable to LSTMs, especially in simpler tasks.

6. Choice in Practice:
Start with GRUs, try improving, if not possible then switch to LSTMs.
"""

"GRU: RNN Architecture\nht-1, ht, xt, rt, zt, ht' are all vectors flowing, except xt all must have same dimensions.\n1. Calculate rt(reset gate)\n2. ht'(candidate hidden state)\n3. zt (update gate)\n4. ht (current hidden state)\n\nIt comprises of 2 gates, reset and update.\n3 ANN compared to 4 in LSTM.\n5 point-wise operation same numbered but different in operation as compared to LSTM.\n\nIn reset gate,\nht-1 © xt -> σ = rt [Reset Gate]\n{[rt ⊗ ht-1] © xt} -> tanh =h't (candidate hidden state)\nIn update gate,\nht-1 © xt -> σ = zt [Update Gate]\nNow, ht = [(1-zt) ⊗ ht-1] ⊕ [zt ⊗ h't]\n\nObserve, reset gate rt decides how much ht-1(previous info) flows according to xt(current input).\nAnd, zt the update gate also decides how much ht-1 is passed and more is zt, less is ht-1(previous info).\n\nLSTM vs GRU:\n1. Number of gates:\nLSTM - 3 (Forget, Input, Output)\nGRU - 2 (Reset, Update)\n\n2. Memory Units:\nLSTM: Two separate states (Cell, hidden state)\nGRU: Single hidden layer\n\n3. Para

In [11]:
"""Deep RNNs:
Here, the layers of nodes increase, but the feedback is in each node's each layer, so input is coming input+self feedback.
No, additional connections are there, just RNN multiplied, is deep RNN.
In notation, it is shown by horizontal and vertical nodes, with x as time axis and y as layer or depth axis.
Notation: Hidden unit is named as, h<sup>layer no</sup><sub>time</sub>
hlt = tanh([Wl][hlt-1] + [ul][hl-1t]+b)
Pros:
1. Hierarchial Representation, starting layer word level, ending layer sentence or paragraph level.
2. Customization for Advanced Tasks
Complex tasks: Speech recognition, Machine Translation.
Large Datasets as with small datasets Deep RNNs may cause overfitting.
High computational power available
Not satisfied with simpler model

Just add multiple SimpleRNN(Number of Nodes)
But, remember to set return_sequences=True in each hidden layer except last one as in last layer there is no need to pass inputs of each step.

Cons:
1. Overfitting
2. Training Time increases
"""

"Deep RNNs:\nHere, the layers of nodes increase, but the feedback is in each node's each layer, so input is coming input+self feedback.\nNo, additional connections are there, just RNN multiplied, is deep RNN.\nIn notation, it is shown by horizontal and vertical nodes, with x as time axis and y as layer or depth axis.\nNotation: Hidden unit is named as, h<sup>layer no</sup><sub>time</sub>\nhlt = tanh([Wl][hlt-1] + [ul][hl-1t]+b)\nPros:\n1. Hierarchial Representation, starting layer word level, ending layer sentence or paragraph level.\n2. Customization for Advanced Tasks\nComplex tasks: Speech recognition, Machine Translation.\nLarge Datasets as with small datasets Deep RNNs may cause overfitting.\nHigh computational power available\nNot satisfied with simpler model\n\nJust add multiple SimpleRNN(Number of Nodes)\nBut, remember to set return_sequences=True in each hidden layer except last one as in last layer there is no need to pass inputs of each step.\n\nCons:\n1. Overfitting\n2. Tra

In [12]:
"""Variants: We can make also make, Deep GRUs, Deep LSTMs, also in them make return_sequnces=True in all layers except the last one.
"""

'Variants: We can make also make, Deep GRUs, Deep LSTMs, also in them make return_sequnces=True in all layers except the last one.\n'

In [13]:
"""Bi-directional RNNs, LSTMs:
When output of later inputs affects outputs of first inputs, it is useful, like in ChatGPT we tell what to do with text we provided at last.
I love amazon, it's a great river.
I love amazon, for their fantastic service.
Here, amazon requires information in later sentence for getting its inference.

We would use bi-directional RNN, input would flow in both timestamps from both direction and final output would be the one concatenated with the other, otherwise RNNs are performing as it is, only their output is going as input not the other RNN.
"""

"Bi-directional RNNs, LSTMs:\nWhen output of later inputs affects outputs of first inputs, it is useful, like in ChatGPT we tell what to do with text we provided at last.\nI love amazon, it's a great river.\nI love amazon, for their fantastic service.\nHere, amazon requires information in later sentence for getting its inference.\n\nWe would use bi-directional RNN, input would flow in both timestamps from both direction and final output would be the one concatenated with the other, otherwise RNNs are performing as it is, only their output is going as input not the other RNN.\n"

In [14]:
import tensorflow
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Bidirectional, SimpleRNN, LSTM, GRU, Embedding

In [None]:
model = Sequential([
	Embedding(input_dim=50, output_dim=10, input_shape=(50, )),
	Bidirectional(SimpleRNN(5)),
	#just add Bidirectional(GRU(5)) or Bidirectional(LSTM(5))
	Dense(1, activation='sigmoid')
])
#Actually the parameters just double
model.summary()


In [18]:
"""Applications:
In NLP tasks, like chatbots.
Specific, NER(Named Entity Recognition),
POS Tagging (Parts of Speech),
Machine Translation,
Sentiment Analysis,
TimeSeries Forecasting (Stock Price Prediction).

Drawbacks:
1. Computational Complexity
2. Overfitting
3. Not used in applications in which all data is not available to us, like where real-time data is coming, for example in real time speech recognition, latency issues may occur.
"""

'Applications:\nIn NLP tasks, like chatbots.\nSpecific, NER(Named Entity Recognition),\nPOS Tagging (Parts of Speech),\nMachine Translation,\nSentiment Analysis,\nTimeSeries Forecasting (Stock Price Prediction).\n\nDrawbacks:\n1. Computational Complexity\n2. Overfitting\n3. Not used in applications in which all data is not available to us, like where real-time data is coming, for example in real time speech recognition, latency issues may occur.\n'