# **Udacity: Intro to TensorFlow for Deep Learning**
## **Lesson 10 NLP: Recurrent Neural Networks**

This lessons extends on what was covered in lesson 9. It introduces recurrent neural networks, which are able to capture temporal dependences,that change over time.

This lesson would covers
- Different RNNs: Simple RNN, LSTMS, GRUs
- Text generation using NLP models.


##**Simple RNNS**

Simple recurrent neural networks, are networks which use outputs from previous time steps as additional inputs, alongside the current input.

for example   
- Inputs: $X_t + Y_{t-1}$
- Ouput: $Y_t$

I'm a bit effy around th calculation of the output.

But...

The general idea is that, previous output from the last time step are fed as additional inputs. The previous output is referred to as the state vector.

<br>

While simple RNN architecture are able to consider the past output when calculating it's new output, it is limited by how far back it can relate dependencies and for dependencies which occur over long period of time, a simple RNN would struggle to capture these dependencies.

## **Long term short term Memory**

LSTMs, were introduced to capture temporal dependencies which spam over longer periods of time. Unlike simple RNNS which have a single state vector, LSTMS have 2 state vectors, a Long term memory and a short term memory.

<br>

**Features of an LSTM**
- It's able to capture temporal dependencies spaming a long period of time
- It has 2 state vectors, long term and short term memory, which are used in calculating a new input
- LSTM feature gates: Forget, learn, remember and use gates.

<br>

**Workflow for an LSTM**
- At each time step, the LSTM has 2 state vectors: A long term memory and a short term memory
- The current time step input alongside the 2nd state vectors are used to determine an output.
- The calculated output from the current time step input, would be used as the short term memory for the next timestep
- The long term memory from the previous time step is updated:
  - Any pieces of the previous long term memory which is no longer relevant is removed, using the forget gate
  - Any new piece of relevant information is added to the long term memory, using the remember gate

<br>

**LSTM Gates**
- Forget gate: Determines which parts of the long term memory are no longer relevant and should be removed from memory.
- Learn gate: Learns new piece of informations using the current input and short term memory.

- Remember gate: Adds any relevant information that was learnt to the long term memory. The output of this gate is the new long term memory

- Use gate: This uses, the relevants parts of the long term memory and newly learnt information to calculate an output. The output is also used as the new short term memory for the next timestep.



## **Gated Recurrent Units**

Gated recurrent units were introduced in 2014, with the aim of solving the vanishing gradient problem, an issue that plagues standard recurrent networks.
<br>

To solve this GRU features gates (An update and reset gate) similar to LSTMS.
- **Reset gate**: Decides how much of the previous information is relevant. The relevant portion of the previous information is then added to the current input, to create an intermidary state vector.

- **Update gate**: Decides how much of the previous state should be kept and should then be added to the intermidary state vector.

<br>

Difference between GRUs and LSTMS
- GRU have fewer parameters (2 gates: update and reset) so are much faster to train and require less data to generalize to new samples.

- In comparison LSTMS have more parameters with it's 4 gates (Learn, forget, use and remember), as such it would take relatively longer to train. But the increased gated connections provides better expressiveness which could lead to better results.

<br>

**Resources used**
- Helpful in grasping the maths behind GRU, [Understanding GRU networks](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be)
- https://www.kaggle.com/code/thebrownviking20/intro-to-recurrent-neural-networks-lstm-gru/notebook


## **RNNs in code**

The LSTM layer is within `tf.keras.Layers.LSTM` [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). likewise with the RNN layer `tf.keras.Layers.SimpleRNN`.

<br>

Worth noting:
- We can directly pass, the output of an embedding layer to an LSTM layer without adding a flatten or GlobalAverage1D layer inbetween.

<br>

**Further resources:**   
The [recurrent neural network (RNN) with keras guide](https://www.tensorflow.org/guide/keras/rnn) is a really good supplementary introduction to RNNs.

Notes from the guide
- 3 built-in RNN layers: SimpleRNN, GRU, LSTM
- RNN can process input sequences in reverse.
- Feature recurrent dropout
- By default returns output at the last time step, but it can be configured to return a sequence instead for each time step.
- The layer can also be configured to return the final internal state vectors.
- Likewise we can also set the initial state of the RNN layer.

<br>

**Difference between layer and cell layer**
- "*the RNN cell process only a single timestep*"

### **Import Dependencies**

import tensorflow and numpy

In [1]:
import tensorflow as tf
import numpy as np

print(tf.__version__)

2.8.2


### **Preparing text for SimpleRNN, LSTM and GRU layers**

Simplified look at using the SimpleRNN, LSTM and GRU layers.
In this section, the layers would take inputs from an Embedding layer, we would then look at the output of each layer for a given embedding.

<br>

**Note**   
we would not train a full network yet, this is just to gain an understanding of what the output of each layer would be, when we change the parameters of each layer.

<br>

**Prepare text**   
We would begin by preparing some text: Tokenize and padd


In [2]:
# sample text
# The great pretender - The Platters
lyrics = ["Oh-oh, yes, I'm the great pretender",
          "Pretending that I'm doing well",
          "My need is such I pretend too much",
          "I'm lonely, but no one can tell",
          "Oh-oh, yes, I'm the great pretender",
          "Adrift in a world of my own",
          "I've played the game but to my real shame",
          "You've left me to grieve all alone",
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around"
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around (still around)"]


In [3]:
# tokenize and pad the text

# deprecated in version 2.9.1
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# define a tokenizer and fit it to the text
The_great_pretender = Tokenizer(num_words=150, oov_token="<OOV>")
The_great_pretender.fit_on_texts(lyrics)


In [4]:
# display the word index
word_index = The_great_pretender.word_index
print(word_index)

{'<OOV>': 1, "i'm": 2, 'my': 3, 'the': 4, 'i': 5, 'a': 6, 'real': 7, 'oh': 8, 'yes': 9, 'great': 10, 'pretender': 11, 'too': 12, 'to': 13, 'what': 14, 'heart': 15, 'like': 16, 'pretending': 17, 'that': 18, 'is': 19, 'of': 20, 'still': 21, 'but': 22, 'this': 23, 'feeling': 24, 'make': 25, 'believe': 26, 'when': 27, 'feel': 28, "can't": 29, 'conceal': 30, 'just': 31, "laughin'": 32, 'and': 33, 'gay': 34, 'clown': 35, 'seem': 36, 'be': 37, 'not': 38, 'you': 39, 'see': 40, 'wearing': 41, 'crown': 42, "you're": 43, 'around': 44, 'doing': 45, 'well': 46, 'need': 47, 'such': 48, 'pretend': 49, 'much': 50, 'lonely': 51, 'no': 52, 'one': 53, 'can': 54, 'tell': 55, 'adrift': 56, 'in': 57, 'world': 58, 'own': 59, "i've": 60, 'played': 61, 'game': 62, 'shame': 63, "you've": 64, 'left': 65, 'me': 66, 'grieve': 67, 'all': 68, 'alone': 69, 'aroundtoo': 70}


In [5]:
# convert the lyrics to sequences
lyrics_sequence = The_great_pretender.texts_to_sequences(lyrics)

length_of_sequence = []
for sequence in lyrics_sequence:
  print(sequence)
  break

[8, 8, 9, 2, 4, 10, 11]


In [6]:
# display the average length of each sequence
length_of_sequence = []
for sequence in lyrics_sequence:
  length_of_sequence.append(len(sequence))


#src: https://www.geeksforgeeks.org/find-average-list-python/
def Average(lst):
    return sum(lst) / len(lst)
  

print(Average(length_of_sequence))

7.619047619047619


In [7]:
# Apply padding to the sequences
lyrics_sequence_padded = pad_sequences(lyrics_sequence, maxlen=7, padding='pre',
                                      truncating='post')


### **Define an embedding layer**

Create an embedding from the padded and tokenized sequences.

I'd probably not use this in later parts of the notebook, but it is still a good practice to remember how to prepare text and use word embeddings

In [8]:
# define an embedding layer
Embedding = tf.keras.layers.Embedding(input_dim=150, output_dim=4, input_length=7)

In [9]:
# pass in a sequence from the lyric_sequence_padded to the embedding layer
output = Embedding(np.array(sequence))
print(output)

tf.Tensor(
[[-0.01425667  0.03933812 -0.00957012 -0.0029873 ]
 [ 0.0455013  -0.00408681 -0.01951443  0.04865884]
 [ 0.02904112 -0.02296771  0.02500412  0.03320162]
 [-0.00348942 -0.03319496  0.0226015   0.02124378]
 [-0.00146838  0.0180063   0.02366808 -0.02330911]
 [-0.00348942 -0.03319496  0.0226015   0.02124378]
 [-0.00146838  0.0180063   0.02366808 -0.02330911]], shape=(7, 4), dtype=float32)


we now have a vector of 4 dimensions representing of each tokens in the sequence.

A thought on the embedding layer
- i don't think the embedding layer is trainable, provided it's function is to simply convert the tokens into a vector representation, i can't really see how any optimization is needed.

Unless the scale or way in which the conversion is done needs to be different to get better results, i'm not sure.

Looking into the above text, *i don't think the embedding layer is trainable*

<br>

**Is the embedding layer trainable?**

The short answer is that, the embedding layer is trainable, it is not enough to just map each token to vector.
<br>

- At it's core, the embedding layer maps positive integers (tokens) into dense vectors of fixed size (word embeddings).
- ideally we would like words with similar semantic context or sentiment, to have similar vector representations, so words like man and woman should have fairly similar vectors. Likewise words like apple, orange, banana, pears should be within the same general cluster, in our n dimensional space for our word embedding.
- We would not be able to achieve this by randomly mapping our tokens into vectors. Hence our vector representation would need to be optimzied to group similar words to be within the same latent space in our word embedding.

<br>

**Resources**
- https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer
- https://www.youtube.com/watch?v=5MaWmXwxFNQ

<br>

**More questions**
- What is word2Vec, skip-gram?

### **Using the SimpleRNN Layer**

**Quick recap**   
SimpleRNN are the basic implementation of recurrent neural networks (A class of neural networks which is able to handle sequential data such as text, time series, speech, etc..).   
At each time step, the SimpleRNN output a prediction and a state vector. The state vector from previous time steps is then used as an additional input in calculating the next prediction at the next time step.

**limitatons**   
- SimpleRNN are limited in it's capacity to capture dependencies spaning long periods of time.
- Vanishing graident problem

<br>

**Resources**
- https://machinelearningmastery.com/an-introduction-to-recurrent-neural-networks-and-the-math-that-powers-them/
- https://machinelearningmastery.com/understanding-simple-recurrent-neural-networks-in-keras/

<br>

**Question**
- Is the state vector the same as the output produced at each time step?

**Define a basic implementation of SimpleRNN**

In [10]:
# input sequence
sample_sequence = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

# define an embedding layer
embedding_layer = tf.keras.layers.Embedding(input_dim=100, output_dim=8,
                                            input_length=4)

embedding_output = embedding_layer(sample_sequence)

print(f"Input_sequence: \n{sample_sequence}")
print(f"\nembedding_output: \n{embedding_output}")


Input_sequence: 
[[1 2 3 4]
 [5 6 7 8]]

embedding_output: 
[[[-0.04436916 -0.046588   -0.03033881 -0.04012107 -0.04228991
   -0.01611384 -0.0452862  -0.03635208]
  [-0.03163145  0.028487   -0.00798888  0.01625302 -0.00956733
    0.02072776 -0.00356535 -0.00335278]
  [-0.0281114   0.04394294 -0.00869752 -0.01702369  0.04539898
    0.00076826 -0.00525911 -0.01790025]
  [-0.02550764 -0.02603803  0.0375783  -0.04038745 -0.02591337
    0.03442175  0.04278809 -0.03333489]]

 [[ 0.01197816 -0.03638438 -0.04979093  0.00194967 -0.00377122
   -0.01964084 -0.01276795  0.04086879]
  [ 0.00198022 -0.03651668  0.00650264  0.0292205   0.02930614
   -0.00053935  0.01651588  0.0496464 ]
  [-0.02612044 -0.00718763 -0.03799029 -0.00346025 -0.03092719
   -0.04868176  0.02475161  0.02923426]
  [ 0.01994821  0.01585681 -0.03392265  0.00387371  0.02482668
    0.03193602 -0.01098223  0.03082924]]]


In [11]:
# basic SimpleRNN
simpleRNN_layer_0 = tf.keras.layers.SimpleRNN(units=5)
simpleRNN_layer_0_output = simpleRNN_layer_0(embedding_output)

print(f"Output of simpleRNN: {simpleRNN_layer_0_output}")

Output of simpleRNN: [[ 0.14154036 -0.03116032 -0.02066625 -0.11908543  0.01602411]
 [-0.16496156 -0.02635418 -0.01401711 -0.0731071   0.07373834]]


output from the 5 simpleRNN in the layer, for the 2 samples in the batch.

**SimpleRNN with return sequence set to True**

In [12]:
# SimpleRNN with return sequence set to True
simpleRNN_layer_1 = tf.keras.layers.SimpleRNN(units=7, return_sequences=True)
simpleRNN_layer_1_output = simpleRNN_layer_1(embedding_output)

print(f"Output of simpleRNN \n: {simpleRNN_layer_1_output}")


Output of simpleRNN: [[[-0.00791446  0.00531016  0.02833577 -0.10240722  0.04435795
    0.04338466 -0.03927782]
  [-0.02294649  0.05608001  0.06695591  0.01919121 -0.07094719
    0.05240041 -0.01179952]
  [ 0.00049517 -0.00886351  0.03744813  0.07296041  0.05648455
    0.01337563  0.06164669]
  [-0.08059626 -0.07067633 -0.02609253 -0.01824237  0.13309442
    0.08805474 -0.02784048]]

 [[ 0.00964613 -0.00414358 -0.01301007 -0.0483936  -0.04030683
    0.01482623  0.00629307]
  [ 0.04720924  0.01849745 -0.07449772  0.0350161  -0.01889437
   -0.00786987  0.01177997]
  [ 0.05065968 -0.05938216 -0.0648331  -0.05308506 -0.01010802
    0.04277499 -0.01339542]
  [ 0.02397647  0.03595821 -0.08856503  0.00908809 -0.12199629
    0.00020513  0.00293914]]]


output from a simpleRNN layer containing 7 units, with a sequence length of 4.   
For each sample in the batch it produces 7 output. Hence final shape is [2, 4, 7].   
The final output for each sample in the batch is the last element in the array.

In [14]:
print(f"final output for the first sample in the batch: \n {simpleRNN_layer_1_output[0][-1]}")
print(f"\nfinal output for the second sample in the batch: \n {simpleRNN_layer_1_output[1][-1]}")


final output for the first sample in the batch: 
 [-0.08059626 -0.07067633 -0.02609253 -0.01824237  0.13309442  0.08805474
 -0.02784048]
final output for the second sample in the batch: 
 [ 0.02397647  0.03595821 -0.08856503  0.00908809 -0.12199629  0.00020513
  0.00293914]


**SimpleRNN with return state set to True**

In [16]:
# SimpleRNN with return state set to True
simpleRNN_layer_2 = tf.keras.layers.SimpleRNN(units=11, return_state=True)
simpleRNN_layer_2_output, final_state = simpleRNN_layer_2(embedding_output)


print(f"SimpleRNN_layer_2_output final output: \n{simpleRNN_layer_2_output}")
print(f"\nSimpleRNN_layer_2_output final state: \n{final_state}")

SimpleRNN_layer_2_output final output: 
[[-0.06191324 -0.04811127 -0.07531374 -0.02578576 -0.09219531  0.00095944
  -0.04507505 -0.05979072  0.045462    0.00225019  0.16151163]
 [-0.02583531  0.10042801  0.02006783 -0.00474388  0.08194724  0.06866762
   0.02788192 -0.04119573  0.02103317 -0.05415137  0.02626269]]

SimpleRNN_layer_2_output final state: 
[[-0.06191324 -0.04811127 -0.07531374 -0.02578576 -0.09219531  0.00095944
  -0.04507505 -0.05979072  0.045462    0.00225019  0.16151163]
 [-0.02583531  0.10042801  0.02006783 -0.00474388  0.08194724  0.06866762
   0.02788192 -0.04119573  0.02103317 -0.05415137  0.02626269]]


***it looks like the final states are the same as the outputs***. So this answers the question as to if the state vectors are the same as the output. It looks like they are.



**SimpleRNN with return sequence and state set to True**

In [18]:
# SimpleRNN with return states and return sequence set to True

simpleRNN_layer_3 = tf.keras.layers.SimpleRNN(units=5, return_sequences=True, return_state=True)
simpleRNN_layer_3_output, final_state = simpleRNN_layer_3(embedding_output)

print(f"SimpleRNN_layer_3_output at each time step: \n {simpleRNN_layer_3_output}")
print(f"\nSimpleRNN_layer_3_final state: \n{final_state}")

SimpleRNN_layer_3_output at each time step: 
 [[[ 0.04475474  0.08183377  0.03800399 -0.00740108 -0.0329938 ]
  [-0.02409583 -0.06263659  0.0736509  -0.05587266 -0.03042615]
  [ 0.01214191 -0.01820896 -0.00401788  0.00358409  0.12225598]
  [ 0.09843498  0.02812253 -0.02506265 -0.04060108  0.04456408]]

 [[-0.01349652  0.03701126  0.02167615 -0.02576507 -0.03183632]
  [-0.08951548 -0.05968942 -0.0057901   0.01342212  0.02970805]
  [ 0.05061521  0.10648205 -0.0067666  -0.09623881 -0.02535432]
  [-0.03010683 -0.16157256  0.003481    0.00620044 -0.00710471]]]

SimpleRNN_layer_3_final state: 
[[ 0.09843498  0.02812253 -0.02506265 -0.04060108  0.04456408]
 [-0.03010683 -0.16157256  0.003481    0.00620044 -0.00710471]]


yeah it looks like the output and states are the same.

<br>

Question why are states not passed between batches?
- Looking at the output it seems that the outputs are the same as the states. For a batch containing multiple samples which we assume as independent from each other it would not make any sense to use the last prediction from the previous sample as the input state.


### **Using the LSTM layer**

**Define a basic LSTM layer**

In [19]:
# pass the output of the embedding layer to the LSTM layer
LSTM = tf.keras.layers.LSTM(units=4)

embedding_output = Embedding(np.array([sequence]))
LSTM_output = LSTM(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nEmbedding layer output: ")
print(embedding_output)
print("\nLSTM layer output: ")
print(LSTM_output)

Input: [[17 18 43 21 44 21 44]]

Embedding layer output: 
tf.Tensor(
[[[-0.01425667  0.03933812 -0.00957012 -0.0029873 ]
  [ 0.0455013  -0.00408681 -0.01951443  0.04865884]
  [ 0.02904112 -0.02296771  0.02500412  0.03320162]
  [-0.00348942 -0.03319496  0.0226015   0.02124378]
  [-0.00146838  0.0180063   0.02366808 -0.02330911]
  [-0.00348942 -0.03319496  0.0226015   0.02124378]
  [-0.00146838  0.0180063   0.02366808 -0.02330911]]], shape=(1, 7, 4), dtype=float32)

LSTM layer output: 
tf.Tensor([[ 0.01005772  0.00248262 -0.00588322 -0.00934658]], shape=(1, 4), dtype=float32)


So what has happened??
- it looks like the number of units correspond to the shape of the output, so if there is 1 unit it would produce a shape of (1, 1) and if there are 4 units it would produce a shape of (1, 4).
- recap output of embedding layer is vector of n dimension representation of a sequence.


**Set return_sequences to True**

In [20]:
# lstm layer with return sequence = True
LSTM_return_sequence = tf.keras.layers.LSTM(units=4, return_sequences=True)
LSTM_return_sequence_output = LSTM_return_sequence(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nLSTM layer output: ")
print(LSTM_return_sequence_output)

Input: [[17 18 43 21 44 21 44]]

LSTM layer output: 
tf.Tensor(
[[[ 0.00025353  0.00134016 -0.00142952 -0.00233571]
  [ 0.00298489 -0.00426727  0.00681078 -0.00389279]
  [ 0.00747593 -0.00740302  0.00727068 -0.00465844]
  [ 0.00612443 -0.00683046  0.00517538 -0.00313806]
  [ 0.0091947  -0.00362551  0.00142917 -0.00327655]
  [ 0.00676546 -0.00364445  0.00064056 -0.00262688]
  [ 0.00902891 -0.0010586  -0.0021648  -0.00334409]]], shape=(1, 7, 4), dtype=float32)


with return sequence set to True as it iterates through the sequence, and returns a value for each token, as there are 7 tokens in a sequence, it returns 7 values and since we have 4 units in our LSTM for each time step it returned 4 value.

still don't fully understand what the output mean

**Understanding the output of an LSTM layer**

Notes from machine learning mastery.
- The LSTM is a class of recurrent neural networks which contains internal gates (Learn, forget, remember and use gates).
- This class of recurrent neural networks are designed to resolve the vanishing gradient problem, as it is able to capture longer temporal dependencies.
- An LSTM layer can be defined with n number of LSTMs cell. each cell contains,
  - an internal cell state, *c*
  - outputs a hidden state *h*

The output of an LSTM is the hidden state *h*.

<br>

By setting 
- **return_sequence to True**, we are able to access the hidden state at each time step, in our sequence.
- **return_state to True**, this would return the final hidden state *c* and the internal cell state *h* of each cell in the LSTM layer.

We would typically set the `return_state to True`, when we want to initialize the states of another LSTM layer with the same number of cells.

<br>

Cool...

But this has still not answered the question of what the internal cell state and hidden cell states are.


<br>

**Resources**
- https://stackoverflow.com/questions/67970519/what-does-tensorflow-lstm-return

- https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
 

**Setting return state and sequences to True**

In [21]:
# Example used in stackoverflow question
# Set return_sequence and return_states to True

# input shape: Batch size, length of sequence, embedding dimension
tensor = tf.random.normal(shape=[2, 2, 2])
lstm = tf.keras.layers.LSTM(units=4, return_sequences=True, return_state=True)
hidden_state_at_each_time_step, final_hidden_state, final_cell_state = lstm(tensor)

print("Input: {}".format(tensor))
print("\nHidden_state_at_each_time_step (output at each time step):\n", hidden_state_at_each_time_step)
print("\nFinal hidden state (output): \n", final_hidden_state)
print("\nFinal cell state: \n", final_cell_state)

Input: [[[-1.071289    1.987392  ]
  [-0.09516488 -1.952016  ]]

 [[ 1.2150242   0.78332025]
  [-1.4531772  -0.29592833]]]

Hidden_state_at_each_time_step (output at each time step):
 tf.Tensor(
[[[-0.1533716   0.05992407 -0.41533175 -0.09892309]
  [ 0.00826846  0.00213334 -0.07077234 -0.10955881]]

 [[-0.16051902 -0.02591506 -0.04823894  0.09310576]
  [ 0.02331617  0.00899057 -0.05557375 -0.06233778]]], shape=(2, 2, 4), dtype=float32)

Final hidden state (output): 
 tf.Tensor(
[[ 0.00826846  0.00213334 -0.07077234 -0.10955881]
 [ 0.02331617  0.00899057 -0.05557375 -0.06233778]], shape=(2, 4), dtype=float32)

Final cell state: 
 tf.Tensor(
[[ 0.02242753  0.00674389 -0.25265494 -0.16699997]
 [ 0.04713709  0.03350394 -0.11022087 -0.12426493]], shape=(2, 4), dtype=float32)


**WTF...**

**Breakdown of the output**


For an input of shape `[2, 2, 2]`, we have a batches of 2 sentences.
``` python
[[[-0.4942633   1.0938902 ], [-0.27461976 -0.2292373 ]]
 [[-0.34839615 -0.39708054], [ 0.21653225  2.0741708 ]]]
```
Where `[-0.4942633   1.0938902 ]` for example is our word embedding for a single token.

So an example input that could create the above word embedding is
``` python
example sequence of tokens = [[10, 11], [12, 13]]

# 10 -> [-0.4942633   1.0938902 ]
# 11 -> [-0.27461976 -0.2292373 ]
# 12 -> [-0.34839615 -0.39708054]
# 13 -> [ 0.21653225  2.0741708 ]

# decoded even futher as another example 
["hello, you"], ["Good food"]

# hell0 -> 10
# you -> 11
# Good -> 12
# food -> 13
```

With return sequence and state set to True. We have
- The hidden state $h_{t}$ returned at each time step.
- The final hidden cell state $h$ returned
- The final cell state $c$ returned.

<br>

**Note**   
- Looking carefully at the output, you should see that the final hidden cell state is just the last hidden cell state for each sequence.
- ***The final cell state is different from the hidden cell state***

In [22]:
# lets try passing in the example sequence above to an LSTM layer and view what the output is

# single batch containing a single sample with a sequence length of 4 
# and an embedding dimension of 6 
test_embedding = np.array([[[1, 2, 3, 4, 5, 6],
                            [7, 8, 9, 10, 11, 12],
                            [13, 14, 15, 16, 17, 18],
                            [19, 20, 21, 22, 23, 24]]], dtype=np.float32)
LSTM_test_2 = tf.keras.layers.LSTM(2)

test_sequence_output = LSTM_test_2(test_embedding)
print("Results: {}".format(test_sequence_output))


Results: [[-9.9916792e-01 -1.8543489e-16]]


**LSTM with go backwards set to True. This would process the sequence backwards and return the reversed sequence**

In [23]:
# set go_backwards to True
LSTM_test_3 = tf.keras.layers.LSTM(2, go_backwards=True)

test_sequence_output_1 = LSTM_test_3(test_embedding)

print("Results: {}".format(test_sequence_output_1))


Results: [[0.08762532 0.00098983]]


**LSTM with return state set to True. Returns the last state in addition to the output and the internal cell state**

In [31]:
# set return state to True
LSTM_test_5 = tf.keras.layers.LSTM(2, return_state=True)

Ouput, final_hidden_cell_state, final_internal_cell_state = LSTM_test_5(test_embedding)

print("Final output: \n{}".format(Ouput))
print("\nFinal hidden cell state: \n{}".format(final_hidden_cell_state))
print("\nFinal internal cell state: \n{}".format(final_internal_cell_state))

Final output: 
[[ 0.5078768  -0.99681455]]

Final hidden cell state: 
[[ 0.5078768  -0.99681455]]

Final internal cell state: 
[[ 0.55986565 -3.22036   ]]


### **A quick note on the dimensions.**

For  a given input dimension [X, Y, Z]. X would be the batch dimension and determines the number of initial element in the array. So if
- X = 2 we would have an array containing 2 elements. $[1, 2]$
- X = 5, we would have 5 elements in the array. $[1, 2, 3, 4, 5]$

The next dimension Y would determine the number of element within the initial set of elements. so if our dimensions are
- [1, 2, Z], we would have an array structured like this $[ [[], []] ]$
Admittedly it looks confusing, but the array would contain a single element, which in turn contains 2 elements.

stacking that further, the 3rd dimension determines the number of elements within the last set. For example an array with a dimension of

- [1, 2, 3] could like this, [[[1, 2, 3], [4, 5, 6]]]
- [3, 4, 6] could look like this 

```
[[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]],
[[11, 12, 13, 14, 15, 16], [17, 18, 19, 20, 21, 22], [23, 24, 25, 26, 27, 28], [29, 30, 31, 32, 33, 24]],
[[21, 22, 23, 24, 25, 26], [27, 28, 29, 30, 31, 32], [33, 34, 35, 36, 37, 38], [39, 40, 41, 42, 43, 44]]]
```



### **Using the GRU layer**
Link to documentation: https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU

Some noes from the documentation
- 2 variants of GRU implementation
  - v3 of the GRU, which has it's reset gate applied to the hidden state before matrix multiplication.
  - initial GRU implementation, which has the reset gate applied to the hidden state after matrix multiplication. (Only compatible with GPU enabled devices)



**SimpleGRU**

In [36]:
# taken from the documentation example

# Define a batch containing 5 samples, each with a length of 4 and an embedding
# dimension of 3
fake_embedding = tf.random.normal([5, 4, 3])
Simple_GRU_layer_0 = tf.keras.layers.GRU(units=4)

Simple_GRU_layer_0_output = Simple_GRU_layer_0(fake_embedding)
print(f"Fake embedding: \n {fake_embedding}")
print(f"\n Output of simple GRU layer:\n {Simple_GRU_layer_0_output}")

Fake embedding: 
 [[[ 0.02024975 -0.5242584   0.41571057]
  [-0.54333377 -0.1240209  -1.0034183 ]
  [ 1.548015   -1.1159978   0.69902265]
  [-0.01268816  1.304217    1.0540898 ]]

 [[ 0.4442007  -0.27238685  1.398242  ]
  [-0.5863187   0.79228     1.1279312 ]
  [ 0.14102115  0.5674381   0.7040402 ]
  [-0.95412934 -0.04099366  1.5230407 ]]

 [[-1.0843962   1.3383615   0.8556613 ]
  [-0.32472858  1.8674307   0.7401377 ]
  [-1.2507063   0.5137581  -1.6594754 ]
  [ 0.50820273 -0.5033321   0.49034518]]

 [[-1.3087839   0.7153914  -0.23905785]
  [-0.78726137 -1.8308866  -0.03049529]
  [ 1.9921625  -0.18866095  1.0221355 ]
  [-0.19604047 -1.2710748   0.6273832 ]]

 [[-0.15179242 -0.38491833  0.33628368]
  [-1.0579001  -0.03693665 -0.46901795]
  [ 1.145693   -0.4075045  -0.8039029 ]
  [-0.59372663  0.3016597  -0.4172142 ]]]

 Output of simple GRU layer:
 [[-0.2949918  -0.0547712  -0.02378956 -0.16269396]
 [-0.56058127 -0.19093616  0.16273935 -0.6636913 ]
 [-0.05660305  0.12945607  0.11812124  

The output of the GRU layer containing 4, for a batch containing 5 samples.

**Simple GRU layer with return state set to True**

In [42]:
# just for bants, i'll try to pass a batch containing 2 samples, each with a embedding dimension of 5, but with different lengths

Simple_GRU_layer_1 = tf.keras.layers.GRU(units=4, return_state=True)

try:
  fake_embedding = np.array([[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10],
                            [11, 12, 13, 14, 15]],
                            [20, 21, 22, 23, 24]], dtype=np.float32)
  
  Simple_GRU_layer_1_output = Simple_GRU_layer_1(fake_embedding)

except Exception as e:
  print(f"Error: {e}")
  fake_embedding = np.array([[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10],
                            [11, 12, 13, 14, 15]],
                            [[20, 21, 22, 23, 24], [25, 26, 27, 28, 29],
                            [30, 31, 32, 33, 34]]], dtype=np.float32)
  
  Simple_GRU_layer_1_output, Simple_GRU_layer_1_state = Simple_GRU_layer_1(fake_embedding)

print(f"\nFake embedding: \n {fake_embedding}")
print(f"\n Output of simple GRU layer:\n {Simple_GRU_layer_1_output}")
print(f"\n Simple GRU layer final cell state:\n {Simple_GRU_layer_1_state}")


Error: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Fake embedding: 
 [[[ 1.  2.  3.  4.  5.]
  [ 6.  7.  8.  9. 10.]
  [11. 12. 13. 14. 15.]]

 [[20. 21. 22. 23. 24.]
  [25. 26. 27. 28. 29.]
  [30. 31. 32. 33. 34.]]]

 Output of simple GRU layer:
 [[ 1.5408228e-01  9.9987435e-01  2.8782127e-02 -1.4678789e-02]
 [ 7.1525574e-07  1.0000000e+00 -1.9500955e-04  1.6018833e-04]]

 Simple GRU layer final cell state:
 [[ 1.5408228e-01  9.9987435e-01  2.8782127e-02 -1.4678789e-02]
 [ 7.1525574e-07  1.0000000e+00 -1.9500955e-04  1.6018833e-04]]


Consistent with the previous trend, the final hidden cell state is the same output of the GRU layer

So far it seems the hidden cell state is the same as the output of the RNN layer, with only the LSTM layer providing an additional internal cell state. I don't think there would be any benefit in going through the process of viewing the output for return_sequences and go_backwards set to True.
 

## **Summay on RNNs**

**SimpleRNN**   
SimpleRNN are the basic implementation of RNN which feature recurrent connections in which the outputs from previous time steps are fed back in as additional inputs for the next time steps.

While simpleRNN are able to capture temporal relations in sequential data, the range in which it is able to retain this temporal relationship is limited. It is good at retaining only the latest information and not earlier information. So informaton introduced at the begining of long sequences are eroded after multiple time steps.

Similarly it suffers vanishing gradient for lon sequential data in which the gradients propagated from the start decay after multiple backwards pass that earlier cells are unable to learn anything.

<br>

**LSTMS and GRU**   
The limitations of simpleRNNs were addressed in LSTMs and GRU, which feature recurrent gated connections. LSTMs and GRU operate in the same way as simpleRNN, in the sense that it uses the previous output is used as additional inputs for the current input.   

Unlike simpleRNN, gated connections are used determine how the previous output is used (what to retain/forget) and how to combine the previous output with the current input.   

The difference between LSTMS and GRU are in the number of gate connections used and state vectors produced at each time step. LSTM are computationally much more expensive and take longer to train, but they are much more suited to larger datasets with longer sequences. GRU in comparison, require less parameters to update and are quicker to train.

<br>

**Note**
- I have some issues with the functionality of each gate in the LSTM and GRU, might be something to address again later. But above i have tried to describe their functionality without going into details on the individual gates.

<br>

**Using RNNs in Tensorflow**
- There are multiple options in tensorflow, that allow you to define the behaviour of the layer and it's output.
- The input to RNN layers are typically the output of the embedding layers. The output of the embedding can be passed directly without needing the flatten or globalaverage pooling layers.
- Using return_sequence, returns the RNN output for each token in the sequence, this is useful when stacking multiple RNN layers together in a network.
- Using return_state, would return the final state (states for LSTM) of the RNN layer. The final states can then be used as initial states for the next RNN layer in more complex architecture
- Other features like go_backwards, dropout, recurrent_dropout, stateful and unroll exist to control the behaviour of an RNN layers. (check out the docs)



### **Using Conv1D Layers**
