# **Udacity: Intro to TensorFlow for Deep Learning**
## **Lesson 10 NLP: Recurrent Neural Networks**

This lessons extends on what was covered in lesson 9. It introduces recurrent neural networks, which are able to capture temporal dependences, which are dependencies that change over time.

This lesson covers
- Different RNNs
- Text generation using NLP models.


##**Simple RNNS**

Simple recurrent neural networks, are networks which use outputs from previous time steps as additional inputs, alongside the current input.

for example   
- Inputs: $X_t + Y_{t-1}$
- Ouput: $Y_t$

I'm a bit effy around th calculation of the output.

But...

The general idea is that, previous output from the last time step are fed as additional inputs. The previous output is referred to as the state vector.

<br>

While simple RNN architecture are able to consider the past output when calculating it's new output, it is limited by how far back it can relate dependencies and for dependencies which occur over long period of time, a simple RNN would struggle to relate the dependencies.

## **Long term short term Memory**

LSTMs, were introduced to capture temporal dependencies which spam over longer periods of time. Unlike simple RNNS which have a single state vector, LSTMS have 2 state vectors, a Long term memory and a short term memory.

<br>

**Features of an LSTM**
- It's able to capture temporal dependencies spaming a long period of time
- It has 2 state vectors, long term and short term memory, which are used in calculating a new input
- LSTM feature gates: Forget, learn, remember and use gates.

<br>

**Workflow for an LSTM**
- At each time step, the LSTM has 2 state vectors: A long term memory and a short term memory
- The current time step input alongside the 2 state vectors are used to determine an output.
- The calculated output from the current time step, would be used as the short term memory for the next timestep
- The long term memory from the previous time step is updated:
  - Any pieces of the previous long term memory which is no longer relevant is removed
  - Any new piece of relevant information is added to the long term memory

<br>

**LSTM Gates**
- Forget gate: Determines which parts of the long term memory are no longer relevant and should be removed from memory.
- Learn gate: Learns new piece of informations using the current input and short term memory.

- Remember gate: Adds any relevant information that was learnt to the long term memory. The output of this gate is the new long term memory

- Use gate: This uses, the relevants parts of the long term memory and newly learnt information to calculate an output. The output is also used as the new short term memory for the next timestep.



##**Import Dependencies**

In [1]:
import tensorflow as tf
import numpy as np

print(tf.__version__)

2.8.2


## **RNNs and LSTMs in code**

The LSTM layer is within `tf.keras.Layers.LSTM` [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). likewise with the RNN layer `tf.keras.Layers.SimpleRNN`.

<br>

Worth noting:
- We can directly pass, the output of an embedding layer to an LSTM layer without adding a flatten or GlobalAverage1D layer inbetween.

<br>

**Further resources:**   
The [recurrent neural network (RNN) with keras guide](https://www.tensorflow.org/guide/keras/rnn) is a really good supplementary introduction to RNNs.

Notes from the guide
- 3 built-in RNN layers: SimpleRNN, GRU, LSTM
- RNN can process input sequences in reverse.
- Feature recurrent dropout
- By default returns output at the last time step, but it can be configured to return a sequence instead for each time step.
- The layer can also be configured to return the final internal state vectors.
- Likewise we can also set the initial state of the RNN layer.


**Difference between layer and cell layer**
- "*the RNN cell process only a single timestep*"

Simplified look at using the SimpleRNN and LSTM layers. Both these layers would take inputs from the Embedding layer, which would return a vector in n dimensions for each token in the sequence.

To take a look at the input and given output, lets try it out on a simple dataset.

In [2]:
# sample text
# The great pretender - The Platters
lyrics = ["Oh-oh, yes, I'm the great pretender",
          "Pretending that I'm doing well",
          "My need is such I pretend too much",
          "I'm lonely, but no one can tell",
          "Oh-oh, yes, I'm the great pretender",
          "Adrift in a world of my own",
          "I've played the game but to my real shame",
          "You've left me to grieve all alone",
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around"
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around (still around)"]


In [3]:
# tokenize and pad the text

# deprecated in version 2.9.1
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# define a tokenizer and fit it to the text
The_great_pretender = Tokenizer(num_words=150, oov_token="<OOV>")
The_great_pretender.fit_on_texts(lyrics)


In [4]:
# display the word index
word_index = The_great_pretender.word_index
print(word_index)

{'<OOV>': 1, "i'm": 2, 'my': 3, 'the': 4, 'i': 5, 'a': 6, 'real': 7, 'oh': 8, 'yes': 9, 'great': 10, 'pretender': 11, 'too': 12, 'to': 13, 'what': 14, 'heart': 15, 'like': 16, 'pretending': 17, 'that': 18, 'is': 19, 'of': 20, 'still': 21, 'but': 22, 'this': 23, 'feeling': 24, 'make': 25, 'believe': 26, 'when': 27, 'feel': 28, "can't": 29, 'conceal': 30, 'just': 31, "laughin'": 32, 'and': 33, 'gay': 34, 'clown': 35, 'seem': 36, 'be': 37, 'not': 38, 'you': 39, 'see': 40, 'wearing': 41, 'crown': 42, "you're": 43, 'around': 44, 'doing': 45, 'well': 46, 'need': 47, 'such': 48, 'pretend': 49, 'much': 50, 'lonely': 51, 'no': 52, 'one': 53, 'can': 54, 'tell': 55, 'adrift': 56, 'in': 57, 'world': 58, 'own': 59, "i've": 60, 'played': 61, 'game': 62, 'shame': 63, "you've": 64, 'left': 65, 'me': 66, 'grieve': 67, 'all': 68, 'alone': 69, 'aroundtoo': 70}


In [5]:
# convert the lyrics to sequences
lyrics_sequence = The_great_pretender.texts_to_sequences(lyrics)

length_of_sequence = []
for sequence in lyrics_sequence:
  print(sequence)
  break

[8, 8, 9, 2, 4, 10, 11]


In [6]:
# display the average length of each sequence
length_of_sequence = []
for sequence in lyrics_sequence:
  length_of_sequence.append(len(sequence))


#src: https://www.geeksforgeeks.org/find-average-list-python/
def Average(lst):
    return sum(lst) / len(lst)
  

print(Average(length_of_sequence))

7.619047619047619


In [7]:
# Apply padding to the sequences
lyrics_sequence_padded = pad_sequences(lyrics_sequence, maxlen=7, padding='pre',
                                      truncating='post')


In [8]:
# define an embedding layer
Embedding = tf.keras.layers.Embedding(input_dim=150, output_dim=4, input_length=7)

In [9]:
# pass in a sequence from the lyric_sequence_padded to the embedding layer
output = Embedding(np.array(sequence))
print(output)

tf.Tensor(
[[-0.0407775  -0.0149647  -0.04853504 -0.0362067 ]
 [ 0.0238638   0.0244017   0.03486432 -0.03801032]
 [ 0.04730498  0.04166961  0.03612883  0.00908053]
 [-0.02445186 -0.03272854 -0.0031684  -0.04062258]
 [-0.03434788 -0.01567726  0.01490529  0.02566185]
 [-0.02445186 -0.03272854 -0.0031684  -0.04062258]
 [-0.03434788 -0.01567726  0.01490529  0.02566185]], shape=(7, 4), dtype=float32)


we now have a vector of 4 dimensions representations of each tokens in the sequence.

An intresting note (or maybe not)
- i don't think the embedding layer is trainable.

In [10]:
# pass the output of the embedding layer to the LSTM layer
LSTM = tf.keras.layers.LSTM(units=4)

embedding_output = Embedding(np.array([sequence]))
LSTM_output = LSTM(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nEmbedding layer output: ")
print(embedding_output)
print("\nLSTM layer output: ")
print(LSTM_output)

Input: [[17 18 43 21 44 21 44]]

Embedding layer output: 
tf.Tensor(
[[[-0.0407775  -0.0149647  -0.04853504 -0.0362067 ]
  [ 0.0238638   0.0244017   0.03486432 -0.03801032]
  [ 0.04730498  0.04166961  0.03612883  0.00908053]
  [-0.02445186 -0.03272854 -0.0031684  -0.04062258]
  [-0.03434788 -0.01567726  0.01490529  0.02566185]
  [-0.02445186 -0.03272854 -0.0031684  -0.04062258]
  [-0.03434788 -0.01567726  0.01490529  0.02566185]]], shape=(1, 7, 4), dtype=float32)

LSTM layer output: 
tf.Tensor([[-0.0069972   0.00251359  0.00376387 -0.00985826]], shape=(1, 4), dtype=float32)


So what has happened??
- it looks like the number of units correspond to the shape of the output, so if there is 1 unit it would produce a shape of (1, 1) and if there are 4 units it would produce a shape of (1, 4).
- recap output of embedding layer is vector of n dimension representation of a sequence.

<br/>

Still doesn't answer what exactly its doing??
- What does the output of the LSTM layer mean??

<br/>

what happens if we ask it to 
- return sequences
- return state
- go backwards

In [11]:
# lstm layer with return sequence = True
LSTM_return_sequence = tf.keras.layers.LSTM(units=4, return_sequences=True)
LSTM_return_sequence_output = LSTM_return_sequence(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nLSTM layer output: ")
print(LSTM_return_sequence_output)

Input: [[17 18 43 21 44 21 44]]

LSTM layer output: 
tf.Tensor(
[[[ 0.00031957 -0.00278954  0.00191454 -0.00266983]
  [ 0.0010209  -0.00119254 -0.0021591   0.00193639]
  [ 0.00490161  0.00189113 -0.00468135  0.00442312]
  [-0.00146162 -0.00199073 -0.00197067  0.00299099]
  [-0.0057288  -0.00012202 -0.00289768  0.00381171]
  [-0.00944867 -0.00371482 -0.00043038  0.00346012]
  [-0.01174956 -0.0015525  -0.00172165  0.00493044]]], shape=(1, 7, 4), dtype=float32)


with return sequence set to True as it iterates through the sequence, it would return a value for each timestep, as there are 7 tokens in a sequence, it returns 7 values and since we have 4 units in our LSTM for each time step it returned 4 value.

still don't fully understand what the output mean

In [None]:
LSTM_test = tf.keras.layers.LSTM(units=8, return_sequences=True)
LSTM_test_output = LSTM_test(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nLSTM layer output: ")
print(LSTM_test_output)

Input: [[17 18 43 21 44 21 44]]

LSTM layer output: 
tf.Tensor(
[[[-4.4579613e-03  9.0390081e-03 -4.1618831e-03  3.9857998e-03
    1.5431750e-04  1.4170543e-03  2.6728231e-03  6.4065116e-03]
  [-1.8123626e-03 -1.4558776e-03  3.7191992e-03 -8.4455329e-04
   -3.1592435e-04 -3.7643593e-04  8.2788552e-04 -1.6677342e-03]
  [-7.1258042e-03  1.7895554e-03  4.1352408e-03  4.3094768e-03
   -5.5278647e-03 -5.9056948e-03 -1.3778711e-04 -7.5842632e-05]
  [-6.1017997e-04 -1.7053846e-03  4.1970001e-03 -3.5239249e-03
   -2.5911711e-03  3.3614601e-03  2.7934618e-03 -2.9627648e-03]
  [ 1.8684951e-03  3.4123105e-03 -1.4369219e-03 -2.1099832e-03
   -5.6553767e-03  4.1918498e-03  2.4927713e-03  1.3080460e-03]
  [ 5.6837257e-03 -5.2536814e-04  9.4453071e-04 -7.5894175e-03
   -3.3683241e-03  9.7949253e-03  4.3107201e-03 -1.8445293e-03]
  [ 6.3246815e-03  4.2484715e-03 -3.3057968e-03 -4.6996563e-03
   -6.7012394e-03  8.2459627e-03  3.3796462e-03  2.1964042e-03]]], shape=(1, 7, 8), dtype=float32)


 ### Understanding the output of the LSTM layer

 looking at:
 - https://stackoverflow.com/questions/67970519/what-does-tensorflow-lstm-return
 

In [24]:
tensor = tf.random.normal(shape=[2, 2, 2])
lstm = tf.keras.layers.LSTM(units=4, return_sequences=True, return_state=True)
result = lstm(tensor)

print("Input: {}".format(tensor))
print("result:\n", result )

Input: [[[-0.4942633   1.0938902 ]
  [-0.27461976 -0.2292373 ]]

 [[-0.34839615 -0.39708054]
  [ 0.21653225  2.0741708 ]]]
result:
 [<tf.Tensor: shape=(2, 2, 4), dtype=float32, numpy=
array([[[ 0.01921677, -0.1382001 ,  0.02751885, -0.09191924],
        [-0.02658506, -0.04949056, -0.01834671, -0.04414783]],

       [[-0.03753999,  0.06848606, -0.04679747,  0.03778585],
        [ 0.02192423, -0.08126742,  0.0863056 , -0.15761705]]],
      dtype=float32)>, <tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[-0.02658506, -0.04949056, -0.01834671, -0.04414783],
       [ 0.02192423, -0.08126742,  0.0863056 , -0.15761705]],
      dtype=float32)>, <tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[-0.05579966, -0.10042001, -0.03550037, -0.09503593],
       [ 0.02987111, -0.11008409,  0.26842624, -0.3075929 ]],
      dtype=float32)>]


I'm sure a lot has changed since version 2.0.0 of tf. But looking at the output the dimensions are similar but with the last dimension being offset by 1.

Looking at the answer to the question asked
- the first output is the output of all hidden states
- the second tensor is the short term memory of the neural network
- the third tensor is the long term memory of the neural network 

**A quick note on the dimensions.**

For  a given input dimension [X, Y, Z]. X would be the batch dimension and determines the number of initial element in the array. So if
- X = 2 we would have an array containing 2 elements. $[1, 2]$
- X = 5, we would have 5 elements in the array. $[1, 2, 3, 4, 5]$

The next dimension Y would determine the number of element within the initial set of elements. so if our dimensions are
- [1, 2, Z], we would have an array structured like this $[ [[], []] ]$
Admittedly it looks confusing, but the array would contain a single element, which in turn contains 2 elements.

stacking that further, the 3rd dimension determines the number of elements within the last set. For example an array with a dimension of

- [1, 2, 3] could like this, [[[1, 2, 3], [4, 5, 6]]]
- [3, 4, 6] could look like this 

```
[[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]],
[[11, 12, 13, 14, 15, 16], [17, 18, 19, 20, 21, 22], [23, 24, 25, 26, 27, 28], [29, 30, 31, 32, 33, 24]],
[[21, 22, 23, 24, 25, 26], [27, 28, 29, 30, 31, 32], [33, 34, 35, 36, 37, 38], [39, 40, 41, 42, 43, 44]]]
```
<br>

**Back to understanding the output of the LSTM**
- From my current understanding, i think only the batch dimension and the number of units in the LSTM layer would affect the dimension of the output. (provided you do not wrap the layer with a bidirectional layer).
- So for a given input with X batches and with A units in the layer the dimension of the output would be X, A. 

What i think occurs is that within each batch the output as the LSTM cycles through the sequence is then passed on to the next set of sequence in the batch, so for the first batch above

```
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]]
```
The LSTM would cycle through `[1, 2, 3, 4, 5, 6]` and then produce an output 7, which would then be used when cycling through the next sequence `[7, 8, 9, 10, 11, 12]` and again the output is passed on to the next sequence within the batch.


In [20]:
# lets try passing in the example sequence above to an LSTM layer and view what the output is

test_sequence = np.array([[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]]], dtype=np.float32)
LSTM_test_2 = tf.keras.layers.LSTM(2)

test_sequence_output = LSTM_test_2(test_sequence)

print("Results: {}".format(test_sequence_output))


Results: [[-3.2980672e-01 -8.5024832e-09]]


So for a single batch, it has cycled through the sequence and produced 2 seperate outputs for each unit.

In [21]:
# set go_backwards to True
LSTM_test_3 = tf.keras.layers.LSTM(2, go_backwards=True)

test_sequence_output_1 = LSTM_test_3(test_sequence)

print("Results: {}".format(test_sequence_output_1))


Results: [[-0.02918967  0.06942581]]


reading the docs, it would process the sequence backwards and returns the reversed sequence.

In [22]:
# set return sequence to True

LSTM_test_4 = tf.keras.layers.LSTM(2, return_sequences=True)

test_sequence_output_2 = LSTM_test_4(test_sequence)

print("Results: {}".format(test_sequence_output_2))


Results: [[[ 4.5328138e-06 -1.6741604e-03]
  [ 2.3820494e-09 -1.7256550e-03]
  [ 1.2384534e-12 -1.8277960e-03]
  [ 6.4388152e-16 -1.9796446e-03]]]


Looking at the output,
```
[[[ 4.5328138e-06 -1.6741604e-03] --> Output for the 2 units on the first sequence in the batch.
  [ 2.3820494e-09 -1.7256550e-03] --> Output for the 2 units on the second sequence in the batch.
  [ 1.2384534e-12 -1.8277960e-03] --> Output for the 2 units on the third sequence in the batch.
  [ 6.4388152e-16 -1.9796446e-03] --> Output for the 2 units on the fourth sequence in the batch.]]
```


In [26]:
# set return state to True
LSTM_test_5 = tf.keras.layers.LSTM(2, return_state=True)

test_sequence_output_5 = LSTM_test_5(test_sequence)

print("Results: {}".format(test_sequence_output_5))


Results: [<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 9.99174535e-01, -1.22137795e-11]], dtype=float32)>, <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 9.99174535e-01, -1.22137795e-11]], dtype=float32)>, <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 3.8983569e+00, -1.2234241e-11]], dtype=float32)>]


Recap again that the LSTM has 2 states, the long term and short term states.
looking at the docs, for the given arguements it would return the 
- final output
- final memory state --> (Long term memory)
- final carry state --> (Short term memory)

**More reading resources**
- https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/