# **Udacity: Intro to TensorFlow for Deep Learning**
## **Lesson 10 NLP: Recurrent Neural Networks**

This lessons extends on what was covered in lesson 9. It introduces recurrent neural networks, which are able to capture temporal dependences, these are dependencies that change over time.

This lesson would covers
- Different RNNs: Simple RNN, LSTMS, GRUs
- Text generation using NLP models.


##**Simple RNNS**

Simple recurrent neural networks, are networks which use outputs from previous time steps as additional inputs, alongside the current input.

for example   
- Inputs: $X_t + Y_{t-1}$
- Ouput: $Y_t$

I'm a bit effy around th calculation of the output.

But...

The general idea is that, previous output from the last time step are fed as additional inputs. The previous output is referred to as the state vector.

<br>

While simple RNN architecture are able to consider the past output when calculating it's new output, it is limited by how far back it can relate dependencies and for dependencies which occur over long period of time, a simple RNN would struggle to capture these dependencies.

## **Long term short term Memory**

LSTMs, were introduced to capture temporal dependencies which spam over longer periods of time. Unlike simple RNNS which have a single state vector, LSTMS have 2 state vectors, a Long term memory and a short term memory.

<br>

**Features of an LSTM**
- It's able to capture temporal dependencies spaming a long period of time
- It has 2 state vectors, long term and short term memory, which are used in calculating a new input
- LSTM feature gates: Forget, learn, remember and use gates.

<br>

**Workflow for an LSTM**
- At each time step, the LSTM has 2 state vectors: A long term memory and a short term memory
- The current time step input alongside the 2 state vectors are used to determine an output.
- The calculated output from the current time step, would be used as the short term memory for the next timestep
- The long term memory from the previous time step is updated:
  - Any pieces of the previous long term memory which is no longer relevant is removed, using the forget gate
  - Any new piece of relevant information is added to the long term memory, using the remember gate

<br>

**LSTM Gates**
- Forget gate: Determines which parts of the long term memory are no longer relevant and should be removed from memory.
- Learn gate: Learns new piece of informations using the current input and short term memory.

- Remember gate: Adds any relevant information that was learnt to the long term memory. The output of this gate is the new long term memory

- Use gate: This uses, the relevants parts of the long term memory and newly learnt information to calculate an output. The output is also used as the new short term memory for the next timestep.



##**Import Dependencies**

In [2]:
import tensorflow as tf
import numpy as np

print(tf.__version__)

2.8.2


## **RNNs and LSTMs in code**

The LSTM layer is within `tf.keras.Layers.LSTM` [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). likewise with the RNN layer `tf.keras.Layers.SimpleRNN`.

<br>

Worth noting:
- We can directly pass, the output of an embedding layer to an LSTM layer without adding a flatten or GlobalAverage1D layer inbetween.

<br>

**Further resources:**   
The [recurrent neural network (RNN) with keras guide](https://www.tensorflow.org/guide/keras/rnn) is a really good supplementary introduction to RNNs.

Notes from the guide
- 3 built-in RNN layers: SimpleRNN, GRU, LSTM
- RNN can process input sequences in reverse.
- Feature recurrent dropout
- By default returns output at the last time step, but it can be configured to return a sequence instead for each time step.
- The layer can also be configured to return the final internal state vectors.
- Likewise we can also set the initial state of the RNN layer.

<br>

**Difference between layer and cell layer**
- "*the RNN cell process only a single timestep*"

### **Preparing text for SimpleRNN and LSTM layer**

Simplified look at using the SimpleRNN and LSTM layers. Both these layers would take inputs from the Embedding layer, which would return a vector in n dimensions for each token in the sequence.

To take a look at the input and given output, lets try it out on a simple dataset.

In [3]:
# sample text
# The great pretender - The Platters
lyrics = ["Oh-oh, yes, I'm the great pretender",
          "Pretending that I'm doing well",
          "My need is such I pretend too much",
          "I'm lonely, but no one can tell",
          "Oh-oh, yes, I'm the great pretender",
          "Adrift in a world of my own",
          "I've played the game but to my real shame",
          "You've left me to grieve all alone",
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around"
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around (still around)"]


In [4]:
# tokenize and pad the text

# deprecated in version 2.9.1
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# define a tokenizer and fit it to the text
The_great_pretender = Tokenizer(num_words=150, oov_token="<OOV>")
The_great_pretender.fit_on_texts(lyrics)


In [5]:
# display the word index
word_index = The_great_pretender.word_index
print(word_index)

{'<OOV>': 1, "i'm": 2, 'my': 3, 'the': 4, 'i': 5, 'a': 6, 'real': 7, 'oh': 8, 'yes': 9, 'great': 10, 'pretender': 11, 'too': 12, 'to': 13, 'what': 14, 'heart': 15, 'like': 16, 'pretending': 17, 'that': 18, 'is': 19, 'of': 20, 'still': 21, 'but': 22, 'this': 23, 'feeling': 24, 'make': 25, 'believe': 26, 'when': 27, 'feel': 28, "can't": 29, 'conceal': 30, 'just': 31, "laughin'": 32, 'and': 33, 'gay': 34, 'clown': 35, 'seem': 36, 'be': 37, 'not': 38, 'you': 39, 'see': 40, 'wearing': 41, 'crown': 42, "you're": 43, 'around': 44, 'doing': 45, 'well': 46, 'need': 47, 'such': 48, 'pretend': 49, 'much': 50, 'lonely': 51, 'no': 52, 'one': 53, 'can': 54, 'tell': 55, 'adrift': 56, 'in': 57, 'world': 58, 'own': 59, "i've": 60, 'played': 61, 'game': 62, 'shame': 63, "you've": 64, 'left': 65, 'me': 66, 'grieve': 67, 'all': 68, 'alone': 69, 'aroundtoo': 70}


In [6]:
# convert the lyrics to sequences
lyrics_sequence = The_great_pretender.texts_to_sequences(lyrics)

length_of_sequence = []
for sequence in lyrics_sequence:
  print(sequence)
  break

[8, 8, 9, 2, 4, 10, 11]


In [7]:
# display the average length of each sequence
length_of_sequence = []
for sequence in lyrics_sequence:
  length_of_sequence.append(len(sequence))


#src: https://www.geeksforgeeks.org/find-average-list-python/
def Average(lst):
    return sum(lst) / len(lst)
  

print(Average(length_of_sequence))

7.619047619047619


In [8]:
# Apply padding to the sequences
lyrics_sequence_padded = pad_sequences(lyrics_sequence, maxlen=7, padding='pre',
                                      truncating='post')


In [9]:
# define an embedding layer
Embedding = tf.keras.layers.Embedding(input_dim=150, output_dim=4, input_length=7)

In [10]:
# pass in a sequence from the lyric_sequence_padded to the embedding layer
output = Embedding(np.array(sequence))
print(output)

tf.Tensor(
[[ 0.0140428  -0.0206385  -0.0401339  -0.00331587]
 [-0.03349987 -0.01393756  0.02969425 -0.03382658]
 [ 0.0047463   0.00197712 -0.04684277  0.00025292]
 [-0.02160268 -0.01317025 -0.03149801 -0.00817186]
 [ 0.01820506 -0.01082184  0.00169314  0.02858908]
 [-0.02160268 -0.01317025 -0.03149801 -0.00817186]
 [ 0.01820506 -0.01082184  0.00169314  0.02858908]], shape=(7, 4), dtype=float32)


we now have a vector of 4 dimensions representing of each tokens in the sequence.

A thought on the embedding layer
- i don't think the embedding layer is trainable, provided it's function is to simply convert the tokens into a vector representation, i can't really see how any optimization is needed.

Unless the scale or way in which the conversion is done needs to be different to get better results, i'm not sure.

Looking into the above text, *i don't think the embedding layer is trainable*

<br>

**Is the embedding layer trainable?**

The short answer is that, the embedding layer is trainable, it is not enough to just map each token to vector.
<br>

- At it's core, the embedding layer maps positive integers (tokens) into dense vectors of fixed size (word embeddings).
- ideally we would like words with similar semantic context or sentiment, to have similar vector representations, so words like man and woman should have fairly similar vectors. Likewise words like apple, orange, banana, pears should be within the same general cluster, in our n dimensional space for our word embedding.
- We would not be able to achieve this by randomly mapping our tokens into vectors. Hence our vector representation would need to be optimzied to group similar words to be within the same latent space in our word embedding.

<br>

**Resources**
- https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer
- https://www.youtube.com/watch?v=5MaWmXwxFNQ

<br>

**More questions**
- What is word2Vec, skip-gram?

### **Understanding the keras LSTM layer**

In [11]:
# pass the output of the embedding layer to the LSTM layer
LSTM = tf.keras.layers.LSTM(units=4)

embedding_output = Embedding(np.array([sequence]))
LSTM_output = LSTM(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nEmbedding layer output: ")
print(embedding_output)
print("\nLSTM layer output: ")
print(LSTM_output)

Input: [[17 18 43 21 44 21 44]]

Embedding layer output: 
tf.Tensor(
[[[ 0.0140428  -0.0206385  -0.0401339  -0.00331587]
  [-0.03349987 -0.01393756  0.02969425 -0.03382658]
  [ 0.0047463   0.00197712 -0.04684277  0.00025292]
  [-0.02160268 -0.01317025 -0.03149801 -0.00817186]
  [ 0.01820506 -0.01082184  0.00169314  0.02858908]
  [-0.02160268 -0.01317025 -0.03149801 -0.00817186]
  [ 0.01820506 -0.01082184  0.00169314  0.02858908]]], shape=(1, 7, 4), dtype=float32)

LSTM layer output: 
tf.Tensor([[-0.00141469  0.00299828  0.00552981 -0.01104947]], shape=(1, 4), dtype=float32)


So what has happened??
- it looks like the number of units correspond to the shape of the output, so if there is 1 unit it would produce a shape of (1, 1) and if there are 4 units it would produce a shape of (1, 4).
- recap output of embedding layer is vector of n dimension representation of a sequence.

<br/>

Still doesn't answer what exactly its doing??
- What does the output of the LSTM layer mean??

<br/>

what happens if we ask it to 
- return sequences
- return state
- go backwards

**Set return_sequences to True**

In [12]:
# lstm layer with return sequence = True
LSTM_return_sequence = tf.keras.layers.LSTM(units=4, return_sequences=True)
LSTM_return_sequence_output = LSTM_return_sequence(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nLSTM layer output: ")
print(LSTM_return_sequence_output)

Input: [[17 18 43 21 44 21 44]]

LSTM layer output: 
tf.Tensor(
[[[ 0.00254509 -0.00265903 -0.00335655  0.00630662]
  [-0.00215338  0.00352042 -0.00360615  0.0071639 ]
  [ 0.00131656 -0.00060463 -0.00234847  0.00958662]
  [ 0.00361935 -0.00076064 -0.00215261  0.01149735]
  [ 0.00590812 -0.00410664 -0.00088706  0.00912217]
  [ 0.00713104 -0.00324596 -0.00123536  0.01124518]
  [ 0.00868961 -0.00585512 -0.0003303   0.00898984]]], shape=(1, 7, 4), dtype=float32)


with return sequence set to True as it iterates through the sequence, it would return a value for each timestep, as there are 7 tokens in a sequence, it returns 7 values and since we have 4 units in our LSTM for each time step it returned 4 value.

still don't fully understand what the output mean

**Understanding the output of an LSTM layer**

Notes from machine learning mastery.
- The LSTM is a class of recurrent neural networks which contains internal gates (Learn, forget, remember and use gates).
- This class of recurrent neural networks are designed to resolve the vanishing gradient problem, as it is able to capture longer temporal dependencies.
- An LSTM layer can be defined with n number of LSTMs cell. each cell contains,
  - an internal cell state, *c*
  - outputs a hidden state *h*

So the output of an LSTM is the hidden state *h*.

<br>

By setting 
- **return_sequence to True**, we are able to access the hidden state at each time step, in our sequence.
- **return_state to True**, this would return the final internal cell states of each cell in the LSTM layer. So it would return *c* and *h*.

We would typically set the `return_state to True`, when we want to initialize the states of another LSTM layer with the same number of cells.

<br>

Cool...

But this has still not answered the question of what the internal cell state and hidden cell states are.




<br>

**Resources**
- https://stackoverflow.com/questions/67970519/what-does-tensorflow-lstm-return

- https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
 

In [14]:
# Example used in stackoverflow question
# Set return_sequence and return_states to True

# input shape: Batch size, length of sequence, embedding dimension
tensor = tf.random.normal(shape=[2, 2, 2])
lstm = tf.keras.layers.LSTM(units=4, return_sequences=True, return_state=True)
hidden_state_at_each_time_step, final_hidden_state, final_cell_state = lstm(tensor)

print("Input: {}".format(tensor))
print("\nHidden_state_at_each_time_step:\n", hidden_state_at_each_time_step)
print("\nFinal hidden state: \n", final_hidden_state)
print("\nFinal cell state: \n", final_cell_state)

Input: [[[ 1.3728726  -1.0539973 ]
  [ 0.70490474 -0.6466145 ]]

 [[ 2.0646765  -0.20600307]
  [ 0.8807793  -0.68819195]]]

Hidden_state_at_each_time_step:
 tf.Tensor(
[[[ 0.02846106 -0.01029014 -0.117158   -0.11814743]
  [ 0.0384821  -0.01346449 -0.13667542 -0.17867412]]

 [[ 0.02039262  0.02783074 -0.15292223 -0.06933288]
  [ 0.03636859  0.03722456 -0.17272212 -0.17978273]]], shape=(2, 2, 4), dtype=float32)

Final hidden state: 
 tf.Tensor(
[[ 0.0384821  -0.01346449 -0.13667542 -0.17867412]
 [ 0.03636859  0.03722456 -0.17272212 -0.17978273]], shape=(2, 4), dtype=float32)

Final cell state: 
 tf.Tensor(
[[ 0.06955759 -0.02722744 -0.2568297  -0.3566234 ]
 [ 0.06557956  0.0762099  -0.32171172 -0.3723976 ]], shape=(2, 4), dtype=float32)


**WTF...**

**Breakdown of the output**


For an input of shape `[2, 2, 2]`, we have 2 batches of 2 sentences.
``` python
[[[-0.4942633   1.0938902 ], [-0.27461976 -0.2292373 ]]
 [[-0.34839615 -0.39708054], [ 0.21653225  2.0741708 ]]]
```
Where `[-0.4942633   1.0938902 ]` for example is our word embedding for a single token.

So an example input for the above word embedding is
``` python
example sequence of tokens = [[10, 11], [12, 13]]

# 10 -> [-0.4942633   1.0938902 ]
# 11 -> [-0.27461976 -0.2292373 ]
# 12 -> [-0.34839615 -0.39708054]
# 13 -> [ 0.21653225  2.0741708 ]

# decoded even futher as another example 
["hello, you"], ["Good food"]

# hell0 -> 10
# you -> 11
# Good -> 12
# food -> 13
```

With return sequence and state set to True. We have
- The hidden state $h_{t}$ returned at each time step.
- The final hidden cell state $h$ returned
- The final cell state $c$ returned.


Looking carefully at the output, you should see that the final hidden cell state is just the last hidden cell state for each sentence.

**A quick note on the dimensions.**

For  a given input dimension [X, Y, Z]. X would be the batch dimension and determines the number of initial element in the array. So if
- X = 2 we would have an array containing 2 elements. $[1, 2]$
- X = 5, we would have 5 elements in the array. $[1, 2, 3, 4, 5]$

The next dimension Y would determine the number of element within the initial set of elements. so if our dimensions are
- [1, 2, Z], we would have an array structured like this $[ [[], []] ]$
Admittedly it looks confusing, but the array would contain a single element, which in turn contains 2 elements.

stacking that further, the 3rd dimension determines the number of elements within the last set. For example an array with a dimension of

- [1, 2, 3] could like this, [[[1, 2, 3], [4, 5, 6]]]
- [3, 4, 6] could look like this 

```
[[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]],
[[11, 12, 13, 14, 15, 16], [17, 18, 19, 20, 21, 22], [23, 24, 25, 26, 27, 28], [29, 30, 31, 32, 33, 24]],
[[21, 22, 23, 24, 25, 26], [27, 28, 29, 30, 31, 32], [33, 34, 35, 36, 37, 38], [39, 40, 41, 42, 43, 44]]]
```



In [None]:
# lets try passing in the example sequence above to an LSTM layer and view what the output is

# single batch with sequence length of 4 and 6 embedding dimensions.
test_sequence = np.array([[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]]], dtype=np.float32)
LSTM_test_2 = tf.keras.layers.LSTM(2)

test_sequence_output = LSTM_test_2(test_sequence)

print("Results: {}".format(test_sequence_output))


Results: [[-3.2980672e-01 -8.5024832e-09]]


In [None]:
# set go_backwards to True
LSTM_test_3 = tf.keras.layers.LSTM(2, go_backwards=True)

test_sequence_output_1 = LSTM_test_3(test_sequence)

print("Results: {}".format(test_sequence_output_1))


Results: [[-0.02918967  0.06942581]]


reading the docs, it would process the sequence backwards and returns the reversed sequence.

In [None]:
# set return sequence to True

LSTM_test_4 = tf.keras.layers.LSTM(2, return_sequences=True)

test_sequence_output_2 = LSTM_test_4(test_sequence)

print("Results: {}".format(test_sequence_output_2))


Results: [[[ 4.5328138e-06 -1.6741604e-03]
  [ 2.3820494e-09 -1.7256550e-03]
  [ 1.2384534e-12 -1.8277960e-03]
  [ 6.4388152e-16 -1.9796446e-03]]]


Looking at the output,
```
[[[ 4.5328138e-06 -1.6741604e-03] --> Output for the 2 units on the first word embedding in the batch.
  [ 2.3820494e-09 -1.7256550e-03] --> Output for the 2 units on the second word embedding in the batch.
  [ 1.2384534e-12 -1.8277960e-03] --> Output for the 2 units on the third word embedding in the batch.
  [ 6.4388152e-16 -1.9796446e-03] --> Output for the 2 units on the fourth sequence in the batch.]]
```


In [None]:
# set return state to True
LSTM_test_5 = tf.keras.layers.LSTM(2, return_state=True)

test_sequence_output_5 = LSTM_test_5(test_sequence)

print("Results: {}".format(test_sequence_output_5))


Results: [<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 9.99174535e-01, -1.22137795e-11]], dtype=float32)>, <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 9.99174535e-01, -1.22137795e-11]], dtype=float32)>, <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 3.8983569e+00, -1.2234241e-11]], dtype=float32)>]


### **Understanding the SimpleRNN Layer**

In [None]:
# define a simpleRNN layer and try to make sense of the output.

### **Understanding GRU Layers**