# **Udacity: Intro to TensorFlow for Deep Learning**
## **Lesson 10 NLP: Recurrent Neural Networks**

This lessons extends on what was covered in lesson 9. It introduces recurrent neural networks, which are able to capture temporal dependences, which are dependencies that change over time.

This lesson covers
- Different RNNs
- Text generation using NLP models.


##**Simple RNNS**

Simple recurrent neural networks, are networks which use outputs from previous time steps as additional inputs, alongside the current input.

for example   
- Inputs: $X_t + Y_{t-1}$
- Ouput: $Y_t$

I'm a bit effy around th calculation of the output.

But...

The general idea is that, previous output from the last time step are fed as additional inputs. The previous output is referred to as the state vector.

<br>

While simple RNN architecture are able to consider the past output when calculating it's new output, it is limited by how far back it can relate dependencies and for dependencies which occur over long period of time, a simple RNN would struggle to relate the dependencies.

## **Long term short term Memory**

LSTMs, were introduced to capture temporal dependencies which spam over longer periods of time. Unlike simple RNNS which have a single state vector, LSTMS have 2 state vectors, a Long term memory and a short term memory.

<br>

**Features of an LSTM**
- It's able to capture temporal dependencies spaming a long period of time
- It has 2 state vectors, long term and short term memory, which are used in calculating a new input
- LSTM feature gates: Forget, learn, remember and use gates.

<br>

**Workflow for an LSTM**
- At each time step, the LSTM has 2 state vectors: A long term memory and a short term memory
- The current time step input alongside the 2 state vectors are used to determine an output.
- The calculated output from the current time step, would be used as the short term memory for the next timestep
- The long term memory from the previous time step is updated:
  - Any pieces of the previous long term memory which is no longer relevant is removed
  - Any new piece of relevant information is added to the long term memory

<br>

**LSTM Gates**
- Forget gate: Determines which parts of the long term memory are no longer relevant and should be removed from memory.
- Learn gate: Learns new piece of informations using the current input and short term memory.

- Remember gate: Adds any relevant information that was learnt to the long term memory. The output of this gate is the new long term memory

- Use gate: This uses, the relevants parts of the long term memory and newly learnt information to calculate an output. The output is also used as the new short term memory for the next timestep.



##**Import Dependencies**

In [2]:
import tensorflow as tf
import numpy as np

print(tf.__version__)

2.8.2


## **RNNs and LSTMs in code**

The LSTM layer is within `tf.keras.Layers.LSTM` [docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). likewise with the RNN layer `tf.keras.Layers.SimpleRNN`.

<br>

Worth noting:
- We can directly pass, the output of an embedding layer to an LSTM layer without adding a flatten or GlobalAverage1D layer inbetween.

<br>

**Further resources:**   
The [recurrent neural network (RNN) with keras guide](https://www.tensorflow.org/guide/keras/rnn) is a really good supplementary introduction to RNNs.

Notes from the guide
- 3 built-in RNN layers: SimpleRNN, GRU, LSTM
- RNN can process input sequences in reverse.
- Feature recurrent dropout
- By default returns output at the last time step, but it can be configured to return a sequence instead for each time step.
- The layer can also be configured to return the final internal state vectors.
- Likewise we can also set the initial state of the RNN layer.


**Difference between layer and cell layer**
- "*the RNN cell process only a single timestep*"

Simplified look at using the SimpleRNN and LSTM layers. Both these layers would take inputs from the Embedding layer, which would return a vector in n dimensions for each token in the sequence.

To take a look at the input and given output, lets try it out on a simple dataset.

In [1]:
# sample text
# The great pretender - The Platters
lyrics = ["Oh-oh, yes, I'm the great pretender",
          "Pretending that I'm doing well",
          "My need is such I pretend too much",
          "I'm lonely, but no one can tell",
          "Oh-oh, yes, I'm the great pretender",
          "Adrift in a world of my own",
          "I've played the game but to my real shame",
          "You've left me to grieve all alone",
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around"
          "Too real is this feeling of make-believe",
          "Too real when I feel what my heart can't conceal",
          "Yes, I'm the great pretender",
          "Just laughin' and gay like a clown",
          "I seem to be what I'm not, you see",
          "I'm wearing my heart like a crown",
          "Pretending that you're still around (still around)"]


In [8]:
# tokenize and pad the text

# deprecated in version 2.9.1
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# define a tokenizer and fit it to the text
The_great_pretender = Tokenizer(num_words=150, oov_token="<OOV>")
The_great_pretender.fit_on_texts(lyrics)


In [10]:
# display the word index
word_index = The_great_pretender.word_index
print(word_index)

{'<OOV>': 1, "i'm": 2, 'my': 3, 'the': 4, 'i': 5, 'a': 6, 'real': 7, 'oh': 8, 'yes': 9, 'great': 10, 'pretender': 11, 'too': 12, 'to': 13, 'what': 14, 'heart': 15, 'like': 16, 'pretending': 17, 'that': 18, 'is': 19, 'of': 20, 'still': 21, 'but': 22, 'this': 23, 'feeling': 24, 'make': 25, 'believe': 26, 'when': 27, 'feel': 28, "can't": 29, 'conceal': 30, 'just': 31, "laughin'": 32, 'and': 33, 'gay': 34, 'clown': 35, 'seem': 36, 'be': 37, 'not': 38, 'you': 39, 'see': 40, 'wearing': 41, 'crown': 42, "you're": 43, 'around': 44, 'doing': 45, 'well': 46, 'need': 47, 'such': 48, 'pretend': 49, 'much': 50, 'lonely': 51, 'no': 52, 'one': 53, 'can': 54, 'tell': 55, 'adrift': 56, 'in': 57, 'world': 58, 'own': 59, "i've": 60, 'played': 61, 'game': 62, 'shame': 63, "you've": 64, 'left': 65, 'me': 66, 'grieve': 67, 'all': 68, 'alone': 69, 'aroundtoo': 70}


In [11]:
# convert the lyrics to sequences
lyrics_sequence = The_great_pretender.texts_to_sequences(lyrics)

length_of_sequence = []
for sequence in lyrics_sequence:
  print(sequence)
  break

[8, 8, 9, 2, 4, 10, 11]


In [12]:
# display the average length of each sequence
length_of_sequence = []
for sequence in lyrics_sequence:
  length_of_sequence.append(len(sequence))


#src: https://www.geeksforgeeks.org/find-average-list-python/
def Average(lst):
    return sum(lst) / len(lst)
  

print(Average(length_of_sequence))

7.619047619047619


In [14]:
# Apply padding to the sequences
lyrics_sequence_padded = pad_sequences(lyrics_sequence, maxlen=7, padding='pre',
                                      truncating='post')


In [16]:
# define an embedding layer
Embedding = tf.keras.layers.Embedding(input_dim=150, output_dim=4, input_length=7)

In [18]:
# pass an sequence from the lyric_sequence_padded to the embedding layer
output = Embedding(np.array(sequence))
print(output)

tf.Tensor(
[[-0.04459211  0.02571313 -0.03931668  0.02958694]
 [ 0.02224226 -0.04186448  0.02919896 -0.03644032]
 [-0.02129121 -0.0400339  -0.02416347  0.04578217]
 [-0.01037017  0.02882774  0.04588655 -0.04674643]
 [-0.01522291  0.03582403  0.01851526  0.04984926]
 [-0.01037017  0.02882774  0.04588655 -0.04674643]
 [-0.01522291  0.03582403  0.01851526  0.04984926]], shape=(7, 4), dtype=float32)


we now have a vector of 4 dimensions representations of each tokens in the sequence.

An intresting note (or maybe not)
- i don't think the embedding layer is trainable.

In [27]:
# pass the output of the embedding layer to the LSTM layer
LSTM = tf.keras.layers.LSTM(units=4)

embedding_output = Embedding(np.array([sequence]))
LSTM_output = LSTM(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nEmbedding layer output: ")
print(embedding_output)
print("\nLSTM layer output: ")
print(LSTM_output)

Input: [[17 18 43 21 44 21 44]]

Embedding layer output: 
tf.Tensor(
[[[-0.04459211  0.02571313 -0.03931668  0.02958694]
  [ 0.02224226 -0.04186448  0.02919896 -0.03644032]
  [-0.02129121 -0.0400339  -0.02416347  0.04578217]
  [-0.01037017  0.02882774  0.04588655 -0.04674643]
  [-0.01522291  0.03582403  0.01851526  0.04984926]
  [-0.01037017  0.02882774  0.04588655 -0.04674643]
  [-0.01522291  0.03582403  0.01851526  0.04984926]]], shape=(1, 7, 4), dtype=float32)

LSTM layer output: 
tf.Tensor([[ 0.00769531  0.0146683   0.00121428 -0.00279331]], shape=(1, 4), dtype=float32)


So what has happened??
- it looks like the number of units correspond to the shape of the output, so if there is 1 unit it would produce a shape of (1, 1) and if there are 4 units it would produce a shape of (1, 4).
- recap output of embedding layer is vector of n dimension representation of a sequence.

<br/>

Still doesn't answer what exactly its doing??
- What does the output of the LSTM layer mean??

<br/>

what happens if we ask it to 
- return sequences
- return state
- go backwards

In [28]:
# lstm layer with return sequence = True
LSTM_return_sequence = tf.keras.layers.LSTM(units=4, return_sequences=True)
LSTM_return_sequence_output = LSTM_return_sequence(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nLSTM layer output: ")
print(LSTM_return_sequence_output)

Input: [[17 18 43 21 44 21 44]]

LSTM layer output: 
tf.Tensor(
[[[ 1.9649762e-05  2.0224554e-03  8.7440340e-03  4.0050391e-03]
  [ 2.9053944e-03  3.3087025e-03 -3.7620508e-04 -3.2723253e-03]
  [ 9.8999811e-04  1.3154540e-02 -5.6852326e-03  1.3051000e-03]
  [ 6.3451435e-03  8.1153130e-03  2.1200352e-03 -3.3731095e-04]
  [ 2.1511361e-03  9.9434238e-03  2.2941262e-03  9.3798516e-03]
  [ 7.0728022e-03  4.6256278e-03  7.7789957e-03  4.5243585e-03]
  [ 2.7511823e-03  6.8042660e-03  6.2713060e-03  1.2181924e-02]]], shape=(1, 7, 4), dtype=float32)


with return sequence set to True as it iterates through the sequence, it would return a value for each timestep, as there are 7 tokens in a sequence, it returns 7 values and since we have 4 units in our LSTM for each time step it returned 4 value.

still don't fully understand what the output mean

In [30]:
LSTM_test = tf.keras.layers.LSTM(units=8, return_sequences=True)
LSTM_test_output = LSTM_test(embedding_output)

print("Input: {}".format(np.array([sequence])))
print("\nLSTM layer output: ")
print(LSTM_test_output)

Input: [[17 18 43 21 44 21 44]]

LSTM layer output: 
tf.Tensor(
[[[-4.4579613e-03  9.0390081e-03 -4.1618831e-03  3.9857998e-03
    1.5431750e-04  1.4170543e-03  2.6728231e-03  6.4065116e-03]
  [-1.8123626e-03 -1.4558776e-03  3.7191992e-03 -8.4455329e-04
   -3.1592435e-04 -3.7643593e-04  8.2788552e-04 -1.6677342e-03]
  [-7.1258042e-03  1.7895554e-03  4.1352408e-03  4.3094768e-03
   -5.5278647e-03 -5.9056948e-03 -1.3778711e-04 -7.5842632e-05]
  [-6.1017997e-04 -1.7053846e-03  4.1970001e-03 -3.5239249e-03
   -2.5911711e-03  3.3614601e-03  2.7934618e-03 -2.9627648e-03]
  [ 1.8684951e-03  3.4123105e-03 -1.4369219e-03 -2.1099832e-03
   -5.6553767e-03  4.1918498e-03  2.4927713e-03  1.3080460e-03]
  [ 5.6837257e-03 -5.2536814e-04  9.4453071e-04 -7.5894175e-03
   -3.3683241e-03  9.7949253e-03  4.3107201e-03 -1.8445293e-03]
  [ 6.3246815e-03  4.2484715e-03 -3.3057968e-03 -4.6996563e-03
   -6.7012394e-03  8.2459627e-03  3.3796462e-03  2.1964042e-03]]], shape=(1, 7, 8), dtype=float32)


 ### Understanding the output of the LSTM layer

 looking at:
 - https://stackoverflow.com/questions/67970519/what-does-tensorflow-lstm-return
 

In [31]:
tensor = tf.random.normal( shape = [ 2, 2, 2 ])
lstm = tf.keras.layers.LSTM(units=4, return_sequences=True, return_state=True )
result = lstm(tensor)
print( "result:\n", result )

result:
 [<tf.Tensor: shape=(2, 2, 4), dtype=float32, numpy=
array([[[-0.06319275, -0.01700984,  0.05514352, -0.05898715],
        [-0.06919891,  0.03776611,  0.27123466, -0.2143757 ]],

       [[-0.11517125, -0.06272746, -0.04737367, -0.00527546],
        [-0.0422636 , -0.17164788, -0.20939562,  0.07271944]]],
      dtype=float32)>, <tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[-0.06919891,  0.03776611,  0.27123466, -0.2143757 ],
       [-0.0422636 , -0.17164788, -0.20939562,  0.07271944]],
      dtype=float32)>, <tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[-0.08944879,  0.06403987,  0.5540789 , -0.33707172],
       [-0.13707983, -0.3527763 , -0.42899728,  0.18073909]],
      dtype=float32)>]


I'm sure a lot has changed since version 2.0.0 of tf. But looking at the output the dimensions are similar but with the last dimension being offset by 1.

Looking at the answer to the question asked
- the first output is the output of all hidden states
- the second tensor is the short term memory of the neural network
- the third tensor is the long term memory of the neural network 