<a href="https://colab.research.google.com/github/Antony-gitau/machine_learning_playground/blob/main/Neurons_with_recurrence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I am following the [MIT 6.S191 lecture on recurrent neural network](https://youtu.be/ySEx_Bqxvvo) and taking some notes and here I document them.

Sequence modelling applications:
- machine translation
- image captioning
- semantic classification


Neuron with recurrence
- RNN

pseudocode of an RNN

1. Define the rnn;
my_rnn = RNN()
2. iterate through all the inputs
3. calculate and update the hidden state using an activation function
4. generate a predicted output.

design criteria for developing networks for sequence modelling:
- handle variable lengths
- track long dependencies
- maintain information about the order of the sequence
- share parameters across the sequence

example:
predicting the next word.

1. represent language to a neural network
- represent words as numerical representation.

one way to represent words as input vectors of a neural network, we use a one hot encoding technique. By one hot encoding we mean, taking a count of every word in a single vector and identifying the word with a 1 and 0 everywhere else. e.g [0,1,0,0] is a one hot vector of a word in the second index (that is appearing second on the count of words in the sequence)

2. Training and learning through neural networks

- backpropagation through time.

challenges:
1. exploding gradients
- the gradient gets bigger and bigger until its unfeaseble to calculate it, and by extension, training a model becomes unstable.
2. vanishing gradients
- the gradient on the other hand gets smaller and smaller, until it becomes insignificant.

tricks to overcome the challenges:
1. changing activation functions
e.g ReLU is an a function that prevents the gradient from shrinking
2. parameter initialization
3. introducing gated cells.
select flow of information in the neural network. like the LSTMs

applications and limitations of RNN

Music generation
- Design an RNN that can predict the next musical note.

limitation
- encoding bottleneck
- no easy parallelization techniques
- not that long memory for quite long sequences, like the 10,000s of words

Attention is all you need:
- attend to the most import part of an input example.
- extract the features deserve the highest attention.


Let jump into a practical section drawing inspiration from [music generation with RNN lab](https://github.com/aamini/introtodeeplearning/blob/master/lab1/Part2_Music_Generation.ipynb) by MIT Introduction to Deep learning course.

The goal is to train a model to generate new music from learning the patterns in raw sheet music.



In [1]:
%%capture
%tensorflow_version 2.x #ensuring we are using any tensorflow 2. something version
import tensorflow as tf

# the data we are using lives in mitdeeplearning package
!pip install mitdeeplearning
import mitdeeplearning as mdl





Data:
- the mitdeeplearning package has an irish folk song data set that has 817 songs.


In [5]:
songs = mdl.lab1.load_training_data()
first_song = songs[0]
print("This is just an example\n ", first_song)
second_song = songs[1]
print("second song: ", second_song)

Found 817 songs in text
This is just an example
  X:1
T:Alexander's
Z: id:dc-hornpipe-1
M:C|
L:1/8
K:D Major
(3ABc|dAFA DFAd|fdcd FAdf|gfge fefd|(3efe (3dcB A2 (3ABc|!
dAFA DFAd|fdcd FAdf|gfge fefd|(3efe dc d2:|!
AG|FAdA FAdA|GBdB GBdB|Acec Acec|dfaf gecA|!
FAdA FAdA|GBdB GBdB|Aceg fefd|(3efe dc d2:|!
second song:  X:2
T:An Buachaill Dreoite
Z: id:dc-hornpipe-2
M:C|
L:1/8
K:G Major
GF|DGGB d2GB|d2GF Gc (3AGF|DGGB d2GB|dBcA F2GF|!
DGGB d2GF|DGGF G2Ge|fgaf gbag|fdcA G2:|!
GA|B2BG c2cA|d2GF G2GA|B2BG c2cA|d2DE F2GA|!
B2BG c2cA|d^cde f2 (3def|g2gf gbag|fdcA G2:|!


In [8]:
#converting the abc notation of the songs to audio file
play_first_song = mdl.lab1.play_song(first_song)
play_first_song

Important questions:

how does the number of different characters present in the text file impact the complexity of the learning problem?



In [12]:
#join the songs leaving a blank line between them
joined_songs = "\n\n".join(songs)

# lets get unique characters from the list of songs we just joined
vocab = sorted(set(joined_songs))
print("we have ", len(vocab), "unique characters in the irish folk songs dataset")

we have  83 unique characters in the irish folk songs dataset


In [15]:
vocab

['\n',
 ' ',
 '!',
 '"',
 '#',
 "'",
 '(',
 ')',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 '<',
 '=',
 '>',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '^',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '|']

preprocessing:

- we are asking the model: given a sequence of characters, what is the most probable next one? This is the goal of this model development.

- the data we have is ABC notation, and we want the RNN to learn the pattern.


so,

1. we need to vectorize the text.
- creating a numerical representation of the musical text.
- we will therefore generate two lookup tables: one to map the characters to numbers and the other will map numbers back to characters.


notes on the dictionary comprehension below:

its equivalent for loop

      char2indx = {}
      for ind, ch in enumerate(vocab):
         char2indx[ch] = ind


In [18]:
import numpy as np

# mapping characters to unique index
char2indx = {ch:indx for indx, ch in enumerate(vocab)}

# now we move from the unique index to the characters in vocab list
indx2char = np.array(vocab)

In [None]:
char2indx

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '#': 4,
 "'": 5,
 '(': 6,
 ')': 7,
 ',': 8,
 '-': 9,
 '.': 10,
 '/': 11,
 '0': 12,
 '1': 13,
 '2': 14,
 '3': 15,
 '4': 16,
 '5': 17,
 '6': 18,
 '7': 19,
 '8': 20,
 '9': 21,
 ':': 22,
 '<': 23,
 '=': 24,
 '>': 25,
 'A': 26,
 'B': 27,
 'C': 28,
 'D': 29,
 'E': 30,
 'F': 31,
 'G': 32,
 'H': 33,
 'I': 34,
 'J': 35,
 'K': 36,
 'L': 37,
 'M': 38,
 'N': 39,
 'O': 40,
 'P': 41,
 'Q': 42,
 'R': 43,
 'S': 44,
 'T': 45,
 'U': 46,
 'V': 47,
 'W': 48,
 'X': 49,
 'Y': 50,
 'Z': 51,
 '[': 52,
 ']': 53,
 '^': 54,
 '_': 55,
 'a': 56,
 'b': 57,
 'c': 58,
 'd': 59,
 'e': 60,
 'f': 61,
 'g': 62,
 'h': 63,
 'i': 64,
 'j': 65,
 'k': 66,
 'l': 67,
 'm': 68,
 'n': 69,
 'o': 70,
 'p': 71,
 'q': 72,
 'r': 73,
 's': 74,
 't': 75,
 'u': 76,
 'v': 77,
 'w': 78,
 'x': 79,
 'y': 80,
 'z': 81,
 '|': 82}

In [None]:
indx2char

array(['\n', ' ', '!', '"', '#', "'", '(', ')', ',', '-', '.', '/', '0',
       '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '<', '=', '>',
       'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
       '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
       'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
       'w', 'x', 'y', 'z', '|'], dtype='<U1')

comment on dtype = '<UI'

seen above from running the indx2char.

that datatype specifies that the array elements are unsigned 1-byte integers.

In [16]:
# vetorize the song strings
def vectorize_string(string):
  '''
  we pass a string ie the song
  we convert the string to characters
  then convert the characters to indices
  then the indices to numpy arrays for easier storage and manipulation
  '''
  #convert the strings of the song to characters
  characters = list(string)

  #vocab characters to indices
  char2indx = {indx:ch for ch, indx in enumerate(vocab)}

  #map each character in the input string to its corresponding index
  vectorize = [char2indx[char] for char in characters]

  # convert the list to a numpy array
  vectorize_array = np.array(vectorize)

  return vectorize_array


In [19]:
vectorized_songs = vectorize_string(joined_songs)

In [22]:
len(vectorized_songs) == len(joined_songs)

  and should_run_async(code)


True

In [28]:
 #test example of the character to indices line in vectorize_string function above
char2indx = {indx:ch for ch, indx in enumerate(vocab)}
char2indx['r']

73

In [32]:
print('{} ->  character mapped to int -> {}'.format(repr(joined_songs[:30]),vectorized_songs[:10] ))

"X:1\nT:Alexander's\nZ: id:dc-hor" ->  character mapped to int -> [49 22 13  0 45 22 26 67 60 79]


from the output of the print function above, you notice that we the individual characters from the string of the song were mapped to their unique indexes (what we are referring to as vectorized).


Now that we have vectorized the songs, we can now split the training set from the testing set.

notes:
- random.choice() returns a randomly selected element from the provided sequence. Each time the code line runs, a new selection is made.

an illustration to clearly explain the `get_batch()` function.

suppose

  the length of vectorized songs is n, the sequence length that we will feed the RNN is defined as 10
  and the batch size is the chunks of examples that will be shown to the model. We will then randomly choose the starting point from the entire array of the input. The code snippet shows how.

    n = vectorized_songs.shape[0] - 1
    seq_length = 10
    batch_size = 10`
    rand_idx = np.random.choice(n - seq_length, batch_size)`

  An example of an output of the `rand_idx` varible above will be
  
    `array([ 91766,  74645,  79707, 100050, 199214, 170350,  79151, 143346,
            17091,  97731])`

  To define an `input_batch` that will be fed to the RNN, we can loop through the indices produced randomly above (From `rand_idx`)

    input_batch = [vectorized_songs[idx:idx + seq_length] for idx in rand_idx]
    input_batch

A sample input to the model would look as follows:

    [array([ 1, 59, 32, 27, 59, 82, 60, 62, 62, 14]),
    array([30,  1, 30, 26, 82,  2,  0, 27, 59,  1]),
    array([82, 59, 15,  1, 26, 82,  2,  0, 27, 59]),
    array([14,  1, 61, 58, 26, 58, 82, 62, 59, 59]),
    array([ 0,  0, 49, 22, 18,  0, 45, 22, 34, 69]),
    array([60, 60, 61,  1, 60, 14, 59, 60, 82, 61]),
    array([14, 11, 16,  0, 37, 22, 13, 11, 20,  0]),
    array([73, 60, 60, 67,  9, 14, 14, 20,  0, 38]),
    array([58, 60,  1, 56, 60, 58, 26, 82,  6, 15]),
    array([26,  8, 82, 26,  8, 14, 32, 30,  1, 29])]

The equivalent of this vectorized input could be viewed by slicing using the same indices but from the non_vectorized input ( i.e the joined_songs)


    input_batch_nonvectorized = [joined_songs[idx:idx + seq_length] for idx in rand_idx]
    input_batch_nonvectorized


The equivalent of the arrays we saw earlier is shown below. That is a section of the song in abc notation.

    [' dGBd|egg2',
    'E EA|!\nBd ',
    '|d3 A|!\nBd',
    '2 fcAc|gdd',
    '\n\nX:6\nT:In',
    'eef e2de|f',
    '2/4\nL:1/8\n',
    'reel-228\nM',
    'ce aecA|(3',
    'A,|A,2GE D']

In [43]:
def get_batch(vectorized_songs, seq_length, batch_size):
  '''
  we initialize the function that will allow us to define a batch for training
  vectorized_songs are the training data
  seq_length will help us break the text into chunks

  '''
  # the length of vectorized songs
  n = vectorized_songs.shape[0] - 1

  #start at a random indice for examples in training batch
  rand_idx = np.random.choice(n - seq_length, batch_size)

  # list of input sequences
  input_batch = [vectorized_songs[idx:idx + seq_length] for idx in rand_idx]

  # list of output sequences
  output_batch = [vectorized_songs[idx+1:idx + seq_length + 1] for idx in rand_idx]

  x_batch = np.reshape(input_batch, [batch_size, seq_length])
  y_batch = np.reshape(output_batch, [batch_size, seq_length])

  return x_batch, y_batch


In [42]:
len(vectorized_songs) == vectorized_songs.shape[0]

  and should_run_async(code)


True

Now that our training data is ready, we can start working on developing the RNN model.

The goal is to use this model to generate new songs.

notes on the LSTM function below:
- return sequences indicates that the model will return sequences and not just the final output.
- glorot uniform initializer is used to sample the weights from the uniform distribution based on the size of the weight tensor.
- sigmoid function squaches the values between 0 and 1 controlling the flow of infomation through the LSTM cell.
- stateful = true is useful to preserve the layer internal state between batches allowing the model to learn long-term dependencies within the sequence.

In [44]:
# define RNN model
def LSTM(rnn_units):
  return tf.keras.layers.LSTM(
      rnn_units,
      return_sequences=True,
      recurrent_initializer='glorot_uniform',
      recurrent_activation='sigmoid',
      stateful=True)

In [45]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      #transform indices into dense vectors - layer 1
      tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),

      #LSTM is layer 2
      tf.keras.layers.LSTM(rnn_units),

      #transform the LSTM output into a vocabulary size
      tf.keras.layers.Dense(vocab_size)

  ])

  return model

In [46]:
model = build_model(len(vocab), embedding_dim=256, rnn_units=1024, batch_size=32)

we do some sanity checks on our simple model.

In [47]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (32, None, 256)           21248     
                                                                 
 lstm (LSTM)                 (32, 1024)                5246976   
                                                                 
 dense (Dense)               (32, 83)                  85075     
                                                                 
Total params: 5,353,299
Trainable params: 5,353,299
Non-trainable params: 0
_________________________________________________________________


In [48]:
x, y = get_batch(vectorized_songs, seq_length=100, batch_size=32)

In [49]:
print(len(x))
x.shape

32


(32, 100)

In [50]:
print(len(y))
y.shape

32


(32, 100)

In [51]:
pred = model(x)

In [52]:
pred

  and should_run_async(code)


<tf.Tensor: shape=(32, 83), dtype=float32, numpy=
array([[-0.00319604,  0.00211253, -0.01241458, ..., -0.00012345,
         0.00466482, -0.01294524],
       [ 0.01571455,  0.00087369, -0.00417113, ...,  0.0032012 ,
        -0.00440836, -0.00078404],
       [ 0.01027013,  0.00438162, -0.00413364, ..., -0.01213961,
         0.00628384, -0.00120205],
       ...,
       [-0.00533564,  0.00173857, -0.01452293, ..., -0.00029971,
        -0.00322834, -0.00576325],
       [ 0.00014208,  0.00317596, -0.00713918, ...,  0.00645009,
        -0.00173611, -0.00363189],
       [-0.00363292, -0.0065942 , -0.00263345, ..., -0.01786411,
        -0.00663792, -0.00523666]], dtype=float32)>

Notes on categorical function:
- we are using this function to select the next word to generate based on the predicted probabilities of each word in the vocabulary.

- pred[0] is the predicted probabilities for each word in the vocabulary.

In [None]:
len(vocab)

  and should_run_async(code)


83

In [53]:
print("Shape of pred[0]:", pred[0].shape)
print("Data type of pred[0]:", pred[0].dtype)
print("Min value of pred[0]:", tf.reduce_min(pred[0]))
print("Max value of pred[0]:", tf.reduce_max(pred[0]))


Shape of pred[0]: (83,)
Data type of pred[0]: <dtype: 'float32'>
Min value of pred[0]: tf.Tensor(-0.013302604, shape=(), dtype=float32)
Max value of pred[0]: tf.Tensor(0.022281915, shape=(), dtype=float32)


In [54]:
print("Negative probabilities:", tf.reduce_any(pred[0] < 0))
print("NaN values:", tf.reduce_any(tf.math.is_nan(pred[0])))


Negative probabilities: tf.Tensor(True, shape=(), dtype=bool)
NaN values: tf.Tensor(False, shape=(), dtype=bool)
