<a href="https://colab.research.google.com/github/LokKaturi/Advance-Applications-of-AI-/blob/main/RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Networks

RNNs capture patterns in sequences of data. In this case, the sequence will be characters in the names of chemical compounds, which are often composed of parts like prefixes (e.g. "mono-", "methyl-"), roots ("fluor", "nitr"), and suffixes (e.g. "-ate", "-ide"). We'll see if a simple RNN can pick up on these patterns and generate new chemical compounds.

## Objective

This will be a multi-input, multi-output RNN. We will train it to predict the next character at each step. This makes it a 'language model', and also a type of autoencoder!

During training, the correct label is simply the following character, which means the labels will be obtained by simply offsetting the full string one to the left (see figure below). During inference, the best guess output for a given time step will be passed to the next step as input.

In order to generate variable length sequences, we'll add characters to signify the start ('<') and end ('>') of a compound name. At inference time, we will pass '<' to the network to start the sequence, and stop generation when the network outputs '>'.

### Training

![train](https://drive.google.com/uc?id=1_dn3LMkMnWz8CMXT7xayGDdq5S9aVY3V)

### Inference

![inference](https://drive.google.com/uc?id=1kMKxCukw9M5K6CA2UxO-eQBxLqAa4O0B)

In [69]:
import tensorflow as tf
import numpy as np
import os

In [70]:
text = open('compounds.txt', 'rb').read()
text = text.decode(encoding='utf-8')

In [71]:
# The unique characters in the corpus
vocab = sorted(set(text))

# We'll use < and > to denote start and end
vocab.append('>')
vocab.append('<')

print('Vocab:\n', vocab)

Vocab:
 ['\n', ' ', "'", '(', ')', ',', '-', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '>', '<']


Neural networks operate on vectors of continuous values, not lists of discrete values like characters. We will need to vectorize the input using "one-hot" encoding, in which the vector is the size of the "vocabulary" (in this case all characters we saw in the input). For each input character, the one-hot vector will have a 1 in the dimension representing that character and 0s everywhere else.

In [72]:
# Create a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
# Make a copy of the unique set elements in NumPy array format for later use in the decoding the predictions
idx2char = np.array(vocab)
# Vectorize the text with a for loop
lines = text.split('\n')
lines = list(map(lambda x: '<' + x + '>', lines))
max_len = np.max(list(map(len, lines)))
print("Max length:", max_len)

Max length: 67


In order to generate variable length sequences, we'll add characters to signify the start ('<') and end ('>') of a compound name. At inference time, we will pass '<' to the network to start the sequence, and stop generation when the network outputs '>'.

We will also pad the training names with zeros so they are all the same length.

In [73]:
source=[]
target=[]
for line in lines:
    padded=np.zeros(max_len)
    for i, c in enumerate(line):
        padded[i] = char2idx[c]
    source.append(padded[:-1])
    target.append(padded[1:])

Let's examine an encoded training example. Notice how the target is actually just the source, but shifted by one, because our network is being tasked with looking one time step into the future.

In [74]:
print("Text:")
print(lines[0])
print("\nVector encoded source:")
print(source[0])
print("\nVector encoded target:")
print(target[0])

Text:
<(1r,3r)-1,2,3-trimethylcyclopentane>

Vector encoded source:
[47.  3.  9. 37.  5. 11. 37.  4.  6.  9.  5. 10.  5. 11.  6. 39. 37. 28.
 32. 24. 39. 27. 44. 31. 22. 44. 22. 31. 34. 35. 24. 33. 39. 20. 33. 24.
 46.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Vector encoded target:
[ 3.  9. 37.  5. 11. 37.  4.  6.  9.  5. 10.  5. 11.  6. 39. 37. 28. 32.
 24. 39. 27. 44. 31. 22. 44. 22. 31. 34. 35. 24. 33. 39. 20. 33. 24. 46.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


In [75]:
dataset = tf.data.Dataset.from_tensor_slices((source, target))

In [76]:
BUFFER_SIZE = 10000 # TF shuffles the data only within buffers
BATCH_SIZE = 1 # Batch size
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
vocab_size = len(vocab)
embedding_dim = 64

**1. Add a [SimpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN) layer with 128 units. Make the layer return sequences, since this RNN will have an output at each time step.**

In [77]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[BATCH_SIZE, None]),
    tf.keras.layers.SimpleRNN(128, return_sequences=True),
    tf.keras.layers.Dense(vocab_size)
])


In [78]:
model.summary()

Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_17 (Embedding)    (1, None, 64)             3072      
                                                                 
 simple_rnn_1 (SimpleRNN)    (1, None, 128)            24704     
                                                                 
 dense_17 (Dense)            (1, None, 48)             6192      
                                                                 
Total params: 33,968
Trainable params: 33,968
Non-trainable params: 0
_________________________________________________________________


In [79]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

In [80]:
def generate_text(model, max_preds):
    input_list = [char2idx['<']] # start with the start char; the rest is zeros
    text_generated = [] # Empty string to store our results
    model.reset_states() # Clears the hidden states in the RNN

    for i in range(max_preds-1): #Run a loop for number of characters to generate
        predictions = model(tf.expand_dims(input_list, 0)) # prediction for single character
        predictions = tf.squeeze(predictions, 0) # remove the batch dimension
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # The predicted character as the next input to the model
        # along with the previous hidden state
        # So the model makes the next prediction based on the previous character
        input_list.append(predicted_id)
        # Also devectorize the number and add to the generated text
        if predicted_id==0:
            continue
        pred_char=idx2char[predicted_id]
        if pred_char=='>':
            break
        text_generated.append(pred_char)

    return ''.join(text_generated)

Train the network for 5 epochs (this will take a few minutes). We will generate a few examples each epoch to check progress.

In [81]:
epochs=5
for i in range(epochs):
    print("EPOCH %d"%(i+1))
    model.fit(dataset, epochs=1, batch_size=BATCH_SIZE)
    print("EXAMPLES:")
    for i in range(5):
        generated_text = generate_text(model, max_preds=max_len)
        print(generated_text)
    print()

EPOCH 1
EXAMPLES:
tramidium
dethyl9um(iii) oxydith
15-99-5
ium ctil
atilive pengoperoxe

EPOCH 2
EXAMPLES:
chlorum(iii) phosphide
perchlorate
dichloroiuon tetra2e
calcin
germoly solicon

EPOCH 3
EXAMPLES:
magnesium olane
uranium(iii) telluride
carconiim
calpium oxale
alicalamogen arbonol

EPOCH 4
EXAMPLES:
goly dinium(iii) chloride
zirconium difluoride
argentum texane
caesium chromite
sulfur dioxide

EPOCH 5
EXAMPLES:
sodium brseurite
erbiomonyb-alubrol
chrosulvarsenium(iii) chloride
manganesen(v) oxide
copper(ii) bromide



Try generating some more examples.

In [82]:
for i in range(10):
    generated_text = generate_text(model, max_preds=max_len)
    print(generated_text)

liburium(iv) oxide
mangane1
dichloropanate
borate ioc
thorium dichromide
caesium phosphate
strontinol
indium(iii) iodide
urrchloroiodide
silicon tetraiodide


Let's try a more advanced network. LSTMs carry an additional hidden state vector over each time step, with learned gates controlling the flow into and out of this carryover vector.

We'll also try stacking RNN layers.

**2. Stack two [LSTM](https://keras.io/api/layers/recurrent_layers/lstm/) layers, each with 64 units.**
  - Hint: A recurrent layer that is passing activations to a subsequent recurrent layer always has to return sequences.

In [93]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[BATCH_SIZE, None]),
    tf.keras.layers.LSTM(64,return_sequences=True),                      
    tf.keras.layers.LSTM(64, return_sequences=True),
    
    tf.keras.layers.Dense(vocab_size)
])


In [94]:
model.summary()

Model: "sequential_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_21 (Embedding)    (1, None, 64)             3072      
                                                                 
 lstm_31 (LSTM)              (1, None, 64)             33024     
                                                                 
 lstm_32 (LSTM)              (1, None, 64)             33024     
                                                                 
 dense_21 (Dense)            (1, None, 48)             3120      
                                                                 
Total params: 72,240
Trainable params: 72,240
Non-trainable params: 0
_________________________________________________________________


In [95]:
model.compile(optimizer='adam', loss=loss)

Train the new network for 5 epochs (this will take a few minutes).

In [96]:
epochs=5
for i in range(epochs):
    print("EPOCH %d"%(i+1))
    model.fit(dataset, epochs=1, batch_size=BATCH_SIZE)
    print("EXAMPLES:")
    for i in range(5):
        generated_text = generate_text(model, max_preds=max_len)
        print(generated_text)
    print()

EPOCH 1
EXAMPLES:
wiumsolmium chobid achluride
tatvlolu(ii) solxide
glorpet(ii) scifluoride
mosseee in
cerleavg loride

EPOCH 2
EXAMPLES:
scoroterzone sulfate metroxypibhate
zitrlenifon(iii) arms-bicarge
tallium(iv) chloride
slenicyane acate
terylonide

EPOCH 3
EXAMPLES:
usthamonicoancomynase
pyrborine clomide
ladm(iv) tetrachloride
cerium hybide
proldoxine

EPOCH 4
EXAMPLES:
disodium hexachlorote
terium manganeatetroagium(iii) iodide
nickel(ii) carbonade
diphosphalum hypochlorite
tin(ii) seleuite

EPOCH 5
EXAMPLES:
berfluorcoside hin
phosphine
indoline
cyanomethoxyandic iodide
cobalt(ii) sulfide



Do the LSTM-generated compound names look any different than the simple RNN?

In [97]:
for i in range(10):
    generated_text = generate_text(model, max_preds=max_len)
    print(generated_text)

sodium metnathiofluoride
sodium iodide
potassium metahydrate trioxide
surythiene
nickel(ii) bromide
mangatesine monofluoride
t,ngslium hydroxide
sodium monoxide
lead(ii) chromate
aluminium mitosstlohfluoride
