<a href="https://colab.research.google.com/github/DylanJJH/590/blob/master/RNN_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
import tensorflow.keras as tfk
import tensorflow.keras.layers as tfkl
from tensorflow.keras.models import Sequential


from google.colab import drive
import numpy as np
import pandas as pd

In this example, we're going to train a [CharRNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on a body of Shakespearian text. Ultimtely, this is an unsuperived learning task. But similar to our previous explorations in unsupervised DL, we will use an unlabeled dataset and create many samples of labeled data that we can use with our familiar supervised loss functions. The result will be a model that has learned the statistical properties of the input text, and can then be considered a "generative" model of language because we can use it to generate synthetic passages of Shakespeare.  

In [2]:
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [3]:
file_path = "/content/gdrive/My Drive/anly590-datasets/shakespeare.txt"

with open(file_path,"r") as f:
  text = f.read()

We've loaded our Shakespeare text, let's take a look at a random snippet.

In [4]:
print(text[31600:32000])

  And there reigns love and all love's loving parts,
  And all those friends which I thought buried.
  How many a holy and obsequious tear
  Hath dear religious love stol'n from mine eye,
  As interest of the dead, which now appear,
  But things removed that hidden in thee lie.
  Thou art the grave where buried love doth live,
  Hung with the trophies of my lovers gone,
  Who all their parts of me


We need to convert our text into numeric arrays, the next several blocks accomplish this.

First, we'll create a mapping between characters and their numeric index. We'll also create the reverse mapping, which is useful.

In [5]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 91


Next, we'll create a training set of sub-sequences. Remember, we're trying to train a model to be able to predict the next chracter if it is given several characters of a subsequence. So we will create training pairs where each X is a fixed-length subsequences and each Y is the corresponding next letter in the text.

In [6]:
maxlen = 40
step = 3
sub_sequences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sub_sequences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sub_sequences))

nb sequences: 1819633


In [7]:
k=300
print("(Sequence):\n" + sub_sequences[k])
print("\n(Target Character): \n" + next_chars[k])

(Sequence):
rary*
in the presentation of The Complet

(Target Character): 
e


Next we'll create one-hot vectors for our sub-sequences. The tensor we create here will be shaped as (num_sequences x sequence_length x alphabet_size).

In [8]:
X = np.zeros((len(sub_sequences), maxlen, len(chars)), dtype=np.uint8 )
Y = np.zeros((len(sub_sequences), len(chars)), dtype=np.uint8)
for i, seq in enumerate(sub_sequences):
    for t, char in enumerate(seq):
        X[i, t, char_indices[char]] = 1
        Y[i, char_indices[next_chars[i]]] = 1

In [9]:
X[0,0,:]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0], dtype=uint8)

In [10]:
Y[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0], dtype=uint8)

Our RNN model will be quite simple.

In [11]:
char_rnn = Sequential()
char_rnn.add(tfkl.LSTM(128, input_shape=(maxlen, len(chars))))
char_rnn.add(tfkl.Dense(len(chars),activation="softmax"))

In [12]:
char_rnn.compile(loss='categorical_crossentropy', optimizer=tfk.optimizers.RMSprop(lr=0.01))

In [13]:
char_rnn.fit(X,Y, epochs=20, batch_size=1024)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fa8d8fdabe0>

Once we have a trained model, we can simulate new text by making predictions about the next character and then drawing characters in proportion to the predicted probabilities. And then simple repeat that process over and over, each time drawing the next character.

In [14]:
def draw_char(probs):
    probs = np.asarray(probs).astype('float64')
    if sum(probs) != 1.0:
      probs = probs / np.sum(probs)
    draw = np.random.choice(range(len(probs)) , p=probs)
    return draw

def sample_text(model, sample_length=100):
    start = np.random.randint(0, len(text) - maxlen - 1)
    sequence = text[start: start + maxlen]
  
    x_preds = np.zeros((sample_length, maxlen, len(chars)))
    for i in range(sample_length):
        for t, char in enumerate(sequence[-maxlen:]):
            x_preds[i, t, char_indices[char]] = 1.

        preds = model.predict(np.expand_dims(x_preds[i,:,:], axis=0), verbose=0)[0]
        next_index = draw_char(preds)
        next_char = indices_char[next_index]

        sequence += next_char
    return sequence

In [15]:
sim = sample_text(char_rnn,sample_length=500) 

In [16]:
print(sim)

rrow night, when Phoebe doth behold
    These an humous time with any doyinging fell
    Hear all the glue of like some obscrab!
    and epace in my sword mine! I was there too tremble.
  TIMON. Most humor
   Then were heart of heavy roils went.
  ULYSSES. As in.
COE'Sur! Antegar, and by, and forture me nor plainters!
  ROSALINE. If Attronion where  
    in anything and yust rostacted, if throst
    Armish undertop'd in haunting dishmes,
    To bear her night to marcy mavi'd, and with him,
    In abassing, and long fire me! I am fits



Notice that we can do pretty well to learn the typical statistical patterns of this text and then simulate new text that appears to be very similar to legitimate Shakespeare. 

But just a caution - we can also do pretty well with a much simpler method (Markov model): http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

So the lesson is to try something simple before jumping right in to deep learning.

## Exercise

In this example, we're going to use an RNN for sequence classification. The task we'll set up is to generate a training set of randomized strings, and train our model to detect whether a string contains any vowels.

First, we'll create a training dataset of short randomized character sequences and the corresponding label of whether or not they contain at least one vowel.

In [39]:
import string

In [40]:
def contains_vowels(sequence):
  vowels = ["a", "e", "i", "o", "u"]
  return any([vowel in list(sequence) for vowel in vowels])

In [41]:
contains_vowels("gradient")

True

In [42]:
sequences = []
labels = []
for i in range(10000):
 char_list = np.random.choice( list(string.ascii_lowercase), size = 5, replace=True)
 seq = "".join(char_list)
 sequences.append(seq)
 labels.append(int(contains_vowels(seq)))

In [43]:
df = pd.DataFrame({"sequence": sequences, "label":labels})

In [44]:
df.head()

Unnamed: 0,sequence,label
0,nuwbk,1
1,ieaxn,1
2,fnbjv,0
3,ynpis,1
4,tpikw,1


Next, set up and train an RNN (of any type) to solve this task. What preprocessing will you need to do first on the raw data in order to prepare it for the network?

In [45]:
# Data Preprocessing
# your code here
chr = list(string.ascii_lowercase)
chr_idx = dict((c, i) for i, c in enumerate(chr))
idx_chr = dict((i, c) for i, c in enumerate(chr))

X = np.zeros((len(sequences),5,26), dtype=np.uint8)
for i, seq in enumerate(sequences):
  for t,char in enumerate(seq):
        X[i,t,chr_idx[char]] = 1

In [46]:
# Model setup and training
# your code here
RNN = Sequential()
RNN.add(tfkl.LSTM(128, input_shape=(5,26)))
RNN.add(tfkl.Dense(1,activation="sigmoid"))
RNN.compile(loss='binary_crossentropy', optimizer=tfk.optimizers.RMSprop(lr=0.01),metrics=['accuracy'])
RNN.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_4 (LSTM)                (None, 128)               79360     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 79,489
Trainable params: 79,489
Non-trainable params: 0
_________________________________________________________________


In [47]:
Y = np.array(labels)
results = RNN.fit(X,Y,epochs=20,steps_per_epoch=20,batch_size=400)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
