<a href="https://colab.research.google.com/github/JpChii/ML-Projects/blob/main/AG_NLP_CH16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Natural Language Processing

In this notebook, am going to workout the concepts and techniques discussed in aurelion geron's tensorflow book second edition chapter 16

## Creating training dataset

In [2]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-a9daf187-9e7c-a9a9-b3b1-0dfa7cc612cc)


In [3]:
import tensorflow as tf
from tensorflow import keras

In [4]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" # shortcut url
filepath = keras.utils.get_file("input.txt", shakespeare_url)
with open(filepath) as f:
  shakespeare_text = f.read()

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


Next we must encode every character as an integer, we're going to encoder each character using `Keras Tokenizer` class. This class maps eah character used in the text and maps them to different character id.

In [5]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True) # char level true to use char encoding instead of default word encoding.

In [6]:
tokenizer.fit_on_texts(shakespeare_text)

Now the tokenizer can encode a sentence to list of character ID"s and back and can tell how many distinct characters are there in the text.

In [7]:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [8]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

Encode the full text so each character is represented by it's ID( Subract 1 to get IDs from 0 to 38, rather than from 1 to 39, becuase tokenizer starts the encoding from 0 and not 1).

In [9]:
import numpy as np
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Next we need to split the dataset into training, validation and test set. We can't just shuffle since it's sequentail data and we'll lose valuabale information.

## How to Split a Sequential Dataset

It's important to avoid overlap between the datasets.

The splitting of a sequence data is not a trivial task and it soley depends on the problem at hand. Refer page number 528 for more information.

In [10]:
max_id = len(tokenizer.word_index) # macimum number of distinct characters
max_id

39

In [11]:
dataset_size = tokenizer.document_count # total number of characters
dataset_size

1115394

In [12]:
train_size = dataset_size * 10 // 100
train_size

111539

In [13]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

## Chopping the Sequential dataset into multiple windows

The training set now consists of a sequence of over million characters, so we can't just train the neurla network dirctly on it: the RNN would be a equivalent to a deep net with million layers `(**why?** it would be RNN with million layers)` and we would train the net on a single sequence. Instead we'll use `dataset's window()` method to convert this long sequence of text into many smaller window of text and the RNN will be unrolled only over the length of these substrings. This is called `truncated backpropogation through time`

In [14]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

* `Shift` of `1` creates a dataset like 0 to 101, 1 to 102, 2 to 103 and so on.
* `drop_remainder=True` to avoid the last 100 windows which will start decreasing from 100 to 1.

The *`window()`* method creates a dataset that contains windows each of which is alsos a datsaset. It's `nested dataset` like list of lists. This is useful when the transformation(batch shuffle) is required for each window. But this can't be passed to the model since expects tensors and not datasets. So we'll use *`flat_map()`* method. It converts a nest dataset into a flat dataset.

flat_map() method takes a function as an argument before flattening.

In [15]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [16]:
for i in dataset.take(2):
  print(i)
  print(f"Shape of flat_map: {i.shape}")
  print(f"Dimension: {i.ndim}")

tf.Tensor(
[19  5  8  7  2  0 18  5  2  5 35  1  9 23 10 21  1 19  3  8  1  0 16  1
  0 22  8  3 18  1  1 12  0  4  9 15  0 19 13  8  2  6  1  8 17  0  6  1
  4  8  0 14  1  0  7 22  1  4 24 26 10 10  4 11 11 23 10  7 22  1  4 24
 17  0  7 22  1  4 24 26 10 10 19  5  8  7  2  0 18  5  2  5 35  1  9 23
 10 15  3 13  0], shape=(101,), dtype=int64)
Shape of flat_map: (101,)
Dimension: 1
tf.Tensor(
[ 5  8  7  2  0 18  5  2  5 35  1  9 23 10 21  1 19  3  8  1  0 16  1  0
 22  8  3 18  1  1 12  0  4  9 15  0 19 13  8  2  6  1  8 17  0  6  1  4
  8  0 14  1  0  7 22  1  4 24 26 10 10  4 11 11 23 10  7 22  1  4 24 17
  0  7 22  1  4 24 26 10 10 19  5  8  7  2  0 18  5  2  5 35  1  9 23 10
 15  3 13  0  4], shape=(101,), dtype=int64)
Shape of flat_map: (101,)
Dimension: 1


So the lambda function has batched windowd_dataset to a `window_length` batch_sized tensors.

Since gradient descent wotks best when the instances in the training set are indepedent and identically distributed, we'll shuffle the windows. Then batch the windows and seperate the inputs ( the first 100 characters) from the targer (the last character)

In [17]:
batch_size=32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [18]:
for i in dataset.take(1):
  print(f"Length of batch: {len(i)}")
  print(f"One sample from a batch: {i[0]}")
  print(f"Shape of one dataset: {i[0].shape}")
  print(f"One sample from a feature: {i[0][0]}")
  print(f"Shape of one feature: {i[0][0].shape}")

Length of batch: 2
One sample from a batch: [[ 0 16  6 ...  8 23  0]
 [ 0  6  4 ...  6  4  2]
 [ 1  0 20 ...  0 22  1]
 ...
 [ 6  0  4 ...  6  1  0]
 [ 8  3 14 ... 15 29 10]
 [ 2  5  2 ...  2  6  1]]
Shape of one dataset: (32, 100)
One sample from a feature: [ 0 16  6  5 11  1  0  5  2  0 16  1  8  1 10 16  6  3 11  1  7  3 14  1
 17  0 16  1  0 14  5 20  6  2  0 20 13  1  7  7  0  2  6  1 15  0  8  1
 11  5  1 25  1 12  0 13  7  0  6 13 14  4  9  1 11 15 28 10 21 13  2  0
  2  6  1 15  0  2  6  5  9 24  0 16  1  0  4  8  1  0  2  3  3  0 12  1
  4  8 23  0]
Shape of one feature: (100,)


Since categorical input features had to be encoded, let's encode them as one-hot vectors. Since there are fairly distince characters( only 39)

In [19]:
dataset = dataset.map(
    lambda X_batch, y_batch: (tf.one_hot(X_batch, depth=max_id), y_batch)
)

In [20]:
for i in dataset.take(1):
  print(f"Length of batch: {len(i)}")
  print(f"One sample from a batch: {i[0]}")
  print(f"Shape of one dataset: {i[0].shape}")
  print(f"One sample from a feature: {i[0][0]}")
  print(f"Shape of one feature: {i[0][0].shape}")

Length of batch: 2
One sample from a batch: [[[0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]]

 ...

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[1. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [1. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0

In [21]:
# Adding prefetching
dataset = dataset.prefetch(1)

That's it the dataseet is ready. Let's move on to modelling.

## Building and Training the Char-RNN Model

* Two RNN layers with 128 units
* 20% dropout on input and hidden states
* Time-Distributed dense layer
* Softmax activation with 39 units - whihc output 39 probabalities summing up to 1.

In [22]:
model = keras.models.Sequential([
  keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                   dropout=0.2, recurrent_dropout=0.2),
  keras.layers.GRU(128, return_sequences=True,
                   dropout=0.2, recurrent_dropout=0.2),
  keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax")),
])



In [1]:
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(),
              optimizer=keras.optimizers.Adam())
history = model.fit(dataset, epochs=3,
                    callbacks=[keras.callbacks.EarlyStopping(monitor='loss',
                                                             verbose=1,
                                                             restore_best_weights=True)])

NameError: ignored

Okay, the training takes 7500secs(120 mins) in an ideal case. I don't have an GPU to perform long computations since this is free tier and get's timed out on idle state. So i'll move along the book without creating this model. Let's skip this and code next steps.

### Preprocessing Function

In [2]:
model.save("/content/drive/MyDrive/ML_models/char_rnn.h5")

NameError: ignored

In [75]:
def preprocess(text):
  X = np.array(tokenizer.texts_to_sequences(text)) - 1
  return tf.one_hot(X, max_id)

In [None]:
X_new = preprocess(["How are yo"])
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence last char

In [None]:
loaded_mode = keras.models.load_model("/content/drive/MyDrive/ML_models/char_rnn.h5")
input = "How are yo"
print(f"Input text: {input}")
print(f"Length of input: {len(input)}")
X_new = preprocess([input])
print(f"Prediction: {model.predict(X_new)}\n")
print(f"Prediction shape: {model.predict(X_new).shape}\n")
Y_pred = np.argmax(loaded_mode.predict(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence last char

Generating new text using char-RNN model, we can fieed it some text, make the modek predict the most likely next letter, add it at the end of the text, then guve the extended text to the model to guess the next letter. But in practice this leads to words being repeated.Instead we can pick the next character randomly, with a probablity equal ti the estimated probabality using `tf.random.Categorical()` function. This will generate more diverse and interesting text.

The `categorical()` function samples the random class indices, given the log probabalities(logits). To have more control ovver the diversity of the generated text, we can divide the logists by a number called `temperature`, which we can tweak as we wish: a temperature close to 0 will favour the high probabality characters, while avery high temperature will give all characters an equal probabality. The following `next_char()` function uses this approach to pick the next character to add to the input text:

In [72]:
def next_char(text, temperature=1):
  X_new = preprocess([text])
  # print(f"Prediction: {model.predict(X_new)}\n")
  # print(f"Prediction shape: {model.predict(X_new).shape}\n")
  y_proba = model.predict(X_new)[0, -1:, :]
  rescaled_logits = tf.math.log(y_proba) / temperature
  char_id = tf.random.categorical(rescaled_logits, num_samples = 1) + 1
  return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [73]:
def complete_text(text, n_chars=50, temperature=1):
  for _ in range(n_chars):
    text += next_char(text, temperature)
  return text

In [None]:
print(complete_text("t", temperature=1))

In [None]:
tf.math.log(0.5)

The shakespeare model works best at temperature = 1. To generate more convinving text, we can add more GRU layers hidden units, train for longer some regularization etc. Moreover the model is currently incapabale of learning patterns longer than n_steps, which is just 100 characters. Wich will make training harder and even LSTM and GRU cells cannot handle very long sequences. Alternativley, we can use stateful RNN.

## Stateful RNN

Until now, we've used only *stateless RNN* that is the hidden state is resetted to zero after the last time step of every batch like it's not needed. If we told RNN to preserve these hidden states, then the RNN will learn long term depedencies over short sequences. This is called *Stateful RNN*.

First thing to note is stateful RNN only makes sense if each input sentence in a batch starts exactly where the corresponding sequence in the previous batch left off. So the first thning we need to do build a steful RNN is ti use sequential non-overlapping input sequences(rather than shuffled and overlapping sequences used to train stateless RNN's).

So we'll use `shift=nsteps`(instead of shift=1) wheen calling the window method and we can't use shuffle since sequences needs to consecutive.

And batching using `batch(32)` will be trouble because after batching, the first window of batch1(1) and batch2(33) are not consecutive. The simplest solution to this is to  use `batch(1)` or batches containing a single window.

In [43]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" # shortcut url
filepath = keras.utils.get_file("input.txt", shakespeare_url)
with open(filepath) as f:
  shakespeare_text = f.read()

In [44]:
tokenizer.fit_on_texts(shakespeare_text)

In [45]:
[encoded_full] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

In [47]:
dataset_size_full = tokenizer.document_count
dataset_size_full

2230788

In [49]:
train_size_full = dataset_size_full * 90 // 100
train_size_full

2007709

In [50]:
dataset = tf.data.Dataset.from_tensor_slices(encoded_full[:train_size_full])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [51]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch)
)
dataset = dataset.prefetch(1)

### Batched dataset

In [52]:
batch_size = 32

In [53]:
print(f"Length of dataset: {len(encoded_full[:train_size_full])}")
encoded_parts = np.array_split(encoded_full[:train_size_full], batch_size)
print(f"Dataset length after split: {len(encoded_parts)}")

Length of dataset: 1115394
Dataset length after split: 32


In [54]:
len(encoded_parts[0])

34857

In [55]:
34857 * 32

1115424

So what's been done above is the total lenght of data has been split into 32 equal parts.

In [57]:
34857 // 100

348

In [58]:
batched_ds = []
for encoded_part in encoded_parts:
  dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
  # From each encode part 34857(encoded_part length) / 100(window length) - 348 windows will be created
  dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
  dataset = dataset.flat_map(lambda window: window.batch(window_length))
  batched_ds.append(dataset)

In [59]:
batched_ds = tf.data.Dataset.zip(tuple(batched_ds)).map(lambda *windows: tf.stack(windows))

In [60]:
batched_ds

<MapDataset shapes: (32, None), types: tf.int64>

In [61]:
batched_ds = batched_ds.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
batched_ds = batched_ds.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
batched_ds = batched_ds.prefetch(1)

In [62]:
batched_ds

<PrefetchDataset shapes: ((32, None, 39), (32, None)), types: (tf.float32, tf.int64)>

Okay the dataset is ready now, let's create a stateful RNN. To make a stateful RNN,

* set `stateful= True`
* Set `batch_input_shape` in first layer. since the hidden state is preserved for each input sequence in the batch.

In [63]:
model = keras.models.Sequential([
  keras.layers.GRU(128, 
                   return_sequences=True, 
                   stateful=True,
                   dropout=0.2,
                   batch_input_shape=[batch_size, None, max_id]),
  keras.layers.GRU(128,
                   return_sequences=True,
                   stateful=True,
                   dropout=0.2),
  keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

At the end of each epoch, we need to reset the states before wego back to beginning of the text, let's write a callback for this.

In [64]:
class ResetStatesCallback(keras.callbacks.Callback):
  def on_epoch_begin(self, epoch, logs):
    self.model.reset_states()

In [65]:
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

In [66]:
history = model.fit(batched_ds,
                    epochs=50,
                    callbacks=[ResetStatesCallback()])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


We can only use this moel to make predictions on batches since it's trained on batches. Let's build a stateless amodel and load the weights in it to predict on single sequence.

In [67]:
stateless_model = keras.Sequential([
                                    keras.layers.GRU(128, return_sequences=True),
                                    keras.layers.GRU(128, return_sequences=True),
                                    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

In [68]:
# building the model to load weights
stateless_model.build(tf.TensorShape([None, None, max_id]))

In [70]:
stateless_model.set_weights(model.get_weights())

In [71]:
model = stateless_model

In [80]:
print(complete_text("t", temperature=0.3))

that will not so she be seen a business of the hand
