<a href="https://colab.research.google.com/github/KeremAydin98/not-to-be-shakespeare/blob/main/Generating_Shakespeare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing

In [1]:
import tensorflow as tf

In [2]:
url = "https://homl.info/shakespeare"
# File at the origin url is downloaded to the cache dir, final location of the file is placed on the fname in our case it is "shakespeare.txt"
filepath = tf.keras.utils.get_file("shakespeare.txt", url)
# Open the file with "with" command so that we do not need to close it afterwards
with open(filepath) as f:
  text = f.read()

Downloading data from https://homl.info/shakespeare


In [3]:
# Let's look at the first 100 characters of the text
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


Lets use Keras' Tokenizer class

In [4]:
# Create the character level tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True) #  char_level: if True, every character will be treated as a token.

# Fit it on the text
# fit_on_texts: This method creates the vocabulary index based on word frequency. 0 is reserved for padding. So lower integer means more frequent word.
tokenizer.fit_on_texts(text)

Now the tokenizer can encode a sentence (or a list of sentences) to a
list of character IDs and back, and it tells us how many distinct characters
there are and the total number of characters in the text:

In [5]:
# Now, tokenizer is able to transform texts to sequences

# "texts_to_sequences" basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. 
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [6]:
# And sequences to texts
tokenizer.sequences_to_texts([[20,6,9,8,3]])

['f i r s t']

In [7]:
# Number of distinct characters
max_id = len(tokenizer.word_index)

# Total number of characters
dataset_size = tokenizer.document_count

max_id, dataset_size

(39, 1115394)

Let’s encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than from 1 to 39):

In [8]:
import numpy as np

# We subtract 1 to get IDs from 0 to 38, rather than from 1 to 39
[encoded] = np.array(tokenizer.texts_to_sequences([text])) - 1

In [9]:
encoded

array([19,  5,  8, ..., 20, 26, 10])

In [10]:
# Train and validation data split
split_size = int(dataset_size * 0.7)
dataset = tf.data.Dataset.from_tensor_slices(encoded[:split_size])

The training set now consists of a single sequence of over a million
characters, so we can’t just train the neural network directly on it: the
RNN would be equivalent to a deep net with over a million layers, and we
would have a single (very long) instance to train it. Instead, we will use
the dataset’s window() method to convert this long sequence of characters
into many smaller windows of text.

In [11]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
"""
Input:
[[1,2,3,4,5,6,7,8]]
Output:
[[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8]]
"""
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

In [12]:
dataset

<WindowDataset element_spec=DatasetSpec(TensorSpec(shape=(), dtype=tf.int64, name=None), TensorShape([]))>

We must call the flat_map()
method: it converts a nested dataset into a flat dataset (one that does not
contain datasets)

In [13]:
"""


    map: It returns a new RDD by applying given function to each element of the RDD. Function in map returns only one item.

    flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.

"""
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [14]:
batch_size = 8
# We seperate the data into 8 batches and then shuffle it, in the end drop the remained data
dataset = dataset.shuffle(10000).batch(batch_size,drop_remainder=True)

# At this one we seperate target and input from the dataset
"""
Input:
[[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8]]
Output:

Input: [[1,2,3,4], Target: [2,3,4,5]]
       [[2,3,4,5], [3,4,5,6]]
"""
dataset = dataset.map(lambda windows: (windows[:,:-1], windows[:,1:]))

In [15]:
# Then we do a one hot encoding on the input data so that loss function would make sense
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [16]:
dataset = dataset.prefetch(1)

# Create the Model

In [17]:
model = tf.keras.Sequential([
                             tf.keras.layers.GRU(512, return_sequences=True, 
                             input_shape = [None, max_id], dropout=0.2, recurrent_dropout=0.2),
                             tf.keras.layers.GRU(512, return_sequences=True,
                             dropout=0.2, recurrent_dropout=0.2),
                             tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(max_id, activation="softmax")) # This wrapper allows to apply a layer to every temporal slice of an input.
])

"""
TimeDistributed:

  Consider a batch of 32 video samples, where each sample is a 128x128 RGB image with channels_last data format, across 10 timesteps. The batch input shape is (32, 10, 128, 128, 3).

  You can then use TimeDistributed to apply the same Conv2D layer to each of the 10 timesteps, independently

  Because TimeDistributed applies the same instance of Conv2D to each of the timestamps, the same set of weights are used at each timestamp.
"""

model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy, # even though input is one hot encoded, target is still tokenized, so we must use sparse categorical cross entropy
              optimizer=tf.keras.optimizers.Adam())



In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru (GRU)                   (None, None, 512)         849408    
                                                                 
 gru_1 (GRU)                 (None, None, 512)         1575936   
                                                                 
 time_distributed (TimeDistr  (None, None, 39)         20007     
 ibuted)                                                         
                                                                 
Total params: 2,445,351
Trainable params: 2,445,351
Non-trainable params: 0
_________________________________________________________________


# Fit the model

In [19]:
history = model.fit(dataset,steps_per_epoch=500, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Generate a Shakespeare text

In [20]:
def preprocess(texts):
  # Preprocessing the text by first tokenizing and then one hot encoding the input
  x = np.array(tokenizer.texts_to_sequences(texts)) - 1
  return tf.one_hot(x, max_id)

In [21]:
def next_char(text, temperature=1):
  X_new = preprocess([text])
  y_probs = model.predict(X_new)[0,-1:,:]
  rescaled_logits = tf.math.log(y_probs) / temperature
  char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
  return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [22]:
def complete_text(text, n_chars=1000, temperature=1):
  for _ in range(n_chars):
    text += next_char(text, temperature)
  return text

In [23]:
print(complete_text("r",temperature=1))

rticus: i, without note,-here's
a vertel tears with smiljnech.

second officer:
faith, there had been many, or elumy to reward
whihe he remember'd.
a very on your actions and daugk,
that may fully tubly care edsured here's anly arm detter and the bleared sightry
sevond the common people.

second officer:
has he did budgen deeds doull

brutus:
i will give them make i as liqy as little question
as he is proud to do't.

brutus:
what's the mad me clip than a never o hate
he will not bloody bleading:
if he did so did at the common disposition.

sicinius:
he cannot temperately that may fully discover his
the arm our stand, as bard as he hath
displeasure your sulvessers: set him speak: matrons flung gloves,
let country? he was he wounded?
god sand carry with us;
for sinking under thee; you are knowen part of your ay, such a nettle but they
plasing beee: they love or hate
him men true.
where is he wounded?
god save you give me to care whether
the people is tho market-place nor on him our
putte