# Wikipedia Text Generation (using RNN LSTM)

> - 🤖 See [full list of Machine Learning Experiments](https://github.com/trekhleb/machine-learning-experiments) on **GitHub**<br/><br/>
> - ▶️ **Interactive Demo**: [try this model and other machine learning experiments in action](https://trekhleb.github.io/machine-learning-experiments/)

## Experiment overview

In this experiment we will use character-based [Recurrent Neural Network](https://en.wikipedia.org/wiki/Recurrent_neural_network) (RNN) to generate a Wikipedia-like text based on the [wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia) TensorFlow dataset.

![text_generation_wikipedia_rnn.png](../../demos/src/images/text_generation_wikipedia_rnn.png)

_Inspired by [Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation)_

## Import dependencies

In [4]:
# Selecting Tensorflow version v2 (the command is relevant for Colab only).
# %tensorflow_version 2.x

In [5]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import numpy as np
import platform
import time
import pathlib
import os
import keras

print('Python version:', platform.python_version())
print('Tensorflow version:', tf.__version__)
print('Keras version:', keras.__version__)

Python version: 3.12.3
Tensorflow version: 2.16.1
Keras version: 3.2.1


## Download the dataset

[Wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia) dataset contains cleaned articles of all languages. The datasets are built from the [Wikipedia dump](https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

In [6]:
# List all available datasets to see how the wikipedia dataset is called.
# tfds.list_builders()

[`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) is a convenience method that's the simplest way to build and load a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

In [7]:
# Loading the wikipedia dataset.
DATASET_NAME = 'huggingface:wikipedia/20220301.en'

dataset, dataset_info = tfds.load(
    name=DATASET_NAME,
    data_dir='tmp',
    with_info=True,
    split=tfds.Split.TRAIN,
)

  hf_names = hf_datasets.list_datasets()
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [8]:
print(dataset_info)

tfds.core.DatasetInfo(
    name='wikipedia',
    full_name='wikipedia/20220301.en/2.0.0',
    description="""
    Wikipedia dataset containing cleaned articles of all languages.
    The datasets are built from the Wikipedia dump
    (https://dumps.wikimedia.org/) with one split per language. Each example
    contains the content of one full Wikipedia article with cleaning to strip
    markdown and unwanted sections (references, etc.).
    """,
    config_description="""
    Wikipedia dataset containing cleaned articles of all languages.
    The datasets are built from the Wikipedia dump
    (https://dumps.wikimedia.org/) with one split per language. Each example
    contains the content of one full Wikipedia article with cleaning to strip
    markdown and unwanted sections (references, etc.).
    
    """,
    homepage='https://www.tensorflow.org/datasets/catalog/wikipedia',
    data_dir='tmp\\wikipedia\\20220301.en\\2.0.0',
    file_format=tfrecord,
    download_size=Unknown size,
   

In [9]:
print(dataset)

<_PrefetchDataset element_spec={'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'text': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None), 'url': TensorSpec(shape=(), dtype=tf.string, name=None)}>


## Analyze the dataset

In [10]:
TRAIN_NUM_EXAMPLES = dataset_info.splits['train'].num_examples
print('Total number of articles: ', TRAIN_NUM_EXAMPLES)

Total number of articles:  6458670


In [11]:
print('First article','\n======\n')
for example in dataset.take(1):
    print('Title:','\n------')
    print(example['title'].numpy().decode('utf-8'))
    print()

    print('Text:', '\n------')
    print(example['text'].numpy().decode('utf-8'))

First article 

Title: 
------
Anarchism

Text: 
------
Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism.

Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of

## Process the dataset

### Flatten the dataset

Converting the dataset from the set of articles into the set of characters. We also are interested only in `text` of each article so we may drop the `title` along the way.

In [12]:
def article_to_text(text):
    return np.array([char for char in text.numpy().decode('utf-8')])

# Converting each dataset item to a string ('text') instead of a dictionary ({'text', 'title'}).
dataset_text = dataset.map(
    lambda article: tf.py_function(func=article_to_text, inp=[article['text']], Tout=tf.string)
)

for text in dataset_text.take(2):
    print(text.numpy())
    print('\n')

[b'A' b'n' b'a' ... b'i' b'c' b's']


[b'A' b'u' b't' ... b'a' b't' b'e']




In [13]:
# Unbatch the text dataset into a more granular char dataset.
# Now each dataset item is one character instead of a big piece of text.
dataset_chars = dataset_text.unbatch()

for char in dataset_chars.take(20):
    print(char.numpy().decode('utf-8'))

A
n
a
r
c
h
i
s
m
 
i
s
 
a
 
p
o
l
i
t


### Generating vocabulary

In [14]:
vocab = set()

# Ideally we should take all dataset items into account here.
for text in dataset_text.take(1000):
    vocab.update([char.decode('utf-8') for char in text.numpy()])
    
vocab = sorted(vocab)

print('Unique characters: {}'.format(len(vocab)))
print('vocab:')
print(vocab)

Unique characters: 1138
vocab:
['\t', '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\xa0', '¡', '¢', '£', '¥', '¦', '§', '¨', '©', 'ª', '«', '¬', '\xad', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¹', 'º', '»', '¼', '½', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Í', 'Î', 'Ñ', 'Ò', 'Ó', 'Õ', 'Ö', '×', 'Ø', 'Ú', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Č', 'č', 'ď',

### Vectorize the text

Before feeding the text to our RNN we need to convert the text from a sequence of characters to a sequence of numbers. To do so we will detect all unique characters in the text, form a vocabulary out of it and replace each character with its index in the vocabulary.

In [15]:
# Map characters to their indices in vocabulary.
char2index = {char: index for index, char in enumerate(vocab)}

print('{')
for char, _ in zip(char2index, range(30)):
    print('  {:4s}: {:3d},'.format(repr(char), char2index[char]))
print('  ...\n}')

{
  '\t':   0,
  '\n':   1,
  ' ' :   2,
  '!' :   3,
  '"' :   4,
  '#' :   5,
  '$' :   6,
  '%' :   7,
  '&' :   8,
  "'" :   9,
  '(' :  10,
  ')' :  11,
  '*' :  12,
  '+' :  13,
  ',' :  14,
  '-' :  15,
  '.' :  16,
  '/' :  17,
  '0' :  18,
  '1' :  19,
  '2' :  20,
  '3' :  21,
  '4' :  22,
  '5' :  23,
  '6' :  24,
  '7' :  25,
  '8' :  26,
  '9' :  27,
  ':' :  28,
  ';' :  29,
  ...
}


In [16]:
# Map character indices to characters from vacabulary.
index2char = np.array(vocab)

print(index2char)

['\t' '\n' ' ' ... '𓆎' '𓊖' '𓏏']


In [17]:
def char_to_index(char):
    char_symbol = char.numpy().decode('utf-8')
    char_index = char2index[char_symbol] if char_symbol in char2index else char2index['?']
    return char_index

dataset_chars_indexed = dataset_chars.map(
    lambda char: tf.py_function(func=char_to_index, inp=[char], Tout=tf.int32)
)

print('ORIGINAL CHARS:', '\n---')
for char in dataset_chars.take(10):
    print(char.numpy().decode())

print('\n\n')    
    
print('INDEXED CHARS:', '\n---')
for char_index in dataset_chars_indexed.take(20):
    print(char_index.numpy())

ORIGINAL CHARS: 
---
A
n
a
r
c
h
i
s
m
 



INDEXED CHARS: 
---
35
80
67
84
69
74
75
85
79
2
75
85
2
67
2
82
81
78
75
86


## Create training sequences

In [18]:
# The maximum length sentence we want for a single input in characters.
sequence_length = 200

In [19]:
# Generate batched sequences out of the char_dataset.
sequences = dataset_chars_indexed.batch(sequence_length + 1, drop_remainder=True)

# Sequences examples.
for item in sequences.take(10):
    print(repr(''.join(index2char[item.numpy()])))
    print()

'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds '

'to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian M'

'arxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism.\n\nHumans lived in societies without formal h'

'ierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist though'

't are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist move

In [20]:
# sequences shape:
# - Each sequence of length 101
#
#    201     201          201
# [(.....) (.....) ...  (.....)]

For each sequence, duplicate and shift it to form the input and target text. For example, say `sequence_length` is `4` and our text is `Hello`. The input sequence would be `Hell`, and the target sequence `ello`.

In [21]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

In [22]:
dataset_sequences = sequences.map(split_input_target)

In [23]:
for input_example, target_example in dataset_sequences.take(1):
    print('Input sequence size:', repr(len(input_example.numpy())))
    print('Target sequence size:', repr(len(target_example.numpy())))
    print()
    print('Input:\n', repr(''.join(index2char[input_example.numpy()])))
    print()
    print('Target:\n', repr(''.join(index2char[target_example.numpy()])))

Input sequence size: 200
Target sequence size: 200

Input:
 'Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds'

Target:
 'narchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds '


In [24]:
# dataset shape:
# - Each sequence is a tuple of 2 sub-sequences of length 100 (input_text and target_text)
#
#    200       200           200
# /(.....)\ /(.....)\ ... /(.....)\  <-- input_text
# \(.....)/ \(.....)/     \(.....)/  <-- target_text

Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and trys to predict the index for "i" as the next character. At the next timestep, it does the same thing but the RNN considers the previous step context in addition to the current input character.

In [25]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print('Step #{:1d}'.format(i))
    print('  input: {} ({:s})'.format(input_idx, repr(index2char[input_idx])))
    print('  expected output: {} ({:s})'.format(target_idx, repr(index2char[target_idx])))
    print()

Step #0
  input: 35 ('A')
  expected output: 80 ('n')

Step #1
  input: 80 ('n')
  expected output: 67 ('a')

Step #2
  input: 67 ('a')
  expected output: 84 ('r')

Step #3
  input: 84 ('r')
  expected output: 69 ('c')

Step #4
  input: 69 ('c')
  expected output: 74 ('h')



## Split training sequences into batches

We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

In [26]:
# Batch size.
BATCH_SIZE = 64

# Buffer size to shuffle the dataset (TF data is designed to work
# with possibly infinite sequences, so it doesn't attempt to shuffle
# the entire sequence in memory. Instead, it maintains a buffer in
# which it shuffles elements).
BUFFER_SIZE = 100

# How many items to prefetch before the next iteration.
PREFETCH_SIZE = 10

dataset_sequence_batches = dataset_sequences \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE, drop_remainder=True) \
    .prefetch(PREFETCH_SIZE)

dataset_sequence_batches

<_PrefetchDataset element_spec=(TensorSpec(shape=<unknown>, dtype=tf.int32, name=None), TensorSpec(shape=<unknown>, dtype=tf.int32, name=None))>

In [27]:
for input_text, target_text in dataset_sequence_batches.take(1):
    print('1st batch: input_text:', input_text)
    print()
    print('1st batch: target_text:', target_text)

1st batch: input_text: tf.Tensor(
[[71 85  2 ... 67  2 79]
 [67 84 69 ... 86 81 79]
 [18 85  2 ... 75 80 73]
 ...
 [67 85 85 ... 69 81 79]
 [91  2 67 ... 78 85 81]
 [71 14  2 ...  2 85 71]], shape=(64, 200), dtype=int32)

1st batch: target_text: tf.Tensor(
[[85  2 68 ...  2 79 75]
 [84 69 74 ... 81 79 75]
 [85  2 67 ... 80 73  2]
 ...
 [85 85 81 ... 81 79 79]
 [ 2 67 80 ... 85 81  2]
 [14  2 85 ... 85 71 71]], shape=(64, 200), dtype=int32)


In [28]:
# dataset shape:
# - 64 sequences per batch
# - Each sequence is a tuple of 2 sub-sequences of length 100 (input_text and target_text)
#
#
#     200       200           200             200       200           200
# |/(.....)\ /(.....)\ ... /(.....)\| ... |/(.....)\ /(.....)\ ... /(.....)\|  <-- input_text
# |\(.....)/ \(.....)/     \(.....)/| ... |\(.....)/ \(.....)/     \(.....)/|  <-- target_text
#
# <------------- 64 ---------------->     <------------- 64 ---------------->

## Build the model

Use [keras.Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) to define the model. For this simple example three layers are used to define our model:

- [keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding): The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
- [keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): A type of RNN with size units=rnn_units (You can also use a GRU layer here.)
- [keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense): The output layer, with vocab_size outputs.

In [29]:
# Let's do a quick detour and see how Embeding layer works.
# It takes several char indices sequences (batch) as an input.
# It encodes every character of every sequence to a vector of tmp_embeding_size length.
tmp_vocab_size = 10
tmp_embeding_size = 5
tmp_input_length = 8
tmp_batch_size = 2

tmp_model = keras.models.Sequential()
tmp_model.add(keras.layers.Embedding(
  input_dim=tmp_vocab_size,
  output_dim=tmp_embeding_size,
))
# The model will take as input an integer matrix of size (batch, input_length).
# The largest integer (i.e. word index) in the input should be no larger than 9 (tmp_vocab_size).
# Now model.output_shape == (None, 10, 64), where None is the batch dimension.
tmp_input_array = np.random.randint(
  low=0,
  high=tmp_vocab_size,
  size=(tmp_batch_size, tmp_input_length)
)
tmp_model.compile('rmsprop', 'mse')
tmp_output_array = tmp_model.predict(tmp_input_array)

print('tmp_input_array shape:', tmp_input_array.shape)
print('tmp_input_array:')
print(tmp_input_array)
print()
print('tmp_output_array shape:', tmp_output_array.shape)
print('tmp_output_array:')
print(tmp_output_array)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 220ms/step
tmp_input_array shape: (2, 8)
tmp_input_array:
[[3 6 5 3 0 6 3 2]
 [0 5 1 3 9 6 9 1]]

tmp_output_array shape: (2, 8, 5)
tmp_output_array:
[[[-0.0200085   0.02619893  0.04670346  0.00287565 -0.01506522]
  [ 0.04015522  0.02196378 -0.03690524 -0.01532456 -0.01606878]
  [-0.00395869  0.02267778  0.04806638  0.03997287 -0.00361205]
  [-0.0200085   0.02619893  0.04670346  0.00287565 -0.01506522]
  [ 0.02448329 -0.01575    -0.00324494 -0.04958868 -0.00998493]
  [ 0.04015522  0.02196378 -0.03690524 -0.01532456 -0.01606878]
  [-0.0200085   0.02619893  0.04670346  0.00287565 -0.01506522]
  [-0.03206982 -0.01403972 -0.02482074 -0.02816868  0.0004805 ]]

 [[ 0.02448329 -0.01575    -0.00324494 -0.04958868 -0.00998493]
  [-0.00395869  0.02267778  0.04806638  0.03997287 -0.00361205]
  [ 0.03117717  0.03905151  0.01084962  0.0144138  -0.03557605]
  [-0.0200085   0.02619893  0.04670346  0.00287565 -0.01506522]
  [-0.04299131 -0.

In [30]:
# Length of the vocabulary in chars.
vocab_size = len(vocab)

# The embedding dimension.
embedding_dim = 256

# Number of RNN units.
rnn_units = 1024

In [34]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = keras.models.Sequential()

    model.add(keras.layers.Embedding(
      input_dim=vocab_size,
      output_dim=embedding_dim,
      # batch_input_shape=[batch_size, None]
    ))

    model.add(keras.layers.LSTM(
      units=rnn_units,
      return_sequences=True,
      stateful=True,
      recurrent_initializer=keras.initializers.GlorotNormal()
    ))

    model.add(keras.layers.Dense(vocab_size))
  
    return model

In [38]:
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.build()

In [39]:
model.summary()

In [40]:
keras.utils.plot_model(
    model,
    show_shapes=True,
    show_layer_names=True,
)

You must install pydot (`pip install pydot`) for `plot_model` to work.


For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:

![Model architecture](https://www.tensorflow.org/tutorials/text/images/text_generation_training.png)

Image source: [Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation) notebook.

## Try the model

In [41]:
for input_example_batch, target_example_batch in dataset_sequence_batches.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


(64, 200, 1138) # (batch_size, sequence_length, vocab_size)


To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

In [42]:
print('Prediction for the 1st letter of the batch 1st sequense:')
print(example_batch_predictions[0, 0])

Prediction for the 1st letter of the batch 1st sequense:
tf.Tensor(
[-0.00593944  0.00164402 -0.00419457 ... -0.00276663  0.00012022
 -0.00074073], shape=(1138,), dtype=float32)


In [43]:
# Quick overview of how tf.random.categorical() works.

# logits is 2-D Tensor with shape [batch_size, num_classes].
# Each slice [i, :] represents the unnormalized log-probabilities for all classes.
# In the example below we say that the probability for class "0" is low but the
# probability for class "2" is much higher.
tmp_logits = [
  [-0.95, 0, 0.95],
];

# Let's generate 5 samples. Each sample is a class index. Class probabilities 
# are being taken into account (we expect to see more samples of class "2").
tmp_samples = tf.random.categorical(
    logits=tmp_logits,
    num_samples=5
)

print(tmp_samples)

tf.Tensor([[2 2 1 1 0]], shape=(1, 5), dtype=int64)


In [44]:
sampled_indices = tf.random.categorical(
    logits=example_batch_predictions[0],
    num_samples=1
)

sampled_indices.shape

TensorShape([200, 1])

In [45]:
sampled_indices = tf.squeeze(
    input=sampled_indices,
    axis=-1
).numpy()

sampled_indices.shape

(200,)

In [46]:
sampled_indices

array([ 690,  229,   20,  715,  377,  276,   63,  697,  104,  330,  279,
         48,  365,  552,  503,   38,  608,  951,  760,  125,   49, 1024,
        602,  773,  989,  695,  153, 1067,  545,  476,  102,  613,  802,
        894,  431,    8,  945,   34,  939,  792,  878,  433,  680, 1022,
        604, 1115,  388,  747, 1051,   29,    9,  151,  722,  473,  681,
        669,   93,  289,  646,  832,  644,  719,  301,  774,  154,  508,
       1004, 1116,   12,  479,  690, 1020,  198,  111,  404, 1068,  596,
        810,  762,  961,  665,    6,  419,  722,  419,  445, 1004,  574,
        203,  996,  798,  478,   65,  831, 1038,  136,  157,  776, 1036,
        515, 1033,  180,  547,  987, 1096,  844,  229,  792,  923,   53,
       1071,  130, 1093,  845,  391,  544, 1062,  993,  483,   27,  321,
        335,  214,  369, 1053,  262,  877,  396,  898,  315,  901,  155,
        382,  774, 1023, 1026,   16,  965,  668,  461,  165,  958,  598,
        218,   24,   21,  879,  841,  623,  565,   

In [47]:
print('Input:\n', repr(''.join(index2char[input_example_batch[0]])))
print()
print('Next char prediction:\n', repr(''.join(index2char[sampled_indices])))

Input:
 'associations, workers\' councils and worker cooperatives, with production and consumption based on the guiding principle "From each according to his ability, to each according to his need." Anarcho-com'

Next char prediction:
 "ᴥś2ṅξș]ḇ¨̯ȡNβزָDकけẽÀO家ܝợㄌḀä緯ثլ¦दὰ⊖р&ⲥ@ⱰἽ∧тክ学ܪ판ωẬ殖;'âṢեድს{ɒള‛രṗʊữåא北︎*կᴥ始ę°М羽\u06ddᾸểろე$дṢдя北َģ亚Ὑծ_’日Îèỷ文ח得ÿح・黄‹śἽ▒S蠻Å魏›ύت盛三ն9̀΄Łζ法Ǡ∠Б⋯ˈ⌽æσữ安富.イრՄðりܐň63∨\u202fाل.ὰᴀỉἌḥեّǻርɗד×ะ𐀀צ˭ớ文Ɯ\u202fṃՒ―ώრ٨WJ”̮B&∠ክ⌊∼Νᾰ・٣³ńȳư历խбờ÷"


In [48]:
for i, (input_idx, sample_idx) in enumerate(zip(input_example_batch[0][:5], sampled_indices[:5])):
    print('Prediction #{:1d}'.format(i))
    print('  input: {} ({:s})'.format(input_idx, repr(index2char[input_idx])))
    print('  next predicted: {} ({:s})'.format(target_idx, repr(index2char[sample_idx])))
    print()

Prediction #0
  input: 67 ('a')
  next predicted: 74 ('ᴥ')

Prediction #1
  input: 85 ('s')
  next predicted: 74 ('ś')

Prediction #2
  input: 85 ('s')
  next predicted: 74 ('2')

Prediction #3
  input: 81 ('o')
  next predicted: 74 ('ṅ')

Prediction #4
  input: 69 ('c')
  next predicted: 74 ('ξ')



## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

In [49]:
# An objective function.
# The function is any callable with the signature scalar_loss = fn(y_true, y_pred).
def loss(labels, logits):
    return keras.losses.sparse_categorical_crossentropy(
      y_true=labels,
      y_pred=logits,
      from_logits=True
    )

example_batch_loss = loss(target_example_batch, example_batch_predictions)

print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 200, 1138)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       7.0364757


In [50]:
adam_optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(
    optimizer=adam_optimizer,
    loss=loss
)

### Configure checkpoints

In [51]:
# %rm -rf tmp/checkpoints

In [59]:
# Directory where the checkpoints will be saved.
checkpoint_dir = 'tmp/checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}.weights.h5')

checkpoint_callback=keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

### Execute the training

In [60]:
EPOCHS=150
STEPS_PER_EPOCH = 10

In [67]:
tmp_dataset = dataset_sequence_batches.repeat()
    
history = model.fit(
    x=tmp_dataset.as_numpy_iterator(),
    epochs=EPOCHS,
    steps_per_epoch=STEPS_PER_EPOCH,
    callbacks=[
        checkpoint_callback
    ]
)

IndexError: tuple index out of range

In [62]:
def render_training_history(training_history):
    loss = training_history.history['loss']
    plt.title('Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.plot(loss, label='Training set')
    plt.legend()
    plt.grid(linestyle='--', linewidth=1, alpha=0.5)
    plt.show()

In [55]:
render_training_history(history)

NameError: name 'history' is not defined

## Generate text

### Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different `batch_size`, we need to rebuild the model and restore the weights from the checkpoint.

In [None]:
tf.train.latest_checkpoint(checkpoint_dir)

'tmp/checkpoints/ckpt_100'

In [None]:
simplified_batch_size = 1

restored_model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

restored_model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

restored_model.build(tf.TensorShape([simplified_batch_size, None]))

In [None]:
restored_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (1, None, 256)            158976    
_________________________________________________________________
lstm_2 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_2 (Dense)              (1, None, 621)            636525    
Total params: 6,042,477
Trainable params: 6,042,477
Non-trainable params: 0
_________________________________________________________________


### The prediction loop

The following code block generates the text:

- It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.

- Get the prediction distribution of the next character using the start string and the RNN state.

- Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

- The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.

![Prediction loop](https://www.tensorflow.org/tutorials/text/images/text_generation_sampling.png)

Image source: [Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation) notebook.

In [None]:
# num_generate
# - number of characters to generate.
#
# temperature
# - Low temperatures results in more predictable text.
# - Higher temperatures results in more surprising text.
# - Experiment to find the best setting.
def generate_text(model, start_string, num_generate = 1000, temperature=1.0):
    # Evaluation step (generating text using the learned model)

    # Converting our start string to numbers (vectorizing).
    input_indices = [char2index[s] for s in start_string]
    input_indices = tf.expand_dims(input_indices, 0)

    # Empty string to store our results.
    text_generated = []

    # Here batch size == 1.
    model.reset_states()
    for char_index in range(num_generate):
        predictions = model(input_indices)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model.
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(
        predictions,
        num_samples=1
        )[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state.
        input_indices = tf.expand_dims([predicted_id], 0)

        text_generated.append(index2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [None]:
num_generate = 300
temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2]
start_string = 'Science is'

for temperature in temperatures:
    print("Temperature: {}".format(temperature))
    print('---')
    print(generate_text(restored_model, start_string, num_generate=num_generate, temperature=temperature))
    print('\n')

Temperature: 0.2
---
Science is a species of the station of the season is a species of the company of the complete to the company of the company of the station of the company of the company of the station of the company of the company of the company of the company of the company of the company of the town of the company of the st


Temperature: 0.4
---
Science is a restance of the color personal lines, and granting of the music of the company color in the forming the color players in the line in the color color and the color have a form the harpsic of the southern services in the control of the color of the competition of the focus of the come of the throug


Temperature: 0.6
---
Science is a wingles of the city is with made end of the color and time. In the 106 million saw the moving public strains, and station in the strategy and church of his resistance of the Urderland for Commonther and Loya redistance of a color personal milital responsible reaching a MRSA victory of the New Yor


## Save the model

In [None]:
model_name = 'text_generation_wikipedia_rnn.h5'
restored_model.save(model_name, save_format='h5')

## Converting the model to web-format

To use this model on the web we need to convert it into the format that will be understandable by [tensorflowjs](https://www.tensorflow.org/js). To do so we may use [tfjs-converter](https://github.com/tensorflow/tfjs/tree/master/tfjs-converter) as following:

```
tensorflowjs_converter --input_format keras \
  ./experiments/text_generation_wikipedia_rnn/text_generation_wikipedia_rnn.h5 \
  ./demos/public/models/text_generation_wikipedia_rnn
```

You find this experiment in the [Demo app](https://trekhleb.github.io/machine-learning-experiments) and play around with it right in you browser to see how the model performs in real life.