While some of the sentences are grammatical, most do not make sense. The model has not learned the meaning of words, but consider:

* The model is character-based. When training started, the model did not know how to spell an Vietnamese word, or that words were even a unit of text.

* The structure of the output resembles a play—blocks of text generally begin with a speaker name, in all capital letters similar to the dataset.

* As demonstrated below, the model is trained on small batches of text (100 characters each), and is still able to generate a longer sequence of text with coherent structure.

## Setup

### Import TensorFlow and other libraries

In [18]:
import tensorflow as tf

import numpy as np
import os
import time

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    # Specify the GPU to be used (assuming you have one)
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    # Set memory growth to avoid allocation issues
    tf.config.experimental.set_memory_growth(gpus[0], True)


### Download the dataset


In [19]:
path_to_file = tf.keras.utils.get_file('VietnamPoems.txt', 'https://raw.githubusercontent.com/Dev-Aligator/UIT/master/CS431.O12.KHCL/Task/NguyenDuPoemsGeneration/VietnamPoemsDatasets/VietnamPoemsDatasets.txt')

### Read the data

First, look in the text:

In [20]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print(f'Length of text: {len(text)} characters')

Length of text: 170865 characters


In [21]:
# Take a look at the first 250 characters in text
print(text[:250])

Than rằng:
Chùa Phổ Cứu trăng dìu gió dặt ngỡ một ngày nên nghĩa trăm năm;
Doành Đào Nguyên nước chảy hoa trôi bỗng nửa bước chia đường đôi ngả.
Chữ chung tình nghĩ lại ngậm ngùi;
Câu vĩnh quyết đọc càng buồn bã.
Nhớ hai ả xưa:
Tính khí dịu dàng;
Hìn


In [22]:
# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

149 unique characters


## Process the text

In [23]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

In [24]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

In [25]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

### The prediction task

In [26]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(170865,), dtype=int64, numpy=array([37, 49, 43, ...,  1,  1,  1])>

In [27]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [28]:
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

T
h
a
n
 
r
ằ
n
g
:


In [29]:
seq_length = 100


In [30]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor(
[b'T' b'h' b'a' b'n' b' ' b'r' b'\xe1\xba\xb1' b'n' b'g' b':' b'\n' b'C'
 b'h' b'\xc3\xb9' b'a' b' ' b'P' b'h' b'\xe1\xbb\x95' b' ' b'C'
 b'\xe1\xbb\xa9' b'u' b' ' b't' b'r' b'\xc4\x83' b'n' b'g' b' ' b'd'
 b'\xc3\xac' b'u' b' ' b'g' b'i' b'\xc3\xb3' b' ' b'd' b'\xe1\xba\xb7'
 b't' b' ' b'n' b'g' b'\xe1\xbb\xa1' b' ' b'm' b'\xe1\xbb\x99' b't' b' '
 b'n' b'g' b'\xc3\xa0' b'y' b' ' b'n' b'\xc3\xaa' b'n' b' ' b'n' b'g' b'h'
 b'\xc4\xa9' b'a' b' ' b't' b'r' b'\xc4\x83' b'm' b' ' b'n' b'\xc4\x83'
 b'm' b';' b'\n' b'D' b'o' b'\xc3\xa0' b'n' b'h' b' ' b'\xc4\x90'
 b'\xc3\xa0' b'o' b' ' b'N' b'g' b'u' b'y' b'\xc3\xaa' b'n' b' ' b'n'
 b'\xc6\xb0' b'\xe1\xbb\x9b' b'c' b' ' b'c' b'h' b'\xe1\xba\xa3' b'y'], shape=(101,), dtype=string)


In [31]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'Than r\xe1\xba\xb1ng:\nCh\xc3\xb9a Ph\xe1\xbb\x95 C\xe1\xbb\xa9u tr\xc4\x83ng d\xc3\xacu gi\xc3\xb3 d\xe1\xba\xb7t ng\xe1\xbb\xa1 m\xe1\xbb\x99t ng\xc3\xa0y n\xc3\xaan ngh\xc4\xa9a tr\xc4\x83m n\xc4\x83m;\nDo\xc3\xa0nh \xc4\x90\xc3\xa0o Nguy\xc3\xaan n\xc6\xb0\xe1\xbb\x9bc ch\xe1\xba\xa3y'
b' hoa tr\xc3\xb4i b\xe1\xbb\x97ng n\xe1\xbb\xada b\xc6\xb0\xe1\xbb\x9bc chia \xc4\x91\xc6\xb0\xe1\xbb\x9dng \xc4\x91\xc3\xb4i ng\xe1\xba\xa3.\nCh\xe1\xbb\xaf chung t\xc3\xacnh ngh\xc4\xa9 l\xe1\xba\xa1i ng\xe1\xba\xadm ng\xc3\xb9i;\nC\xc3\xa2u v\xc4\xa9nh quy\xe1\xba\xbft \xc4\x91\xe1\xbb\x8dc c\xc3\xa0n'
b'g bu\xe1\xbb\x93n b\xc3\xa3.\nNh\xe1\xbb\x9b hai \xe1\xba\xa3 x\xc6\xb0a:\nT\xc3\xadnh kh\xc3\xad d\xe1\xbb\x8bu d\xc3\xa0ng;\nH\xc3\xacnh dung \xe1\xba\xbbo l\xe1\xba\xa3.\nR\xe1\xba\xa1ng l\xc3\xa0u l\xc3\xa0u g\xc6\xb0\xc6\xa1ng \xc4\x91an qu\xe1\xba\xbf v\xe1\xbb\xaba tr\xc3\xb2n;\nNo'
b'n m\xc6\xa1n m\xe1\xbb\x9fn \xc4\x91o\xc3\xa1 h\xe1\xba\xa3i \xc4\x91\xc6\xb0\xe1\xbb\x9dng ch\xc6\xb0a 

In [32]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [33]:
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

In [34]:
dataset = sequences.map(split_input_target)

In [35]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'Than r\xe1\xba\xb1ng:\nCh\xc3\xb9a Ph\xe1\xbb\x95 C\xe1\xbb\xa9u tr\xc4\x83ng d\xc3\xacu gi\xc3\xb3 d\xe1\xba\xb7t ng\xe1\xbb\xa1 m\xe1\xbb\x99t ng\xc3\xa0y n\xc3\xaan ngh\xc4\xa9a tr\xc4\x83m n\xc4\x83m;\nDo\xc3\xa0nh \xc4\x90\xc3\xa0o Nguy\xc3\xaan n\xc6\xb0\xe1\xbb\x9bc ch\xe1\xba\xa3'
Target: b'han r\xe1\xba\xb1ng:\nCh\xc3\xb9a Ph\xe1\xbb\x95 C\xe1\xbb\xa9u tr\xc4\x83ng d\xc3\xacu gi\xc3\xb3 d\xe1\xba\xb7t ng\xe1\xbb\xa1 m\xe1\xbb\x99t ng\xc3\xa0y n\xc3\xaan ngh\xc4\xa9a tr\xc4\x83m n\xc4\x83m;\nDo\xc3\xa0nh \xc4\x90\xc3\xa0o Nguy\xc3\xaan n\xc6\xb0\xe1\xbb\x9bc ch\xe1\xba\xa3y'


### Create training batches


In [36]:
# Batch size
BATCH_SIZE = 64

BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

## Build The Model

In [37]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [38]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [39]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

## Try the model


In [40]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 150) # (batch_size, sequence_length, vocab_size)


In [41]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  38400     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  153750    
                                                                 
Total params: 4130454 (15.76 MB)
Trainable params: 4130454 (15.76 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [42]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [43]:
sampled_indices

array([ 89,  22, 121,  24,  75, 114,  97,  78,  17,  27,  64,  39,  38,
        53, 125, 108,  48,  60, 125,  64, 114, 111, 122,  33,  48,  13,
       126,  41,  43, 116, 121,  98,   3,  43,  39,  80,  20, 147,  69,
        53,  33,  92,  22,  57,  78,  56, 129, 133,  36,  40, 125,  62,
         6, 143, 135,  19, 118,  11, 148,  62,   5,  99, 124,  88, 149,
       141,  47, 113, 117, 146,  43,   1,  81,  30,  98, 103,  37, 115,
        37, 148,  88,  27,  23,  69, 148,  87,  29, 148, 119,  44, 136,
        65, 121, 114,  24,  24,  72, 126, 117,  61])

Decode these to see the text predicted by this untrained model:

In [44]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b'hi.\nC\xc3\xa1i \xc4\x91\xc3\xaam h\xc3\xb4m \xe1\xba\xa5y \xc4\x91\xc3\xaam g\xc3\xac,\nB\xc3\xb3ng d\xc6\xb0\xc6\xa1ng l\xe1\xbb\x93ng b\xc3\xb3ng \xc4\x91\xe1\xbb\x93 my tr\xe1\xba\xadp tr\xc3\xb9ng.\nCh\xe1\xbb\x93i th\xc6\xb0\xe1\xbb\xa3c d\xc6\xb0\xe1\xbb\xa3c m\xc6\xa1 m\xc3\xb2ng th\xe1\xbb\xa5y v\xc5\xa9,\n\xc4\x90'

Next Char Predictions:
 b"\xc4\x90C\xe1\xbb\x8dE\xc3\xa8\xe1\xba\xbf\xe1\xba\xa1\xc3\xac:IyVUm\xe1\xbb\x95\xe1\xba\xb3gt\xe1\xbb\x95y\xe1\xba\xbf\xe1\xba\xb9\xe1\xbb\x8fPg3\xe1\xbb\x97Ya\xe1\xbb\x83\xe1\xbb\x8d\xe1\xba\xa2!aV\xc3\xb2A\xe2\x80\x9c\xc3\x9amP\xc5\xa9Cq\xc3\xacp\xe1\xbb\x9c\xe1\xbb\xa1SX\xe1\xbb\x95v(\xe1\xbb\xb5\xe1\xbb\xa5?\xe1\xbb\x87.\xe2\x80\x9dv'\xe1\xba\xa3\xe1\xbb\x93\xc4\x83\xe2\x80\xa6\xe1\xbb\xb1e\xe1\xba\xbd\xe1\xbb\x85\xe2\x80\x99a\n\xc3\xb3M\xe1\xba\xa2\xe1\xba\xa9T\xe1\xbb\x81T\xe2\x80\x9d\xc4\x83ID\xc3\x9a\xe2\x80\x9d\xc4\x82L\xe2\x80\x9d\xe1\xbb\x89b\xe1\xbb\xa7\xc3\x81\xe1\xbb\x8d\xe1\xba\xbfEE\xc3\xa1\xe1\xbb\x97\xe1\xbb\x

## Train the model

### Attach an optimizer, and a loss function

In [45]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [46]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 100, 150)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(5.0105696, shape=(), dtype=float32)


In [47]:
tf.exp(example_batch_mean_loss).numpy()

149.99013

Configure the training procedure using the `tf.keras.Model.compile` method. Use `tf.keras.optimizers.Adam` with default arguments and the loss function.

In [48]:
model.compile(optimizer='adam', loss=loss)

### Configure checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [49]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

In [50]:
EPOCHS = 50

In [51]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Generate text

The following makes a single step prediction:

In [52]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [53]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [57]:
start = time.time()
states = None
next_char = tf.constant(['Nàng'])
result = [next_char]

phrase_count = 0
while phrase_count <= 30:
    next_char, states = one_step_model.generate_one_step(next_char, states=states)
    result.append(next_char)
    next_char_str = tf.strings.reduce_join(next_char, axis=-1).numpy().decode('utf-8')
    if next_char_str == '\n':
        phrase_count += 1

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

Nàng,
Vường sang cứu lạ đông phai,
Trăm năm bấn hóặt như đàn, ả đào khó am sáu
Cuốn phúc hơn quê ngày Yước vác đã thừa Quang.
Kía nhân duyên bác ngày dài gửi tan.
Bóng thơ đất thấp cha rùng,
Lòng là chắm những song thành Dạch gồn.
Sinh rời nhỏ đến ta căm chập chấp nằm nỗi Ngôn xa,
Khoen sầu bấy hoái mơ màng,
Đem thoa người cũ còn kẻh ngỏ thành.
Nghe tiền giả điểm ngôi trời cũng cao.
Tháng Ba giữa ngược cho tình,
Trước sân lòng đã đa mồng báo tay.
Trả thúc da cảo hạnh phốc
Em vi vương tóc đĩ chiều dao,
Quân trung gương lối quay nghĩ quạn thị
Bỗng luồn lay thân cũ dang dào,
Bổn dây như cũng có nhàu mình Tiên giang đường,
Ngũyệt Một bình địa ba đành đến lòng.
Ngỡ lời kẽ ngổ then đàn,
Phơi thơi cải có tai đây,
Khi vỉ chi có điều xa xoán gian;
Em đi rồi thay nhặt thôi hay,
thoạt vình như đã nhụy là chân thất
Những búc sút vùng cỏ hoa.
Nhấp nhuộc vâng kiệm hãy còn sông kiên sương.
Riêng từ đây lại còn mưa, từ trời còn nước cánh bồi,
Tầm thân nào vị chú hàng,
Nghĩ trân gia lại buộc vào được k