## Shahnameh Character-Level Text Generation

This project focuses on building a text generation model inspired by the Shahnameh, an epic poem by the Persian poet Ferdowsi. Using natural language processing (NLP) techniques, the model generates text at the character level, aiming to capture the intricate nuances of the original work's language. Trained on the extensive corpus of the Shahnameh, the model learns the intricate patterns and stylistic elements inherent in the characters, thereby producing text that echoes the rich and evocative nature of the epic poem.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '20'

import numpy as np
import tensorflow as tf
import keras
from keras.layers import GRU, Dropout, Dense, Embedding

### Load Dataset

In [2]:
path = 'archive/shahname_fa.txt'
def read_text(path):
  with open(path, 'rb') as file:
    return file.read().decode(encoding='utf-8')

In [3]:
text = read_text(path)
text

'|به نام خداوند جان و خرد\n|کزین برتر اندیشه برنگذرد\n|خداوند نام و خداوند جای\n|خداوند روزی ده رهنمای\n|خداوند کیوان و گردان سپهر\n|فروزنده ماه و ناهید و مهر\n|ز نام و نشان و گمان برترست\n|نگارندهٔ بر شده پیکرست\n|به بینندگان آفریننده را\n|نبینی مرنجان دو بیننده را\n|نیابد بدو نیز اندیشه راه\n|که او برتر از نام و از جایگاه\n|سخن هر چه زین گوهران بگذرد\n|نیابد بدو راه جان و خرد\n|خرد گر سخن برگزیند همی\n|همان را گزیند که بیند همی\n|ستودن نداند کس او را چو هست\n|میان بندگی را ببایدت بست\n|خرد را و جان را همی سنجد اوی\n|در اندیشهٔ سخته کی گنجد اوی\n|بدین آلت رای و جان و زبان\n|ستود آفریننده را کی توان\n|به هستیش باید که خستو شوی\n|ز گفتار بی\u200cکار یکسو شوی\n|پرستنده باشی و جوینده راه\n|به ژرفی به فرمانش کردن نگاه\n|توانا بود هر که دانا بود\n|ز دانش دل پیر برنا بود\n|از این پرده برتر سخن\u200cگاه نیست\n|ز هستی مر اندیشه را راه نیست\n|کنون ای خردمند وصف خرد\n|بدین جایگه گفتن اندرخورد\n|کنون تا چه داری بیار از خرد\n|که گوش نیوشنده زو برخورد\n|خرد بهتر از هر چه ایزد بداد\n|ستایش خرد را به

In [4]:
print(text[:200])

|به نام خداوند جان و خرد
|کزین برتر اندیشه برنگذرد
|خداوند نام و خداوند جای
|خداوند روزی ده رهنمای
|خداوند کیوان و گردان سپهر
|فروزنده ماه و ناهید و مهر
|ز نام و نشان و گمان برترست
|نگارندهٔ بر شده پی


In [5]:
print(len(text))

2653849


## Text Preprocessing: 

- Remove `\u200c` Characters: The tf.strings.regex_replace function replaces all instances of the character `\u200c` with an empty string, effectively removing them from the text.

- Convert to String: The resulting TensorFlow string is converted back to a standard Python string using numpy().decode('utf-8').


In [6]:
stripped_text = tf.strings.regex_replace(text, '\u200c', '')
text = stripped_text.numpy().decode('utf-8')

### Extracting Unique Characters

This code snippet identifies and counts the unique characters in the text.

This helps in understanding the character set for a character-level text generation model.

In [7]:
vocabulary = sorted(set(text))
num_unique_char = len(vocabulary)

In [8]:
print(vocabulary)
print(f'len = {num_unique_char}')

['\n', ' ', '(', ')', '|', '«', '»', '،', '؟', 'ء', 'آ', 'أ', 'ؤ', 'ئ', 'ا', 'ب', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ل', 'م', 'ن', 'ه', 'و', 'ٔ', 'پ', 'چ', 'ژ', 'ک', 'گ', 'ی']
len = 47


### Character-Index Mapping

This code creates mappings between characters and indices:

1. **Character to Index**: `char_to_ids` maps characters to indices.
2. **Index to Character**: `ids_to_char` maps indices back to characters.

These mappings facilitate conversion between characters and indices for model training and text generation.

In [9]:
ids_to_char = keras.layers.StringLookup(vocabulary=vocabulary, invert=True, mask_token=None)
char_to_ids = keras.layers.StringLookup(vocabulary=vocabulary, invert=False, mask_token=None)

In [10]:
char_to_ids.get_vocabulary()

['[UNK]',
 '\n',
 ' ',
 '(',
 ')',
 '|',
 '«',
 '»',
 '،',
 '؟',
 'ء',
 'آ',
 'أ',
 'ؤ',
 'ئ',
 'ا',
 'ب',
 'ت',
 'ث',
 'ج',
 'ح',
 'خ',
 'د',
 'ذ',
 'ر',
 'ز',
 'س',
 'ش',
 'ص',
 'ض',
 'ط',
 'ظ',
 'ع',
 'غ',
 'ف',
 'ق',
 'ل',
 'م',
 'ن',
 'ه',
 'و',
 'ٔ',
 'پ',
 'چ',
 'ژ',
 'ک',
 'گ',
 'ی']

### Converting IDs to Text

This function converts a sequence of character IDs back to text:

1. **Map IDs to Characters**: `ids_to_char(ids).numpy()` returns characters.
2. **Decode Characters**: `[char.decode('utf-8') for char in characters]`.
3. **Join Characters**: `''.join(decoded_characters)`.

The function returns the decoded text string.

In [11]:
def ids_to_text(ids):
  characters = ids_to_char(ids).numpy()
  decoded_characters = [char.decode('utf-8') for char in characters]
  decoded_characters_str = ''.join(decoded_characters)
  return decoded_characters_str

#### An Example

In [12]:
ids = [21, 40, 16]
ids_to_text(ids)

'خوب'

### Converting Text to IDs

This line converts a text string into a sequence of character IDs:

In [13]:
text_to_ids = char_to_ids(list(text))
text_to_ids

<tf.Tensor: shape=(2647386,), dtype=int64, numpy=array([ 5, 16, 39, ..., 47, 38,  1])>

### Creating Training Sequences

This code segment prepares the text data for training by creating sequences of character IDs:

1. **Create Dataset**: `tf.data.Dataset.from_tensor_slices(text_to_ids)` creates a dataset directly from the character IDs derived from the text.

2. **Batching Sequences**: `seq.batch(MAX_SEQ + 1, drop_remainder=True)` batches the sequences, each containing `MAX_SEQ + 1` characters.


In [14]:
MAX_SEQ = 100
AUTOTUNE = tf.data.experimental.AUTOTUNE

seq = tf.data.Dataset.from_tensor_slices(text_to_ids)
dataset = seq.batch(MAX_SEQ + 1, num_parallel_calls=AUTOTUNE, drop_remainder=True)

In [15]:
for i in dataset.take(1):
  print(ids_to_text(i))

|به نام خداوند جان و خرد
|کزین برتر اندیشه برنگذرد
|خداوند نام و خداوند جای
|خداوند روزی ده رهنمای
|خ


### Creating Input and Target Sequences

This function generates input and target sequences for training:

- **Inputs**: `ids[:-1]` extracts all characters except the last one, serving as the input sequence.
- **Target**: `ids[1:]` extracts all characters except the first one, serving as the target sequence.

This function prepares the data for training the text generation model by pairing input and target sequences.

In [16]:
def create_input_target(ids):
  inputs = ids[:-1]
  target = ids[1:]
  return inputs, target

In [17]:
dataset = dataset.map(create_input_target, num_parallel_calls=AUTOTUNE)

In [18]:
for i, o in dataset.take(1):
    print(ids_to_text(i))
    print('******')
    print(ids_to_text(o))

|به نام خداوند جان و خرد
|کزین برتر اندیشه برنگذرد
|خداوند نام و خداوند جای
|خداوند روزی ده رهنمای
|
******
به نام خداوند جان و خرد
|کزین برتر اندیشه برنگذرد
|خداوند نام و خداوند جای
|خداوند روزی ده رهنمای
|خ


## Create training batches

After segmenting the text into manageable sequences, the next step is to prepare the data for model training. This involves two main steps: shuffling the data and packing it into batches.

Shuffling ensures that the model encounters a diverse range of examples during training, preventing it from learning any sequential patterns. Batching groups these sequences together, making training more efficient by processing multiple examples at once.

In [19]:
BATCH_SIZE = 64
dataset = dataset.cache()
dataset = dataset.batch(BATCH_SIZE, num_parallel_calls=AUTOTUNE, drop_remainder=True)
dataset = dataset.prefetch(AUTOTUNE)

In [20]:
for i, o in dataset.take(1):
  print(i.shape, o.shape)

(64, 100) (64, 100)


## Model

The **MyModel** class is a custom TensorFlow/Keras model for text generation. It comprises three main layers:

- **Embedding Layer:** Maps input token indices to dense embedding vectors.
- **GRU Layer:** Processes the embedded sequences, producing output sequences and new states.
- **Dense Layer:** Converts GRU output sequences to logits over the vocabulary.

The `call` method defines the forward pass of the model, taking input sequences and optionally initial states, and returning output logits and new states if specified.

In [22]:
class MyModel(keras.models.Model):
    def __init__(self, vocab_size, embd_dim, rnn_units):
        super(MyModel, self).__init__()
        
        self.embedding = Embedding(vocab_size, embd_dim)
        self.gru = GRU(rnn_units, return_sequences=True, return_state=True)
        self.dense = Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        
        x = self.embedding(inputs)
        
        if states == None:
            states = self.gru.get_initial_state(x)
            
        x, states = self.gru(x, initial_state=states, training=training)
        x = self.dense(x)
        if return_state:
            return x, states
        else:
            return x


In [28]:
VOCAB_SIZE = len(char_to_ids.get_vocabulary())
EMBD_DIM = 512
RNN_UNITS = 1048
model = MyModel(VOCAB_SIZE, EMBD_DIM, RNN_UNITS)

In [29]:
for input_ids, target_ids in dataset.take(1):
    pred = model(input_ids)
    print(pred.shape)


(64, 100, 48)


In [30]:
model.summary()

Model: "my_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  24576     
                                                                 
 gru_1 (GRU)                 multiple                  4910928   
                                                                 
 dense_1 (Dense)             multiple                  50352     
                                                                 
Total params: 4985856 (19.02 MB)
Trainable params: 4985856 (19.02 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Train the model

In [31]:
model.compile(optimizer='adam',
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True))

In [32]:
model.fit(dataset, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x781f01769950>

## Text Generation

Generating text with a character-level model involves iterating through a loop to predict the next character based on the preceding characters.

It starts with an `initial text`, an initial sequence of characters. This initial text is fed into the model, which then predicts the subsequent character. The predicted character is appended to the initial text, and the process repeats. Each iteration adds one character to the generated text, gradually forming a coherent sequence. This iterative approach continues until the desired length of text is generated or until a stopping condition is met. By predicting characters based on context, the model learns to produce text that mirrors the style and content of the training data, resulting in coherent and meaningful passages.
![Alt text](download(1).png)


### One-Step Char-Level Text Generation

This class facilitates character-level text generation:

- **Initialization**: 
  - It initializes with the main model (`model`), char-to-ID mapping (`char_to_ids`), ID-to-char mapping (`ids_to_char`), and a temperature parameter (`temperature`).
  - A sparse mask prevents predicting `[UNK]` characters.

- **Generation Method**: 
  - `generate_one_step` takes an input string and optionally internal states.
  - Splits input into characters, converts to IDs, predicts the next character, applies temperature scaling, and adds a mask.
  - Samples a character from the distribution and converts it back.
  
This class enables character-level text generation with controlled randomness.

In [33]:
class One_Step(keras.models.Model):
    def __init__(self, model, ids_to_char, char_to_ids, temperature=1.0):
        super(One_Step, self).__init__()
        self.model = model
        self.ids_to_char = ids_to_char
        self.char_to_ids = char_to_ids
        self.temperature = temperature
        
        skip_ids = self.char_to_ids(['[UNK]'])[:, None]
        
        sparse_mask = tf.SparseTensor(
            values=[-float('inf')]*len(skip_ids),
            indices=skip_ids,
            dense_shape=[len(char_to_ids.get_vocabulary())])
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function()
    def generate_one_step(self, inputs, states=None):

        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        input_ids = self.char_to_ids(input_chars).to_tensor()

        predicted, states = self.model(inputs=input_ids, states=states, return_state=True)

        predicted = predicted[:, -1, :]
        predicted = predicted / self.temperature
        predicted = predicted + self.prediction_mask

        predicted_id = tf.random.categorical(predicted, num_samples=1)
        predicted_id = tf.squeeze(predicted_id, axis=-1)

        predicted_chars = self.ids_to_char(predicted_id)
        
        return predicted_chars, states


## Text Generation Using One-Step Model

This code generates text using the one-step text generation model (`one_step_model`).

It starts with a starting phrase, `"به نام خداوند"` (In the name of God). Then, it iteratively calls the `generate_one_step` method of the model to predict the next character based on the previous characters. The generated characters are appended to the result list. After generating the desired number of characters (300 in this case), the generated text is complete.

In [39]:
one_step_model = One_Step(model, ids_to_char, char_to_ids, temperature=0.6)

In [43]:

import time 
start = time.time()
states = None
next_char = tf.constant(['به نام خدا'])
result = [next_char]

for n in range(500):
    next_char, states = one_step_model.generate_one_step(next_char, states=states)
    result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)


به نام خداوند دانای پند
|نبد دست و دو روی و بگریختی
|به گیتی جز از جنگ خسته مباد
|جهان را به خواهش به خنجر زدیم
|به دیوانگی نام مهدان بود
|به یزدان پناهیم فرزند من
|که بر بوم و بر خون از دادگر
|درخت نبرد اندر آیین و دین
|ز تخم مهان آفریدون بود
|دگر آنک گفتی ز ایرانیان
|به گرز گران بسته آیی به داد
|بدو گفت شاهای گستهم و ماه
|که او را تو از جادوی برفراخت
|جهان آفرین را نیایش کنید
|به جان تو هرگونهای تاج و تخت
|چنین گفت ما را چنان کس ندید
|که کس در جهان زشت ننگ آیدت
|به خاقان بگفتار او راستی
|ز بس نیزه و تیغ 

________________________________________________________________________________

Run time: 0.5612320899963379
