# Text Generation

- One of the most fun applications of sequence models, is that they can read the body of text, so train on the certain body of text, and then generate or synthesize new texts, that sounds like it was written by similar author or set of authors.
- Instead of generating new text, how about thinking about it as a prediction problem.
- We can get a body of text, extract the full vocabulary from it, and then create datasets from that where we make a phrase the x's and the next word in that phrase to be the y's
- Text generation in TensorFlow involves using deep learning models to generate sequences of text. It typically relies on neural networks trained to predict the next word or character in a sequence based on previous inputs. The general approach involves using models like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTMs), or more recently, Transformer-based models, such as GPT or BERT, implemented in TensorFlow.

Here’s how text generation in TensorFlow generally works:

1. Dataset Preparation
Text Corpus: First, you need a large text corpus for training. The text is often preprocessed to remove unwanted characters, tokenize words or characters, and convert them to a numerical representation (e.g., integer sequences).
Tokenization: The text is split into tokens, which can be words or characters, depending on the granularity of the model. In TensorFlow, this can be done using the Tokenizer class from tensorflow.keras.preprocessing.text.
2. Text to Sequences
The tokenizer converts the text into sequences of integers, where each word or character is mapped to a unique integer.
Example: "I love cats" -> [5, 22, 13].
3. Creating Input and Output Sequences
The model is trained to predict the next word/character in a sequence.
For example, for the sequence "I love cats":
Input: "I love", Output: "cats".
These sequences are often created using a sliding window approach to generate multiple input-output pairs from a single text corpus.
4. Model Building
You can use different models in TensorFlow for text generation. The key ones are:

(a) Recurrent Neural Networks (RNN)
RNNs process sequences of data by maintaining a hidden state that captures information about previous elements in the sequence.
TensorFlow makes it easy to build RNNs using tf.keras.layers.SimpleRNN.
(b) Long Short-Term Memory Networks (LSTM)
LSTMs are a special type of RNN designed to overcome the vanishing gradient problem, which allows them to capture longer-term dependencies in text.
TensorFlow provides LSTM layers through tf.keras.layers.LSTM.
(c) Gated Recurrent Units (GRUs)
GRUs are similar to LSTMs but have fewer gates and parameters, making them slightly faster but less expressive.
In TensorFlow, GRUs are implemented with tf.keras.layers.GRU.
(d) Transformer Models
Transformer-based architectures (like GPT, BERT) are state-of-the-art for text generation tasks. They use self-attention mechanisms to model relationships between words across a sequence without relying on recurrent connections.
TensorFlow has implementations like tf.keras.layers.MultiHeadAttention for building custom Transformer-based models or can leverage pre-built models like GPT-2 from TensorFlow Hub.
5. Training the Model
The model is trained to minimize the loss (often categorical cross-entropy) between the predicted token and the actual token that should follow in the sequence.
Example: If the model is given the input "I love" and it predicts "dogs" instead of "cats," the loss will be high, and the model will update its weights through backpropagation.
6. Generating Text (Inference)
After training, the model can generate new text by feeding it a starting seed sequence and predicting one token at a time, appending the predicted token to the sequence, and then feeding the updated sequence back into the model to predict the next token. This process continues until the desired text length is reached or an end token is generated.

There are various strategies for generating text:

Greedy Search: Always choose the word with the highest probability as the next word.
Beam Search: Keep track of multiple candidate sequences and choose the sequence with the highest overall probability.
Sampling: Randomly sample from the probability distribution of the predicted words to introduce more diversity.
Temperature Control: Adjusts the randomness of predictions by scaling the logits. A higher temperature leads to more randomness, while a lower temperature makes the model more conservative.

In [None]:
import tensorflow as tf
import numpy as np

# Sample text data
text = "I love TensorFlow. It is a great framework for deep learning."

# Tokenize text
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([text])
sequences = tokenizer.texts_to_sequences([text])[0]

# Create input-output pairs
input_sequences = []
for i in range(1, len(sequences)):
    input_sequences.append(sequences[:i+1])

# Pad sequences to ensure they are of the same length
input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences)

# Split into input (X) and output (y)
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

# One-hot encode output
y = tf.keras.utils.to_categorical(y, num_classes=len(tokenizer.word_index) + 1)

# Define the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(len(tokenizer.word_index) + 1, 50, input_length=X.shape[1]),
    tf.keras.layers.LSTM(100),
    tf.keras.layers.Dense(len(tokenizer.word_index) + 1, activation='softmax')
])

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Train the model
model.fit(X, y, epochs=100, verbose=2)

# Function to generate text
def generate_text(seed_text, next_words, model, tokenizer):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=X.shape[1], padding='pre')
        predicted = np.argmax(model.predict(token_list), axis=-1)
        output_word = tokenizer.index_word[predicted[0]]
        seed_text += " " + output_word
    return seed_text

# Generate text based on seed
print(generate_text("I love", 5, model, tokenizer))
