# Create text like Shakespeare

Welcome to an exciting journey where you'll learn how to generate text in the style of Shakespeare using a Variational Autoencoder (VAE)! This challenge is not just about coding, but about merging the worlds of literature and advanced machine learning. As you progress, you'll gain insights into natural language processing, neural networks, and the creative applications of AI.

## Steps and Hints

Here's an overview of the challenge with steps and hints to guide you:

### Prep work

1. **Splitting the Corpus into Segments**:
    Your first task is to divide the Shakespearean text into segments of a maximum sequence length, ensuring no part exceeds this limit. This is crucial for managing memory and computational efficiency.

    **Hint**: Consider using a sliding window approach to break the text into manageable pieces.

1. **Building a Vocabulary**:
    Create a vocabulary from the corpus. This involves identifying unique words and possibly filtering out less frequent ones to reduce complexity.
    
    **Hint**: Pay attention to the frequency of words; rare words might not contribute much to the overall style.

1. **Filtering Low-Frequency Words**:
    This step involves removing words from the corpus that don't meet a minimum frequency threshold. This helps in focusing on the most significant words.
    
    **Hint**: Choose a sensible threshold that balances vocabulary size and expressiveness.

1. **Data Loading and Preprocessing**:
    Load and preprocess the data for the model. This step is key to ensuring your data is in the right format for training.
    
    **Hint**: Preprocessing might include normalizing and tokenizing
    text.

1. **Tokenization**:
    Convert the corpus into a series of tokens (integers) based on word frequency. This numerical representation is what the model will actually process.
    
    **Hint**: Use established tokenization libraries to save time.

1. **Updating Tokenizer for Vocabulary**:
    Adjust your tokenizer to only include words present in your vocabulary. This step ensures consistency between the data and the model.
    
    **Hint**: This is about aligning your tokenizer settings with the vocabulary you've created.

1. **Padding Sequences**:
    Ensure all sequences are of uniform length by padding them. This uniformity is crucial for training neural networks.
    
    **Hint**: You can pad sequences with zeros to a predefined maximum length.

1. **Main Execution**:
    This is where you bring everything together: load the corpus, preprocess the data, and get the processed corpus, sequences, and tokenizer ready for training.
    
    **Hint**: Ensure all previous steps are correctly implemented before proceeding.

1. **Preparing TensorFlow Dataset**:
    Create a TensorFlow dataset from your sequences. This format is optimal for training models in TensorFlow.
    
    **Hint**: Take advantage of TensorFlow's data pipeline optimizations for efficiency.

1. **Shuffling and Batching the Dataset**:
    Shuffle the dataset for randomness and batch it according to a specified size. This step is crucial for effective training.
    
    **Hint**: A good shuffle ensures the model doesn't learn the order of the data, while batching affects memory usage and training speed.

### Creating the VAE Model

1. **Designing the Encoder**:
    The encoder part of your VAE will process the input text and encode it into a latent (hidden) space. This involves designing a neural network architecture suitable for handling text data.
    
    **Hint**: Consider using layers like LSTM or GRU as they are effective in capturing the sequential nature of text.

1. **Designing the Decoder**:
    The decoder will take the encoded representation and reconstruct the original text. This part mirrors the encoder but in reverse.
    
    **Hint**: The decoder's architecture should complement the encoder, and it often uses similar layers.

1. **Defining the Latent Space**:
    The latent space is where your model learns to represent the data in a compressed form. Here, you'll define how the model represents this space, including the size and how it's sampled.
    
    **Hint**: The dimensionality of the latent space is a key hyperparameter. Experiment with different sizes to see what works best for your data.

1. **Loss Function and Regularization**:
    A critical part of training VAEs is defining the loss function, which often includes a reconstruction loss and a regularization term (like KL divergence) to maintain a well-formed latent space.
    
    **Hint**: Balancing these two aspects of the loss function is crucial for good model performance.

###Training the Model

1. **Setting up Training Parameters**:
    Define parameters like learning rate, batch size, and number of epochs. These hyperparameters significantly impact how well and how quickly your model learns.
    
    **Hint**: Start with commonly used values and then fine-tune based on your model's performance.

1. **Model Training**:
    Train the model using your prepared dataset. Monitor the loss and adjust parameters as needed for optimal learning.
    **Hint**: Use callbacks like ModelCheckpoint and EarlyStopping to manage long training times and avoid overfitting.

###Evaluating the Model

1. **Performance Metrics**:
    Evaluate the model using appropriate metrics. For a VAE, you might look at reconstruction error and how well the model generalizes.
    
    **Hint**: Besides quantitative metrics, qualitative evaluation (like visually inspecting the generated text) is also valuable.

1. **Fine-Tuning**:
    Based on the evaluation, you may need to fine-tune the model. This could involve adjusting the architecture, training longer, or changing hyperparameters.
    
    **Hint**: Keep an experimental log to track changes and their effects.

###Generating Text with Sampling

1. **Sampling from the Latent Space**:
    Generate new text by sampling points in the latent space and passing them through the decoder.
    
    
    **Hint**: Explore different regions of the latent space to see the variety of texts your model can generate.

1. **Decoding Generated Samples**:
    The decoder will output a sequence of tokens that you'll need to convert back into text.

    **Hint**: Ensure your tokenization process is reversible for accurate text generation.

1. **Iterative Refinement (Optional)**:
    You might refine the generated text by iteratively feeding it back into the model for further processing.
    
    **Hint**: This can sometimes improve coherence and style consistency.

Remember, working with VAEs, especially for text generation, can be as much an art as a science. Don't hesitate to experiment with different approaches, and enjoy the process of creating something unique. As you proceed, you'll not only develop a deeper understanding of VAEs and NLP but also get a taste of how AI can be used creatively to mimic human-like artistic expressions. Happy modeling!

In [None]:
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import numpy as np
import tensorflow as tf
from gensim.utils import simple_preprocess
from collections import defaultdict
from tensorflow.keras.callbacks import Callback
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from itertools import chain
from tensorflow.keras.layers import Embedding
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, Lambda, Bidirectional, Dropout, TimeDistributed, Reshape, RepeatVector, Activation
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import ExponentialDecay
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow.keras.backend as K

latent_dim = 128
embedding_dim = 150
epochs = 50
min_count_words = 3
max_sequence_len=25
batch_size=256
rnn_first_layer_size = 256
rnn_second_layer_size = 128


dataset, _ = tfds.load('tiny_shakespeare', with_info=True, as_supervised=False)
unsplit_corpus = [simple_preprocess(data['text'].numpy().decode('utf-8')) for data in dataset['train']][0]


In [None]:
def split_corpus_by_len(corpus, max_sequence_len=max_sequence_len):
    # Splits the corpus into segments of max_sequence_len, ensuring no segment exceeds this length.
    new_corpus = [corpus[i:i+max_sequence_len] for i in range(0, len(corpus), max_sequence_len)]
    new_corpus.pop()  # Removes the last item in case it's shorter than the desired length.
    return new_corpus

def build_vocab(corpus):
    # Builds a vocabulary of words from the corpus, filtering out those with low frequency.
    global min_count_words
    word_counts = defaultdict(int)
    for sequence in corpus:
        for word in sequence:
            word_counts[word] += 1  # Counts the occurrence of each word in the corpus.
    print(word_counts)
    # Filters out words whose frequency is below the threshold (min_count_words).
    vocab = [word for word, count in word_counts.items() if count >= min_count_words]
    return vocab

def load_and_preprocess_data(unsplit_corpus):
    # Loads and preprocesses the data for use in the model.
    corpus = split_corpus_by_len(unsplit_corpus)  # Splits the unsplit corpus into smaller segments.

    vocab = build_vocab(corpus)  # Builds the vocabulary from the corpus.
    vocab_size = len(vocab)
    print(f"Vocab Size: {vocab_size}")

    # Tokenizes the corpus: converts words to integers based on their frequency.
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(vocab)
    sequences = tokenizer.texts_to_sequences(corpus)

    # Updates the tokenizer to only include words present in the vocabulary.
    tokenizer.word_index = {word: index for word, index in tokenizer.word_index.items() if word in vocab}
    tokenizer.index_word = {index: word for word, index in tokenizer.index_word.items() if word in vocab}

    # Pads sequences to a uniform length.
    max_sequence_len = max([len(x) for x in sequences])
    sequences = np.array(pad_sequences(sequences, maxlen=max_sequence_len, padding='post'))

    return corpus, sequences, tokenizer

# Main execution: Loads the corpus, preprocesses the data, and retrieves the processed corpus, sequences, and tokenizer.
corpus, sequences, tokenizer = load_and_preprocess_data(unsplit_corpus)


In [None]:
sequences.shape

In [None]:
print(f"Total Sequences: {len(sequences)}")

In [None]:
train_sequences, test_sequences = train_test_split(sequences, test_size=0.1)
print(f"Training Sequences: {len(train_sequences)}")
print(f"Test Sequences: {len(test_sequences)}")


In [None]:
def prepare_dataset(sequences, batch_size=batch_size):
    # Prepares a TensorFlow dataset from the given sequences.
    dataset = tf.data.Dataset.from_tensor_slices((sequences, sequences))
    # Shuffles the dataset with a buffer of 10000 and batches it according to the specified batch size.
    return dataset.shuffle(10000).batch(batch_size)

# Prepares the training dataset from the train_sequences, and enables prefetching for performance optimization.
train_dataset = prepare_dataset(train_sequences).prefetch(tf.data.AUTOTUNE)
# Similar to the training dataset, prepares the test dataset and enables prefetching.
test_dataset = prepare_dataset(test_sequences).prefetch(tf.data.AUTOTUNE)


In [None]:
def create_word_embeddings(corpus):
    # Creates word embeddings from the given corpus using the Word2Vec model.
    global min_count_words
    # Initializes the Word2Vec model with specified parameters:
    # vector_size: the dimensionality of the word vectors,
    # window: the maximum distance between the current and predicted word within a sentence,
    # min_count: ignores all words with total frequency lower than this,
    # workers: use this many worker threads to train the model (=faster training with multicore machines),
    # epochs: number of iterations (epochs) over the corpus.
    word2vec_model = Word2Vec(corpus, vector_size=embedding_dim, window=5, min_count=min_count_words, workers=4, epochs=100)
    return word2vec_model

# Generates the word embeddings using the corpus with the create_word_embeddings function.
word2vec_model = create_word_embeddings(corpus)


In [None]:
word2vec_model.wv.vectors.shape

In [None]:
import numpy as np

def create_embedding_matrix(word2vec_model, tokenizer, embedding_dim):
    # Creates an embedding matrix that maps each word to its vector representation.
    vocab_size = len(tokenizer.word_index) + 1  # The size of the vocabulary including a reserved index at 0.
    embedding_matrix = np.zeros((vocab_size, embedding_dim))  # Initializes the matrix with zeros.

    # Iterates over each word in the tokenizer's vocabulary.
    for word, i in tokenizer.word_index.items():
        if word in word2vec_model.wv:
            # Retrieves the word's embedding from the Word2Vec model.
            embedding_vector = word2vec_model.wv[word]
            # Places the word's embedding vector in the corresponding row of the matrix.
            embedding_matrix[i] = embedding_vector

    return embedding_matrix  # Returns the complete embedding matrix.

# Creates an embedding matrix for the given Word2Vec model and tokenizer.
embedding_matrix = create_embedding_matrix(word2vec_model, tokenizer, embedding_dim)


In [None]:
embedding_matrix.shape

In [None]:
vocab_size = embedding_matrix.shape[0]  # Retrieves the size of the vocabulary from the embedding matrix.

def build_vae(embedding_matrix, max_sequence_len=max_sequence_len, latent_dim=latent_dim):
    vocab_size = embedding_matrix.shape[0]  # Vocabulary size (number of unique tokens).
    embedding_dim = embedding_matrix.shape[1]  # Dimension of each word vector.

    # Defines the embedding layer using the pre-trained word embeddings.
    embedding_layer = Embedding(embedding_matrix.shape[0],
                                embedding_matrix.shape[1],
                                weights=[embedding_matrix],
                                input_length=max_sequence_len,
                                trainable=False)

    # Encoder part of the VAE.
    encoder_inputs = Input(shape=(max_sequence_len,))
    x = embedding_layer(encoder_inputs)
    x = Bidirectional(LSTM(256, return_sequences=True))(x)  # Bidirectional LSTM layer.
    x = Dropout(0.15)(x)  # Dropout layer for regularization.
    x = Bidirectional(LSTM(128))(x)  # Another LSTM layer.
    z_mean = Dense(latent_dim)(x)  # Dense layer to generate z_mean.
    z_log_var = Dense(latent_dim)(x)  # Dense layer to generate z_log_var.

    # Sampling function to generate the latent vector.
    def sampling(args):
        z_mean, z_log_var = args
        epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim))
        return z_mean + K.exp(0.5 * z_log_var) * epsilon

    z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])

    # Decoder part of the VAE.
    decoder_inputs = Input(shape=(latent_dim,))
    x = Dense(128, activation='relu')(decoder_inputs)
    x = RepeatVector(max_sequence_len)(x)  # Repeats the vector for sequence generation.
    x = LSTM(64, return_sequences=True)(x)  # LSTM layer in the decoder.
    x = Dense(max_sequence_len * vocab_size, activation='relu')(x)
    decoder_outputs = TimeDistributed(Dense(vocab_size, activation='softmax'))(x)  # TimeDistributed Dense layer.

    # Assembling the VAE model.
    encoder = Model(encoder_inputs, [z_mean, z_log_var, z])
    decoder = Model(decoder_inputs, decoder_outputs)
    vae = Model(encoder_inputs, decoder(encoder(encoder_inputs)[2]))

    return vae, encoder, decoder

# Builds the VAE using the embedding matrix.
vae, encoder, decoder = build_vae(embedding_matrix=embedding_matrix)

class VAELoss(tf.keras.losses.Loss):
    def __init__(self, encoder, vocab_size, **kwargs):
        super(VAELoss, self).__init__(**kwargs)
        self.encoder = encoder  # The encoder part of the VAE.
        self.vocab_size = vocab_size  # Vocabulary size.

    def call(self, y_true, y_pred):
        z_mean, z_log_var, _ = self.encoder(y_true)

        # Converts y_true to one-hot encoding.
        y_true_one_hot = tf.one_hot(tf.cast(tf.squeeze(y_true), tf.int32), depth=self.vocab_size)

        # Binary cross-entropy for reconstruction loss.
        reconstruction_loss = tf.reduce_mean(
            tf.keras.losses.categorical_crossentropy(y_true_one_hot, y_pred))

        # KL divergence for regularization.
        kl_loss = -0.5 * tf.reduce_sum(1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var), axis=-1)

        return reconstruction_loss + kl_loss  # Total VAE loss.

# Custom loss function for the VAE.
loss_function = VAELoss(encoder, vocab_size=vocab_size)

# Compiling and summarizing the VAE model.
vae.compile(optimizer=Adam(0.01), loss=loss_function)
vae.summary()