# Shakespearean Sonnet Generator using LSTM Networks

## Introduction:

In the realm of literature, few names evoke the same level of reverence and admiration as William Shakespeare. His mastery of language and poetic expression has left an indelible mark on the world of literature for centuries. One of the most beloved forms of his work is the sonnet, a 14-line poetic form that Shakespeare employed with unparalleled eloquence.

In this project, we delve into the realm of artificial intelligence and natural language processing to create a tool that emulates Shakespeare's sonnet-writing prowess. Leveraging the power of LSTM (Long Short-Term Memory) networks, a type of recurrent neural network well-suited for sequence modeling, we aim to generate new sonnets that echo the style and cadence of the Bard himself.

Our journey begins with a corpus of Shakespearean sonnets, from which our model learns the intricate patterns of language, rhyme, and meter characteristic of these timeless works. With each line of verse ingested, the model hones its understanding of Shakespeare's linguistic idiosyncrasies, paving the way for the creation of original sonnets that bear the hallmark of his genius.

As we embark on this literary adventure, we invite you to witness the marriage of art and technology, as we breathe new life into the age-old tradition of sonneteering, guided by the ethereal spirit of William Shakespeare.

## Project Goals

**Develop a Shakespearean Sonnet Generator:**
Create an LSTM-based neural network capable of generating new Shakespearean-style sonnets based on learned patterns from a corpus of existing sonnets.

**Optimize Model Performance:**
Refine the model to produce sonnets that closely mimic the style, vocabulary, and thematic elements of Shakespeare's original works, aiming for high fidelity and coherence.

**Evaluate Sonnet Quality:**
Conduct thorough evaluation and qualitative analysis of the generated sonnets to assess their literary merit, coherence, and adherence to Shakespearean conventions. Additionally, compare the generated sonnets with authentic Shakespearean sonnets to measure similarity and authenticity.

**Enhance Creativity and Originality:**
Explore methods to encourage the model to produce innovative and original sonnets while still maintaining fidelity to Shakespearean style and themes, fostering creativity within the constraints of the project.

## Project Workflow

1. **Data Collection:**
   Gather a comprehensive dataset of Shakespearean sonnets, ensuring diversity in themes, styles, and authors if applicable.

2. **Data Preprocessing:**
   Process the sonnet data by tokenizing the text, cleaning any noise or irrelevant information, and converting the text into numerical sequences suitable for input into the LSTM model.

3. **Model Training:**
   Train the LSTM-based neural network using the preprocessed sonnet dataset. Experiment with different model architectures, hyperparameters, and training strategies to optimize the model's ability to generate high-quality sonnets in the style of Shakespeare.

4. **Model Evaluation:**
   Evaluate the trained model's performance by generating new sonnets and assessing their quality, coherence, and adherence to Shakespearean conventions. Utilize both qualitative and quantitative evaluation methods to gauge the model's effectiveness in producing authentic-sounding sonnets.

5. **Iterative Refinement:**
   Iterate on the model architecture, hyperparameters, and training data based on the evaluation results to further improve the model's performance and enhance the quality of the generated sonnets.

6. **Documentation and Presentation:**
   Document the project workflow, including data collection methods, preprocessing techniques, model architecture, training process, evaluation metrics, and results. Present the findings and generated sonnets in a clear and engaging manner to stakeholders and interested parties.

### Imported Libraries:
- **NumPy:** For numerical computing tasks.
- **Pandas:** For data manipulation and analysis.
- **TensorFlow:** Deep learning framework for building and training neural networks.
- **re:** Regular expression library for pattern matching and manipulation.
- **Tokenizer and pad_sequences:** For tokenizing text data and padding sequences to ensure uniform length.
- **RMSprop Optimizer:** A popular optimizer for updating model weights during training.

Let's proceed with data preprocessing and model building.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import re
import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2024-04-20 13:11:07.951827: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-20 13:11:08.271828: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.


## Sonnet Data Reader

**Description:**
This code snippet defines a path for a file containing sonnets and reads the data from that file. The file path is specified as `peomeData`, and the content of the file is read using the `open()` function in Python. The `with` statement ensures that the file is properly closed after reading its contents. The variable `data` stores the content of the file, which presumably contains sonnets. This code can be a part of a larger project involving natural language processing (NLP), sentiment analysis, or poetry generation using sonnets as the dataset.

In [3]:
# Define path for file with sonnets
peomeData = '/home/alrashidi/Desktop/Peome.txt'

# Read the data
with open(peomeData) as f:
    data = f.read()

## Symbol Removal from Sonnet Text

**Description:**
This code snippet defines a regular expression pattern to match symbols in the sonnet text. The pattern `r'[^\w\s]'` matches any character that is not a word character (`\w`) or whitespace character (`\s`). This pattern effectively matches any symbol or punctuation marks in the text.

The `re.sub()` function is then used to replace all occurrences of symbols with an empty string in the `data` variable, storing the result in `data_`.

Finally, the first 10 characters of the text without symbols are printed to the console using `print(data_[:10])`.

In [6]:
# Define a regular expression pattern to match symbols
pattern = r'[^\w\s]'

# Replace symbols with an empty string
data_ = re.sub(pattern, '', data)

# Print the text without symbols
print(data_[:14])

The moon rose 


## Lowercase Conversion and List Creation

**Description:**
This code snippet converts the text data stored in the `data` variable to lowercase using the `.lower()` method and then splits it into a list of lines using the `.split("\n")` method. Each line of the text becomes an element in the `corpus` list.

The purpose of converting the text to lowercase is to ensure uniformity in the text data, as it makes it easier to handle and analyze text by disregarding case sensitivity.

The resulting `corpus` list contains each line of the sonnet text as a separate element, all in lowercase.


In [7]:
# Convert to lower case and save as a list
corpus = data.lower().split("\n")

In [9]:
print(f"Show the first tow rows {corpus[:2]}")
print(f"Len the data list {len(corpus)}")

Show the first tow rows ['the moon rose over the bay. i had a lot of feelings.', 'i am taken with the hot animal']
Len the data list 2875


## Tokenization

**Description:**
This code snippet initializes a `Tokenizer` object from TensorFlow's `Tokenizer` module. The `Tokenizer` is a utility class used to vectorize a text corpus by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf, etc.

The `fit_on_texts()` method is then used to fit the tokenizer on the `corpus` data, which essentially updates the internal vocabulary based on the given text data.

Finally, the variable `total_words` is assigned the total number of unique words in the text corpus plus one. The additional `+1` is for accommodating the index `0`, which is reserved for padding.

This tokenization process prepares the text data for input into a neural network model by converting each word into a unique integer index.

In [10]:
# Initialize the tokenizer
tokenizer = Tokenizer()

# Fit tokenizer on text data
tokenizer.fit_on_texts(corpus)

total_words = len(tokenizer.word_index) + 1

## Display Tokenizer Word Index

**Description:**
This code snippet retrieves the word index from the previously initialized tokenizer object and converts it into a list of tuples using the `items()` method. It then slices the list to select the first 15 elements.

The purpose of this code is to display a subset of the word index, pairing each word with its corresponding index.

Finally, the code prints the tokenizer along with the word-index pairs, showing the mapping of words to their respective indices.

In [11]:
word_index = tokenizer.word_index

# Convert word index to list of tuples and slice
word_index_list = list(word_index.items())[:15]

# Show the tokenizer
print("Tokenizer:")
for word, index in word_index_list:
    print(f"{word}: {index}",end=" ")

Tokenizer:
the: 1 and: 2 to: 3 i: 4 of: 5 my: 6 in: 7 that: 8 thy: 9 thou: 10 a: 11 love: 12 you: 13 with: 14 is: 15 

In [12]:
# Get the total number of unique words
total_words = len(word_index) + 1
print("Total unique words:", total_words)

Total unique words: 3785


## N-gram Sequence Generation Function

**Description:**
This code defines a Python function `n_gram_seqs` that generates a list of n-gram sequences for each line of text in a given corpus. The function takes two arguments:
- `corpus`: a list of strings representing lines of text to generate n-grams for.
- `tokenizer`: an instance of the Tokenizer class containing the word-index dictionary.

The function iterates over each line in the corpus and tokenizes it using the provided tokenizer. It then iterates over each tokenized line and generates n-gram sequences by progressively adding tokens from the beginning of the line. The resulting n-gram sequences are appended to the `input_sequences` list.

Finally, the function returns the list of generated n-gram sequences.

This function is useful for preparing text data for training sequence models, such as recurrent neural networks (RNNs), by creating input-output pairs based on n-gram sequences.

In [13]:
def n_gram_seqs(corpus, tokenizer):
    """
    Generates a list of n-gram sequences

    Args:
        corpus (list of string): lines of texts to generate n-grams for
        tokenizer (object): an instance of the Tokenizer class containing the word-index dictionary

    Returns:
        input_sequences (list of int): the n-gram sequences for each line in the corpus
    """
    input_sequences = []

    for line in corpus:
      token_list = tokenizer.texts_to_sequences([line])[0]

      for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]

        input_sequences.append(n_gram_sequence)

    return input_sequences

In [14]:
# Test  function with one example
first_example_sequence = n_gram_seqs([corpus[0]], tokenizer)

print("n_gram sequences for first example look like this:\n")
first_example_sequence

n_gram sequences for first example look like this:



[[1, 487],
 [1, 487, 318],
 [1, 487, 318, 319],
 [1, 487, 318, 319, 1],
 [1, 487, 318, 319, 1, 1009],
 [1, 487, 318, 319, 1, 1009, 4],
 [1, 487, 318, 319, 1, 1009, 4, 150],
 [1, 487, 318, 319, 1, 1009, 4, 150, 11],
 [1, 487, 318, 319, 1, 1009, 4, 150, 11, 1592],
 [1, 487, 318, 319, 1, 1009, 4, 150, 11, 1592, 5],
 [1, 487, 318, 319, 1, 1009, 4, 150, 11, 1592, 5, 1593]]

In [15]:
# Test  function with a bigger corpus
next_3_examples_sequence = n_gram_seqs(corpus[1:4], tokenizer)

print("n_gram sequences for next 3 examples look like this:\n")
next_3_examples_sequence

n_gram sequences for next 3 examples look like this:



[[4, 69],
 [4, 69, 594],
 [4, 69, 594, 14],
 [4, 69, 594, 14, 1],
 [4, 69, 594, 14, 1, 488],
 [4, 69, 594, 14, 1, 488, 1594],
 [5, 6],
 [5, 6, 1010],
 [5, 6, 1010, 1595],
 [5, 6, 1010, 1595, 3],
 [5, 6, 1010, 1595, 3, 1596],
 [5, 6, 1010, 1595, 3, 1596, 6],
 [5, 6, 1010, 1595, 3, 1596, 6, 743],
 [2, 37],
 [2, 37, 98],
 [2, 37, 98, 595],
 [2, 37, 98, 595, 23],
 [2, 37, 98, 595, 23, 4],
 [2, 37, 98, 595, 23, 4, 1011],
 [2, 37, 98, 595, 23, 4, 1011, 70]]

In [16]:
# Apply the n_gram_seqs transformation to the whole corpus
input_sequences = n_gram_seqs(corpus, tokenizer)

# Save max length
max_sequence_len = max([len(x) for x in input_sequences])

print(f"n_grams of input_sequences have length: {len(input_sequences)}")
print(f"maximum length of sequences is: {max_sequence_len}")

n_grams of input_sequences have length: 18581
maximum length of sequences is: 27


## Sequence Padding Function

**Description:**
This code defines a Python function `pad_seqs` that pads tokenized sequences to ensure they all have the same length. The function takes two arguments:
- `input_sequences`: a list of lists containing tokenized sequences to pad.
- `maxlen`: an integer specifying the maximum length to pad sequences to.

The function uses the `pad_sequences` function from the TensorFlow module to pad each sequence in `input_sequences` to the specified maximum length. The padding is added at the beginning of each sequence (`padding='pre'`).

Finally, the function returns an array of padded sequences, where each sequence has the same length specified by `maxlen`.

This function is useful for preparing input data for training neural network models, ensuring that input sequences have consistent dimensions.

In [17]:
def pad_seqs(input_sequences, maxlen):
    """
    Pads tokenized sequences to the same length

    Args:
        input_sequences (list of int): tokenized sequences to pad
        maxlen (int): maximum length of the token sequences

    Returns:
        padded_sequences (array of int): tokenized sequences padded to the same length
    """
    padded_sequences = pad_sequences(input_sequences, maxlen, padding='pre')

    return padded_sequences

In [18]:
# Test  function with the n_grams_seq of the first example
first_padded_seq = pad_seqs(first_example_sequence, max([len(x) for x in first_example_sequence]))
first_padded_seq

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    1,
         487],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    1,  487,
         318],
       [   0,    0,    0,    0,    0,    0,    0,    0,    1,  487,  318,
         319],
       [   0,    0,    0,    0,    0,    0,    0,    1,  487,  318,  319,
           1],
       [   0,    0,    0,    0,    0,    0,    1,  487,  318,  319,    1,
        1009],
       [   0,    0,    0,    0,    0,    1,  487,  318,  319,    1, 1009,
           4],
       [   0,    0,    0,    0,    1,  487,  318,  319,    1, 1009,    4,
         150],
       [   0,    0,    0,    1,  487,  318,  319,    1, 1009,    4,  150,
          11],
       [   0,    0,    1,  487,  318,  319,    1, 1009,    4,  150,   11,
        1592],
       [   0,    1,  487,  318,  319,    1, 1009,    4,  150,   11, 1592,
           5],
       [   1,  487,  318,  319,    1, 1009,    4,  150,   11, 1592,    5,
        1593]], dtype=int32)

In [19]:
# Test  function with the n_grams_seq of the next 3 examples
next_3_padded_seq = pad_seqs(next_3_examples_sequence, max([len(s) for s in next_3_examples_sequence]))
next_3_padded_seq

array([[   0,    0,    0,    0,    0,    0,    4,   69],
       [   0,    0,    0,    0,    0,    4,   69,  594],
       [   0,    0,    0,    0,    4,   69,  594,   14],
       [   0,    0,    0,    4,   69,  594,   14,    1],
       [   0,    0,    4,   69,  594,   14,    1,  488],
       [   0,    4,   69,  594,   14,    1,  488, 1594],
       [   0,    0,    0,    0,    0,    0,    5,    6],
       [   0,    0,    0,    0,    0,    5,    6, 1010],
       [   0,    0,    0,    0,    5,    6, 1010, 1595],
       [   0,    0,    0,    5,    6, 1010, 1595,    3],
       [   0,    0,    5,    6, 1010, 1595,    3, 1596],
       [   0,    5,    6, 1010, 1595,    3, 1596,    6],
       [   5,    6, 1010, 1595,    3, 1596,    6,  743],
       [   0,    0,    0,    0,    0,    0,    2,   37],
       [   0,    0,    0,    0,    0,    2,   37,   98],
       [   0,    0,    0,    0,    2,   37,   98,  595],
       [   0,    0,    0,    2,   37,   98,  595,   23],
       [   0,    0,    2,   37,

In [20]:
# Pad the whole corpus
input_sequences = pad_seqs(input_sequences, max_sequence_len)

print(f"padded corpus has shape: {input_sequences.shape}")

padded corpus has shape: (18581, 27)


## Feature and Label Generation Function

**Description:**
This code defines a Python function `features_and_labels` that generates features and labels from n-grams. The function takes two arguments:
- `input_sequences`: a list of lists containing sequences to split features and labels from.
- `total_words`: an integer specifying the vocabulary size.

The function extracts features from `input_sequences` by selecting all elements except the last one for each sequence, effectively removing the last token, which will be used as the label. The labels are extracted separately as the last token of each sequence.

Next, the labels are one-hot encoded using the `to_categorical` function from TensorFlow, with the number of classes set to `total_words`. This converts the integer labels into binary vectors with a 1 at the index corresponding to the label and 0s elsewhere.

Finally, the function returns arrays of features and one-hot encoded labels.

This function is useful for preparing data for training classification models, where the input consists of sequences of tokens and the output is a categorical variable.

In [21]:
def features_and_labels(input_sequences, total_words):
    """
    Generates features and labels from n-grams

    Args:
        input_sequences (list of int): sequences to split features and labels from
        total_words (int): vocabulary size

    Returns:
        features, one_hot_labels (array of int, array of int): arrays of features and one-hot encoded labels
    """
    features = input_sequences[:,:-1]
    labels = input_sequences[:,-1]
    one_hot_labels = to_categorical(labels, num_classes=total_words)

    return features, one_hot_labels

In [22]:
# Test  function with the padded n_grams_seq of the first example
first_features, first_labels = features_and_labels(first_padded_seq, total_words)

print(f"labels have shape: {first_labels.shape}")
print("\nfeatures look like this:\n")
first_features

labels have shape: (11, 3785)

features look like this:



array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    1],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    1,  487],
       [   0,    0,    0,    0,    0,    0,    0,    0,    1,  487,  318],
       [   0,    0,    0,    0,    0,    0,    0,    1,  487,  318,  319],
       [   0,    0,    0,    0,    0,    0,    1,  487,  318,  319,    1],
       [   0,    0,    0,    0,    0,    1,  487,  318,  319,    1, 1009],
       [   0,    0,    0,    0,    1,  487,  318,  319,    1, 1009,    4],
       [   0,    0,    0,    1,  487,  318,  319,    1, 1009,    4,  150],
       [   0,    0,    1,  487,  318,  319,    1, 1009,    4,  150,   11],
       [   0,    1,  487,  318,  319,    1, 1009,    4,  150,   11, 1592],
       [   1,  487,  318,  319,    1, 1009,    4,  150,   11, 1592,    5]],
      dtype=int32)

In [23]:
# Split the whole corpus
features, labels = features_and_labels(input_sequences, total_words)

print(f"features have shape: {features.shape}")
print(f"labels have shape: {labels.shape}")

features have shape: (18581, 26)
labels have shape: (18581, 3785)


# Text Generation Model Creation Function

**Description:**
This code defines a Python function `create_text_generator_model` that creates a text generation model using a stacked bidirectional LSTM architecture. The function takes the following arguments:
- `total_words`: The total number of words in the vocabulary.
- `embedding_dim`: The dimension of the word embedding.
- `max_sequence_length`: The maximum length of input sequences.
- `lstm_units`: The number of units/neurons in the LSTM layers.
- `learning_rate`: The learning rate for the RMSprop optimizer.

The function creates a Sequential model using TensorFlow's `Sequential` class. The model consists of the following layers:
- Embedding Layer: Maps each word index to a dense vector representation.
- Stacked Bidirectional LSTM Layers: Two bidirectional LSTM layers with `lstm_units` neurons each, with the first layer returning sequences and the second layer not returning sequences.
- Fully Connected Layer: A dense layer with `total_words*2` neurons and ReLU activation function.
- Output Layer: A dense layer with `total_words` neurons and softmax activation function, which outputs a probability distribution over the vocabulary.

The model is compiled using categorical cross-entropy loss for multi-class classification and the RMSprop optimizer with the specified `learning_rate`.

Finally, the function returns the compiled text generation model.

This function is useful for creating and compiling a deep learning model for text generation tasks.

In [24]:
def create_text_generator_model(total_words: int, 
                                embedding_dim: int, 
                                max_sequence_length: int, 
                                lstm_units: int, 
                                learning_rate) -> tf.keras.Model:
    """
    Creates a text generation model using a stacked bidirectional LSTM architecture.

    Args:
        total_words (int): The total number of words in the vocabulary.
        embedding_dim (int): The dimension of the word embedding.
        max_sequence_length (int): The maximum length of input sequences.
        lstm_units (int): The number of units/neurons in the LSTM layers.
        learning_rate (float): The learning rate for the RMSprop optimizer.

    Returns:
        model (tf.keras.Model): The compiled text generation model.
    """
    model = Sequential([
        # Embedding Layer
        tf.keras.layers.Embedding(total_words, embedding_dim, input_shape=[max_sequence_length-1,]),        
        # Stacked Bidirectional LSTM Layers
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units,return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units,return_sequences=False)),
        
        
        # Fully Connected Layers
        tf.keras.layers.Dense(total_words*2, activation='relu'),   
       
        # Output Layer
        tf.keras.layers.Dense(total_words, activation='softmax')
    ])

    # Use categorical crossentropy for multi-class classification
    model.compile(loss='categorical_crossentropy',
                  optimizer=tf.keras.optimizers.RMSprop(learning_rate=learning_rate), 
                  metrics=['accuracy'])

    return model

### Text Generation Model Summary

**Description:**
This code initializes variables for the embedding dimension, LSTM units, and learning rate. It then creates a text generation model using the `create_text_generator_model` function defined earlier, passing the specified parameters.

Finally, it prints a summary of the model architecture using the `summary()` method, which provides a detailed overview of the layers, their output shapes, and the number of parameters in the model.

This summary is helpful for understanding the structure of the model and verifying that it matches your expectations.


### Text Generation Model Summary

**Description:**
This code initializes variables for the embedding dimension, LSTM units, and learning rate. It then creates a text generation model using the `create_text_generator_model` function defined earlier, passing the specified parameters.

Finally, it prints a summary of the model architecture using the `summary()` method, which provides a detailed overview of the layers, their output shapes, and the number of parameters in the model.

This summary is helpful for understanding the structure of the model and verifying that it matches your expectations.

In [None]:
embedding_dim = 100  # embedding dimension
lstm_units = 200  # LSTM units
learning_rate = 1e-3  # learning rate

# Get the  model
model = create_text_generator_model(total_words, embedding_dim, max_sequence_len, lstm_units, learning_rate)
model.summary()

In [None]:
# Train the model
history = model.fit(features, labels, epochs=250, verbose=1, validation_split=0.4)

## Training and Validation Metrics Visualization

**Description:**
This code snippet retrieves the training and validation accuracy and loss metrics from the `history` object obtained after training a neural network model. It then plots these metrics over the number of epochs to visualize the training progress and model performance.

The first part retrieves the accuracy and loss metrics (`acc`, `val_acc`, `loss`, `val_loss`) from the `history` object.

Next, it defines the number of epochs based on the length of the accuracy metric (`epochs = range(len(acc))`).

The code then plots the training and validation accuracy over the epochs using the `plt.plot()` function, with red color representing training accuracy and blue color representing validation accuracy. It adds titles and legends to the plot for better interpretation and calls `plt.show()` to display the plot.

Similarly, it plots the training and validation loss over the epochs in a separate plot, again adding titles, legends, and displaying the plot.

These visualizations are useful for assessing the training progress, identifying overfitting or underfitting, and determining the effectiveness of the model architecture and training parameters.

#-----------------------------------------------------------
# Retrieve a list of list results on training and test data
# sets for each training epoch
#-----------------------------------------------------------
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc)) # Get number of epochs

#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt.plot(epochs, acc, 'r', label="Training Accuracy")
plt.plot(epochs, val_acc, 'b', label="Validation Accuracy")
plt.title('Training and validation accuracy')
plt.legend()
plt.show()

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt.plot(epochs, loss, 'r', label="Training Loss")
plt.plot(epochs, val_loss, 'b', label="Validation Loss")
plt.title('Training and validation loss')
plt.legend()
plt.show()

## Text Generation Using Trained Model

**Description:**
This code snippet generates text using a trained text generation model. It starts with a seed text (`seed_text`) and generates the specified number of words (`next_words`) based on the probabilities predicted by the model.

In each iteration of the loop, the seed text is converted into sequences of tokens using the tokenizer, padded to match the model's input shape, and fed into the model to predict the next word. The predicted word is then appended to the seed text.

This process continues for the specified number of words (`next_words`), gradually expanding the generated text.

Finally, the generated text is printed to the console, which represents a continuation of the initial seed text based on the learned patterns from the training data.

This approach allows for the generation of new text based on the style and patterns learned by the model during training.

seed_text = "Naser my brother"
next_words = 100

for _ in range(next_words):
    # Convert the text into sequences
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    # Pad the sequences
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    # Get the probabilities of predicting a word
    predicted = model.predict(token_list, verbose=0)
    # Choose the next word based on the maximum probability
    predicted = np.argmax(predicted, axis=-1).item()
    # Get the actual word from the word index
    output_word = tokenizer.index_word[predicted]
    # Append to the current text
    seed_text += " " + output_word

print(seed_text)

## Conclusion

In this project, we aimed to create a text generation model using a stacked bidirectional LSTM architecture. We started by preprocessing the text data, tokenizing it, and preparing it for training. The model architecture was designed to learn from the sequence of words and generate new text based on the learned patterns.

During the training process, we observed that the model achieved high accuracy on the training data. However, as the training progressed, the validation accuracy plateaued or even started decreasing, indicating potential overfitting. Overfitting occurs when the model learns to memorize the training data instead of generalizing well to unseen data. This is a common challenge in deep learning models, especially when dealing with complex architectures and limited data.

To address overfitting and improve the model's performance, several strategies can be considered:
- **Regularization techniques:** Adding dropout layers or L2 regularization to the model can help prevent overfitting by introducing noise or penalizing large weights.
- **Data augmentation:** Increasing the diversity of the training data by applying transformations such as rotation, scaling, or adding noise can help the model generalize better.
- **Model architecture modifications:** Adjusting the complexity of the model architecture, such as reducing the number of parameters or using simpler network structures, can also mitigate overfitting.

Additionally, further hyperparameter tuning and experimenting with different learning rates, batch sizes, and optimizer settings can also contribute to improving the model's performance.

In conclusion, while we have successfully built a text generation model, further refinement is required to address overfitting and enhance its generalization capabilities. By applying appropriate regularization techniques and fine-tuning the model's architecture and hyperparameters, we can develop a more robust and reliable text generation system.

In [None]:
https://www.kaggle.com/code/japeralrashid/shakespearean-sonnet-generator-using-lstm-networks