# Pootry

`Richard Rivaldo 13519185`

`Informatics Engineering Institut Teknologi Bandung`

**Pootry** is a Natural Language Processing project on building a `Deep Learning` model of poetry generator. The planned model for this project will be the `Bidirectional LSTM` alias `Long Short-Term Memory`, also with `GloVe Embeddings` or `Global Vectors` for words embeddings technique used in the training phase. Let's see how it goes!

The repository of the project: [Pootry](https://github.com/RichardRivaldo/Pootry).  The dataset can also be found in the repository.

# Preparations

The dataset used in this project is a `.txt` file containing over 2000 verses (verses? lines? whatever~). For GloVe embeddings, this number might seem a little few compared to the pre-trained vectors made with it. Nonetheless, I still want to create my very first own embeddings as the sole purpose of this project is to gain deeper understanding about Natural Language Processing`s techniques.

**Library Preparation**

In [None]:
# Install Glove and Tensorflow Text
! pip install glove-python-binary
! pip install tensorflow_text



In [37]:
# Import all libraries
import re
import warnings
import numpy as np
from glove import Corpus, Glove
import tensorflow_text as tftext
from tensorflow.ragged import constant
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.initializers import Constant
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Embedding, Dropout, Bidirectional, LSTM, Dense, InputLayer, LeakyReLU

# Let's ignore warning, shall we?
warnings.filterwarnings('ignore')

# Data Preprocessing

**Reading and Cleaning the Dataset**

In [None]:
# Data path for Colab
data_path = "/content/poems.txt"

# Tried using Shakespeeare's. 
# It crashed due to massive number of tokens when doing one-hot encoding. :(

In [None]:
# Function to clean the verses
# Simply replaces punctuations with empty char
def clean_verses(list_of_verses):
  # Clean all punctuations
  # Except for ' and - that might have meanings for the word using it
  cleaned_verse = [re.sub(r'[\!"#$%&\*+,./:;<=>?@^_`()|~=]', "", verse) for verse in list_of_verses]
  
  return cleaned_verse

In [None]:
# Read the file and store each line into a list of lists of tokenized words in every verses
def read_poems_data(data_path, clean_puncts=False):
  # Read the file and split the verses by newline
  # Also filter out the empty strings
  with open(data_path) as poems:
    verses = poems.read()
    verses = verses.split("\n")
    verses = list(filter(lambda verse: verse != "", verses))
  
  # If the data wants to be cleaned, the clean_puncts can be changed to True
  if clean_puncts:
      verses = clean_verses(verses)
  
  return verses

**Tokenize the Verses**

In [None]:
# Tokenize the words in each verse
def tokenize_verses(verses):
  all_tokens = []
  for verse in verses:
    # Use regex to split the strings based on whitespace
    # Also based on -- (not -) because there are many words using it
    tokens = re.split(r'--| ', verse)

    # Iterate over every tokens
    # If a word is UPPERCASE, change it to sentence case for better dictionary
    tokens = [token.capitalize() if token.isupper() else token for token in tokens]

    # Filter empty string from the second split
    tokens = list(filter(lambda token: token != "", tokens))
    all_tokens.append(tokens)

  return all_tokens

# GlOVe (Global Vectors) Words Embedding

**Create Corpus**

In [None]:
def create_corpus(tokens, window=20):
  # Instantiate the corpus
  corpus = Corpus()

  # Create the occurence matrix with context window of 20
  # Context window is the technique of counting co-occurence
  # 20 means that we will count co-occurence 20 words left-right
  # The number is chosen because average maximum words for the dataset is less than 15
  # Fit the verses into the corpus
  corpus.fit(tokens, window)
  
  return corpus

**GloVe Embedding**

In [None]:
# Embed the corpus into GloVe Model
# Components are the numbers of latent vector dimension
# Learning Rate is the SGD Learning Rate
# Epochs is the number of training epochs for fitting the corpus
# Number of threads is the number of threads used in training the data
def embed_glove(corpus, num_of_components=300, lr=0.05, epochs=100, num_of_threads=30):
  # Instantiate the model
  glove = Glove(no_components=num_of_components, learning_rate=lr)

  # Fit over the co-occurence matrix in the corpus
  glove.fit(corpus.matrix, epochs=epochs, no_threads=num_of_threads)

  # Add the vocab of corpus to the model
  glove.add_dictionary(corpus.dictionary)

  return glove

**Testing the Embeddings**

In [None]:
# Find the 10 most similar word
# glove_embeddings.most_similar("devil", number=10)

# Sequence Modelling

**Construct Tokenizer**

In [None]:
# Using Tensorflow Keras Tokenizer API
# The tokenizer will take all preprocessed tokens from before
def construct_tokenizer(tokens, lower_text=False):
  # Create the tokenizer with specialized Out of Vocabulary Token
  tokenizer = Tokenizer(lower=lower_text, oov_token="<OOV>")

  # Fit the tokenizer into the tokens
  tokenizer.fit_on_texts(tokens)

  return tokenizer

**Rejoin Sentences**

In [None]:
# Rejoin all tokens back into a sentence
def rejoin_sentences(tokens):
  # Join all tokens in a sentence and put it in this list
  return [' '.join(tokens_sentence) for tokens_sentence in tokens]

**Sequence Modelling of the Words**

In [None]:
# Convert the sentences into sequences
# Will generate sequences of iteratively increasing N-Grams form of the sequence
def generate_sequences(tokenizer, sentences):
  n_grams_sequences = []

  # Iterate over each sentence and convert it into sequences with the tokenizer
  for sentence in sentences:
    seq = tokenizer.texts_to_sequences([sentence])[0]

    # Iterate over the sequence made from before and build N Gram sequences iteratively
    for idx in range(1, len(seq)):
      # Minimum length of the sequence will be two converted token
      n_gram_seq = seq[: idx + 1]
      n_grams_sequences.append(n_gram_seq)
  
  return n_grams_sequences

**Generate Embeddings Matrix**

In [None]:
def generate_embed_matrix(embeddings, tokenizer):
  # Get the word index and number of tokens from the tokenizer
  word_index = tokenizer.word_index
  # Already included the OOV word
  num_of_tokens = len(word_index)

  # Get embeddings model dimension (length of each word vector)
  embeddings_dim = embeddings.no_components

  # Initialize numpy matrix of zeros as the embeddings matrix
  # The dimension will be the number of tokens x the embeddings dimension
  embeddings_matrix = np.zeros((num_of_tokens, embeddings_dim))

  # Iterate over all words in the word index
  for word, i in word_index.items():
    # Get the word index in the GloVe dictionary
    glove_word_index = embeddings.dictionary.get(word)
    if glove_word_index is not None:
      # Get the embeddings vector of the corresponding GloVe index
      embeddings_vector = embeddings.word_vectors[glove_word_index]
      # Set the index of the matrix, because the OOV has the index of 1,
      # then the starting index will be i - 1 instead of just i
      embeddings_matrix[i - 1] = embeddings_vector

  return embeddings_matrix

# Generate Training Data from the Sequences

**Generate Features**

In [None]:
# Generate all the N-Grams sequences except the last token in each sequences
# Simulate predicting the next words available given preceding words
# These sliced sequences will be the features of the model
def generate_features(n_gram_sequences):
  return constant([seq[:-1] for seq in n_gram_sequences])

**Generate Labels**

In [None]:
# Generate all the last token in each N-Grams sequences
# Simulate the predicted words given preceding words
# Change the labels to be categorical and encode the sequences with One-Hot Encoding
# These sliced sequences will be the labels of the model
def generate_labels(n_gram_sequences, tokenizer):
  # Slice the sequences
  labels = [seq[-1] for seq in n_gram_sequences]

  # Get the number of tokens to make it the number of classes in the encoding
  num_of_tokens = len(tokenizer.word_index)

  return to_categorical(labels, num_classes=num_of_tokens)

# Constructing the Model

**Construct the Model**

In [None]:
def construct_model(embeddings_matrix, tokenizer, glove, n_units=200, dropout=0.2):
  # Get model properties
  vocab_size = len(tokenizer.word_index)      # Number of tokens or vocabularies
  embeddings_dim = glove.no_components        # The dimension of embeddings matrix vector

  # Initialize the Sequential Model
  seq_model = Sequential([
    # Initialize Ragged Input Layer
    InputLayer(input_shape=(None, ), ragged=True),
    # Convert the layer into densely-connected layer
    tftext.keras.layers.ToDense(pad_value=0, mask=True),
    # Inititalize Embedding Layer with weighted embeddings matrix
    Embedding(vocab_size, embeddings_dim, embeddings_initializer=Constant(embeddings_matrix), weights=[embeddings_matrix]),
    # Dropout Regularization Layer to avoid overfitting
    Dropout(dropout),
    # Bidirectional LSTM
    # Bidirectional here means that the LSTM can do two-way learning
    # Return Sequences because there are still many layers needing it
    Bidirectional(LSTM(n_units, return_sequences=True)),
    # Dropout Regularization
    Dropout(dropout),
    # One directional LSTM layer
    LSTM(n_units),
    # Add a dense layer with L2 Regularizer of 0.01
    Dense(vocab_size, kernel_regularizer=l2(0.01)),
    # Transform the output with Leaky ReLU
    LeakyReLU(),
    # Add last dense layer with Softmax Activation Function
    Dense(vocab_size, activation='softmax')],
    # Why not? :D
    name="Pootry")

  return seq_model

**Compile the Model**

In [None]:
# Compile with Adam Optimizer, Categorical Crossentropy loss function, and accuracy metric
def compile_model(model, opt='adam', losses='categorical_crossentropy', metric=['accuracy']):
  model.compile(loss=losses, optimizer=opt, metrics=metric)

**Show Model Summary**

In [None]:
def show_model_summary(model):
  model.summary()

**Training the Model**

In [None]:
# Train with default number of 100 epochs and 1 verbose
def fit_train(model, features, labels, n_epochs=100, verb=1):
  model.fit(features, labels, epochs=n_epochs, verbose=verb)

**I Know What You Will Say**

In [None]:
# Helper
# Find words by predicted class index
def find_word_by_index(predicted_class, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == predicted_class:
      return word
  
  return ""

In [40]:
# Helper
# Output the generated poem
def output_poem(pooem):
  print("Pooem")
  print("By: Pootry\n")

  for verse in pooem[:-1]:
    verse[0] = verse[0].capitalize()
    verse = " ".join(verse)
    print(verse + ", ")
  print((" ".join(pooem[-1]).capitalize()) + ". ")

In [35]:
# Making a prediction a.k.a generating new poem from a seed text
# Number of verse is the number of verse in the generated poem
# Maximum words count is the number of maximum words in each verse
# Minimum number words will be maximum words count -5
# Minimum value for the maximum words count should be 8
def generate_poem(model, seed_text, tokenizer, num_of_verse=8, max_word_count=15):
  pooem = []
  min_word_count = max_word_count - 5
  assert(min_word_count >= 3)

  # Cleaning pipeline
  seed_text = clean_verses([seed_text])
  seed_text = tokenize_verses(seed_text)[0]

  # List to contain each line
  verse = seed_text.copy()

  # Iterate over each verse
  for _ in range(num_of_verse):
    # Randomly pick number of words in each verses
    num_of_words = np.random.randint(min_word_count, max_word_count + 1)

    while len(verse) <= num_of_words:
      # Convert the seed text to sequences with the tokenizer
      seed_sequence = tokenizer.texts_to_sequences(seed_text)[0]
      # Convert the sequences into Ragged Tensor
      seed_sequence = constant([seed_sequence])
      # Predict the word and append it to the seed text and poem
      predicted_class = model.predict_classes(seed_sequence, verbose=0)
      # Get the corresponding word
      predicted_word = find_word_by_index(predicted_class, tokenizer)
      verse.append(predicted_word)
      
      # Append and join all words currently in the seed text
      seed_text.append(predicted_word)
      seed_text = [" ".join(seed_text)]
    
    # Append the verse to the poem list and reset the verse
    pooem.append(verse)
    verse = []

  # Show the poem
  output_poem(pooem)

# Gotta Jumble Them All!

In [None]:
# Main function caller
def main(data_path, seed_text):
  # Read and clean Poem data
  verses = read_poems_data(data_path, clean_puncts=True)

  # Tokenize the verse to create words embeddings
  tokens = tokenize_verses(verses)

  # Produce corpus from the tokens
  corpus = create_corpus(tokens)

  # Construct Tensorflow Keras Tokenizer object
  tokenizer = construct_tokenizer(tokens)

  # Fit GloVe Embeddings and generate embeddings matrix from the model
  glove_embeddings = embed_glove(corpus)
  embeddings_matrix = generate_embed_matrix(glove_embeddings, tokenizer)  

  # Rejoin sentences from the tokens and create N-Grams Sequences
  sentences = rejoin_sentences(tokens)
  n_gram_sequences = generate_sequences(tokenizer, sentences)

  # Generate features and labels data for training
  features = generate_features(n_gram_sequences)
  labels = generate_labels(n_gram_sequences, tokenizer)

  # Construct the model with given parameters
  model = construct_model(embeddings_matrix, tokenizer, glove_embeddings)

  # Compile and show the summary of the model
  compile_model(model)
  model.summary()

  # Fit and train the model
  fit_train(model, features, labels)

  # Generate a poem
  generate_poem(model, seed_text, tokenizer)

  # Save the model
  model.save("/content/Pootry")

In [25]:
seed_text = "devil"
main(data_path, seed_text)

Model: "Pootry"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
to_dense (ToDense)           (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 300)         1281000   
_________________________________________________________________
dropout (Dropout)            (None, None, 300)         0         
_________________________________________________________________
bidirectional (Bidirectional (None, None, 400)         801600    
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 400)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               480800    
_________________________________________________________________
dense (Dense)                (None, 4270)              85827



INFO:tensorflow:Assets written to: /content/Pootry/assets


INFO:tensorflow:Assets written to: /content/Pootry/assets


# Oopsie, I have a really long short-term memory there..

No problem, having it saved sure ease my mind. :D Anyway, there are code changes made after the training, ignoring warnings and changing those print test for example, andddd, some secrets of course..

In [34]:
# Loading the model
# The saved model can be found in the same repository
# Wait nope, not a good idea. The link to the model can be accessed here
# https://drive.google.com/drive/folders/1Qq7jSFCAGKGZdXy2MJcXOt9BvKRcdXr0?usp=sharing
model = load_model("Pootry")
model.summary()

Model: "Pootry"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
to_dense (ToDense)           (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 300)         1281000   
_________________________________________________________________
dropout (Dropout)            (None, None, 300)         0         
_________________________________________________________________
bidirectional (Bidirectional (None, None, 400)         801600    
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 400)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               480800    
_________________________________________________________________
dense (Dense)                (None, 4270)              85827

In [66]:
# Poem #1
seed_text = "revolts, mutinies, and multitude"
generate_poem(model, seed_text, tokenizer)

Pooem
By: Pootry

Revolts mutinies and multitude thus i' his fawn and love to, 
Richard so hast in yours for this battle pleased sights of men's right, 
Of mine right happy son their happy son their death aches, 
Last side straight happy to Richard hence him go in me establish'd there see thee false, 
Glass to Clarence these wars in peace aches vile was heaven either thou eyes confound victorious, 
Wreaths herself of faded lies your power upon heaven shame me him he was sorrow people, 
There will despair cry carnal cur lose hell right again lose dangerous thing more sleep upon, 
Fear me where spake sport will weeds revenge this death go in me attend my. 


In [63]:
# Poem #2
seed_text = "manacles tied my mind"
generate_poem(model, seed_text, tokenizer)

Pooem
By: Pootry

Manacles tied my mind march by his wall by birth me in your spears Why, 
None more ever redemption plebeians 'we have shalt fetch his queen of uncle him, 
Richard there looks men it be groans stood offer'd along up me, 
Strew'd upon enemy love him spake for Englishmen fill'd and offer'd dreams 'we point of, 
Accomplish dream will power happy and power courageous friends pride Richard leaves heaven go, 
Upon me up him upon thirst the king Volsces side hath fond, 
For heaven am made queen me aches fame thou hast and heaven live thee power, 
Why was unto my heart edward against a face against it. 


In [60]:
# Poem #3
seed_text = "in front of shunless destiny!"
generate_poem(model, seed_text, tokenizer)

Pooem
By: Pootry

In front of shunless destiny me hath spoken groans lets me O hope, 
Was thy death windows will faded 'we have shalt fetch his queen of sorrow walls us, 
Say fresher amen part heaven speak him hither with either power, 
Richard outweighs trouble us use it to thy rage heaven his zeal kind of, 
Evil confound Guilty guilty of yours will thou affairs affairs born, 
'we have shalt Rome like peace flower wall on me shall, 
Mothers' sons and men count peace happy and right power hereafter thy, 
Dearest mind of wrongs breathest infidels guilty of brothers and power foot was. 


In [59]:
# Poem #4
seed_text = "Heaven plagues thee."
generate_poem(model, seed_text, tokenizer)

Pooem
By: Pootry

Heaven plagues thee common pleasing of love eyes his joints camel camel Richard outweighs fortunes in, 
My wives of enemy fathers up me affrights you spake to see me about it right, 
Out of fear fear right weeping blood right of battle looks looks in, 
Weeping time a king faded aches in battle pleased how thou yours for things, 
Miles right enemy peace nail these wars and comest Trust camel, 
Camel proportion me who's so fortunes to thee in fear withal those, 
Cobham to fetch his queen of uncle him Richard there looks looks, 
Hast prey to things we fall it true affairs hast guilty odds. 


In [55]:
# Poem #5
seed_text = "<slander him who?> "
generate_poem(model, seed_text, tokenizer)

Pooem
By: Pootry

Slander him who of a name like fathers unto love why man then have, 
Peace and true name of battle see to shame King these battle looks will groans looks, 
Swears then hath broken in the world 'we have shalt reap his time fierce blest, 
That weeping gains blest offer'd offer'd blown and lies your zeal there saved, 
Advised respect how cur upon Richard' lies heaven them stamp upon your power about like Tewksbury, 
Strength be breathest how lose life horse Hastings in your uncle crying pride, 
Guilty of enemy Richard outweighs enemy offer'd allies men's right of enemy, 
Right men suit faded thou uncle men well comest adder alive. 


The last surely is the **best** one. :DDDDDDDDDDDDD

# Takeaways

Playing with texts in NLP sure is fun. There are many challenges I had when using this new architecture for me, and it was quite an adventurous exploration, digging in GloVe Embeddings and Ragged Tensors. The model has the accuracy of more than 80% which is somehow great, but improvement on its grammar and tagging all tokens' parts of speech might produce better results. 21 million of parameters sure is huge. :D