# Lyric Generator: Comparing RNN and Markov Models
### Assignment 5

In this week's NLP lab, I will built two different Lyric Generators using Markov Chain and Neural Network methodologies. My goal is to explore the differences between a lexicon-based model, such as the Markov chain, and deep learning models like Neural Networks. Through the process, it is hoped that I could better understand their respective strengths and weaknesses and investigate their potential applications in real-world scenarios.

## Introduction 
When considering machine learning models, my initial thoughts regarding the process invariably center on dataset collection. This concept mirrors cognitive science principles, suggesting that the foundational data structure underpinning our brain's functionality is closely linked to the final computational output. Thus, the selection of data should be directly related to its domain and purpose, embodying the essence of the targeted task.

For this project, lexicon choice and rhyme structure emerge as two prominent features that significantly influence the final output. Aimed at resonating with the younger generation, I embarked on data mining and research to identify the most suitable dataset. After thorough exploration, I opted for a dataset which contains text strings from the singer Michael Jackson, representing the most plausible and effective choice for this endeavor.

## Final Outcome

Throughout this week's exploration, I encountered several challenges. Initially, collecting my database for this project was straightforward, as free lyrics websites are ubiquitous on the internet. However, I faced some trouble encoding the system for raw string text. I learned to use the .fit_on_texts function to tokenize the individual lexicon in the dataset and build a preliminary vocabulary system. I didn't invest much effort in data cleaning—a step I recognize as crucial for an industry usage project—before proceeding with my lab journey.

For the LSTM Recurrent Network Model, I followed a tutorial from a machine learning course I'm currently taking. After tweaking and adjusting the parameters and logic of the model's structure, I commenced the training process. Initially, to prevent overfitting, I used an `early_stop` function, but was quite disappointed with the resulting accuracy of 0.3. I then commented out that part, retrained the model for 100 epochs, and achieved a new accuracy of 0.56. Although the final result isn't ideal compared to existing NLP language models, the main purpose of this journey was to explore the possibilities of domain-specific model training processes. I believe with a larger and better-structured dataset and more sophisticated procedures, the results could be significantly improved.

Compared to the RNN model, I found that the Markov Chain Model requires less data to produce reasonable outputs and is very time-efficient! For instance, training just one RNN model with a 250KB dataset took me over an hour to complete on the latest MacBook Pro, while training two Markov models took less than a minute altogether. And for this project's tasks, where the linguistic structure is relatively simple and the output relies on its randomness rather than its deep coherence, Markov chains performed much better than RNNs, which was a delightful outcome to me.

Below are some final output from different models, as well as this week's lab process.

###  Markov Model


    miss  me  feel  the  sun
    sun  shine  i  apart  you're  my
    darling,  now  just  like  the
    blood  in  the  heart  lay
    down  if  you  dig  it?
    blood  in  the  heart  lay
    lay  down  if  you  dig  it?
    girl  just  because  we  can
    close  as  one  and  shivers
    in  time)  and  you  wanna
    close  as  one  and  shivers
    shivers  in  time)  and  you  wanna
    say  is  cheap  you're  dirty
    diana,  nah  dirty  diana,  nah (ooooh . . .  .)
    diana,  nah  dirty  diana,  nah  nah  (ooooh . . .  .)
    just  not  around  what  you
    try  so  we  may  seem
    to  be  recognized  the  money
    try  so  we  may  seem
    seem  to  be  recognized  the  money
    throw  your  name  and  me
    somehow  though  years  the  floor
    in  me  (hold  on)  ain't
    somehow  though  years  the  floor
    floor  in  me  (hold  on)  ain't
    nothin'  that  i  was  waiting
    for  me  alone  whisper  three
    to  be  startin'  somethin'  about
    for  me  alone  whisper  three
    three  to  be  startin'  somethin'  about

-----

    spring every year, yours and a child's heart
    do me game don't stop pressurin' me here
    at night there and it to shake my
    baby cryin' wolf ain't hard to... take you
    baby (you really matters i've never ran away

-----

    why  do  baby  your  eyes
    eyes  are  kissin'  me,  sue  me
    how  it  for  us  all
    the  change)  hoo!  
    i  have a  torch  will  be  there,
    the  change)  hoo!  i  have
    have  a  torch  will  be  there,
    you  know  you  said  yeah
    shoo-hee  woh  woh  woh  hee!
    shoo-hee  oooh  foolish  trickery  and
    shoo-hee  woh  woh  woh  hee! hee!  shoo-hee  oooh  
    foolish  trickery  and
    how  lust  for  us  we
    have  been  my  hand  
    ('cause you  got  the  way  that
    have  been  my  hand)  
    'cause you  got  the  way  that
    girl  at  yourself  and  i
    am  the  lost  my  baby
    (yeah,  yeah)  too  high  speed,
    am  the  lost  my  baby
    baby  (yeah,  yeah)  too  high  speed,
    feedback,  dolby  release  two  places
    at  the  devil  anything  anything
    anything  anything  anything  for  me
    at  the  devil  anything  anything
    anything  anything  anything  anything  for  me
    with  me!)  i'm  telling  you
    make  it  (bad  bad  -
    you  on  time  around  comes
    make  it  (bad  bad  -
    you  on  time  around  comes

###  RNN Model
                                         
    
    I had a dream of another tomorrow of life                                 
    and night i'll find day years may side right 
    just ridiculed or fast falls land though falls      
    real son wings him chiller gana plan pure cola 
    hollywood across remedy high nigh high nigh high nigh high 
    chiller gloom stalking mine      
    baby tomorrow other's nowhere tambien

--------
    
    Spring here with you
    but it feel
    i was around
    yeah yeah yeah
    yeah yeah yeah
    yeah yeah here
    worried feed eliminate
    leaving evening moonwalk
    millionaire it’s deal
    or estas alright
    oak tree tops
    funny top dear
    lovely boat cruise
    heaven

---------

    World i still remember the face
    her world a kiss you
    found your side so cold
    so cold cold fin oak
    tree tops que je yourself
    addition by myself strong in
    stalking singing to rise above
    tomorrow stalking him selfish terrifying
    theme pushed right revenge somehow
    actual plan fin l bien
    de bois tristeza ms inside
    c doo doo oooooooh gana

# Lab Process

## `Data Dictionary`

A collection of lyric texts from the singer Michael Jackson. The size of the dataset is 250KB, and the original source comes from free lyric hosting websites.

In [58]:
import tensorflow as tf
import numpy as np 
import random

from tensorflow.keras.preprocessing.text import Tokenizer

In [59]:
textdata = open("lyrics/michael-jackson.txt").read()
textdata



In [60]:
corpus = textdata.lower().split("\n")
print(corpus)



## `Build the vocabulary system in the raw dataset`

In [61]:
tokenizer = Tokenizer()

In [62]:
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
total_words

3274

In [141]:
print(f"This is the length of the corpus:{len(corpus)}, This is the length of the text data: {len(textdata)}")

This is the length of the corpus:11177, This is the length of the text data: 251022


In [142]:
print(f"{tokenizer.word_index}")



# LAB 1 - RNN Model

In [132]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [12]:
input_sequences = []

for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)


max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))


# create predictors and labels
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]


ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [14]:
model = Sequential()

model.add(Embedding(total_words, 100, input_shape=(max_sequence_len-1,)))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))

adam = Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100)

print(model)

Epoch 1/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 30ms/step - accuracy: 0.0971 - loss: 5.8498
Epoch 2/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 31ms/step - accuracy: 0.3018 - loss: 3.8804
Epoch 3/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 31ms/step - accuracy: 0.3803 - loss: 3.1646
Epoch 4/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 31ms/step - accuracy: 0.3789 - loss: 3.3300
Epoch 5/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 31ms/step - accuracy: 0.4382 - loss: 2.6535
Epoch 6/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 31ms/step - accuracy: 0.4545 - loss: 2.5393
Epoch 7/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 31ms/step - accuracy: 0.4597 - loss: 2.5052
Epoch 8/100
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 31ms/step - accuracy: 0.4829 - loss: 2.2977


# Save The Model

In [82]:
model.save('MJ_pop_poetry.keras')

In [24]:
#model.save('MJ2_pop_poetry_model.keras')
print(model.summary())

None


In [91]:
def generate_lyrics(seed_text, next_words=50, model=model, max_sequence_len=max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predicted_probs, axis=-1)[0]
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text


print(generate_lyrics("flower"))

flower summer's gotten away from us but around the time le da da da dum mumble oh oh god before scarred war before appeared du someone stalking yourself sunbeams apart flap doing o'neal ferris aa aa aa aa aa tambien actual power him ohms apart 'cause honest hollywood mean 12x apart


In [116]:
print(generate_lyrics("I had a dream"))

I had a dream of another tomorrow of life and night i'll find day years may side right just ridiculed or fast falls land though falls real son wings him chiller gana plan pure cola hollywood across remedy high nigh high nigh high nigh high chiller gloom stalking mine baby tomorrow other's nowhere tambien


In [102]:
def generate_pop_poetry(seed_text, model, tokenizer, max_sequence_len, next_words=50, words_per_line=5):
    output_text = seed_text
    word_count = 0
    
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([output_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predicted_probs, axis=-1)[0]
        output_word = ""
        
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break
        
        output_text += " " + output_word
        word_count += 1
        
        if word_count % words_per_line == 0:
            output_text += "\n"
    
    return output_text

print(generate_pop_poetry(
    seed_text="\n Spring", 
    model=model, 
    tokenizer=tokenizer, 
    max_sequence_len=max_sequence_len, 
    next_words=40, 
    words_per_line=3
))


 Spring here with you
 but it feel
 i was around
 yeah yeah yeah
 yeah yeah yeah
 yeah yeah here
 worried feed eliminate
 leaving evening moonwalk
 millionaire it’s deal
 or estas alright
 oak tree tops
 funny top dear
 lovely boat cruise
 heaven


In [114]:
print(generate_pop_poetry(
    seed_text="\n World", 
    model=model, 
    tokenizer=tokenizer, 
    max_sequence_len=max_sequence_len, 
    next_words=60, 
    words_per_line=5
))


 World i still remember the face
 her world a kiss you
 found your side so cold
 so cold cold fin oak
 tree tops que je yourself
 addition by myself strong in
 stalking singing to rise above
 tomorrow stalking him selfish terrifying
 theme pushed right revenge somehow
 actual plan fin l bien
 de bois tristeza ms inside
 c doo doo oooooooh gana



# LAB 2 - Markov Model

In [7]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import os
from collections import defaultdict
from collections import Counter
import re
import os
import random

In [9]:
with open('lyrics/michael-jackson.txt', 'r', encoding='utf-8') as file:
    words = file.read().lower().split()

In [63]:
words

['[spoken',
 'intro:]',
 'you',
 'ever',
 'want',
 'something',
 'that',
 'you',
 'know',
 'you',
 "shouldn't",
 'have',
 'the',
 'more',
 'you',
 'know',
 'you',
 "shouldn't",
 'have',
 'it,',
 'the',
 'more',
 'you',
 'want',
 'it',
 'and',
 'then',
 'one',
 'day',
 'you',
 'get',
 'it,',
 "it's",
 'so',
 'good',
 'too',
 'but',
 "it's",
 'just',
 'like',
 'my',
 'girl',
 'when',
 "she's",
 'around',
 'me',
 'i',
 'just',
 'feel',
 'so',
 'good,',
 'so',
 'good',
 'but',
 'right',
 'now',
 'i',
 'just',
 'feel',
 'cold,',
 'so',
 'cold',
 'right',
 'down',
 'to',
 'my',
 'bones',
 "'cause",
 'ooh...',
 "ain't",
 'no',
 'sunshine',
 'when',
 "she's",
 'gone',
 "it's",
 'not',
 'warm',
 'when',
 "she's",
 'away',
 "ain't",
 'no',
 'sunshine',
 'when',
 "she's",
 'gone',
 'and',
 "she's",
 'always',
 'gone',
 'too',
 'long',
 'anytime',
 'she',
 'goes',
 'away',
 'wonder',
 'this',
 'time',
 'where',
 "she's",
 'gone',
 'wonder',
 'if',
 "she's",
 'gone',
 'to',
 'stay',
 "ain't",
 'no'

In [64]:
markov_model = {}

for i in range(len(words)-1):
    word = words[i]
    next_word = words[i + 1]
    if word not in markov_model:
        markov_model[word] = []
    markov_model[word].append(next_word)

print(markov_model)



In [11]:
def generate_text(model, start_word, length):
    current_word = start_word
    text = [current_word]
    for _ in range(length - 1):
        current_word = random.choice(model.get(current_word, ['']))
        text.append(current_word)
    return ' '.join(text)

start_word = "spring" 
generated_text = generate_text(markov_model, start_word, 50)
print(generated_text)

spring every day and all the prolems that is breaking why don't stop trippin' why i could be startin' somethin' you and singing to do we fell trapped in vain her about it) (aahh, she dreams tell you want more step (cause we're bring your apartment) dang gone it's got


In [12]:
def generate_markov_poetry(start_word, model, next_words=50, words_per_line=5):
    if start_word not in model:
        return "The start word is not in the model."
    
    output_text = start_word
    current_word = start_word
    word_count = 1 
    
    for _ in range(next_words - 1):  
        next_words_list = model.get(current_word, [''])
        if not next_words_list:  
            break
        current_word = random.choice(next_words_list)
        output_text += " " + current_word
        word_count += 1
        
        if word_count % words_per_line == 0:
            output_text += "\n"
    
    return output_text


print(generate_markov_poetry(
    start_word="spring", 
    model=markov_model, 
    next_words=40, 
    words_per_line=8  
))

spring every hot island if the pain on
 me a vegetable still i told you g'on
 get up your every night you don't you
 babe, hee! hee! hoo! hoo! dancin'-hee! doggone girl
 i just because it's the soul want you,



In [14]:
markov_model_2 = defaultdict(lambda: defaultdict(int))

In [15]:
for i in range(len(words) - 1):
        current_word = words[i]
        next_word = words[i + 1]
        markov_model_2[current_word][next_word] += 1

In [16]:
for current_word, next_words in markov_model_2.items():
        total_occurrences = sum(next_words.values())
        for next_word in next_words:
            next_words[next_word] /= total_occurrences

In [32]:
lyric_length = 100
punctuation_marks = ['.', ',', ';', ':', '!', '?']

lyric = [current_word]

for _ in range(lyric_length - 1):
    if current_word in markov_model_2:
        next_words = markov_model_2[current_word]
        
        # Choose the next word based on its probability
        next_word = random.choices(list(next_words.keys()), weights=list(next_words.values()), k=1)[0]
        
        # Append the next word to the lyric
        if next_word in punctuation_marks:
            lyric.append(next_word)  # Add punctuation directly
        else:
            lyric.append(' ' + next_word)  # Add space before non-punctuation words
        
        # Update current_word for the next iteration
        current_word = next_word
    else:
        # If the current word has no next words, end the lyric
        break

# Generate the lyric
current_word = "why"
lyric_str = ''.join(lyric)
print(lyric_str)

why do baby your eyes are kissin' me, sue me how it for us all the change) hoo! i have a torch will be there, you know you said yeah shoo-hee woh woh woh hee! shoo-hee oooh foolish trickery and how lust for us we have been my hand) 'cause you got the way that girl at yourself and i am the lost my baby (yeah, yeah) too high speed, feedback, dolby release two places at the devil anything anything anything anything anything for me with me!) i'm telling you make it (bad bad - you on time around comes


In [66]:
def poem_style(word_list):
    for i in range(0, (lyric_length - 1), 15):
    
        lines = [
        " ".join(word_list[i:i+5]),
        " ".join(word_list[i+4:i+10]),
        " ".join(word_list[i+10:i+15]),
        " ".join(word_list[i+15:i+20]),
        " ".join(word_list[i+20:i+25])
        ]

        # Print the poem
        for line in lines:
            print(line)

In [67]:
poem_style(lyric)

why  do  baby  your  eyes
 eyes  are  kissin'  me,  sue  me
 how  it  for  us  all
 the  change)  hoo!  i  have
 a  torch  will  be  there,
 the  change)  hoo!  i  have
 have  a  torch  will  be  there,
 you  know  you  said  yeah
 shoo-hee  woh  woh  woh  hee!
 shoo-hee  oooh  foolish  trickery  and
 shoo-hee  woh  woh  woh  hee!
 hee!  shoo-hee  oooh  foolish  trickery  and
 how  lust  for  us  we
 have  been  my  hand)  'cause
 you  got  the  way  that
 have  been  my  hand)  'cause
 'cause  you  got  the  way  that
 girl  at  yourself  and  i
 am  the  lost  my  baby
 (yeah,  yeah)  too  high  speed,
 am  the  lost  my  baby
 baby  (yeah,  yeah)  too  high  speed,
 feedback,  dolby  release  two  places
 at  the  devil  anything  anything
 anything  anything  anything  for  me
 at  the  devil  anything  anything
 anything  anything  anything  anything  for  me
 with  me!)  i'm  telling  you
 make  it  (bad  bad  -
 you  on  time  around  comes
 make  it  (bad  bad  -
 -  you  on  t

# Reload the model

In [73]:
import keras
from keras.models import load_model

In [84]:
new_model = load_model('MJ_pop_poetry.keras')

ValueError: A total of 1 objects could not be loaded. Example error message for object <LSTMCell name=lstm_cell, built=True>:

Layer 'lstm_cell' expected 3 variables, but received 0 variables during loading. Expected: ['kernel', 'recurrent_kernel', 'bias']

List of objects that could not be loaded:
[<LSTMCell name=lstm_cell, built=True>]

note: somehow I failed at reloading mt RNN kera model, which needs to be further investgated.