# Generating 90s Pop Lyrics at the Character level

## Goal
Generate 1 line of lyrics in the style of 90s Pop.

## Problem Formulation
X: Examples of a line of lyrics for the model to use (n_examples, max_length, n_characters)
Y: A generated sequence of characters that ends with <EOS> (n_examples, n_characters)
    
<EOS> will be a special character in the vocabulary which the model will use to know that it can stop predicting.

## Methodology
To accomplish this, we need:
1. Dataset: A corpus of 90s Pop lyrics
2. Vocabulary: A set of characters which will be used for generating lyrics
3. Model: A model which can encode the probability of the next character given a sequence of characters
4. Generate Lyrics: Use the model and an input to generate new lyrics

## Extract and Transform Raw Dataset

In [1]:
import pandas as pd

In [2]:
# load raw data file as a dataframe
raw_data = pd.read_csv('data/raw.csv')

In [3]:
# filter for only lyrics from the 1990s, of the pop genre, and not instrumentals
mask = (raw_data['year'] > 1989) & (raw_data['year'] < 2000) & (raw_data['genre'] == 'Pop') & (raw_data['lyrics'] != '[Instrumental]')
filtered_data = raw_data[mask]

In [4]:
# remove any that have null values
cleaned_data = filtered_data.dropna()

In [5]:
# trim all the extra data. We only want the lyrics
raw_lyrics = cleaned_data['lyrics']

In [6]:
# reindex the lyrics to make it easier to work with
reindexed_lyrics = raw_lyrics.reset_index(drop=True)

In [7]:
# lowercase the lyrics to make it easier to work with
formatted_lyrics = reindexed_lyrics[:].str.lower()
formatted_lyrics.head(10)

0    come they told me, pa rum pum pum pum\na new b...
1    over the ground lies a mantle, white\na heaven...
2    i just came back from a lovely trip along the ...
3    i'm dreaming of a white christmas\njust like t...
4    just hear those sleigh bells jingle-ing, ring-...
5    little rump shaker she can really shake and ba...
6    girl you want to sex me\ngirl, why don't you l...
7    oooh, tonight i want to turn the lights down l...
8    so you say he let you on, you'll never give yo...
9    something about you baby\nthat makes me wanna ...
Name: lyrics, dtype: object

In [8]:
# examine the number of song lyrics we have
n_formatted_lyrics = formatted_lyrics.shape[0]
print(n_formatted_lyrics)

964


In [9]:
# split each lyric on \n
# store song lyrics as a list of lines
# store those in lyrics
lyrics_lines = []

for i in range(n_formatted_lyrics):
    lyrics = formatted_lyrics[i].split('\n')
    lyrics_lines.append(lyrics)

In [10]:
# flatten the previous into a list of song lyrics lines
flattened_lyrics_lines = [line for song in lyrics_lines for line in song]

In [11]:
# examine the resulting number of song lyrics lines we have
print(len(flattened_lyrics_lines))
print(flattened_lyrics_lines[0])

35188
come they told me, pa rum pum pum pum


## Filter out non-english lyrics

In [12]:
char_set = [' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'x', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [17]:
from keras.preprocessing.text import text_to_word_sequence

english_lyrics_lines = []

for line in flattened_lyrics_lines:
    line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
    char_split = list(" ".join(line_split))
    char_check = 0
    for char in char_split:
        if char not in char_set:
            char_check = 1
            
    if char_check == 0:
        english_lyrics_lines.append(char_split)

In [18]:
# examine the resulting number of song lyrics lines we have
print(len(english_lyrics_lines))
print(english_lyrics_lines[0])

33659
['c', 'o', 'm', 'e', ' ', 't', 'h', 'e', 'y', ' ', 't', 'o', 'l', 'd', ' ', 'm', 'e', ' ', 'p', 'a', ' ', 'r', 'u', 'm', ' ', 'p', 'u', 'm', ' ', 'p', 'u', 'm', ' ', 'p', 'u', 'm']


## Extract the subset we are interested in

In [47]:
# grab a random amount of them for our examples
from random import sample

n_training = 700
n_validation = 300

examples = sample(english_lyrics_lines, n_training + n_validation)

print(len(examples))
print(examples[0])

1000
['f', 'a', 't', 'e', ' ', 's', 't', 'e', 'p', 's', ' ', 'i', 'n', ' ', 'a', 'n', 'd', ' ', 's', 'e', 'e', 's', ' ', 'y', 'o', 'u', ' ', 't', 'h', 'r', 'o', 'u', 'g', 'h']


In [48]:
# process lyrics into lists of word indices
# also determine line with the greatest length
max_char_n = 0

for line in examples:
#     line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
#     char_split = list(" ".join(line_split))
    char_n = len(line)
    if char_n > max_char_n:
        max_char_n = char_n

In [49]:
print(max_char_n)

70


In [50]:
# flatten chars
flat_chars = [item for sublist in examples for item in sublist]

# dedup list
chars = list(set(flat_chars))

# append our terminator
print(chars)

['i', 'f', 'l', 'x', 's', 'g', '2', 'c', 'm', 'q', 'v', '8', '9', 'w', 'h', 'y', 'd', 'n', 'a', 'z', ' ', 'p', '0', 'b', 'j', 'e', '1', 'r', 't', 'k', 'o', 'u', "'"]


In [51]:
# determine number of charecters in our set
n_chars = len(chars)
print(n_chars)

33


In [52]:
# create dictionarys
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)
print(char_to_ix)

{0: ' ', 1: "'", 2: '0', 3: '1', 4: '2', 5: '8', 6: '9', 7: 'a', 8: 'b', 9: 'c', 10: 'd', 11: 'e', 12: 'f', 13: 'g', 14: 'h', 15: 'i', 16: 'j', 17: 'k', 18: 'l', 19: 'm', 20: 'n', 21: 'o', 22: 'p', 23: 'q', 24: 'r', 25: 's', 26: 't', 27: 'u', 28: 'v', 29: 'w', 30: 'x', 31: 'y', 32: 'z'}
{' ': 0, "'": 1, '0': 2, '1': 3, '2': 4, '8': 5, '9': 6, 'a': 7, 'b': 8, 'c': 9, 'd': 10, 'e': 11, 'f': 12, 'g': 13, 'h': 14, 'i': 15, 'j': 16, 'k': 17, 'l': 18, 'm': 19, 'n': 20, 'o': 21, 'p': 22, 'q': 23, 'r': 24, 's': 25, 't': 26, 'u': 27, 'v': 28, 'w': 29, 'x': 30, 'y': 31, 'z': 32}


## Training and Validation Datasets

In [53]:
import numpy as np

# create training input
X_training = np.zeros((n_training, max_char_n, n_chars), dtype='float32')
X_training.shape

(700, 70, 33)

In [54]:
# fill input training set with word sequences, where words are one-hot encoded
for li, line in enumerate(examples[:n_training]):
    indices = []
#     line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
#     char_split = list(" ".join(line_split))
    for ci, char in enumerate(line):
        index = char_to_ix[char]
        X_training[li][ci][index] = 1

In [55]:
# create training output
Y_training = np.resize(X_training, (n_training, max_char_n + 1, n_chars))
Y_training.shape

(700, 71, 33)

In [56]:
# # outputs need to end with the end of sequence charecter
# for li, line in enumerate(X_training):
#     spaceCounter = 0
#     for ci, char in enumerate(line):
#         if np.all(Y_training[li][ci] == 0):
#             spaceCounter += 1
#         if spaceCounter > 1:
#             Y_training[li][ci-1][-1] = 1.0

In [57]:
# create validation input
X_validation = np.zeros((n_validation, max_char_n, n_chars), dtype='float32')
X_validation.shape

(300, 70, 33)

In [58]:
for li, line in enumerate(examples[n_training:n_validation]):
    indices = []
#     line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
#     char_split = list(" ".join(line_split))
    for ci, char in enumerate(line):
        index = char_to_ix[char]
        X_validation[li][ci][index] = 1

In [59]:
# create validation output
Y_validation = np.resize(X_validation, (n_validation, max_char_n + 1, n_chars))
Y_validation.shape

(300, 71, 33)

In [60]:
# # outputs need to end with the end of sequence charecter
# for li, line in enumerate(X_validation):
#     spaceCounter = 0
#     for ci, char in enumerate(line):
#         if np.all(Y_validation[li][ci] == 0):
#             spaceCounter += 1
#         if spaceCounter > 1:
#             Y_validation[li][ci-1][-1] = 1.0

## Validate Dataset

In [64]:
x_training_string = []
for woh in X_training[0]:
    max_idx = np.argmax(woh)
    x_training_string.append(ix_to_char[max_idx])
x_training_string_formatted = "".join(x_training_string)
print(x_training_string_formatted)

fate steps in and sees you through                                    


In [65]:
y_training_string = []
for woh in Y_training[0]:
    max_idx = np.argmax(woh)
    y_training_string.append(ix_to_char[max_idx])
y_training_string_formatted = "".join(y_training_string)
print(x_training_string_formatted)

fate steps in and sees you through                                    


## Model

In [66]:
from keras import backend as K
import os

# to use GPU
os.environ["CUDA_VISIBLE_DEVICES"]="0"

# verify that a gpu is listed
K.tensorflow_backend._get_available_gpus()

[]

In [67]:
from keras.models import Model
from keras.layers import Dense, Input, LSTM
from keras.optimizers import RMSprop

In [104]:
model_input = Input(shape=(None, n_chars))
x = LSTM(n_chars, return_sequences=True)(model_input)
x = Dense(n_chars, activation='softmax')(x)

In [103]:
model = Model(inputs=model_input, outputs=x)

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [100]:
from keras.callbacks import EarlyStopping, TensorBoard
from datetime import datetime

timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
log_dir = 'logs/{}'.format(timestamp)

early = EarlyStopping(monitor='val_acc',
                      min_delta=0,
                      patience=10,
                      verbose=1,
                      mode='auto')

In [101]:
model.fit(X_training, 
          X_training, 
          batch_size=50, 
          epochs=50, 
          validation_data=(X_validation, X_validation),
          callbacks=[early, TensorBoard(log_dir=log_dir)])

ValueError: Error when checking target: expected dense_3 to have 2 dimensions, but got array with shape (700, 70, 33)

## Make a prediction

In [84]:
new_sample = 'sweet dreams are made of '

In [85]:
# convert new_sample to a sequence of one-hot encoded chars
line_split = text_to_word_sequence(new_sample, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
char_split = list(" ".join(line_split))
n_sample_chars = len(char_split)

sample = np.zeros((1, n_sample_chars, n_chars), dtype='float32')

for ci, char in enumerate(char_split):
    index = char_to_ix[char]
    sample[0][ci][index] = 1

In [86]:
prediction = model.predict(sample)

In [87]:
# take the max of each...
string_prediction = []
for p in prediction[0]:
    max_p = np.argmax(p)
    string_prediction.append(ix_to_char[max_p])

In [88]:
formatted_prediction = "".join(string_prediction)

In [89]:
print(formatted_prediction)

sweet dreams are made of


## Generate a sequence from a sequence

In [90]:
x = sample

In [91]:
# take the max of each...
string_prediction = []
for p in x[0]:
    max_p = np.argmax(p)
    string_prediction.append(ix_to_char[max_p])
formatted_prediction = "".join(string_prediction)
print(formatted_prediction)

sweet dreams are made of


In [92]:
for i in range(500):
    prediction = model.predict(x, verbose=0)
    x = np.zeros((1, prediction.shape[1] + 1, n_chars), dtype='float32')
    x[0][:prediction.shape[1]][:] = prediction[0]

In [93]:
# take the max of each...
string_prediction = []
for p in x[0]:
    max_p = np.argmax(p)
    string_prediction.append(ix_to_char[max_p])
formatted_prediction = "".join(string_prediction)
print(formatted_prediction)

sweet dreams are made of  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo  oo    
