# Generating 90s Pop Lyrics at the Character level

## Goal
Generate 1 line of lyrics in the style of 90s Pop.

## Problem Formulation
X: Examples of a line of lyrics for the model to use (n_examples, max_length, n_characters)
Y: A generated sequence of characters that ends with <EOS> (n_examples, n_characters)
    
<EOS> will be a special character in the vocabulary which the model will use to know that it can stop predicting.

## Methodology
To accomplish this, we need:
1. Dataset: A corpus of 90s Pop lyrics
2. Vocabulary: A set of characters which will be used for generating lyrics
3. Model: A model which can encode the probability of the next character given a sequence of characters
4. Generate Lyrics: Use the model and an input to generate new lyrics

## Extract and Transform Raw Dataset

In [1]:
import pandas as pd

In [2]:
# load raw data file as a dataframe
raw_data = pd.read_csv('data/raw.csv')

In [3]:
# filter for only lyrics from the 1990s, of the pop genre, and not instrumentals
mask = (raw_data['year'] > 1989) & (raw_data['year'] < 2000) & (raw_data['genre'] == 'Pop') & (raw_data['lyrics'] != '[Instrumental]')
filtered_data = raw_data[mask]

In [4]:
# remove any that have null values
cleaned_data = filtered_data.dropna()

In [5]:
# trim all the extra data. We only want the lyrics
raw_lyrics = cleaned_data['lyrics']

In [6]:
# reindex the lyrics to make it easier to work with
reindexed_lyrics = raw_lyrics.reset_index(drop=True)

In [7]:
# lowercase the lyrics to make it easier to work with
formatted_lyrics = reindexed_lyrics[:].str.lower()
formatted_lyrics.head(10)

0    come they told me, pa rum pum pum pum\na new b...
1    over the ground lies a mantle, white\na heaven...
2    i just came back from a lovely trip along the ...
3    i'm dreaming of a white christmas\njust like t...
4    just hear those sleigh bells jingle-ing, ring-...
5    little rump shaker she can really shake and ba...
6    girl you want to sex me\ngirl, why don't you l...
7    oooh, tonight i want to turn the lights down l...
8    so you say he let you on, you'll never give yo...
9    something about you baby\nthat makes me wanna ...
Name: lyrics, dtype: object

In [8]:
# examine the number of song lyrics we have
formatted_lyrics.shape

(964,)

In [9]:
# split each lyric on \n
# store song lyrics as a list of lines
# store those in lyrics
lyrics_lines = []

for i in range(len(formatted_lyrics)):
    lyrics = formatted_lyrics[i].split('\n')
    lyrics_lines.append(lyrics)

In [10]:
## flatten the previous into a list of song lyrics lines
flattened_lyrics_lines = [line for song in lyrics_lines for line in song]

In [11]:
## examine the resulting number of song lyrics lines we have
print(len(flattened_lyrics_lines))
print(flattened_lyrics_lines[0])

35188
come they told me, pa rum pum pum pum


## Extract the subset we are interested in

In [12]:
# grab a random amount of them for our examples
from random import sample

n_training = 700
n_validation = 300

examples = sample(flattened_lyrics_lines, n_training + n_validation)

print(len(examples))
print(examples[0])

1000
me llor y le cont


In [13]:
# process lyrics into lists of word indices
# also determine line with the greatest length
from keras.preprocessing.text import text_to_word_sequence
import numpy as np

max_char_n = 0
chars = []

for line in examples:
    line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
    char_split = list(" ".join(line_split))
    chars.append(char_split)
    char_n = len(char_split)
    if char_n > max_char_n:
        max_char_n = char_n

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [14]:
max_char_n

94

In [15]:
# flatten chars
flat_chars = [item for sublist in chars for item in sublist]

# dedup list
chars = list(set(flat_chars))

# append our terminator
chars.append("\n")
print(chars)

['d', 'x', 'r', 't', ' ', 'z', '2', 'u', 'f', '±', '¨', 'y', 'i', '1', "'", '8', 'q', 'b', '\xad', '\xa0', '6', '0', 'e', 'ª', 'ã', 'a', 'o', 'p', 'j', 'º', '¹', 'k', 'h', '¥', 'n', '³', '§', 'l', 'w', 'g', '9', '¶', '¤', '\x80', 'c', '\x87', 'v', 's', '¡', '©', 'm', '\n']


In [16]:
# determine number of charecters in our set
n_chars = len(chars)
print(n_chars)

52


In [17]:
# create dictionarys
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }
print(ix_to_char)
print(char_to_ix)

{0: 'd', 1: 'x', 2: 'r', 3: 't', 4: ' ', 5: 'z', 6: '2', 7: 'u', 8: 'f', 9: '±', 10: '¨', 11: 'y', 12: 'i', 13: '1', 14: "'", 15: '8', 16: 'q', 17: 'b', 18: '\xad', 19: '\xa0', 20: '6', 21: '0', 22: 'e', 23: 'ª', 24: 'ã', 25: 'a', 26: 'o', 27: 'p', 28: 'j', 29: 'º', 30: '¹', 31: 'k', 32: 'h', 33: '¥', 34: 'n', 35: '³', 36: '§', 37: 'l', 38: 'w', 39: 'g', 40: '9', 41: '¶', 42: '¤', 43: '\x80', 44: 'c', 45: '\x87', 46: 'v', 47: 's', 48: '¡', 49: '©', 50: 'm', 51: '\n'}
{'d': 0, 'x': 1, 'r': 2, 't': 3, ' ': 4, 'z': 5, '2': 6, 'u': 7, 'f': 8, '±': 9, '¨': 10, 'y': 11, 'i': 12, '1': 13, "'": 14, '8': 15, 'q': 16, 'b': 17, '\xad': 18, '\xa0': 19, '6': 20, '0': 21, 'e': 22, 'ª': 23, 'ã': 24, 'a': 25, 'o': 26, 'p': 27, 'j': 28, 'º': 29, '¹': 30, 'k': 31, 'h': 32, '¥': 33, 'n': 34, '³': 35, '§': 36, 'l': 37, 'w': 38, 'g': 39, '9': 40, '¶': 41, '¤': 42, '\x80': 43, 'c': 44, '\x87': 45, 'v': 46, 's': 47, '¡': 48, '©': 49, 'm': 50, '\n': 51}


## Training and Validation Datasets

In [18]:
# create training input
X_training = np.zeros((n_training, max_char_n, n_chars), dtype='float32')
X_training.shape

(700, 94, 52)

In [19]:
# fill input training set with word sequences, where words are one-hot encoded
for li, line in enumerate(examples[:n_training]):
    indices = []
    line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
    char_split = list(" ".join(line_split))
    for ci, char in enumerate(char_split):
        index = char_to_ix[char]
        X_training[li][ci][index] = 1

In [20]:
# create training output
Y_training = np.resize(X_training, (n_training, max_char_n + 1, n_chars))
Y_training.shape

(700, 95, 52)

In [21]:
# outputs need to end with the end of sequence charecter
for li, line in enumerate(X_training):
    for ci, char in enumerate(line):
        if :
            Y_training[li][ci][-1] = 1.0

In [22]:
# create validation input
X_validation = np.zeros((n_validation, max_char_n, n_chars), dtype='float32')
X_validation.shape

(300, 94, 52)

In [23]:
for li, line in enumerate(examples[n_training:n_validation]):
    indices = []
    line_split = text_to_word_sequence(line, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
    char_split = list(" ".join(line_split))
    for ci, char in enumerate(char_split):
        index = char_to_ix[char]
        X_validation[li][ci][index] = 1

In [24]:
# create validation output
Y_validation = np.resize(X_validation, (n_validation, max_char_n + 1, n_chars))
Y_validation.shape

(300, 95, 52)

In [25]:
# outputs need to end with the end of sequence charecter
for li, line in enumerate(X_validation):
    Y_validation[li][-1][-1] = 1.0

## Validated Data

In [26]:
x_training_string = []
for woh in X_training[0]:
    max_idx = np.argmax(woh)
    x_training_string.append(ix_to_char[max_idx])
x_training_string_formatted = "".join(x_training_string)
print(x_training_string_formatted)

me llor y le contddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd


In [27]:
y_training_string = []
for woh in Y_training[0]:
    max_idx = np.argmax(woh)
    y_training_string.append(ix_to_char[max_idx])
y_training_string_formatted = "".join(y_training_string)
print(x_training_string_formatted)

me llor y le contddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd


## Model

In [28]:
from keras import backend as K
import os

# to use GPU
os.environ["CUDA_VISIBLE_DEVICES"]="0"

# verify that a gpu is listed
K.tensorflow_backend._get_available_gpus()

[]

In [29]:
from keras.models import Model
from keras.layers import Dense, Input, LSTM
from keras.optimizers import RMSprop

In [30]:
model_input = Input(shape=(None, n_chars))
x = LSTM(n_chars, return_sequences=True)(model_input)
x = Dense(n_chars, activation='softmax')(x)

In [31]:
model = Model(inputs=model_input, outputs=x)

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [32]:
from keras.callbacks import EarlyStopping, TensorBoard
from datetime import datetime

timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
log_dir = 'logs/{}'.format(timestamp)

early = EarlyStopping(monitor='val_acc',
                      min_delta=0,
                      patience=10,
                      verbose=1,
                      mode='auto')

In [33]:
model.fit(X_training, 
          X_training, 
          batch_size=50, 
          epochs=50, 
          validation_data=(X_validation, X_validation),
          callbacks=[early, TensorBoard(log_dir=log_dir)])

Train on 700 samples, validate on 300 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 00011: early stopping


<keras.callbacks.History at 0x7f7997dd1908>

## Make a prediction

In [34]:
new_sample = 'sweet dreams are made of these'

In [35]:
# convert new_sample to a sequence of one-hot encoded chars
line_split = text_to_word_sequence(new_sample, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~')
char_split = list(" ".join(line_split))
n_sample_chars = len(char_split)

sample = np.zeros((1, n_sample_chars, n_chars), dtype='float32')

for ci, char in enumerate(char_split):
    index = char_to_ix[char]
    sample[0][ci][index] = 1

In [36]:
prediction = model.predict(sample)

In [37]:
# take the max of each...
string_prediction = []
for p in prediction[0]:
    max_p = np.argmax(p)
    string_prediction.append(ix_to_char[max_p])

In [38]:
formatted_prediction = "".join(string_prediction)

In [39]:
print(formatted_prediction)

sweet dreams are made of these


## Generate a sequence from a sequence

In [40]:
x = sample

In [41]:
# take the max of each...
string_prediction = []
for p in x[0]:
    max_p = np.argmax(p)
    string_prediction.append(ix_to_char[max_p])
formatted_prediction = "".join(string_prediction)
print(formatted_prediction)

sweet dreams are made of these


In [42]:
for i in range(100):
    prediction = model.predict(x, verbose=0)
    x = np.zeros((1, prediction.shape[1] + 1, n_chars), dtype='float32')
    x[0][:prediction.shape[1]][:] = prediction[0]

In [43]:
# take the max of each...
string_prediction = []
for p in x[0]:
    max_p = np.argmax(p)
    string_prediction.append(ix_to_char[max_p])
formatted_prediction = "".join(string_prediction)
print(formatted_prediction)

sweet dreams are made of these gee ie  eee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee fee  ed
