# Generating 90s Pop Lyrics

## Goal
Generate 1 line of lyrics in the style of 90s Pop.

## Problem Formulation
X: Some number starting words for the model to use  
Y: A generated sequence of words that ends with <EOS>
    
<EOS> will be a special words in the vocabulary which the model will use to know that it can stop predicting.

## Methodology
To accomplish this, we need:
1. Dataset: A corpus of 90s Pop lyrics
2. Vocabulary: A set of words which will be used for generating lyrics
3. Model: A model which can encode the probability of the next word given a sequence of words
4. Generate Lyrics: USe the model and an input to generate new lyrics

## 1. Dataset
To build the dataset, I will start with [380000+ lyrics from MetroLyrics](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics).

Then I will:
1. Filter for 90s Pop songs
2. Lower case all the lyrics
3. Chop up the lyrics by `<EOF>`

In [1]:
import pandas as pd

In [2]:
# load raw data file as a dataframe
raw_data = pd.read_csv('data/raw.csv')

In [3]:
# inspect 10 rows 
raw_data.head(10)

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."
5,5,all-i-could-do-was-cry,2009,beyonce-knowles,Pop,I heard\nChurch bells ringing\nI heard\nA choi...
6,6,once-in-a-lifetime,2009,beyonce-knowles,Pop,This is just another day that I would spend\nW...
7,7,waiting,2009,beyonce-knowles,Pop,"Waiting, waiting, waiting, waiting\nWaiting, w..."
8,8,slow-love,2009,beyonce-knowles,Pop,[Verse 1:]\nI read all of the magazines\nwhile...
9,9,why-don-t-you-love-me,2009,beyonce-knowles,Pop,"N-n-now, honey\nYou better sit down and look a..."


In [4]:
# filter for only lyrics from the 1990s, of the pop genre, and not instrumentals
mask = (raw_data['year'] > 1989) & (raw_data['year'] < 2000) & (raw_data['genre'] == 'Pop') & (raw_data['lyrics'] != '[Instrumental]')
filtered_data = raw_data[mask]

In [5]:
# remove any that have null values
cleaned_data = filtered_data.dropna()

In [6]:
# examine the results
cleaned_data.head(10)

Unnamed: 0,index,song,year,artist,genre,lyrics
5132,5132,the-little-drummer-boy,1990,boston-pops,Pop,"Come they told me, pa rum pum pum pum\nA new b..."
5134,5134,winter-wonderland,1990,boston-pops,Pop,"Over the ground lies a mantle, white\nA heaven..."
5136,5136,santa-claus-is-comin-to-town,1990,boston-pops,Pop,I just came back from a lovely trip along the ...
5139,5139,white-christmas,1990,boston-pops,Pop,I'm dreaming of a white Christmas\nJust like t...
5142,5142,sleigh-ride,1990,boston-pops,Pop,"Just hear those sleigh bells jingle-ing, ring-..."
11828,11828,she-s-got-skillz,1994,all-4-one,Pop,Little rump shaker she can really shake and ba...
11829,11829,the-bomb,1994,all-4-one,Pop,"Girl you want to sex me\nGirl, why don't you l..."
11830,11830,breathless,1994,all-4-one,Pop,"oooh, tonight i want to turn the lights down l..."
11831,11831,down-to-the-last-drop,1994,all-4-one,Pop,"So you say he let you on, you'll never give yo..."
11832,11832,something-about-you,1994,all-4-one,Pop,Something about you baby\nThat makes me wanna ...


In [7]:
# trim all the extra data. We only want the lyrics
raw_lyrics = cleaned_data['lyrics']

In [8]:
# reindex the lyrics to make it easier to work with
reindexed_lyrics = raw_lyrics.reset_index(drop=True)

In [9]:
# lowercase the lyrics to make it easier to work with
formatted_lyrics = reindexed_lyrics[:].str.lower()
formatted_lyrics.head(10)

0    come they told me, pa rum pum pum pum\na new b...
1    over the ground lies a mantle, white\na heaven...
2    i just came back from a lovely trip along the ...
3    i'm dreaming of a white christmas\njust like t...
4    just hear those sleigh bells jingle-ing, ring-...
5    little rump shaker she can really shake and ba...
6    girl you want to sex me\ngirl, why don't you l...
7    oooh, tonight i want to turn the lights down l...
8    so you say he let you on, you'll never give yo...
9    something about you baby\nthat makes me wanna ...
Name: lyrics, dtype: object

In [10]:
# examine the number of song lyrics we have
formatted_lyrics.shape

(964,)

In [11]:
# split each lyric on \n
# store song lyrics as a list of lines
# store those in lyrics
lyrics_lines = []

for i in range(len(formatted_lyrics)):
    lyrics = formatted_lyrics[i].split('\n')
    lyrics_lines.append(lyrics)

In [12]:
## flatten the previous into a list of song lyrics lines
flattened_lyrics_lines = [line for song in lyrics_lines for line in song]

In [13]:
## examine the resulting number of song lyrics lines we have
lyrics_lines_n = len(flattened_lyrics_lines)
print(lyrics_lines_n)
print(flattened_lyrics_lines[11201])

35188
ill go where you lead me


In [14]:
# save this as a new csv for the future.
import csv

with open('data/lyrics.csv', 'w', newline='') as file:
    wr = csv.writer(file, quoting=csv.QUOTE_ALL)
    wr.writerow(flattened_lyrics_lines)

## 2. Vocabulary
To build the vocabulary, I will start with [380000+ lyrics from MetroLyrics](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics).

Then I will:
1. Find the top 10000 words
2. Create a list of these words and `<EOS>`
3. Create a dictionary which can convert words to indices
4. Create a dictionary which can convert indices to words

In [15]:
# use keras' tokenizer api to find the top 100000 words
from keras.preprocessing.text import Tokenizer

t = Tokenizer()
t.fit_on_texts(flattened_lyrics_lines)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [16]:
# get the words and their occurances
word_counts = t.word_counts

In [17]:
# get the words in decending order
sorted_words = sorted(word_counts, reverse=True, key=word_counts.get)

In [18]:
# check to see how many words there are
len(sorted_words)

12004

In [19]:
# create a subset for our vocabulary
vocabulary = sorted_words[:10000]
vocabulary = sorted_words

In [20]:
# add the `<EOS>` token
vocabulary.append('<EOS>')

In [21]:
vocabulary_n = len(vocabulary)
print(vocabulary_n)

12005


In [22]:
# create a reference list for converting from index to word
index_to_word = {}

for i, word in enumerate(vocabulary):
    index_to_word[i] = word

In [23]:
index_to_word[1]

'i'

In [24]:
# create a list for converting from word to index
word_to_index = {}

for i, word in enumerate(vocabulary):
    word_to_index[word] = i

In [25]:
word_to_index['i']

1

In [26]:
# save this as a new csv for the future.
with open('data/vocabulary.csv', 'w', newline='') as file:
    wr = csv.writer(file, quoting=csv.QUOTE_ALL)
    wr.writerow(vocabulary)

## 3. Model
I will build a simple recurrent neural network with an LSTM. This model should:
1. Preprocess the input X, and Y*
    - X: a list of word indices
    - Y: X + `<EOS>`
2. Encode the inputs
3. Output a one-hot encoding Y^ which should be the last word of the sentence

In [47]:
# process lyrics into lists of word indices
# also determine line with the greatest length
from keras.preprocessing.text import text_to_word_sequence
import numpy as np

max_line_n = 0

for line in flattened_lyrics_lines:
    line_split = text_to_word_sequence(line)
    line_n = len(line_split)
    if line_n > max_line_n:
        max_line_n = line_n

X = np.zeros((10000, max_line_n, 100), dtype='int64')

# for line in flattened_lyrics_lines:
# indices = []
# line_split = text_to_word_sequence(line)
# line_n = len(line_split)
# if line_n > max_line_n:
#     max_line_n = line_n
# for word in line_split:
#     index = word_to_index[word]
#     indices.append(index)
# X.append(indices)

In [29]:
max_line_n

158

In [36]:
X.shape

(35188,)

In [28]:
# add an end of line token to each X to create Y
Y = []

index_eos = word_to_index['<EOS>']

for indices in X:
    indices.append(index_eos)
    Y.append(indices)
Y = np.array(Y)

In [30]:
from keras.models import Model
from keras.layers import Input, LSTM
from keras.optimizers import RMSprop

In [31]:
model_input = Input(shape=(max_line_n, vocabulary_n))

In [32]:
x = LSTM(10, return_sequences=True)(model_input)

In [33]:
model = Model(inputs=model_input, outputs=x)

In [34]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [35]:
model.fit(X, Y, batch_size=10, epochs=1)

ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (35188, 1)

## 4. Generate Lyrics
Finally, I can generate lyrics. To do so:
1. Get an input sequence
2. Encode it to indices
3. Run it through the model until we hit the `<EOF>`