# Using all folk songs and a complete pipline to create a genre level model

`Here, I took what I learned from making the album and artist level models and applied it to all folk songs in a dataset. I wrote a series of functions that load the data, clean it, and transform it into a ready format for the neural network. These functions can be chained together to load any group of songs from my data set and start training a neural network on the lyrics. This notebook in has pip install cells so that it can be run on an AWS instance.`

In [1]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-1.7.0-cp36-cp36m-manylinux1_x86_64.whl (48.0MB)
[K    100% |████████████████████████████████| 48.0MB 17kB/s  eta 0:00:01    35% |███████████▎                    | 17.0MB 48.9MB/s eta 0:00:01
Collecting grpcio>=1.8.6 (from tensorflow)
  Downloading grpcio-1.10.0-cp36-cp36m-manylinux1_x86_64.whl (7.5MB)
[K    100% |████████████████████████████████| 7.5MB 108kB/s eta 0:00:01
Collecting gast>=0.2.0 (from tensorflow)
  Downloading gast-0.2.0.tar.gz
Collecting numpy>=1.13.3 (from tensorflow)
  Downloading numpy-1.14.2-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
[K    100% |████████████████████████████████| 12.2MB 67kB/s  eta 0:00:01
[?25hCollecting astor>=0.6.0 (from tensorflow)
  Downloading astor-0.6.2-py2.py3-none-any.whl
Collecting tensorboard<1.8.0,>=1.7.0 (from tensorflow)
  Downloading tensorboard-1.7.0-py3-none-any.whl (3.1MB)
[K    100% |████████████████████████████████| 3.1MB 262kB/s eta 0:00:01
[?25hCollecting absl-py>=0.1.6 (from

In [2]:
! pip install keras

Collecting keras
  Downloading Keras-2.1.5-py2.py3-none-any.whl (334kB)
[K    100% |████████████████████████████████| 337kB 1.7MB/s ta 0:00:01
Installing collected packages: keras
Successfully installed keras-2.1.5
[33mYou are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [5]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.2.5.tar.gz (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 634kB/s ta 0:00:01
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/18/9c/1f/276bc3f421614062468cb1c9d695e6086d0c73d67ea363c501
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.5
[33mYou are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [11]:
import string
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.utils import to_categorical
from keras import models
from keras import Sequential
from keras.layers import Dense, LSTM, Embedding, Dropout
from keras.callbacks import ModelCheckpoint
import pickle
import pandas as pd
import collections
from nltk.corpus import words
from scipy import sparse
import sys
from sklearn.model_selection import train_test_split

In [2]:
filepath = 'model_chekpoints/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=False, mode='min')
callbacks_list = [checkpoint]

In [12]:
def load_data(filename='all_songs.csv', col='album', col_value='In the Aeroplane Over the Sea'):
    df    = pd.read_csv('all_songs.csv')
    df    = df.loc[df[col] == col_value]
    songs = df['lyrics'].values
    return songs

def lyric_cleaner(songs):
    lyric_tokens = []
    for song in songs:
        text = song.lower().replace(' n ', ' eol ').replace('[verse ', '[verse')
        text = text.replace("'", '').replace('-', ' ')
        tokens = text.split()
        table = str.maketrans('', '', string.punctuation)
        tokens = [word.translate(table) for word in tokens]
        lyric_tokens.append(tokens)
    return lyric_tokens

def lyric_gatherer(lyric_tokens):
    lyrics = []
    for song in lyric_tokens:
        song.append('eos')
        for lyric in song:
            lyrics.append(lyric)   
    return lyrics

def vocabulary_dictionary(lyrics, n_vocab):
    word_count = collections.Counter(lyrics)
    most_common = word_count.most_common(n=n_vocab)
    vocab = []
    for word, count in most_common:
        vocab.append(word)
    word_to_index = dict(zip(vocab, range(0, len(vocab))))
    word_to_index['unknown'] = len(vocab)
    index_to_word = dict([(index, word) for word, index in word_to_index.items()])
    return word_to_index, index_to_word

def tokenizer(dictionary, lyrics):
    encoded_lyrics = [dictionary[lyric] if lyric in dictionary else dictionary['unknown'] for lyric in lyrics]
    return encoded_lyrics

def sequenizer(encoded_lyrics, seq_length):
    length = seq_length + 1
    sequences = []
    for i in range(length, len(encoded_lyrics)):
        sequence = encoded_lyrics[i-length:i]
        sequences.append(sequence)
    n_patterns = len(sequences)
    sequences = np.array(sequences)
    return sequences, n_patterns
    
def prepare_data(sequences):
    X, y = sequences[:, :-1], sequences[:, -1]
#     y = to_categorical(y)
    return X, y

def prepare_model(vocab_size, seq_length, lstm_hidden_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(LSTM(lstm_hidden_size, return_sequences=True))
    model.add(LSTM(lstm_hidden_size, return_sequences=False))
    model.add(Dense(lstm_hidden_size, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [21]:
seq_length = 20
n_vocab = 10000
vocab_size = n_vocab + 1
lstm_hidden_size = 50

data         = load_data(filename='all_songs.csv', col='is_folk', col_value=1)

lyric_tokens = lyric_cleaner(data)

lyrics       = lyric_gatherer(lyric_tokens)

word_to_index, index_to_word = vocabulary_dictionary(lyrics, n_vocab)

encoded_lyrics = tokenizer(word_to_index, lyrics)

sequences, n_patterns = sequenizer(encoded_lyrics, seq_length)

X, y = prepare_data(sequences)

model = prepare_model(vocab_size, seq_length, lstm_hidden_size)

In [22]:
X, X_test, y, y_test = train_test_split(X, y, train_size=0.5, random_state=2018)

y = to_categorical(y)

In [None]:
history = model.fit(X, y, batch_size=128, epochs=100, callbacks=callbacks_list)

Epoch 1/100

Epoch 00001: loss improved from inf to 5.69642, saving model to model_chekpoints/weights-improvement-01-5.6964.hdf5
Epoch 2/100

Epoch 00002: loss improved from 5.69642 to 5.15943, saving model to model_chekpoints/weights-improvement-02-5.1594.hdf5
Epoch 3/100

Epoch 00003: loss improved from 5.15943 to 4.94066, saving model to model_chekpoints/weights-improvement-03-4.9407.hdf5
Epoch 4/100

Epoch 00004: loss improved from 4.94066 to 4.80716, saving model to model_chekpoints/weights-improvement-04-4.8072.hdf5
Epoch 5/100

Epoch 00005: loss improved from 4.80716 to 4.70742, saving model to model_chekpoints/weights-improvement-05-4.7074.hdf5
Epoch 6/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00008: loss improved from 4.56496 to 4.50977, saving model to model_chekpoints/weights-improvement-08-4.5098.hdf5
Epoch 9/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00010: loss improved from 4.46047 to 4.41586, saving model to model_chekpoints/weights-improvement-10-4.4159.hdf5
Epoch 11/100
101120/548030 [====>.........................] - ETA: 4:41 - loss: 4.3592 - acc: 0.2401

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00013: loss improved from 4.33681 to 4.30154, saving model to model_chekpoints/weights-improvement-13-4.3015.hdf5
Epoch 14/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00016: loss improved from 4.23636 to 4.20703, saving model to model_chekpoints/weights-improvement-16-4.2070.hdf5
Epoch 17/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00019: loss improved from 4.15243 to 4.12820, saving model to model_chekpoints/weights-improvement-19-4.1282.hdf5
Epoch 20/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00021: loss improved from 4.10543 to 4.08319, saving model to model_chekpoints/weights-improvement-21-4.0832.hdf5
Epoch 22/100
 58752/548030 [==>...........................] - ETA: 5:04 - loss: 4.0054 - acc: 0.2711

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00024: loss improved from 4.04324 to 4.02571, saving model to model_chekpoints/weights-improvement-24-4.0257.hdf5
Epoch 25/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00027: loss improved from 3.99128 to 3.97665, saving model to model_chekpoints/weights-improvement-27-3.9766.hdf5
Epoch 28/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00029: loss improved from 3.96151 to 3.94693, saving model to model_chekpoints/weights-improvement-29-3.9469.hdf5
Epoch 30/100
 41728/548030 [=>............................] - ETA: 5:14 - loss: 3.8646 - acc: 0.2832

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00030: loss improved from 3.94693 to 3.93359, saving model to model_chekpoints/weights-improvement-30-3.9336.hdf5
Epoch 31/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00032: loss improved from 3.92006 to 3.90995, saving model to model_chekpoints/weights-improvement-32-3.9099.hdf5
Epoch 33/100
 81152/548030 [===>..........................] - ETA: 4:50 - loss: 3.8335 - acc: 0.2878

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00033: loss improved from 3.90995 to 3.89601, saving model to model_chekpoints/weights-improvement-33-3.8960.hdf5
Epoch 34/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Epoch 00035: loss improved from 3.88543 to 3.87397, saving model to model_chekpoints/weights-improvement-35-3.8740.hdf5
Epoch 36/100

Epoch 00036: loss improved from 3.87397 to 3.86512, saving model to model_chekpoints/weights-improvement-36-3.8651.hdf5
Epoch 37/100

Epoch 00037: loss improved from 3.86512 to 3.85526, saving model to model_chekpoints/weights-improvement-37-3.8553.hdf5
Epoch 38/100

Epoch 00038: loss improved from 3.85526 to 3.84484, saving model to model_chekpoints/weights-improvement-38-3.8448.hdf5
Epoch 39/100

Epoch 00039: loss improved from 3.84484 to 3.83462, saving model to model_chekpoints/weights-improvement-39-3.8346.hdf5
Epoch 40/100

Epoch 00040: loss improved from 3.83462 to 3.82562, saving model to model_chekpoints/weights-improvement-40-3.8256.hdf5
Epoch 41/100

Epoch 00041: loss improved from 3.82562 to 3.81808, saving model to model_chekpoints/weights-improvement-41-3.8181.hdf5
Epoch 42/100

Epoch 00042: loss improved from 3.81808 to 3.81041, saving mo

Epoch 71/100

Epoch 00071: loss improved from 3.66311 to 3.66019, saving model to model_chekpoints/weights-improvement-71-3.6602.hdf5
Epoch 72/100

Epoch 00072: loss improved from 3.66019 to 3.65476, saving model to model_chekpoints/weights-improvement-72-3.6548.hdf5
Epoch 73/100

Epoch 00073: loss improved from 3.65476 to 3.65239, saving model to model_chekpoints/weights-improvement-73-3.6524.hdf5
Epoch 74/100

Epoch 00074: loss improved from 3.65239 to 3.64839, saving model to model_chekpoints/weights-improvement-74-3.6484.hdf5
Epoch 75/100

Epoch 00075: loss improved from 3.64839 to 3.64489, saving model to model_chekpoints/weights-improvement-75-3.6449.hdf5
Epoch 76/100

Epoch 00076: loss improved from 3.64489 to 3.64276, saving model to model_chekpoints/weights-improvement-76-3.6428.hdf5
Epoch 77/100

Epoch 00077: loss improved from 3.64276 to 3.64037, saving model to model_chekpoints/weights-improvement-77-3.6404.hdf5
Epoch 78/100

Epoch 00078: loss improved from 3.64037 to 3.635

In [1]:
# stoped at 89 for time