### In this notebook, we'll prepare our dataset for the sub-word model

Start by using some pickle helper functions

In [189]:
import pickle
def writePickle(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()
def readPickle(fname):
    filename = "pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj
def readPicklePast(fname):
    filename = "../pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj

Also load the complete sub_dataset from the csv file, and get all examples with their corresponding artist labels (categorical) and genre labels (categorical)

In [192]:
import string
import numpy as np
import pandas as pd
from keras.utils.np_utils import to_categorical

def load_data():
    
    data = pd.read_csv('sub_dataset.csv', header=None)
    data = data.dropna()

    x = data[2]
    x = np.array(x)

    y_artist = data[0] - 1
    y_artist = to_categorical(y_artist)
    
    y_genre = data[1] - 1
    y_genre = to_categorical(y_genre)
    
    return (x, y_artist, y_genre)

In [193]:
x, y_artist, y_genre = load_data()

In [195]:
# also load the genre and artist id mapping dictionaries
artist2id = readPickle("artist2id")
id2artist = readPickle("id2artist")
genre2id = readPickle("genre2id")
id2genre = readPickle("id2genre")

In [199]:
# some examples
print(x[100],'\n', y_artist[100], id2artist[1 + np.argmax(y_artist[100], axis=-1)],'\n', \
      y_genre[100], id2genre[1 + np.argmax(y_genre[100], axis=-1)])

Someone wrote just yesterday
"Love's in need of love today"
Will there be none tomorrow

Faded feelings, jaded news
Where's the tender hearts
That love romance, they
Seem so few

So, I write these words to say
Take heart have faith
In love's tomorrow
Where our precious dreams
Will all come true

Heaven knows what I've
Been through
Searching for a heart that's true
Is there one in a million?

Though I've often failed the test
Everyone must journey till
That questioning is through

Shine a star for me to see what
Children see on Christmas morning
Every hope and dream
To be renewed

Heaven only knows how I hope
That chance of love is higher
Than one in a million

I'm a dreamer I confess
But my dreams are only of
The best that we can do

So, I'll keep on counting sheep
And will not sleep until the dawning

Help me break the news, eternal
Love just beat the blues

Save me from no love
(You oughta save me)
Save me from no love
Say sweet love has come
And is not gone,
There is all, that I wan

Here we'll try to follow the *Kim (2014)* **Convolutional Neural Networks for Sentence Classification** paper for our model. Therefore below we clean our string from a function taken from the github page of the model implementation <br> <br>

For song lyrics, tabs, line breaks, punctuation like "..." might be very important. Therefore later we might remove/modify this preprocessing part.

In [200]:
import string
import re 

def clean_str(string):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

In [201]:
x = [clean_str(sen) for sen in x]
x = [sen.split(" ") for sen in x]

In [203]:
# compare the new example format with the raw version above
print(x[100])

['someone', 'wrote', 'just', 'yesterday', 'love', "'s", 'in', 'need', 'of', 'love', 'today', 'will', 'there', 'be', 'none', 'tomorrow', 'faded', 'feelings', ',', 'jaded', 'news', 'where', "'s", 'the', 'tender', 'hearts', 'that', 'love', 'romance', ',', 'they', 'seem', 'so', 'few', 'so', ',', 'i', 'write', 'these', 'words', 'to', 'say', 'take', 'heart', 'have', 'faith', 'in', 'love', "'s", 'tomorrow', 'where', 'our', 'precious', 'dreams', 'will', 'all', 'come', 'true', 'heaven', 'knows', 'what', 'i', "'ve", 'been', 'through', 'searching', 'for', 'a', 'heart', 'that', "'s", 'true', 'is', 'there', 'one', 'in', 'a', 'million', '\\?', 'though', 'i', "'ve", 'often', 'failed', 'the', 'test', 'everyone', 'must', 'journey', 'till', 'that', 'questioning', 'is', 'through', 'shine', 'a', 'star', 'for', 'me', 'to', 'see', 'what', 'children', 'see', 'on', 'christmas', 'morning', 'every', 'hope', 'and', 'dream', 'to', 'be', 'renewed', 'heaven', 'only', 'knows', 'how', 'i', 'hope', 'that', 'chance', '

Check how the data length properties look like

In [204]:
print(max([len(sentence) for sentence in x])) # max length
print(min([len(sentence) for sentence in x])) # min length
print(sum([len(sentence) for sentence in x])/len([len(sentence) for sentence in x])) # avg length

2184
2
254.337


It looks like the maximum and the minimum length data examples are quite divergent from the average. This might result in a lot of padding for many examples. **We'll ignore this issue for now, but alternative models may reconsider cropping some data examples later on**

#### From here on, we'll use a sub-word level embedding package retrieved from: https://github.com/bheinzerling/bpemb <br>
#### We'll also prepare our data for keras implementation

Load the sub-word level embedding model

In [205]:
from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
bpemb_en = BPEmb(lang="en", dim=50)

In [206]:
# this is a built-in function in the BPEMB library
texts = [bpemb_en.encode(s) for s in x]

In [207]:
# show an example
print(texts[0])

[['▁carry'], ['▁each'], ['▁other'], ["▁'", 's'], ['▁bur', 'd', 'ens'], ['▁carry'], ['▁each'], ['▁other'], ["▁'", 's'], ['▁bur', 'd', 'ens'], ['▁and'], ['▁in'], ['▁this'], ['▁way'], ['▁you'], ['▁will'], ['▁fulf', 'ill'], ['▁the'], ['▁law'], ['▁of'], ['▁christ'], ['▁be'], ['▁devoted'], ['▁', '\\', '('], ['▁be'], ['▁devoted'], ['▁', '\\', ')'], ['▁to'], ['▁one'], ['▁another'], ['▁', '\\', '('], ['▁to'], ['▁one'], ['▁another'], ['▁', '\\', ')'], ['▁in'], ['▁brother', 'ly'], ['▁love'], ['▁', '\\', '('], ['▁in'], ['▁brother', 'ly'], ['▁love'], ['▁', '\\', ')'], ['▁be'], ['▁devoted'], ['▁', '\\', '('], ['▁be'], ['▁devoted'], ['▁', '\\', ')'], ['▁to'], ['▁one'], ['▁another'], ['▁', '\\', '('], ['▁to'], ['▁one'], ['▁another'], ['▁', '\\', ')'], ['▁in'], ['▁brother', 'ly'], ['▁love'], ['▁', '\\', '('], ['▁in'], ['▁brother', 'ly'], ['▁love'], ['▁', '\\', ')'], ['▁honor'], ['▁one'], ['▁another'], ['▁above'], ['▁y', 'ours', 'elves'], ['▁', '\\', '('], ['▁above'], ['▁y', 'ours', 'elves'], ['▁', '\\'

In [208]:
# Create a vocabulary dictionary where each token is mapped to an index value
vocabulary = {}
for index, token in enumerate(bpemb_en.words):
    vocabulary[token] = index + 1

print("Index of 'the' is:", vocabulary['▁the'], "\nIndex of unknown is:", vocabulary["unk"])
print("The vocabulary length is:", len(vocabulary))

Index of 'the' is: 8 
Index of unknown is: 4993
The vocabulary length is: 10000


In [209]:
# Convert each subword to its index value with the following helper function
def subword2index(texts, vocabulary):
    sentences = []
    for text in texts:
        one_line = []
        for sub_word in text:
            for word in sub_word:
                if word not in vocabulary.keys():
                    one_line.append(vocabulary['unk'])
                else:
                    one_line.append(vocabulary[word])
        sentences.append(one_line)
    return sentences

In [210]:
sentences = subword2index(texts, vocabulary)

In [211]:
# show some examples
print(sentences[0:2])

[[5025, 883, 303, 882, 9921, 1492, 9924, 305, 5025, 883, 303, 882, 9921, 1492, 9924, 305, 35, 27, 216, 1105, 1221, 506, 9265, 142, 8, 873, 28, 1060, 87, 8674, 9913, 4993, 9943, 87, 8674, 9913, 4993, 9942, 43, 273, 1075, 9913, 4993, 9943, 43, 273, 1075, 9913, 4993, 9942, 27, 1724, 75, 2154, 9913, 4993, 9943, 27, 1724, 75, 2154, 9913, 4993, 9942, 87, 8674, 9913, 4993, 9943, 87, 8674, 9913, 4993, 9942, 43, 273, 1075, 9913, 4993, 9943, 43, 273, 1075, 9913, 4993, 9942, 27, 1724, 75, 2154, 9913, 4993, 9943, 27, 1724, 75, 2154, 9913, 4993, 9942, 2853, 273, 1075, 2208, 156, 2006, 2898, 9913, 4993, 9943, 2208, 156, 2006, 2898, 9913, 4993, 9942, 1657, 87, 2856, 38, 27, 2449, 31, 9913, 4993, 9943, 1657, 1657, 1750, 1657, 1750, 1657, 87, 2856, 38, 27, 2449, 31, 9913, 4993, 9942, 2902, 9913, 4993, 9943, 2902, 9913, 4993, 9942, 3286, 6071, 9913, 4993, 9943, 3286, 6071, 9913, 4993, 9942, 22, 914, 23, 2902, 9913, 4993, 9943, 2902, 9913, 4993, 9942, 3286, 9913, 4993, 9943, 3286, 9913, 4993, 9942, 6071,

Check how the sub-word level data looks like in terms of example lengths

In [212]:
# See sub-word level length
length = [len(sentence) for sentence in sentences]
print('The max length is: ', max(length))
print('The min length is: ', min(length))
print('The average length is: ', sum(length)/len(length))

The max length is:  3674
The min length is:  3
The average length is:  326.987


The gap is even bigger now!!

Get ready for the model inputs

In [213]:
# Padding
from keras.preprocessing.sequence import pad_sequences
padded_sentences = pad_sequences(sentences, maxlen=max(length), padding='post')

# Convert to numpy array
padded_sentences = np.array(padded_sentences)
padded_sentences.shape

(12000, 3674)

Here it is time to split our dataset

In [228]:
# Shuffle data
np.random.seed(23) # !!!!!! use the same seed value in all dataset split versions !!!!!!
shuffle_indices = np.random.permutation(np.arange(len(y_artist))) 
# shuffle all inputs with the same indices
x_shuffled = padded_sentences[shuffle_indices]
y_artist_shuffled = y_artist[shuffle_indices]
y_genre_shuffled = y_genre[shuffle_indices]

# form count dictionaries for both artist and genre labels
# first, artist labels
artist_label_count_dict = dict()
for i in range(120):
    artist_label_count_dict[i] = 0
# then also for genre labels
genre_label_count_dict = dict()
for i in range(12):
    genre_label_count_dict[i] = 0
    
'''Now, we'll go through the data examples one by one. If each artist label occurs less than 80 times, the sample 
will belong to the training set; occurances between 81th and 90th times will go to the validation set and the 
occurances between 91th and 100th times (last 10 occurances) will go to the test set.
For genre labels, any genre label occuring less than 800 times will go to the training set; occurances between 801th and 
900th times will go to the validation set and the occurances between 901th and 1000th times (last 100 occurances) will 
go to the test set.'''

# create training, validation and test sets with equal distributions of artists and genres
x_tr_artist_equal, x_val_artist_equal, x_te_artist_equal = list(), list(), list()
x_tr_genre_equal, x_val_genre_equal, x_te_genre_equal = list(), list(), list()
y_tr_artist_equal, y_val_artist_equal, y_te_artist_equal = list(), list(), list()
y_tr_genre_equal, y_val_genre_equal, y_te_genre_equal = list(), list(), list()

for sample, art_label, gen_label in zip(x_shuffled, y_artist_shuffled, y_genre_shuffled):
    artist_label_index = np.argmax(art_label, axis=-1)
    genre_label_index = np.argmax(gen_label, axis=-1)
    # for artist labels
    if artist_label_count_dict[artist_label_index] < 80:
        x_tr_artist_equal.append(sample)
        y_tr_artist_equal.append(art_label)
    elif 80 <= artist_label_count_dict[artist_label_index] < 90:
        x_val_artist_equal.append(sample)
        y_val_artist_equal.append(art_label)
    elif 90 <= artist_label_count_dict[artist_label_index] < 100:
        x_te_artist_equal.append(sample)
        y_te_artist_equal.append(art_label)
    else:
        print("There is an error with artist counts!")
    artist_label_count_dict[artist_label_index] += 1
        
    # for genre labels
    if genre_label_count_dict[genre_label_index] < 800:
        x_tr_genre_equal.append(sample)
        y_tr_genre_equal.append(gen_label)
    elif 800 <= genre_label_count_dict[genre_label_index] < 900:
        x_val_genre_equal.append(sample)
        y_val_genre_equal.append(gen_label)
    elif 900 <= genre_label_count_dict[genre_label_index] < 1000:
        x_te_genre_equal.append(sample)
        y_te_genre_equal.append(gen_label)
    else:
        print("There is an error with genre counts!")
    genre_label_count_dict[genre_label_index] += 1

        
        
# turn the output datasets in np arrays
x_tr_artist_equal = np.array(x_tr_artist_equal)
x_val_artist_equal = np.array(x_val_artist_equal)
x_te_artist_equal = np.array(x_te_artist_equal)
x_tr_genre_equal = np.array(x_tr_genre_equal)
x_val_genre_equal = np.array(x_val_genre_equal)
x_te_genre_equal = np.array(x_te_genre_equal)
y_tr_artist_equal = np.array(y_tr_artist_equal)
y_val_artist_equal = np.array(y_val_artist_equal)
y_te_artist_equal = np.array(y_te_artist_equal)
y_tr_genre_equal = np.array(y_tr_genre_equal)
y_val_genre_equal = np.array(y_val_genre_equal)
y_te_genre_equal  = np.array(y_te_genre_equal)

Finally prepare the embeddings

In [215]:
from keras.layers import Embedding

model = bpemb_en.emb

embedding_dim = 50 # this depends on the embedding model we've chosen
# create a dummy embedding weights matrix to be filled later on
embedding_weights = np.zeros((len(vocabulary) + 1, embedding_dim)) 

for subword, i in vocabulary.items():
    if subword in model.vocab:
        embedding_vector = model[subword]
        if embedding_vector is not None:
            embedding_weights[i] = embedding_vector
    else:
        continue



In [216]:
# some examples
print(embedding_weights[0:2])

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 1.02473402  0.38355899 -0.079484    0.253795    1.75872803 -0.229478
  -0.96081603 -0.009623    0.24513    -0.30851501 -0.46463799  0.82689101
  -0.052177    0.57460701 -0.184591   -0.302048   -0.099645    0.755027
  -0.46508199  0.115942    0.027584    0.92594999 -0.66696501 -0.133093
   0.95801401  0.636563   -0.432753    0.27072999 -0.056136    0.597776
   0.081105    0.48131901 -0.13154     0.120166    0.480726    0.309266
  -0.845996

In [235]:
# store all important variables in pickle files to a folder specific to the character model
def writePickleChar(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/sub_word/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()

writePickleChar(embedding_matrix, "embedding_weights")
writePickleChar(x_tr_artist_equal, "x_tr_artist_equal")
writePickleChar(x_val_artist_equal, "x_val_artist_equal")
writePickleChar(x_te_artist_equal, "x_te_artist_equal")
writePickleChar(x_tr_genre_equal, "x_tr_genre_equal")
writePickleChar(x_val_genre_equal, "x_val_genre_equal")
writePickleChar(x_te_genre_equal, "x_te_genre_equal")
writePickleChar(y_tr_artist_equal, "y_tr_artist_equal")
writePickleChar(y_val_artist_equal, "y_val_artist_equal")
writePickleChar(y_te_artist_equal, "y_te_artist_equal")
writePickleChar(y_tr_genre_equal, "y_tr_genre_equal")
writePickleChar(y_val_genre_equal, "y_val_genre_equal")
writePickleChar(y_te_genre_equal, "y_te_genre_equal")
writePickleChar(vocab_size, "vocab_size")
writePickleChar(length, "sequence_length")

---------------
#### Initial Incorrect Split Code
---------------

In [217]:
'''
# store all important variables in pickle files to a folder specific to the sub-word model
def writePickleSW(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/sub_word/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()

writePickleSW(embedding_weights, "embedding_weights")
writePickleSW(x_tr, "x_tr")
writePickleSW(x_te, "x_te")
writePickleSW(x_val, "x_val")
writePickleSW(y_tr_artist, "y_tr_artist")
writePickleSW(y_tr_genre, "y_tr_genre")
writePickleSW(y_te_artist, "y_te_artist")
writePickleSW(y_te_genre, "y_te_genre")
writePickleSW(y_val_artist, "y_val_artist")
writePickleSW(y_val_genre, "y_val_genre")
vocab_size = len(vocabulary)
writePickleSW(vocab_size, "vocab_size")
writePickleSW(length, "sequence_length")

# Shuffle data
np.random.seed(23)
shuffle_indices = np.random.permutation(np.arange(len(y_artist))) 
# shuffle all inputs with the same indices
x_shuffled = padded_sentences[shuffle_indices]
y_artist_shuffled = y_artist[shuffle_indices]
y_genre_shuffled = y_genre[shuffle_indices]

# Split training, validation and test sets with a ratio of 80-10-10
training_slice = int(len(y_artist) * 0.8)
validation_slice = int(len(y_artist) * 0.9)

# training
x_tr = x_shuffled[:training_slice]
y_tr_artist = y_artist_shuffled[:training_slice]
y_tr_genre = y_genre_shuffled[:training_slice]

# validation
x_val = x_shuffled[training_slice:validation_slice]
y_val_artist = y_artist_shuffled[training_slice:validation_slice]
y_val_genre = y_genre_shuffled[training_slice:validation_slice]

# test
x_te = x_shuffled[validation_slice:]
y_te_artist = y_artist_shuffled[validation_slice:]
y_te_genre = y_genre_shuffled[validation_slice:]

print('Training data size is: ', x_tr.shape)
print('Validation data size is: ', x_val.shape)
print('Test data size is: ', x_te.shape)
'''