### In this notebook, we'll prepare our dataset for the sub-word model

Start by using some pickle helper functions

In [1]:
import pickle
def writePickle(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()
def readPickle(fname):
    filename = "pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj
def readPicklePast(fname):
    filename = "../pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj

Also load the complete sub_dataset from the csv file, and get all examples with their corresponding artist labels (categorical) and genre labels (categorical)

In [2]:
import string
import numpy as np
import pandas as pd
from keras.utils.np_utils import to_categorical

def load_data():
    
    data = pd.read_csv('sub_dataset.csv', header=None)
    data = data.dropna()

    x = data[2]
    x = np.array(x)

    y_artist = data[0] - 1
    y_artist = to_categorical(y_artist)
    
    y_genre = data[1] - 1
    y_genre = to_categorical(y_genre)
    
    return (x, y_artist, y_genre)

In [3]:
x, y_artist, y_genre = load_data()

In [4]:
# also load the genre and artist id mapping dictionaries
artist2id = readPickle("artist2id")
id2artist = readPickle("id2artist")
genre2id = readPickle("genre2id")
id2genre = readPickle("id2genre")

In [5]:
# some examples
print(x[100],'\n', y_artist[100], id2artist[1 + np.argmax(y_artist[100], axis=-1)],'\n', \
      y_genre[100], id2genre[1 + np.argmax(y_genre[100], axis=-1)])

The spacious firmament on high,
With all the blue ethereal sky,
And spangled heavens, a shining frame
Their great Original proclaim.
The unwearied sun, from day to day,
Does his Creator's powers display,
And publishes to every land
The work of an Almighty Hand.

Soon as the evening shades prevail
The moon takes up the wondrous tale,
And nightly to the listening earth
Repeats the story of her birth;
While all the stars that round her burn
And all the planets in their turn,
Confirm the tidings as they roll,
And spread the truth from pole to pole.

What though in solemn silence all
Move round the dark terrestrial ball?
What though no real voice nor sound
Amid the radiant orbs be found?
In reason's ear they all rejoice,
And utter forth a glorious voice,
Forever singing as they shine,
"The hand that made us is divine."

Scriptural Reference:

"The heavens declare the glory of God; the skies proclaim the work of His hands." Psalm 19:1 
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

Here we'll try to follow the *Kim (2014)* **Convolutional Neural Networks for Sentence Classification** paper for our model. Therefore below we clean our string from a function taken from the github page of the model implementation <br> <br>

For song lyrics, tabs, line breaks, punctuation like "..." might be very important. Therefore later we might remove/modify this preprocessing part.

In [6]:
import string
import re 

def clean_str(string):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

In [7]:
x = [clean_str(sen) for sen in x]
x = [sen.split(" ") for sen in x]

In [8]:
# compare the new example format with the raw version above
print(x[100])

['the', 'spacious', 'firmament', 'on', 'high', ',', 'with', 'all', 'the', 'blue', 'ethereal', 'sky', ',', 'and', 'spangled', 'heavens', ',', 'a', 'shining', 'frame', 'their', 'great', 'original', 'proclaim', 'the', 'unwearied', 'sun', ',', 'from', 'day', 'to', 'day', ',', 'does', 'his', 'creator', "'s", 'powers', 'display', ',', 'and', 'publishes', 'to', 'every', 'land', 'the', 'work', 'of', 'an', 'almighty', 'hand', 'soon', 'as', 'the', 'evening', 'shades', 'prevail', 'the', 'moon', 'takes', 'up', 'the', 'wondrous', 'tale', ',', 'and', 'nightly', 'to', 'the', 'listening', 'earth', 'repeats', 'the', 'story', 'of', 'her', 'birth', 'while', 'all', 'the', 'stars', 'that', 'round', 'her', 'burn', 'and', 'all', 'the', 'planets', 'in', 'their', 'turn', ',', 'confirm', 'the', 'tidings', 'as', 'they', 'roll', ',', 'and', 'spread', 'the', 'truth', 'from', 'pole', 'to', 'pole', 'what', 'though', 'in', 'solemn', 'silence', 'all', 'move', 'round', 'the', 'dark', 'terrestrial', 'ball', '\\?', 'what

Check how the data length properties look like

In [9]:
print(max([len(sentence) for sentence in x])) # max length
print(min([len(sentence) for sentence in x])) # min length
print(sum([len(sentence) for sentence in x])/len([len(sentence) for sentence in x])) # avg length

1620
3
235.54333333333332


It looks like the maximum and the minimum length data examples are quite divergent from the average. This might result in a lot of padding for many examples. **We'll ignore this issue for now, but alternative models may reconsider cropping some data examples later on**

#### From here on, we'll use a sub-word level embedding package retrieved from: https://github.com/bheinzerling/bpemb <br>
#### We'll also prepare our data for keras implementation

Load the sub-word level embedding model

In [10]:
from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
bpemb_en = BPEmb(lang="en", dim=50)

In [11]:
# this is a built-in function in the BPEMB library
texts = [bpemb_en.encode(s) for s in x]

In [12]:
# show an example
print(texts[0])

[['▁i'], ['▁change'], ['▁places'], ['▁,'], ['▁to'], ['▁prevent'], ['▁catch', 'in', "'"], ['▁the'], ['▁cases'], ['▁races'], ['▁,'], ['▁in'], ['▁the'], ['▁faces'], ['▁,'], ['▁hall'], ['▁at'], ['▁you'], ['▁l', 'aces'], ['▁this'], ['▁is'], ['▁a'], ['▁hit'], ['▁,'], ['▁let'], ["▁'", 's'], ['▁see'], ['▁if'], ['▁hom', 'icide'], ['▁tra', 'ce'], ['▁this'], ['▁the'], ['▁only'], ['▁thing'], ['▁hot', 'ter'], ['▁than'], ['▁my'], ['▁flow'], ['▁is'], ['▁the'], ['▁block'], ['▁', '\\', '('], ['▁in', 'h', 'ale'], ['▁and'], ['▁ex', 'h', 'ale'], ['▁', '\\', ')'], ['▁that'], ["▁'", 's'], ['▁why'], ['▁i'], ['▁left'], ['▁this'], ['▁snow'], ['▁b', 'iz'], ['▁,'], ['▁and'], ['▁got'], ['▁into'], ['▁show'], ['▁b', 'iz'], ['▁let'], ["▁'", 's'], ['▁get'], ['▁this'], ['▁clear'], ['▁,'], ['▁it'], ['▁a', 'i'], ['▁n', "'", 't'], ['▁on'], ['▁till'], ['▁i'], ['▁say'], ['▁it'], ["▁'", 's'], ['▁on'], ['▁,'], ['▁', '\\', '('], ['▁p', 'ause'], ['▁', '\\', ')'], ['▁,'], ['▁it'], ["▁'", 's'], ['▁on'], ['▁i', "'", 'm'], ['▁eat'

In [13]:
# Create a vocabulary dictionary where each token is mapped to an index value
vocabulary = {}
for index, token in enumerate(bpemb_en.words):
    vocabulary[token] = index + 1

print("Index of 'the' is:", vocabulary['▁the'], "\nIndex of unknown is:", vocabulary["unk"])
print("The vocabulary length is:", len(vocabulary))

Index of 'the' is: 8 
Index of unknown is: 4993
The vocabulary length is: 10000


In [14]:
writePickle(vocabulary, "/sub_word/vocabulary")

In [15]:
# Convert each subword to its index value with the following helper function
def subword2index(texts, vocabulary):
    sentences = []
    for text in texts:
        one_line = []
        for sub_word in text:
            for word in sub_word:
                if word not in vocabulary.keys():
                    one_line.append(vocabulary['unk'])
                else:
                    one_line.append(vocabulary[word])
        sentences.append(one_line)
    return sentences

In [16]:
sentences = subword2index(texts, vocabulary)

In [17]:
# show some examples
print(sentences[0:2])

[[387, 2220, 3407, 1750, 43, 3178, 7819, 7, 9938, 8, 2962, 2287, 1750, 27, 8, 9424, 1750, 1500, 115, 1221, 44, 2128, 216, 81, 5, 1888, 1750, 2308, 882, 9921, 1843, 876, 2096, 5905, 405, 83, 216, 8, 492, 6018, 2041, 96, 539, 1242, 3229, 81, 8, 3234, 9913, 4993, 9943, 27, 9922, 566, 35, 160, 9922, 566, 9913, 4993, 9942, 121, 882, 9921, 5452, 387, 1166, 216, 4627, 21, 249, 1750, 35, 3287, 424, 702, 21, 249, 2308, 882, 9921, 2069, 216, 2647, 1750, 108, 5, 9917, 48, 9938, 9916, 72, 8707, 387, 2357, 108, 882, 9921, 72, 1750, 9913, 4993, 9943, 25, 752, 9913, 4993, 9942, 1750, 108, 882, 9921, 72, 387, 9938, 9928, 6123, 7, 9938, 1750, 156, 9938, 159, 48, 82, 5339, 3770, 7, 9938, 1069, 108, 882, 9921, 8898, 64, 10, 2538, 835, 1105, 27, 8258, 10, 1750, 2171, 40, 8, 2940, 387, 87, 115, 8, 4471, 28, 8, 736, 1750, 11, 2107, 7, 9938, 5, 1617, 387, 2902, 8, 5210, 1011, 1026, 27, 1740, 1750, 1221, 1657, 2171, 392, 108, 882, 9921, 72, 216, 206, 1345, 21, 2770, 1750, 387, 2515, 48, 9938, 9916, 823, 6896,

Check how the sub-word level data looks like in terms of example lengths

In [18]:
# See sub-word level length
length = [len(sentence) for sentence in sentences]
print('The max length is: ', max(length))
print('The min length is: ', min(length))
print('The average length is: ', sum(length)/len(length))

The max length is:  2441
The min length is:  6
The average length is:  304.57183333333336


The gap is even bigger now!!

Get ready for the model inputs

In [19]:
# Padding
from keras.preprocessing.sequence import pad_sequences
padded_sentences = pad_sequences(sentences, maxlen=max(length), padding='post')

# Convert to numpy array
padded_sentences = np.array(padded_sentences)
padded_sentences.shape

(12000, 2441)

Here it is time to split our dataset

In [20]:
# Shuffle data
np.random.seed(23) # !!!!!! use the same seed value in all dataset split versions !!!!!!
shuffle_indices = np.random.permutation(np.arange(len(y_artist))) 
# shuffle all inputs with the same indices
x_shuffled = padded_sentences[shuffle_indices]
y_artist_shuffled = y_artist[shuffle_indices]
y_genre_shuffled = y_genre[shuffle_indices]

# form count dictionaries for both artist and genre labels
# first, artist labels
artist_label_count_dict = dict()
for i in range(120):
    artist_label_count_dict[i] = 0
# then also for genre labels
genre_label_count_dict = dict()
for i in range(12):
    genre_label_count_dict[i] = 0
    
'''Now, we'll go through the data examples one by one. If each artist label occurs less than 80 times, the sample 
will belong to the training set; occurances between 81th and 90th times will go to the validation set and the 
occurances between 91th and 100th times (last 10 occurances) will go to the test set.
For genre labels, any genre label occuring less than 800 times will go to the training set; occurances between 801th and 
900th times will go to the validation set and the occurances between 901th and 1000th times (last 100 occurances) will 
go to the test set.'''

# create training, validation and test sets with equal distributions of artists and genres
x_tr_artist, x_val_artist, x_te_artist = list(), list(), list()
x_tr_genre, x_val_genre, x_te_genre = list(), list(), list()
y_tr_artist, y_val_artist, y_te_artist = list(), list(), list()
y_tr_genre, y_val_genre, y_te_genre = list(), list(), list()

for sample, art_label, gen_label in zip(x_shuffled, y_artist_shuffled, y_genre_shuffled):
    artist_label_index = np.argmax(art_label, axis=-1)
    genre_label_index = np.argmax(gen_label, axis=-1)
    # for artist labels
    if artist_label_count_dict[artist_label_index] < 80:
        x_tr_artist.append(sample)
        y_tr_artist.append(art_label)
    elif 80 <= artist_label_count_dict[artist_label_index] < 90:
        x_val_artist.append(sample)
        y_val_artist.append(art_label)
    elif 90 <= artist_label_count_dict[artist_label_index] < 100:
        x_te_artist.append(sample)
        y_te_artist.append(art_label)
    else:
        print("There is an error with artist counts!")
    artist_label_count_dict[artist_label_index] += 1
        
    # for genre labels
    if genre_label_count_dict[genre_label_index] < 800:
        x_tr_genre.append(sample)
        y_tr_genre.append(gen_label)
    elif 800 <= genre_label_count_dict[genre_label_index] < 900:
        x_val_genre.append(sample)
        y_val_genre.append(gen_label)
    elif 900 <= genre_label_count_dict[genre_label_index] < 1000:
        x_te_genre.append(sample)
        y_te_genre.append(gen_label)
    else:
        print("There is an error with genre counts!")
    genre_label_count_dict[genre_label_index] += 1

        
        
# turn the output datasets in np arrays
x_tr_artist = np.array(x_tr_artist)
x_val_artist = np.array(x_val_artist)
x_te_artist = np.array(x_te_artist)
x_tr_genre = np.array(x_tr_genre)
x_val_genre = np.array(x_val_genre)
x_te_genre = np.array(x_te_genre)
y_tr_artist = np.array(y_tr_artist)
y_val_artist = np.array(y_val_artist)
y_te_artist = np.array(y_te_artist)
y_tr_genre = np.array(y_tr_genre)
y_val_genre = np.array(y_val_genre)
y_te_genre  = np.array(y_te_genre)

Finally prepare the embeddings

In [21]:
from keras.layers import Embedding

model = bpemb_en.emb

embedding_dim = 50 # this depends on the embedding model we've chosen
# create a dummy embedding weights matrix to be filled later on
embedding_weights = np.zeros((len(vocabulary) + 1, embedding_dim)) 

for subword, i in vocabulary.items():
    if subword in model.vocab:
        embedding_vector = model[subword]
        if embedding_vector is not None:
            embedding_weights[i] = embedding_vector
    else:
        continue



In [22]:
# some examples
print(embedding_weights[0:2])

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 1.02473402  0.38355899 -0.079484    0.253795    1.75872803 -0.229478
  -0.96081603 -0.009623    0.24513    -0.30851501 -0.46463799  0.82689101
  -0.052177    0.57460701 -0.184591   -0.302048   -0.099645    0.755027
  -0.46508199  0.115942    0.027584    0.92594999 -0.66696501 -0.133093
   0.95801401  0.636563   -0.432753    0.27072999 -0.056136    0.597776
   0.081105    0.48131901 -0.13154     0.120166    0.480726    0.309266
  -0.845996

In [26]:
# store all important variables in pickle files to a folder specific to the character model
def writePickleChar(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/sub_word/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()

writePickleChar(embedding_weights, "embedding_weights")
writePickleChar(x_tr_artist, "x_tr_artist")
writePickleChar(x_val_artist, "x_val_artist")
writePickleChar(x_te_artist, "x_te_artist")
writePickleChar(x_tr_genre, "x_tr_genre")
writePickleChar(x_val_genre, "x_val_genre")
writePickleChar(x_te_genre, "x_te_genre")
writePickleChar(y_tr_artist, "y_tr_artist")
writePickleChar(y_val_artist, "y_val_artist")
writePickleChar(y_te_artist, "y_te_artist")
writePickleChar(y_tr_genre, "y_tr_genre")
writePickleChar(y_val_genre, "y_val_genre")
writePickleChar(y_te_genre, "y_te_genre")
writePickleChar(len(vocabulary), "vocab_size")
writePickleChar(length, "sequence_length")

In [23]:
y_te_genre

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]], dtype=float32)