### In this notebook, we'll prepare our dataset for the character model

Start by using some pickle helper functions

In [1]:
import pickle
def writePickle(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()
def readPickle(fname):
    filename = "pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj
def readPicklePast(fname):
    filename = "../pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj

Also load the complete sub_dataset from the csv file, and get all examples with their corresponding artist labels (categorical) and genre labels (categorical)

In [2]:
import string
import numpy as np
import pandas as pd
from keras.utils.np_utils import to_categorical

def load_data():
    
    data = pd.read_csv('sub_dataset.csv', header=None)
    data = data.dropna()

    x = data[2]
    x = np.array(x)

    y_artist = data[0] - 1
    y_artist = to_categorical(y_artist)
    
    y_genre = data[1] - 1
    y_genre = to_categorical(y_genre)
    
    return (x, y_artist, y_genre)

In [3]:
x, y_artist, y_genre = load_data()

In [4]:
# also load the genre and artist id mapping dictionaries
artist2id = readPickle("artist2id")
id2artist = readPickle("id2artist")
genre2id = readPickle("genre2id")
id2genre = readPickle("id2genre")

In [5]:
# some examples
print(x[100],'\n', y_artist[100], id2artist[1 + np.argmax(y_artist[100], axis=-1)],'\n', \
      y_genre[100], id2genre[1 + np.argmax(y_genre[100], axis=-1)])

The spacious firmament on high,
With all the blue ethereal sky,
And spangled heavens, a shining frame
Their great Original proclaim.
The unwearied sun, from day to day,
Does his Creator's powers display,
And publishes to every land
The work of an Almighty Hand.

Soon as the evening shades prevail
The moon takes up the wondrous tale,
And nightly to the listening earth
Repeats the story of her birth;
While all the stars that round her burn
And all the planets in their turn,
Confirm the tidings as they roll,
And spread the truth from pole to pole.

What though in solemn silence all
Move round the dark terrestrial ball?
What though no real voice nor sound
Amid the radiant orbs be found?
In reason's ear they all rejoice,
And utter forth a glorious voice,
Forever singing as they shine,
"The hand that made us is divine."

Scriptural Reference:

"The heavens declare the glory of God; the skies proclaim the work of His hands." Psalm 19:1 
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

In this preprocessing document, we'll form character embeddings that are derived from GloVe word embeddings.
Therefore we need to get the glove data first

In [6]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-10-28 08:39:37--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-10-28 08:39:37--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-10-28 08:39:38--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’

glove

In [7]:
# unzip the file
import zipfile
zip_ref = zipfile.ZipFile('glove.6B.zip', 'r')
zip_ref.extractall('glove')
zip_ref.close()

In [8]:
# get the embeddings and write them in a dictionary
import os
import io
embeddings_index = {}
with io.open('glove/glove.6B.300d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:],dtype='float32')
        embeddings_index[word] = coefs

In [9]:
# see an example
list(embeddings_index.items())[100:102]

[('so', array([-2.4561e-01,  6.8010e-02,  1.8254e-01, -2.9551e-01,  1.8007e-02,
          1.6811e-01,  1.9109e-01,  6.3650e-02,  2.1243e-01, -2.1165e+00,
          3.7864e-01, -1.0974e-01, -2.0340e-01,  8.5525e-02, -3.2146e-01,
          4.8442e-02, -2.7847e-01, -2.7883e-01,  1.2850e-01, -1.0495e-01,
          1.2134e-01,  4.5983e-01,  1.2976e-01,  3.0153e-02, -3.3020e-01,
          2.3499e-01,  1.5116e-01, -2.5010e-01, -1.2135e-01,  4.0642e-01,
          5.2263e-02,  3.2016e-01, -4.2218e-01, -1.5186e-01, -1.0565e+00,
          3.0503e-01, -1.1422e-03, -9.0414e-02, -7.5197e-02, -7.9242e-02,
         -3.4756e-02, -1.4131e-01, -1.9038e-01, -7.7423e-02,  4.8919e-02,
          2.4843e-01,  1.9439e-01, -1.0053e-01,  1.1012e-01,  3.0298e-01,
          2.9027e-01, -2.7237e-01,  8.6791e-02, -1.4677e-01, -3.8868e-01,
          2.1407e-01,  1.4234e-02,  5.5393e-02,  1.7376e-01,  6.5738e-02,
          4.0475e-01, -2.9207e-02,  1.8231e-01,  4.0401e-01, -1.5855e-01,
         -2.3270e-01,  2.3904e-0

Now, we'd like to see the complete set of characters used in the whole dataset. We'll also check whether each of these characters are/are not present in the Glove embedding index matrix. If a character is in our embedding index matrix, that character will be added to our ultimate alphabet. Otherwise it will be discarded

In [10]:
characters = []
for song in x:
    for char in song:
        characters.append(char)

        
alphabet = list(set(characters))

vocab_size = len(alphabet)

# create dictionaries for letter indexing
vocab = {}
reverse_vocab = {}
for i, letter in enumerate(alphabet):
    vocab[letter] = i
    reverse_vocab[i] = letter

print(alphabet)

['о', 'í', '小', 'Ö', 'h', 'r', '本', '=', '抰', ';', 'G', 'Ü', '\n', 'Þ', 'J', '2', '＇', 'ò', 'て', 'ê', '/', '\\', 't', 'à', 'M', '_', '-', '㠱', ' ', 'R', '$', ',', ':', 'g', 'U', 'ö', '^', '抦', '}', 'ô', 'F', '°', 'V', 'u', '©', 'N', 'l', 'る', '%', '\t', 'A', 'û', 'ß', '㈲', 'c', '•', 'ä', '8', 'ð', 'W', 'и', 'i', 'q', 'I', '(', '?', 'ó', 'è', 'B', '”', ')', 'Г', '@', '抳', '“', '’', 'C', '<', '|', 'â', 'T', 'x', '‘', '¡', 'P', '…', '9', 'm', '抎', '抮', '1', 'ñ', 'á', '緼', '駉', '3', 'v', 'D', 'ù', 'S', 'в', '—', 'ú', '.', '抣', 'þ', 's', '³', '′', '£', 'ç', '!', '㬸', 'Ó', '>', 'Y', 'f', 'у', '"', 'æ', 'a', '⌦', '㬳', 'ü', 'o', '¿', 'г', 'п', '®', '㬱', 'j', 'n', 'b', '&', 'ì', '[', 'ý', 'н', 'L', 'ï', '~', 'O', '愛', '¦', 'H', '+', '‰', '抯', 'K', 'й', 'し', 'z', 'd', '–', '子', 'е', '*', '¯', 'ī', 'é', 'л', 'k', '{', 'E', 'Æ', '4', "'", ']', '\u2028', 'e', '脚', 'ṗ', 'Z', '0', 'p', 'ë', 'а', '5', 'X', '7', 'р', 'É', 'À', '#', 'У', 'Q', 'ㄲ', '6', 'y', 'ã', 'w']


In [11]:
ultimate_alphabet = list()
all_embedding_keys = list(embeddings_index.keys())
for char in alphabet:
    if char in all_embedding_keys:
        ultimate_alphabet.append(char)
print(ultimate_alphabet, len(ultimate_alphabet))

['о', 'í', 'h', 'r', '=', ';', '2', 'ò', 'ê', '/', '\\', 't', 'à', '_', '-', '$', ',', ':', 'g', 'ö', '^', '}', 'ô', 'u', 'l', '%', 'û', 'ß', 'c', 'ä', '8', 'ð', 'и', 'i', 'q', '(', '?', 'ó', 'è', '”', ')', '@', '“', '’', '<', '|', 'â', 'x', '‘', '¡', '…', '9', 'm', '1', 'ñ', 'á', '3', 'v', 'в', '—', 'ú', '.', 'þ', 's', '£', 'ç', '!', '>', 'f', 'у', '"', 'æ', 'a', 'ü', 'o', '¿', 'г', 'п', 'j', 'n', 'b', '&', '[', 'ý', 'н', 'ï', '~', '+', 'z', 'd', '–', 'е', '*', 'ī', 'é', 'л', 'k', '{', '4', "'", ']', 'e', '0', 'p', 'ë', 'а', '5', '7', '#', '6', 'y', 'ã', 'w'] 113


At this point we have to consider several important things: <br>
1- All our characters with vectors are lowercase. Therefore we need our dataset converted to lowercase.<br>
2- We need another vector for the space and linebrake characters. The choice of having a linebrake might be crucial.<br>
3- We also need a vector for unknown characters.<br>
4- Finally we need our dataset in character format.<br>

In [12]:
# start by updating the embedding matrix

# Add 'linebreak' vector to the embeddings_index
linebreak_vector = 0
for char_vector in embeddings_index.values():
    linebreak_vector += char_vector
linebreak_vector /= len(embeddings_index)

# Add 'unknown' vector to the embeddings_index
mu, sigma = 0, 0.1 # mean and standard deviation
unk_vector = np.random.normal(mu, sigma, 300)

# Add 'space' vector to the embeddings_index
mu, sigma = 0, 0.2 # mean and standard deviation
space_vector = np.random.normal(mu, sigma, 300)

embeddings_index['\n'] = linebreak_vector
embeddings_index['UNK'] = unk_vector
embeddings_index[' '] = space_vector

# also update our ultimate character dictionary
ultimate_alphabet.append("UNK")
ultimate_alphabet.append(" ")
ultimate_alphabet.append("\n")

vocab_size = len(ultimate_alphabet)


In [13]:
# convert our ultimate_alphabet to a dictionary
ultimate_alphabet_dictionary = dict()
for i, char in enumerate(ultimate_alphabet):
    ultimate_alphabet_dictionary[char] = i + 1
print(ultimate_alphabet_dictionary)

{'о': 1, 'í': 2, 'h': 3, 'r': 4, '=': 5, ';': 6, '2': 7, 'ò': 8, 'ê': 9, '/': 10, '\\': 11, 't': 12, 'à': 13, '_': 14, '-': 15, '$': 16, ',': 17, ':': 18, 'g': 19, 'ö': 20, '^': 21, '}': 22, 'ô': 23, 'u': 24, 'l': 25, '%': 26, 'û': 27, 'ß': 28, 'c': 29, 'ä': 30, '8': 31, 'ð': 32, 'и': 33, 'i': 34, 'q': 35, '(': 36, '?': 37, 'ó': 38, 'è': 39, '”': 40, ')': 41, '@': 42, '“': 43, '’': 44, '<': 45, '|': 46, 'â': 47, 'x': 48, '‘': 49, '¡': 50, '…': 51, '9': 52, 'm': 53, '1': 54, 'ñ': 55, 'á': 56, '3': 57, 'v': 58, 'в': 59, '—': 60, 'ú': 61, '.': 62, 'þ': 63, 's': 64, '£': 65, 'ç': 66, '!': 67, '>': 68, 'f': 69, 'у': 70, '"': 71, 'æ': 72, 'a': 73, 'ü': 74, 'o': 75, '¿': 76, 'г': 77, 'п': 78, 'j': 79, 'n': 80, 'b': 81, '&': 82, '[': 83, 'ý': 84, 'н': 85, 'ï': 86, '~': 87, '+': 88, 'z': 89, 'd': 90, '–': 91, 'е': 92, '*': 93, 'ī': 94, 'é': 95, 'л': 96, 'k': 97, '{': 98, '4': 99, "'": 100, ']': 101, 'e': 102, '0': 103, 'p': 104, 'ë': 105, 'а': 106, '5': 107, '7': 108, '#': 109, '6': 110, 'y': 1

In [14]:
# Prepare embedding weights
embedding_dim = 300 # since we have selected the 300-dimensional GloVe embedding
embedding_matrix = np.zeros((vocab_size+1, embedding_dim)) # placeholder for out matrix, with padding included
for char, i in ultimate_alphabet_dictionary.items(): 
    embedding_vector = embeddings_index.get(char) # if not find in the dict, return None
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else: # For the unknown word in tk.word_index, assign UNK vector
        embedding_vector = embeddings_index.get('UNK')
        embedding_matrix[i] = embedding_vector

In [15]:
# see examples of the first three items, namely padding, '.' and '/'
print(embedding_matrix[0:3])

[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000

In [16]:
# then convert our dataset to character format with lowercase letters only

# use a split function that takes a song lyrics, extracts all the individual characters and converts them into lowercase versions
def split(word): 
    return [char.lower() for char in word]  

x_splitted = list() # a new list for lyrics of splitted characters
for lyric in x:
    splitted_lyric = split(lyric)
    splitted_lyric_indices = list()
    for char in splitted_lyric:
        try:
            splitted_lyric_indices.append(ultimate_alphabet_dictionary[char])
        except:
            splitted_lyric_indices.append(ultimate_alphabet_dictionary['UNK'])
    x_splitted.append(splitted_lyric_indices)


Check how the sub-word level data looks like in terms of example lengths

In [17]:
# See sub-word level length
length = [len(lyric) for lyric in x_splitted]
print('The max length is: ', max(length))
print('The min length is: ', min(length))
print('The average length is: ', sum(length)/len(length))

The max length is:  8303
The min length is:  17
The average length is:  1065.56675


Get ready for the model inputs

In [18]:
# Padding
from keras.preprocessing.sequence import pad_sequences
padded_sentences = pad_sequences(x_splitted, maxlen=max(length), padding='post')

# Convert to numpy array
padded_sentences = np.array(padded_sentences)
padded_sentences.shape

(12000, 8303)

Here it is time to split our dataset

In [19]:
# Shuffle data
np.random.seed(23) # !!!!!! use the same seed value in all dataset split versions !!!!!!
shuffle_indices = np.random.permutation(np.arange(len(y_artist))) 
# shuffle all inputs with the same indices
x_shuffled = padded_sentences[shuffle_indices]
y_artist_shuffled = y_artist[shuffle_indices]
y_genre_shuffled = y_genre[shuffle_indices]

# form count dictionaries for both artist and genre labels
# first, artist labels
artist_label_count_dict = dict()
for i in range(120):
    artist_label_count_dict[i] = 0
# then also for genre labels
genre_label_count_dict = dict()
for i in range(12):
    genre_label_count_dict[i] = 0
    
'''Now, we'll go through the data examples one by one. If each artist label occurs less than 80 times, the sample 
will belong to the training set; occurances between 81th and 90th times will go to the validation set and the 
occurances between 91th and 100th times (last 10 occurances) will go to the test set.
For genre labels, any genre label occuring less than 800 times will go to the training set; occurances between 801th and 
900th times will go to the validation set and the occurances between 901th and 1000th times (last 100 occurances) will 
go to the test set.'''

# create training, validation and test sets with equal distributions of artists and genres
x_tr_artist, x_val_artist, x_te_artist = list(), list(), list()
x_tr_genre, x_val_genre, x_te_genre = list(), list(), list()
y_tr_artist, y_val_artist, y_te_artist = list(), list(), list()
y_tr_genre, y_val_genre, y_te_genre = list(), list(), list()

for sample, art_label, gen_label in zip(x_shuffled, y_artist_shuffled, y_genre_shuffled):
    artist_label_index = np.argmax(art_label, axis=-1)
    genre_label_index = np.argmax(gen_label, axis=-1)
    # for artist labels
    if artist_label_count_dict[artist_label_index] < 80:
        x_tr_artist.append(sample)
        y_tr_artist.append(art_label)
    elif 80 <= artist_label_count_dict[artist_label_index] < 90:
        x_val_artist.append(sample)
        y_val_artist.append(art_label)
    elif 90 <= artist_label_count_dict[artist_label_index] < 100:
        x_te_artist.append(sample)
        y_te_artist.append(art_label)
    else:
        print("There is an error with artist counts!")
    artist_label_count_dict[artist_label_index] += 1
        
    # for genre labels
    if genre_label_count_dict[genre_label_index] < 800:
        x_tr_genre.append(sample)
        y_tr_genre.append(gen_label)
    elif 800 <= genre_label_count_dict[genre_label_index] < 900:
        x_val_genre.append(sample)
        y_val_genre.append(gen_label)
    elif 900 <= genre_label_count_dict[genre_label_index] < 1000:
        x_te_genre.append(sample)
        y_te_genre.append(gen_label)
    else:
        print("There is an error with genre counts!")
    genre_label_count_dict[genre_label_index] += 1

        
        
# turn the output datasets in np arrays
x_tr_artist = np.array(x_tr_artist)
x_val_artist = np.array(x_val_artist)
x_te_artist = np.array(x_te_artist)
x_tr_genre = np.array(x_tr_genre)
x_val_genre = np.array(x_val_genre)
x_te_genre = np.array(x_te_genre)
y_tr_artist = np.array(y_tr_artist)
y_val_artist = np.array(y_val_artist)
y_te_artist = np.array(y_te_artist)
y_tr_genre = np.array(y_tr_genre)
y_val_genre = np.array(y_val_genre)
y_te_genre  = np.array(y_te_genre)

In [21]:
# store all important variables in pickle files to a folder specific to the character model
def writePickleChar(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/character/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()

writePickleChar(embedding_matrix, "embedding_weights")
writePickleChar(x_tr_artist, "x_tr_artist")
writePickleChar(x_val_artist, "x_val_artist")
writePickleChar(x_te_artist, "x_te_artist")
writePickleChar(x_tr_genre, "x_tr_genre")
writePickleChar(x_val_genre, "x_val_genre")
writePickleChar(x_te_genre, "x_te_genre")
writePickleChar(y_tr_artist, "y_tr_artist")
writePickleChar(y_val_artist, "y_val_artist")
writePickleChar(y_te_artist, "y_te_artist")
writePickleChar(y_tr_genre, "y_tr_genre")
writePickleChar(y_val_genre, "y_val_genre")
writePickleChar(y_te_genre, "y_te_genre")
writePickleChar(vocab_size, "vocab_size")
writePickleChar(length, "sequence_length")