### In this notebook, we'll retrieve a subset from our main dataset and write that to a csv file, so that this subset can be used in all our mini models

Start by importing certain pickle variables that might be useful

In [1]:
import pickle
def writePickle(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()
def readPickle(fname):
    filename = "pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj
def readPicklePast(fname):
    filename = "../pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj

In [2]:
EN_PRGE_metadata_dict = readPicklePast("EN_PRGE_metadata_dict") # metadata dataset for english only songs with raw and projeted genre labels
children_to_parent_genre_dict = readPicklePast("children_to_parent_genre_dict") # a dictionary that maps all children genre labels to parent genre labels

We aim to form a sub-dataset that have the following properties: <br>
- It will have 10 artists from each parent genre
- The parent genre set will exclude 'Rest', and include all the remaining genres (i.e 'Reggae', 'R&B', 'Punk', 'Pop', 'Blues', 'Folk', '(Electronic) Dance', 'Jazz', 'Heavy Metal', 'Country', 'Rock', 'Hip Hop' and 'Gospel&Religious')
- Each artist will have a total of 100 song lyrics. This set of 100 lyrics will be allocated as 80-10-10 for the training, evaluation and test sets respectively.

Let's start by collecting each entry (object or instance) in the dataset under a dictionary with the following example structure: <br>
dict = { 'parent_genre_1' : {'artist_name_1' : [object_id_1, object_id_2, ..., object_id_i], 'artist_name_2' : [object_id_j, ..., object_id_k], ... }, <br>
&emsp;&emsp;&emsp; 'parent_genre_2' : {'artist_name_10' : [object_id_l, object_id_m, ..., object_id_p], 'artist_name_11' : [object_id_q, ..., object_id_t], ... }, <br>
&emsp;&emsp;&emsp; 'parent_genre_3'...}

In [3]:
parent_genre_collection_dict = dict()
for parent_genre_label in list(set(list(children_to_parent_genre_dict.values()))):
    parent_genre_collection_dict[parent_genre_label] = dict()
del parent_genre_collection_dict['Rest'] # we'll discard the 'Rest' genre

for object_id, metadata in EN_PRGE_metadata_dict.items():
    genre = metadata[1]
    artist_name = metadata[2]
    parent_genre = children_to_parent_genre_dict[genre]
    if parent_genre == 'Rest':
        pass
    else:
        try:
            parent_genre_collection_dict[parent_genre][artist_name].append(object_id)
        except:
            parent_genre_collection_dict[parent_genre][artist_name] = list()
            parent_genre_collection_dict[parent_genre][artist_name].append(object_id)



In [4]:
# an example
print(parent_genre_collection_dict["Heavy Metal"]["Metallica"])

['ObjectId(5714dedb25ac0d8aee4ad800)', 'ObjectId(5714dedb25ac0d8aee4ad802)', 'ObjectId(5714dedb25ac0d8aee4ad803)', 'ObjectId(5714dedb25ac0d8aee4ad806)', 'ObjectId(5714dedb25ac0d8aee4ad807)', 'ObjectId(5714dedb25ac0d8aee4ad808)', 'ObjectId(5714dedb25ac0d8aee4ad809)', 'ObjectId(5714dedb25ac0d8aee4ad80e)', 'ObjectId(5714dedb25ac0d8aee4ad80f)', 'ObjectId(5714dedb25ac0d8aee4ad812)', 'ObjectId(5714dedb25ac0d8aee4ad813)', 'ObjectId(5714dedb25ac0d8aee4ad814)', 'ObjectId(5714dedb25ac0d8aee4ad816)', 'ObjectId(5714dedb25ac0d8aee4ad818)', 'ObjectId(5714dedb25ac0d8aee4ad81a)', 'ObjectId(5714dedb25ac0d8aee4ad81b)', 'ObjectId(5714dedb25ac0d8aee4ad81d)', 'ObjectId(5714dedb25ac0d8aee4ad81e)', 'ObjectId(5714dedb25ac0d8aee4ad822)', 'ObjectId(5714dedb25ac0d8aee4ad824)', 'ObjectId(5714dedb25ac0d8aee4ad825)', 'ObjectId(5714dedb25ac0d8aee4ad826)', 'ObjectId(5714dedb25ac0d8aee4ad829)', 'ObjectId(5714dedb25ac0d8aee4ad82d)', 'ObjectId(5714dedb25ac0d8aee4ad82f)', 'ObjectId(5714dedb25ac0d8aee4ad830)', 'ObjectId(5

For each parent genre category, find the list of artists that have more than 100 songs:

In [5]:
_100_plus = dict()
for parent_genre_label in list(set(list(children_to_parent_genre_dict.values()))):
    _100_plus[parent_genre_label] = list()
del _100_plus['Rest'] # we'll discard the 'Rest' genre

for parent_genre, artists in parent_genre_collection_dict.items():
    for artist in artists.keys():
        if len(artists[artist]) >= 100:
            _100_plus[parent_genre].append(artist)
    

In [6]:
for genre_label, artist_list in _100_plus.items():
    print(genre_label, "has", len(artist_list), "artists with more than 100 songs in the dataset.\n")

Country has 113 artists with more than 100 songs in the dataset.

Jazz has 10 artists with more than 100 songs in the dataset.

(Electronic) Dance has 17 artists with more than 100 songs in the dataset.

Pop has 64 artists with more than 100 songs in the dataset.

Rock has 250 artists with more than 100 songs in the dataset.

Folk has 29 artists with more than 100 songs in the dataset.

Heavy Metal has 71 artists with more than 100 songs in the dataset.

Punk has 43 artists with more than 100 songs in the dataset.

Blues has 21 artists with more than 100 songs in the dataset.

Reggae has 8 artists with more than 100 songs in the dataset.

Gospel&Religious has 24 artists with more than 100 songs in the dataset.

Hip Hop has 102 artists with more than 100 songs in the dataset.

R&B has 48 artists with more than 100 songs in the dataset.



According to our analysis of genre labels and artists, 'Reggae' cannot comply with our criteria of having 10 artists with at least 100 songs each. Therefore: <br>
1- We'll remove 'Reggae' and work with the remaining genre labels <br>
2- For each remaining genre label, we'll randomly select 10 artists from the set of all artists available <br>
3- For each artist, we'll randomly select 100 songs (instances) each, and form out subset

In [7]:
# remove 'Reggae'
del _100_plus['Reggae']

In [8]:
import random

subset_song_id_list = list() # a list for all the song (object) ids that will be evaluated in the model

for genre_label, artist_list in _100_plus.items():
    # for each genre label, select a list of ten artists randomly
    random_10_list = random.sample(artist_list, 10)
    # then for each (genre_label, artist) combination, select 100 songs randomly
    for artist in random_10_list:
        song_id_list = random.sample(parent_genre_collection_dict[genre_label][artist],100)
        # append the items in this 100-song selection to a more comprehensive list
        subset_song_id_list.extend(song_id_list)
print("The final list of song_ids will have", len(subset_song_id_list), "items.")

The final list of song_ids will have 12000 items.


To finalize our subset, let's form a dictionary that takes song_id's as its keys, and a list of ['artist_name', 'genre', 'lyrics'] as its values.

In [9]:
sub_dataset = dict()
for song_id in subset_song_id_list:
    sub_dataset[song_id] = [EN_PRGE_metadata_dict[song_id][2], \
                            children_to_parent_genre_dict[EN_PRGE_metadata_dict[song_id][1]], EN_PRGE_metadata_dict[song_id][5]]

In [10]:
# some examples
print(list(sub_dataset.items())[1100:1103])

[('ObjectId(5714dece25ac0d8aee40e3cc)', ['Ella Fitzgerald', 'Jazz', 'Away in a manger, no crib for a bed\nThe little Lord Jesus laid down His sweet head\nThe stars in the sky look down where He lay\nThe little Lord Jesus asleep on the hay\n\nBe near me Lord Jesus, I ask Thee to stay\nClose by me forever and love me, I pray\nBless all the dear children in Thy tender care\nAnd fit us for heaven to live with Thee there']), ('ObjectId(5714dece25ac0d8aee40e1f1)', ['Ella Fitzgerald', 'Jazz', "Sing Hallelujah, Hallelujah\nAnd you'll shoo the blues away\nCares pursue ya, Hallelujah\nGets you through the darkest day\n\nSatan lies a-waitin'\nAnd creatin' skies of gray\nBut Hallelujah, Hallelujah\nHelps to shoo the clouds away\n\nI recall in times when I was small\nIn light and free Jubilee days\nIn that sunny land of milk and honey\nI had no complaints while I thought of saints\nSo I say to all who fearful are\n\nSing Hallelujah, Hallelujah\nAnd you'll shoo the blues away\nWhen cares pursue ya, 

In [11]:
# write the sub_dataset dictionary to a pickle file
writePickle(sub_dataset, "sub_dataset")

From this moment on, we'll convert our dictionary to a csv file

In [12]:
sub_dataset = readPickle("sub_dataset")

In [13]:
# start with a dictionary that maps ids to lyrics
ids2lyrics = dict((ids, info[2]) for ids, info in sub_dataset.items())

In [14]:
# collect two separate lists of unique artist and genre labels
unique_artist_list = sorted(list(set(list(info[0] for info in sub_dataset.values()))))
unique_genre_list = sorted(list(set(list(info[1] for info in sub_dataset.values()))))

# create 2 dictionaries that map artist_labels to integer values, and vice versa
artist2id = dict((a, i+1) for i, a in enumerate(unique_artist_list))
id2artist = dict((i+1, a) for i, a in enumerate(unique_artist_list))

# the same for genre_labels
genre2id = dict((a, i+1) for i, a in enumerate(unique_genre_list))
id2genre = dict((i+1, a) for i, a in enumerate(unique_genre_list))

In [18]:
writePickle(artist2id, "artist2id")
writePickle(id2artist, "id2artist")
writePickle(genre2id, "genre2id")
writePickle(id2genre, "id2genre")

In [15]:
# continue with another dictionary that maps artist labels to a list of her song_ids in the dataset
artist_to_song_ids = dict((artist, []) for artist in unique_artist_list)
for ids, info in sub_dataset.items():
    artist_to_song_ids[info[0]].append(ids)

Now, we'll use a function to write our complete sub-dataset into a csv file. <br>
The format of this csv will be like: artist_label_index, genre_label_index, lyrics

In [16]:
import csv

# training csv file
#with open('train.csv', 'w', newline='') as file:
with open('sub_dataset.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    for artist_label, id_list in artist_to_song_ids.items():
        for ids in id_list:
            writer.writerow([artist2id[artist_label],genre2id[sub_dataset[ids][1]], ids2lyrics[ids]])


In [17]:
# In other files, for retrieving the complete sub_dataset from the csv file above, use the following helper function

import string
import numpy as np
import pandas as pd
from keras.utils.np_utils import to_categorical

def load_data():
    
    data = pd.read_csv('sub_dataset.csv', header=None)
    data = data.dropna()

    x = data[2]
    x = np.array(x)

    y_artist = data[0] - 1
    y_artist = to_categorical(y_artist)
    
    y_genre = data[1] - 1
    y_genre = to_categorical(y_genre)
    
    return (x, y_artist, y_genre)

Using TensorFlow backend.
