### Here, we will make the final conversion of the lyrics (written in POS tags and RID tags) to a format that suits CNN requirements. We'll follow the steps below:
- By using a script named 'Lyrics2POSandRID_Converter.py', we will convert each song in the final collection to POS tags and RID tags. Then the converted versions will be stored under two distinct dictionaries, one mapping artists to a list of songs written in POS tags, and the other in RID tags
- We will generate two sets (one for POS and one for RID) that contain the unique tags including 'PADDING'. Each of these tags will be mapped to a unique index number
- For future efficiency, each song id will be mapped to its POS_Lyric and RID_Lyric version in two separate dictionaries
- Each song should be extended to a size of 100 lines and 20 tokens per line. Therefore we will take each song and fill in the blanks with 'PADDING' until all of them has size (100x20)
- The padded songs will be converted to tag indices, so that they can be used in the CNN. Then, these indices will be normalized to numbers between 0 and 1 for CNN calculation efficiency.
- We'll terminate the session by using the training-test-development split that is already prepared in a separate file called 'Training-Development-Test Set Allocation.ipynb'. These splitted datasets will be stored and sent to the scripts that constructs our CNN models.

In [1]:
# start by writing the Pickle functions to call and save pickle variables later on

import pickle
def writePickle( Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()
def readPickle(fname):
    filename = "pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj
def readPicklefromPast(fname):
    filename = "../pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj

In [None]:
# Import the POS and RID lyrics dictionaries generated in 'Lyrics2POSandRID_Converter.py'
final_artists_to_RIDsongs_dict = readPickle("final_artists_to_RIDsongs_dict")
final_artists_to_POSsongs_dict = readPickle("final_artists_to_POSsongs_dict")

In [None]:
# create a set that contains all the unique POS tags used by spaCy
unique_POS_set = set()
for artist, song_list in final_artists_to_POSsongs_dict.items():
    for song in song_list:
        flattened_POS_set = set([POS for line in song for POS in line])
        unique_POS_set.update(flattened_POS_set)
print(unique_POS_set)

# do the same for the RID tags
# First create a set that contains all the unique RID tags used within the songs
unique_RID_set = set()
for artist, song_list in final_artists_to_RIDsongs_dict.items():
    for song in song_list:
        flattened_RID_set = set([RID for line in song for RID in line])
        unique_RID_set.update(flattened_RID_set)
print(unique_RID_set)

# add 'PADDING' as the first element to the list versions of these sets
unique_POS_list = list(unique_POS_set)
unique_POS_list.insert(0,"PADDING")
print(unique_POS_list)

unique_RID_list = list(unique_RID_set)
unique_RID_list.insert(0,"PADDING")
print(unique_RID_list)

In [None]:
# then link every RID tag and POS tag respectively to a number and store these links in both directions in two dictionaries
RID2id = {rid: index for index, rid in enumerate(unique_RID_list)}
id2RID = {index: rid for index, rid in enumerate(unique_RID_list)}
print(RID2id)

POS2id = {pos: index for index, pos in enumerate(unique_POS_list)}
id2POS = {index: pos for index, pos in enumerate(unique_POS_list)}
print(POS2id)


In [None]:
# also the artist names should be mapped to numbers
artist_names = list(final_artists_to_RIDsongs_dict.keys())
Artist2id = {artist: index for index, artist in enumerate(artist_names)}
id2Artist = {index: artist for index, artist in enumerate(artist_names)}

In [None]:
# store all these variables under the folder 'indexing'
writePickle(POS2id, "indexing/POS2id")
writePickle(id2POS, "indexing/id2POS")
writePickle(RID2id, "indexing/RID2id")
writePickle(id2RID, "indexing/id2RID")
writePickle(Artist2id, "indexing/Artist2id")
writePickle(id2Artist, "indexing/id2Artist")

Continue with creating dictionaries that contain song ids mapped to POS lyrics and RID lyrics and actual lyrics respectively

In [None]:
final_IDs_to_Lyrics_dict = dict()
final_IDs_to_POS_dict = dict()
final_IDs_to_RID_dict = dict()

for artist, rid_songs in final_artists_to_RIDsongs_dict.items():
    song_IDs = final_constrained_artist2idlist_dict[artist]
    song_lyrics = final_artist2lyrics_dict[artist]
    for songID, RID in zip(song_IDs,rid_songs):
        final_IDs_to_RID_dict[songID] = RID
    for songID, lyrics in zip(song_IDs,song_lyrics):
        final_IDs_to_Lyrics_dict[songID] = lyrics
    
for artist, pos_songs in final_artists_to_POSsongs_dict.items():
    song_IDs = final_constrained_artist2idlist_dict[artist]    
    for songID, POS in zip(song_IDs,pos_songs):
        final_IDs_to_POS_dict[songID] = POS
        

In [None]:
# print the same song from each of the lists to see similarities
print(final_IDs_to_POS_dict["241007"], len(list(final_IDs_to_POS_dict.keys())))
print(final_IDs_to_RID_dict["241007"], len(list(final_IDs_to_RID_dict.keys())))
print(final_IDs_to_Lyrics_dict["241007"], len(list(final_IDs_to_Lyrics_dict.keys())))

In [None]:
# record these dictionaries into pickle files
writePickle(final_IDs_to_POS_dict, 'final_IDs_to_POS_dict')
writePickle(final_IDs_to_RID_dict, 'final_IDs_to_RID_dict')
writePickle(final_IDs_to_Lyrics_dict, 'final_IDs_to_Lyrics_dict')

In [None]:
# here you can plot a couple of histograms to see the song length and line length distributions over the dataset for POS tags
song_len_list = list()
line_len_list = list()
for song in final_IDs_to_POS_dict.values():
    song_len_list.append(len(song))
    for line in song:
        line_len_list.append(len(line))
import matplotlib.pyplot as plt
from collections import defaultdict
all_line_freq = defaultdict(int)
for len in line_len_list:
    all_line_freq[len] += 1
all_song_freq = defaultdict(int)
for len in song_len_list:
    all_song_freq[len] += 1
plt.bar(all_line_freq.keys(), all_line_freq.values(), width=1, color='g')
plt.show()  
plt.bar(all_song_freq.keys(), all_song_freq.values(), width=1, color='g')
plt.show()

Each song should be extended to a size of 100 lines and 20 tokens per line. Therefore we will take each song and fill in the blanks with 'PADDING' until all of them has size (100x20). <br>
Also each song should be converted to its indexed version <br>
Finally, the converted datasets should be splitted into training, development and test sets<br>
The following cell script handles these steps: <br>
(we use the previously created variables here. for the generation of train-dev-test split that are provided in "train_df", "dev-def" and "test_df" variables respectively, please refer to the separate notebook file named as 'Training-Development-Test Set Allocation.ipynb') 

In [None]:
import numpy as np
import pandas as pd
from collections import OrderedDict


max_song = 100 # maximum song length
max_line = 20 # maximum number of tokens in a line

RID2id = readPickle("indexing/RID2id")
POS2id = readPickle("indexing/POS2id")
Artist2id = readPickle("indexing/Artist2id")
ID_to_RID = readPickle("final_IDs_to_RID_dict")
ID_to_POS = readPickle("final_IDs_to_POS_dict")

# import also the splitted datasets
train_df = readPickle("train_df")
dev_df = readPickle("dev_df")
test_df = readPickle("test_df")

# start with the RID generator function
def RID_generator(dataframe): # pick any of the dataframes; -test, -train or -dev
    RID_dict = dataframe.to_dict('list')
    sorted_RID_dict = OrderedDict(sorted(RID_dict.items(), key=lambda v: v, reverse=True))
    artists = list()
    songs = list()
    for artist, song_ID_list in sorted_RID_dict.items():
        for song_ID in song_ID_list:
            RID_song = list()
            artists.append(Artist2id[artist])
            song = ID_to_RID[song_ID]
            while len(song) != max_song:
                song.append(["PADDING"])
            for line in song:
                while len(line) != max_line:
                    line.append("PADDING")
            for line in song:
                RID_line = list()
                for tag in line:
                    RID_line.append(RID2id[tag])
                RID_song.append(RID_line)
            songs.append(RID_song)
    return songs, artists

# using the function, form the datasets in python list format
train_RID_input_data, train_RID_labels = RID_generator(train_df)
print("Training data finished for RID, continuing with development data...")
dev_RID_input_data, dev_RID_labels = RID_generator(dev_df)
print("Development data finished for RID, continuing with test data...")
test_RID_input_data, test_RID_labels = RID_generator(test_df)
print("Test data finished for RID, continuing with POS generation...")

# replicate the process for the POS tags
def POS_generator(dataframe): # pick any of the dataframes; -test, -train or -dev
    POS_dict = dataframe.to_dict('list')
    sorted_POS_dict = OrderedDict(sorted(POS_dict.items(), key=lambda v: v, reverse=True))
    artists = list()
    songs = list()
    for artist, song_ID_list in sorted_POS_dict.items():
        for song_ID in song_ID_list:
            POS_song = list()
            artists.append(Artist2id[artist])
            song = ID_to_POS[song_ID]
            while len(song) != max_song:
                song.append(["PADDING"])
            for line in song:
                while len(line) != max_line:
                    line.append("PADDING")
            for line in song:
                #print("line is",line)
                POS_line = list()
                for tag in line:
                    #print(tag)
                    POS_line.append(POS2id[tag])
                POS_song.append(POS_line)
            songs.append(POS_song)
    return songs, artists

# using the function, form the datasets in python list format
train_POS_input_data, train_POS_labels = POS_generator(train_df)
print("Training data finished for POS, continuing with development data...")
dev_POS_input_data, dev_POS_labels = POS_generator(dev_df)
print("Development data finished for POS, continuing with test data...")
test_POS_input_data, test_POS_labels = POS_generator(test_df)
print("Test data finished for POS, continuing with pickle file recording...")




# in the end store these as pickle variables for later use
writePickle(train_POS_input_data, "cnn_data_inputs/train_POS_input_data")
writePickle(train_POS_labels, "cnn_data_inputs/train_POS_labels")
writePickle(dev_POS_input_data, "cnn_data_inputs/dev_POS_input_data")
writePickle(dev_POS_labels, "cnn_data_inputs/dev_POS_labels")
writePickle(test_POS_input_data, "cnn_data_inputs/test_POS_input_data")
writePickle(test_POS_labels, "cnn_data_inputs/test_POS_labels")

writePickle(train_RID_input_data, "cnn_data_inputs/train_RID_input_data")
writePickle(train_RID_labels, "cnn_data_inputs/train_RID_labels")
writePickle(dev_RID_input_data, "cnn_data_inputs/dev_RID_input_data")
writePickle(dev_RID_labels, "cnn_data_inputs/dev_RID_labels")
writePickle(test_RID_input_data, "cnn_data_inputs/test_RID_input_data")
writePickle(test_RID_labels, "cnn_data_inputs/test_RID_labels")


print("An example of training RID input data is:", train_RID_input_data[0])
print("The first training RID label is", train_RID_labels[0])
print("The first training POS label is", train_POS_labels[0])

Now we have obtained all 6 datasets (2 each for train, test and dev, 1 for inputs and 1 for labels) for both POS and RID songs. It is time to place these into a model.

In [None]:
# start with reading these dataset variables from pickle files
train_POS_input_data = readPickle("cnn_data_inputs/train_POS_input_data")
train_POS_labels = readPickle("cnn_data_inputs/train_POS_labels")
dev_POS_input_data = readPickle("cnn_data_inputs/dev_POS_input_data")
dev_POS_labels = readPickle("cnn_data_inputs/dev_POS_labels")
test_POS_input_data = readPickle("cnn_data_inputs/test_POS_input_data")
test_POS_labels = readPickle("cnn_data_inputs/test_POS_labels")

train_RID_input_data = readPickle("cnn_data_inputs/train_RID_input_data")
train_RID_labels = readPickle("cnn_data_inputs/train_RID_labels")
dev_RID_input_data = readPickle("cnn_data_inputs/dev_RID_input_data")
dev_RID_labels = readPickle("cnn_data_inputs/dev_RID_labels")
test_RID_input_data = readPickle("cnn_data_inputs/test_RID_input_data")
test_RID_labels = readPickle("cnn_data_inputs/test_RID_labels")

In [None]:
# convert all of them to numpy arrays, so that they can be used in keras
import numpy as np

train_POS_input_data = np.array(train_POS_input_data)
train_POS_labels = np.array(train_POS_labels)
dev_POS_input_data = np.array(dev_POS_input_data)
dev_POS_labels = np.array(dev_POS_labels)
test_POS_input_data = np.array(test_POS_input_data)
test_POS_labels = np.array(test_POS_labels)

train_RID_input_data = np.array(train_RID_input_data)
train_RID_labels = np.array(train_RID_labels)
dev_RID_input_data = np.array(dev_RID_input_data)
dev_RID_labels = np.array(dev_RID_labels)
test_RID_input_data = np.array(test_RID_input_data)
test_RID_labels = np.array(test_RID_labels)


In [None]:
# see an example
train_POS_input_data[0]

In [None]:
# for all the input data, we have to normalize the data points to an interval between 0 and 1, 
# and convert everything to floating numbers

print(np.amax(train_POS_input_data))
print(np.amax(test_POS_input_data))
print(np.amax(dev_POS_input_data))

print(np.amax(train_RID_input_data))
print(np.amax(test_RID_input_data))
print(np.amax(dev_RID_input_data))

In [None]:
train_POS_input_data = train_POS_input_data.astype('float32') / np.amax(train_POS_input_data)
dev_POS_input_data = dev_POS_input_data.astype('float32') / np.amax(dev_POS_input_data)
test_POS_input_data = test_POS_input_data.astype('float32') / np.amax(test_POS_input_data)

train_RID_input_data = train_RID_input_data.astype('float32') / np.amax(train_RID_input_data)
dev_RID_input_data = dev_RID_input_data.astype('float32') / np.amax(dev_RID_input_data)
test_RID_input_data = test_RID_input_data.astype('float32') / np.amax(test_RID_input_data)

In [None]:
# see an example
print(train_POS_input_data[0])
print(train_POS_labels[0])
print(train_POS_input_data.shape)

In [None]:
# reshape the inputs into desired format

X_train_POS = train_POS_input_data.reshape(len(train_POS_input_data),max_song,max_line,1)
X_dev_POS = dev_POS_input_data.reshape(len(dev_POS_input_data),max_song,max_line,1)
X_test_POS = test_POS_input_data.reshape(len(test_POS_input_data),max_song,max_line,1)

X_train_RID = train_RID_input_data.reshape(len(train_RID_input_data),max_song,max_line,1)
X_dev_RID = dev_RID_input_data.reshape(len(dev_RID_input_data),max_song,max_line,1)
X_test_RID = test_RID_input_data.reshape(len(test_RID_input_data),max_song,max_line,1)

# check the final shapes
print(X_train_POS.shape)
print(X_train_RID.shape)

In [None]:
# turn the labels into categorical values

from keras.utils import to_categorical

y_train_POS = to_categorical(train_POS_labels)
y_dev_POS = to_categorical(dev_POS_labels)
y_test_POS = to_categorical(test_POS_labels)

y_train_RID = to_categorical(train_RID_labels)
y_dev_RID = to_categorical(dev_RID_labels)
y_test_RID = to_categorical(test_RID_labels)



In [None]:
# see how it works
print(y_train_RID)
print(y_train_POS)

Record all of the formatted final input and outputs into pickle files 

In [None]:
# save the variables
writePickle(X_train_POS,"cnn_data_inputs/POS_Keras/X_train_POS")
writePickle(X_dev_POS,"cnn_data_inputs/POS_Keras/X_dev_POS")
writePickle(X_test_POS,"cnn_data_inputs/POS_Keras/X_test_POS")
writePickle(y_train_POS,"cnn_data_inputs/POS_Keras/y_train_POS")
writePickle(y_dev_POS,"cnn_data_inputs/POS_Keras/y_dev_POS")
writePickle(y_test_POS,"cnn_data_inputs/POS_Keras/y_test_POS")

writePickle(X_train_RID,"cnn_data_inputs/RID_Keras/X_train_RID")
writePickle(X_dev_RID,"cnn_data_inputs/RID_Keras/X_dev_RID")
writePickle(X_test_RID,"cnn_data_inputs/RID_Keras/X_test_RID")
writePickle(y_train_RID,"cnn_data_inputs/RID_Keras/y_train_RID")
writePickle(y_dev_RID,"cnn_data_inputs/RID_Keras/y_dev_RID")
writePickle(y_test_RID,"cnn_data_inputs/RID_Keras/y_test_RID")

### From the this moment on, we will use our Model scripts to construct our models, using the inputs prepared and stored in pickle files above. For an example model script, please check "POS_Model.py"