#### Start by converting the CMU Pronouncing Dictionary into a json file
The original version of the dictionary can be found at: http://www.speech.cs.cmu.edu/cgi-bin/cmudict <br>
It contains more than 130k unique English word pronounciations.

In [1]:
import json 
   
filename = 'pickle_vars/rhyme/cmu_dict.txt'
  

cmu_dict = dict()
  
# creating dictionary 
with open(filename) as fh: 
  
    for line in fh: 
  
        # reads each line and trims of extra the spaces  
        # and gives only the valid words 
        command, description = line.strip().split(None, 1) 
  
        cmu_dict[command] = description.strip() 
  
 
out_file = open("pickle_vars/rhyme/cmu.json", "w") 
json.dump(cmu_dict, out_file, indent = 4, sort_keys = False) 
out_file.close()

#### Now, get our mini dataset and convert all lyrics into tokens. See if unique tokens have pronounciation representations in our CMU dictionary

In [2]:
# load all the lyrics in the mini dataset (12000 song lyrics)

import numpy as np
import pandas as pd
from keras.utils.np_utils import to_categorical
import string

def load_data():
    
    data = pd.read_csv('sub_dataset.csv', header=None)
    data = data.dropna()

    x = data[2]
    x = np.array(x)

    y_artist = data[0] - 1
    y_artist = to_categorical(y_artist)
    
    y_genre = data[1] - 1
    y_genre = to_categorical(y_genre)
    
    return (x, y_artist, y_genre)

x, y_artist, y_genre = load_data()

In [3]:
with open('pickle_vars/rhyme/cmu.json') as json_file: 
    cmu_dict = json.load(json_file)

Take a quick look at the dataset, token by token, to see the characteristics of the lyrics in terms of complying with the CMU pronounciation dictionary

In [4]:
pronounciation_dict = {}
unknown_words = list()
all_count = 0
for lyrics in x:
    for word in lyrics.split(): # for each token
        all_count += 1
        stripped_word = word.translate(str.maketrans('', '', string.punctuation)) # first find the version that is stripped from all punctuation
        try:
            pronounciation_dict[word] = cmu_dict[stripped_word.upper()] # don't forget to uppercase
            
        except: # in case the stripped version of the token cannot be found in the dictionary
            try:
                pronounciation_dict[word] = cmu_dict[word.upper()] # don't forget to uppercase
            except:
                unknown_words.append(word)
print("After splitting the whole dataset into tokens, {} tokens are processed. Out of these, {} tokens \
(={} unique tokens) do not have a match in the CMU dictionary. In other words {:03.2f}% of the time when we encounter a \
token in the dataset, we don't know its pronounciation. The tokens with known pronounciations ({} unique tokens) \
are recorded in the pronounciation_dict\n\
".format(all_count, len(unknown_words), len(set(unknown_words)), (len(unknown_words)/all_count)*100, len(pronounciation_dict)))              

print("Some examples of tokens with unknown pronounciations, in their original form in the dataset", set(unknown_words[0:100]))
      

      

After splitting the whole dataset into tokens, 2526333 tokens are processed. Out of these, 72460 tokens (=23609 unique tokens) do not have a match in the CMU dictionary. In other words 2.87% of the time when we encounter a token in the dataset, we don't know its pronounciation. The tokens with known pronounciations (65246 unique tokens) are recorded in the pronounciation_dict

Some examples of tokens with unknown pronounciations, in their original form in the dataset {'niggas', "fo'", 'ATL,', "I'ma", '(ketro),', "hittin'", 'playas,', "fo',", '2X)', 'aiyyo', 'TRL-total', 'playas', 'nigga,', "spittin',", '(hahaha)', 'Trackmasters,', "niggas'll", "holdin'", "packin',", "fastin'", 'Murda', 'Nore', 'Vol.', 'TM,', '3X)', "'", '(haha)', '(50', "(c'mon)", "eatin',", "catchin'", 'Niggas', 'niggas,', 'hahahaha', 'nigga', 'Bowlish', '50,', 'niggaz', 'P-90', '3', 'titties', '1-800-TIPS', 'ketro', '81', 'Noreaga,', 'Rimadon', "sippin'", 'Nore,', '2000', "fiddy'", 'GS', '...', 'piddy-pat', 'faggots'

A quick look at the tokens with unknown pronounciations, we encounter several common cases: <br>

1- Colloquialism and slang words are not recognized <br>
2- Numbers are not recognized <br>
3- Hyphen ("-") use is not accepted<br>
4- Verb writing style with omitting the 'g' sound at the end (e.g. "hittin'" or "hittin") is not accepted <br>

For 1 & 2, we'll use the CMU Lexicon Tool (http://www.speech.cs.cmu.edu/tools/lextool.html) later on. <br>
For 3, we'll try to split the token into non-hyphen version and see whether the pieces can be pronounced separately. <br>
For 4, we'll try to add the 'g' sound at the end, check the dictioary for a translation, if there is one, we'll get that and remove the last g sound again, to mimic the pronounciation of 'SATIN' as follows:

In [5]:
print("The pronounciation of 'satin':", cmu_dict["SATIN"])

The pronounciation of 'satin': S AE1 T AH0 N


In [6]:
# for hyphens
for token in set(unknown_words):
    if '-' in token:
        replaced = token.replace('-', ' ') # replace '-'s with a blank space
        pronounciation = str()
        translation_exists = True
        for piece in replaced.split():
            try:
                pronounciation += " " + cmu_dict[piece.upper()]
            except:
                translation_exists = False # if any of the pieces do not exist in the cmu_dict, we'll not use other pieces as well
        if translation_exists:
            pronounciation = pronounciation[1:]
            pronounciation_dict[token] = pronounciation # add the discovered pronounciation to the comprehensive dictionary
            unknown_words = list(filter(lambda a: a != token, unknown_words))

print("Now the pronounciation dictionary has {} unique tokens, whereas the set of unique words \
with unknown pronounciations has left with {} items".format(len(pronounciation_dict), len(set(unknown_words))))


Now the pronounciation dictionary has 67531 unique tokens, whereas the set of unique words with unknown pronounciations has left with 21324 items


In [7]:
# for verbs ending with "in'"
for token in set(unknown_words):
    if token[-2:] == 'in':
        replaced = token[:-2]+'ing'
    elif token[-3:] == "in'":
        replaced = token[:-3]+'ing'
    else:
        pass
    try:
        pronounciation_dict[token] = cmu_dict[replaced.upper()][:-1] # don't forget to remove the last 'g' sound
        unknown_words = list(filter(lambda a: a != token, unknown_words))
    except:
        pass
        
print("Now the pronounciation dictionary has {} unique tokens, whereas the set of unique words \
with unknown pronounciations has left with {} items".format(len(pronounciation_dict), len(set(unknown_words))))


Now the pronounciation dictionary has 83347 unique tokens, whereas the set of unique words with unknown pronounciations has left with 5508 items


For the rest of the 5800 unique tokens that cannot be translated into a proper pronounciation form, we'll use the CMU Lexicon Tool. By using their website, we have obtained the following translations for them:

In [8]:
import json 

# upload the dictionary provided by the CMU Lexicon Tool
filename = 'pickle_vars/rhyme/cmu_lextool_dict.txt'
  
# creating dictionary 
with open(filename) as fh: 
    for line, token in zip(fh, set(unknown_words)):
        try:
            command, description = line.strip().split(None, 1) 
            pronounciation_dict[token] = description.strip()
        except:
            pronounciation_dict[token] = 'UNK'

In [9]:
print(len(pronounciation_dict))

88855


Now we have completed our pronounciation conversion dictionary.

In [11]:
import pickle
# store all important variables in pickle files to a folder specific to the character model
def writePickle(Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/rhyme/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()

In [12]:
writePickle(pronounciation_dict, "pronounciation_dict")

In [10]:
print("There are", list(pronounciation_dict.values()).count('UNK'), "items with unknown pronounciations")

There are 44 items with unknown pronounciations


In [12]:
print(set(list(pronounciation_dict.values())))

{'S P AO1 R', '', 'S ER0 V EY1', 'F L AA1 G IH0 NG', 'L IY1 Z AH0', 'K L AY1 M AH0 T', 'T AA1 N S AH0 L Z', 'F R AH G AE N CH IY AH', 'K L AH1 CH AH0 Z', 'B AY D AY AH', 'K AH0 N T AE1 M AH0 N EY2 T AH0 D', 'AH0 L UH1 K IH0 N', 'S AA1 B Z', 'M AE1 S M EY1 D', 'R AE1 SH AH0 N Z', 'B L IH1 Z ER0 D', 'T R AE0 N S P L AE1 N T IH0 NG', 'EH2 K S T R AH0 T ER2 EH1 S T R IY0 AH0 L Z', 'SH IY1 T S', 'HH AY1 D IY1 HH AY1', 'HH OW1 Z', 'B AA1 R B Z', 'HH AE1 L IY0', 'K IH1 K IH0 N', 'S UW1 DH', 'HH AH1 V ER0 D', 'T AA1 ZH', 'D IY1 F EH0 K T', 'L AE1 N D IH0 N', 'S AA1 D AH0 N', 'K L AH1 B Z', 'R EY1 T IH0 D', 'G AE1 T', 'TH EH1 R AH0', 'AO2 T AH0 M AE1 T IH0 K', 'N EH1 T AH0 L Z', 'B ER1 N Z', 'K AE1 S', 'JH EH1 L AH0 S', 'AH1 N D ER0', 'M IH S T AA', 'AE1 T AH0 T UW2 D', 'M IH0 S Y UW1 Z IH0 Z', 'F AH R M AH M AH N T', 'T R EH1 D IH0 NG', 'R IH0 Z ER1 V', 'F R EH1 N D SH IH0 P', 'R EY1 S IH0 NG', 'L IH1 S AH0 N IH0 NG', 'P AY1 AH0 T IY0', 'K R AY1 IH0 NG', 'S M IH1 TH', 'SH AA1 L AA1 L AA1 L AA1

In [13]:
pieces = []
for pronounce in set(list(pronounciation_dict.values())):
    for piece in pronounce.split():
        pieces.append(piece)
print(len(pieces))

204357


In [14]:
phoneme_vocabulary = list(set(pieces))
#phoneme_vocabulary.append('//') # this line is here for the line break version
len(phoneme_vocabulary)
PHO2id = {pho: i for i, pho in enumerate(phoneme_vocabulary)}
id2PHO = {i: pho for i, pho in enumerate(phoneme_vocabulary)}

In [15]:
pho_x = list()
for song in x:
    pho_song = list()
    for line in song.split('\n'):
        if line == '':
            pass
        else:
            for token in line.split():
                #pho_song.append(pronounciation_dict[token])
                for pho in pronounciation_dict[token].split():
                    pho_song.append(PHO2id[pho])

                # pho_song.append('/') #word breaks
            # pho_song.pop()
            #pho_song.append(PHO2id['//'])
    pho_x.append(pho_song)

# convert the list of lists to an np matrix
pho_x = np.array([np.array(song) for song in pho_x])


In [16]:
pho_length = [len(song) for song in pho_x]
print('The max length is: ', max(pho_length))
print('The min length is: ', min(pho_length))
print('The average length is: ', sum(pho_length)/len(pho_length))

The max length is:  5747
The min length is:  12
The average length is:  668.6780833333333


In [18]:
# Padding
from keras.preprocessing.sequence import pad_sequences
padded_pho_x = pad_sequences(pho_x, maxlen=max(pho_length), padding='post')

# Convert to numpy array
padded_pho_x = np.array(padded_pho_x)
padded_pho_x.shape

(12000, 5747)

In [23]:
# Shuffle data
np.random.seed(23) # !!!!!! use the same seed value in all dataset split versions !!!!!!
shuffle_indices = np.random.permutation(np.arange(len(y_artist))) 
# shuffle all inputs with the same indices
x_shuffled = padded_pho_x[shuffle_indices]
y_artist_shuffled = y_artist[shuffle_indices]
y_genre_shuffled = y_genre[shuffle_indices]

# form count dictionaries for both artist and genre labels
# first, artist labels
artist_label_count_dict = dict()
for i in range(120):
    artist_label_count_dict[i] = 0
# then also for genre labels
genre_label_count_dict = dict()
for i in range(12):
    genre_label_count_dict[i] = 0
    
'''Now, we'll go through the data examples one by one. If each artist label occurs less than 80 times, the sample 
will belong to the training set; occurances between 81th and 90th times will go to the validation set and the 
occurances between 91th and 100th times (last 10 occurances) will go to the test set.
For genre labels, any genre label occuring less than 800 times will go to the training set; occurances between 801th and 
900th times will go to the validation set and the occurances between 901th and 1000th times (last 100 occurances) will 
go to the test set.'''

# create training, validation and test sets with equal distributions of artists and genres
x_tr_artist, x_val_artist, x_te_artist = list(), list(), list()
x_tr_genre, x_val_genre, x_te_genre = list(), list(), list()
y_tr_artist, y_val_artist, y_te_artist = list(), list(), list()
y_tr_genre, y_val_genre, y_te_genre = list(), list(), list()

for sample, art_label, gen_label in zip(x_shuffled, y_artist_shuffled, y_genre_shuffled):
    artist_label_index = np.argmax(art_label, axis=-1)
    genre_label_index = np.argmax(gen_label, axis=-1)
    # for artist labels
    if artist_label_count_dict[artist_label_index] < 80:
        x_tr_artist.append(sample)
        y_tr_artist.append(art_label)
    elif 80 <= artist_label_count_dict[artist_label_index] < 90:
        x_val_artist.append(sample)
        y_val_artist.append(art_label)
    elif 90 <= artist_label_count_dict[artist_label_index] < 100:
        x_te_artist.append(sample)
        y_te_artist.append(art_label)
    else:
        print("There is an error with artist counts!")
    artist_label_count_dict[artist_label_index] += 1
        
    # for genre labels
    if genre_label_count_dict[genre_label_index] < 800:
        x_tr_genre.append(sample)
        y_tr_genre.append(gen_label)
    elif 800 <= genre_label_count_dict[genre_label_index] < 900:
        x_val_genre.append(sample)
        y_val_genre.append(gen_label)
    elif 900 <= genre_label_count_dict[genre_label_index] < 1000:
        x_te_genre.append(sample)
        y_te_genre.append(gen_label)
    else:
        print("There is an error with genre counts!")
    genre_label_count_dict[genre_label_index] += 1

        
        
# turn the output datasets in np arrays
x_tr_pho_artist = np.array(x_tr_artist)
x_val_pho_artist = np.array(x_val_artist)
x_te_pho_artist = np.array(x_te_artist)
x_tr_pho_genre = np.array(x_tr_genre)
x_val_pho_genre = np.array(x_val_genre)
x_te_pho_genre = np.array(x_te_genre)
y_tr_pho_artist = np.array(y_tr_artist)
y_val_pho_artist = np.array(y_val_artist)
y_te_pho_artist = np.array(y_te_artist)
y_tr_pho_genre = np.array(y_tr_genre)
y_val_pho_genre = np.array(y_val_genre)
y_te_pho_genre  = np.array(y_te_genre)


#### Creating One-Hot Embeddings
Here we create a one-hot embedding matrix out of the phoneme vocabulary. <br>
In future versions also vectors for line and word breaks can be used. However the transformation of the dataset along with the changes in the phoneme vocabulary should be considered as well!

In [21]:
print(len(phoneme_vocabulary))
vocab_size_pho = len(phoneme_vocabulary)  #(87 phonemes, 1 for unknown)

# Embedding weights
embedding_dim_pho = len(phoneme_vocabulary) #88
zero_vector = np.zeros((1, embedding_dim_pho)) # representing pad vector
embedding_weights_pho = np.concatenate((zero_vector, np.identity(vocab_size_pho)), axis=0) #  concatenate the vectors

from keras.layers import Embedding
input_size = padded_pho_x.shape[1]

# Embedding layer Initialization
embedding_layer_pho = Embedding(vocab_size_pho + 1,
                            embedding_dim_pho,
                            input_length=input_size,
                            weights=[embedding_weights_pho])


88


In [24]:
writePickle(embedding_weights_pho, "embedding_weights_pho_onehot")
writePickle(x_tr_pho_artist, "x_tr_pho_artist")
writePickle(x_val_pho_artist, "x_val_pho_artist")
writePickle(x_te_pho_artist, "x_te_pho_artist")
writePickle(x_tr_pho_genre, "x_tr_pho_genre")
writePickle(x_val_pho_genre, "x_val_pho_genre")
writePickle(x_te_pho_genre, "x_te_pho_genre")
writePickle(y_tr_pho_artist, "y_tr_pho_artist")
writePickle(y_val_pho_artist, "y_val_pho_artist")
writePickle(y_te_pho_artist, "y_te_pho_artist")
writePickle(y_tr_pho_genre, "y_tr_pho_genre")
writePickle(y_val_pho_genre, "y_val_pho_genre")
writePickle(y_te_pho_genre, "y_te_pho_genre")
writePickle(phoneme_vocabulary, "phoneme_vocabulary") # without spaces

