# Embedding Process

For the embedding process we have chosen a method known as the "Continuous Bag of Words" (CBOW). The CBOW model attempts to predict a target word based on context words (surrounding words). For example if we have a scentence, "Beauty is in the eye of the beholder", we can form pairs of surrounding words and our target word. The size of this pair is dependent on the size of our context window (a numeric value n that considers n amount of context words around our target word). If we chose a context window size equal to two then an example pair would be ([Beauty,in],is). The model would then try to predict the tarket word. This model structure is referenced from the website: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html.

The first step in this process is having an accurate unique word count of our refined corpus which we will do below:

In [1]:
import pandas as pd
import numpy as np
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import nltk.data
from nltk.tokenize import sent_tokenize, word_tokenize
import re

Import the cleaned data:

In [2]:
data = pd.read_csv("cleaned_data.csv")
length = len(data)
data

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,artist,lyrics,listens,hotness,genres,genius ID,spotify ID
0,0,0,Fast Cars,Craig David,Fast cars Fast women Speed bikes with the n...,751624,28,"['R&B Genius', 'Rock Genius']",,
1,1,1,Watching The Rain,Scapegoat Wax,Hello hello its me again You know since yo...,10681,6,['Pop Genius'],,
2,2,2,Run to the Money,Bankroll Fresh,Fuckin up money I already done it Guess ...,160202,17,['Rap Genius'],,
3,3,3,Determined (Vows of Vengeance) - Live at Summe...,Kataklysm,Your fate is made of words that you spill Li...,19218,4,['Pop Genius'],,
4,4,4,Dandelion Days,Adam McHeffey,Well look back on our old ways And call them...,1038,0,['Pop Genius'],,
...,...,...,...,...,...,...,...,...,...,...
36754,36754,36754,Rain,Leela James,I cant take this rain no When Im bringing ...,256830,19,"['Soul', 'R&B Genius']",957879.0,58qUsOq28QtDpPdyHAARNQ
36755,36755,36755,Cardiff Afterlife,Manic Street Preachers,If the love between us has faded away Left ...,94223,15,['Rock Genius'],463900.0,1SylCJI2pHEXC1GO3cd9iR
36756,36756,36756,Darkness Breeds Immortality,Marduk,An angel in ecstasy crosses the sky The sky ...,69664,17,['Pop Genius'],1402388.0,5RFS3Z2vLi0jOzSDAIHQnd
36757,36757,36757,Super Bad Sisters,Sister Sledge,Super bad super bad ah ah ah arent we bad S...,78248,14,['R&B Genius'],3988510.0,3n2aVCZTCCJ4HDIrsiTbZ4


We will now form a list 'index_lookup' in order to get a unique word count for our songs in the corpus. This involves removing capital letters via the ".lower()" attribute in order to avoid a unique count mistake. 

In [3]:
index_lookup = {}
x = 0
for text in data["lyrics"]:
    for word in text.lower().split(" "):
        if not word in index_lookup.keys() and not word == '':
            index_lookup.update({word: x})
            x = x + 1
word_lookup = {v : k  for k , v in index_lookup.items()}

In [4]:
vocab_size = len(index_lookup)
vocab_size

118466

As we see from above we have a total of 118466 unique words in our data. Lets take the first song from our data and index its unique words as a unique numeric value, by doing this we are able to look for errors in the value assigning process. We will do this by looping over the first song and splitting the values by each space. In the end we should have a list of numeric values. 

In [5]:
for text in [data["lyrics"][0]]:
    songlist = []
    print(text)
    for word in text.lower().split(" "):
        try:
            songlist.append(index_lookup[word])
        except:
            pass
songlist

   Fast cars Fast women Speed bikes with the nitro in them Dangerous when driven Those are the type that I be feeling    Sitting there while I observe I like your lines I love your curves Checking out your bodywork How can I get with her Youre the one that I want Do anything to turn you on Somebody please just pass the keys so you can take a ride with me   Im on a mission First thing disarming your system Next thing slip the key in the ignition Just listen To the way that you purr at me you know you prefer the speed When your back starts dipping Wheel spinning when the gears start shifting Im sticking til the turbo kicks in You know that Im missing Got me moving so fast you got me missing the flash a 50   Fast cars Fast women Speed bikes with the nitro in them Dangerous when driven Those are the type that I be feeling    Feel the ride feel the rush The moment I tease your clutch Reacting to my every touch Were shifting down or tearing up I dont care where we go To burn you outs the end

[0,
 1,
 0,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 6,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 17,
 23,
 17,
 24,
 25,
 26,
 17,
 27,
 25,
 28,
 29,
 30,
 25,
 31,
 32,
 33,
 17,
 34,
 5,
 35,
 36,
 6,
 37,
 16,
 17,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 6,
 49,
 50,
 43,
 33,
 51,
 52,
 53,
 5,
 54,
 55,
 44,
 52,
 56,
 57,
 58,
 59,
 25,
 60,
 61,
 58,
 62,
 6,
 63,
 8,
 6,
 64,
 47,
 65,
 41,
 6,
 66,
 16,
 43,
 67,
 68,
 54,
 43,
 69,
 43,
 70,
 6,
 3,
 11,
 25,
 71,
 72,
 73,
 74,
 75,
 11,
 6,
 76,
 77,
 78,
 55,
 79,
 80,
 6,
 81,
 82,
 8,
 43,
 69,
 16,
 55,
 83,
 84,
 54,
 85,
 50,
 0,
 43,
 84,
 54,
 83,
 6,
 86,
 52,
 87,
 0,
 1,
 0,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 6,
 15,
 16,
 17,
 18,
 19,
 88,
 6,
 53,
 88,
 6,
 89,
 6,
 90,
 17,
 91,
 25,
 92,
 93,
 41,
 94,
 95,
 96,
 97,
 78,
 98,
 99,
 100,
 101,
 17,
 102,
 103,
 104,
 105,
 106,
 41,
 107,
 43,
 108,
 6,
 109,
 110,
 36,
 111,
 5,
 52,
 112,
 113,
 114,

Everything looks great. Now we will do this over all the song lyrics in our corpus and we will have a completed list of unique words mapped to a unique numeric value for every unique value in our corpus.

In [6]:
corpus = []
for text in data["lyrics"]:
    songlist = []
    for word in text.lower().split(" "):
        try:
            songlist.append(index_lookup[word])
        except:
            pass
    corpus.append(songlist)

# Formation of Context Word Pairs

Now that the corpus is complete it is time to form the pairs of target words and context words. This involves some arbitrary choice for the size of our context window, we have decided on three because of its accuracy and lower probability of overfitting our data. The first steps are to form a funciton that creates the pairs we are looking for. 

In [7]:
from keras.utils import np_utils
from keras.preprocessing import sequence

window_size = 3

def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size * 2
    for lyrics in corpus:
        lyrics_length = len(lyrics)
        for index, word in enumerate(lyrics):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([lyrics[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < lyrics_length 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)

Using TensorFlow backend.


This function should yeild our pairs and the best way to determine if it does is to test it out on our first song. 

In [8]:
# Test this out for some samples
i = 0
for x, y in generate_context_word_pairs(corpus=corpus, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('Context (X):', [word_lookup[w] for w in x[0]], '-> Target (Y):', word_lookup[np.argwhere(y[0])[0][0]])
    
        if i == 10:
            break
        i += 1

Context (X): ['women', 'speed', 'bikes', 'the', 'nitro', 'in'] -> Target (Y): with
Context (X): ['speed', 'bikes', 'with', 'nitro', 'in', 'them'] -> Target (Y): the
Context (X): ['bikes', 'with', 'the', 'in', 'them', 'dangerous'] -> Target (Y): nitro
Context (X): ['with', 'the', 'nitro', 'them', 'dangerous', 'when'] -> Target (Y): in
Context (X): ['the', 'nitro', 'in', 'dangerous', 'when', 'driven'] -> Target (Y): them
Context (X): ['nitro', 'in', 'them', 'when', 'driven', 'those'] -> Target (Y): dangerous
Context (X): ['in', 'them', 'dangerous', 'driven', 'those', 'are'] -> Target (Y): when
Context (X): ['them', 'dangerous', 'when', 'those', 'are', 'the'] -> Target (Y): driven
Context (X): ['dangerous', 'when', 'driven', 'are', 'the', 'type'] -> Target (Y): those
Context (X): ['when', 'driven', 'those', 'the', 'type', 'that'] -> Target (Y): are
Context (X): ['driven', 'those', 'are', 'type', 'that', 'i'] -> Target (Y): the


Thus we can see that the funciton successfully pairs the context word with our target word. Now its time to build our CBOW modeling architecture. This involves using Keras and tensorflow in order to build. This model will take inputs of context words that are passed to an embedding layer and given random weights. The embeddings are then sent to a lambda layer where we average our the word embeddings. These new values are then passed to a dense softmax layer that attempts to predict our target word. The results given by our dense softmax layer are then compared to our actual target word and compute a loss by leveraging the the categorical_crossentropy loss and perform backpropagation with each epoch to update the embedding layer in the process.

In [9]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

embed_size = 100

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# view model summary
print(cbow.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6, 100)            11846600  
_________________________________________________________________
lambda_1 (Lambda)            (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 118466)            11965066  
Total params: 23,811,666
Trainable params: 23,811,666
Non-trainable params: 0
_________________________________________________________________
None


As we can see from the summary we have 23,811,666 parameters! This model is now ready to be trained on our corpus. The following code shows how it will be computed however due to the size of these embeddings we will use a third party means of computation (as stated the project proposal). 

In [None]:
for epoch in range(1, 6):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=corpus, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
    print()

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
