# Embedding Process (WIP)

For the embedding process we have chosen a method known as the "Continuous Bag of Words" (CBOW). The CBOW model attempts to predict a target word based on context words (surrounding words). For example if we have a scentence, "Beauty is in the eye of the beholder", we can form pairs of surrounding words and our target word. The size of this pair is dependent on the size of our context window (a numeric value n that considers n amount of context words around our target word). If we chose a context window size equal to two then an example pair would be ([Beauty,in],is). The model would then try to predict the tarket word. This model structure is referenced from the website: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html.

The first step in this process is having an accurate unique word count of our refined corpus which we will do below:

In [3]:
import pandas as pd
import numpy as np
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import nltk.data
from nltk.tokenize import sent_tokenize, word_tokenize
import re

Import the cleaned data:

In [4]:
data = pd.read_csv("../Data Cleanup/english_clean.csv")
length = len(data)
data[:5]

Unnamed: 0,title,artist,lyrics,listens,hotness,genres,genius ID,spotify ID,language,instrumental,song length
0,Fast Cars,Craig David,"['Fast cars', 'Fast women', 'Speed bikes with ...",751624,28,"['r&b', 'rock']",,,en,False,379
1,Watching The Rain,Scapegoat Wax,"['Ya ya ya ya ya ya ya', 'Hello hello its me a...",10681,6,['pop'],,,en,False,360
2,Run to the Money,Bankroll Fresh,"['Fuckin up money I already done it', 'Guess w...",160202,17,['rap'],,,en,False,422
3,Determined (Vows of Vengeance) - Live at Summe...,Kataklysm,"['Your fate is made of words that you spill', ...",19218,4,['pop'],,,en,False,258
4,Dandelion Days,Adam McHeffey,"['Well look back on our old ways', 'And call t...",1038,0,['pop'],,,en,False,181


We will now form a list 'index_lookup' in order to get a unique word count for our songs in the corpus. This involves removing capital letters via the ".lower()" attribute in order to avoid a unique count mistake. 

In [5]:
index_lookup = {}
x = 0
for text in data["lyrics"]:
    for word in text.lower().split(" "):
        if not word in index_lookup.keys() and not word == '':
            index_lookup.update({word: x})
            x = x + 1
word_lookup = {v : k  for k , v in index_lookup.items()}

In [11]:
word_lookup[55]

'ride'

In [6]:
vocab_size = len(index_lookup)
vocab_size

314041

As we see from above we have a total of 314041 unique words in our data. Lets take the first song from our data and index its unique words as a unique numeric value, by doing this we are able to look for errors in the value assigning process. We will do this by looping over the first song and splitting the values by each space. In the end we should have a list of numeric values. 

In [7]:
for text in [data["lyrics"][0]]:
    songlist = []
    print(text)
    for word in text.lower().split(" "):
        try:
            songlist.append(index_lookup[word])
        except:
            pass
print(songlist)

['Fast cars', 'Fast women', 'Speed bikes with the nitro in them', 'Dangerous when driven', 'Those are the type that I be feeling', 'Sitting there while I observe', 'I like your lines I love your curves', 'Checking out your bodywork', 'How can I get with her', 'Youre the one that I want', 'Do anything to turn you on', 'Somebody please just pass the keys so you can take a ride with me', 'Im on a mission', 'First thing disarming your system', 'Next thing slip the key in the ignition', 'Just listen', 'To the way that you purr at me you know you prefer the speed', 'When your back starts dipping', 'Wheel spinning when the gears start shifting', 'Im sticking til the turbo kicks in', 'You know that Im missing', 'Got me moving so fast you got me missing the flash a 50', 'Fast cars', 'Fast women', 'Speed bikes with the nitro in them', 'Dangerous when driven', 'Those are the type that I be feeling', 'Feel the ride feel the rush', 'The moment I tease your clutch', 'Reacting to my every touch', 'We

Everything looks great. Now we will do this over all the song lyrics in our corpus and we will have a completed list of unique words mapped to a unique numeric value for every unique value in our corpus.

In [8]:
corpus = []
for text in data["lyrics"]:
    songlist = []
    for word in text.lower().split(" "):
        try:
            songlist.append(index_lookup[word])
        except:
            pass
    corpus.append(songlist)

# Formation of Context Word Pairs

Now that the corpus is complete it is time to form the pairs of target words and context words. This involves some arbitrary choice for the size of our context window, we have decided on three because of its accuracy and lower probability of overfitting our data. The first steps are to form a funciton that creates the pairs we are looking for. 

In [9]:
from keras.utils import np_utils
from keras.preprocessing import sequence

window_size = 3

def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size * 2
    for lyrics in corpus:
        lyrics_length = len(lyrics)
        for index, word in enumerate(lyrics):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([lyrics[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < lyrics_length 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)

Using TensorFlow backend.


This function should yeild our pairs and the best way to determine if it does is to test it out on our first song. 

In [10]:
# Test this out for some samples
i = 0
for x, y in generate_context_word_pairs(corpus=corpus, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('Context (X):', [word_lookup[w] for w in x[0]], '-> Target (Y):', word_lookup[np.argwhere(y[0])[0][0]])
    
        if i == 10:
            break
        i += 1

Context (X): ["cars',", "'fast", "women',", 'bikes', 'with', 'the'] -> Target (Y): 'speed
Context (X): ["'fast", "women',", "'speed", 'with', 'the', 'nitro'] -> Target (Y): bikes
Context (X): ["women',", "'speed", 'bikes', 'the', 'nitro', 'in'] -> Target (Y): with
Context (X): ["'speed", 'bikes', 'with', 'nitro', 'in', "them',"] -> Target (Y): the
Context (X): ['bikes', 'with', 'the', 'in', "them',", "'dangerous"] -> Target (Y): nitro
Context (X): ['with', 'the', 'nitro', "them',", "'dangerous", 'when'] -> Target (Y): in
Context (X): ['the', 'nitro', 'in', "'dangerous", 'when', "driven',"] -> Target (Y): them',
Context (X): ['nitro', 'in', "them',", 'when', "driven',", "'those"] -> Target (Y): 'dangerous
Context (X): ['in', "them',", "'dangerous", "driven',", "'those", 'are'] -> Target (Y): when
Context (X): ["them',", "'dangerous", 'when', "'those", 'are', 'the'] -> Target (Y): driven',
Context (X): ["'dangerous", 'when', "driven',", 'are', 'the', 'type'] -> Target (Y): 'those


Thus we can see that the funciton successfully pairs the context word with our target word. Now its time to build our CBOW modeling architecture. This involves using Keras and tensorflow in order to build. This model will take inputs of context words that are passed to an embedding layer and given random weights. The embeddings are then sent to a lambda layer where we average our the word embeddings. These new values are then passed to a dense softmax layer that attempts to predict our target word. The results given by our dense softmax layer are then compared to our actual target word and compute a loss by leveraging the the categorical_crossentropy loss and perform backpropagation with each epoch to update the embedding layer in the process.

In [12]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

embed_size = 100

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# view model summary
print(cbow.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6, 100)            31404100  
_________________________________________________________________
lambda_1 (Lambda)            (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 314041)            31718141  
Total params: 63,122,241
Trainable params: 63,122,241
Non-trainable params: 0
_________________________________________________________________
None


As we can see from the summary we have 63,122,241 parameters! This model is now ready to be trained on our corpus. The following code shows how it will be computed however due to the size of these embeddings we will use a third party means of computation (as stated the project proposal). 

In [13]:
for epoch in range(1):
    loss = 0
    i = 0
    for x, y in generate_context_word_pairs(corpus=corpus, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))
        print(str(i) + ' of ' + str(vocab_size), end='\r')

    print('Epoch:', epoch, '\tLoss:', loss)
    print()

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


1 of 3140412 of 314041

ResourceExhaustedError:  OOM when allocating tensor with shape[314041,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node Square (defined at C:\Users\TannerSims\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_keras_scratch_graph_543]

Function call stack:
keras_scratch_graph


### Further Considerations and Work to Do
First, our model as is is unusually large. We should investigate ways to reduce the number of unique words, since this is the biggest contributor to the model size. It may be necessary to drop any notion of training our own embedding along with the transfer learned embeddings, and just do the fit training only.

Second, it is unusual to train on batches of size one, and this may effect performance and training time. The model is too large, and we have encountered resource issues. We can choose to generate contexts by song, or by line, which will have different implications for the associations the model makes. I expect that it is better to create the associations by line, but we will have to modify the code above to function that way.

Finally, we need to create a stripped script to run on the CHPC computing nodes.