# Embedding Process

For the embedding process we have chosen a method known as the "Continuous Bag of Words" (CBOW). The CBOW model attempts to predict a target word based on context words (surrounding words). For example if we have a scentence, "Beauty is in the eye of the beholder", we can form pairs of surrounding words and our target word. The size of this pair is dependent on the size of our context window (a numeric value n that considers n amount of context words around our target word). If we chose a context window size equal to two then an example pair would be ([Beauty,in],is). The model would then try to predict the tarket word. This model structure is referenced from the website: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html.

The first step in this process is having an accurate unique word count of our refined corpus which we will do below:

In [1]:
import pandas as pd
import numpy as np
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import nltk.data
from nltk.tokenize import sent_tokenize, word_tokenize
import re

Import the cleaned data:

In [2]:
data = pd.read_csv("cleaned_data (1).csv")
length = len(data)
data

FileNotFoundError: [Errno 2] File b'cleaned_data (1).csv' does not exist: b'cleaned_data (1).csv'

We will now form a list 'index_lookup' in order to get a unique word count for our songs in the corpus. This involves removing capital letters via the ".lower()" attribute in order to avoid a unique count mistake. 

In [None]:
index_lookup = {}
x = 0
for text in data["lyrics"]:
    for word in text.lower().split(" "):
        if not word in index_lookup.keys() and not word == '':
            index_lookup.update({word: x})
            x = x + 1
word_lookup = {v : k  for k , v in index_lookup.items()}

In [None]:
vocab_size = len(index_lookup)
vocab_size

As we see from above we have a total of 118466 unique words in our data. Lets take the first song from our data and index its unique words as a unique numeric value, by doing this we are able to look for errors in the value assigning process. We will do this by looping over the first song and splitting the values by each space. In the end we should have a list of numeric values. 

In [None]:
for text in [data["lyrics"][0]]:
    songlist = []
    print(text)
    for word in text.lower().split(" "):
        try:
            songlist.append(index_lookup[word])
        except:
            pass
songlist

Everything looks great. Now we will do this over all the song lyrics in our corpus and we will have a completed list of unique words mapped to a unique numeric value for every unique value in our corpus.

In [None]:
corpus = []
for text in data["lyrics"]:
    songlist = []
    for word in text.lower().split(" "):
        try:
            songlist.append(index_lookup[word])
        except:
            pass
    corpus.append(songlist)

# Formation of Context Word Pairs

Now that the corpus is complete it is time to form the pairs of target words and context words. This involves some arbitrary choice for the size of our context window, we have decided on three because of its accuracy and lower probability of overfitting our data. The first steps are to form a funciton that creates the pairs we are looking for. 

In [None]:
from keras.utils import np_utils
from keras.preprocessing import sequence

window_size = 3

def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size * 2
    for lyrics in corpus:
        lyrics_length = len(lyrics)
        for index, word in enumerate(lyrics):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([lyrics[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < lyrics_length 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)

This function should yeild our pairs and the best way to determine if it does is to test it out on our first song. 

In [None]:
# Test this out for some samples
i = 0
for x, y in generate_context_word_pairs(corpus=corpus, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('Context (X):', [word_lookup[w] for w in x[0]], '-> Target (Y):', word_lookup[np.argwhere(y[0])[0][0]])
    
        if i == 10:
            break
        i += 1

Thus we can see that the funciton successfully pairs the context word with our target word. Now its time to build our CBOW modeling architecture. This involves using Keras and tensorflow in order to build. This model will take inputs of context words that are passed to an embedding layer and given random weights. The embeddings are then sent to a lambda layer where we average our the word embeddings. These new values are then passed to a dense softmax layer that attempts to predict our target word. The results given by our dense softmax layer are then compared to our actual target word and compute a loss by leveraging the the categorical_crossentropy loss and perform backpropagation with each epoch to update the embedding layer in the process.

In [None]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

embed_size = 100

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# view model summary
print(cbow.summary())

As we can see from the summary we have 23,811,666 parameters! This model is now ready to be trained on our corpus. The following code shows how it will be computed however due to the size of these embeddings we will use a third party means of computation (as stated the project proposal). 

In [None]:
for epoch in range(1, 6):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=corpus, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
    print()