# Songdata assignment

## Authors: Jan Zahn, Jonas Meier, Thomas Wiktorin


Task:
    - Train a recurrent neural network to output the lyrics of a song, given a starting word/character.
    - Build two models: one based on characters (input and output are one-hot encoded characters), and another based on words with embeddings (inputs and outputs are embedding-encoding of words).
    - Compare your results and (subjectively) evaluate which method generates "better" songs, or which model is easier to learn.
    - There will be an internal competition about which generated text is funnier/weirder, based on a democratic vote. This will not affect your grades.

Requirements:
    - The example below uses characters. You will have to make a model that uses words for generating text rather than characters.
    - Read about python generators and the yield operator.
    - Embedded words using a keras.layers.Embedding layer.
    - Initialize the embedding layer with pre-trained weights as explained in the link below:
        https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
    - The acquisition of data is left open but a common dataset can be found here:
        https://www.kaggle.com/mousehead/songlyrics/
    - Implement your model using keras.Sequence (or your own generator)
    - Use model.fit_generator to fit model.

Look at the example below (taken from the official Keras examples).
Warning they use characters instead of words.

https://raw.githubusercontent.com/keras-team/keras/master/examples/lstm_text_generation.py

# Evaluation

The model based on words creates lyrics which consist of actual words and parts of sentences do make sense. The model based on characters uses special characters e.g. line breaks and punctuation which is nice but the text generated is worse. The actual text is the most improtant aspect for us, therefore we prefer the model based on words.

# Testing the given code for Nietzsches text

## To run the following code:
(in Anaconda console, activate ENVIRONMENT_NAME)
pip install plaidml-keras plaidbench
plaidml-setup

In [1]:
'''Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''
from __future__ import print_function

import numpy as np
import random
import sys
import io
import os

# Use this for GPU accelerated keras (also AMD) 
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop

with io.open("data/nietzsche.txt", encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
print(chars)
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=10,
          callbacks=[print_callback])

Using plaidml.keras.backend backend.


corpus length: 600893
total chars: 57
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ä', 'æ', 'é', 'ë']
nb sequences: 200285
Vectorization...
Build model...


INFO:plaidml:Opening device "opencl_amd_hawaii.0"


Epoch 1/60


INFO:plaidml:Analyzing Ops: 3292 of 3722 operations complete




INFO:plaidml:Analyzing Ops: 3492 of 3722 operations complete



----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ssed, enchained class of spirits,
who de"
ssed, enchained class of spirits,
who destrong of the can a propation of the preased the selfect of the serfect of the prease of the sensemance of this as a probably and a fard the prease of the present of the seems the sense of the prease of the serfect of the probary and the prease of the sense of the serfortion of the present of the seems the prease of the presents the prease of the canse the probal and the such as the sense of the p
----- diversity: 0.5
----- Generating with seed: "ssed, enchained class of spirits,
who de"
ssed, enchained class of spirits,
who decieated at experal of reperence of the has anticisted this good one of suppect of the doestart be with outherse--and who searary or this word this in consequence trations. that as a prepart and as the prease of need even the facture of the pracase and surpation in the serrorations. the can of new



r englanlustentywhion of rayments--it was yird
conceation, uppreitive happy and cakewingly, by ageament with his libeble only one adoon--of the tou,nish metty iaking almoked
contesn shade of hishen. "nachence and for about by impulsed, very proponda
about it
is nonly and love-truthm like a obed, of the sign loamed and
to healthy wome to itredsanc
moral easyroterly?
Epoch 14/60

----- Generating text after Epoch: 13
----- diversity: 0.2
----- Generating with seed: "tions), a false ethic is
reared, to supp"
tions), a false ethic is
reared, to suppose of the possible the same developed to the subjection and action of the soul of the subjection of the subjection of the subjection of the standard of the made the fact, and the same the same the subjection of the philosophers and contradiction and subjection of the subjection of the subjection that the philosophers and an antiquity of the philosophers and in a stands that the same the soul of t
----- diversity: 0.5
----- Generating with seed:

KeyboardInterrupt: 

# Based on characters

In [1]:
from __future__ import print_function

import numpy as np
import random
import sys
import io
import os

# Use this for GPU accelerated keras (also AMD) 
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop

import csv

with open('data/songdata.csv', 'r') as csvfile:
    tmp_content = []
    spamreader = csv.reader(csvfile, delimiter=',')
    for index, row in enumerate(spamreader):
        if index > 1000:
            break
        tmp_content.append(row[3].lower())
    text = "".join(tmp_content)
print(text[:200])

print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 20
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=64,
          epochs=10,
          callbacks=[print_callback])

Using plaidml.keras.backend backend.


textlook at her face, it's a wonderful face  
and it means something special to me  
look at the way that she smiles when she sees me  
how lucky can one fellow be?  
  
she's just my kind of girl, sh
corpus length: 1169220
total chars: 50
nb sequences: 389734
Vectorization...
Build model...


INFO:plaidml:Opening device "opencl_amd_hawaii.0"


Epoch 1/10

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "other break  
but pl"
other break  
but please the way in the start  
  
when the sight the same  
  
when you want to be a thought the start the more  
  
what you want to go  
  
i want to be a thought the show  
  
i want to the start  
  
i won't be the start  
  
i want to the start  
i want to say the start  
  
he was a little back  
i won't be the show  
  
when i want to go  
i can't be a the say  
  
when you want to be a boom  
----- diversity: 0.5
----- Generating with seed: "other break  
but pl"
other break  
but please the sunch i wanna thous  
i wanna take your there  
the thing that's say it bepies  
  
what when the line that you need the newfer  
  
you wats i can't go  
know can say the same  
i won't be guess the sare  
  
when you can want to hear the way  
  
moth and living was a some baby  
i wonder the can inside of the screep  
i wanna love is the way  
  
she c



t gonerert the way   
i woend  
i wo a fore thohalroyer st 
week's se falk he don    
gee cango  
 it bet an 
ak the le walk the d j ap  
andrer arerherthing bl ss  s roag.h  ew  auddy  is sererctoroherr itre t's wor  
i'm sca bo wantertryoa ta awak i'm s too  ac to mer  
 me oto an' mindhitkat 
igok boe nedheatu  
  
  i mi
o
a  
  al is dawaymou th
----- diversity: 0.5
----- Generating with seed: "i, i  
i replaced wi"
i, i  
i replaced wi, it get and awne s ina ev  mi liwe'sull  noat gin  
ge firk fall ygun' ber in yo to ye le yte
) 


 i candrhh my so moend a's ster teak  
nohhthhert itue' g   tingher t melfbve goe ardryonoreusn me ar wayt is t bo   
yeg the dowr dowrthohre'  
lit t ster ge 
w inder fyeu 
donungurt toin  
i've in t m h  i woes we cigigo 
me ase was feet na ye stawa's timerd.d here  me andoiteltiu   ca1 s 'verotia
----- diversity: 1.0
----- Generating with seed: "i, i  
i replaced wi"
i, i  
i replaced wig h mall  s it drantiy 
my out toidedvtan agloridnu   lat h we

<keras.callbacks.History at 0x20f600860b8>

# Based on words

In [1]:
from __future__ import print_function

import numpy as np
import random
import sys
import io
import os

# Use this for GPU accelerated keras (also AMD) 
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop

import csv

with open('data/songdata.csv', 'r') as csvfile:
    text = []
    spamreader = csv.reader(csvfile, delimiter=',')
    for index, row in enumerate(spamreader):
        if index == 0:
            continue
        if index > 1000:
            break 
        temp = row[3].lower().split(" ")
        for word in temp:
            word = ''.join(e for e in word if e.isalnum())
            if word:
                text.append(word)

print('number of words:', len(text))

words = sorted(list(set(text)))
print('total words:', len(words))
word_indices = dict((c, i) for i, c in enumerate(words))
indices_word = dict((i, c) for i, c in enumerate(words))
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 10
step = 5
sentences = []
next_words = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_words.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(words)), dtype=np.bool)
y = np.zeros((len(sentences), len(words)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, word in enumerate(sentence):
        x[i, t, word_indices[word]] = 1
    y[i, word_indices[next_words[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(64, input_shape=(maxlen, len(words))))
model.add(Dense(len(words), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += " ".join(sentence)
        print('----- Generating with seed: "' + " ".join(sentence) + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(words)))
            for t, char in enumerate(sentence):
                x_pred[0, t, word_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_word = indices_word[next_index]

            generated += next_word
            sentence = sentence[1:] + [next_word]

            sys.stdout.write(next_word + " ")
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=64,
          epochs=10,
          callbacks=[print_callback])

Using plaidml.keras.backend backend.


number of words: 218279
total words: 9439
nb sequences: 43654
Vectorization...
Build model...


INFO:plaidml:Opening device "opencl_amd_hawaii.0"


Epoch 1/10

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "nothing but the best for you too dont forget me"
nothing but the best for you too dont forget meto the world of the world of the world of the day of the world of the world of the world is the world of the world of the world of the world of the world of the world of the world of the world of the world of the world of the day is the world of the world is the world is the world is a little bit just a little bit just a little bit a little bit just a little bit just a little bit a little bit a little bit a little bit a little bit a little bit a little bit a little bit just a little bit a little bit a little bit a little bit a little bit a little bit a little bit just a little bit a little bit a little bit a little bit a little bit a little bit a little bit a little bit a little bit just a little bit a little bit a little bit a little bit just a little bit a little bit a little bit a little bit

<keras.callbacks.History at 0x252f5e046a0>