# Rap Lyrics Generator - Kanye West

## Introduction
This project uses 300+ Kanye West's verses to train on Recurrent Nural Network (RNN). The verses file is [from a dataset on Kaggle](https://www.kaggle.com/viccalexander/kanyewestverses). The dataset contains total of 364 verses from 243 songs. Each verses are seprated by an empty line which consist of "\n". This project is trained on characters based features. i.e. the embeding step is based on characters, not words.
This project is inspried by a NLP course on coursera in Tensorflow in practice specitialization offer by deeplearning.ai and [Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/text_generation). The process is to prepare the lyrics and window them into segments the use the characters in these windows to predict the next character. Since each character is related to its previous and next character, RNN and LSTM that carry hidden state from previous cell are suitable for this application. 

### Import Packages

In [1]:
import numpy as np 
import tensorflow as tf

print(tf.__version__)
#!pip install tensorflow==2.0.0

2.0.0


### load the file print the first 250 characters

In [2]:
# Read, then decode for py2 compat.
text = open('kanye_verses.txt', 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it

print ('Length of text: {} characters'.format(len(text)))
print(text[:250])

Length of text: 260341 characters
Let the suicide doors up
I threw suicides on the tour bus
I threw suicides on the private jet
You know what that mean, I'm fly to death
I step in Def Jam buildin' like I'm the shit
Tell 'em give me fifty million or I'ma quit
Most rappers' taste level


### Unique character
We are using character based feature, so we want to identify the total number of unique charactor and give each one of them a numerical label.

In [3]:
# number of unique characters 
print(len(set(text)),'unique characters:') 

# dict of these characters
chars = sorted(set(text))
print(chars)

char_dict = {char:i for i,char in enumerate(chars)}
idx2char = np.array(chars)
print(char_dict)

96 unique characters:
['\n', ' ', '!', '"', '#', '$', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '~', '·', 'Á', 'é', 'í', 'ñ', 'ó', 'ā', '\u200b', '–', '‘', '’', '“', '”', '…']
{'\n': 0, ' ': 1, '!': 2, '"': 3, '#': 4, '$': 5, '&': 6, "'": 7, '(': 8, ')': 9, '*': 10, '+': 11, ',': 12, '-': 13, '.': 14, '/': 15, '0': 16, '1': 17, '2': 18, '3': 19, '4': 20, '5': 21, '6': 22, '7': 23, '8': 24, '9': 25, ':': 26, ';': 27, '?': 28, 'A': 29, 'B': 30, 'C': 31, 'D': 32, 'E': 33, 'F': 34, 'G': 35, 'H': 36, 'I': 37, 'J': 38, 'K': 39, 'L': 40, 'M': 41, 'N': 42, 'O': 43, 'P': 44, 'Q': 45, 'R': 46, 'S': 47, 'T': 48, 'U': 49, 'V': 50, 'W': 51, 'X': 52, 'Y': 53, 'Z': 54, 'a':

### Transform lyrics to numerical labels
Here each characters are transform to relative numerical label that I created above so they can be then feed into Tensorflow.

In [4]:
#Sequence of the text
text_in_num =np.array([char_dict[i] for i in text])
print(text[:13])
print(text_in_num[:13])

print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_in_num[:13]))


Let the suici
[40 59 74  1 74 62 59  1 73 75 63 57 63]
'Let the suici' ---- characters mapped to int ---- > [40 59 74  1 74 62 59  1 73 75 63 57 63]


### Create sequences for training X and Y
In order to create X (feature set) and Y, the answer,(ground truth), the entire lyrics file are flaten to a single vactor(1,# of chars). The sequences (training samples) are extracted by shifting a window down to the vector. For example, If the lyrics file in a single vector is : \[0,1,2,3,4,5,6,7,8,9\]. We use a window size of 3 and shift every 1 character, we will get:\[0,1,2\] \[1,2,3\] \[2,3,4\] \[3,4,5\] \[4,5,6\] \[5,6,7\] \[6,7,8\] and \[7,8,9\].

The window size is defined by the length of each sequence(training sample). The sample is the first character to n-1 th. The second to the last characters are our target, Y. i.e. an instance is \[0,1,2,3\]: Training sequence :\[0,1,2\], target:\[1,2,3\]

In [5]:
#Convert to trainable data
seq_len = 50
example_per_epoc = (len(text_in_num)-seq_len)//5
BATCH_SIZE = 64
example_per_epoc


52058

In [6]:
# Create training examples / targets

#Make the data in numerical lable to Tf.dataset for furter processing
char_dataset = tf.data.Dataset.from_tensor_slices(text_in_num)

#Creating sequences 
sequences = char_dataset.window(size = seq_len+1, shift = 10, drop_remainder = True)

#Get training instance (training sequence + Target) and Flaten out the entire dataset 
sequences = sequences.flat_map(lambda window: window.batch(seq_len+1))

#Creating tuple for each train instance : (sequence, target)
dataset = sequences.map(lambda window: (window[:-1], window[1:]))

BUFFER_SIZE = 1000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)


for i,y in dataset.take(1):
    print('Sample:(batch_size, sequence_len)')
    print(i.numpy().shape)
    print('Target:(batch_size,)')
    print(y.numpy().shape)

Sample:(batch_size, sequence_len)
(64, 50)
Target:(batch_size,)
(64, 50)


### Print out the first 10 sequences with window size 50 and shift = 5

In [7]:
for batch,batch_target in  dataset.take(1):
    for i in range(10):
        print ('Input data: ', repr(''.join(idx2char[batch[i].numpy()])))
    

Input data:  "p, that pussy slippery, no whip\nWe ain't trippin' "
Input data:  " problem is I be textin'\nMy psychiatrist got kids "
Input data:  'Huh? Motherfucker we rolling\nWith some light-skinn'
Input data:  ' hoodrats\nAnd I just blame everything on you\nAt le'
Input data:  'o see Thee more clearly\nI know he hear me when my '
Input data:  " a whole lot of O's\nWhat you after, actor money?\nY"
Input data:  "n this bitch another 'gain\n\nI made Jesus Walks, I'"
Input data:  "ike Mekhi Phife'\nIn that pussy so deep I could hav"
Input data:  'at we been through\nI mean, after all the things we'
Input data:  "lla\nOkay, I smashed your Corolla\nI'm hangin' on a "


### The first 5 sequences and respective targets

In [8]:
for batch,batch_target in  dataset.take(1):
    for i in range(5):
        print ('Input data: ', repr(''.join(idx2char[batch[i].numpy()])))
        print ('Target data:', repr(''.join(idx2char[batch_target[i].numpy()])))

Input data:  ", everybody gettin' paid\nNiggas lookin' at me like"
Target data: " everybody gettin' paid\nNiggas lookin' at me like "
Input data:  " like a month right now\nStupid niggas gettin' mone"
Target data: "like a month right now\nStupid niggas gettin' money"
Input data:  "ing I need to let you know\nYou ain't never seen no"
Target data: "ng I need to let you know\nYou ain't never seen not"
Input data:  'very ass, cheated on every test\nI guess, this is m'
Target data: 'ery ass, cheated on every test\nI guess, this is my'
Input data:  " other side\nGotta keep 'em separated, I call that "
Target data: "other side\nGotta keep 'em separated, I call that a"


### Build the model
The model consists of an embeding layer, a RNN (GRU) layer, and a dense output layer. As the "return_sequence" is set to True, the output of GRU layer is output the state of each call in the GRU. The dense lyaer outputs the probability of each charater for each state. 

In [9]:
# Length of the vocabulary in chars
vocab_size = len(chars)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [10]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
    # Emdedding layer    
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
        
    # RNN layer    
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
        
    # Dense layer    
    tf.keras.layers.Dense(vocab_size)
      ])
    return model

model = build_model(
  vocab_size = len(chars),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

### Take a look what each batch looks like:

In [34]:
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch)
    print(target_example_batch)

tf.Tensor(
[[70 75 74 ... 57 65  1]
 [62 63  0 ... 59  1 70]
 [59 77 59 ... 55 74 69]
 ...
 [79 59 72 ...  1 77 59]
 [69 75 68 ... 74  0 48]
 [66 58  1 ... 74 57 62]], shape=(64, 50), dtype=int32)
tf.Tensor(
[[75 74  1 ... 65  1 59]
 [63  0 47 ...  1 70 55]
 [77 59 66 ... 74 69 73]
 ...
 [59 72 73 ... 77 59  1]
 [75 68 58 ...  0 48 62]
 [58  1 58 ... 57 62  1]], shape=(64, 50), dtype=int32)


In [12]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 50, 96) # (batch_size, sequence_length, vocab_size)


As seen above, each batch contains 64 samples, each sample has input and target which both constins 50 charaters. 


### Model summary:

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           24576     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 96)            98400     
Total params: 4,061,280
Trainable params: 4,061,280
Non-trainable params: 0
_________________________________________________________________


### Sample indices
Input random lyrics and use untrained model to predict the lyrics

In [14]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [15]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

Input: 
 'ay\nSo this is in the name of love like Robert say\n'

Next Char Predictions: 
 "Lq6:·Gjio‘'qP\nuqiCw&uāTa5LYT+EcJkjCh\u200b”P)F·~ó;PT+e”"


### Define loss function and compile the model
Catergorical crossentropy

In [16]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 50, 96)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.562809


In [17]:
model.compile(optimizer='adam', loss=loss)

### Train the model
Here define a callback to save checkpoint for each epoch so that we can refer to the weight at each epoch at later step.


In [18]:
import os
import time

In [19]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "c_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [20]:
EPOCHS=30

In [21]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Creat a new model use trained weight

Looking at the loss for each epoc, seems the 18th epoc works pretty good.

In [22]:
#tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints\\c_30'

In [35]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
#model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.load_weights('./training_checkpoints\\c_18')
model.build(tf.TensorShape([1, None]))

In [36]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (1, None, 256)            24576     
_________________________________________________________________
gru_2 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_2 (Dense)              (1, None, 96)             98400     
Total params: 4,061,280
Trainable params: 4,061,280
Non-trainable params: 0
_________________________________________________________________


### Lyrics Generator
Finally define the lyrics generator function that feed in a string and predict the next character.

In [43]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
    num_generate = 1000

  # Converting our start string to numbers (vectorizing)
    input_eval = [char_dict[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
    text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
    temperature = 0.65

  # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
      # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)
        #print(predictions)

      # using a categorical distribution to predict the word returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        #print(tf.random.categorical(predictions, num_samples=1))
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

### Prediction

It is around Thanksgiving, let's try use "Thanksgiving" as start string and see what we will get:

In [44]:
print(generate_text(model, start_string="Thanksgiving"))

Thanksgiving, do we even got a question?
Hermes pastel, I say "I know he got a plan, I know I'm on your beams
One set of footsteps, you was carryin' me
When I was pretty before the dough but now I'm just the man
But she not like you
Right now I need you to muth of me
I will not be headed astronaut
Maybe it's because
That girl is bad

I'm getting spins a song with Coldplay
Back in my mind I'm like a vampire on the gas fully filled up like I nigga

May the Lord forgive us, me it was this is Christians
My apologies, they be all up on your ass, yo
When you gettin' money, cops don’t fall
Made to make niggas fail
Especially if you paid for the way that she 36-2, pieces catch miracle whips
Changing lanes
Yeah, I'm changing lanes

Magazines call me a rock star, girls call me cock star
Billboard, pop star, neighborhood black people to the one
I know they don't want niggas that life
Wele as the light post
They say Drive Slow, I say "I know"
Then errrr, away I go
And the way I want you
I'm goin t

## Conclusion

Here I generated 1000 characters for given start string "Thanksgiving". Since the chance of line "\n" and space are also been part of the training vocab, the auto-generated lyrics will switch lines between bars and skip a line between verses. Not looking at the contex, this lyrics looks somehow legit. Some intresting found after deep inspecting the lyrics:

- The vocab is case senetive, so it's good to see the model generate the lyrics that captalize the first character of each line also some names like "Coldplay". 

- It also know to use quotation marks after the word "say", for example, They say Drive Slow, I say "I know".

- Also, we can see the lyric is impacted by Kanye. For example, he is known as a Christian rapper. We can see some lyrics that are God related:
    May the Lord forgive us, me it was this is Christians

- Even rhyme can be seen sometime:

Overall, the auto-generated lyrics of course is not as good as the original ones. However, it presents the ability of NLP and an possible way to generate text of different content. This model is trained using charater based RNN due to it has less unique "labels" so can largly reduce the computational expense. The drawback is the composition of sentances and verses may not make much sence to people. The model can be improved by using "word-based" vocab and input more samples and train.


# Happy Thanksgiving!
    
    


