# Deep Learning final project - Lyrics Generation

---

    Gabriel Graells - 205638

---
For this project, we implement a RNN capable of generating new song lyrics character by character after being trained with hundreds of reggeaton songs of the best singers. 

In [None]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

# Creating our dataset

## Load the Data

Load the lyrics text file and convert it into integers for our network to use. 

In [None]:
#mount Google Colab
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


# Data Collection
In order to generated the needed data set for training we have used [this](https://github.com/johnwmillr/LyricsGenius) library that has been build on top of the API of the music content website [Genius](https://genius.com/).
The code searches for the top 50 songs, based on popularity, of the artist in the list, then downloads one JSON file per artist containing the songs and further information. Finally it transfers all songs from the JSONs file to the output file "*lyrics.txt*".

**If you want to execute this part of the code is better to do it locally in a *.py* Python script.**

In [None]:
import lyricsgenius
import json
import os


artist_list = ['Bad Bunny', 'Daddy Yanky','Ozuna', 'Don Omar', 'Anuel AA','Maluma','Juan Magan','Nicky Jam', 'Karol G','Mozart La Para', 'Cosculluela', 'Tito "El Bambino"','Calle Trece', 'Arcangel','Tego Calderon','Plan B']

#Init API client
genius = lyricsgenius.Genius("XAjzWCsgJw9mHU48O7RRZ6-nzXykKtYH9_7zAFnPbl_PHrVQZQBM7InGU05sji9o")
genius.verbose = False 
genius.remove_section_headers = True
genius.excluded_terms = ['Live', '(Live)']

#Iterate artist_list and download JSON file with lyrics
for art in artist_list:
    print(art)
    artist = genius.search_artist(art, max_songs = 50, sort='popularity')
    
    if artist == None:
        print(art,'Does not exist!')
    else:
        print('saving json...')
        artist.save_lyrics()

#Get names json files in current directory
json_files = [pos_json for pos_json in os.listdir() if pos_json.endswith('.json')]

for files in json_files:
    with open(files) as json_file, open('lyrics.txt', 'a') as text_file:
        data = json.load(json_file)
        for s in data['songs']:  
            text_file.write(s['lyrics'])


# Data Cleaning
After the "lyrics.txt" file is generated we need to review the downloaded data and define a criteria to clean our it with the goal of improving the network performance.
The model for our lyrics generator learns on a char level, meaning it uses a dictionary of characters and learns which character is the more likely to follow a given 
sequence. Thus reducing the amount of characters will 
reduce the decision range and the complexity of the problem.

Initially our data contained a total of 131 characters, we printed them to inspect them and we found Korean and Germanm letters. For this cases we took the extreme approach of deleting the songs containing those characters. 

Still, there where remaining a series of simbols in the character dictionary which we deleted them with the following code snipped:
```
f1 = open('lyrics.txt', 'r')
f2 = open('lyrics.txt', 'w')
for line in f1:
    f2.write(line.replace(, ))
f1.close()
f2.close()
```

# Data Visualization

In [None]:
# open text file and read in data as `text`
with open('/content/drive/My Drive/DeepLearning_2020/FP/lyrics.txt', 'r') as f:
    text = f.read()

In [None]:
print(text[:100])

"---\nY esta noche está pa' bailar, beber, jode'\nHasta que no pueda más (Yeh-yeh-yeh-yeh)\nY esta noche"

# Data Encoding
A neural neural model needs numerical inputs for a better performance so we tranform each char into an interger by simply enumerating all the unique characters in out lyrics file. We generate two dictionaries, ***int2char*** maps an integer value to the associated character and ***char2int*** does the inverse mapping.

In [None]:
chars = tuple(set(text))
int2char = dict(enumerate(chars)) #maps integers to characters
char2int = {ch: ii for ii, ch in int2char.items()} #maps characters to unique integers

# encode the text
encoded = np.array([char2int[ch] for ch in text])

In [None]:
encoded[:100]

array([68, 68, 68, 17, 76, 84, 37, 47, 40, 63, 84, 93, 18, 20, 80, 37, 84,
       37, 47, 40, 30, 84, 70, 63, 87, 84, 67, 63, 96, 69, 63, 53, 56, 84,
       67, 37, 67, 37, 53, 56, 84, 44, 18, 42, 37, 87, 17, 12, 63, 47, 40,
       63, 84, 41, 25, 37, 84, 93, 18, 84, 70, 25, 37, 42, 63, 84, 15, 30,
       47, 84, 74, 76, 37, 80, 68, 50, 37, 80, 68, 50, 37, 80, 68, 50, 37,
       80, 64, 17, 76, 84, 37, 47, 40, 63, 84, 93, 18, 20, 80, 37])

## One Hot Encoding

As stated before is is mandatory to transform the character to a numerical representation. 

One hot encoding is a commom practice for categorical data and for text generation.

One Hot encoding consists on representing each character as a binary vector of N dimensions, where N is the number of categories (characters) in the data, it would have a 1 in the position associated to that character and 0 in the rest of positions.

If we have a dictionary with three characters [a, t, c] and we want to encode the word "cat" it would be represent it as follows:

[[0 0 1]
[1 0 0]
[0 1 0]].

---


**Why One Hot Encoding?**

Even though we could use Interger Encoding One Hot Encoding preserves the independence between categories outputing much better results.

Integer encoding introduces a natural continuity between categories, assume we map value 2 with letter 'a' and 3 to letter 't'. There is a natural numerical continuity between 2 and 3, so if the actual output **y** should be 3 and the model guesses 2.49 the answer is close enought to our target. If we translate this logic to characters we can appreciate that there is no actual continuity between letter 'a' to letter 't'.

One Hot encoding brakes this continuity by generating N buckets (one for each category), the model gives "points" to a bucket if the associated character is likely to follow the senquence. The char with more "points" in its bucket is the one selected to follow the sequence.

In [None]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
    
    # Fill the correct index with a one for each char
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

In [None]:
# check that the function works as expected
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


In [None]:
#get_batches function returns batches of size batch_size x seq_length from array
def get_batches(arr, batch_size, seq_length):

    batch_size_total = batch_size * seq_length
    # Get the number of batches we can make
    n_batches = len(arr)//batch_size_total
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size_total]
    
    # Reshape into batch_size rows
    arr = arr.reshape((batch_size, -1))
    
    # Iterate over the batches using a window of size seq_length
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n+seq_length]
        # The targets, shifted by one
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

In [None]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)

In [None]:
#Print the first 10 items in a sequence
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[68 68 68 17 76 84 37 47 40 63]
 [83 63 42 63 84  9 37 62 84 41]
 [37 33 25 53 18 84 37 93 84 15]
 [84 37 69 84 20 63 47 37 53 91]
 [47 37 84 41 25 37 42 63 84 47]
 [41 25 37 84 15 37 84 80 96 20]
 [63 84 70 96 69 63 17 54 63 84]
 [63 69 17 54 53 53 53 63 63 17]]

y
 [[68 68 17 76 84 37 47 40 63 84]
 [63 42 63 84  9 37 62 84 41 25]
 [33 25 53 18 84 37 93 84 15 96]
 [37 69 84 20 63 47 37 53 91 18]
 [37 84 41 25 37 42 63 84 47 18]
 [25 37 84 15 37 84 80 96 20 96]
 [84 70 96 69 63 17 54 63 84 41]
 [69 17 54 53 53 53 63 63 17 12]]


In [None]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
  print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

Training on GPU!


In [None]:
class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # creating character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        #LSTM layer
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        #dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        #fully-connected output layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
                
        #Get the outputs and the new hidden state from the lstm
        r_output, hidden = self.lstm(x, hidden)
      
        #pass through a dropout layer
        out = self.dropout(r_output)
        
        # Stack up LSTM outputs 
        out = out.contiguous().view(-1, self.n_hidden)
        
        #put x through the fully-connected layer
        out = self.fc(out)
        
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
        

In [None]:
#function to train the network
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    if(train_on_gpu):
        net.cuda()
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            if(train_on_gpu):
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backpropagate through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            
            # calculate the loss and perform backpropagation
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            # we use Clipping_gradiend_norm method to prevent the exploding gradient problem
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y
                    if(train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net(inputs, val_h)
                    val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                
                    val_losses.append(val_loss.item())
                
                net.train()
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

In [None]:
#set model hyperparameters
n_hidden= 1024
n_layers= 3

net = CharRNN(chars, n_hidden, n_layers)

In [None]:
batch_size = 128
seq_length = 100
n_epochs = 100

# train the model
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.0001, print_every=100)

Epoch: 1/100... Step: 100... Loss: 3.2262... Val Loss: 3.2244
Epoch: 2/100... Step: 200... Loss: 3.2258... Val Loss: 3.2213
Epoch: 3/100... Step: 300... Loss: 3.2383... Val Loss: 3.2169
Epoch: 4/100... Step: 400... Loss: 3.2089... Val Loss: 3.1973
Epoch: 5/100... Step: 500... Loss: 3.1048... Val Loss: 3.1065
Epoch: 6/100... Step: 600... Loss: 2.7763... Val Loss: 2.7712
Epoch: 7/100... Step: 700... Loss: 2.7067... Val Loss: 2.6009
Epoch: 8/100... Step: 800... Loss: 2.4637... Val Loss: 2.4378
Epoch: 9/100... Step: 900... Loss: 2.4098... Val Loss: 2.3649
Epoch: 10/100... Step: 1000... Loss: 2.3327... Val Loss: 2.3228
Epoch: 11/100... Step: 1100... Loss: 2.2944... Val Loss: 2.2608
Epoch: 12/100... Step: 1200... Loss: 2.2552... Val Loss: 2.2170
Epoch: 13/100... Step: 1300... Loss: 2.2124... Val Loss: 2.1773
Epoch: 14/100... Step: 1400... Loss: 2.2032... Val Loss: 2.1396
Epoch: 15/100... Step: 1500... Loss: 2.1574... Val Loss: 2.1052
Epoch: 16/100... Step: 1600... Loss: 2.1262... Val Loss: 2

## Checkpoint

After training, we'll save the model so we can load it again later if we need too. Here I'm saving the parameters needed to create the same architecture, the hidden layer hyperparameters and the text characters.

In [None]:
model_name = f'rnn_{n_epochs}_epoch.pt'
results_path = '/content/drive/My Drive/DeepLearning_2020/FP/Results/'

checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

torch.save(checkpoint, results_path + model_name)

In [None]:
def predict(net, char, h=None, top_k=None):
        ''' Given a character, predict the next character.
            Returns the predicted character and the hidden state.
        '''
        
        # tensor inputs
        x = np.array([[net.char2int[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = torch.from_numpy(x)
        
        if(train_on_gpu):
            inputs = inputs.cuda()
        
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs, h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
        
        # get top characters
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return net.int2char[char], h

In [None]:
def sample(net, size, prime='Baila', top_k=None):
        
    if(train_on_gpu):
        net.cuda()
    else:
        net.cpu()
    
    net.eval() # eval mode
    
    #run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [None]:
print(sample(net, 1000, prime='Aprende profundo', top_k=5))

Aprende profundo8"8r8"r8"l8ÚÚ"rl"8lÚrlÚr""ÚÚÚÚrrrr8rlrÚ8llr8Ú"Ú"Ú"8Ú8l8"lÚrÚ88"88l"l8lr8rrlÚrrÚÚÚrlrllÚrÚÚr"lll8Ú"ÚllÚÚÚ8r"rl8r"8r88llÚÚrr8Úl88r""rrl8Úr"l8r"ÚÚÚlr8Ú""Ú8l""r""lÚ"Úl8Úr"8lÚ"8Ú8Úrrr"r88lrÚl"r"""lÚrl8rrl8rÚl8Ú8lr"Úr"8rÚ"lÚr8rrÚ"8"ÚÚ88l8Ú"ÚÚÚlr8ÚÚrÚlrr"l"8ÚÚ8llÚr""rlrÚ8rÚrÚ"Ú"Úr8Ú88ÚÚlr""8r"Ú88"rlrÚ8ÚÚ"Ú8lÚl8r88ÚllllÚrÉrl8"Ú"Ú8lr""""Ú8lrll88Ú"8llr"lÚ8ÚÚ8"rÚrÚÚr"rr8""Úl8ÚÚ"ÚÚÚ"l"8rÚ"rr88"Úrl8ÚlÚÚ8rÚÚlr""lÚ8l"l88rll"l"Él"l8rÚrÚ8rr""rrlrl"r8rÚ"rl8Úlrr"l"""r8ÚÚrrlÚÚ8"l"lr8Ú8l"8rlÚrllÚlÉÉlÚ88"ÚrÚ88rrrrl"l8ÚrÚ8rÚl8Ú"lllÉr888"rÚlllrr8ÚlÚr8l8rlrlrÚÚÚrr"lrr88"Úl8"ÚÚlrÚ"r"lrl8"rÚr"8l8""88rÚlrÚrl8""r"llll8É88r8"8r"8"rÚ8l8"l8"8rrlrlÚrÚlrr"ll"ÚÉ"rl8lr8ÚÚ8lrÚr"l"l"l8rÚ8lÚ888""lr"lÚ88ÚÚr"l""Ú"ÚÚÚlrll"8r8Úr8"8l"ÚÚÚrrÚ"Ú88rl"8Úr8Ú"ÚÚr""rllÚÚ8lÚ8"888ÚÚ8"""l"l8lrlrÚÚl8""ÚlÚl8l88Ú8r"""Úlrlr8"rlllrrl"rÚ""l8rrÚ8"lllr8r"Ú8rlrlÚÚÚr8"""8lÚ8ll8lÚ8r8"8l"8l"r8"rÚlr"8lllÚrÚ"llrÚl"8l"rr888""8l8"ÚlrÚÚÚrlÚr8"8l8"l""ÚÚr""ÚÚ88r"""8ÚÚrÚllÚll"Úlrrr8Úll8Úr"Úrrr"rÚÚr"rÚl"lÚrÚ8Ú"lÚrlÚÚÚlllÚÚ"8Ú"Ú"8rÚÚÚlllrllÉlÉÚ"

In [None]:
# we have loaded in a model that trained over 100 epochs `rnn_100_epoch.net`
with open('/content/drive/My Drive/DeepLearning_2020/FP/Results/rnn_100_epoch.ckpt', 'rb') as f:
    checkpoint = torch.load(f)
    
loaded = CharRNN(checkpoint['tokens'], n_hidden=checkpoint['n_hidden'], n_layers=checkpoint['n_layers'])
loaded.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [None]:
# Sample using a loaded model
print(sample(loaded, 2000, top_k=2, prime="Baila"))

Baila con mi corazón
Yo sé que tú te vas, volvera a pensar
Y ahora te tengo amigos que me desestene
Yo sé que tú me preguntas a que no te vender a mi
Y si yo me voy pa'l party
Y aunque te pido perdón
Yo nunca voy a porer el perremo
Y ahora soy pero, pero yo no sé

No paro de pensar en ti como tú no metes
Yandele esa no queda nosotros (Nos damos)
Y es que yo quiero estar contigo hoy (Estoy antendo)
Yo que tú quieras tener (Ah)
Yo te llego tú me desea'
Que yo no tengo miedo a que no te quedas no pares no me desesperes
Y que te prefieres a mí
Yo no sé cómo aquí no me desespero
Quiero decirte aquí, ahora es mejor que no
Cuando yo te pienso y tú ere' mi corazón
Yo te lo meto como tú, tú, tú, tú, tú, tú, tú

Y aunque tú tienes la milla y se me hace los tiempos (Yeh-yeh-yeh)
No te amo estar no puedo cantar
Y aunque tú te vayas (Yah)
Que en la pida hace calor, yeah-eh-eh (Yeh-eh-eh)
Y es que a mi me toca la vida
Yo sé que tú me deseas
Yo me puedo contener
No me dije que te pido
Que no te quier