<a href="https://colab.research.google.com/github/BielC/Reggaeton-generator/blob/master/Lyrics_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Final Project - Lyrics Generation

---


    Biel Casals - 206743
    Aran Coll - 204887
    Gabriel Graells - 205638

---
For this project, we implement a RNN capable of generating new song lyrics character by character after being trained with hundreds of reggeaton songs of the best singers.

For this project we'll be using numpy and torch

In [None]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [None]:
#mount Google Colab
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


# Data Collection
In order to generated the needed data set for training we have used [this](https://github.com/johnwmillr/LyricsGenius) library that has been build on top of the API of the music content website [Genius](https://genius.com/).
The code searches for the top 50 songs, based on popularity, of the artist in the list, then downloads one JSON file per artist containing the songs and further information. Finally it transfers all songs from the JSONs file to the output file "*lyrics.txt*".

**If you want to execute this part of the code is better to do it locally in a *.py* Python script.**

In [None]:
!pip install lyricsgenius
import lyricsgenius
import json
import os


artist_list = ['Bad Bunny', 'Daddy Yanky','Ozuna', 'Don Omar', 'Anuel AA','Maluma','Juan Magan','Nicky Jam', 'Karol G','Mozart La Para', 'Cosculluela', 'Tito "El Bambino"','Calle Trece', 'Arcangel','Tego Calderon','Plan B']

#Init API client
genius = lyricsgenius.Genius("XAjzWCsgJw9mHU48O7RRZ6-nzXykKtYH9_7zAFnPbl_PHrVQZQBM7InGU05sji9o")
genius.verbose = False 
genius.remove_section_headers = True
genius.excluded_terms = ['Live', '(Live)']

#Iterate artist_list and download JSON file with lyrics
for art in artist_list:
    print(art)
    artist = genius.search_artist(art, max_songs = 50, sort='popularity')
    
    if artist == None:
        print(art,'Does not exist!')
    else:
        print('saving json...')
        artist.save_lyrics()

#Get names json files in current directory
json_files = [pos_json for pos_json in os.listdir() if pos_json.endswith('.json')]

for files in json_files:
    with open(files) as json_file, open('lyrics.txt', 'a') as text_file:
        data = json.load(json_file)
        for s in data['songs']:  
            text_file.write(s['lyrics'])


# Data Cleaning
After the "lyrics.txt" file is generated we need to review the downloaded data and define a criteria to clean our it with the goal of improving the network performance.
The model for our lyrics generator learns on a char level, meaning it uses a dictionary of characters and learns which character is the more likely to follow a given 
sequence. Thus reducing the amount of characters will 
reduce the decision range and the complexity of the problem.

Initially our data contained a total of 131 characters, we printed them to inspect them and we found Korean and Germanm letters. For this cases we took the extreme approach of deleting the songs containing those characters. 

Still, there where remaining a series of simbols in the character dictionary which we deleted them with the following code snipped:
```
f1 = open('lyrics.txt', 'r')
f2 = open('lyrics.txt', 'w')
for line in f1:
    f2.write(line.replace(, ))
f1.close()
f2.close()
```

# Data Visualization

In [None]:
# open text file and read in data as `text`
with open('/content/drive/My Drive/DeepLearning_2020/FP/lyrics.txt', 'r') as f:
    text = f.read()

We check the first 100 characters of our dataset to make sure the format is correct.

In [None]:
text[:100]

"---\nY esta noche está pa' bailar, beber, jode'\nHasta que no pueda más (Yeh-yeh-yeh-yeh)\nY esta noche"

# Data Encoding
A neural neural model needs numerical inputs for a better performance so we tranform each char into an interger by simply enumerating all the unique characters in out lyrics file. We generate two dictionaries, ***int2char*** maps an integer value to the associated character and ***char2int*** does the inverse mapping.

In [None]:
chars = tuple(set(text))
int2char = dict(enumerate(chars)) #maps integers to characters
char2int = {ch: ii for ii, ch in int2char.items()} #maps characters to unique integers

# encode the text
encoded = np.array([char2int[ch] for ch in text])

And we can see those same characters from above, encoded as integers.

In [None]:
encoded[:100]

array([73, 73, 73,  1, 65, 29, 53, 90, 67, 48, 29, 52, 79, 16, 19, 53, 29,
       53, 90, 67, 25, 29, 55, 48, 77, 29, 64, 48, 87, 14, 48, 94, 44, 29,
       64, 53, 64, 53, 94, 44, 29, 31, 79, 41, 53, 77,  1, 35, 48, 90, 67,
       48, 29, 26, 32, 53, 29, 52, 79, 29, 55, 32, 53, 41, 48, 29, 15, 25,
       90, 29, 83, 65, 53, 19, 73, 20, 53, 19, 73, 20, 53, 19, 73, 20, 53,
       19, 62,  1, 65, 29, 53, 90, 67, 48, 29, 52, 79, 16, 19, 53])

## One Hot Encoding

As stated before is is mandatory to transform the character to a numerical representation. 

One hot encoding is a commom practice for categorical data and for text generation.

One Hot encoding consists on representing each character as a binary vector of N dimensions, where N is the number of categories (characters) in the data, it would have a 1 in the position associated to that character and 0 in the rest of positions.

If we have a dictionary with three characters [a, t, c] and we want to encode the word "cat" it would be represent it as follows:

[[0 0 1]
[1 0 0]
[0 1 0]].

---


**Why One Hot Encoding?**

Even though we could use Interger Encoding One Hot Encoding preserves the independence between categories outputing much better results.

Integer encoding introduces a natural continuity between categories, assume we map value 2 with letter 'a' and 3 to letter 't'. There is a natural numerical continuity between 2 and 3, so if the actual output **y** should be 3 and the model guesses 2.49 the answer is close enought to our target. If we translate this logic to characters we can appreciate that there is no actual continuity between letter 'a' to letter 't'.

One Hot encoding brakes this continuity by generating N buckets (one for each category), the model gives "points" to a bucket if the associated character is likely to follow the senquence. The char with more "points" in its bucket is the one selected to follow the sequence.

In [None]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
    
    # Fill the correct index with a one
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

In [None]:
# check the one_hot_encoder function
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


## Making training mini-batches

We'll take the encoded characters and split them into multiple sequences, given by `batch_size`. Each of our sequences will be `seq_length` long.

In [None]:
def get_batches(arr, batch_size, seq_length):

    batch_size_total = batch_size * seq_length
    # Get the number of batches we can make
    n_batches = len(arr)//batch_size_total
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size_total]
    
    # Reshape into batch_size rows
    arr = arr.reshape((batch_size, -1))
    
    # Iterate over the batches using a window of size seq_length
    for n in range(0, arr.shape[1], seq_length):
        # features
        x = arr[:, n:n+seq_length]
        # targets
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

In [None]:
# Testing the function with batch size=8 and sequence length=50
batches = get_batches(encoded, 8, 50)
x, y = next(batches)

In [None]:
# Print the first 10 items in a sequence
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[73 73 73  1 65 29 53 90 67 48]
 [39 48 41 48 29 13 53 18 29 26]
 [53  0 32 94 79 29 53 52 29 15]
 [29 53 14 29 16 48 90 53 94 82]
 [90 53 29 26 32 53 41 48 29 90]
 [26 32 53 29 15 53 29 19 87 16]
 [48 29 55 87 14 48  1 96 48 29]
 [48 14  1 96 94 94 94 48 48  1]]

y
 [[73 73  1 65 29 53 90 67 48 29]
 [48 41 48 29 13 53 18 29 26 32]
 [ 0 32 94 79 29 53 52 29 15 87]
 [53 14 29 16 48 90 53 94 82 79]
 [53 29 26 32 53 41 48 29 90 79]
 [32 53 29 15 53 29 19 87 16 87]
 [29 55 87 14 48  1 96 48 29 26]
 [14  1 96 94 94 94 48 48  1 35]]


---
## Defining the network
Our model is composed by three layers:
* An LSTM layer
* A dropout layer
* A fully-connected layer

In [None]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
  print('Training on GPU!')
else: 
    print('No GPU available')

Training on GPU!


In [None]:
class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # creating character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        # define the layers of the model
        # LSTM layer
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # fully-connected output layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
                
        ## Get the outputs and the new hidden state from the lstm
        r_output, hidden = self.lstm(x, hidden)
      
        ## pass through a dropout layer
        out = self.dropout(r_output)
        
        # Stack up LSTM outputs 
        out = out.contiguous().view(-1, self.n_hidden)
        
        ## put x through the fully-connected layer
        out = self.fc(out)
        
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
        

## Training

We're using an Adam optimizer and cross entropy loss since we are looking at character class scores as output.

We use gradient clipping to help prevent exploding gradients.

In [None]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):

    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    if(train_on_gpu):
        net.cuda()
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            if(train_on_gpu):
                inputs, targets = inputs.cuda(), targets.cuda()
            
            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            
            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            # we use the `clipping_gradient_norm` method to prevent the exploding gradient problem
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y
                    if(train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net(inputs, val_h)
                    val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                
                    val_losses.append(val_loss.item())
                
                net.train() # reset to train mode after iterationg through validation data
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

## Define hyperparameters

In [None]:
n_hidden= 1024
n_layers= 3

net = CharRNN(chars, n_hidden, n_layers)

In [None]:
batch_size = 128
seq_length = 100
n_epochs = 10

# train the model
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.0001, print_every=100)

Epoch: 1/10... Step: 100... Loss: 3.2228... Val Loss: 3.2246
Epoch: 2/10... Step: 200... Loss: 3.2268... Val Loss: 3.2218
Epoch: 3/10... Step: 300... Loss: 3.2381... Val Loss: 3.2181
Epoch: 4/10... Step: 400... Loss: 3.2147... Val Loss: 3.2003
Epoch: 5/10... Step: 500... Loss: 3.1125... Val Loss: 3.0942
Epoch: 6/10... Step: 600... Loss: 2.7845... Val Loss: 2.7630
Epoch: 7/10... Step: 700... Loss: 2.7267... Val Loss: 2.6221
Epoch: 8/10... Step: 800... Loss: 2.4873... Val Loss: 2.4605
Epoch: 9/10... Step: 900... Loss: 2.4134... Val Loss: 2.3729
Epoch: 10/10... Step: 1000... Loss: 2.3429... Val Loss: 2.3160


## Hyperparameters

Here are the hyperparameters for the network.

In defining the model:
* `n_hidden` - The number of units in the hidden layers.
* `n_layers` - Number of hidden LSTM layers to use.

We assume that dropout probability and learning rate will be kept at the default, in this example.

And in training:
* `batch_size` - Number of sequences running through the network in one pass.
* `seq_length` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lr` - Learning rate for training

## Saving the model

In [None]:
model_name = f'rnn_{n_epochs}_epoch.pt'
results_path = '/content/drive/My Drive/DeepLearning_2020/FP/Results/'

checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

torch.save(checkpoint, results_path + model_name)

---
## Sampling

With the trained model, we can sample by passing in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character.

In [None]:
def predict(net, char, h=None, top_k=None):
        ''' Given a character, predict the next character.
            Returns the predicted character and the hidden state.
        '''
        
        # tensor inputs
        x = np.array([[net.char2int[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = torch.from_numpy(x)
        
        if(train_on_gpu):
            inputs = inputs.cuda()
        
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs, h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
        
        # get top characters
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return net.int2char[char], h

We use a prime to build up a hidden state, otherwise the network will start out generating characters at random. The characters will be less coherent since it hasn't built up a long history of characters to predict from.

In [None]:
def sample(net, size, prime='Baila', top_k=None):
        
    if(train_on_gpu):
        net.cuda()
    else:
        net.cpu()
    
    net.eval() # eval mode
    
    # run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

## Loading a checkpoint

In [None]:
# Here we have loaded in a model that trained over 100 epochs `rnn_100_epoch.net`
with open('/content/drive/My Drive/DeepLearning_2020/FP/Results/rnn_100_epoch.ckpt', 'rb') as f:
    checkpoint = torch.load(f)
    
loaded = CharRNN(checkpoint['tokens'], n_hidden=checkpoint['n_hidden'], n_layers=checkpoint['n_layers'])
loaded.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [None]:
# Sample using a loaded model
print(sample(loaded, 2000, top_k=2, prime="Baila"))

Baila me dicen que tú no la abajo yo soy tuyo
Y tú sabes que yo no soy tu camino al cielo, y a ti no voy a estar no hay
Yo no te voy a enla mal
Y tú está' buscando en la mañana (Yah)
Y ahora todos to' si me puse pa' mi amor
Y yo te quiero a ti (pa' no ah)
Por eso me dice la loca, también (Tú no va' a volver)
Y ya no te me apare' como tú no me así'
Ya me paso como un tiempo pa' comprarte (Yeah)
Y tú ere' una diabla y te vas a ver, eh-eh (Uah)
Y todo el amor soy en el camino (Yeah)

Yo sé que está canción (-mande), para mí maminarto (Yeah)
Y ahora to' todo lo que te hace (Nada)
Porque tú no vive' igual que yo (Tú no vive' igual), ah-ah

Te doba te dar (Yeh-yeh)
Baby, tú nadie ve' en el pela'o (Yeah, yaa')

Yo sé que tú me desea' (Que vayan a mirarme)
Tú no vive' igual que yo (Y yo no tengo miedo (No)
Ya no sé qué tú quiere', ya no sé pa' dónde ve (Yeah)
Y aunque tú te va' a dar (Ah)
Y te vo'a matar (-ah)
Yo te acorro de mí (Dama)
Porque tú me dices que yo te voy a decir que ya no ere' tú