# Pop Lyric Generator

Text Generation is a type of Language Modelling problem.

Language Modelling is the core problem for a number of Natural Language Processing tasks such as speech to text, conversational system, and text summarization, etc.

Text Generation is a task which can be be architectured using deep learning models, particularly Recurrent Neural Networks.

In this project, we will experiment to generate new pop lyrics based on [this Dataset](https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres).

# Import Libraries

We have to import the libraries we're going to use, the most relevant are:

- **Pandas** in order to manipulate and analyze the data
- **Matplotlib** to visualize the data in graphics
- **Wordcloud** to help us when we are going to generate the next word of a sentence
- **Tensorflow & Keras** will help us in the process of machine learning and keras for deep learning.


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import string, os 
import tensorflow as tf

# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dropout, LSTM, Dense, Bidirectional 
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential

# Loading the Dataset
Once the Data was precleaned, it's going to be imported to the cloud. As it's mentioned in the documentation, it's on .csv because is easier to manipulate the data using Pandas library.


In [4]:
# Import csv file
df = pd.read_csv(os.path.abspath(('/input/pop-lyrics/Pop.csv')))

AttributeError: module 'os' has no attribute 'abspath'

In [None]:
# Print first 10 rows to confirm is the right dataset
df.head()

# Data Cleaning

In this step the dataset is going to be cleaned using Pandas library to remove the columns that are not going to be used and keep only the unique column called "Lyrics".

![Example of the first cleaning process](https://miro.medium.com/max/2560/1*cDBVd9wlvjAxKi0hpdQkAg.png)

In [None]:
# Drop all the columns except Lyrics
df.drop(['Artist','Songs','Popularity','Genre','Genres','Idiom','ALink','SName','SLink'],axis=1,inplace=True)
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

In [None]:
# Print first 10 rows to confirm the drops were made correctly
df.head()

# Shaping the Dataset
Since kaggle does not have enough memory (16gb RAM), disk usage (73 GB) and GPU (13GB) to process the entire dataset, we will have to limit our model from 28441 to 700 lyrics. Kaggle give It's worth mentioning that this same test was performed locally and took approximately 100 hours to perform.

In [None]:
# Dataframe shape to show how many columns and rows the dataframe have
df.shape

In [None]:
# Take first 700 rows
df = df[:700]

In [None]:
df.head()

In [None]:
# Dataframe shape to show how many columns and rows the dataframe have after last cell 
df.shape

Now, to get Statistical information, the next cell is going to calculate the number of words per row, this is going to help us to determine the Frequency Distribution of number of words for each text extracted

In [None]:
df['Number_of_words'] = df['Lyric'].apply(lambda x:len(str(x).split()))
df.head()

# Statistical information

In [None]:
df['Number_of_words'].describe()

- **count**: Number of rows to evaluate
- **mean**: Average of words per row in the dataset
- **std**: Word's standard deviation
- **min**: Minimum amount of words found in a lyric or row
- **max**: Maximum amount of words found.

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(12,6))
sns.distplot(df['Number_of_words'],kde = False,color="red",bins=200)
plt.title("Frequency distribution of number of words for each text extracted", size=20)

- **Y axis**: Represent how many times a word was found in the dataset
- **X axis**: Represent the amount of different words found; 

# Tokenization

The data from the column named **Lyric** must be preprocessed and to achieve these, the words will have to be converted into numbers. This process is called "tokenization".

Keras uses the class ___Tokenizer()___ to do this job, this class has two important methods to be used:

- ___fit_on_text()___ : Update the internal vocabulary based on a given list of texts or in this case, "Lyrics" column, where each entry of the list is going to be a token.  

- ___texts_to_sequences()___ : Transform each text within the list of texts supplied to a sequence of integers; only words known by the tokenizer will be considered.

![Tokenization Example](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/tokenization.png)

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['Lyric'].astype(str).str.lower())

total_words = len(tokenizer.word_index)+1
tokenized_sentences = tokenizer.texts_to_sequences(df['Lyric'].astype(str))
tokenized_sentences[0]

# Slash sequences into a n-gram sequence
After the text is tokenized, the words will be sorted to represent them numerically by creating an input sequence using the created tokens.

In [None]:
input_sequences = list()
for i in tokenized_sentences:
    for t in range(1, len(i)):
        n_gram_sequence = i[:t+1]
        input_sequences.append(n_gram_sequence)

# Padding

Before the model generation, is necessary to normalize all sentences to the same standard lenght, to avoid the memory overflow and to get the layers of the model way more deep, this is a simple process to add a 0's in the beginning of the text, resulting in layers of the same size.

___pad_sequences___ Transform a list of sequences that is a lists of integers into a 2D array, in this case the list is called input_sequences.

Sequences resulting shorter than **maxlen** are padded with 0's until they have the same length.  

The position where the zeros will be added is determined by the argument **padding**, in this case, it will be done at the beginning of the sequence.

![Padding Example](https://miro.medium.com/max/700/1*CPLhZoVSTCWgAxe2LKXoOA.png)

In [None]:
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In [None]:
input_sequences[:20]

In [None]:
# create predictors and label
X, labels = input_sequences[:,:-1],input_sequences[:,-1]
y = tf.keras.utils.to_categorical(labels, num_classes=total_words)

# Creating the Model

This test will use the Bidirectional LSTM model, this kind of neural networks run as the name says: in two ways. This is from past to future and vice versa, this is how the model preserves the information of both states at any point. LSTM neural networks are mostly used where context is involved.

![Bidirectional LSTM Diagram](https://mlwhiz.com/images/birnn.png)

In [None]:
model = Sequential()

Models in Keras are defined as a sequence of layers, and the Sequential model is about adding layers one at a time. Layers are the basic building block of neural network.  

## Layers:

### Embedding
- Is a core layer, can only be used as the first layer in a model, turns positive integers into dense vectors of fixed size (first parameter is the size of the vocabulary, second parameter is the dimension of the dense embedding, and third parameter is about the length of sequences which is required because we will use a Dense layer later)  

In [None]:
model.add(Embedding(total_words, 40, input_length=max_sequence_len-1))


### Bidirectional:
- Is a recurrent layer, a bidirectional wrapper for RNN's which will recibe a layer as an input being LSTM layer the one we chose, it will receive a positive integer as it's input which refers to the amount of output nodes that should be returned. 

In [None]:
model.add(Bidirectional(LSTM(250)))

### Dropout:
- Is a regularization layer. This layer randomly sets input units to 0, with a frequency of the value we pass it,at each step during training time which helps prevent overfitting.  

In [None]:
model.add(Dropout(0.1))

### Dense:
- Is a core layer, and a densely-connected neural network layer. Receives as first parameter a positive integer which refers to the amount of output nodes that should be returned. Second parameter is the one named **activation** which defines the type of predictions the model can make; for the kind of problem we are abording the one which suits the better is softmax which outputs a vector of values (input) that can be interpretated as probabilities of being used.  

In [None]:
model.add(Dense(total_words, activation='softmax'))

### Compile method.
Takes two relevant parameters:

- **loss**: Also known as a cost function; works during the optimization process and it's role is to calculate the error of the model. Cross-entropy is used to estimate the difference between an esimated and predicted probability distributions. categorical_cross-entropy will be used because it's best suited for this kind of problems and is almost universally used to train deep learning neural networks due to the results it produces.  
- **optimizer**: Is responsible for reducing the losses and to provide the most accurate results possible. Adam is the option chosen because is the best choice offered by Keras to train the neural network in less time and more efficiently.   Earlystop will stop the training if the model has stopped improving, this will be checked at the end of every epoch.

## Optimization algorithm and Performance metrics

The optimization algorithm added in the configuration layer will be __Adam__, because of the good perfomance and results in other projects it has.

In this case, the 'accuracy' will be measured as perfomance metric, which gives closeness of calculated value to the actual value.

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

**fit** method trains the model for the fixed number of epochs given (iterations on the dataset).

In [None]:
earlystop = EarlyStopping(monitor='loss', min_delta=0, patience=3, verbose=0, mode='auto')
history = model.fit(X, y, epochs=20, verbose=1, callbacks=[earlystop])

# Evaluating the Model

In this block a new graphic will be generated and displayed, the Y axis will stand for accuracy and X axis will stand for the amount of epochs; as shown, in order to increase the accuracy, you have to increase the number of epochs during the training. 

In [None]:
# plot the accuracy

plt.plot(history.history['accuracy'], label='train acc')
plt.legend()
plt.show()
plt.savefig('AccVal_acc')


# Import the Trained Model

Once the training process is completed it only remains the import of our model so we can test how does it work, in our case the trained model is called 'song_lyrics_generator'.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from keras.models import load_model
model = load_model('../input/songlyricmodel/song_lyrics_generator.h5')

# Function to Generate the song

This step is in charge of preparing the function that will be used to complete a song given the model previously trained, it will predict the next words based on the input words suministrated as 'seed_text'. For this to work a tokenization must be applied to the seed_text, then a padding will be applied to the sequences generated and passed into the trained model so the next word can be predicted.

In [None]:
def complete_this_song(seed_text, next_words):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)        
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

# Examples

In [None]:
complete_this_song("I am missing you", 200)

In [None]:
complete_this_song("It's a cruel and random world", 80)

In [None]:
complete_this_song("I want a piece of pizza", 80)

In [None]:
complete_this_song("Love flowers", 80)

In [None]:
complete_this_song("Hate flowers", 80)