## Ben Summer - Lyrics Identification Neural Network

#### Introduction To The Project
I wanted to develop this project to intersect two of my passions: programming and hip-hop music. I figured that creating a machine learning model that could recgonize which artists created what lyrics could be not only fun, but also a useful tool for my work.

I find it fun to experiment and try out different styles, and thought it would be beneficial to myself to have software like this. If I was to write something in a certain style, and check it based on an ML model, this could help me refine my style and find what makes each artist unique.

While I made signifigant progress on this project, I would have liked to make more. I spent a large chunk of time purely on the lyrics ripping software itself.

There exists a Python library called "Lyrics Genius" that allows a programmer to scrape lyrics from the Genius website. However, due to the limitations of the library, as well as the additional processing needed to handle the ingestion of the lyrics, this is where the bulk of my time went.

Thus, my model was not as effective as I would have liked it. In some of my recent results, I have gotten an accuracy of 7%. At one point using a smaller and simpler model, I was able to acheive about 75%. However, there was a lot of overfitting in this model, and I question its accuracy. Going forward, I would utilize a pre-trained model in conjunction with mine to help with this process, and allow my model to be bigger and more accurate.

#### Imports

I utilized Tensorflow for this project, as this is the library we have used in my course on Machine Learning, so it required less of a learning curve and more jumping in. 

Of note in my imports are SimpleRNN and LSTM, along with Tokenizer. These were all integral to processing the lyrics data into something the model could read, and to the actual structure of the model itself.

###### Imports

In [28]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras import regularizers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import GlobalMaxPooling1D
# used for manipulating directory paths
import os

# Scientific and vector computation for python
import numpy as np

# Plotting library
from matplotlib import pyplot as plt

# Optimization module in scipy
from scipy import optimize

# will be used to load MATLAB mat datafile format
from scipy.io import loadmat

# tells matplotlib to embed plots within the notebook
%matplotlib inline

import pandas as pd

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.utils.class_weight import compute_class_weight

import seaborn as sns

#### Import The Source Data And Tokenize

This section is where I imported my data from my tsv files. My limited amount of data proved to be a challenge here, and moving forward, I would make it more of a priority to get more data, even if all of it wasn't clean.

I split the data up into training, test, and cv data, pretty standard.

Here is also where I utilize the **tokenizer**. This creates a dictionary using numbers and maps each word in my lyrics to a number. ML models don't just take in text data itself.

The data was also padded with blank space when necessary to make sure the lengths of all the lyrics were the same, another important part of allowing the model to work.

In [3]:
df = pd.read_csv('shuffled_verses.tsv', sep='\t')
X = np.asarray(df.values[:3975, 6]).astype('str')
X_cv = np.asarray(df.values[3975:5300, 6]).astype('str')
X_test = np.asarray(df.values[5300:, 6]).astype('str')

# Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X_sequences = tokenizer.texts_to_sequences(X)
X_cv_sequences = tokenizer.texts_to_sequences(X_cv)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

# Add padding
X_padded = pad_sequences(X_sequences, padding='post')
X_cv_padded = pad_sequences(X_cv_sequences, padding='post')
X_test_padded = pad_sequences(X_test_sequences, padding='post')

y = np.asarray(df.values[:3975, 1]).astype('float32')
y_cv = np.asarray(df.values[3975:5300, 1]).astype('float32')
y_test = np.asarray(df.values[5300:, 1]).astype('float32')

df = pd.read_csv('my_artists.tsv', sep='\t')
X_encoding = np.asarray(df.values[:, 0]).astype('str')
y_encoding = np.asarray(df.values[:, 1]).astype('float32')

## MODEL

Here is the actual model. It uses a "Long Short Term Memory" structure, which is common for these types of word applications. It is a classification model, utilizing softmax to determine what artist wrote the lyrics. There are other techniques used here, such as BatchNormalization, Dropout, and Regularization, all to help make the model more robust.

In [30]:
# Hyperparameters
embedding_dim = 100 # Changed from 100 to 200
rnn_units = 128 # Changed from 128 to 256
max_sequence_length = X_padded.shape[1]  # Length of the padded sequences
vocab_size = len(tokenizer.word_index) + 1  # Vocabulary size (adding 1 for padding token)
num_classes = y_encoding.shape[0]
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
class_weights = dict(enumerate(class_weights))
total_weight = sum(class_weights.values())
normalized_class_weights = {k: v / total_weight for k, v in class_weights.items()}

# Build The Model
model = Sequential()

# Embedding layer
model.add(Embedding(input_dim=vocab_size,  # Size of the vocabulary
                    output_dim=embedding_dim))  # Length of input sequences

# RNN
model.add(Bidirectional(LSTM(rnn_units, return_sequences=True, kernel_regularizer=l2(0.001), recurrent_regularizer=l2(0.001), recurrent_dropout=0.2)))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(rnn_units * 2, kernel_regularizer=l2(0.001), recurrent_regularizer=l2(0.001), recurrent_dropout=0.2)))
model.add(BatchNormalization())
model.add(Dropout(0.4))
GlobalMaxPooling1D(),
model.add(Dense(70, activation='relu', kernel_regularizer=l2(0.01)))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(45, activation='relu', kernel_regularizer=l2(0.01)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax', kernel_regularizer=l2(0.001)))

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.001,
    decay_steps=100,
    decay_rate=0.9
)

# Compile
model.compile(optimizer=Adam(lr_schedule), 
              loss='sparse_categorical_crossentropy',  # Use 'categorical_crossentropy' for multi-class
              metrics=['accuracy'])

# EarlyStopping
early_stopping = EarlyStopping(
    monitor='val_loss',        # Monitor validation loss
    patience=3,                # Stop if no improvement for 3 consecutive epochs
    restore_best_weights=True  # Restore the best model weights
)

model.summary()

# Train
history = model.fit(X_padded, y,
                    epochs=300,
                    batch_size=16,
                    validation_data=(X_cv_padded, y_cv),
                    class_weight=normalized_class_weights)

# Print learning curves
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.show()



Epoch 1/300
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m346s[0m 1s/step - accuracy: 0.0224 - loss: 1.7306 - val_accuracy: 0.0113 - val_loss: 3.7483
Epoch 2/300
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m334s[0m 1s/step - accuracy: 0.0172 - loss: 0.1742 - val_accuracy: 0.0211 - val_loss: 3.5932
Epoch 3/300
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m338s[0m 1s/step - accuracy: 0.0130 - loss: 0.0867 - val_accuracy: 0.0136 - val_loss: 3.5555
Epoch 4/300
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m323s[0m 1s/step - accuracy: 0.0101 - loss: 0.0712 - val_accuracy: 0.0113 - val_loss: 3.5512
Epoch 5/300
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m327s[0m 1s/step - accuracy: 0.0146 - loss: 0.0640 - val_accuracy: 0.0136 - val_loss: 3.5513
Epoch 6/300
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m336s[0m 1s/step - accuracy: 0.0204 - loss: 0.0609 - val_accuracy: 0.0136 - val_loss: 3.5530
Epoch 7/300
[1m

KeyboardInterrupt: 

## Conclusions

I am proud of the work I put into this project, and the level of robustness my extraction code is to allow this data to be gotten. It allows the foundation for me to add more data and improve upon the ML model. Using pre-trained libraries like Word2Vec will allow me to 