## Summary and Explanations. ##

Title: Improving Music Genre Classification using Text Lyrics. 

Hypothesis: Lyrics does contain important information not shown in the audio itself that can improve audio classifiers.

Dataset: Contains Audio Features and Song Lyrics with Music genre as label and artist_and_song as unique identifyer.

Preparation: see the two notebooks below containing filtering, analysis and workflow for this project.
- data_analysis_for_prototype.ipynb
- song_text_data.ipynb

Resources: Custom built data set combining datapoints from
- Audio Features data: https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db
- Text Lyrics data: https://www.kaggle.com/datasets/edenbd/150k-lyrics-labeled-with-spotify-valence

## Imports and load custom data-set. ##

In [2]:
import pandas as pd
import numpy as np

from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

data = pd.read_csv('DATASETS/Project_dataset.csv')

## Minor changes and some analysis ##

In [3]:
len(data)

12440

In [4]:
data.genre.value_counts()

genre
Country             1738
Ska                 1687
Blues               1469
Folk                1144
Dance                995
Rock                 917
Electronic           911
Reggae               864
Soul                 804
R&B                  469
Jazz                 431
Pop                  387
Hip-Hop              234
Indie                152
Children’s Music     122
Rap                  116
Name: count, dtype: int64

After careful consideration the whole point is to answer the question of this project and therefore I will remove genres below 500 datapoints. I do this because of two reasons:
- Analysis and plots will be less congested with overflowing amounts of classes.
- Removing classes with too few datapoints will balance out the dataset and make the classification easier for the models.

The downside is that the model becomes less general, and not adapted to the real world (where there are many music genres.)

In [5]:
# Create a filter for genres with value counts >= 500
valid_genres = data.genre.value_counts()
genres_to_keep = valid_genres[valid_genres >= 500].index

# Subset the DataFrame based on the filter
data = data[data.genre.isin(genres_to_keep)]

data.genre.value_counts()

genre
Country       1738
Ska           1687
Blues         1469
Folk          1144
Dance          995
Rock           917
Electronic     911
Reggae         864
Soul           804
Name: count, dtype: int64

In [6]:
print('Dummy accuracy:', round(100 * 1738 /len(data) ), '%')

Dummy accuracy: 17 %


In [7]:
# Encode the categorical data
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in data.columns:
    if col in ['key', 'mode', 'time_signature', 'genre']:

        data[col] = le.fit_transform(data[col])
        data[col] = data[col].astype("category")

data.head()

# for col in audio_data.columns:
#     if col in categorical_columns:
#         audio_data[col] = audio_data[col].astype("category")

Unnamed: 0,genre,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist_and_song,Lyrics
0,1,0.0293,0.387,234307,0.874,1.7e-05,3,0.171,-4.528,0,0.053,168.105,2,0.674,gary allan /// get off on the pain,I don't know why I love women\r\nThat love to ...
1,1,0.173,0.633,183600,0.444,3e-06,5,0.0821,-14.09,0,0.0264,106.111,1,0.45,"hank williams, jr. /// old habits","I kicked the habit, of smoking back some time ..."
2,1,0.06,0.493,262720,0.78,0.000248,3,0.192,-4.127,0,0.0485,119.996,2,0.19,carrie underwood /// chaser,I need something strong tonight\nI'm needing m...
3,1,0.773,0.687,320040,0.543,2e-06,10,0.705,-10.151,0,0.0521,129.296,3,0.807,bobby bare /// the winner,The hulk of a man with a beer in his hand he l...
4,1,0.0655,0.618,226187,0.79,0.00641,9,0.11,-4.973,0,0.0445,119.987,2,0.545,brooks & dunn /// proud of the house we built,I dropped to my knees in that field on your da...


In [8]:
data.genre.value_counts()

genre
1    1738
7    1687
0    1469
4    1144
2     995
6     917
3     911
5     864
8     804
Name: count, dtype: int64

### Audio Baseline

In [9]:
data.genre

0        1
1        1
2        1
3        1
4        1
        ..
12435    8
12436    8
12437    8
12438    8
12439    8
Name: genre, Length: 10529, dtype: category
Categories (9, int64): [0, 1, 2, 3, ..., 5, 6, 7, 8]

Prepare the datasets for audio and text classification

In [10]:
labels = data.genre
audio_data = data.drop(columns=['genre', 'artist_and_song','Lyrics'])

# Audio classification train test data.
X_train_audio, X_test_audio, y_train_audio, y_test_audio = train_test_split(audio_data, labels, test_size=0.2,random_state=42)

# print(X_test_audio)
# Create train test data for text classification.
idx_train = X_train_audio.index
idx_test = X_test_audio.index

X_train_text = data.loc[idx_train].Lyrics.tolist()
X_test_text = data.loc[idx_test].Lyrics.tolist()
y_train_text = y_train_audio.tolist()
y_test_text = y_test_audio.tolist()



In [11]:
# Scale the data for Audio classification.
scaler = StandardScaler()
X_train_audio = scaler.fit_transform(X_train_audio)
X_test_audio = scaler.transform(X_test_audio)

In [12]:
# Initialize the model
MLP_clf = MLPClassifier()#hidden_layer_sizes=(256,128,64), max_iter=200)

# Train the model
MLP_clf.fit(X_train_audio, y_train_audio)
MLP_y_pred = MLP_clf.predict(X_test_audio)
MLP_y_proba_pred = MLP_clf.predict_proba(X_test_audio)

# Calculate the classification report for this fold
MLP_report = classification_report(y_test_audio, MLP_y_pred)
print(MLP_report)
print(MLP_y_proba_pred.shape)


              precision    recall  f1-score   support

           0       0.43      0.38      0.41       332
           1       0.49      0.68      0.57       366
           2       0.43      0.35      0.39       195
           3       0.52      0.58      0.55       163
           4       0.38      0.37      0.37       225
           5       0.47      0.56      0.51       162
           6       0.28      0.15      0.20       175
           7       0.70      0.76      0.73       333
           8       0.27      0.18      0.22       155

    accuracy                           0.48      2106
   macro avg       0.44      0.45      0.44      2106
weighted avg       0.46      0.48      0.47      2106





In [13]:
naive_clf = GaussianNB()
naive_clf.fit(X_train_audio, y_train_audio)

naive_y_pred = naive_clf.predict(X_test_audio)

# Calculate the classification report for this fold
naive_report = classification_report(y_test_audio, naive_y_pred)
print(naive_report)

              precision    recall  f1-score   support

           0       0.48      0.07      0.13       332
           1       0.37      0.73      0.50       366
           2       0.32      0.34      0.33       195
           3       0.48      0.40      0.43       163
           4       0.37      0.39      0.38       225
           5       0.38      0.49      0.43       162
           6       0.30      0.15      0.20       175
           7       0.59      0.69      0.64       333
           8       0.24      0.12      0.16       155

    accuracy                           0.41      2106
   macro avg       0.39      0.38      0.35      2106
weighted avg       0.41      0.41      0.37      2106



In [14]:
# !pip install pytorch-tabnet

In [15]:
# from pytorch_tabnet.tab_model import TabNetClassifier

# tabnet_clf = TabNetClassifier()
# tabnet_clf.fit(X_train_audio, y_train_audio)

In [16]:
# tabnet_y_pred = tabnet_clf.predict(X_test_audio)
# tabnet_report = classification_report(y_test_audio, tabnet_y_pred)
# print(tabnet_report)

In [17]:
# !pip install xgboost

In [18]:
import xgboost as xgb
xgb_clf = xgb.XGBClassifier()

# classes_weights = class_weight.compute_sample_weight(
#     class_weight='balanced',
#     y=audio_data['genre'])

xgb_clf.fit(X_train_audio, y_train_audio)

In [19]:
xgb_y_pred = xgb_clf.predict(X_test_audio)
xgb_report = classification_report(y_test_audio, xgb_y_pred)
print(xgb_report)

              precision    recall  f1-score   support

           0       0.44      0.39      0.41       332
           1       0.51      0.63      0.56       366
           2       0.43      0.39      0.41       195
           3       0.53      0.55      0.54       163
           4       0.37      0.41      0.39       225
           5       0.48      0.50      0.49       162
           6       0.28      0.20      0.23       175
           7       0.73      0.76      0.75       333
           8       0.30      0.23      0.26       155

    accuracy                           0.49      2106
   macro avg       0.45      0.45      0.45      2106
weighted avg       0.48      0.49      0.48      2106



In [20]:
best_XGB_params = {
                'max_depth': 4,
                'learning_rate': 0.03711299495104505,
                'n_estimators': 447,
                'min_child_weight': 10,
                'gamma': 0.002238238469617126,
                'subsample': 0.849826493051354,
                'colsample_bytree': 0.9977007889943236,
                'reg_alpha': 0.15538382519234734,
                'reg_lambda': 7.07955517130104e-07,
                'enable_categorical' : True,
                'tree_method' : "hist"
               }

xgb2_clf = xgb.XGBClassifier(**best_XGB_params)

xgb2_clf.fit(X_train_audio, y_train_audio)

In [21]:
xgb2_y_pred = xgb2_clf.predict(X_test_audio)
xgb2_report = classification_report(y_test_audio, xgb2_y_pred)
print(xgb2_report)

              precision    recall  f1-score   support

           0       0.46      0.43      0.44       332
           1       0.50      0.66      0.57       366
           2       0.43      0.38      0.40       195
           3       0.58      0.64      0.61       163
           4       0.39      0.39      0.39       225
           5       0.51      0.56      0.53       162
           6       0.29      0.16      0.21       175
           7       0.74      0.78      0.76       333
           8       0.34      0.25      0.28       155

    accuracy                           0.51      2106
   macro avg       0.47      0.47      0.47      2106
weighted avg       0.49      0.51      0.50      2106



In [22]:
#!pip install imbalanced-learn==0.7.0

In [23]:
# from imblearn.over_sampling import SMOTE

# sm = SMOTE(random_state=42)

# X_res, y_res = sm.fit_resample(X_train_audio, y_train_audio)

# print("Inspect oversampling: \n",y_res.value_counts())

# xgb3_clf = xgb.XGBClassifier()#**best_XGB_params)

# xgb3_clf.fit(X_res, y_res)

In [24]:
# xgb3_y_pred = xgb3_clf.predict(X_test_audio)
# xgb3_report = classification_report(y_test_audio, xgb3_y_pred)
# print(xgb3_report)

In [25]:
# tabnet2_clf = TabNetClassifier()
# tabnet2_clf.fit(X_res, y_res)

In [26]:
# tabnet2_y_pred = tabnet2_clf.predict(X_test_audio)
# tabnet2_report = classification_report(y_test_audio, tabnet2_y_pred)
# print(tabnet2_report)

### Let us do prediction on the Lyrics instead! ###

In [27]:
import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

device(type='cpu')

In [28]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader
import torch.nn.functional as F
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [29]:
# #Some testing 
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# text = X_train_text[120]
# print(text)
# test = tokenizer(text)
# test

In [30]:
def processing_for_bert(batch, tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')):
    
    N = len(batch)
    texts, final_labels = [0]*N, [0]*N
    
    for i in range(N):
        
        texts[i] = batch[i][0]
        final_labels[i] = batch[i][1] 

    #pad to longest length of longest sentence in batch!
    tokenized = tokenizer(texts, padding=True, max_length=512, truncation=True)
    #print(tokenized)
    token_id = torch.tensor(tokenized['input_ids'])
    attention_masks = torch.tensor(tokenized['attention_mask'])
    return token_id.to(device), attention_masks.to(device), torch.tensor(final_labels).to(device) 

# token_id, attention_masks, labels = processing_for_bert(X_train_text[:2], y_train_text[:2])  

In [31]:
def find_max_list(list):
    list_len = [len(i) for i in list]
    return max(list_len)

# Function to calculate the accuracy of our predictions
def accuracy(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    #print(pred_flat)
    #print(labels_flat)
    return np.sum(preds_flat == labels_flat)/len(preds_flat)

In [32]:
def create_label_map(num_categories):

    #mapping for labels
    zero_vec = [0] * num_categories 
    label_map = {}
    for i in range(num_categories-1):
        gold_vec = zero_vec.copy()
        gold_vec[i] = 1
        label_map[i] = gold_vec
    # print(label_map)
    return label_map

def create_text_data(X,Y):
    """Create data of shape [(x1,y1), (x2,y2),...,]"""
    data = []
    for i in range(len(X)):
        data.append((X[i],Y[i]))
    return data

In [43]:
def train_model(train_data, n_epochs=1, batch_size=32, lr=1e-5):
    """
    Train a BERT classifier on the provided training data.

    Parameters:
    - train_data (torch.utils.data.Dataset): Training dataset.
    - n_epochs (int, optional): Number of training epochs. Default is 1.
    - batch_size (int, optional): Batch size. Default is 32.
    - lr (float, optional): Learning rate. Default is 1e-5.

    Returns:
    - classifier (torch.nn.Module): Trained BERT classifier.
    """
    
    # Nine music genres
    num_categories = 9  

    # Prepare the data loaders
    train_loader = DataLoader(train_data, batch_size, shuffle=True, collate_fn=processing_for_bert)

    # Build the classifier
    classifier = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_categories).to(device)

    # Initialise the optimizer
    optimizer = torch.optim.Adam(classifier.parameters(), lr=lr)

    # Training loop
    try:
        for epoch in range(n_epochs):
            train_losses = 0
            classifier.train()
            for batch in tqdm(train_loader):
                train_token_id = batch[0]
                train_attention_masks = batch[1]
                train_labels = batch[2]

                # Reset the accumulated gradients
                optimizer.zero_grad()

                # Forward pass
                output = classifier(train_token_id,
                                    token_type_ids=None,
                                    attention_mask=train_attention_masks,
                                    labels=train_labels)

                # Save loss
                train_losses += output.loss.item()

                # Backward pass; propagates the loss and computes the gradients
                output.loss.backward()
                # Update the parameters of the model
                optimizer.step()

            print('epoch_avg_train_loss:', round(train_losses / len(train_loader), 3))

        # Save model parameters
        torch.save(classifier.state_dict(), 'trained_model.pth')

    except KeyboardInterrupt:
        pass

    return classifier


In [51]:
def test_model(classifier, test_data, batch_size=32):
    """
    Evaluate the classifier on test data, returning classification report and class probabilities.
    
    Parameters:
    - classifier (torch.nn.Module): Model to evaluate.
    - test_data (torch.utils.data.Dataset): Test dataset.
    - batch_size (int, optional): Batch size. Default is 32.
    
    Returns:
    - tuple:
        - report (str): Classification report.
        - all_probs_tensor (torch.Tensor): Predicted class probabilities.
    """   
    test_loader = DataLoader(test_data, batch_size, collate_fn=processing_for_bert)

    classifier.eval()

    all_probs = []
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(test_loader):
            val_token_id = batch[0]
            val_attention_masks = batch[1]
            val_labels = batch[2]

            output = classifier(val_token_id,
                                token_type_ids=None,
                                attention_mask=val_attention_masks,
                                labels=val_labels)

            # Convert logits to probabilities using softmax
            probs = F.softmax(output.logits, dim=1)
            all_probs.append(probs)

            # Store predictions and true labels for computing the weighted F1 later
            preds_flat = np.argmax(output.logits.detach().cpu().numpy(), axis=1)
            all_preds.extend(preds_flat.tolist())
            all_labels.extend(val_labels.to('cpu').numpy().tolist())

    report = classification_report(all_labels, all_preds)
        
    all_probs_tensor = torch.cat(all_probs, dim=0)
    return report, all_probs_tensor


In [52]:
text_train_data = create_text_data(X_train_text, y_train_text)
text_test_data = create_text_data(X_test_text, y_test_text)

In [53]:
model = train_model(text_train_data)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  1%|          | 3/264 [09:59<14:29:28, 199.88s/it]


In [56]:
report, text_pred_probs = test_model(model, text_test_data[:2])


100%|██████████| 1/1 [00:00<00:00,  2.31it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [57]:
# Save the array
np.save('text_pred_probs.npy', text_pred_probs)
# Load the numpy array
loaded_probs_array = np.load('text_pred_probs.npy')
print(loaded_probs_array)
# Convert the numpy array back to a tensor if needed
loaded_probs_tensor = torch.tensor(loaded_probs_array)

[[0.09162419 0.17112213 0.09888737 0.11190072 0.08715122 0.07841507
  0.06692923 0.21866548 0.0753045 ]
 [0.08354866 0.303768   0.06471396 0.0613597  0.08798452 0.05442909
  0.10990866 0.17746    0.05682735]]


In [None]:
print(report)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         1
           3       0.20      1.00      0.33         1
           4       0.00      0.00      0.00         1
           5       0.00      0.00      0.00         1

    accuracy                           0.20         5
   macro avg       0.04      0.20      0.07         5
weighted avg       0.04      0.20      0.07         5



In [None]:
def load_model(model_path, num_categories=9):
    """
    Load the trained BERT classifier from the specified path.

    Parameters:
    - model_path (str): Path to the saved model parameters.
    - num_categories (int, optional): Number of output categories. Default is 9.

    Returns:
    - classifier (torch.nn.Module): Loaded BERT classifier.
    """
    # Build the classifier
    classifier = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_categories)

    # Load the model parameters
    classifier.load_state_dict(torch.load(model_path))
    classifier.eval()  # Set the model to evaluation mode
    return classifier

# Usage:
# Assuming you have saved the model using the train_model function
loaded_model = load_model('trained_model.pth')