<h1> <center> Project work in Deep Learning </center> </h1> 
<h2> <center> Song Lyrics Classification Based on Genres </center> </h2>

<h3> Student: Dinno Koluh (0001034376)</h3>

<h4> Introduction </h4>
<p>
In this project we are going classify songs into genres based on their lyrics. This is inherently a task in the area of NLP (Natural Language Processing), more specifically the <i>text classification</i> problem. In our case the text to be classified is the song lyrics and the different classes are the different genres. This task has substantial real-world usage application as the big music platforms (e.g. Spotify, Deezer, SoundCloud, Apple Music...) are exposed to this task on a daily basis. 
</p>

<h4> Dataset </h4>

The used dataset was obtained from Kaggle (it can be found <a url="https://www.kaggle.com/datasets/mateibejan/multilingual-lyrics-for-genre-classification"> here</a>). The dataset is comprised of ~300,000 samples with the following features: artist, song title, genre, language, lyrics. During the preprocessing phase we are going reduce the dataset as it is inherently imbalanced. We will address the issues and the respective solutions in the dataset. 

<h4> Architecture to be used </h4>

To go-to architecture for NLP tasks used to be RNN (Recurrent Neural Networks) and their modifications (LSTM, GRU) as we are dealing with inherently sequential data. RNNs are able to capture contextual information as they are able to store information from previous inputs. But this fact is also the bottleneck as RNNs have long-term dependency issues (information about a fact stated at the beginning of a document is lost at some point) and they are inefficient when training as it is hard to parallelize them to use the massive power of GPUs for training. 

Nowadays the most popular architecture used in NLP tasks is the Transformer. The two problems RNNs had, the Transformer model solves using the <i> attention </i> mechanism which enables to capture dependencies between distant words in text and input sequences can be processed in parallel making the Transformer model highly efficient. We are going to dive more into the architecture of the Transformer when we start to build the model for the classification task.

In [12]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import contractions
import re
from ast import literal_eval
import numpy as np
from sklearn.preprocessing import LabelEncoder
# NN importing
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from transformers import TFAutoModel, AutoTokenizer

In [2]:
nltk.download("wordnet")
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

<h4> Data preprocessing </h4>

In [41]:
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

In [43]:
df_train

Unnamed: 0,Artist,Song,Genre,Language,Lyrics
0,12 stones,world so cold,Rock,en,"It starts with pain, followed by hate\nFueled ..."
1,12 stones,broken,Rock,en,Freedom!\nAlone again again alone\nPatiently w...
2,12 stones,3 leaf loser,Rock,en,"Biting the hand that feeds you, lying to the v..."
3,12 stones,anthem for the underdog,Rock,en,You say you know just who I am\nBut you can't ...
4,12 stones,adrenaline,Rock,en,My heart is beating faster can't control these...
...,...,...,...,...,...
291121,bobby womack,i wish he didn t trust me so much,R&B,en,I'm the best friend he's got I'd give him the ...
291122,bad boys blue,i totally miss you,Pop,en,"Bad Boys Blue ""I Totally Miss You"" I did you w..."
291123,celine dion,sorry for love,Pop,en,Forgive me for the things That I never said to...
291124,dan bern,cure for aids,Indie,en,The day they found a cure for AIDS The day the...


We will only work on songs in English, so we will keep only samples in English. After this step we can also drop the language columns as it not necessary anymore.

In [44]:
en_df_train = df_train[df_train["Language"] == 'en'] # removing language
df_train = en_df_train.drop(columns='Language') # language column now more needed

Let us now get the unique genres in the genre columns. There will be some noisy data, so we will filter it and jest keep the genres that make sense. We will construct a dictionary where the keys will be the genres and the values are the dataframes. 

In [45]:
genres = df_train['Genre'].unique() # there is some noisy data
print(genres)
genres = ['Rock', 'Metal', 'Pop', 'Indie', 'R&B', 'Electronic', 'Jazz', 'Hip-Hop', 'Country'] # the genres that we will keep

data_train = {} # dictionary of genres used as keys
data_test = {}
train_samples = 0 # number of samples
test_samples = 0
for g in genres:
    data_train[g] = df_train[df_train["Genre"] == g]
    data_test[g] = df_test[df_test["Genre"] == g]
    train_samples += len(data_train[g])
    test_samples += len(data_test[g])
print("Number of training samples: " + str(train_samples))
print("Number of test samples: " + str(test_samples))

['Rock' 'Metal' 'Pop' 'Indie' 'Folk' 'Electronic' 'R&B' 'Jazz' 'Hip-Hop'
 'Country']
Number of training samples: 242027
Number of test samples: 7440


Let us now inspect the obtained data. For training purposes we should have balanced data across the different classes. So, let us see how classes compare to each other.

In [46]:
def print_data(data, n_samples):
    for key in data.keys():
        print("Music genre: {}. Number of samples: {}. Percentage of dataset: {}.\n".format(key, len(data[key]), 100*len(data[key])/n_samples))
print_data(data_train, train_samples)
# print_data(data_test, test_samples)

Music genre: Rock. Number of samples: 107145. Percentage of dataset: 44.2698541898218.

Music genre: Metal. Number of samples: 19133. Percentage of dataset: 7.905316349002384.

Music genre: Pop. Number of samples: 86297. Percentage of dataset: 35.655939213393545.

Music genre: Indie. Number of samples: 7240. Percentage of dataset: 2.9914017857511763.

Music genre: R&B. Number of samples: 2765. Percentage of dataset: 1.1424345217682326.

Music genre: Electronic. Number of samples: 2005. Percentage of dataset: 0.8284199696728051.

Music genre: Jazz. Number of samples: 13314. Percentage of dataset: 5.50103914026121.

Music genre: Hip-Hop. Number of samples: 2238. Percentage of dataset: 0.9246902205125874.

Music genre: Country. Number of samples: 1890. Percentage of dataset: 0.7809046098162602.



We see that the data is unbalanced to a great extent. Rock and Pop are the most present genres whereas Country and Electronic music are the least present. To keep data at a large enough level, we will drop the Country and Electronic genres and downsize other genres to about $2500$ random samples to keep the dataset balanced.

In [47]:
# deleting the two least present genres
del data_train['Electronic']
del data_train['Country']
del data_test['Electronic']
del data_test['Country']

In [48]:
# new number of samples after deleting Electronic and Country genres
train_samples = 0
test_samples = 0
for key in data_train.keys(): train_samples+=len(data_train[key])
for key in data_test.keys(): test_samples+=len(data_test[key])

In [49]:
genres = ['Rock', 'Metal', 'Pop', 'Indie', 'R&B', 'Jazz', 'Hip-Hop'] # the genres that we will keep
train_samples = 0
for g in genres:
    if g == 'Hip-Hop': 
        train_samples += len(data_train[g])
        continue
    data_train[g] = data_train[g].sample(n=2500, random_state=0)
    train_samples += len(data_train[g])
print_data(data_train, train_samples)

Music genre: Rock. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Metal. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Pop. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Indie. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: R&B. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Jazz. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Hip-Hop. Number of samples: 2238. Percentage of dataset: 12.982944657152801.



We now have a decently balanced dataset, so we can proceed with some text preprocessing. Let us look at the lyrics of the first rock song to get the idea of the text we are working with:

In [50]:
print(data_train['Rock']['Lyrics'].iloc[0])

She was my lover
She was working undercover
Oh the woman knew all of the moves
She really had me rompin'
We were barefoot stompin'
She just kept igniting my fuse

I was blinded by the blackness
Of her long silk stockings
She was rocking with an optical illusion
This ain't how I'd thought it'd be
She just kept on keeping me
In a total state of confusion

She took me for a ride
Rattled me down to my shoes
And I found out
She was an undercover agent for the blues

She never really needed love
Omnidirectional
I was just an innocent bystander
She kept on getting kinkier
I sank hook, line, and sinker
Just, just, just too hot to handle

She took me by storm
It must of been a season for the fools
She's so bad
An undercover agent for the blues



We will now normalize the text. We are going to do the following:
<br>
- Tokenize the text
- Expand token contractions ('cause $\rightarrow$ because)
- Convert all characters to lowercase
- Remove punctuation signs
</br>

We won't do word lemmatization or stop-word removal as our model of choice is the Transformer which benefits from both, fully expanded tokens and also stop-words as they give context to text. Before text normalization we will split the lyrics into verses, and then the text normalization will be done on the verses, which implies that we will be doing <i> sentence-level </i> instead of <i> token-level </i> classification.

In [51]:
def normalize_verse(verse):
  expanded_words = []   
  for word in verse.split():
    # using contractions to expand the shortened words
    expanded_words.append(contractions.fix(word))  
    
  expanded_lyrics = ' '.join(expanded_words)
  expanded_lyrics = re.sub(r"in'", "ing", expanded_lyrics) # taking into account verbs that end in "in'", singin' -> singing

  tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+') # remove punctuations and other non-alphanumeric characters
  tokens = tokenizer.tokenize(expanded_lyrics)

  tokens = [token.lower() for token in tokens] # lower tokens
  return tokens

In [52]:
delimiters = ['.', ';', '\n']
split_pattern = '|'.join(map(re.escape, delimiters))
def split_verses(lyrics):
    verses = [substring for substring in re.split(split_pattern, lyrics) if substring.strip()]
    for i, verse in enumerate(verses):
        verses[i] = normalize_verse(verse)
    return verses

lyrics = data_train['Rock']['Lyrics'].iloc[0]
verses = split_verses(lyrics)
print(verses)

[['she', 'was', 'my', 'lover'], ['she', 'was', 'working', 'undercover'], ['oh', 'the', 'woman', 'knew', 'all', 'of', 'the', 'moves'], ['she', 'really', 'had', 'me', 'romping'], ['we', 'were', 'barefoot', 'stomping'], ['she', 'just', 'kept', 'igniting', 'my', 'fuse'], ['i', 'was', 'blinded', 'by', 'the', 'blackness'], ['of', 'her', 'long', 'silk', 'stockings'], ['she', 'was', 'rocking', 'with', 'an', 'optical', 'illusion'], ['this', 'are', 'not', 'how', 'i', 'would', 'thought', 'it', 'would', 'be'], ['she', 'just', 'kept', 'on', 'keeping', 'me'], ['in', 'a', 'total', 'state', 'of', 'confusion'], ['she', 'took', 'me', 'for', 'a', 'ride'], ['rattled', 'me', 'down', 'to', 'my', 'shoes'], ['and', 'i', 'found', 'out'], ['she', 'was', 'an', 'undercover', 'agent', 'for', 'the', 'blues'], ['she', 'never', 'really', 'needed', 'love'], ['omnidirectional'], ['i', 'was', 'just', 'an', 'innocent', 'bystander'], ['she', 'kept', 'on', 'getting', 'kinkier'], ['i', 'sank', 'hook', 'line', 'and', 'sinker

Let us now do this for all samples and all categories. We will also make a new data column "Tokens" which will contain the tokenized lyrics to be able to compare the original and tokenized one.

In [53]:
def tokenize_dict(data):
    for key in data.keys():
        genre_tokens = []
        for i in range(len(data[key])):
            genre_tokens.append(split_verses(data[key]['Lyrics'].iloc[i]))
        data[key]['Tokens'] = genre_tokens
    return data

In [54]:
data_train = tokenize_dict(data_train)
data_test = tokenize_dict(data_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[key]['Tokens'] = genre_tokens
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[key]['Tokens'] = genre_tokens
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[key]['Tokens'] = genre_tokens
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_in

For the purpose of reusing the dataset we will save train and test data so to be able to load it without going through the preprocessing steps in the future.

In [55]:
def combine_dfs(data):
    dfs = []
    for key in data.keys(): dfs.append(data[key])
    return pd.concat(dfs)
combine_dfs(data_test).to_csv('data/pruned_test.csv', index=False)
combine_dfs(data_train).to_csv('data/pruned_train.csv', index=False)

<h4> Model architecture </h4>

- Address word embeddings
- Address transformer architecture
- Cross entropy for classification

In [56]:
genres = ['Rock', 'Metal', 'Pop', 'Indie', 'R&B', 'Jazz', 'Hip-Hop']
n = len(genres)
def load_data():
    train = pd.read_csv('data/pruned_train.csv')
    test = pd.read_csv('data/pruned_test.csv')
    for i in range(len(train)):
        enc = np.zeros(n)
        train.at[i, 'Tokens'] = literal_eval(train.at[i, 'Tokens'])
        enc[genres.index(train.at[i, 'Genre'])] = 1.0
        train.at[i, 'Genre'] = enc
    for i in range(len(test)):
        enc = np.zeros(n)
        test.at[i, 'Tokens'] = literal_eval(test.at[i, 'Tokens'])
        enc[genres.index(test.at[i, 'Genre'])] = 1.0
        test.at[i, 'Genre'] = enc
    return train, test
train, test = load_data()

In [None]:
max_length = train['Tokens'].apply(len)
max_length_index = max_length.idxmax()
max_verse_len = train['Tokens'].apply(lambda x: len(x))

max_len = max_length.max()
print("Maximum lyrics length (in verses):", max_len)
print("Index of maximum length:", max_length_index)

print("Maximum verse length: ", max_verse_len)

In [78]:
train['Lyrics'].iloc[-1]

"[Mr. Cheeks] Basically, LB Fam to the motherfuckin' death Park side, Queen's niggaz represent Long Isle, how we do? They knew our style Represent niggaz in and out the P now Yo, I could do this mother shit for a while I don't give a fuck, my rap style be true, yo Yo, eh yo, yo, yo, how we do this Hey, yo, well back on my South Side Jamaica part of town Where us real niggas love to get down Where you only hear G and P finessin' tracks up on the tape We stuck in Queens, and I'm not tryin to escape Yo, I'm havin cess', drinkin; I'm kickin raps and Emceein' LB for life, kid, my way of bein' Its time to set up shops; wild in this game and got props And fuck cops; we puffin' lah wit' windows up in drop tops Nothin' stops my crew from gettin' it; we learn from the past Puffin' on this ounce of weed, I got this drink in my glass Conversatin' with myself; what does my future hold? Niggaz is dyin', will I make it past thirty years old? I can't run; I guess I gots to hold it down till I'm done W

In [4]:
labels = np.array(train['Genre'].to_list())
labels.shape

(17238, 7)

In [5]:
def create_model(num_classes, max_sequence_length):
    # Inputs
    input_ids = Input(shape=(max_sequence_length,), name='input_ids', dtype=tf.int32)
    attention_mask = Input(shape=(max_sequence_length,), name='attention_mask', dtype=tf.int32)

    # Load pre-trained transformer model
    transformer_model = TFAutoModel.from_pretrained("bert-base-uncased")

    # Freeze the transformer layers
    transformer_model.trainable = False

    # Get the transformer output
    transformer_output = transformer_model(input_ids=input_ids, attention_mask=attention_mask)[0]

    # Classification head
    output = Dense(num_classes, activation='softmax')(transformer_output[:, 0, :])  # Use the [CLS] token

    # Combine inputs and outputs into a Keras model
    model = Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

In [6]:
model = create_model(7, max_len)
model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 256)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 256)]        0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 256,                                           

In [7]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def generate_training_data(df, ids, masks, tokenizer):
    for i, text in enumerate(df['Lyrics']):
        tokenized_text = tokenizer.encode_plus(
            text,
            max_length=max_len, 
            truncation=True, 
            padding='max_length', 
            add_special_tokens=True,
            return_tensors='tf'
        )
        ids[i, :] = tokenized_text.input_ids
        masks[i, :] = tokenized_text.attention_mask
    return ids, masks

In [8]:
X_input_ids = np.zeros((len(train), max_len))
X_attn_masks = np.zeros((len(train), max_len))
X_input_ids, X_attn_masks = generate_training_data(train, X_input_ids, X_attn_masks, tokenizer)

In [9]:
dataset = tf.data.Dataset.from_tensor_slices((X_input_ids, X_attn_masks, labels))
def SentimentDatasetMapFunction(input_ids, attn_masks, labels):
    return {
        'input_ids': input_ids,
        'attention_mask': attn_masks
    }, labels
dataset = dataset.map(SentimentDatasetMapFunction)
dataset = dataset.shuffle(10000).batch(16, drop_remainder=True)

In [10]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

In [11]:
model.fit(x=dataset, epochs=5)

Epoch 1/5
   1/1077 [..............................] - ETA: 20:52:37 - loss: 2.7053 - accuracy: 0.0000e+00