<h1> <center> Project work in Deep Learning </center> </h1> 
<h2> <center> Song Lyrics Classification Based on Genres </center> </h2>

<h3> Student: Dinno Koluh (0001034376)</h3>

<h4> Introduction </h4>
<p>
In this project we are going classify songs into genres based on their lyrics. This is inherently a task in the area of NLP (Natural Language Processing), more specifically the <i>text classification</i> problem. In our case the text to be classified is the song lyrics and the different classes are the different genres. This task has substantial real-world usage application as the big music platforms (e.g. Spotify, Deezer, SoundCloud, Apple Music...) are exposed to this task on a daily basis. 
</p>

<h4> Dataset </h4>

The used dataset was obtained from Kaggle (it can be found <a url="https://www.kaggle.com/datasets/mateibejan/multilingual-lyrics-for-genre-classification"> here</a>). The dataset is comprised of ~300,000 samples with the following features: artist, song title, genre, language, lyrics. During the preprocessing phase we are going to address the issues and solutions in the dataset. 

<h4> Architecture to be used </h4>

To go-to architecture for NLP tasks used to be RNN (Recurrent Neural Networks) and their modifications (LSTM, GRU) as we are dealing with inherently sequential data. RNNs are able to capture contextual information as they are able to store information from previous inputs. But this fact is also the bottleneck as RNNs have long-term dependency issues (information about a fact stated at the beginning of a document is lost at some point) and they are inefficient when training as it is hard to parallelize them to use the massive power of GPUs for training. 

Nowadays the most popular architecture used in NLP tasks is the Transformer. The two problems RNNs had, the Transformer model solves using the <i> attention </i> mechanism which enables to capture dependencies between distant words in text and input sequences can be processed in parallel making the Transformer model highly efficient. We are going to dive more into the architecture of the Transformer when we start to build the model for the classification task.

In [166]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import contractions
import re

In [137]:
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...


True

<h4> Data preprocessing </h4>

In [202]:
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

In [206]:
df_train

Unnamed: 0,Artist,Song,Genre,Language,Lyrics
0,12 stones,world so cold,Rock,en,"It starts with pain, followed by hate\nFueled ..."
1,12 stones,broken,Rock,en,Freedom!\nAlone again again alone\nPatiently w...
2,12 stones,3 leaf loser,Rock,en,"Biting the hand that feeds you, lying to the v..."
3,12 stones,anthem for the underdog,Rock,en,You say you know just who I am\nBut you can't ...
4,12 stones,adrenaline,Rock,en,My heart is beating faster can't control these...
...,...,...,...,...,...
291117,bobby womack,i wish he didn t trust me so much,R&B,en,I'm the best friend he's got I'd give him the ...
291118,bad boys blue,i totally miss you,Pop,en,"Bad Boys Blue ""I Totally Miss You"" I did you w..."
291119,celine dion,sorry for love,Pop,en,Forgive me for the things That I never said to...
291120,dan bern,cure for aids,Indie,en,The day they found a cure for AIDS The day the...


We will only work on songs in English, so we will keep only samples in English. After this step we can also drop the language columns as it not necessary anymore.

In [208]:
en_df_train = df_train[df_train["Language"] == 'en'] # removing language
df_train = en_df_train.drop(columns='Language') # language column now more needed

Let us now get the unique genres in the genre columns. There will be some noisy data, so we will filter it and jest keep the genres that make sense. We will construct a dictionary where the keys will be the genres and the values are the dataframes. 

In [210]:
genres = df_train['Genre'].unique() # there is some noisy data
print(genres)
genres = ['Rock', 'Metal', 'Pop', 'Indie', 'R&B', 'Electronic', 'Jazz', 'Hip-Hop', 'Country'] # the genres that we will keep

data_train = {} # dictionary of genres used as keys
data_test = {}
train_samples = 0 # number of samples
test_samples = 0
for g in genres:
    data_train[g] = df_train[df_train["Genre"] == g]
    data_test[g] = df_test[df_test["Genre"] == g]
    train_samples += len(data_train[g])
    test_samples += len(data_test[g])
print("Number of training samples: " + str(train_samples))
print("Number of test samples: " + str(test_samples))

['Rock' 'Metal' 'Pop' 'Indie' 'Folk' 'Electronic' 'R&B' 'Jazz' 'Hip-Hop'
 'Country']
Number of training samples: 242028
Number of test samples: 7440


Let us now inspect the obtained data. For training purposes we should have balanced data across the different classes. So, let us see how classes compare to each other.

In [214]:
def print_data(data, n_samples):
    for key in data.keys():
        print("Music genre: {}. Number of samples: {}. Percentage of dataset: {}.\n".format(key, len(data[key]), 100*len(data[key])/n_samples))
print_data(data_train, train_samples)
# print_data(data_test, test_samples)

Music genre: Rock. Number of samples: 107145. Percentage of dataset: 44.26967127770341.

Music genre: Metal. Number of samples: 19133. Percentage of dataset: 7.9052836861850695.

Music genre: Pop. Number of samples: 86298. Percentage of dataset: 35.65620506718231.

Music genre: Indie. Number of samples: 7240. Percentage of dataset: 2.9913894260168243.

Music genre: R&B. Number of samples: 2765. Percentage of dataset: 1.142429801510569.

Music genre: Electronic. Number of samples: 2005. Percentage of dataset: 0.8284165468458194.

Music genre: Jazz. Number of samples: 13314. Percentage of dataset: 5.50101641132431.

Music genre: Hip-Hop. Number of samples: 2238. Percentage of dataset: 0.9246863999206704.

Music genre: Country. Number of samples: 1890. Percentage of dataset: 0.7809013833110219.



We see that the data is unbalanced to a great extent. Rock and Pop are the most present genres whereas Country and Electronic music are the least present. To keep data at a large enough level, we will drop the Country and Electronic genres and downsize other genres to about $2500$ random samples to keep the dataset balanced.

In [215]:
# deleting the two least present genres
del data_train['Electronic']
del data_train['Country']
del data_test['Electronic']
del data_test['Country']

In [226]:
# new number of samples after deleting Electronic and Country genres
train_samples = 0
test_samples = 0
for key in data_train.keys(): train_samples+=len(data_train[key])
for key in data_test.keys(): test_samples+=len(data_test[key])

In [227]:
genres = ['Rock', 'Metal', 'Pop', 'Indie', 'R&B', 'Jazz', 'Hip-Hop'] # the genres that we will keep
train_samples = 0
for g in genres:
    if g == 'Hip-Hop': 
        train_samples += len(data_train[g])
        continue
    data_train[g] = data_train[g].sample(n=2500, random_state=0)
    train_samples += len(data_train[g])
print_data(data_train, train_samples)

Music genre: Rock. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Metal. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Pop. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Indie. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: R&B. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Jazz. Number of samples: 2500. Percentage of dataset: 14.5028425571412.

Music genre: Hip-Hop. Number of samples: 2238. Percentage of dataset: 12.982944657152801.



We now have a decently balanced dataset, so we can proceed with some text preprocessing. Let us look at the lyrics of the first rock song to get the idea of the text we are working with:

In [219]:
print(data_train['Rock']['Lyrics'].iloc[0])

Tell me that you've got everything you want
But you don't get me
You say you've seen seven wonders
And you're bird is green
But you don't see me
When your prized possessions start to weigh you down
Look in my direction,
I'll be 'round
I'll be 'round
When your bird is broken
Will it bring you down
You may feel awoken
I'll be 'round
I'll be 'round
You tell me that you've heard every sound there is
And your bird can swing
But you don't get me
You can't hear me.


We will now normalize the text. We are going to do the following:
<br>
- Tokenize the text
- Expand token contractions ('cause $\rightarrow$ because)
- Convert all characters to lowercase
- Remove punctuation signs
</br>

We won't do word lemmatization or stop-word removal as our model of choice is the Transformer which benefits from both, fully expanded tokens and also stop-words as they give context to text. So, the following function normalize lyrics:

#### (Sentence-level or Token-level classification with Transformers?)

In [229]:
def normalize_lyrics(lyrics):
  expanded_words = []   
  for word in lyrics.split():
    # using contractions.fix to expand the shortened words
    expanded_words.append(contractions.fix(word))  
    
  expanded_lyrics = ' '.join(expanded_words)
  expanded_lyrics = re.sub(r"in'", "ing", expanded_lyrics) # taking into account verbs that end in "in'", singin' -> singing

  tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
  tokens = tokenizer.tokenize(expanded_lyrics)

  tokens = [token.lower() for token in tokens]
  return tokens

lyrics = data_train['Rock']['Lyrics'].iloc[0]
print(normalize_lyrics(lyrics))

['through', 'the', 'course', 'of', 'an', 'embrace', 'our', 'sisters', 'felt', 'a', 'striking', 'hand', 'their', 'fear', 'was', 'raised', 'by', 'the', 'light', 'of', 'day', 'their', 'quiet', 'rage', 'sleeps', 'with', 'them', 'tonight', 'and', 'they', 'say', 'we', 'have', 'a', 'reason', 'to', 'ban', 'our', 'heart', 'we', 'have', 'a', 'reason', 'to', 'change', 'our', 'mind', 'sister', 'midnight', 'sister', 'moon', 'like', 'me', 'so', 'much', 'do', 'not', 'think', 'i', 'will', 'see', 'them', 'soon', 'through', 'the', 'course', 'of', 'an', 'embrace', 'our', 'sisters', 'felt', 'a', 'striking', 'hand', 'their', 'fear', 'was', 'raised', 'by', 'the', 'light', 'of', 'day', 'their', 'quiet', 'rage', 'sleeps', 'with', 'them', 'tonight', 'and', 'they', 'say', 'we', 'have', 'a', 'reason', 'to', 'ban', 'our', 'heart', 'we', 'have', 'a', 'reason', 'to', 'change', 'our', 'mind', 'sister', 'midnight', 'sister', 'moon', 'like', 'me', 'so', 'much', 'do', 'not', 'think', 'i', 'will', 'see', 'them', 'soon']

Let us now do this for all samples and all categories. We will also make a new data column "Tokens" which will contain the tokenized lyrics to be able to compare the original and tokenized one.

In [240]:
def tokenize_dict(data):
    for key in data.keys():
        genre_tokens = []
        for i in range(len(data[key])):
            genre_tokens.append(normalize_lyrics(data[key]['Lyrics'].iloc[i]))
        data[key]['Tokens'] = genre_tokens
    return data

In [None]:
data_train = tokenize_dict(data_train)
data_test = tokenize_dict(data_test)

For the purpose of reusing the dataset we will save train and test data so to be able to load it without going through the preprocessing steps in the future.

In [246]:
def combine_dfs(data):
    dfs = []
    for key in data.keys(): dfs.append(data[key])
    return pd.concat(dfs)
combine_dfs(data_test).to_csv('data/pruned_test.csv', index=False)
combine_dfs(data_train).to_csv('data/pruned_train.csv', index=False)

<h4> Model architecture </h4>