# Lyrics-based Genre Classification

Musical genres are usually distinguished from each other in terms of their musical content, be it orchestration, rhythmic metre and melodic structure. However, when it comes to songs, it is also the lyrical content that is often associated with a specific genre in its thematology and choice of expression.

In this project we will attempt to classify songs belonging to 3 music genres (Pop, Rock, Hip Hop) depending solely on their lyrics. Our dataset for this task is found in Kaggle under the title *Scrapped Lyrics from 6 Genres*, and contains lyrics of songs across different artists, languages and genres.

In [1]:
# Import dependencies:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/scrapped-lyrics-from-6-genres/artists-data.csv
/kaggle/input/scrapped-lyrics-from-6-genres/lyrics-data.csv


Our data exists in two separate dataframes, one corresponding to the artists and their associated musical genre, and the other containing the songs and their lyrics. The link between the two dataframes is the name of the performing artist. We will process each starting dataframe separately before merging them in order to generate our feature dataset.

### Artists dataframe

First, we read in the "artists-data.csv" file.

In [2]:
df_artists = pd.read_csv('/kaggle/input/scrapped-lyrics-from-6-genres/artists-data.csv')
df_artists.head()

Unnamed: 0,Artist,Songs,Popularity,Link,Genre,Genres
0,10000 Maniacs,110,0.3,/10000-maniacs/,Rock,Rock; Pop; Electronica; Dance; J-Pop/J-Rock; G...
1,12 Stones,75,0.3,/12-stones/,Rock,Rock; Gospel/Religioso; Hard Rock; Grunge; Roc...
2,311,196,0.5,/311/,Rock,Rock; Surf Music; Reggae; Ska; Pop/Rock; Rock ...
3,4 Non Blondes,15,7.5,/4-non-blondes/,Rock,Rock; Pop/Rock; Rock Alternativo; Grunge; Blue...
4,A Cruz Está Vazia,13,0.0,/a-cruz-esta-vazia/,Rock,Rock


We can see that this dataframe contains information about the artists included in the set. The information we will be using for our classification task is the Artist name and Genre assigned to them. When it comes to the name, it is better to use the Link column, as we will see it also exists in the lyrics dataframe, and will allow us to merge the dataframes easier. The rest of the columns can be dropped now.

In [3]:
df_artists_2c = df_artists[['Link', 'Genre']]
df_artists_2c.head()

Unnamed: 0,Link,Genre
0,/10000-maniacs/,Rock
1,/12-stones/,Rock
2,/311/,Rock
3,/4-non-blondes/,Rock
4,/a-cruz-esta-vazia/,Rock


So now we have an artist dataframe that's much easier to work with. Let's see if there are any duplicate entries of artists.

In [4]:
df_artists_2c.duplicated(subset = 'Link', keep = 'first').value_counts()

False    2940
True      302
dtype: int64

There are 302 instances of duplicate artist entries. We can check one of those below:

In [5]:
df_artists_2c.loc[lambda df: df['Link'] == '/10000-maniacs/']

Unnamed: 0,Link,Genre
0,/10000-maniacs/,Rock
1947,/10000-maniacs/,Pop


We can see that the issue here is certain artists have been classified under multiple genres. Let's make sure this is the problem and we don't actually have multiple identical entries all across our columns.

In [6]:
df_artists_2c.duplicated().value_counts()

False    3242
dtype: int64

Indeed it is an issue of multiple genres for a single artist. A couple of approaches we can take with this issue are:
1. We only keep one genre per artist. This will greatly simplify things, however it will make our model relatively rigid and inaccurate in its classification, and there's always the issue of which single genre would be more fitting to keep for an artist.
2. We leave this dataframe as is, and the resulting merged dataframe will contain duplicate entries of lyrics under multiple labels. This will, one could argue that keeping these duplicates will make our model assign a weaker association of the lyrics with a specific genre.

To take the first approach, we can run the code cell below and use the resulting df_artists_nd dataframe. This will eliminate all double entries and strictly assign one genre per artist. Our model will be simplistic but more robust. 

In [7]:
df_artists_nd = df_artists_2c.drop_duplicates(subset='Link', keep='first', ignore_index=False)
df_artists_nd.Genre.value_counts()

Rock            755
Pop             696
Sertanejo       569
Hip Hop         467
Funk Carioca    265
Samba           188
Name: Genre, dtype: int64

## Lyrics dataframe

Time to load and take a look at our lyrics dataframe.

In [8]:
df_lyrics = pd.read_csv('/kaggle/input/scrapped-lyrics-from-6-genres/lyrics-data.csv')
df_lyrics.head()

Unnamed: 0,ALink,SName,SLink,Lyric,Idiom
0,/10000-maniacs/,More Than This,/10000-maniacs/more-than-this.html,I could feel at the time. There was no way of ...,ENGLISH
1,/10000-maniacs/,Because The Night,/10000-maniacs/because-the-night.html,"Take me now, baby, here as I am. Hold me close...",ENGLISH
2,/10000-maniacs/,These Are Days,/10000-maniacs/these-are-days.html,These are. These are days you'll remember. Nev...,ENGLISH
3,/10000-maniacs/,A Campfire Song,/10000-maniacs/a-campfire-song.html,"A lie to say, ""O my mountain has coal veins an...",ENGLISH
4,/10000-maniacs/,Everyday Is Like Sunday,/10000-maniacs/everyday-is-like-sunday.html,Trudging slowly over wet sand. Back to the ben...,ENGLISH


One of the columns is Idiom, which denotes the language of the lyrics. Let's see how many languages we have in our dataset.

In [9]:
df_lyrics.Idiom.value_counts()

ENGLISH           114723
PORTUGUESE         85085
SPANISH             4812
ITALIAN              626
FRENCH               471
GERMAN               314
KINYARWANDA           88
ICELANDIC             47
SWEDISH               27
FINNISH               24
INDONESIAN            17
ESTONIAN              12
GALICIAN              12
HAITIAN_CREOLE         9
DANISH                 9
IRISH                  9
BASQUE                 8
CROATIAN               7
NORWEGIAN              7
TAGALOG                7
SUNDANESE              6
CATALAN                6
SWAHILI                5
DUTCH                  5
MALAY                  4
RUSSIAN                4
SERBIAN                3
ARABIC                 2
TURKISH                2
NYANJA                 2
KURDISH                2
SESOTHO                2
CEBUANO                2
JAPANESE               2
MALAGASY               2
CZECH                  1
AFRIKAANS              1
SLOVAK                 1
POLISH                 1
GANDA                  1


Quite a few languages, especially Portuguese. We will only use songs in English for this task, therefore we must drop all other languages.

In [10]:
df_lyrics_en = df_lyrics.drop(df_lyrics[df_lyrics['Idiom'] !='ENGLISH'].index)
df_lyrics_en.Idiom.value_counts()

ENGLISH    114723
Name: Idiom, dtype: int64

So now we have an exclusively English lyrics dataframe. Time to search for duplicate entries.

In [11]:
# Search for duplicate entries in 'SLink' (Artist & Song name):
df_lyrics_en.duplicated(subset = 'SLink', keep = 'first').value_counts()

False    91611
True     23112
dtype: int64

This means that the same song by the same artist has multiple entries. We can discard these duplicates.

In [12]:
# Drop duplicates in the field 'SLink'
df_lyrics_en.drop_duplicates(subset='SLink', keep='first', inplace=True, ignore_index=False)

# Search for persisting duplicate lyric entries:
df_lyrics_en.duplicated(subset = 'Lyric', keep = 'first').value_counts()

False    90796
True       815
dtype: int64

There are still duplicate lyrics entries. This is most likely due to the same song being performed by multiple artists. But let's take a better look.

In [13]:
(df_lyrics_en[df_lyrics_en.duplicated(subset = 'Lyric', keep = 'first')])

Unnamed: 0,ALink,SName,SLink,Lyric,Idiom
460,/311/,Leaving Babylon,/311/leaving-babylon.html,My computer is future shockin’. download this ...,ENGLISH
466,/311/,Livin' & Rockin',/311/livin-rockin.html,My computer is future shockin’. download this ...,ENGLISH
474,/311/,Mindspin,/311/mindspin.html,My computer is future shockin’. download this ...,ENGLISH
1495,/aerosmith/,Write Me A Letter,/aerosmith/write-me-a-letter-cifrada.html,WRITE ME A LETTER\nguitar 1(Joe)\nE-----------...,ENGLISH
2065,/alesana/,Second Guessing,/alesana/second-guessing.html,These eyes they have seen the world. These fee...,ENGLISH
...,...,...,...,...,...
170899,/jorge-e-mateus/,Use Somebody,/jorge-e-mateus/use-somebody.html,I've been roaming around. Always looking down ...,ENGLISH
181811,/ronny-e-rangel/,Have You Ever Seen The Rain?,/ronny-e-rangel/have-you-ever-seen-the-rain.html,Someone told me long ago. There's a calm befor...,ENGLISH
189026,/lenny-b/,We Found Love,/lenny-b/we-found-love.html,Yellow diamonds in the light. And we're standi...,ENGLISH
195050,/alexandre-pires/,Jingle Bell Rock,/alexandre-pires/jingle-bell-rock.html,"Jingle bell, jingle bell, jingle bell rock. Ji...",ENGLISH


Apparently, the 'SName' 'SLink' columns are not well defined, and we have the same song lyrics under different song names, for the same artists. We should make sure that the lyrics appear once under every artist, without eliminating any possible cover songs. First we will get rid of the 'SName', 'SLink' and 'Idiom' columns now since they are of no further use.

In [14]:
df_lyrics_nd = df_lyrics_en.drop(['SName', 'SLink', 'Idiom'], axis=1)
df_lyrics_nd

Unnamed: 0,ALink,Lyric
0,/10000-maniacs/,I could feel at the time. There was no way of ...
1,/10000-maniacs/,"Take me now, baby, here as I am. Hold me close..."
2,/10000-maniacs/,These are. These are days you'll remember. Nev...
3,/10000-maniacs/,"A lie to say, ""O my mountain has coal veins an..."
4,/10000-maniacs/,Trudging slowly over wet sand. Back to the ben...
...,...,...
207611,/sambo/,"(Chorus). Hello, hello,hello,how low. Hello,he..."
207624,/sambo/,Well sometimes I go out by myself. And I look ...
207628,/sambo/,Feeling my way through the darkness. Guided by...
207792,/seu-jorge/,"Don't, don't, that's what you say. Each time t..."


Now we have a dataframe with the Artist and Lyric columns. Any duplicate rows will indicate the existence of the same song by the same artist, and can therefore be safely discarded.

In [15]:
# Discard all duplicate rows: 
df_lyrics_nd.drop_duplicates(inplace=True)

# Check that we have absolutely no duplicate entries remaining:
df_lyrics_nd.duplicated().value_counts()

False    91392
dtype: int64

## Merge dataframes

Now that we have two clean dataframes, we can merge them at the name of the artist to create our main dataset.

In [16]:
df_merged = pd.merge(df_lyrics_nd, df_artists_nd, how='inner', left_on='ALink', right_on='Link')
df_merged.head()

Unnamed: 0,ALink,Lyric,Link,Genre
0,/10000-maniacs/,I could feel at the time. There was no way of ...,/10000-maniacs/,Rock
1,/10000-maniacs/,"Take me now, baby, here as I am. Hold me close...",/10000-maniacs/,Rock
2,/10000-maniacs/,These are. These are days you'll remember. Nev...,/10000-maniacs/,Rock
3,/10000-maniacs/,"A lie to say, ""O my mountain has coal veins an...",/10000-maniacs/,Rock
4,/10000-maniacs/,Trudging slowly over wet sand. Back to the ben...,/10000-maniacs/,Rock


We can now discard the artist names as we have no further use for them.

In [17]:
df_merged_2c = df_merged.drop(['ALink','Link'], axis=1)
df_merged_2c.head()

Unnamed: 0,Lyric,Genre
0,I could feel at the time. There was no way of ...,Rock
1,"Take me now, baby, here as I am. Hold me close...",Rock
2,These are. These are days you'll remember. Nev...,Rock
3,"A lie to say, ""O my mountain has coal veins an...",Rock
4,Trudging slowly over wet sand. Back to the ben...,Rock


Now that we have simplified our dataframe, let's take a look at our labels (Genre).

In [18]:
df_merged_2c.Genre.value_counts()

Rock            47534
Pop             25647
Hip Hop         13661
Sertanejo          51
Samba              42
Funk Carioca       15
Name: Genre, dtype: int64

I like Rock, but this definitely isn't the most balanced dataset. At this stage it is probably better if we completely discard the three least represented genres as they are effectively outliers in our data. 

In [19]:
df_3g = df_merged_2c.drop(df_merged_2c[ (df_merged_2c['Genre'] == 'Sertanejo') | (df_merged_2c['Genre'] == 'Samba') | (df_merged_2c['Genre'] == 'Funk Carioca')].index)
df_3g

Unnamed: 0,Lyric,Genre
0,I could feel at the time. There was no way of ...,Rock
1,"Take me now, baby, here as I am. Hold me close...",Rock
2,These are. These are days you'll remember. Nev...,Rock
3,"A lie to say, ""O my mountain has coal veins an...",Rock
4,Trudging slowly over wet sand. Back to the ben...,Rock
...,...,...
86860,Smile though your heart. Is aching. Smile even...,Pop
86861,A dream like this. Not something. You wish for...,Pop
86862,"Aah, yeah yeah. I see the spotlight in my drea...",Pop
86863,"I'm, dreaming of a white, Christmas. Just like...",Pop


So now we have a 3 Genre dataset (Pop, Rock & Hip Hop). Once last thing to do is to clear any remaining duplicate rows. This will discard the same song performed by different artists in the same genre.

In [20]:
df_3g.drop_duplicates(inplace=True)

Finally, let's perform a one-hot encoding of the labels, so that they are in a convenient format for our classifier.

In [21]:
# use pd.concat to join the new columns with your original dataframe
df_3g = pd.concat([df_3g,pd.get_dummies(df_3g['Genre'])],axis=1)

# now drop the original 'Genre' column (you don't need it anymore)
df_3g.drop(['Genre'],axis=1, inplace=True)

We are good to go. Let's store our cleaned up dataset. We will use this dataset to train our classifier in the next Notebook.

In [22]:
# Save as .csv
df_3g.to_csv('DF_3Genres_Lyrics_En.csv', index = False)

Here is one additional step that we could have taken if we had decided to verge into multi-label classification territory by keeping duplicate artist entries for multiple genres. 
This is to merge rows with the same lyrics and duplicate entries under different label. Using the aggregate function, this will generate entries with multiple labels. Then we would need to use a classifier such as Multi-Label K-Nearest Neighbours, capable of handling this task. We will ignore this now, since we are taking a single-label approach, which gives us more flexibility in choosing a classifier.
The optional code is shown below:

In [23]:
'''
aggregation_functions = {'Hip Hop': 'sum', 'Pop': 'sum', 'Rock': 'sum','Lyric': 'first'}
df_ml = df_3g.groupby(df_3g.Lyric).agg(aggregation_functions).reset_index(drop=True)
df_ml.head()
'''

"\naggregation_functions = {'Hip Hop': 'sum', 'Pop': 'sum', 'Rock': 'sum','Lyric': 'first'}\ndf_ml = df_3g.groupby(df_3g.Lyric).agg(aggregation_functions).reset_index(drop=True)\ndf_ml.head()\n"