# Lyrics Classifier research notebook
This notebook attempts to classify some stuff about songs based on their lyrics.

### Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data

Lyrics - [Kaggle](https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset/)
**BIG** dataset containing nothing but the song name and lyrics

In [3]:
lyrics_data = pd.read_csv('lyrics_data.csv')
lyrics_data = lyrics_data.drop(columns=['link'])
print('lyrics shape:', lyrics_data.shape)
lyrics_data.columns

lyrics shape: (57650, 3)


Index(['artist', 'song', 'text'], dtype='object')

Meta dataset #1 - [Kaggle](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

Detailed track information.  
We have to clean it up.

In [4]:
meta_data = pd.read_csv('meta_data_1.csv')
meta_data = meta_data.drop_duplicates(subset=['track_name', 'artists'])
meta_data = meta_data.drop(columns=['track_id', 'album_name', 'time_signature', 'popularity', 'explicit', 'mode'])
meta_data = meta_data.drop(meta_data.columns[0], axis=1)
meta_data = meta_data.rename(columns={'track_name': 'song', 'artists': 'artist'})
print('meta #1 shape:', meta_data.shape)
meta_data.columns

meta #1 shape: (81344, 14)


Index(['artist', 'song', 'duration_ms', 'danceability', 'energy', 'key',
       'loudness', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'track_genre'],
      dtype='object')

Meta dataset #2 - [Kaggle](https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube)

Pretty much the same as the first meta dataset, I hope it contains more data. (probably not...

In [5]:
meta_data2 = pd.read_csv('meta_data_2.csv')
meta_data2 = (meta_data2.drop_duplicates(subset=['Track', 'Artist'])
              .drop(columns=['Url_spotify', 'Album', 'Album_type', 'Uri', 'Url_youtube', 'Channel', 'Views', 'Likes', 'Comments', 'Description', 'Licensed', 'official_video', 'Stream', 'Title'])
              .drop(meta_data2.columns[0], axis=1)
              .rename(columns={'Track': 'song'})
              )
meta_data2.columns = map(str.lower, meta_data2.columns)
meta_data2.columns

Index(['artist', 'song', 'danceability', 'energy', 'key', 'loudness',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms'],
      dtype='object')

# Combining datasets

We have our separate datasets: one for lyrics, two for other data  
Now we try to join them together on the song names.

In [6]:
merge1 = pd.merge(lyrics_data, meta_data, on=['artist', 'song'])
merge2 = pd.merge(lyrics_data, meta_data2, on=['artist', 'song'])
print('merge1 size:', merge1.shape)
print('merge2 size:', merge2.shape)

merge1 size: (1127, 15)
merge2 size: (1037, 14)


In [7]:
concat = merge1#pd.concat([merge1, merge2])
# concat = concat.drop_duplicates(subset=['artist', 'song'])
# print('combined unique songs with lyrics and metadata:', concat.shape)
# concat

Narrowing down the meaningless made up words into real genres

In [8]:
mapping = {
    'hard-rock': 'rock',
    'psych-rock': 'rock',
    'j-rock': 'rock',
    'goth': 'rock',
    'alt-rock': 'rock',
    'german': 'rock',
    'synth-pop': 'pop',
    'power-pop': 'pop',
    'indie-pop': 'pop',
    'j-pop': 'pop',
    'swedish': 'pop',
    'british': 'pop',
    'piano': 'pop',
    'latin': 'pop',
    'electro': 'pop',
    'electronic': 'pop',
    'world-music': 'pop',
    'edm': 'pop',
    'grunge': 'metal',
    'death-metal': 'metal',
    'black-metal': 'metal',
    'metalcore': 'metal',
    'classical': 'metal',
    'hardcore': 'metal',
    'rockabilly': 'rock-n-roll',
    'r-n-b': 'rock-n-roll',
    'j-dance': 'dance',
    'garage': 'edm',
    'dancehall': 'reggae',
    'ska': 'reggae',
    'dub': 'reggae',
    'children': 'reggae',
    'bluegrass': 'folk',
    'punk-rock': 'punk',
    'alternative': 'punk',
    'emo': 'punk',
    'guitar': 'punk',
    'funk': 'blues',
    'singer-songwriter': 'blues',
    'honky-tonk': 'country'
}

def collapse_genres(genre):
    if genre in mapping:
        return mapping[genre]
    return genre

print('genres before collapsing:', len(concat['track_genre'].unique()))
concat['track_genre'] = concat['track_genre'].apply(collapse_genres)
print('genres after collapsing:', len(concat['track_genre'].unique()))

genres before collapsing: 59
genres after collapsing: 20


Discard meaningless values

In [19]:
from collections import Counter

count = Counter(concat['track_genre'])

pruned = concat[concat.apply(lambda x: True if count[x['track_genre']] > 50 else False, axis=1)]
print('shape:', pruned.shape)
pruned['track_genre'].unique()

shape: (831, 15)


array(['pop', 'rock', 'metal', 'blues', 'country'], dtype=object)

Not gonna lie this is depressing   

Let's chop up each song into verses.

In [22]:
def clean_text(t: str):
    return t.replace('\r\n', ' ').strip()

verse_data = pruned.copy()
verse_data['new_text'] = verse_data['text'].str.split('\r\n  \r\n')
verse_data = verse_data.drop(['text'], axis=1).explode('new_text').rename(columns={'new_text': 'text'})
verse_data['text'] = verse_data['text'].apply(clean_text)
verse_data

Unnamed: 0,artist,song,duration_ms,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_genre,text
0,ABBA,"Andante, Andante",278213,0.523,0.361,10,-10.718,0.0238,0.6840,0.000348,0.0671,0.380,101.887,pop,"Take it easy with me, please Touch me gently..."
0,ABBA,"Andante, Andante",278213,0.523,0.361,10,-10.718,0.0238,0.6840,0.000348,0.0671,0.380,101.887,pop,Make your fingers soft and light Let your bo...
0,ABBA,"Andante, Andante",278213,0.523,0.361,10,-10.718,0.0238,0.6840,0.000348,0.0671,0.380,101.887,pop,I'm your music (I am your music and I am you...
0,ABBA,"Andante, Andante",278213,0.523,0.361,10,-10.718,0.0238,0.6840,0.000348,0.0671,0.380,101.887,pop,There's a shimmer in your eyes Like the feel...
0,ABBA,"Andante, Andante",278213,0.523,0.361,10,-10.718,0.0238,0.6840,0.000348,0.0671,0.380,101.887,pop,I'm your music (I am your music and I am you...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1124,Within Temptation,Stand My Ground,267986,0.271,0.866,5,-4.072,0.0578,0.0489,0.000762,0.1160,0.127,175.665,rock,Stand my ground I won't give in No more de...
1124,Within Temptation,Stand My Ground,267986,0.271,0.866,5,-4.072,0.0578,0.0489,0.000762,0.1160,0.127,175.665,rock,All I know for sure is that I'm trying I wil...
1124,Within Temptation,Stand My Ground,267986,0.271,0.866,5,-4.072,0.0578,0.0489,0.000762,0.1160,0.127,175.665,rock,"Stand my ground I won't give in, (I won't gi..."
1124,Within Temptation,Stand My Ground,267986,0.271,0.866,5,-4.072,0.0578,0.0489,0.000762,0.1160,0.127,175.665,rock,Stand my ground I won't give in No more de...


I expected more

Take it easy with me, please   Touch me gently like a summer evening breeze   Take your time, make it slow   Andante, Andante   Just let the feeling grow
Make your fingers soft and light   Let your body be the velvet of the night   Touch my soul, you know how   Andante, Andante   Go slowly with me now
I'm your music   (I am your music and I am your song)   I'm your song   (I am your music and I am your song)   Play me time and time again and make me strong   (Play me again 'cause you're making me strong)   Make me sing, make me sound   (You make me sing and you make me)   Andante, Andante   Tread lightly on my ground   Andante, Andante   Oh please don't let me down
There's a shimmer in your eyes   Like the feeling of a thousand butterflies   Please don't talk, go on, play   Andante, Andante   And let me float away
I'm your music   (I am your music and I am your song)   I'm your song   (I am your music and I am your song)   Play me time and time again and make me strong   (Play me again