# Lyrics Classifier research notebook
This notebook attempts to classify some stuff about songs based on their lyrics.

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data

Lyrics - [Kaggle](https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset/)
**BIG** dataset containing nothing but the song name and lyrics

In [16]:
lyrics_data = pd.read_csv('lyrics_data.csv')
lyrics_data = lyrics_data.drop(columns=['link'])
lyrics_data.columns

Index(['artist', 'song', 'text'], dtype='object')

Meta dataset #1 - [Kaggle](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

Detailed track information.  
We have to clean it up.

In [75]:
meta_data = pd.read_csv('meta_data_1.csv')
meta_data = meta_data.drop_duplicates(subset=['track_name', 'artists'])
meta_data = meta_data.drop(columns=['track_id', 'album_name', 'time_signature', 'popularity', 'explicit', 'mode'])
meta_data = meta_data.drop(meta_data.columns[0], axis=1)
meta_data = meta_data.rename(columns={'track_name': 'song', 'artists': 'artist'})
meta_data.columns

Index(['artist', 'song', 'duration_ms', 'danceability', 'energy', 'key',
       'loudness', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'track_genre'],
      dtype='object')

Meta dataset #2 - [Kaggle](https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube)

Pretty much the same as the first meta dataset, I hope it contains more data.

In [73]:
meta_data2 = pd.read_csv('meta_data_2.csv')
meta_data2 = (meta_data2.drop_duplicates(subset=['Track', 'Artist'])
              .drop(columns=['Url_spotify', 'Album', 'Album_type', 'Uri', 'Url_youtube', 'Channel', 'Views', 'Likes', 'Comments', 'Description', 'Licensed', 'official_video', 'Stream', 'Title'])
              .drop(meta_data2.columns[0], axis=1)
              .rename(columns={'Track': 'song'})
              )
meta_data2.columns = map(str.lower, meta_data2.columns)
meta_data2.columns

Index(['artist', 'song', 'danceability', 'energy', 'key', 'loudness',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms'],
      dtype='object')

# Combining datasets

We have our separate datasets: one for lyrics, two for other data  
Now we try to join them together on the song names.

In [80]:
merge1 = pd.merge(lyrics_data, meta_data, on=['artist', 'song'])
merge2 = pd.merge(lyrics_data, meta_data2, on=['artist', 'song'])
print('merge1 size:', merge1.shape)
print('merge2 size:', merge2.shape)

merge1 size: (1127, 15)
merge2 size: (1037, 14)


In [83]:
concat = pd.concat([merge1, merge2])
concat = concat.drop_duplicates(subset=['artist', 'song'])
print('combined unique songs with lyrics and metadata:', concat.shape)
concat

combined unique songs with lyrics and metadata: (1753, 15)


Unnamed: 0,artist,song,text,duration_ms,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_genre
0,ABBA,"Andante, Andante","Take it easy with me, please \r\nTouch me gen...",278213.0,0.523,0.361,10.0,-10.718,0.0238,0.684000,0.000348,0.0671,0.380,101.887,swedish
1,ABBA,Chiquitita,"Chiquitita, tell me what's wrong \r\nYou're e...",326320.0,0.500,0.554,9.0,-8.108,0.0354,0.734000,0.000004,0.3120,0.372,84.229,swedish
2,ABBA,Dancing Queen,"You can dance, you can jive, having the time o...",230400.0,0.543,0.870,9.0,-6.514,0.0428,0.358000,0.000939,0.7920,0.754,100.804,swedish
3,ABBA,Does Your Mother Know,"You're so hot, teasing me \r\nSo you're blue ...",193560.0,0.703,0.824,7.0,-5.170,0.0364,0.050400,0.000092,0.0970,0.971,135.240,swedish
4,ABBA,Fernando,Can you hear the drums Fernando? \r\nI rememb...,252960.0,0.354,0.535,9.0,-8.876,0.0303,0.627000,0.000003,0.0808,0.434,110.821,swedish
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1032,Yelawolf,Daddy's Lambo,You really in Beverly hills \r\nAnd so Drama ...,227573.0,0.779,0.725,11.0,-7.128,0.1450,0.025900,0.000000,0.0946,0.335,131.997,
1033,Yelawolf,Pop The Trunk,Meth lab in the back and the crack smoke pills...,228160.0,0.912,0.673,2.0,-8.910,0.1090,0.099500,0.000000,0.1030,0.123,120.157,
1034,Ying Yang Twins,Naggin',Every now and then you get mad \r\nSometime I...,263227.0,0.708,0.854,1.0,-5.724,0.0322,0.000682,0.000000,0.2780,0.914,100.005,
1035,Ying Yang Twins,Shake,"Shake, shake, just shake, shake, just shake, s...",241587.0,0.918,0.681,1.0,-7.399,0.3090,0.337000,0.000000,0.6440,0.843,117.994,
