# Data exploration

Following experimentation in helgi-03-hp-search-first-pass.ipynb (where we didn't perform any filtering), this notebook further explores the data

In [1]:
import numpy as np
import pandas as pd

import sys
sys.path.append("../")

import libs.visual

## Load data

In [2]:
df = pd.read_csv('E:\\Repos\\comp550-final-project\\data\\scraped-lyrics-v1.csv')
df

Unnamed: 0,artist,song,lyrics,genre
0,Disperse,Tether,Warm-hearted\nDirections\nWe're off the map\nI...,Alternative Rock
1,Disperse,Foreword,[Instrumental],Alternative Rock
2,Disperse,Touching The Golden Cloud,Hear in this garden\nHear in this space\nImmer...,Alternative Rock
3,Disperse,Neon,Hello dear stranger\nI've got so much to tell ...,Alternative Rock
4,Disperse,Kites,Still a headache\nFrom last night\nIt was vali...,Alternative Rock
...,...,...,...,...
79995,Flyleaf,Broken Wings,"Thank you for being such a friend to me\nOh, I...",Rock
79996,Flyleaf,Swept Away,The evil fell from your pretty mouth\nWrapped ...,Rock
79997,Flyleaf,Call You Out,How can you act like you know\nWhen all you kn...,Rock
79998,Flyleaf,Beautiful Bride,Unified diversity\nFunctioning as one body\nEv...,Rock


Songs per genre available:

In [3]:
df.groupby('genre').count().lyrics

genre
Alternative Rock    8000
Country             8000
Hard Rock           8000
Heavy Metal         8000
Hip-Hop             8000
Indie               8000
Pop                 8000
R&B                 8000
Rock                8000
Soul                8000
Name: lyrics, dtype: int64

In [4]:
df_v2 = df.groupby(['artist', 'song']).first().reset_index()
unique_artist_song_lyrics = df_v2.groupby('genre').count().lyrics
unique_artist_song_lyrics

genre
Alternative Rock    7039
Country             7503
Hard Rock           6547
Heavy Metal         6437
Hip-Hop             7010
Indie               5685
Pop                 5190
R&B                 3083
Rock                4277
Soul                7027
Name: lyrics, dtype: int64

In [5]:
print(f'When we only keep one copy of each song (remove duplicates), we see that we end up with {sum(unique_artist_song_lyrics)} songs.')

When we only keep one copy of each song (remove duplicates), we see that we end up with 59798 songs.


As we can see, there are quite a lot of songs that are duplicates, that were assigned to different genres on Vagalume. We should extract all of the genres associated with each song and create a v2 of the dataset such that we can have a more accurate representation of which song is part of which genres. See **helgi-05-scraped-lyrics-v2-preprocessing.ipynb**. Finally, we note that some songs are tabs:

In [6]:
tab_lyrics = df[df.lyrics.str.contains('.*[\|:][\-]+[0-9a-zA-Z]+[\-]+.*', regex=True)]
print(f'There are at least {len(tab_lyrics)} songs that are tabs')

There are at least 32 songs that are tabs


## Identify outliers and invalid data

Lyrics may be missing or be of invalid format (tabs instead of lyrics). Let's identify these entries

Distribution of lyric lengths:

In [7]:
lyrics_lengths = df['lyrics'].str.len()
lyrics_lengths.describe()

count    80000.000000
mean      1295.329463
std        811.373017
min          7.000000
25%        777.000000
50%       1112.000000
75%       1588.000000
max      33531.000000
Name: lyrics, dtype: float64

As we can tell, some lyric entries are very short (7 characters) while others are very long (33531 characters). Let's explore them:

In [8]:
libs.visual.analyse_lyrics(dataframe=df, lyrics_length=70, mode='less', n_samples=25, random_state=1234)

There are 749 songs with lyrics of 70 characters or less.
Here are 25 samples:

<index: 55544>
Let me speak
Let me speak
Let me speak
Let me spea <... 8 more chars>

<index: 22349>
Instrumental

<index: 73416>
[Instrumental]

<index: 55541>
(Instrumental)

<index: 75593>
All you have to do
All you have to do
Ahhhhhhh Ahh <... 3 more chars>

<index: 17025>
instrumental

<index: 17005>
[Instrumental]

<index: 40478>
I believe in dreams
Dream everything

<index: 17620>
Instrumental

<index: 16760>
Instrumental

<index: 34762>
Instrumental

<index: 78947>
Instrumental

<index: 36885>
Instrumental

<index: 8244>
Instrumental

<index: 79758>
Everywhere I go
Swallowed up inside

<index: 7002>
Instrumental

<index: 27848>
Instrumental

<index: 3778>
(Instrumental song)

<index: 51848>
[Instrumental]

<index: 19288>
(Instrumental)
Go on cry
Go on cry
Go on cry cry c <... 6 more chars>

<index: 29529>
Instrumental

<index: 37668>
[Instrumental]

<index: 48638>
Instrumental

<index: 33888>
[Instr

Informally, we can see that 21 of those 25 samples (84%) don't have lyrics specified. Let's note that for the future.

Next, let's look at the top 1% of the longest lyrics:

In [9]:
longest_1_perc = int(lyrics_lengths.quantile(0.99))
libs.visual.analyse_lyrics(dataframe=df, lyrics_length=longest_1_perc, mode='more', n_samples=25, random_state=1234, max_print=100)

There are 804 songs with lyrics of 3990 characters or more.
Here are 25 samples:

<index: 44083>
I love you
Turn my headphone down a little bit, yeah
For so many reasons
Yeah, yeah, yeah, yeah, yea <... 4922 more chars>

<index: 43357>
Hit-Boy
G. Ry got me
Look
Fuck rap, I'm a street legend, block love me with a deep reverence
I was b <... 3962 more chars>

<index: 43860>
[Intro: Future & DJ Khaled]
Hide the money and ran outta room, yeah
I tried to hide the money and ra <... 4139 more chars>

<index: 47834>
[Wiz Khalifa]
Yea, Uh huh, you know what it is
Black and yellow
Black and yellow
Black and yellow
Bl <... 4403 more chars>

<index: 45969>
[Ludacris - Verse One]
I be that nigga named Ludi
a k a L-O-V-A L-O-V-A
Fuck that shit
Nigga what yo <... 4520 more chars>

<index: 40377>
Yeah! It's my tape man, listen to my tape
WOO!!!
I've waited I've waited
Time went by
All I could do <... 4225 more chars>

<index: 43656>
[Chamillionaire]
I'm like the baddest rapper ever Ever ever? Ever!
I 

Informally, these look fine, but we could experiment with using an upper bound for how many characters we consider from a song, to not produce overly busy embeddings.

Interestingly enough, most of those very long lyrics are Hip-Hop:

In [10]:
candidates = lyrics_lengths[lyrics_lengths > longest_1_perc]
samples = candidates.sample(25, random_state=1234)
df.iloc[samples.index].genre

42411        Hip-Hop
40239        Hip-Hop
47891        Hip-Hop
38198    Heavy Metal
46377        Hip-Hop
57919            Pop
40528        Hip-Hop
40795        Hip-Hop
31592           Soul
69876            R&B
66482            R&B
43719        Hip-Hop
41495        Hip-Hop
68588            R&B
47566        Hip-Hop
40382        Hip-Hop
46009        Hip-Hop
43794        Hip-Hop
42665        Hip-Hop
46301        Hip-Hop
43075        Hip-Hop
61315            Pop
68463            R&B
45747        Hip-Hop
44096        Hip-Hop
Name: genre, dtype: object

Finally, let's look at the top 100 longest entries:

In [11]:
lyrics_lengths.nlargest(100)

23832    33531
72425    33531
23981    17006
76479    17006
42945    13408
         ...  
68588     5818
7991      5808
40577     5787
7901      5761
73245     5761
Name: lyrics, Length: 100, dtype: int64

In [12]:
libs.visual.analyse_lyrics(dataframe=df, lyrics_length=5760, mode='more', n_samples=100, random_state=1234, max_print=100)

There are 100 songs with lyrics of 5760 characters or more.
Here are 100 samples:

<index: 43588>
[Royce Da 5'9"]
First verse, uh, I'm on 'til I'm on a island
My life's ridin' on the autobahn on aut <... 6173 more chars>

<index: 43195>
One Blood - (feat. Jim Jones, Snoop Dogg, Nas, T.I., Fat Joe, lil WayneNORE, Jadakiss, Styles P, Fab <... 10859 more chars>

<index: 68523>
[Turk]
Come on, come on
Come on, come on, come on, come on, come on
I roll with a bunch of untamed g <... 5858 more chars>

<index: 46505>
At these up late times, hardcore funkateers before the bop gun.
We unleash you a positive light. The <... 6752 more chars>

<index: 76479>
Intro
|------------|-------------------------------|-----------------------------|
|------------|--- <... 16906 more chars>

<index: 58008>
I was born inside a small town
I've lost that state of mind
Learned to sing inside the Lord's house
 <... 5855 more chars>

<index: 68615>
Baby understand me now
If sometimes you see that I'm mad
Don't you

Above we observe that the following indexes are tabs (not lyrics). We can therefore remove them: 76479, 7991, 13345, 13327, 23981, 78729 and 72425. Example:

In [13]:
df.iloc[76479].lyrics[:200]

'Intro\n|------------|-------------------------------|-----------------------------|\n|------------|-------------------------------|-----------------------------|\n|------------|-2--------5----0--2-------'

# Conclusion

* We should consider removing songs with low amounts of lyrics characters (the example of <= 70 characters was observed above: 746 songs). About 84% of these are instrumentals or don't have the lyrics specified
* Some songs are tabs, maybe we want to process these separately?