# Data exploration

* This notebook explores the dataset scraped-lyrics-v2.csv

In [20]:
import ast

import numpy as np
import pandas as pd

import sys
sys.path.append("../")

import libs.visual

In [21]:
df = pd.read_csv('E:\\Repos\\comp550-final-project\\data\\scraped-lyrics-v2.csv')
df

Unnamed: 0,artist,song,lyrics,genres,category
0,The War On Drugs,Pain,Go to bed now I can tell\nPain is on the way o...,['Rock Alternativo' 'Indie' 'Folk'],Rock Alternativo
1,The War On Drugs,Nothing To Find,"Oh, I'm rising from within\nI see it every mor...",['Rock Alternativo' 'Indie' 'Folk'],Rock Alternativo
2,The War On Drugs,Thinking of a Place,It was back in Little Bend that I saw you\nLig...,['Rock Alternativo' 'Indie' 'Folk'],Rock Alternativo
3,The War On Drugs,Under The Pressure,"With a comb down here, it's easy\nBut do you r...",['Rock Alternativo' 'Indie' 'Folk'],Rock Alternativo
4,The War On Drugs,Under The Pressure,"With a comb down here, it's easy\nBut do you r...",['Rock Alternativo' 'Indie' 'Folk'],Rock Alternativo
...,...,...,...,...,...
79872,James Morrison,So Beautiful,"17, way too young for love\nBut I couldn't thi...",['Pop/Rock' 'Romântico' 'Pop'],Soul Music
79873,James Morrison,Lonely People,When you feel alone and need someone\nDo you t...,['Pop/Rock' 'Romântico' 'Pop'],Soul Music
79874,James Morrison,Slave To The Music,She pulled me in so easily\nRight from the sta...,['Pop/Rock' 'Romântico' 'Pop'],Soul Music
79875,James Morrison,Forever,Well my mama used to say\nIf you find a nice g...,['Pop/Rock' 'Romântico' 'Pop'],Soul Music


In [22]:
# genres are a string, let's convert to a list
df.genres = df.genres.apply(lambda x: ast.literal_eval(x.replace("' ", "', ")))

In [23]:
# Remove duplicates
df = df.groupby(['artist', 'song']).first().reset_index()
print(f'Remaining songs: {len(df)}')

Remaining songs: 59208


In [24]:
libs.visual.analyse_lyrics(dataframe=df, n_samples=25, lyrics_length=40, mode='less', random_state=1234)

There are 483 songs with lyrics of 40 characters or less.
Here are 25 samples:

<index: 32873>
Instrumental

<index: 56740>
Instrumental

<index: 40787>
Instrumental

<index: 37100>
Instrumental

<index: 45991>
We're Thirsty
We are thirsty

<index: 31462>
Instrumental

<index: 40982>
[Instrumental]

<index: 27520>
Instrumental

<index: 3513>
Instrumental

<index: 30187>
Instrumental

<index: 44098>
[This song is an instrumental.]

<index: 1730>
Instrumental

<index: 7296>
Instrumental

<index: 28066>
Instrumental

<index: 14097>
Instrumental

<index: 6405>
[Instrumental]

<index: 40422>
Instrumental

<index: 36817>
Instrumental

<index: 20893>
Instrumental

<index: 40373>
Instrumental

<index: 48189>
Instrumental

<index: 31370>
instrumental

<index: 33684>
Instrumental

<index: 3511>
Instrumental

<index: 7410>
Instrumental



It seems reasonable to remove songs of 40 characters or less, since 24/25 of the samples above are instrumentals

In [25]:
df = df[df.lyrics.str.len() > 40]

In [26]:
# Count how many times a song is part of a certain genre
unique, count = np.unique([genre for sublist in df.genres for genre in sublist], return_counts=True)
genre_counts = dict(zip(unique, count))
genre_hist_df = pd.DataFrame.from_dict({
    'genre': [genre for genre, count in genre_counts.items()],
    'count': [count for genre, count in genre_counts.items()]
}).sort_values(by='count', ascending=False).reset_index(drop=True)

pd.options.display.max_rows = 70
genre_hist_df

Unnamed: 0,genre,count
0,Rock,22310
1,Indie,10155
2,Hard Rock,9712
3,R&B,9494
4,Pop,9403
5,Heavy Metal,8890
6,Hip Hop,8842
7,Rock Alternativo,8834
8,Country,7798
9,Black Music,7192


In [27]:
# Define a threshold of "common genres"
frequency_constraint = 1000
common_genres = genre_hist_df[genre_hist_df['count'] > frequency_constraint].genre

# These are songs that have a genre which is in common_genres
df_common_genre_songs = df[pd.DataFrame(df.genres.tolist()).isin(set(common_genres)).any(1).values]
qualifying_song_indices = df.index.isin(list(df_common_genre_songs.index))

# These are songs which are not among the common genres:
non_qualifying_songs = df[~qualifying_song_indices]
print('Songs that are not part of the "common genres":')
non_qualifying_songs

Songs that are not part of the "common genres":


Unnamed: 0,artist,song,lyrics,genres,category
1170,Air,How Does It Make You Feel?,I am feeling very warm right now\nPlease don't...,"[Chillout, Instrumental, Trip-Hop]",Indie
1172,Air,Seven Stars,How long it will take you\nTo reach the stars?...,"[Chillout, Instrumental, Trip-Hop]",Indie
1173,Air,Somewhere Between Waking And Sleeping,"Without blindness, there is no sight\nYou'd se...","[Chillout, Instrumental, Trip-Hop]",Indie
1174,Air,Who Am I Now?,"What do I know, where should I go\nTelling me ...","[Chillout, Instrumental, Trip-Hop]",Indie


Chillout, Instrumental and Trip-Hop are not among common genres, so these songs wouldn't be considered part of the "common genres". Finally, let's remove any uncommon genres from the genre column:

In [28]:
df_processed = df_common_genre_songs.copy()
df_processed.genres = df_common_genre_songs.genres.apply(lambda x: sorted(list(set(x).intersection(set(common_genres)))))
genre_counts = df_processed.genres.str.len()

In [38]:
print(f'Common genres (genres that have at least {frequency_constraint} songs associated with them):')
print(f'\t* Are {100*len(common_genres)/len(genre_hist_df):.2f}% of all genres ({len(common_genres)}/{len(genre_hist_df)})')
print(f'\t* Describe {100*len(df_processed)/len(df):.2f}% of all songs ({len(df_processed)}/{len(df)})')
print(f'\t* Leaves us with {100*len(genre_counts[genre_counts > 1])/len(genre_counts):.2f}% of songs that have 2 or more genres associated ({len(genre_counts[genre_counts > 1])}/{len(genre_counts)})')

Common genres (genres that have at least 1000 songs associated with them):
	* Are 37.50% of all genres (24/64)
	* Describe 99.99% of all songs (58719/58723)
	* Leaves us with 86.52% of songs that have 2 or more genres associated (50802/58719)


## Save result

In [42]:
df_processed.reset_index(drop=True).to_csv('../data/scraped-lyrics-v2-preprocessed.csv', index=False)