# Refining the Main Labeled Dataset for Primary Artist Focus
**Author:** Juan Antonio Robledo ([GitHub](https://github.com/JuanRobledo12))

### Description
This notebook is dedicated to refining the main labeled dataset produced by our classification model. It performs specific data type conversions and eliminates instances of multiple artists within the `artists_name` attribute and other related features. The goal is to distill the dataset to only include data pertaining to the primary artist. Simplifying the dataset in this manner facilitates its use in the development of playlist generation software, ensuring a more streamlined dataset for machine learning applications.

### Execution Instructions
- Execute the notebook in its entirety to generate a file named `cleaned_labeled_main_dataset.csv`.
- Additionally, the final cell of the notebook creates a subset of the data intended for testing purposes. While this subset is already available in the corresponding GitHub repository, you have the option to modify this subset as needed.


In [1]:
import pandas as pd
import json

## 1. Check the labeled dataset

In [2]:
df = pd.read_csv('DATA/labeled_main_dataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277938 entries, 0 to 277937
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   track_uri             277938 non-null  object 
 1   name                  277923 non-null  object 
 2   artists_names         277938 non-null  object 
 3   popularity            277938 non-null  int64  
 4   album_type            277938 non-null  object 
 5   is_playable           277938 non-null  bool   
 6   release_date          277938 non-null  object 
 7   artists_uris          277938 non-null  object 
 8   playlist_uris         277938 non-null  object 
 9   danceability          277938 non-null  float64
 10  energy                277938 non-null  float64
 11  key                   277938 non-null  float64
 12  loudness              277938 non-null  float64
 13  mode                  277938 non-null  float64
 14  speechiness           277938 non-null  float64
 15  

In [3]:
df.head()

Unnamed: 0,track_uri,name,artists_names,popularity,album_type,is_playable,release_date,artists_uris,playlist_uris,danceability,...,liveness,valence,tempo,analysis_url,duration_ms,time_signature,artists_popularities,artists_genres,artists_followers,predicted_mood
0,spotify:track:0yLdNVWF3Srea0uzk55zFn,Flowers,['Miley Cyrus'],100,single,True,2023-01-13,['spotify:artist:5YGY8feqx7naU7z4HrwZM6'],"['spotify:playlist:3n3PnS8EAZwzCbL7AyU7Op', 's...",0.707,...,0.0322,0.646,117.999,https://api.spotify.com/v1/audio-analysis/0yLd...,200455.0,4.0,[92],[['pop']],[20250106],energetic
1,spotify:track:4nrPB8O7Y7wsOCJdgXkthe,"Shakira: Bzrp Music Sessions, Vol. 53","['Bizarrap', 'Shakira']",96,single,True,2023-01-11,"['spotify:artist:716NhGYqD1jl2wI1Qkgq36', 'spo...","['spotify:playlist:6GIsLGY639AcJmgSfhQx7C', 's...",0.778,...,0.0915,0.498,122.104,https://api.spotify.com/v1/audio-analysis/4nrP...,218289.0,4.0,"[88, 92]","[['argentine hip hop', 'pop venezolano', 'trap...","[10050089, 28191070]",happy
2,spotify:track:0DWdj2oZMBFSzRsi2Cvfzf,TQG,['KAROL G'],96,album,True,2023-02-24,['spotify:artist:790FomKkXshlbRYZFtlgla'],"['spotify:playlist:37i9dQZF1DWUArRC04H8rI', 's...",0.72,...,0.0936,0.607,179.974,https://api.spotify.com/v1/audio-analysis/0DWd...,199440.0,4.0,[93],"[['reggaeton', 'reggaeton colombiano', 'urbano...",[33856064],calm
3,spotify:track:6AQbmUe0Qwf5PZnt4HmTXv,Boy's a liar Pt. 2,"['PinkPantheress', 'Ice Spice']",96,single,True,2023-02-03,"['spotify:artist:78rUTD7y6Cy67W1RVzYs7t', 'spo...","['spotify:playlist:40L0ZzcH54squhYMUb5E3a', 's...",0.696,...,0.248,0.857,132.962,https://api.spotify.com/v1/audio-analysis/6AQb...,131013.0,4.0,"[84, 84]","[[], []]","[1928246, 969451]",happy
4,spotify:track:7oDd86yk8itslrA9HRP2ki,Die For You - Remix,"['The Weeknd', 'Ariana Grande']",95,single,True,2023-02-24,"['spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ', 'spo...","['spotify:playlist:5Q6nyYneiOVjbFhzz1TzVt', 's...",0.531,...,0.441,0.502,66.9,https://api.spotify.com/v1/audio-analysis/7oDd...,232857.0,4.0,"[98, 91]","[['canadian contemporary r&b', 'canadian pop',...","[62165751, 89209352]",calm


## 2. Clean the dataset

TODO:

* ~~Drop the track_uri, album_type, is_playable, artists_uris, playlist_uris, analysis_url columns.~~
* ~~Modify the artists_names column so it only has the first artist and it is not a list. Same for artist_popularities, artists_genres, and artist_followers, just keep the first artist data and make sure it is not a list.~~
* ~~Make duration_ms and time_signature int64 instead of float.~~
* Make track_uri into links to access the song

In [4]:
df_cleaned = df.drop(columns=['album_type', 'is_playable', 'artists_uris', 'playlist_uris', 'analysis_url'])
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277938 entries, 0 to 277937
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   track_uri             277938 non-null  object 
 1   name                  277923 non-null  object 
 2   artists_names         277938 non-null  object 
 3   popularity            277938 non-null  int64  
 4   release_date          277938 non-null  object 
 5   danceability          277938 non-null  float64
 6   energy                277938 non-null  float64
 7   key                   277938 non-null  float64
 8   loudness              277938 non-null  float64
 9   mode                  277938 non-null  float64
 10  speechiness           277938 non-null  float64
 11  acousticness          277938 non-null  float64
 12  instrumentalness      277938 non-null  float64
 13  liveness              277938 non-null  float64
 14  valence               277938 non-null  float64
 15  

In [5]:
# Extracting the first artist's name, popularity, genre, and followers
# Assumes there's at least one artist in each list
df_cleaned['artists_names'] = df_cleaned['artists_names'].apply(lambda x: eval(x)[0] if x else None)
df_cleaned['artists_popularities'] = df_cleaned['artists_popularities'].apply(lambda x: eval(x)[0] if x else None)
df_cleaned['artists_genres'] = df_cleaned['artists_genres'].apply(lambda x: eval(x)[0][0] if x and eval(x)[0] else None)
df_cleaned['artists_followers'] = df_cleaned['artists_followers'].apply(lambda x: eval(x)[0] if x else None)
df_cleaned[['artists_names', 'artists_popularities', 'artists_genres', 'artists_followers']].head()


Unnamed: 0,artists_names,artists_popularities,artists_genres,artists_followers
0,Miley Cyrus,92,pop,20250106
1,Bizarrap,88,argentine hip hop,10050089
2,KAROL G,93,reggaeton,33856064
3,PinkPantheress,84,,1928246
4,The Weeknd,98,canadian contemporary r&b,62165751


In [6]:
df_cleaned['duration_ms'] = df_cleaned['duration_ms'].astype('int64')
df_cleaned['time_signature'] = df_cleaned['time_signature'].astype('int64')

df_cleaned.rename(columns={'popularity': 'song_popularity'}, inplace=True)

In [7]:
df_cleaned['track_uri'].head()

0    spotify:track:0yLdNVWF3Srea0uzk55zFn
1    spotify:track:4nrPB8O7Y7wsOCJdgXkthe
2    spotify:track:0DWdj2oZMBFSzRsi2Cvfzf
3    spotify:track:6AQbmUe0Qwf5PZnt4HmTXv
4    spotify:track:7oDd86yk8itslrA9HRP2ki
Name: track_uri, dtype: object

In [8]:
# Function to convert track URI to link
def convert_to_link(uri):
    song_id = uri.split(':')[-1]
    return f"open.spotify.com/track/{song_id}"

# Apply the function to the 'track_uri' column
df_cleaned['track_uri'] = df_cleaned['track_uri'].apply(convert_to_link)
df_cleaned['track_uri'].head()

0    open.spotify.com/track/0yLdNVWF3Srea0uzk55zFn
1    open.spotify.com/track/4nrPB8O7Y7wsOCJdgXkthe
2    open.spotify.com/track/0DWdj2oZMBFSzRsi2Cvfzf
3    open.spotify.com/track/6AQbmUe0Qwf5PZnt4HmTXv
4    open.spotify.com/track/7oDd86yk8itslrA9HRP2ki
Name: track_uri, dtype: object

In [9]:
df_cleaned.head(10)

Unnamed: 0,track_uri,name,artists_names,song_popularity,release_date,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,artists_popularities,artists_genres,artists_followers,predicted_mood
0,open.spotify.com/track/0yLdNVWF3Srea0uzk55zFn,Flowers,Miley Cyrus,100,2023-01-13,0.707,0.681,0.0,-4.325,1.0,...,5e-06,0.0322,0.646,117.999,200455,4,92,pop,20250106,energetic
1,open.spotify.com/track/4nrPB8O7Y7wsOCJdgXkthe,"Shakira: Bzrp Music Sessions, Vol. 53",Bizarrap,96,2023-01-11,0.778,0.632,2.0,-5.6,0.0,...,0.0,0.0915,0.498,122.104,218289,4,88,argentine hip hop,10050089,happy
2,open.spotify.com/track/0DWdj2oZMBFSzRsi2Cvfzf,TQG,KAROL G,96,2023-02-24,0.72,0.63,4.0,-3.547,0.0,...,0.0,0.0936,0.607,179.974,199440,4,93,reggaeton,33856064,calm
3,open.spotify.com/track/6AQbmUe0Qwf5PZnt4HmTXv,Boy's a liar Pt. 2,PinkPantheress,96,2023-02-03,0.696,0.809,5.0,-8.254,1.0,...,0.000128,0.248,0.857,132.962,131013,4,84,,1928246,happy
4,open.spotify.com/track/7oDd86yk8itslrA9HRP2ki,Die For You - Remix,The Weeknd,95,2023-02-24,0.531,0.525,1.0,-6.5,0.0,...,0.0,0.441,0.502,66.9,232857,4,98,canadian contemporary r&b,62165751,calm
5,open.spotify.com/track/0WtM2NBVQNNJLh6scP13H8,Calm Down (with Selena Gomez),Rema,94,2022-08-25,0.801,0.806,11.0,-5.206,1.0,...,0.000669,0.114,0.802,106.999,239318,4,82,nigerian pop,2104953,happy
6,open.spotify.com/track/4uUG5RXrOk84mYEfFvj3cK,I'm Good (Blue),David Guetta,94,2022-08-26,0.561,0.965,7.0,-3.673,0.0,...,7e-06,0.371,0.304,128.04,175238,4,89,big room,25778023,energetic
7,open.spotify.com/track/78Sw5GDo6AlGwTwanjXbGh,Here With Me,d4vd,93,2022-09-22,0.574,0.469,4.0,-8.209,1.0,...,9.2e-05,0.128,0.288,132.023,242485,4,82,bedroom pop,909208,sad
8,open.spotify.com/track/0V3wPSX9ygBnCm8psDIegu,Anti-Hero,Taylor Swift,93,2022-10-21,0.637,0.643,4.0,-6.571,1.0,...,2e-06,0.142,0.533,97.008,200690,4,100,pop,71194980,happy
9,open.spotify.com/track/5ww2BF9slyYgNOk37BlC4u,La Bachata,Manuel Turizo,93,2022-05-26,0.835,0.679,7.0,-5.329,0.0,...,2e-06,0.218,0.85,124.98,162638,4,85,colombian pop,11848110,happy


In [10]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277938 entries, 0 to 277937
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   track_uri             277938 non-null  object 
 1   name                  277923 non-null  object 
 2   artists_names         277938 non-null  object 
 3   song_popularity       277938 non-null  int64  
 4   release_date          277938 non-null  object 
 5   danceability          277938 non-null  float64
 6   energy                277938 non-null  float64
 7   key                   277938 non-null  float64
 8   loudness              277938 non-null  float64
 9   mode                  277938 non-null  float64
 10  speechiness           277938 non-null  float64
 11  acousticness          277938 non-null  float64
 12  instrumentalness      277938 non-null  float64
 13  liveness              277938 non-null  float64
 14  valence               277938 non-null  float64
 15  

In [11]:
df_cleaned.to_csv('DATA/cleaned_labeled_main_dataset.csv', index=False)

In [12]:
# Uncomment and edit if you want to generate a new subset dataset

subset_df_cleaned = df_cleaned.head(5000)
subset_df_cleaned.to_csv('DATA/cleaned_labeled_subset_dataset.csv', index=False)

In [13]:
# Create JSON file version of the dataset

mood_groups = subset_df_cleaned.groupby('predicted_mood').apply(lambda x: x.to_dict(orient='records')).to_dict()

# Write the dictionary to a JSON file
with open('DATA/subset_mood_classified_tracks.json', 'w') as file:
    json.dump(mood_groups, file, indent=4)

print("JSON file has been created successfully.")

JSON file has been created successfully.


## 3. Check the Dataset Quality
See by yourself if the predicted labels match the song. In general it does a decent job but I think we need more complex labels to describe songs.

In [19]:
subset_df = pd.read_csv('DATA/cleaned_labeled_subset_dataset.csv')
subset_df = subset_df[['track_uri', 'name', 'artists_names', 'predicted_mood']]
subset_df.head()

Unnamed: 0,track_uri,name,artists_names,predicted_mood
0,open.spotify.com/track/0yLdNVWF3Srea0uzk55zFn,Flowers,Miley Cyrus,energetic
1,open.spotify.com/track/4nrPB8O7Y7wsOCJdgXkthe,"Shakira: Bzrp Music Sessions, Vol. 53",Bizarrap,happy
2,open.spotify.com/track/0DWdj2oZMBFSzRsi2Cvfzf,TQG,KAROL G,calm
3,open.spotify.com/track/6AQbmUe0Qwf5PZnt4HmTXv,Boy's a liar Pt. 2,PinkPantheress,happy
4,open.spotify.com/track/7oDd86yk8itslrA9HRP2ki,Die For You - Remix,The Weeknd,calm


In [21]:
required_moods = ['sad', 'calm', 'happy', 'energetic']

# Function to sample 5 songs from each mood
def sample_songs_by_mood(df, mood, n=5):
    # Filter the DataFrame by mood
    mood_df = df[df['predicted_mood'] == mood]
    # Randomly sample n rows
    sampled_df = mood_df.sample(n=n, random_state=42)
    return sampled_df

# Create an empty DataFrame to hold the samples
sampled_songs = pd.DataFrame()

# Sample 5 songs from each required mood
for mood in required_moods:
    sampled_songs = pd.concat([sampled_songs, sample_songs_by_mood(subset_df, mood, n=5)], ignore_index=True)

# Print the sampled songs
sampled_songs

Unnamed: 0,track_uri,name,artists_names,predicted_mood
0,open.spotify.com/track/7fQYRdNX6y8BpfmHvWVPm8,恋人じゃなくなった日,Yuuri,sad
1,open.spotify.com/track/5x5JM1BSB6vollcIzDocqT,The Climb,Miley Cyrus,sad
2,open.spotify.com/track/4frLb7nWtsz2ymBE6k2GRP,Earned It (Fifty Shades Of Grey),The Weeknd,sad
3,open.spotify.com/track/5enxwA8aAbwZbf5qCHORXi,All Too Well (10 Minute Version) (Taylor's Ver...,Taylor Swift,sad
4,open.spotify.com/track/3UoULw70kMsiVXxW0L3A33,pov,Ariana Grande,sad
5,open.spotify.com/track/3JA9Jsuxr4xgHXEawAdCp4,Just Can’t Get Enough,Black Eyed Peas,calm
6,open.spotify.com/track/47EiUVwUp4C9fGccaPuUCS,DÁKITI,Bad Bunny,calm
7,open.spotify.com/track/1635wWSdp29PO3GxYhy991,Bel Mercy,Jengi,calm
8,open.spotify.com/track/0iOZM63lendWRTTeKhZBSC,"Mrs. Robinson - From ""The Graduate"" Soundtrack",Simon & Garfunkel,calm
9,open.spotify.com/track/4ZLzoOkj0MPWrTLvooIuaa,Get You The Moon (feat. Snøw),Kina,calm
