# Dataset Setup
Save the following datasets locally as:

charts.csv: https://www.kaggle.com/datasets/dhruvildave/spotify-charts

Spotify 1.2M+ Songs.csv: https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs

Spotify Top 100 Songs of 2010-2019.csv: https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019

Spotify Top 200 Charts (2020-2021).csv: https://www.kaggle.com/datasets/sashankpillai/spotify-top-200-charts-20202021

Spotify Tracks Dataset.csv: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

TikTok_songs_2019.csv: https://www.kaggle.com/datasets/sveta151/tiktok-popular-songs-2019

TikTok_songs_2020.csv: https://www.kaggle.com/datasets/sveta151/tiktok-popular-songs-2020

TikTok_songs_2021.csv: https://www.kaggle.com/datasets/sveta151/tiktok-popular-songs-2021

In [34]:
# package installs and imports
%pip install pandasql  
import pandas as pd
import numpy as np
import pandasql as ps

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Joe\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


In [35]:
# load data into pandas dataframes
charts_df = pd.read_csv('charts.csv')
song_features_1_df = pd.read_csv('Spotify 1.2M+ Songs.csv')
song_features_2_df = pd.read_csv('Spotify Top 100 Songs of 2010-2019.csv')
song_features_3_df = pd.read_csv('Spotify Top 200 Charts (2020-2021).csv')
song_features_4_df = pd.read_csv('Spotify Tracks Dataset.csv')
tiktok_19_df = pd.read_csv('TikTok_songs_2019.csv')
tiktok_20_df = pd.read_csv('TikTok_songs_2020.csv')
tiktok_21_df = pd.read_csv('TikTok_songs_2021.csv')

# Song Feature EDA

Exploring the many possible datasets of songs and features. Relevant questions to answer:

1. Can we find song features for most of the songs in 'charts.csv'?

2. Do we have to combine datasets to do so?

3. If we have to combine datasets, what fields do they share?

4. What kind of and how much cleaning do we need to do?

In [36]:
# first, since we want to find song features for as many of the songs in 'charts_df' as possible, let's start by
# projecting and grouping charts_df by song title and artist name
charts_songs_artists_df = charts_df[['title', 'artist']].groupby(['title', 'artist']).max()

# next, let's join the features datasets on song title and artist name and see what percentage of songs in charts_df
# each song features dataset can provide features for
song_features_1_df_projected = song_features_1_df[['name', 'artists']].rename(columns={'name': 'title', 'artists': 'artist'})
song_features_2_df_projected = song_features_2_df[['title', 'artist']]
song_features_3_df_projected = song_features_3_df[['Song Name', 'Artist']].rename(columns={'Song Name': 'title', 'Artist': 'artist'})
song_features_4_df_projected = song_features_4_df[['track_name', 'artists']].rename(columns={'track_name': 'title', 'artists': 'artist'})

# join song features datasets on song title and artist name
merge_1 = pd.merge(charts_songs_artists_df, song_features_1_df_projected, on=['title', 'artist'], how='inner')
merge_2 = pd.merge(charts_songs_artists_df, song_features_2_df_projected, on=['title', 'artist'], how='inner')
merge_3 = pd.merge(charts_songs_artists_df, song_features_3_df_projected, on=['title', 'artist'], how='inner')
merge_4 = pd.merge(charts_songs_artists_df, song_features_4_df_projected, on=['title', 'artist'], how='inner')

match_percentage_1 = 100 * merge_1.shape[0] / charts_songs_artists_df.shape[0]
match_percentage_2 = 100 * merge_2.shape[0] / charts_songs_artists_df.shape[0]
match_percentage_3 = 100 * merge_3.shape[0] / charts_songs_artists_df.shape[0]
match_percentage_4 = 100 * merge_4.shape[0] / charts_songs_artists_df.shape[0]

print(match_percentage_1)
print(match_percentage_2)
print(match_percentage_3)
print(match_percentage_4)


0.0
0.31792156247310577
0.7867039937630674
7.569368156206811
