In [40]:
import pandas as pd
import numpy as np
import plotly.express as px

In this notebook, we will clean the dataset for the *Hot 100 Billboard Charts*. Please make sureto download it from https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs and give the correct path to the .csv in the cell below:

In [41]:
data_raw = pd.read_csv("data/billboard_charts.csv")


In [42]:
data_raw.head(150)

Unnamed: 0,date,rank,song,artist,last-week,peak-rank,weeks-on-board
0,2021-11-06,1,Easy On Me,Adele,1.0,1,3
1,2021-11-06,2,Stay,The Kid LAROI & Justin Bieber,2.0,1,16
2,2021-11-06,3,Industry Baby,Lil Nas X & Jack Harlow,3.0,1,14
3,2021-11-06,4,Fancy Like,Walker Hayes,4.0,3,19
4,2021-11-06,5,Bad Habits,Ed Sheeran,5.0,2,18
...,...,...,...,...,...,...,...
145,2021-10-30,46,Gyalis,Capella Grey,46.0,38,12
146,2021-10-30,47,Wild Side,Normani Featuring Cardi B,52.0,14,14
147,2021-10-30,48,Fair Trade,Drake Featuring Travis Scott,39.0,3,7
148,2021-10-30,49,Leave Before You Love Me,Marshmello X Jonas Brothers,48.0,19,22


We are only going to look at data from 2020 and 2021, so we can throw away older dates. We are using the same cutoff as the spotify dataset:

In [43]:
data_raw["date"] = pd.to_datetime(data_raw['date'])
truncated = data_raw[(data_raw['date'] > '2019-12-31')]



First, lets see which columns contain missing values:

In [44]:
truncated.isna().sum()

date                 0
rank                 0
song                 0
artist               0
last-week         1536
peak-rank            0
weeks-on-board       0
dtype: int64

In [45]:
data_missing= truncated.isna()
has_nan = data_missing.any(axis=1)
rows_nan = data_missing[has_nan]
indices = rows_nan.index.values
truncated.loc[indices]


Unnamed: 0,date,rank,song,artist,last-week,peak-rank,weeks-on-board
26,2021-11-06,27,Moth To A Flame,Swedish House Mafia & The Weeknd,,27,1
27,2021-11-06,28,Let's Go Brandon,Bryson Gray Featuring Tyson James & Chandler C...,,28,1
60,2021-11-06,61,Not In The Mood,"Lil Tjay, Fivio Foreign & Kay Flock",,61,1
68,2021-11-06,69,Switches & Dracs,Moneybagg Yo Featuring Lil Durk & EST Gee,,69,1
78,2021-11-06,79,Poke It Out,Wale Featuring J. Cole,,79,1
...,...,...,...,...,...,...,...
9648,2020-01-04,49,"You're A Mean One, Mr. Grinch",Thurl Ravenscroft,,49,1
9668,2020-01-04,69,Happy Xmas (War Is Over),John Legend,,69,1
9684,2020-01-04,85,Slide,H.E.R. Featuring YG,,85,1
9697,2020-01-04,98,Tusa,Karol G & Nicki Minaj,,78,4


In [46]:
truncated[truncated["song"].str.contains("Maybach")] #Example for song containing NaN

Unnamed: 0,date,rank,song,artist,last-week,peak-rank,weeks-on-board
76,2021-11-06,77,Maybach,42 Dugg Featuring Future,85.0,68,5
184,2021-10-30,85,Maybach,42 Dugg Featuring Future,85.0,68,4
284,2021-10-23,85,Maybach,42 Dugg Featuring Future,,68,3
2179,2021-06-12,80,Maybach,42 Dugg Featuring Future,68.0,68,2
2267,2021-06-05,68,Maybach,42 Dugg Featuring Future,,68,1


As we can see, w have NaN values for the 'last-week' column when the song was not in the charts the previous week. A less error prone way of keeping this information could be to set NaN values to -1:

In [47]:
final = truncated.fillna(-1)

In [48]:
final.isna().sum()

date              0
rank              0
song              0
artist            0
last-week         0
peak-rank         0
weeks-on-board    0
dtype: int64