In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

In this notebook, we will clean the dataset for the *Hot 100 Billboard Charts*. Please make sureto download it from https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs and give the correct path to the .csv in the cell below:

In [None]:
data_raw = pd.read_csv("../data/billboardCharts.csv")

In [None]:
data_raw.head(150)

We are only going to look at data from 2020 and 2021, so we can throw away older dates. We are using the same cutoff as the spotify dataset:

In [None]:
data_raw["date"] = pd.to_datetime(data_raw['date'])
truncated = data_raw[(data_raw['date'] > '2019-12-31')]

First, lets see which columns contain missing values:

In [None]:
truncated.isna().sum()

In [None]:
data_missing= truncated.isna()
has_nan = data_missing.any(axis=1)
rows_nan = data_missing[has_nan]
indices = rows_nan.index.values
truncated.loc[indices]

In [None]:
truncated[truncated["song"].str.contains("Maybach")] #Example for song containing NaN

As we can see, w have NaN values for the 'last-week' column when the song was not in the charts the previous week. A less error prone way of keeping this information could be to set NaN values to -1:

In [None]:
final = truncated.fillna(-1)

In [None]:
final.isna().sum()

Empty values are now filled-up, but we observated, that 'last-week' column has float as dtype. The values from this column indicate jumps between ranks, which are always represented as integers. Therefore, we will change the dtype to integer to avoid any kind of problems with future work on our dataset.

In [None]:
# change type of 'last-week' to integer
final = final.astype({'last-week':'int'})
final.dtypes

The column 'date' is stored as datetime64 with format 'YYYY-MM-DD HH:MM', whereas spotify stores the 'date' in the format 'YYYY-MM-DD'. If we want to compare this dataset with the one of spotify we need to unify the dtypes.

In [None]:
final['date'] = pd.to_datetime(final.date)

# check the type again to see if it is an 'object' now
final.dtypes

For future work on this dataset, we are exporting the data cleaning to a new csv file.

In [None]:
final.to_csv('../data/billboardCharts_cleaned.csv')