In [2]:
import numpy as np
import pandas as pd
import plotly.express as px

# Data processing - Cleaning of the dataset

In this notebook, we will clean the spotify dataset we will be using for our project. Please download the entire dataset here: https://www.kaggle.com/datasets/dhruvildave/spotify-charts **Please make sure to run 'convert_spotify_data.py' with correct paths in the method parameters before running this notebook**. 
That short script will only retain the years we acutally need for our project, in order to make the csv a lot smaller. (The dataset is too big, if we place the script in the notebook it takes a very long time to finish)

Define functions

In [3]:
# check if region has missing ranks
def checkMissingRanks(df, region, numOfRanks = 200):
    missing = 0
    missingList = []
    df_region = df[df['region'] == region]
    for index, val in df_region['date'].value_counts().items():
        if val < numOfRanks:
            missing = missing + (numOfRanks-val)
            #print('date:', index, ', #ranks:', val)
    if missing == 0:
        print('No missing ranks!')
    else:
        print(missing, 'missing rank entries!')

In [4]:
def checkMissingDates(df, region):
    df_region = df[df['region'] == region]
    
    # dates which are not in the sequence are returned
    missing = pd.date_range(start=df['date'].min(), end=df['date'].max()).difference(df_region['date'])
    
    if missing.size == 0:
        print('No missing dates!')
    else:
        print(missing.size, 'missing dates!')
        return missing

In [5]:
def checkMissingData(df, ranks = 200):
    regions = df['region'].unique()
    
    for region in regions:
        print(region)
        missingDates = checkMissingDates(df, region)
        missingRanks = checkMissingRanks(df, region, ranks)
        print()

Import dataset

In [None]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)
data_raw = pd.read_csv("../data/spotify_2020+.csv")

Sort the values by region, then date and ultimately by rank for later processing

In [None]:
data_raw.sort_values(by=['region', 'date', 'rank'], inplace=True)

In [None]:
# set a new index after the sort and check if
data_raw = data_raw.reset_index(drop=True)
data_raw.head(10)

First, lets see which columns contain missing values:

In [None]:
data_raw.isna().sum()

As evident, most missing values occur in the 'streams' column. We can drop this column since we wont need it in our analysis. We will also drop the 'Unnamed 0' column

In [None]:
data_no_streams = data_raw.drop(columns=["Unnamed: 0", "streams"],axis=1)

In [None]:
data_no_streams.head(5)

In this cell, we find out which entries have missing artist fields and inspect them:

In [None]:
artist_missing = data_no_streams.isna()
row_has_nan = artist_missing.any(axis=1)
rows_with_nan = artist_missing[row_has_nan]
indices_missing = rows_with_nan.index.values
data_no_streams.loc[indices_missing]

All rows seem to use the song title 'NO GOOD'. Lets check if that title is used anywhere else or if this is showing a defect data entry:

In [None]:
mask = data_no_streams["title"].str.contains("NO GOOD")
data_no_streams[mask]

This could be an actual track name. We will therefore fill the Nan value with 'Unkown Artist' instead of deleting the rows:

In [None]:
data_no_streams["artist"].fillna(value="Unkown artist",inplace=True)

In [None]:
# sanity check
data_no_streams.isna().sum()

Check for dtypes in the dataframe

In [None]:
# check datatypes
data_no_streams.dtypes

Convert 'date' column to date format

In [None]:
data_no_streams['date'] = pd.to_datetime(data_no_streams.date)
data_no_streams.dtypes

In [None]:
# check if more cleaning is needed
# checkMissingData(data_no_streams)

Check for additional missing data is not needed because we will look into titles, where the information for missing rank is not necessary

For future work on this dataset, we are exporting the data cleaning to a new csv file.

In [None]:
data_no_streams = data_no_streams.drop(columns=['Unnamed: 0.1'])
data_no_streams.to_csv('../data/spotify_2020+_cleaned.csv')