In [24]:
import pandas as pd
import numpy as np
import plotly.express as px

In this notebook, we will clean the spotify dataset we will be using for our project. Please download the entire dataset here: https://www.kaggle.com/datasets/dhruvildave/spotify-charts **Please make sure to run 'convert_spotify_data.py' with correct paths in the method parameters before running this notebook**. 
That short script will only retain the years we acutally need for our project, in order to make the csv a lot smaller. (The dataset is too big, if we place the script in the notebook it takes a very long time to finish)

In [25]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)
data_raw = pd.read_csv("data/spotify_2020+.csv")


In [26]:
data_raw.head(10)

Unnamed: 0.1,Unnamed: 0,title,rank,date,artist,region,chart,trend,streams
0,19128,Ride It,146,2020-05-01,Regard,Bolivia,top200,MOVE_UP,1956.0
1,19234,+Linda,147,2020-05-01,Dalex,Bolivia,top200,MOVE_DOWN,1953.0
2,19287,Woah,148,2020-05-01,Lil Baby,Canada,top200,MOVE_UP,35211.0
3,19390,25/8,105,2020-05-01,Bad Bunny,Chile,top200,MOVE_DOWN,31854.0
4,19698,Keii,65,2020-05-01,Anuel AA,Colombia,top200,MOVE_UP,16155.0
5,20166,Ride It,99,2020-05-01,Regard,Ecuador,top200,MOVE_UP,5999.0
6,20427,Hei rakas,13,2020-05-01,BEHM,Finland,top200,MOVE_UP,28838.0
7,28326,Tel Me,198,2020-05-01,"Jul, Ninho",France,top200,MOVE_DOWN,33011.0
8,28645,GOOBA,9,2020-06-01,6ix9ine,Austria,top200,MOVE_DOWN,22249.0
9,28799,P2,69,2020-05-01,Lil Uzi Vert,Latvia,top200,MOVE_DOWN,1193.0


First, lets see which columns contain missing values:

In [27]:
data_raw.isna().sum()

Unnamed: 0          0
title               0
rank                0
date                0
artist             18
region              0
chart               0
trend               0
streams       2510831
dtype: int64

As evident, most missing values occur in the 'streams' column. We can drop this column since we wont need it in our analysis. We will also drop the 'Unnamed 0' column

In [28]:
data_no_streams = data_raw.drop(columns=["Unnamed: 0", "streams"],axis=1)

In this cell, we find out which entries have missing artist fields and inspect them:

In [29]:
artist_missing = data_no_streams.isna()
row_has_nan = artist_missing.any(axis=1)
rows_with_nan = artist_missing[row_has_nan]
indices_missing = rows_with_nan.index.values
data_no_streams.loc[indices_missing]

Unnamed: 0,title,rank,date,artist,region,chart,trend
7972060,NO GOOD,10,2020-07-13,,Japan,viral50,NEW_ENTRY
7991853,NO GOOD,10,2020-07-14,,Japan,viral50,SAME_POSITION
8015490,NO GOOD,10,2020-07-15,,Japan,viral50,SAME_POSITION
8037120,NO GOOD,10,2020-07-16,,Japan,viral50,SAME_POSITION
8053041,NO GOOD,10,2020-07-17,,Japan,viral50,SAME_POSITION
8080759,NO GOOD,10,2020-07-18,,Japan,viral50,SAME_POSITION
8102093,NO GOOD,10,2020-07-19,,Japan,viral50,SAME_POSITION
8124034,NO GOOD,13,2020-07-20,,Japan,viral50,MOVE_DOWN
8161478,NO GOOD,14,2020-07-21,,Japan,viral50,MOVE_DOWN
8201506,NO GOOD,19,2020-07-22,,Japan,viral50,MOVE_DOWN


All rows seem to use the song title 'NO GOOD'. Lets check if that title is used anywhere else or if this is showing a defect data entry:

In [30]:
mask = data_no_streams["title"].str.contains("NO GOOD")
data_no_streams[mask]

Unnamed: 0,title,rank,date,artist,region,chart,trend
3362515,NO GOOD (feat. MamboLosco),146,2021-01-06,Malerba,Italy,top200,NEW_ENTRY
7972060,NO GOOD,10,2020-07-13,,Japan,viral50,NEW_ENTRY
7991853,NO GOOD,10,2020-07-14,,Japan,viral50,SAME_POSITION
8015490,NO GOOD,10,2020-07-15,,Japan,viral50,SAME_POSITION
8037120,NO GOOD,10,2020-07-16,,Japan,viral50,SAME_POSITION
8053041,NO GOOD,10,2020-07-17,,Japan,viral50,SAME_POSITION
8080759,NO GOOD,10,2020-07-18,,Japan,viral50,SAME_POSITION
8102093,NO GOOD,10,2020-07-19,,Japan,viral50,SAME_POSITION
8124034,NO GOOD,13,2020-07-20,,Japan,viral50,MOVE_DOWN
8161478,NO GOOD,14,2020-07-21,,Japan,viral50,MOVE_DOWN


This could be an actual track name. We will therefore fill the Nan value with 'Unkown Artist' instead of deleting the rows:

In [31]:
data_no_streams["artist"].fillna(value="Unkown artist",inplace=True)

In [32]:
data_no_streams.isna().sum()# sanity check

title     0
rank      0
date      0
artist    0
region    0
chart     0
trend     0
dtype: int64