In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

# Data processing - Cleaning of the dataset

In this notebook, we will clean the spotify dataset we will be using for our project. Please download the entire dataset here: https://www.kaggle.com/datasets/dhruvildave/spotify-charts **Please make sure to run 'convert_spotify_data.py' with correct paths in the method parameters before running this notebook**. 
That short script will only retain the years we acutally need for our project, in order to make the csv a lot smaller. (The dataset is too big, if we place the script in the notebook it takes a very long time to finish)

Define functions

In [2]:
# check if region has missing ranks
def checkMissingRanks(df, region, numOfRanks = 200):
    missing = 0
    missingList = []
    df_region = df[df['region'] == region]
    for index, val in df_region['date'].value_counts().items():
        if val < numOfRanks:
            missing = missing + (numOfRanks-val)
            #print('date:', index, ', #ranks:', val)
    if missing == 0:
        print('No missing ranks!')
    else:
        print(missing, 'missing rank entries!')

In [3]:
def checkMissingDates(df, region):
    df_region = df[df['region'] == region]
    
    # dates which are not in the sequence are returned
    missing = pd.date_range(start=df['date'].min(), end=df['date'].max()).difference(df_region['date'])
    
    if missing.size == 0:
        print('No missing dates!')
    else:
        print(missing.size, 'missing dates!')
        return missing

In [4]:
def checkMissingData(df, ranks = 200):
    regions = df['region'].unique()
    
    for region in regions:
        print(region)
        missingDates = checkMissingDates(df, region)
        missingRanks = checkMissingRanks(df, region, ranks)
        print()

Import dataset

In [5]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)
data_raw = pd.read_csv("../data/spotify_2020+.csv")

Sort the values by region, then date and ultimately by rank for later processing

In [6]:
data_raw.sort_values(by=['region', 'date', 'rank'], inplace=True)

In [7]:
# set a new index after the sort and check if
data_raw = data_raw.reset_index(drop=True)
data_raw.head(10)

Unnamed: 0.1,Unnamed: 0,title,rank,date,artist,region,chart,trend,streams
0,18739916,Whine Up,1,2020-01-01,"Nicky Jam, Anuel AA",Andorra,viral50,SAME_POSITION,
1,18739917,ROXANNE,2,2020-01-01,Arizona Zervas,Andorra,viral50,MOVE_UP,
2,18739918,Que Calor (with J Balvin & El Alfa),3,2020-01-01,"Major Lazer, Diplo",Andorra,viral50,MOVE_UP,
3,18739919,"Yo x Ti, Tu x Mi",4,2020-01-01,"ROSALÍA, Ozuna",Andorra,viral50,MOVE_UP,
4,18739920,Tutu,5,2020-01-01,"Camilo, Pedro Capó",Andorra,viral50,MOVE_UP,
5,18739921,HIGHEST IN THE ROOM,6,2020-01-01,Travis Scott,Andorra,viral50,MOVE_UP,
6,18739922,Diavla,7,2020-01-01,"Chris Viz, Young Vene",Andorra,viral50,MOVE_UP,
7,18739923,Baila Conmigo (feat. Kelly Ruiz),8,2020-01-01,"Dayvi, Victor Cardenas, Kelly Ruíz",Andorra,viral50,MOVE_UP,
8,18739924,Señorita,9,2020-01-01,"Shawn Mendes, Camila Cabello",Andorra,viral50,MOVE_UP,
9,18739925,Vas A Quedarte,10,2020-01-01,Aitana,Andorra,viral50,MOVE_UP,


First, lets see which columns contain missing values:

In [8]:
data_raw.isna().sum()

Unnamed: 0          0
title               0
rank                0
date                0
artist             18
region              0
chart               0
trend               0
streams       2510831
dtype: int64

As evident, most missing values occur in the 'streams' column. We can drop this column since we wont need it in our analysis. We will also drop the 'Unnamed 0' column

In [9]:
data_no_streams = data_raw.drop(columns=["streams"],axis=1)

In [10]:
data_no_streams.head(5)

Unnamed: 0.1,Unnamed: 0,title,rank,date,artist,region,chart,trend
0,18739916,Whine Up,1,2020-01-01,"Nicky Jam, Anuel AA",Andorra,viral50,SAME_POSITION
1,18739917,ROXANNE,2,2020-01-01,Arizona Zervas,Andorra,viral50,MOVE_UP
2,18739918,Que Calor (with J Balvin & El Alfa),3,2020-01-01,"Major Lazer, Diplo",Andorra,viral50,MOVE_UP
3,18739919,"Yo x Ti, Tu x Mi",4,2020-01-01,"ROSALÍA, Ozuna",Andorra,viral50,MOVE_UP
4,18739920,Tutu,5,2020-01-01,"Camilo, Pedro Capó",Andorra,viral50,MOVE_UP


In this cell, we find out which entries have missing artist fields and inspect them:

In [11]:
artist_missing = data_no_streams.isna()
row_has_nan = artist_missing.any(axis=1)
rows_with_nan = artist_missing[row_has_nan]
indices_missing = rows_with_nan.index.values
data_no_streams.loc[indices_missing]

Unnamed: 0.1,Unnamed: 0,title,rank,date,artist,region,chart,trend
5715643,20596664,NO GOOD,10,2020-07-13,,Japan,viral50,NEW_ENTRY
5715893,20616457,NO GOOD,10,2020-07-14,,Japan,viral50,SAME_POSITION
5716143,20640094,NO GOOD,10,2020-07-15,,Japan,viral50,SAME_POSITION
5716393,20661724,NO GOOD,10,2020-07-16,,Japan,viral50,SAME_POSITION
5716643,20677645,NO GOOD,10,2020-07-17,,Japan,viral50,SAME_POSITION
5716893,20705363,NO GOOD,10,2020-07-18,,Japan,viral50,SAME_POSITION
5717143,20726697,NO GOOD,10,2020-07-19,,Japan,viral50,SAME_POSITION
5717399,20748638,NO GOOD,13,2020-07-20,,Japan,viral50,MOVE_DOWN
5717651,20788975,NO GOOD,14,2020-07-21,,Japan,viral50,MOVE_DOWN
5717911,20833982,NO GOOD,19,2020-07-22,,Japan,viral50,MOVE_DOWN


All rows seem to use the song title 'NO GOOD'. Lets check if that title is used anywhere else or if this is showing a defect data entry:

In [12]:
mask = data_no_streams["title"].str.contains("NO GOOD")
data_no_streams[mask]

Unnamed: 0.1,Unnamed: 0,title,rank,date,artist,region,chart,trend
5583537,15883417,NO GOOD (feat. MamboLosco),146,2021-01-06,Malerba,Italy,top200,NEW_ENTRY
5715643,20596664,NO GOOD,10,2020-07-13,,Japan,viral50,NEW_ENTRY
5715893,20616457,NO GOOD,10,2020-07-14,,Japan,viral50,SAME_POSITION
5716143,20640094,NO GOOD,10,2020-07-15,,Japan,viral50,SAME_POSITION
5716393,20661724,NO GOOD,10,2020-07-16,,Japan,viral50,SAME_POSITION
5716643,20677645,NO GOOD,10,2020-07-17,,Japan,viral50,SAME_POSITION
5716893,20705363,NO GOOD,10,2020-07-18,,Japan,viral50,SAME_POSITION
5717143,20726697,NO GOOD,10,2020-07-19,,Japan,viral50,SAME_POSITION
5717399,20748638,NO GOOD,13,2020-07-20,,Japan,viral50,MOVE_DOWN
5717651,20788975,NO GOOD,14,2020-07-21,,Japan,viral50,MOVE_DOWN


This could be an actual track name. We will therefore fill the Nan value with 'Unkown Artist' instead of deleting the rows:

In [13]:
data_no_streams["artist"].fillna(value="Unkown artist",inplace=True)

In [14]:
# sanity check
data_no_streams.isna().sum()

Unnamed: 0    0
title         0
rank          0
date          0
artist        0
region        0
chart         0
trend         0
dtype: int64

Check for dtypes in the dataframe

In [15]:
# check datatypes
data_no_streams.dtypes

Unnamed: 0     int64
title         object
rank           int64
date          object
artist        object
region        object
chart         object
trend         object
dtype: object

Convert 'date' column to date format

In [16]:
data_no_streams['date'] = pd.to_datetime(data_no_streams.date)
data_no_streams.dtypes

Unnamed: 0             int64
title                 object
rank                   int64
date          datetime64[ns]
artist                object
region                object
chart                 object
trend                 object
dtype: object

Check for additional missing data is not needed because we will look into titles, where the information for missing rank is not necessary

For future work on this dataset, we are exporting the data cleaning to a new csv file.

In [17]:
data_no_streams.to_csv('../data/spotify_2020+_cleaned.csv', index=False)