### Prepping Data Challenge: Making Spotify Data Spotless (Week 26)
 
### Requirements
- Input the data
- Create a new field which would break down milliseconds into seconds and minutes
  - e.g. 208,168 turned into minutes would be 3.47min
- Extract the year from the timestamp field
- Rank the artists by total minutes played overall
- For each year, find the ranking of the artists by total minutes played
- Reshape the data so we can compare how artist position changes year to year
- Filter to the overall top 100 artists
- Output the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
#input the data
df = pd.read_csv('wk26-input.csv', parse_dates = ['ts'])\
               .rename(columns={'ts':'Timestamp'})

In [3]:
df = df[['Artist Name','ms_played','Timestamp']]

In [4]:
df.head()

Unnamed: 0,Artist Name,ms_played,Timestamp
0,Seether,208186,2016-02-03 17:46:42+00:00
1,Beissoul & Einius,203738,2019-11-11 08:03:03+00:00
2,The Beta Machine,247265,2021-11-17 09:17:32+00:00
3,Weezer,204480,2016-07-05 20:04:26+00:00
4,Beissoul & Einius,203738,2019-11-16 11:06:24+00:00


In [5]:
#Create a new field which would break down milliseconds into seconds and minutes
df['ms_played'] = np.round(df['ms_played']/60000,2)

In [6]:
#Extract the year from the timestamp field
df['year'] = pd.to_datetime(df['Timestamp']).dt.year

In [7]:
df['total ms_played'] = df.groupby('Artist Name')['ms_played'].transform(sum)

In [8]:
#Rank the artists by total minutes played overall
df['Overall Rank'] = df['total ms_played'].transform('rank', method = 'dense', ascending = False).astype('int')

In [9]:
# Reshape the data so we can compare how artist position changes year to year
df = df.pivot_table('ms_played', index= ['Overall Rank', 'Artist Name'], columns = 'year', aggfunc= (sum)).reset_index()

In [10]:
# For each year, find the ranking of the artists by total minutes played
df = df.set_index(['Artist Name', 'Overall Rank']).rank(method = 'max', ascending = False).reset_index()

In [12]:
df.head()

year,Artist Name,Overall Rank,2015,2016,2017,2018,2019,2020,2021,2022
0,Shinedown,1,1.0,7.0,4.0,14.0,66.0,23.0,367.0,337.0
1,Eminem,2,82.0,1.0,5.0,9.0,3.0,2.0,52.0,
2,ONE OK ROCK,3,12.0,3.0,1.0,3.0,4.0,207.0,242.0,
3,Five Finger Death Punch,4,335.0,4.0,3.0,2.0,6.0,4.0,100.0,
4,Seether,5,4.0,6.0,24.0,119.0,34.0,49.0,134.0,


In [13]:
#Filter to the overall top 100 artists
df = df[df['Overall Rank'] <= 100]

In [14]:
df.head(10)

year,Artist Name,Overall Rank,2015,2016,2017,2018,2019,2020,2021,2022
0,Shinedown,1,1.0,7.0,4.0,14.0,66.0,23.0,367.0,337.0
1,Eminem,2,82.0,1.0,5.0,9.0,3.0,2.0,52.0,
2,ONE OK ROCK,3,12.0,3.0,1.0,3.0,4.0,207.0,242.0,
3,Five Finger Death Punch,4,335.0,4.0,3.0,2.0,6.0,4.0,100.0,
4,Seether,5,4.0,6.0,24.0,119.0,34.0,49.0,134.0,
5,Papa Roach,6,16.0,40.0,2.0,6.0,16.0,69.0,170.0,72.0
6,3 Doors Down,7,3.0,8.0,50.0,144.0,,647.0,,
7,Stone Sour,8,11.0,57.0,20.0,12.0,26.0,11.0,7.0,169.0
8,Foo Fighters,9,5.0,11.0,48.0,91.0,62.0,,14.0,
9,Nickelback,10,2.0,39.0,47.0,333.0,435.0,,,


In [15]:
#output the data 
df.to_csv('wk26-output.csv', index=False)