# Processing the initial [Spotify Charts dataset](https://www.kaggle.com/datasets/dhruvildave/spotify-charts) from Kaggle

## Get dataset filepath

We have downloaded the dataset and placed in the `data` folder. It is called `charts.csv`.

In [2]:
from helpers import create_data_path

In [3]:
kaggle_data_path = create_data_path("charts.csv")

## Inspect file structure

In [4]:
!head $kaggle_data_path

title,rank,date,artist,url,region,chart,trend,streams
Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6lroV2Kg,Argentina,top200,SAME_POSITION,253019
Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul3ywMe46,Argentina,top200,MOVE_UP,223988
Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAOSC1qTfo,Argentina,top200,MOVE_DOWN,210943
Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtcMZ4S4bO,Argentina,top200,SAME_POSITION,173865
Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37DOZPJ2hf,Argentina,top200,MOVE_UP,153956
Traicionera,6,2017-01-01,Sebastian Yatra,https://open.spotify.com/track/5J1c3M4EldCfNxXwrwt8mT,Argentina,top200,MOVE_DOWN,151140
Cuando Se Pone a Bailar,7,2017-01-01,Rombai,https://open.spotify.com/track/1MpKZi1zTXpERKwxmOu1PH,Argentina,top200,MOVE_DOWN,148369

## Load as dataframe

This will take a while, as the file is pretty big (more than 3 GB!)

In [5]:
import pandas as pd

kaggle_data = pd.read_csv(kaggle_data_path)

In [6]:
kaggle_data.head()

Unnamed: 0,title,rank,date,artist,url,region,chart,trend,streams
0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,top200,SAME_POSITION,253019.0
1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,top200,MOVE_UP,223988.0
2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,top200,MOVE_DOWN,210943.0
3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,top200,SAME_POSITION,173865.0
4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,top200,MOVE_UP,153956.0


## Process columns, extract Top 200 data

For convenience, rename the `rank` column to `pos` (`rank` has a [special meaning](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html) in pandas dataframes, so calling `.rank` on the dataframe wouldn't give us the column)

In [7]:
kaggle_data = kaggle_data.rename(columns={"rank": "pos"})

In [8]:
kaggle_data

Unnamed: 0,title,pos,date,artist,url,region,chart,trend,streams
0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,top200,SAME_POSITION,253019.0
1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,top200,MOVE_UP,223988.0
2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,top200,MOVE_DOWN,210943.0
3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,top200,SAME_POSITION,173865.0
4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,top200,MOVE_UP,153956.0
...,...,...,...,...,...,...,...,...,...
26173509,BYE,46,2021-07-31,Jaden,https://open.spotify.com/track/3OUyyDN7EZrL7i0...,Vietnam,viral50,MOVE_UP,
26173510,Pillars,47,2021-07-31,My Anh,https://open.spotify.com/track/6eky30oFiQbHUAT...,Vietnam,viral50,NEW_ENTRY,
26173511,Gái Độc Thân,48,2021-07-31,Tlinh,https://open.spotify.com/track/2klsSb2iTfgDh95...,Vietnam,viral50,MOVE_DOWN,
26173512,Renegade (feat. Taylor Swift),49,2021-07-31,Big Red Machine,https://open.spotify.com/track/1aU1wpYBSpP0M6I...,Vietnam,viral50,MOVE_DOWN,


In [9]:
kaggle_data.chart.value_counts()

chart
top200     20321904
viral50     5851610
Name: count, dtype: int64

The Viral Top 50 are irrelevant for us as those are not based on raw streams and rather represent a curated playlist (not that interesting from a Data Science perspective)

Furthermore, the `url` is a bit redundant: better just extract the track IDs!

In [10]:
def get_id(spotify_url):
    return spotify_url.split("/")[-1]

In [11]:
get_id("https://open.spotify.com/track/2VxeLyX666F8uXCJ0dZF8B")

'2VxeLyX666F8uXCJ0dZF8B'

In [12]:
kaggle_top200 = (
    kaggle_data[kaggle_data.chart == "top200"]
    .drop(columns=["chart"])
)

In [13]:
kaggle_top200

Unnamed: 0,title,pos,date,artist,url,region,trend,streams
0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,SAME_POSITION,253019.0
1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,MOVE_UP,223988.0
2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,MOVE_DOWN,210943.0
3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,SAME_POSITION,173865.0
4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,MOVE_UP,153956.0
...,...,...,...,...,...,...,...,...
25276069,Ojalá (feat. Darell),196,2018-01-31,"De La Ghetto, Almighty, Bryant Myers",https://open.spotify.com/track/3EMDvnVpQd9RZJv...,Uruguay,MOVE_DOWN,1178.0
25276070,Lo Que Pasa en la Noche,197,2018-01-31,Mano Arriba,https://open.spotify.com/track/2eOleVJlGvBE027...,Uruguay,NEW_ENTRY,1178.0
25276071,El Equivocado,198,2018-01-31,Mano Arriba,https://open.spotify.com/track/5vy1C7DD9xJ5fRB...,Uruguay,MOVE_DOWN,1170.0
25276072,Que Fui Tu Amante,199,2018-01-31,El Gucci y Su Banda,https://open.spotify.com/track/1fmiCxwEbZFIszI...,Uruguay,MOVE_DOWN,1165.0


## Write to file

It's smarter to store data in the parquet format as this is optimized for large datasets (writing is faster and output files are also much smaller). Note that for this to work, `pyarrow` needs to be installed!

In [14]:
out_path = create_data_path("kaggle_top200.parquet")

kaggle_top200.to_parquet(out_path)