# Analyzing Spotify weekly top 200 data

Spotify has become one of the big players in the global music market over the recent years. Because it has hundreds of millions users streaming songs across the globe, its several [music rankings/charts](https://charts.spotify.com/home). The weekly Top 200 most streamed songs are therefore probably also a solid estimate for the popularity of pieces of music. I figured it would be very interesting to look at the changes in the Top 200 charts for several countries.

After quite some time spent searching, I finally found [this](https://www.kaggle.com/datasets/dhruvildave/spotify-charts) dataset on Kaggle. It contains the Top 200 and Viral 50 weekly charts for several countries across five years, from 2017 up until the end of 2021. I've downloaded and extracted the ZIP archive. Let's look at the data!

## Analyzing the data

In [1]:
import pandas as pd

In [2]:
from helpers import get_data_path

The CSV is **really** big (3.5 GB), so loading it will take quite a while...

In [22]:
charts = pd.read_csv(
    get_data_path("charts.csv"),
    dtype={
        "title": "category",
        "artist": "category",
        "url": "category",
        "region": "category",
        "chart": "category",
        "trend": "category",
    },
    parse_dates=["date"]
)

This is really is a huge dataset, with around 26 mio. rows! It is very nice that we really have data separated by region.

### Column value counts

In [26]:
charts.trend.value_counts()

MOVE_DOWN        11220434
MOVE_UP           9801048
SAME_POSITION     3298392
NEW_ENTRY         1853640
Name: trend, dtype: int64

In [27]:
charts.chart.value_counts()

top200     20321904
viral50     5851610
Name: chart, dtype: int64

In [28]:
charts.region.value_counts()

Argentina        455308
United States    455057
Austria          454593
Brazil           454439
Australia        453103
                  ...  
Ukraine          127544
Russia           126837
Luxembourg        98053
Andorra           79592
South Korea       76276
Name: region, Length: 70, dtype: int64

In [29]:
charts.url.nunique()

217704

In [30]:
charts.artist.nunique()

96156

Those numbers are quite surprising: more than 96k artists and 217k songs? Personally, I wouldn't have expected that much variety.

### Extracting Top 200 data

The viral Top 50 aren't of much interest, as they are apparently [curated](https://community.spotify.com/t5/Your-Library/What-s-the-difference-between-daily-top-50-and-viral-top-50/td-p/4973312) and not purely based on numbers of streams. In general, the provenance of the Viral Top 50 data seems to not be explained in detail at all by Spotify. So, we'll stick to the Top 200, which are based on pure numbers of streams.

In [31]:
top200 = charts.loc[charts.chart == 'top200'].drop(columns="chart")
top200

Unnamed: 0,title,rank,date,artist,url,region,trend,streams
0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,SAME_POSITION,253019.0
1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,MOVE_UP,223988.0
2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,MOVE_DOWN,210943.0
3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,SAME_POSITION,173865.0
4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,MOVE_UP,153956.0
...,...,...,...,...,...,...,...,...
25276069,Ojalá (feat. Darell),196,2018-01-31,"De La Ghetto, Almighty, Bryant Myers",https://open.spotify.com/track/3EMDvnVpQd9RZJv...,Uruguay,MOVE_DOWN,1178.0
25276070,Lo Que Pasa en la Noche,197,2018-01-31,Mano Arriba,https://open.spotify.com/track/2eOleVJlGvBE027...,Uruguay,NEW_ENTRY,1178.0
25276071,El Equivocado,198,2018-01-31,Mano Arriba,https://open.spotify.com/track/5vy1C7DD9xJ5fRB...,Uruguay,MOVE_DOWN,1170.0
25276072,Que Fui Tu Amante,199,2018-01-31,El Gucci y Su Banda,https://open.spotify.com/track/1fmiCxwEbZFIszI...,Uruguay,MOVE_DOWN,1165.0


### Verifying completeness of data

In [32]:
top200.date.min()

Timestamp('2017-01-01 00:00:00')

In [33]:
top200.date.max()

Timestamp('2021-12-31 00:00:00')

At first glance it looks like we indeed have data for 5 years. But is it really complete for all countries/regions?

In [36]:
top200_rows_per_week_and_region = top200.loc[:,["date","region"]].groupby("date").value_counts()
top200_rows_per_week_and_region

date        region   
2017-01-01  Argentina    200
            Ecuador      200
            Italy        200
            Ireland      200
            Indonesia    200
                        ... 
2021-12-31  Iceland        0
            Italy          0
            Ireland        0
            Norway         0
            Ukraine        0
Length: 127610, dtype: int64

looks like some data is missing...

In [49]:
top_less_than200 = top200_rows_per_week_and_region.loc[top200_rows_per_week_and_region != 200].reset_index()
top_less_than200

Unnamed: 0,date,region,0
0,2017-01-01,Uruguay,199
1,2017-01-01,Czech Republic,136
2,2017-01-01,Guatemala,130
3,2017-01-01,Dominican Republic,129
4,2017-01-01,Panama,111
...,...,...,...
37172,2021-12-31,Iceland,0
37173,2021-12-31,Italy,0
37174,2021-12-31,Ireland,0
37175,2021-12-31,Norway,0


In [51]:
top_less_than200.region.unique()

['Uruguay', 'Czech Republic', 'Guatemala', 'Dominican Republic', 'Panama', ..., 'Singapore', 'Turkey', 'Taiwan', 'Switzerland', 'Portugal']
Length: 70
Categories (70, object): ['Andorra', 'Argentina', 'Australia', 'Austria', ..., 'United Arab Emirates', 'Russia', 'South Korea', 'Ukraine']

Oh no, does that mean that we don't have 100% complete data for any of the countries?

## "Enriching the dataset" with data from the Spotify API

We can use the Spotify API to obtain additional data for the songs in the charts.

To access the API we need to have a Spotify developer account and register a new application to obtain a client ID and secret. For this notebook, we have already done that and added a `.env` file with the following content:
```
SPOTIPY_CLIENT_ID='our-client-id-would-be-here'
SPOTIPY_CLIENT_SECRET='our-client-secret-would-be-here'
SPOTIPY_REDIRECT_URI='http://127.0.0.1:9090'
```

These environment variables will be used by the `spotipy` package, a lightweight python library for getting data from the Spotify API.

We also added the redirect URL `http://127.0.0.1:9090` mentioned in the `SPOTIPY_REDIRECT_URI` variable to the application in the Developer Console on the Spotify website. spotipy will "instantiate a server on the indicated response to receive the access token from the response at the end of the oauth flow" (as mentioned in the [docs](https://spotipy.readthedocs.io/en/2.21.0/#redirect-uri))

To load the environment variables from that file we use `load_dotenv`:

In [4]:
from dotenv import load_dotenv
load_dotenv()

True

Now, we are ready to import spotipy:

In [5]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth

scope = "user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope))

Let's try to fetch more information about the tracks from the CSV we loaded before.

In [6]:
track_uris = sp_glob_charts_week_2022_10_27.uri
track_uris

0      spotify:track:0V3wPSX9ygBnCm8psDIegu
1      spotify:track:5jQI2r1RdgtuT8S3iG8zFC
2      spotify:track:1wtOxkiel43cVs0Yux5Q4h
3      spotify:track:3rWDp9tBPQR9z6U5YyRSK4
4      spotify:track:3eX0NZfLtGzoLUxPNvRfqm
                       ...                 
195    spotify:track:5g7sDjBhZ4I3gcFIpkrLuI
196    spotify:track:1UwUhKmFxGKs59xiWO60Sx
197    spotify:track:7o2CTH4ctstm8TNelqjb51
198    spotify:track:4RvWPyQ5RL0ao9LPZeSouE
199    spotify:track:6VrQTLzzuyGIYjUDe4kAZk
Name: uri, Length: 200, dtype: object

It seems that the Spotify API doesn't allow fetching more

If we run the code below, a browser window will open where we need to give our app using `spotipy` permission to make requests to the Spotify API on our behalf. If we confirm, we will get the data.

In [7]:
sp.tracks(tracks=track_uris[:50])

{'tracks': [{'album': {'album_type': 'album',
    'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/06HL4z0CvFAxyc27GXpf02'},
      'href': 'https://api.spotify.com/v1/artists/06HL4z0CvFAxyc27GXpf02',
      'id': '06HL4z0CvFAxyc27GXpf02',
      'name': 'Taylor Swift',
      'type': 'artist',
      'uri': 'spotify:artist:06HL4z0CvFAxyc27GXpf02'}],
    'available_markets': ['AD',
     'AE',
     'AG',
     'AL',
     'AM',
     'AO',
     'AR',
     'AT',
     'AU',
     'AZ',
     'BA',
     'BB',
     'BD',
     'BE',
     'BF',
     'BG',
     'BH',
     'BI',
     'BJ',
     'BN',
     'BO',
     'BR',
     'BS',
     'BT',
     'BW',
     'BZ',
     'CA',
     'CD',
     'CG',
     'CH',
     'CI',
     'CL',
     'CM',
     'CO',
     'CR',
     'CV',
     'CW',
     'CY',
     'CZ',
     'DE',
     'DJ',
     'DK',
     'DM',
     'DO',
     'DZ',
     'EC',
     'EE',
     'EG',
     'ES',
     'FI',
     'FJ',
     'FM',
     'FR',
     'GA',
     'GB',


In [8]:
from helpers import DATA_DIR
charts_target_path = os.path.join(DATA_DIR, "charts.csv")

It would be convenient if we could scrape Spotify Chart data from the Spotify API. Unfort

In [10]:
from fycharts.SpotifyCharts import SpotifyCharts

api = SpotifyCharts()
api.top200Daily(output_file = "top_200_daily.csv", start="2017-10-27", end="2017-10-27")

INFO : 02/11/2022 03:52:01 PM : Extracting top 200 daily for 2017-10-27 - global
ERROR : 02/11/2022 03:52:06 PM : ***** <<HTTPSConnectionPool(host='charts.spotify.com', port=443): Max retries exceeded with url: / (Caused by ResponseError('too many 404 error responses'))>> Data not found. Generating empty dataframe *****
INFO : 02/11/2022 03:52:06 PM : Extracting top 200 daily for 2017-10-27 - ad
INFO : 02/11/2022 03:52:06 PM : Appending data to the file top_200_daily.csv...
INFO : 02/11/2022 03:52:06 PM : Done appending to the file top_200_daily.csv!!!
ERROR : 02/11/2022 03:52:11 PM : ***** <<HTTPSConnectionPool(host='charts.spotify.com', port=443): Max retries exceeded with url: / (Caused by ResponseError('too many 404 error responses'))>> Data not found. Generating empty dataframe *****
INFO : 02/11/2022 03:52:11 PM : Extracting top 200 daily for 2017-10-27 - ar
INFO : 02/11/2022 03:52:11 PM : Appending data to the file top_200_daily.csv...
INFO : 02/11/2022 03:52:11 PM : Done append

KeyboardInterrupt: 