# Combining chart track data from various sources

This notebook documents how I combined the additional metadata for the unique songs in my "Spotify daily Top 50 for 50 countries (2017-2021)" dataset `top50.csv` from several sources to obtain a big, cleaned dataset `top50_track_data.csv`

In [73]:
import pandas as pd
from helpers import get_data_path, create_data_out_path
import json
from ast import literal_eval


## Starting point: track metadata fetched from `api.spotify.com/v1/tracks` 

I stored this data in `tracks.csv`. It will serve as the starting point for the final dataset as it contains metadata all the unique tracks.

In [74]:
tracks = pd.read_csv(
    get_data_path("tracks.csv"),
    index_col="id",  
    converters={"album": literal_eval}, # want to "parse" album objects https://stackoverflow.com/a/67079641
)

tracks


Unnamed: 0_level_0,album,artists,available_markets,disc_number,duration_ms,explicit,external_ids,external_urls,href,is_local,name,popularity,preview_url,track_number,type,uri
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
6mICuAdrwEjh6Y6lroV2Kg,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,"['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT...",1,195840,False,{'isrc': 'USSD11600299'},{'spotify': 'https://open.spotify.com/track/6m...,https://api.spotify.com/v1/tracks/6mICuAdrwEjh...,False,Chantaje (feat. Maluma),76,https://p.scdn.co/mp3-preview/b7a66b261ebbe2aa...,3,track,spotify:track:6mICuAdrwEjh6Y6lroV2Kg
7DM4BPaS7uofFul3ywMe46,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,"['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT...",1,259195,False,{'isrc': 'USSD11600252'},{'spotify': 'https://open.spotify.com/track/7D...,https://api.spotify.com/v1/tracks/7DM4BPaS7uof...,False,Vente Pa' Ca (feat. Maluma),72,https://p.scdn.co/mp3-preview/21e38a8983daf1c3...,1,track,spotify:track:7DM4BPaS7uofFul3ywMe46
3AEZUABDXNtecAOSC1qTfo,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,"['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT...",1,222560,False,{'isrc': 'USSD11600135'},{'spotify': 'https://open.spotify.com/track/3A...,https://api.spotify.com/v1/tracks/3AEZUABDXNte...,False,Reggaetón Lento (Bailemos),72,https://p.scdn.co/mp3-preview/ced5c17cadb43603...,3,track,spotify:track:3AEZUABDXNtecAOSC1qTfo
6rQSrBHf7HlZjtcMZ4S4bO,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,[],1,205600,False,{'isrc': 'USUM71604778'},{'spotify': 'https://open.spotify.com/track/6r...,https://api.spotify.com/v1/tracks/6rQSrBHf7HlZ...,False,Safari,0,,3,track,spotify:track:6rQSrBHf7HlZjtcMZ4S4bO
58IL315gMSTD37DOZPJ2hf,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,[],1,234320,False,{'isrc': 'US2BU1600020'},{'spotify': 'https://open.spotify.com/track/58...,https://api.spotify.com/v1/tracks/58IL315gMSTD...,False,Shaky Shaky,0,,1,track,spotify:track:58IL315gMSTD37DOZPJ2hf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6ClL99PvuCeBLTFTdnR1Hw,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,"['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT...",1,179120,False,{'isrc': 'UYM120700001'},{'spotify': 'https://open.spotify.com/track/6C...,https://api.spotify.com/v1/tracks/6ClL99PvuCeB...,False,Que Tiene la Noche,42,https://p.scdn.co/mp3-preview/93168930edca3314...,1,track,spotify:track:6ClL99PvuCeBLTFTdnR1Hw
7asgcWpGtv92TiMZUYNWVt,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,"['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT...",1,211896,False,{'isrc': 'UYM122103910'},{'spotify': 'https://open.spotify.com/track/7a...,https://api.spotify.com/v1/tracks/7asgcWpGtv92...,False,Me Encanta,58,https://p.scdn.co/mp3-preview/719b6f57859cf341...,1,track,spotify:track:7asgcWpGtv92TiMZUYNWVt
1PIiIVeacU0Fj7Rm9lFlsx,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,[],1,180240,True,{'isrc': 'BXIV82097196'},{'spotify': 'https://open.spotify.com/track/1P...,https://api.spotify.com/v1/tracks/1PIiIVeacU0F...,False,Paypal,0,,1,track,spotify:track:1PIiIVeacU0Fj7Rm9lFlsx
6yf3MxEOScNBTYCHOAIiNQ,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,"['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT...",1,165466,True,{'isrc': 'GBUM71702567'},{'spotify': 'https://open.spotify.com/track/6y...,https://api.spotify.com/v1/tracks/6yf3MxEOScNB...,False,Instruction (feat. Demi Lovato & Stefflon Don),52,https://p.scdn.co/mp3-preview/d528dbeae270d471...,1,track,spotify:track:6yf3MxEOScNBTYCHOAIiNQ


In [75]:
tracks.columns

Index(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms',
       'explicit', 'external_ids', 'external_urls', 'href', 'is_local', 'name',
       'popularity', 'preview_url', 'track_number', 'type', 'uri'],
      dtype='object')

### Remove redundant columns

In [76]:
tracks.drop(columns=[
  "uri", # can be recreated from track ID
  "type", # same for all tracks ("track")
  "artists",# information will be added later when joining with "track_artists_and_genres.csv"
  "available_markets",
  "disc_number",
  "duration_ms",# contained in "audio_features.csv" too
  "external_urls", # unfortunately only contains Spotify track URLs :/
  "href", # only Spotify API URLs in format https://api.spotify.com/v1/tracks/{track_id}
  "is_local",
  "popularity", # not really meaningful, as it seems to be "current popularity", which is probably lower for older tracks - discussion here: https://community.spotify.com/t5/Content-Questions/Artist-popularity/td-p/4415259
  "track_number",
], inplace=True)

### Extract relevant album information

In [77]:
def extract_relevant_album_data(row):
  album = row.album
  return {
    "album_type": album["album_type"],
    "album_id": album["uri"].split("spotify:album:")[1], # https://stackoverflow.com/a/12572391/13727176
    "album_release_date": album["release_date"],
    "album_release_date_precision": album["release_date_precision"]
  }

album_data = tracks.apply(extract_relevant_album_data, result_type="expand", axis=1)
album_data

Unnamed: 0_level_0,album_type,album_id,album_release_date,album_release_date_precision
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6mICuAdrwEjh6Y6lroV2Kg,album,6bUxh58rYTL67FS8dyTKMN,2017-05-26,day
7DM4BPaS7uofFul3ywMe46,single,1FkaJUwfqLdQdSmRPBlw6l,2016-09-22,day
3AEZUABDXNtecAOSC1qTfo,album,0YLrAWUbY0nyM7PFtqnYld,2016-08-26,day
6rQSrBHf7HlZjtcMZ4S4bO,album,2LYwooMTH1iJeBvWyXXWUf,2016-06-24,day
58IL315gMSTD37DOZPJ2hf,single,2zrLk90b4qjmrxRZKyIY7X,2016-04-08,day
...,...,...,...,...
6ClL99PvuCeBLTFTdnR1Hw,album,6zhawgybFcgvkdqpbFjbRK,2007-01-02,day
7asgcWpGtv92TiMZUYNWVt,single,08orK7pUWMWTpdOK1b3AOi,2021-10-01,day
1PIiIVeacU0Fj7Rm9lFlsx,single,0Pcp62cVNnrKo5QXTs26k1,2020-12-08,day
6yf3MxEOScNBTYCHOAIiNQ,single,4qstESQUoK2J7APuxGx0WN,2017-06-16,day


In [78]:
tracks = tracks.join(album_data).drop(columns="album")

### Extracting Data related to International Standard Recording Codes (ISRCs)

In [79]:
(tracks["external_ids"].str.contains("isrc")).sum()

33857

Almost all the tracks have an [ISRC](https://en.wikipedia.org/wiki/International_Standard_Recording_Code). This is convenient, as we can extract some information from it!

According to Wikipedia 
>An ISRC identifies a particular recording, not the work (composition and lyrical content) itself. Therefore, different recordings, edits, and remixes of the same work should each have their own ISRC. Works are identified by ISWC. Recordings remastered or revised in other ways are usually assigned a new ISRC.

Every ISRC is of the form `"CC-XXX-YY-NNNNN"`. `CC` and `YY` are interesting, as they contain the 2-digit ISO country code of the recording and the last two digits of the release year. 

However, the Wikipedia article also states
> High demand for ISRCs within the United States has caused the supply of available registrant codes to become exhausted; after December 6, 2010, new registrants in the US use country code "QM"

Also,
> "XXX" is a three character alphanumeric registrant code of the ISRC issuer. This number by itself does NOT uniquely identify the ISRC issuer as the same 3-digit registrant code may be used in various countries for different issuers. To uniquely identify an issuer, the country code and registrant code should be used together.

So, the `XXX` component might be interesting as well!


Let's extract relevant data from the ISRCs:

In [80]:
def extract_isrc_data(row):
    isrc_obj_str = row.external_ids
    isrc = json.loads(isrc_obj_str.replace("'", '"')).get("isrc")
    if isrc:
        isrc = isrc.replace("-", "")
        isrc = isrc.upper()
        cc = isrc[:2]
        xxx = isrc[2:5]
        yy = isrc[5:7]
        #nn = isrc[7:] not really relevant
        if (yy[0] == "0") or (yy[0] == "1") or (yy[0] == "2"):
            year = "20" + yy
        else:
            year = "19" + yy
        year = int(year)
        out = {"isrc": isrc, "isrc_year": year, "isrc_cc": cc, "isrc_xxx": xxx}
        return out


In [81]:
isrc_data = tracks.apply(
    extract_isrc_data, result_type="expand", axis=1
).convert_dtypes()
isrc_data


Unnamed: 0_level_0,isrc,isrc_year,isrc_cc,isrc_xxx
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6mICuAdrwEjh6Y6lroV2Kg,USSD11600299,2016,US,SD1
7DM4BPaS7uofFul3ywMe46,USSD11600252,2016,US,SD1
3AEZUABDXNtecAOSC1qTfo,USSD11600135,2016,US,SD1
6rQSrBHf7HlZjtcMZ4S4bO,USUM71604778,2016,US,UM7
58IL315gMSTD37DOZPJ2hf,US2BU1600020,2016,US,2BU
...,...,...,...,...
6ClL99PvuCeBLTFTdnR1Hw,UYM120700001,2007,UY,M12
7asgcWpGtv92TiMZUYNWVt,UYM122103910,2021,UY,M12
1PIiIVeacU0Fj7Rm9lFlsx,BXIV82097196,2020,BX,IV8
6yf3MxEOScNBTYCHOAIiNQ,GBUM71702567,2017,GB,UM7


In [82]:
isrc_data.count()

isrc         33857
isrc_year    33857
isrc_cc      33857
isrc_xxx     33857
dtype: int64

Something is very interesting...

In [83]:
isrc_data[isrc_data.duplicated()]

Unnamed: 0_level_0,isrc,isrc_year,isrc_cc,isrc_xxx
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1WniHvhq9zTkny0WvGXX8o,USSD11600299,2016,US,SD1
43it4kot08akLzFIEMhXNN,USSD11600112,2016,US,SD1
4m5A5meIueOcDBpbqGvkQB,USSD11700088,2017,US,SD1
0k23rRi1B8ZHrKtzECGoyk,USUM71604779,2016,US,UM7
79Jhw5xn4gGn6PZak275gg,USUM71605331,2016,US,UM7
...,...,...,...,...
7iC56CDz8miPDKaH0OEIqS,USAT22101243,2021,US,AT2
71l8BEtJPXXlWbV6hhTHWK,USAT22104222,2021,US,AT2
7ACW7VpgoKmfM1sKo15UhX,USRC12101071,2021,US,RC1
7M6CFruBrM5x7u0lTMtm6r,ES5701501180,2015,ES,570


Looks like a lot of duplicates! But don't worry, the reason for this is that the same song recording may appear in different albums/singles. Let's look at an example:

In [84]:
from helpers import get_spotify_link

example_isrc = isrc_data[isrc_data.duplicated()].iloc[0].isrc
track_ids = isrc_data[isrc_data.isrc == example_isrc].index
display(isrc_data.loc[track_ids])
print("Spotify links:")
for track_id in track_ids:
  print(get_spotify_link(track_id))

Unnamed: 0_level_0,isrc,isrc_year,isrc_cc,isrc_xxx
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6mICuAdrwEjh6Y6lroV2Kg,USSD11600299,2016,US,SD1
1WniHvhq9zTkny0WvGXX8o,USSD11600299,2016,US,SD1


Spotify links:
https://open.spotify.com/track/6mICuAdrwEjh6Y6lroV2Kg
https://open.spotify.com/track/1WniHvhq9zTkny0WvGXX8o


If we open the links, we notice that even the track names and release years differ (one includes the "featuring ..." part while the other doesn't), but still it's actually the same recording! As we will want to merge the track data with the Spotify Charts data, it doesn't make sense to remove duplicates here. By the way, this example also shows that the recording country code does not give us reliable information about the language of the song (in thase case it is Spanish, even though it has been assigned an US country code).

Those are already interesting insights. But there's more we can do. 

I found an [article](https://isrc.ifpi.org/en/isrc-standard/structure) on the official ISRC website. There, a [table](https://isrc.ifpi.org/downloads/Valid_Characters.pdf) with mappings of the `CC` component of ISRC codes to names and territories of ISRC agencies is linked. Unconveniently, it is provided as a PDF.

I used an online tool to convert the PDF table into an Excel spreadsheet, then cleaned the data a bit in Excel, and finally exported a CSV file. It is available in the [Google Drive folder](https://drive.google.com/drive/folders/1bW2Gh3Xrcj6Dnaooe12JyCgYtmLh7Zt5?usp=sharing) of this project as `isrc_cc_agency_and_territory.csv`. Let's load it (I downloaded it to the `data` folder already here):

In [85]:
agencies_territories = pd.read_csv(get_data_path("isrc_cc_agency_and_territory.csv"), index_col="isrc_cc")
agencies_territories

Unnamed: 0_level_0,agency,territory
isrc_cc,Unnamed: 1_level_1,Unnamed: 2_level_1
AL,International ISRC Registration Authority,Albania
DZ,International ISRC Registration Authority,Algeria
AD,International ISRC Registration Authority,Andorra
AO,International ISRC Registration Authority,Angola
AI,International ISRC Registration Authority,Anguilla
...,...,...
CP,International ISRC Registration Authority,Worldwide
DG,International ISRC Registration Authority,Worldwide
ZZ,International ISRC Registration Authority,Worldwide
CS,International ISRC Registration Authority,(former) Serbia and Montenegro


In [86]:
agencies_territories.agency.value_counts()

International ISRC Registration Authority    108
Pro-música Brazil                              5
Recorded Music NZ                              3
RIAA                                           3
PPL UK                                         3
RISA                                           2
IFPI Switzerland                               2
SCPP                                           2
KMCA                                           2
SIMIM                                          2
Connect                                        2
RIT                                            1
SLOVGRAM                                       1
PRODUCE                                        1
SGP                                            1
UNIMPRO                                        1
PARI                                           1
ZPAV                                           1
AFP                                            1
UPFR                                           1
Recording Industry A

In [87]:
agencies_territories.territory.value_counts()

Brazil                 5
Worldwide              4
United Kingdom         3
United States          3
Canada                 2
                      ..
Grenada                1
Guatemala              1
Guernsey               1
Guyana                 1
(former) Yugoslavia    1
Name: territory, Length: 164, dtype: int64

So, we can "enrich" the `isrc_data` a bit now:

In [88]:
track_agencies_territories = agencies_territories[agencies_territories.index.isin(isrc_data.isrc_cc.unique())]
track_agencies_territories

Unnamed: 0_level_0,agency,territory
isrc_cc,Unnamed: 1_level_1,Unnamed: 2_level_1
AL,International ISRC Registration Authority,Albania
AR,CAPIF,Argentina
AM,International ISRC Registration Authority,Armenia
AU,ARIA,Australia
AT,LSG,Austria
...,...,...
QZ,RIAA,United States
UY,Camara Uruguaya Del Disco,Uruguay
TC,TuneCore,Worldwide
DG,International ISRC Registration Authority,Worldwide


Do we have any cases where the mapping of `isrc_cc` to territory is not clear?

In [89]:
track_agencies_territories[track_agencies_territories.index.duplicated(False)] # show all occurences of rows with duplicated index

Unnamed: 0_level_0,agency,territory
isrc_cc,Unnamed: 1_level_1,Unnamed: 2_level_1
DG,International ISRC Registration Authority,Turks and Caicos Islands
DG,International ISRC Registration Authority,Worldwide


Unfortunately yes. The Turks and Caicos Islands are pretty small, and `DG` is also not an official ISO 2-digit country code. So, maybe some agency moved to those islands for tax reasons? Anyway, remove that row!

In [90]:
track_agencies_territories = track_agencies_territories[
    ~track_agencies_territories.index.duplicated(keep="last")
]  # removes DG with Turks and Caicos Islands (as it first duplicate)
track_agencies_territories.shape


(73, 2)

Let's quickly check whether the "official list of agencies/territories" for the `CC` component of ISRC codes also covers all the unique values for `isrc_cc` in our track dataset:

In [91]:
set(isrc_data.isrc_cc) - set(track_agencies_territories.index)

{<NA>}

Looks good, as we do have some tracks without ISRC codes.

In [92]:
enriched_isrc_data = pd.merge(isrc_data, track_agencies_territories, left_on="isrc_cc", right_index=True, how="left")
enriched_isrc_data.columns

Index(['isrc', 'isrc_year', 'isrc_cc', 'isrc_xxx', 'agency', 'territory'], dtype='object')

In [93]:
enriched_isrc_data.count()

isrc         33857
isrc_year    33857
isrc_cc      33857
isrc_xxx     33857
agency       33857
territory    33857
dtype: int64

In [94]:
enriched_isrc_data

Unnamed: 0_level_0,isrc,isrc_year,isrc_cc,isrc_xxx,agency,territory
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6mICuAdrwEjh6Y6lroV2Kg,USSD11600299,2016,US,SD1,RIAA,United States
7DM4BPaS7uofFul3ywMe46,USSD11600252,2016,US,SD1,RIAA,United States
3AEZUABDXNtecAOSC1qTfo,USSD11600135,2016,US,SD1,RIAA,United States
6rQSrBHf7HlZjtcMZ4S4bO,USUM71604778,2016,US,UM7,RIAA,United States
58IL315gMSTD37DOZPJ2hf,US2BU1600020,2016,US,2BU,RIAA,United States
...,...,...,...,...,...,...
6ClL99PvuCeBLTFTdnR1Hw,UYM120700001,2007,UY,M12,Camara Uruguaya Del Disco,Uruguay
7asgcWpGtv92TiMZUYNWVt,UYM122103910,2021,UY,M12,Camara Uruguaya Del Disco,Uruguay
1PIiIVeacU0Fj7Rm9lFlsx,BXIV82097196,2020,BX,IV8,Pro-música Brazil,Brazil
6yf3MxEOScNBTYCHOAIiNQ,GBUM71702567,2017,GB,UM7,PPL UK,United Kingdom


Rename columns and drop the `CC` and `XXX` parts (could extract exact issuers from that, however I could not find a list of all issuers - apart from that this data is pretty useless)

In [95]:
enriched_isrc_data = enriched_isrc_data.rename(
    columns={"agency": "isrc_agency", "territory": "isrc_territory"}
).drop(columns=["isrc_cc", "isrc_xxx"])
enriched_isrc_data.columns


Index(['isrc', 'isrc_year', 'isrc_agency', 'isrc_territory'], dtype='object')

Let's add that data to the existing DataFrame. We can remove the `external_urls` column as it doesn't contain any additional info for most of the rows.

In [96]:
tracks = tracks.join(enriched_isrc_data).drop(columns=["external_ids"])

## Add audio features fetched from `api.spotify.com/v1/audio-features`

In [97]:
track_features = pd.read_csv(get_data_path("audio_features.csv"), index_col="id").drop(
    columns=["uri", "type", "track_href", "analysis_url"] # all those columns are redundant
)
track_features


Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
6mICuAdrwEjh6Y6lroV2Kg,0.852,0.773,8.0,-2.921,0.0,0.0776,0.18700,0.000030,0.1590,0.907,102.034,195840.0,4.0
7DM4BPaS7uofFul3ywMe46,0.663,0.920,11.0,-4.070,0.0,0.2260,0.00431,0.000017,0.1010,0.533,99.935,259196.0,4.0
3AEZUABDXNtecAOSC1qTfo,0.761,0.838,4.0,-3.073,0.0,0.0502,0.40000,0.000000,0.1760,0.710,93.974,222560.0,4.0
6rQSrBHf7HlZjtcMZ4S4bO,0.508,0.687,0.0,-4.361,1.0,0.3260,0.55100,0.000003,0.1260,0.555,180.044,205600.0,4.0
58IL315gMSTD37DOZPJ2hf,0.899,0.626,6.0,-4.228,0.0,0.2920,0.07600,0.000000,0.0631,0.873,88.007,234320.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6ClL99PvuCeBLTFTdnR1Hw,0.766,0.795,9.0,-5.530,0.0,0.0644,0.26600,0.000000,0.3500,0.789,85.055,179120.0,4.0
7asgcWpGtv92TiMZUYNWVt,0.672,0.708,7.0,-5.454,1.0,0.0364,0.55100,0.000000,0.1280,0.939,79.991,211897.0,4.0
1PIiIVeacU0Fj7Rm9lFlsx,0.902,0.560,4.0,-4.471,0.0,0.3220,0.30200,0.000000,0.2890,0.913,130.047,180241.0,4.0
6yf3MxEOScNBTYCHOAIiNQ,0.767,0.929,9.0,-3.080,0.0,0.1850,0.10200,0.000000,0.0921,0.923,121.087,165467.0,4.0


In [98]:
tracks = tracks.join(track_features)

In [99]:
tracks

Unnamed: 0_level_0,explicit,name,preview_url,album_type,album_id,album_release_date,album_release_date_precision,isrc,isrc_year,isrc_agency,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000RW47rhEkSqjgTrZx7YX,False,Lance Individual,https://p.scdn.co/mp3-preview/c64292f4c53560bc...,album,57h6WHDjwNGIs5NMeKYEoL,2021-04-22,day,BRRGE2010642,2020,Pro-música Brazil,...,-5.536,1.0,0.0509,0.30900,0.000000,0.0750,0.962,117.399,164459.0,4.0
000xQL6tZNLJzIrtIgxqSl,False,Still Got Time (feat. PARTYNEXTDOOR),https://p.scdn.co/mp3-preview/83fad967740b8a85...,single,2kGUeTGnkLOYlinKRJe47G,2017-03-23,day,USRC11700675,2017,RIAA,...,-6.029,1.0,0.0639,0.13100,0.000000,0.0852,0.524,120.963,188491.0,4.0
000xYdQfIZ4pDmBGzQalKU,False,"Eu, Você, O Mar e Ela",https://p.scdn.co/mp3-preview/ae0e943883e06623...,album,4QianJs5Ls4mxwcT7gDBww,2016-11-04,day,BRRGE1603547,2016,Pro-música Brazil,...,-6.743,1.0,0.0400,0.68400,0.000539,0.4630,0.651,166.018,187119.0,4.0
001b8t3bYPfnabpjpfG1Y4,True,Geen Stof,https://p.scdn.co/mp3-preview/535320aa4cbc5811...,album,06v2EPzWTwcP0egTJVrPdU,2021-01-21,day,NLG662000948,2020,SENA,...,-4.846,1.0,0.3720,0.10500,0.000000,0.1170,0.541,95.951,167866.0,4.0
003VDDA7J3Xb2ZFlNx7nIZ,True,YELL OH,https://p.scdn.co/mp3-preview/14e659590d1a70cf...,single,2orYogfKeURqyS1hRP1vZ4,2020-02-07,day,QZJ842000061,2020,RIAA,...,-6.050,0.0,0.1380,0.00419,0.000000,0.2280,0.190,74.496,236779.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7zxphfIdUMOfZOAVpKFlld,True,Trapper Of The Year (Intro),https://p.scdn.co/mp3-preview/b48d68019fc66831...,single,65DP4VYecvLUlUVfCsXPLF,2021-11-19,day,SE6XY2184006,2021,IFPI Sweden,...,-11.291,0.0,0.3160,0.00582,0.013500,0.0827,0.461,121.913,110164.0,4.0
7zyYmIdjqqiX6kLryb7QBx,False,以後別做朋友,https://p.scdn.co/mp3-preview/4d243321f0ec66ba...,album,1JEzXcEYuEFKKmo4mfMgC7,2014-12-19,day,TWA471410001,2014,RIT,...,-9.458,1.0,0.0372,0.72800,0.000000,0.1050,0.291,130.576,260573.0,4.0
7zyZ9yPXIQebb79PrMghpV,False,Zap Zum,https://p.scdn.co/mp3-preview/2f82c76f077fb377...,album,194szTkDIGJsa9iZJNStwN,2021-06-24,day,BCM112100037,2021,Pro-música Brazil,...,-3.850,1.0,0.0400,0.08580,0.000000,0.1580,0.833,169.123,167503.0,4.0
7zyofXGhXgaqT8fhvLufdf,False,Kuka Antais Pukille?,https://p.scdn.co/mp3-preview/236ec096925cdb1f...,single,2QIHDFGxEjYdVg8Y1QYBRv,2018-11-21,day,FIUM71800543,2018,IFPI Finland,...,-4.364,1.0,0.0300,0.20000,0.000000,0.1850,0.419,89.982,241564.0,4.0


## Add First 3 Featuring Artists and Extracted Genre List

Next, we add the top 3 artists and all their associated genres as our best guess for the track's genres. Using artist genre tags was the only feasible approach for getting the genres of a track, as Spotify doesn't provide song-level genre annotations - see `extract_track_artists_and_genres.py` for more information on how the data was created.

In [100]:
track_artists_and_genres = (
    pd.read_csv(get_data_path("track_artists_and_genres.csv"))
    .rename(
        columns={
            "track_id": "id",
        }
    )
    .set_index("id")
)
track_artists_and_genres

Unnamed: 0_level_0,artist_id_1,artist_id_2,artist_id_3,artist_name_1,artist_name_2,artist_name_3,genres
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
6mICuAdrwEjh6Y6lroV2Kg,0EmeFodog0BfCgMzAIvKQp,1r4hJ1h58CWwUQe3MxPuau,,Shakira,Maluma,,"['colombian pop', 'dance pop', 'latin pop', 'r..."
7DM4BPaS7uofFul3ywMe46,7slfeZO9LsJbWgpkIoXBUJ,1r4hJ1h58CWwUQe3MxPuau,,Ricky Martin,Maluma,,"['dance pop', 'latin pop', 'mexican pop', 'pue..."
3AEZUABDXNtecAOSC1qTfo,0eecdvMrqBftK0M1VKhaF4,,,CNCO,,,"['boy band', 'latin pop', 'reggaeton']"
6rQSrBHf7HlZjtcMZ4S4bO,1vyhD5VmyZ7KMfW5gqLgo5,2RdwBSPQiwcmiDo9kixcl8,6veh5zbFpm31XsPdjBgPER,J Balvin,Pharrell Williams,BIA,"['reggaeton', 'reggaeton colombiano', 'trap la..."
58IL315gMSTD37DOZPJ2hf,4VMYDCV2IEDYJArk749S6m,,,Daddy Yankee,,,"['latin hip hop', 'reggaeton', 'trap latino']"
...,...,...,...,...,...,...,...
6ClL99PvuCeBLTFTdnR1Hw,2QoRWNLJ6A9M8f9F0ovGcM,7Bl9s8h4F1jcX1aJYHBpfm,,Sonido Caracol,Chacho Ramos,,"['cumbia pop', 'cumbia uruguaya', 'plena urugu..."
7asgcWpGtv92TiMZUYNWVt,6SGCqG5HEr5gFZR9ct8wID,0WnP62TjkFfRrt52yE8zcX,,Matías Valdez,Lucas Sugo,,"['cumbia pop', 'canto popular uruguayo', 'cumb..."
1PIiIVeacU0Fj7Rm9lFlsx,7A7wnJp6eGBNV5Foax2kYg,6OviQMT0o5K1fzKYvUNbRS,,Teto Mc,Lucas Corleone,,['rap paraense']
6yf3MxEOScNBTYCHOAIiNQ,4Q6nIcaBED8qUel8bBx6Cr,6S2OmqARrzebs0tKUEyXyp,2ExGrw6XpbtUAJHTLtUXUD,Jax Jones,Demi Lovato,Stefflon Don,"['dance pop', 'edm', 'electro house', 'house',..."


In [101]:
tracks = tracks.join(track_artists_and_genres)


## Brief analysis of final dataset

In [102]:
tracks

Unnamed: 0_level_0,explicit,name,preview_url,album_type,album_id,album_release_date,album_release_date_precision,isrc,isrc_year,isrc_agency,...,tempo,duration_ms,time_signature,artist_id_1,artist_id_2,artist_id_3,artist_name_1,artist_name_2,artist_name_3,genres
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000RW47rhEkSqjgTrZx7YX,False,Lance Individual,https://p.scdn.co/mp3-preview/c64292f4c53560bc...,album,57h6WHDjwNGIs5NMeKYEoL,2021-04-22,day,BRRGE2010642,2020,Pro-música Brazil,...,117.399,164459.0,4.0,1elUiq4X7pxej6FRlrEzjM,,,Jorge & Mateus,,,"['arrocha', 'sertanejo', 'sertanejo universita..."
000xQL6tZNLJzIrtIgxqSl,False,Still Got Time (feat. PARTYNEXTDOOR),https://p.scdn.co/mp3-preview/83fad967740b8a85...,single,2kGUeTGnkLOYlinKRJe47G,2017-03-23,day,USRC11700675,2017,RIAA,...,120.963,188491.0,4.0,5ZsFI1h6hIdQRw2ti0hz81,2HPaUgqeutzr3jx5a9WyDV,,ZAYN,PARTYNEXTDOOR,,"['dance pop', 'pop', 'post-teen pop', 'uk pop'..."
000xYdQfIZ4pDmBGzQalKU,False,"Eu, Você, O Mar e Ela",https://p.scdn.co/mp3-preview/ae0e943883e06623...,album,4QianJs5Ls4mxwcT7gDBww,2016-11-04,day,BRRGE1603547,2016,Pro-música Brazil,...,166.018,187119.0,4.0,3qvcCP2J0fWi0m0uQDUf6r,,,Luan Santana,,,"['arrocha', 'sertanejo', 'sertanejo pop', 'ser..."
001b8t3bYPfnabpjpfG1Y4,True,Geen Stof,https://p.scdn.co/mp3-preview/535320aa4cbc5811...,album,06v2EPzWTwcP0egTJVrPdU,2021-01-21,day,NLG662000948,2020,SENA,...,95.951,167866.0,4.0,1wFoE1RwBMWoWkXcFrCgsx,,,Josylvio,,,['dutch hip hop']
003VDDA7J3Xb2ZFlNx7nIZ,True,YELL OH,https://p.scdn.co/mp3-preview/14e659590d1a70cf...,single,2orYogfKeURqyS1hRP1vZ4,2020-02-07,day,QZJ842000061,2020,RIAA,...,74.496,236779.0,4.0,6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn,,Trippie Redd,Young Thug,,"['melodic rap', 'rap', 'trap', 'atl hip hop', ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7zxphfIdUMOfZOAVpKFlld,True,Trapper Of The Year (Intro),https://p.scdn.co/mp3-preview/b48d68019fc66831...,single,65DP4VYecvLUlUVfCsXPLF,2021-11-19,day,SE6XY2184006,2021,IFPI Sweden,...,121.913,110164.0,4.0,2Dor6diK1zw9BEluKBOdoA,,,23,,,"['swedish drill', 'swedish hip hop', 'swedish ..."
7zyYmIdjqqiX6kLryb7QBx,False,以後別做朋友,https://p.scdn.co/mp3-preview/4d243321f0ec66ba...,album,1JEzXcEYuEFKKmo4mfMgC7,2014-12-19,day,TWA471410001,2014,RIT,...,130.576,260573.0,4.0,5fEQLwq1BWWQNR8GzhOIvi,,,Eric Chou,,,['mandopop']
7zyZ9yPXIQebb79PrMghpV,False,Zap Zum,https://p.scdn.co/mp3-preview/2f82c76f077fb377...,album,194szTkDIGJsa9iZJNStwN,2021-06-24,day,BCM112100037,2021,Pro-música Brazil,...,169.123,167503.0,4.0,6tzRZ39aZlNqlUzQlkuhDV,,,Pabllo Vittar,,,"['dance pop', 'funk carioca', 'funk pop', 'pop..."
7zyofXGhXgaqT8fhvLufdf,False,Kuka Antais Pukille?,https://p.scdn.co/mp3-preview/236ec096925cdb1f...,single,2QIHDFGxEjYdVg8Y1QYBRv,2018-11-21,day,FIUM71800543,2018,IFPI Finland,...,89.982,241564.0,4.0,3zh3U2eQ64EhBFbJuxgf1M,05qPtpcSltJZLI9sj0qm3B,4l0zTor5S32Yly4uw96Bto,Teflon Brothers,Spekti,Petri Nygård,"['eurovision', 'finnish dance pop', 'finnish h..."


In [103]:
tracks.shape

(33881, 31)

In [104]:
tracks.columns

Index(['explicit', 'name', 'preview_url', 'album_type', 'album_id',
       'album_release_date', 'album_release_date_precision', 'isrc',
       'isrc_year', 'isrc_agency', 'isrc_territory', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms',
       'time_signature', 'artist_id_1', 'artist_id_2', 'artist_id_3',
       'artist_name_1', 'artist_name_2', 'artist_name_3', 'genres'],
      dtype='object')

In [105]:
type_counts = tracks.dtypes.value_counts()
type_counts

object     15
float64    13
bool        1
string      1
Int64       1
dtype: int64

In [106]:
for dtype in type_counts.index:
  print(dtype, tracks.dtypes.index[tracks.dtypes == dtype].tolist())
  print()

object ['name', 'preview_url', 'album_type', 'album_id', 'album_release_date', 'album_release_date_precision', 'isrc_agency', 'isrc_territory', 'artist_id_1', 'artist_id_2', 'artist_id_3', 'artist_name_1', 'artist_name_2', 'artist_name_3', 'genres']

float64 ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

bool ['explicit']

string ['isrc']

Int64 ['isrc_year']



In [107]:
tracks.count().sort_values()

artist_id_3                      3828
artist_name_3                    3828
artist_name_2                   13515
artist_id_2                     13515
preview_url                     25713
artist_name_1                   33856
isrc_year                       33857
isrc                            33857
isrc_agency                     33857
isrc_territory                  33857
name                            33858
instrumentalness                33876
liveness                        33876
mode                            33876
tempo                           33876
valence                         33876
acousticness                    33876
danceability                    33876
loudness                        33876
key                             33876
energy                          33876
duration_ms                     33876
speechiness                     33876
time_signature                  33876
explicit                        33881
album_release_date_precision    33881
album_releas

Interesting, there are some songs without a name (luckily only around 20). Also, for 5 songs no audio features seem to be available. Overall, the completeness of the track metadata is pretty satifsying :)

## Export data

In [108]:
tracks.to_csv(create_data_out_path("top50_track_data.csv"))

Data output path: /home/sejmou/Repos/Uni/VisDS/vis-ds/data/top50_track_data.csv
