In [2]:
import pandas as pd
import os
from pathlib import Path
from dotenv import load_dotenv
import spotipy
from tqdm import tqdm
import numpy as np
from spotipy.oauth2 import SpotifyClientCredentials

In [3]:
paths = Path("./data").glob("endsong*.json")
df = pd.concat(map(pd.read_json, paths))
df = df[df["spotify_track_uri"].notnull()]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 78786 entries, 0 to 15904
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   ts                                 78786 non-null  object 
 1   username                           78786 non-null  object 
 2   platform                           78786 non-null  object 
 3   ms_played                          78786 non-null  int64  
 4   conn_country                       78786 non-null  object 
 5   ip_addr_decrypted                  78786 non-null  object 
 6   user_agent_decrypted               78693 non-null  object 
 7   master_metadata_track_name         78786 non-null  object 
 8   master_metadata_album_artist_name  78786 non-null  object 
 9   master_metadata_album_album_name   78786 non-null  object 
 10  spotify_track_uri                  78786 non-null  object 
 11  episode_name                       0 non-null      object 


![](./er.svg)

## Preparation
1. fetch Track Data from SpotifyWeb Api
    - [ ] id
    - [ ] duration
    - [ ] music analysis
    - [ ] audio features
2. artist data

In [4]:
load_dotenv()
ID = os.getenv("ID")
SECRET = os.getenv("SECRET")

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=ID, client_secret=SECRET))

## 1. Tracks Data

### 1.1 Existing data from the extended history

Drop all duplicate entries in the history, therey removing multiple plays.
However this is not sufficient to properly eliminate duplicate references to the same track, since one track can be played from different albums.
For Example, when a song is at first released as a single, and later within an album. Individual plays appear with a different `spotify_track_uri`, altough they should be counted as one.

To fix this, the online tool [Datablist](https://datablist.com) is used, to find and remove duplicates.
An entry is considered a duplicate if the artist, album name, and track name are identical.
Subsequently, the track uri's are merged into a comma-separated string, which is then transformed into an array for futher processing.
Tough the column is then no longer atomic, is is now easier to assign play entries, with different uris to the same song.

1. Import `unique_tracks.csv` into Datablist
2. Find duplicates based on `artist`, `track` and `album`
3. Merge `spotify_track_uri`, drop `album`

<div class="alert alert-block alert-info">
<b>TODO:</b>    
This should be automated, for processing new data.
</div>

In [5]:
unique_tracks = df[df["episode_name"].isnull()][["master_metadata_track_name", "master_metadata_album_album_name", "spotify_track_uri", "master_metadata_album_artist_name"]].drop_duplicates()
unique_tracks = unique_tracks.dropna()
unique_tracks.info()
#unique_tracks.to_csv("unique_tracks.csv")

<class 'pandas.core.frame.DataFrame'>
Index: 22705 entries, 0 to 15819
Data columns (total 4 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   master_metadata_track_name         22705 non-null  object
 1   master_metadata_album_album_name   22705 non-null  object
 2   spotify_track_uri                  22705 non-null  object
 3   master_metadata_album_artist_name  22705 non-null  object
dtypes: object(4)
memory usage: 886.9+ KB


In [6]:
tracks_cleaned = pd.read_csv("./unique_cleaned.csv")
tracks_cleaned["spotify_track_uri"] = tracks_cleaned["spotify_track_uri"].apply(lambda r: [s for s in r.split(",")])
print("Unique tracks after cleaning:", len(tracks_cleaned))
tracks_cleaned.head()

Unique tracks after cleaning: 21206


Unnamed: 0,master_metadata_track_name,master_metadata_album_album_name,spotify_track_uri,master_metadata_album_artist_name
0,STFU,"Long Time, No See.",[spotify:track:7DOq1mhfUI49SNVAlx8E1v],Aim Vision
1,Wo bist Du? (feat. Alligatoah),Wo bist Du? (feat. Alligatoah),"[spotify:track:7Dbq3XgIm824Od54EHIhBC, spotify...",Swiss
2,What I Know Is All Quicksand - rookery live tapes,rookery live tapes,[spotify:track:71E4QM6O81q0DX79Cg6EGk],Giant Rooks
3,Bentley Coupe,Bentley Coupe,[spotify:track:7aksJRV1H0pCds7FSiZ8ZI],Social House
4,Tauchen (feat. KYMA),Malu,[spotify:track:6yBIrzVvtYoL0meYNmDEHa],Tom Thaler & Basil


This separation still is not enough, because having two keys for every track makes it difficult to aggregate the data.
Therefore every uri in the `extended history`, that is related to another track, is replaced by the first uri in the `spotify_track_uri` array above.

<div class="alert alert-block alert-warning">
This only has to be done once. The resulting dataframe is strored as a `.csv`. 
</div>

In [9]:
process = True
if process:
    lookup = tracks_cleaned["spotify_track_uri"].to_frame()
    lookup.columns = ["from"]
    lookup["to"] = lookup["from"].map(lambda r: r[0])
    lookup.head()
    
    tqdm.pandas()
    df["spotify_track_uri"] = df["spotify_track_uri"].progress_map(lambda x: lookup[lookup["from"].str.join(" ").str.contains(x)]["to"].item())
    df.to_csv("history.csv")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 78786/78786 [05:36<00:00, 234.47it/s]


---
Once the tracks have been processed, the next stage is to prepare the relevant metrics.
The Spotify web API provides information about every track, namely audio features and analysis.

#### Audio Features

This is an example response for the audio features of "Time is Running Out" by "Muse".
```json
"acousticness": 0.00242,
"analysis_url": "https://api.spotify.com/v1/audio-analysis/2takcwOaAZWiXQijPHIx7B",
"danceability": 0.585,
"duration_ms": 237040,
"energy": 0.842,
"id": "2takcwOaAZWiXQijPHIx7B",
"instrumentalness": 0.00686,
"key": 9,
"liveness": 0.0866,
"loudness": -5.883,
"mode": 0,
"speechiness": 0.0556,
"tempo": 118.211,
"time_signature": 4,
"track_href": "https://api.spotify.com/v1/tracks/2takcwOaAZWiXQijPHIx7B",
"type": "audio_features",
"uri": "spotify:track:2takcwOaAZWiXQijPHIx7B",
"valence": 0.428
```

Since the uri's are now cleaned, we can just request the auto features via the spotify api and append them to our dataframe.

In [13]:
audio_features = pd.DataFrame(columns=[
    "uri",
    "acousticness",
    "danceability",
    "duration_ms",
    "energy",
    "instrumentalness",
    "key",
    "liveness",
    "loudness",
    "mode",
    "speechiness",
    "tempo",
    "time_signature",
    "valence",
])
audio_features.head()

Unnamed: 0,uri,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence


In [14]:
for chunk in tqdm(np.array_split(unique_tracks, len(unique_tracks)/90)):
    try:
        uris = chunk["spotify_track_uri"].to_numpy()
        result = sp.audio_features(uris)
        result_nonull = []
        for obj in result:
            if obj != None:
                result_nonull.append(obj)
        chunk_df = pd.DataFrame(result_nonull)
        audio_features = pd.concat([audio_features, chunk_df], join="inner")
    except Exception as e:
        print(e)
        break
    

  return bound(*args, **kwds)
  0%|                                                                                                                                                                                                                    | 0/252 [00:00<?, ?it/s]Max Retries reached
  0%|                                                                                                                                                                                                                    | 0/252 [00:02<?, ?it/s]

http status: 429, code:-1 - /v1/audio-features/?ids=7DOq1mhfUI49SNVAlx8E1v,7Dbq3XgIm824Od54EHIhBC,71E4QM6O81q0DX79Cg6EGk,7aksJRV1H0pCds7FSiZ8ZI,6yBIrzVvtYoL0meYNmDEHa,4nL4zTcQjOADjVziLUrKC4,3Bqa6fS5R7XDJCfVbWAq46,0ElpbbncWT9aS7mgoqEHbQ,1KTJmfwrk5pYqsi9mkY3nT,0dfmg4JoDGMeqtNRF5Gr4t,3qLZhFTjhLiPnVFzK5r8UL,6SpODwwsGQ7jqqFO10RR27,5ncFetHluI0qYiTrG1m48x,1hCVZTG44qDXaIgBSgyvpQ,2NpPNpDJG4g1Txp98TiRsx,2bEnUKx0z0RUYBmuosBRvd,4KPuDGKuBOITkxP5PrBxPU,6R3CeNRk6DOdK6tBXUaJ9X,1EXoOvL1g5UssQmQdgT4JM,6Y1xR0TDpxWQwmTYmkYZMt,1mU5tbLwcCYs64FhTTEQ4U,4rs3frrzpepBXoJOGGlzZQ,3K874ORmFuV9Ari5TmOqfY,4KFcmYICCe24YbAxAWEQAW,3M8Ee7yvy2T3AnhmmR1E0j,00RW0bhpba2Djqf8E65sXL,7qrBYrivpvfXUPBMmqh3dA,5yH8Ce7MpMwJPDSRvgIZF6,1c6xkwwhc0T5JdU2oiTFkH,20lMC6OzZLlI71WuVcVtPX,1ylphlPLI3ri4KGN6y4Dul,2yesMWYhprhtepBgexS8aN,5ThEXbatpKgKn4pOl2qexl,39FUd9QsvtOBqa49TMuY3M,62MosvjGMQY3HeGPFEsmM1,2k9N4caeCIJLOWwWwssrEM,4I3tMsc0RKjVSaOdR3a6LX,6mry9fDj4oTFudQAMRo1lV,4A7DUET5H4f7dJkUhjfVFB,4ZHGRFGwjnEIXHG2FhcDPv,2Pml521aaHZf3b0S2f4no8,3Tkb2




In [15]:
audio_features.rename({"uri": "spotify_track_uri"}, inplace=True, axis=1)
audio_features

Unnamed: 0,spotify_track_uri,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence


In [16]:
history_cleaned = pd.read_csv("history.csv")

In [17]:
tacks_features = history_cleaned.join(audio_features, on="spotify_track_uri")

ValueError: columns overlap but no suffix specified: Index(['spotify_track_uri'], dtype='object')