___________
# **Music Data Analysis: Spotify API with Pandas, Altair, and NetworkX**

This notebook illustrates several ways to operate Spotify API using Spotipy – a Python package designed to enable user-friendly (ish) interactions with Spotify's music metadata. In Part I of this notebook, we will use Spotipy and Pandas to **set up a DataFrame containing a collection of songs (tracks)** found by a playlist ID. Then, we will investigate ways to **visually represent and compare** this collection using Altair (Part II) and explore the basics of **network graph visualization** using Pyvis and NetworkX.

You can learn more about these resources here:
* [Spotify API](https://developer.spotify.com/documentation/web-api/)
* [Spotipy](https://spotipy.readthedocs.io/en/master/#)
* [Pandas](https://pandas.pydata.org/)
* [Altair](https://altair-viz.github.io/)
* [Pyvis](https://pyvis.readthedocs.io/en/latest/)
* [NetworkX](https://networkx.org/)

### Brief Introduction: Spotify, APIs, Spotify API

As many of you know, **Spotify** is a paid music streaming web application launched in 2006. The service has about 182 million subscribers and hosts more than 70 million tracks. In 2014, Spotify released **Spotify API**, a web-based interface that allows anyone with a Spotify account to search, analyze, and manipulate Spotify's music metadata. In short, **an API** is a piece of software that enables two or more programs to talk to each other. You can learn more about APIs [here](https://en.wikipedia.org/wiki/API).

Going through this notebook, you'll be able to request Spotify API access for your personal notebook and perform all sorts of analyses on the tracks, users, artists, albums, and playlists of your interest. While some of the material covered in this Notebook is very basic, some elements might seem quite puzzling. Please don't hesitate to reach out and ask questions.
______

## **Part 1: Setting up**
#### Step 1.1: Importing Python Libraries

In [107]:
import pandas as pd
import numpy as np
import random
import altair as alt
import requests
import inspect
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import networkx as nx
import networkx.algorithms.community as nx_comm
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import pyvis
from pyvis import network as net
from itertools import combinations
from community import community_louvain
from copy import deepcopy

#### Step 1.2: Providing User Credentials


In order to utilize the functionality of Spotify's API, you'll need to establish a connection between the local endpoint (your laptop) and the API (cloud). To do that, you'll need to create a **web client** (read more [here](https://en.wikipedia.org/wiki/Client_(computing))).

A web client typically requires authentication parameters **(key and secret)**. Spotify API uses OAuth2.0 authorization scheme. As we don't want to trouble you with setting up your own tokens, we have created one common set of login credentials for this course. You can learn more about authentication [here](https://en.wikipedia.org/wiki/OAuth).

Please find the tokens below:

In [29]:
# storing the credentials:
CLIENT_ID = "f4ba183c8722470bbd9f998f445026c7"
CLIENT_SECRET = "a546cfe52a7749829aebd23c0ded5559"
my_username = "sx47r9lq4dwrjx1r0ct9f9m09"

# instantiating the client
# source: Max Hilsdorf (https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6)
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [3]:
AUTH_URL = 'https://accounts.spotify.com/api/token'

# POST
auth_response = requests.post(AUTH_URL, {
    'grant_type': 'client_credentials',
    'client_id': CLIENT_ID,
    'client_secret': CLIENT_SECRET,
})

# convert the response to JSON
auth_response_data = auth_response.json()

# save the access token
access_token = auth_response_data['access_token']

At this point, you should be perfectly able to access the API! Hence, we move on to scraping and analyzing music metadata.

----------
## **Part 2: Analyzing Playlists**

### Step 2.1: Obtaining Data

We can **get tracks in a playlist** of a user using the *sp.user_playlist_tracks(username, playlist)* method and turning it into a Pandas DataFrame. The two parameters we need for this are **user ID** and **playlist ID**; they can be easily found on the Spotify website or in the Spotify app. Just look in the URL bar and copy the IDs as Strings.

In this we are using the following data:
* "sx47r9lq4dwrjx1r0ct9f9m09": user_1's **Spotify User ID**. Typically, a Spotify ID is formatted somewhat nicer (e.g. "barackobama" but this one is just a random string ...
* "7KfWEjHxpcOIkqvDqMW5RV": the **Playlist ID** for one of user_1's playlists. 

Both playlist ID and User ID **can be found in a web browser** when accessing the User's or Playlist's webpage.

* for example, Oleh's Spotify User page can be found at: "https://open.spotify.com/user/sx47r9lq4dwrjx1r0ct9f9m09", and you can see that the User ID is what follows ater "...user/", meaning "sx47r9lq4dwrjx1r0ct9f9m09"
* for example, Spotify's featured Pop Mix playlist can be found at: "https://open.spotify.com/playlist/37i9dQZF1EQncLwOalG3K7", and you can find the Playlist ID ater "...playlist/", meaning "37i9dQZF1EQncLwOalG3K7"

In [5]:
# playlist_tracks(user_id: String, playlist_id: String): json_dict
# sx47r9lq4dwrjx1r0ct9f9m09
playlist_tracks = pd.DataFrame(sp.user_playlist_tracks("sx47r9lq4dwrjx1r0ct9f9m09", "7KfWEjHxpcOIkqvDqMW5RV"))
playlist_tracks

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:18Z', 'added_by...",100,,0,,16
1,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:32Z', 'added_by...",100,,0,,16
2,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:50Z', 'added_by...",100,,0,,16
3,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:10Z', 'added_by...",100,,0,,16
4,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:22Z', 'added_by...",100,,0,,16
5,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:38Z', 'added_by...",100,,0,,16
6,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:26:57Z', 'added_by...",100,,0,,16
7,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:31:50Z', 'added_by...",100,,0,,16
8,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:35:09Z', 'added_by...",100,,0,,16
9,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:38:34Z', 'added_by...",100,,0,,16


We can take a look at an **individual track** here:

In [6]:
sample_track = playlist_tracks.iloc[1]["items"]["track"]
sample_track

{'album': {'album_type': 'album',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1VJspRsoC6c0bvqhnSiFCs'},
    'href': 'https://api.spotify.com/v1/artists/1VJspRsoC6c0bvqhnSiFCs',
    'id': '1VJspRsoC6c0bvqhnSiFCs',
    'name': 'Ini Kamoze',
    'type': 'artist',
    'uri': 'spotify:artist:1VJspRsoC6c0bvqhnSiFCs'}],
  'available_markets': ['AD',
   'AE',
   'AL',
   'AM',
   'AO',
   'AR',
   'AT',
   'AU',
   'AZ',
   'BA',
   'BD',
   'BE',
   'BF',
   'BG',
   'BH',
   'BI',
   'BJ',
   'BN',
   'BO',
   'BR',
   'BT',
   'BW',
   'BY',
   'BZ',
   'CA',
   'CD',
   'CG',
   'CH',
   'CI',
   'CL',
   'CM',
   'CO',
   'CR',
   'CV',
   'CW',
   'CY',
   'CZ',
   'DE',
   'DJ',
   'DK',
   'DO',
   'DZ',
   'EC',
   'EE',
   'EG',
   'ES',
   'ET',
   'FI',
   'FJ',
   'FM',
   'FR',
   'GA',
   'GB',
   'GE',
   'GH',
   'GM',
   'GN',
   'GQ',
   'GR',
   'GT',
   'GW',
   'GY',
   'HK',
   'HN',
   'HR',
   'HU',
   'ID',
   'IE',
   'IL',
   'IN',
 

As you can notice, tracks are stored as **JSON objects** (think Dictionaries), which you can read more about [here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON). Each Track object has many attributes, including "album", "artists", "id", "duration", "popularity", "name" etc. Some of these are extremely useful to us! You can learn more about Spotify's Track features [here](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track).

While this information is already a lot (!), we can extract some perhaps-more-interesting features of tracks via the Audio Features method. Using *sp.audio_features(track_id)*, we easily get track's audio features (by track_id):

In [27]:
sample_track_audio_features = pd.DataFrame(sp.audio_features(sample_track["id"]))
sample_track_audio_features

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.494,0.182,10,-11.116,0,0.0443,0.79,0.00036,0.129,0.21,112.689,audio_features,5vb7At47uO0yPGfmYnAHuw,spotify:track:5vb7At47uO0yPGfmYnAHuw,https://api.spotify.com/v1/tracks/5vb7At47uO0y...,https://api.spotify.com/v1/audio-analysis/5vb7...,355333,4


As you can see, each track has **a large number of recorded audio features**. These are typically generated by Spotify and cover various musical aspects, ranging from Loudness to Liveness, from Danceability to Duration, and from Tempo to Time Signature. The feature values are of different **data types**: "key" is an **Integer**, "energy" is a **Float**, "id" is a **String**, and "mode" is a **Boolean** represented as Integer. As you work your way through this notebook, you will discover many options to count, bin, sort, graph, and connect variables and values of different types.

Consider the function below (courtesy of Max Hilsdorf), which can help us **loop through the items of a playlist and get every track's [audio] features of interest**:

In [8]:
# This function is created based on Max Hilsdorf's article
# Source: https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6
def get_audio_features_df(playlist):
    
    # Create an empty dataframe
    playlist_features_list = ["artist", "album", "track_name", "track_id","danceability","energy","key","loudness","mode", "speechiness","instrumentalness","liveness","valence","tempo", "duration_ms","time_signature"]
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    for track in playlist["items"]:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the DataFrames
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df

Note: the **@playlist parameter** (that is passed in to the get_audio_features_df() method) should be a **DataFrame consisting of several track objects**. In our case, we have one such collection stored in **playlist_tracks**, which we got from calling sp.user_playlist_tracks() on a playlist and storing it as a Pandas DataFrame. 

Hence, we run the get_audio_features_df() method on our collection to obtain the **audio features DataFrame** for the tracks in **playlist_tracks**.

In [9]:
audio_features_df = get_audio_features_df(playlist_tracks)
audio_features_df.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4
2,Shaggy,Boombastic,In The Summertime,726KAdf3k8Ce8W95O38XNI,0.734,0.684,1,-13.822,1,0.227,0.0,0.0497,0.962,173.607,238360,4
3,Afroman,Waiting to Inhale,Colt 40ty Fiva,3hody5PjTIzwoiV3hnAvWL,0.666,0.751,0,-5.355,0,0.216,0.0,0.294,0.607,175.891,201440,4
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4


In [30]:
audio_features_df.to_csv("Miles_Davis_Spotify.csv")

As you can see above, our new DataFrame contains **Spotify's audio features for every track in the provided playlist**.

### Step 2.2: Charting Data

As we now have a collection of data points that represent different feature values for one complete playlist, we should be able to graph our findings using Altair. While there are many available charts, we will start with graphing **one feature's values for every item (track) in the series**. 

To illustrate this concept, we will use **Altair's scatterplot** to chart **each track's tempo**. This could be done by setting the Chart's data source to **audio_features_df**, it's **x** variable to **track_name** and it's **y** variable to **tempo**.

Here's our chart:

In [10]:
alt.Chart(audio_features_df).mark_point().encode(
    x="track_name",
    y='tempo'
)

Note: by default, Altair will sort the tracks alphabetically. If you prefer to keep the original sorting or sort them some particular way, you should toggle the **sort** attribute on the axis of interest. Specifically, we will set up our **x** variable this way: *x=alt.X("track_name", sort=None)* instead of *x="track_name"*. 

You can read more about **Altair's axis sorting** [here](https://altair-viz.github.io/user_guide/generated/channels/altair.X.html). 

##### Adding Multiple Variables:


While there are many available charts, one useful way to visually illustrate a correlation between two variables (think DataFrame columns) is **constructing a scatterplot using two data ranges**. 

In general, a Scatterplot requires **two variables (data ranges)** that will be mapped according to their corresponding values. For example, consider **"energy"** and **"loudness"**. Our first track (Shaggy: Boombastic) has an "energy" score of 0.538 and a "loudness" score of -16.183, which together make one of the points on the scatterplot: (0.538, -16.183); the second track (Ini Kamoze: Here Comes The Hotstepper) makes up the (0.454, -8.598) datapoint – hopefully, you can see where this is going.

You can read more about Altair's scatterplots [here](https://altair-viz.github.io/gallery/scatter_tooltips.html).

In the example below, we are using **audio_features_df** as the data source, **"energy"** as the x (horizontal variable) and **"loudness"** as the y (vertical variable). Let's take a look at the result:

In [11]:
alt.Chart(audio_features_df).mark_point().encode(
    x='energy',
    y='loudness'
)

As you can see in the example above, "energy" and "loudness" tend to have somewhat of a **corresponding upward trend**: for items with higher "energy", "valence" tends to be higher, too. This, in turn, corresponds to our natural hypothesis: one could normally expect a higher-energy track to be louder. Mathematically, the relationship between these two variables could be described as one having **positive correlation**. 

While we've briefly talked about Correlation in our Pandas lab, we invite you to read more about it [here](https://www.washington.edu/assessment/scanning-scoring/scoring/reports/correlations/).

Using Pandas' built-in *pandas.Series.corr()* method, it is extremely easy to obtain the **Correlation Coefficient** for the two variables:

In [11]:
audio_features_df['energy'].corr(audio_features_df['loudness'])

0.5252140232410945

While there is a multitude of aspects to correlation (including test types, sample sizes, strengths, variance, and many other factors), it sometimes can be a useful statistical measure in your Music Data Analysis exploration.

--- 

To categorize your tracks, you would sometimes need to map their values from a continuous range onto a discrete range. Typically, we call this process **"binning"**. Binning usually involves creating a new column within the existing (or in a new) DataFrame such that the new column's values correspond to the discretely defined categories of the item (based on some threshold value).

You can read more about continuous and discrete variables [here](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable).

For example, consider **"danceability"** – a continuous variable with values ranging from 0 to 1. In order to **"bin"** our tracks, we will classify everything with a "danceability" score of o.75 and higher as a dance tune. For this, we'll create a new column – "dance_tune", and if a track's "danceability" score is equal to or above 0.75, its "dance_tune" value should be True; otherwise, it should be set to False.

This can be easily done using Pandas and NumPy's np.where method, which you can learn more about [here](https://numpy.org/doc/stable/reference/generated/numpy.where.html).

Here's how to do it:

In [12]:
feature_based_tracks = audio_features_df.copy() # make a copy of the DataFrame
feature_based_tracks["dance_tune"] = np.where(feature_based_tracks['danceability'] >= 0.75, True, False)
feature_based_tracks.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,dance_tune
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4,True
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4,True
2,Shaggy,Boombastic,In The Summertime,726KAdf3k8Ce8W95O38XNI,0.734,0.684,1,-13.822,1,0.227,0.0,0.0497,0.962,173.607,238360,4,False
3,Afroman,Waiting to Inhale,Colt 40ty Fiva,3hody5PjTIzwoiV3hnAvWL,0.666,0.751,0,-5.355,0,0.216,0.0,0.294,0.607,175.891,201440,4,False
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4,True


At this point, you should be able to see which tunes are "Dance Tunes" based on our categorization threshold. Excitingly, Altair provides an easy way to visualize our findings using a **bar chart**.

You can learn more about Altair's bar charts [here](https://altair-viz.github.io/gallery/simple_bar_chart.html).

Here's how to do it:

In [13]:
alt.Chart(feature_based_tracks).mark_bar().encode(
    x='dance_tune',
    y='count()'
)

As you can see, our data indicates that out of 16 songs in the playlist, 13 are Dance Tunes (e.g. have a "danceability" score of at least 0.75) and 3 are not. 

<br> 

If we were looking to make our lives even more complicated, we could **bin "energy"** based on a 0.75 "energy" score threshold:

In [14]:
feature_based_tracks["energy_tune"] = np.where(feature_based_tracks['energy'] >= 0.75, True, False)
feature_based_tracks.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,dance_tune,energy_tune
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4,True,False
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4,True,False
2,Shaggy,Boombastic,In The Summertime,726KAdf3k8Ce8W95O38XNI,0.734,0.684,1,-13.822,1,0.227,0.0,0.0497,0.962,173.607,238360,4,False,False
3,Afroman,Waiting to Inhale,Colt 40ty Fiva,3hody5PjTIzwoiV3hnAvWL,0.666,0.751,0,-5.355,0,0.216,0.0,0.294,0.607,175.891,201440,4,False,True
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4,True,False


Based on this information, we can **analyze the composition** of our modified DataFrame using Altair. For example, one could think: out of the dance tunes, are most high energy or low energy? 

Here's a way to find out using Altair:

In [15]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('energy_tune', title=""),
    y=alt.Y('count()', title='Count'),
    color=alt.Color('energy_tune', title="High energy")
)

alt.layer(bars, data=feature_based_tracks).facet(
    column=alt.Column('dance_tune', title = "Dance tune")
)

Alternatively, we can use **Altair's built-in bin method** of the Chart object to produce more standard binning. This is typically useful when creating a Histogram.

You can learn more about binning and histograms in Altair [here](https://altair-viz.github.io/gallery/simple_histogram.html).

Here's an example:

In [16]:
alt.Chart(feature_based_tracks).mark_bar().encode(
    alt.X("danceability", bin=True),
    y='count()',
)

<br>

Another extremely useful tool is **sorting a DataFrame** based on one or many columns. As an example, we can sort our brand new DataFrame by the Tracks' "energy":

In [17]:
my_sorted_df = feature_based_tracks.sort_values(['energy'], ascending=[True])
my_sorted_df.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,dance_tune,energy_tune
9,Afroman,The Good Times,Crazy Rap (Colt 45 & 2 Zig Zags),1ACZpHI5vZ5Ea4xGlkdGWM,0.927,0.367,9,-7.797,1,0.382,0.0,0.132,0.576,99.053,328667,4,True,False
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4,True,False
13,Toots & The Maytals,Funky Kingston,Funky Kingston,26WPI2aksB9XdqmeLfca5z,0.777,0.458,11,-12.358,1,0.0841,0.00295,0.0423,0.961,99.05,295667,4,True,False
7,Vanilla Ice,Ice Ice Baby,Ice Ice Baby - Radio Edit,3sy0rren2cVFNfkDxa0q2e,0.977,0.488,2,-15.962,1,0.12,0.0,0.0826,0.77,115.726,231773,4,True,False
11,Vanilla Ice,To The Extreme,Play That Funky Music,1Ezs8eYxuZjhlgyoI1Bo76,0.851,0.514,4,-15.279,0,0.165,4e-06,0.381,0.573,100.425,285800,4,True,False


Things to note here:
* the tracks in this new DataFrame are arranged based on their "energy" scores
* the tracks' indices are now inconsequent (the leftmost column), but could be easily reset (check out Pandas B lab)
* the tracks can also be sorted by multiple columns (specified in the value list)

Let's **chart the Tracks' "energy" based on our new DataFrame**:

In [18]:
alt.Chart(my_sorted_df).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y="energy"
)

### Radar Plots

Radar (or Polar) plots are a useful way to represent multiple variables at once.  

Read more about the various features via [Plotly](https://plotly.com/python/radar-chart/) (under scatter plots):  

In [105]:
# radar function

feature_columns = ["danceability", "energy", "speechiness", "liveness", "instrumentalness", "valence", "danceability"]

def createRadarElement(row, feature_cols):
    return go.Scatterpolar(
        r = row[feature_cols].values.tolist(), 
        theta = feature_cols, 
        mode = 'lines', 
        name = row['track_name'])

def get_radar_plot(playlist_id, features_list):
    current_playlist_audio_df = get_audio_features_df(pd.DataFrame(sp.playlist_items(playlist_id)))
    current_data = list(current_playlist_audio_df.apply(createRadarElement, axis=1, args=(features_list, )))  
    fig = go.Figure(current_data, )
    fig.show(renderer='iframe')
    fig.write_image(playlist_id + '.png', width=1200, height=800)
    
def get_radar_plots(playlist_id_list, features_list):
    for item in playlist_id_list:
        get_radar_plot(item, features_list)

In [108]:
playlist_id = "1NppEwvZhkjeG3ZTYoOwVM"
get_radar_plot(playlist_id, feature_columns)

----------
## **Part 3: Comparing Playlists**

In this part, we will obtain tracks from multiple playlists (from the same user) and compare these playlists using Altair's charts and scatterplots and Panda's built-in statistics methods.

<br>

Building onto Max Hilsdorf's code, we can create a function that would **produce an audio features DataFrame for all tracks for a given Spotify User** based on a User ID:

In [24]:
# preserve the name of the playlist in the dataframe
def get_all_user_tracks(username):
  all_my_playlists = pd.DataFrame(sp.user_playlists(username))
  list_of_dataframes = []

  for playlist in all_my_playlists.index:
    current_playlist = pd.DataFrame(sp.user_playlist_tracks(username, all_my_playlists["items"][playlist]["id"]))
    current_playlist_audio = get_audio_features_df(current_playlist)
    if all_my_playlists["items"][playlist]["name"]:
      current_playlist_audio["playlist_name"] = all_my_playlists["items"][playlist]["name"]
    else:
       current_playlist_audio["playlist_name"] = None
    list_of_dataframes.append(current_playlist_audio)

  return pd.concat(list_of_dataframes)

Using this function, we can get **all tracks contained in Oleh's followed public playlists** and **produce an Audio Features DataFrame** for them:

In [30]:
# Getting the current_user's all tracks
all_my_tracks = get_all_user_tracks(my_username)
all_my_tracks["Author"] = "user_1" # noting where the tracks came from
all_my_tracks.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Author
0,Rogér Fakhr,Fine Anyway (Habibi Funk 016),Fine Anyway,1CvYa7uK1o1YU9k8liIdRB,0.602,0.39,4,-9.198,0,0.0408,0.00025,0.114,0.484,62.47,163437,4,mixtape vol. 2,user_1
1,Rogér Fakhr,Fine Anyway (Habibi Funk 016),Everything You Want,3aR7UNYXSNeV5nfEkvqZGz,0.553,0.574,3,-7.87,1,0.0288,4.9e-05,0.103,0.427,114.845,144653,4,mixtape vol. 2,user_1
2,Graham Nash,Wild Tales,Hey You (Looking at the Moon),0io1WX3hD42QdTnT9LRfRZ,0.747,0.32,0,-12.585,1,0.0324,0.00424,0.061,0.535,107.087,137280,4,mixtape vol. 2,user_1
3,Jim Croce,You Don't Mess Around With Jim,Operator (That's Not the Way It Feels),3NJzkMApQqAudLSgYb5Bz2,0.687,0.453,7,-11.649,1,0.0335,8e-06,0.0897,0.83,129.987,230213,4,mixtape vol. 2,user_1
4,Van Morrison,Moondance (Deluxe Edition),And It Stoned Me - 2013 Remaster,3n5iUh2Z6P7cnWins22W0F,0.593,0.468,7,-11.165,1,0.0285,3.5e-05,0.0919,0.597,75.798,272160,4,mixtape vol. 2,user_1


As you can see, our new DataFrame consists of 267 tracks – which are pretty much **all user_1's tracks**. Using the familiar tools from Altair, we can **produce a color-coded chart** for the new collection.

Specifically, we can **color** the data points **based on the playlist** they are in:

In [31]:
alt.Chart(all_my_tracks).mark_point().encode(
    x="liveness",
    y="danceability",
    color="playlist_name"
)

Bigger sample sizes prompt stronger observations! If you have noticed a trend when looking at just the 16 Tracks of the initial playlist, you are much more likely to witness a similar trend as the **sample size increases**.

Another useful chart would be **charting every track's energy and color-coding the data points** based on what playlist they are in.

Here's how to do it:

In [32]:
alt.Chart(all_my_tracks).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="playlist_name",
    tooltip=["artist", "track_name", "playlist_name"]
).properties(
    width=1200
)

As you can see in the chart above, Oleh's playlists tend to **follow a certain "energy" trend (typically upward)** as the playlist progresses. This likely corresponds with how many of you listen to your own playlists: start with less energetic songs and move on to more energetic ones.

Mathematically, we can **describe** each playlist as a subset of the overall DataFrame. You can read more about categorical descriptions in Pandas [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).

Here's how to get the **description detail for a particular Playlist**:

In [33]:
all_my_tracks[all_my_tracks["playlist_name"] == "Alternative & Indie"].describe()

Unnamed: 0,danceability,energy,loudness,speechiness,liveness,valence,tempo
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,0.605567,0.6581,-7.0845,0.05091,0.1696,0.549667,115.8822
std,0.155042,0.149876,2.245306,0.036995,0.146648,0.257647,21.587505
min,0.207,0.336,-14.79,0.0265,0.0499,0.198,75.179
25%,0.54575,0.5665,-7.9805,0.031625,0.1045,0.284,100.50875
50%,0.6155,0.669,-6.9525,0.03595,0.1165,0.543,114.87
75%,0.6925,0.74175,-5.60625,0.04845,0.16175,0.77175,128.3745
max,0.876,0.882,-3.809,0.174,0.689,0.975,156.036


Now, let's compare Oleh's tracks to another listener! For example, we could **get one of user_2's playlists** using the *sp.plylist_items()* method. In this example, we will use a playlist with Playlist ID = "3tt4ET474Xr1uOPgNz8jAY" 

Here's how to do it:

In [37]:
user_2_playlist_df = pd.DataFrame(sp.playlist_items("3tt4ET474Xr1uOPgNz8jAY"))
# user_2_playlist_df.head()

Similarly to what we have done earlier, we can **construct an Audio Features DataFrame** for this playlist:

In [36]:
avas_audio_features_df = get_audio_features_df(user_2_playlist_df)
avas_audio_features_df["Author"] = "user_2"
avas_audio_features_df.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,Author
0,Princess Nokia,Gemini - A COLORS SHOW,Gemini - A COLORS SHOW,0KHRpftQXPXk1ZJrBRjbu7,0.707,0.526,9,-8.016,1,0.403,1e-06,0.12,0.809,80.312,184827,4,user_2
1,Alice Phoebe Lou,Witches,Witches,4CZgaNdobtnTfBevPBje0c,0.576,0.719,8,-5.749,1,0.0583,5e-06,0.116,0.729,73.122,195079,4,user_2
2,Japanese Breakfast,Soft Sounds from Another Planet,Boyish,0De8H4o9xzPtjRp9dns0L5,0.227,0.457,2,-7.459,1,0.0299,0.000447,0.189,0.371,106.727,213987,4,user_2
3,Elvis Presley,Elvis 30 #1 Hits,Can't Help Falling In Love,4hAUynwghvrqDXs1ejKNEq,0.438,0.325,2,-11.066,1,0.0268,5e-06,0.107,0.362,101.577,177840,3,user_2
4,King Princess,Prophet,Prophet,4vFTpKeY2F3ckwhULrtS0z,0.502,0.783,7,-4.718,1,0.105,0.000464,0.099,0.626,165.996,249773,4,user_2


In [38]:
# Getting one of Oleh's playlists
gs_playlist_tracks = pd.DataFrame(sp.user_playlist_tracks("sx47r9lq4dwrjx1r0ct9f9m09", "47VfnY1RsMOadBdy9MCDYW"))
gs_playlist_tracks_audio_df = get_audio_features_df(gs_playlist_tracks)
gs_playlist_tracks_audio_df["Author"] = "user_1"

# Combining the two DataFrames
two_playlists_combined = pd.concat([gs_playlist_tracks_audio_df, avas_audio_features_df], ignore_index=True)
two_playlists_combined.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,Author
0,Prince,Purple Rain,Purple Rain,54X78diSLoUDI3joC2bjMz,0.367,0.452,10,-10.422,1,0.0307,0.00228,0.689,0.189,113.066,520787,4,user_1
1,Bob Dylan,Pat Garrett & Billy The Kid (Soundtrack From T...,Knockin' On Heaven's Door,6HSXNV0b4M4cLJ7ljgVVeh,0.513,0.396,7,-13.061,1,0.0299,0.177,0.11,0.229,140.208,149880,4,user_1
2,The Beatles,Abbey Road (Remastered),Here Comes The Sun - Remastered 2009,6dGnYIeXmHdcikdzNNDMm2,0.557,0.54,9,-10.484,1,0.0347,0.00248,0.179,0.394,129.171,185733,4,user_1
3,David Gilmour,Live in Gdansk,Wish You Were Here - Live in Gdańsk,2q0BviPG80XxEkaCJCrBm8,0.526,0.472,7,-13.148,1,0.037,3e-05,0.982,0.339,124.443,314387,4,user_1
4,Chris Cornell,Chris Cornell (Deluxe Edition),Nothing Compares 2 U - Live At SiriusXM/2015,0tUELgOuOJ3KCsYMDDsNvD,0.434,0.327,0,-10.72,1,0.0312,2e-06,0.686,0.295,119.506,303907,4,user_1


Finally, we can **chart the two playlist side by side**.

In this example, we color the entries based on the Author column and sort them exactly the way they appear in the original playlist (by setting sort=None). We will color user_2's tracks blue and user_1's tracks yellow. Some trends are very visible from the plot:

In [39]:
alt.Chart(two_playlists_combined).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="Author",
    tooltip=["artist", "track_name"]
).properties(
    width=1000
)

Note a few things here:
* user_1's tracks (yellow) typically **vary** less, whereas user_2's tracks **vary** greatly
* user_1's tracks (yellow) have **average** energy that is higher than that of user_2's
* user_1's tracks (yellow) follow a visible **trend** in the way they arranged

<br>

We can support our conclusions mathematically, by exploring Pandas' **descriptions** of the "energy" column for the two sub-DataFrames: 

In [40]:
print("user_1's data: \n", two_playlists_combined[two_playlists_combined["Author"] == "user_1"]["energy"].describe(), "\n")
print("user_2's data: \n", two_playlists_combined[two_playlists_combined["Author"] == "user_2"]["energy"].describe())

user_1's data: 
 count    87.000000
mean      0.678161
std       0.168881
min       0.167000
25%       0.569000
50%       0.710000
75%       0.800000
max       0.954000
Name: energy, dtype: float64 

user_2's data: 
 count    60.000000
mean      0.419764
std       0.200654
min       0.006220
25%       0.284750
50%       0.412500
75%       0.553500
max       0.806000
Name: energy, dtype: float64


As expected, there are some corresponding statistical observations:
* user_1's Standard Deviation (std) is 0.169 whereas user_2's is 0.201 (corresponding to the spread)
* user_1's average "energy" (mean) is 0.678 whereas user_2's is 0.420 (lower average, as expected)

<br>

Instead of comparing just two playlists, we can compare many! As an example, we'll load **8 of user_2's favorite playlists**:

In [41]:
list_of_user_2_playlists = []
user_2_export_playlists_list = ["3tt4ET474Xr1uOPgNz8jAY",
                              "69bvktIqRHFk56zJLFu3ms", 
                              "5nGnFuPH2G1e2lZwji2qxy",
                              "1H715wD7rkVCSGz0fwtLeH",
                              "35DLrFVs4dK3QreeuQt9vZ",
                              "0N6HSTGQcNhgrsjvdgqjH9",
                              "1BwJKfuRNrnfdkvIpaaSHH",
                              "6AfdBAcUHElsK8cRzMpnc1"]

for item in user_2_export_playlists_list:
  temp_playlist_df = pd.DataFrame(sp.playlist_items(item))
  temp_playlist_audio = get_audio_features_df(temp_playlist_df)
  temp_playlist_audio["playlist_name"] = sp.playlist(item)["name"]
  temp_playlist_audio["Author"] = "user_2"
  list_of_user_2_playlists.append(temp_playlist_audio)

user_2_eight_playlists = pd.concat(list_of_user_2_playlists)
user_2_eight_playlists.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Author
0,Princess Nokia,Gemini - A COLORS SHOW,Gemini - A COLORS SHOW,0KHRpftQXPXk1ZJrBRjbu7,0.707,0.526,9,-8.016,1,0.403,1e-06,0.12,0.809,80.312,184827,4,2020,user_2
1,Alice Phoebe Lou,Witches,Witches,4CZgaNdobtnTfBevPBje0c,0.576,0.719,8,-5.749,1,0.0583,5e-06,0.116,0.729,73.122,195079,4,2020,user_2
2,Japanese Breakfast,Soft Sounds from Another Planet,Boyish,0De8H4o9xzPtjRp9dns0L5,0.227,0.457,2,-7.459,1,0.0299,0.000447,0.189,0.371,106.727,213987,4,2020,user_2
3,Elvis Presley,Elvis 30 #1 Hits,Can't Help Falling In Love,4hAUynwghvrqDXs1ejKNEq,0.438,0.325,2,-11.066,1,0.0268,5e-06,0.107,0.362,101.577,177840,3,2020,user_2
4,King Princess,Prophet,Prophet,4vFTpKeY2F3ckwhULrtS0z,0.502,0.783,7,-4.718,1,0.105,0.000464,0.099,0.626,165.996,249773,4,2020,user_2


Here we got the 218 songs user_2 listens to in total! And, similarly, we'll **chart them side by side**:

In [42]:
alt.Chart(user_2_eight_playlists).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="playlist_name",
    tooltip=["artist", "track_name", "playlist_name"]
).properties(
    width=1200
)

Then, we can create our **shared DataFrame of all the tracks** obtained from user_1's and user_2's Spotify profiles:

In [43]:
two_people_dataframe = pd.concat([user_2_eight_playlists, all_my_tracks], ignore_index=True)
two_people_dataframe

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Author
0,Princess Nokia,Gemini - A COLORS SHOW,Gemini - A COLORS SHOW,0KHRpftQXPXk1ZJrBRjbu7,0.707,0.526,9,-8.016,1,0.4030,0.000001,0.1200,0.809,80.312,184827,4,2020,user_2
1,Alice Phoebe Lou,Witches,Witches,4CZgaNdobtnTfBevPBje0c,0.576,0.719,8,-5.749,1,0.0583,0.000005,0.1160,0.729,73.122,195079,4,2020,user_2
2,Japanese Breakfast,Soft Sounds from Another Planet,Boyish,0De8H4o9xzPtjRp9dns0L5,0.227,0.457,2,-7.459,1,0.0299,0.000447,0.1890,0.371,106.727,213987,4,2020,user_2
3,Elvis Presley,Elvis 30 #1 Hits,Can't Help Falling In Love,4hAUynwghvrqDXs1ejKNEq,0.438,0.325,2,-11.066,1,0.0268,0.000005,0.1070,0.362,101.577,177840,3,2020,user_2
4,King Princess,Prophet,Prophet,4vFTpKeY2F3ckwhULrtS0z,0.502,0.783,7,-4.718,1,0.1050,0.000464,0.0990,0.626,165.996,249773,4,2020,user_2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
467,Random,Chiasso,Chiasso,4UJoAdTMnqxQHKWKLLQpvK,0.721,0.451,6,-13.285,0,0.1920,0,0.1010,0.460,91.999,182229,4,Italian,user_1
468,Coez,Faccio un casino,Le luci della città,381VH1JFj9q08V1OMlPU5m,0.582,0.544,7,-6.591,1,0.0362,0.000002,0.1350,0.142,129.933,177267,4,Italian,user_1
469,Edoardo Bennato,Burattino Senza Fili,Il gatto e la volpe,5ZSzAggKFjyIL6OLA2xUUg,0.632,0.931,6,-4.675,0,0.0463,0,0.0608,0.943,95.636,178547,4,Italian,user_1
470,Gipsy Kings,!Volare! The Very Best of the Gipsy Kings,Volare,5oVs4alUctAl0B0QhWY0I2,0.612,0.876,4,-8.323,1,0.0515,0.000012,0.3830,0.909,116.104,218467,4,Italian,user_1


The combined DataFrame consists of 485 songs that these two people listen to in totality. Let's **chart out the "energy" values** for these songs to see how the two compare:

In [44]:
alt.Chart(two_people_dataframe).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="Author",
    tooltip=["artist", "track_name", "playlist_name"]
).properties(
    width=1200
)

Just as noted earlier (when comparing just two playlists), there are some important things to note here:

* user_1's tracks (yellow) typically **vary** less, whereas user_2's tracks **vary** greatly
* user_1's tracks (yellow) have **average** energy that is higher than that of user_2's
* user_1's tracks (yellow) follow a visible **trend** in the way they arranged

<br>

We can similarly support our conclusions mathematically, by exploring Pandas' **descriptions** of the "energy" column for the two sub-DataFrames: 

In [45]:
print("user_1's data: \n", two_people_dataframe[two_people_dataframe["Author"] == "user_1"]["energy"].describe(), "\n")
print("user_2's data: \n", two_people_dataframe[two_people_dataframe["Author"] == "user_2"]["energy"].describe())

user_1's data: 
 count    254.000000
mean       0.575360
std        0.200207
min        0.058000
25%        0.453250
50%        0.575000
75%        0.723500
max        0.984000
Name: energy, dtype: float64 

user_2's data: 
 count    218.000000
mean       0.473340
std        0.220546
min        0.006220
25%        0.310750
50%        0.472000
75%        0.647000
max        0.973000
Name: energy, dtype: float64


As expected, there are some corresponding statistical observations:
* user_1's Standard Deviation (std) is 0.161 whereas user_2's is 0.221 (corresponding to the spread)
* user_1's average "energy" (mean) is 0.653 whereas user_2's is 0.473 (lower average, as expected)
<br>

----
## **Part 4: Network Graph Visualization**

In this part, we'll explore some basic Network Theory graphing for Spotify's Artists and Songs based on Recommended and Related songs and artists.

#### Step 4.1: Network Basics

At first, we will illustrate the basics of Pyvis-based **Network Graphs**. Generally speaking, a Network Graph is a visual structure designed to emphasize connections between discrete entities. It consists of Nodes and Edges, which represent a system of connected or related elements, and is largely studied within Network Theory. 

You can learn more about Network Theory [here](https://en.wikipedia.org/wiki/Network_theory) and explore Network Grahps [here](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)).

Here's how to **build, populate, and show a simple Network Graph** using Networkx and Pyvis:

In [46]:
# Creating a Network
g = net.Network(notebook=True, width=1000, height = 800)

# Adding nodes
g.add_node("John")
g.add_node("Paul")

# Adding an edge
g.add_edge("John", "Paul")

# Showing the network
g.show("example.html")

As you can see, in this example we created two nodes "John" and "Paul" and connected them. We are able to **add nodes** to an existing network by calling *net.Network.add_node()* and **add edges** to the same network by calling *net.Network.add_edge()*. It is also possible to **get all nodes** by calling *net.Network.get_edges()*.

Using these tools, we can **check if a node is in a network**:

In [47]:
# checking
"John" in g.get_nodes()

True

Building onto these tools, we can create something more advanced – for example, **a diagram of Oleh's playlists** (by iterating over *all_my_tracks*). We will scale the nodes (playlists) based on their size using Pyvis' **value** attribute of Nodes.

Here's how to do it:

In [49]:
# Creating a Network with one center Node
playlists_network = net.Network(notebook=True, width=1000, height = 800)
playlists_network.add_node("user_1's Spotify", color="#fffff")

# As we want to record both playlist names and corresponding sizes, we need a Dictionary:
user_1_playlist_dictionary = {}
user_1_playlists = pd.DataFrame(sp.user_playlists(my_username)["items"])

# Iterating over the playlists and recording Names and Sizes
for i in range(len(user_1_playlists)):
    user_1_playlist_dictionary[user_1_playlists.loc[i]["name"]] = user_1_playlists["tracks"][i]["total"]

# Adding new Nodes and Edges based on the items in the Dictionary:
for item in user_1_playlist_dictionary:
    playlists_network.add_node(item, value=user_1_playlist_dictionary[item])
    playlists_network.add_edge("user_1's Spotify", item)

# Showing the Network Graph
playlists_network.show("playlists_diagram.html")

As expected, we can see the center node we added at first – which is now connected to 8 other nodes, which all correspond to Oleh's playlists. These nodes are sized based on the playlists' sizes (number of tracks) and named based on the playlists' names. **This is a simple undirected network**.

----

Now, we'll get into slightly more complicated things.

Spotify API provides a way to **get related artists** given an Artist ID. According to Spotify, this method returns a collection of artists "similar to a given artist", and the **"similarity is based on analysis of the Spotify community's listening history"**.

You can learn more about Spotify's Related Artists method [here](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-an-artists-related-artists).

Reflecting this method, Spotipy conveniently has *sp.artist_related_artists*, which returns a collection of artists related to an artist. Making use of this method, one could think of a function that would go through a number of related artists (**limit**) and add graph Nodes and Edges corresponding to the newly discovered related artists. We will also **size nodes** based on popularity.

Here's what such a function could look like:

In [50]:
def add_related_artists(starting_artist_name, starting_artist_id, existing_graph, limit, order_group=None):
    # get artists related to the current artist
    current_artist_related = pd.DataFrame(sp.artist_related_artists(starting_artist_id)["artists"])
    # loop through the related artists, add nodes and edges
    for i in range(limit):
        # check if node already exists
        if current_artist_related.loc[i]["name"] not in existing_graph.get_nodes():
            if order_group:
                existing_graph.add_node(current_artist_related.loc[i]["name"], value=int(current_artist_related.loc[i]["popularity"]), group=order_group)
            else:
                existing_graph.add_node(current_artist_related.loc[i]["name"], value=int(current_artist_related.loc[i]["popularity"]), group=(i + 1))
        # add edge
        existing_graph.add_edge(starting_artist_name, current_artist_related.loc[i]["name"])

Get Artist Albumns

In [52]:

headers = {
    'Authorization': 'Bearer {token}'.format(token=access_token)
}
BASE_URL = 'https://api.spotify.com/v1/'
artist_id = '7nwUJBm0HE4ZxD3f5cy5ok'

# pull all artists albums
r = requests.get(BASE_URL + 'artists/' + artist_id + '/albums', 
                 headers=headers, 
                 params={'include_groups': 'album', 'limit': 50})
d = r.json()

df = pd.DataFrame(d)
df
# df["items"][0]

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
1,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
2,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
3,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
4,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
5,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
6,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
7,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
8,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56
9,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,"{'album_group': 'album', 'album_type': 'album'...",50,https://api.spotify.com/v1/artists/7nwUJBm0HE4...,0,,56


In the cell below, we will make use of the function we just defined. Using this function and some basic information, we will **produce a Network Graph for two generations (circles) of artists related to The Beatles**. 

As noted, we will start with Beatles (Artist ID = "3WrFJ7ztbogyGnTHbHJFl2", Name = "The Beatles")

In [53]:
## First, we need to record the information about The Beatles
center_artist_id = "3WrFJ7ztbogyGnTHbHJFl2"
center_artist_name = "The Beatles"
center_artist_popularity = 80

# # or, we need to record the information about Aretha
# center_artist_id = "6uRJnvQ3f8whVnmeoecv5Z"
# center_artist_name = "Berlin"
# center_artist_popularity = 100

# # or, we need to record the information about Aretha
# center_artist_id = "7nwUJBm0HE4ZxD3f5cy5ok"
# center_artist_name = "Aretha"
# center_artist_popularity = 100

# limit: how many related per generation are we interested in
limit = 5

center_artist_related = pd.DataFrame(sp.artist_related_artists(center_artist_id)["artists"]).loc[0:(limit-1)]

# setting up the Network
artist_network = net.Network(notebook=True, width=1000, height=800)
artist_network.add_node(center_artist_name, value=center_artist_popularity, color="#fffff", group=0)

# Getting the first circle of related artists:
add_related_artists(center_artist_name, center_artist_id, artist_network, limit)

# artist_network.add_node("test")

# Showing the Network Graph
artist_network.show("artist_example.html")

In order to further complicate our lives, we can **add one more generation of related artists** (think friends of friends):

In [54]:
# Running through the once-related artists
for i in range(limit):
    add_related_artists(center_artist_related.loc[i]["name"], center_artist_related.loc[i]["id"], artist_network, limit, (i+1))

# Showing the Network Graph
artist_network.show("artist_example.html")

As you can see, the Network Graph above provides some very interesting information and prompts some very important thoughts. Think about: 
* Why are the nodes located the way they are located? 
* Who are the artists we've missed? 
* How are these people related?

<br>

Similarly to Related Artists, Spotify API has a way of **recommending songs** based on a "seed" of tracks. Acording to the API Documentation, "recommendations **are generated based on the available information for a given seed entity and matched against similar artists and tracks**".

You can read more about Spotify's Recommendations [here](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-recommendations).

This method is mirrored by Spotipy – specifically, in the *sp.recommendations* method. One could think of a function that would **get a generation of recommended songs and add them to a Network Graph** (scaled by popularity):

In [55]:
def add_related_songs(starting_song_name, starting_artist_name, starting_song_id, existing_graph, limit, first_gen=True, order_group=None):
    current_song_related = pd.DataFrame(sp.recommendations(seed_tracks=[starting_song_id])["tracks"])
    for i in range(limit):
        if str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]) not in existing_graph.get_nodes():
            if order_group:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=order_group)
            else:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=(i+1))
        existing_graph.add_edge(str(starting_artist_name + ": " + starting_song_name), str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]))
    return current_song_related

In the cell below, we will make use of the function we just defined. Using this function and some basic information, we will **produce a Network Graph for two generations (circles) of recommended songs based on Ben E. King's Stand By Me**. 

As noted, we will start with Stand By Me (Song ID = "3SdTKo2uVsxFblQjpScoHy")

In [56]:
# First, we need to record the information about Stand By Me
center_song = sp.track("3SdTKo2uVsxFblQjpScoHy")
# Or Mahler 1

# center_song = sp.track("7vZoMrrqsqfO96vortxxjn")

# Or Lasso

# center_song = sp.track("4CAp8WXEotxJLE5A2c3Yup")

center_song_id = center_song["id"]
center_song_artist = center_song["artists"][0]["name"]
center_song_name = center_song["name"]
center_song_popularity = int(center_song["popularity"])



# limit: how many recommended songs per generation we are interested in
limit = 3

# Creating the Network graph and adding the center Node
song_network = net.Network(notebook=True, width=1000, height=800)
song_network.add_node(str(center_song_artist + ": " + center_song_name), value=center_song_popularity, color="#fffff", group=0)

# Getting the first circle of related artists:
recommended_songs = add_related_songs(center_song_name, center_song_artist, center_song_id, song_network, limit)

# Showing the Network
song_network.show("song_network_short.html")

Similarly to Related Artists, we will further complicate our lives by **adding one more generation of recommended songs** (with no extra seed knowledge):

In [57]:
# Getting the second generation of Recommended songs
for i in range(limit):
    add_related_songs(recommended_songs.loc[i]["name"], recommended_songs.loc[i]["artists"][0]["name"], recommended_songs.loc[i]["id"], song_network, limit, False, (i+1))

# Showing the network
song_network.show("song_network.html")

Interestingly, Spotify's recommendations for songs **change every time you run your code**. We encourage you to re-run  the previous two cells a few times! Just like the Related Artists graph, the Network Graph above provides some very interesting information and prompts some very important thoughts. Think about: 
* Why are the nodes located the way they are located? 
* Who are the artists we've missed? 
* How are these people related?

<br>

Finally, we can make one very slight tweak to our add_related_songs method. Previously, we only included one track as a seed track for running the GET Recommendations method. In the function below, we will define a new function that will essentially do the same thing as the one above, except it will **pass 5 random tracks (out of the tracks in the graph) as the recommendation seed** into the Recommendation function:

In [58]:
def add_related_songs_gen(starting_song_name, starting_artist_name, starting_song_id, existing_graph, limit, first_gen=True, order_group=None):
    current_song_related = pd.DataFrame(sp.recommendations(seed_tracks=starting_song_id)["tracks"]).loc[0:(limit - 1)]
    for i in range(limit):
        if str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]) not in existing_graph.get_nodes():
            if order_group:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=order_group)
            else:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=(i+1))
        existing_graph.add_edge(str(starting_artist_name + ": " + starting_song_name), str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]))
    return current_song_related

We will run this function for **two generations** for the same song (Stand By Me by Ben E. King):

In [59]:
# Start the network
song_network = net.Network(notebook=False, width=1000, height=800)
song_network.add_node(str(center_song_artist + ": " + center_song_name), value=center_song_popularity, color="#fffff", group=0)

# First generation
recommended_songs = add_related_songs_gen(center_song_name, center_song_artist, [center_song_id], song_network, limit)

# Second generation
for i in range(limit):
    add_related_songs_gen(recommended_songs.loc[i]["name"], recommended_songs.loc[i]["artists"][0]["name"], random.sample(list(recommended_songs["id"]), 3), song_network, limit, False, (i+1))

# Show the network Graph
song_network.show("song_network.html")

Note that this graph looks a little different! What are **some of your observations**?

----

### Louvain Community Detection

In the last part of this Notebook, we will briefly explore Louvain Community Detection. In short, this is a method that allows us to visually identify communities of discrete entities that share a common attribute. 

You can learn more about the mathematics behind the Louvain algorithm [here](https://towardsdatascience.com/louvain-algorithm-93fde589f58c), explore its documentation [here](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.louvain.louvain_communities.html), read about its applications [here](https://towardsdatascience.com/louvains-algorithm-for-community-detection-in-python-95ff7f675306).

In our example, we will **identify Louvain communities of artists** within the playlists they belong to.

<br>

At first, let's **pick 5 playlists** centered around a common theme. For example, let's choose "Rock". By searching on Spotify, we found and randomly picked these 5 playlists:
* [Rock Classics](https://open.spotify.com/playlist/37i9dQZF1DWXRqgorJj26U): "Rock legends & epic songs that continue to inspire generations", Spotify Playlist ID: "37i9dQZF1DWXRqgorJj26U"
* [Rock Mix](https://open.spotify.com/playlist/37i9dQZF1EQpj7X7UK8OOF): "Fleetwood Mac, Elton John, Steve Miller Band and more", Spotify Playlist ID: "37i9dQZF1EQpj7X7UK8OOF"
* [Classic Rock Drive](https://open.spotify.com/playlist/37i9dQZF1DXdOEFt9ZX0dh): "Classic rock to get your motor running. Cover: AC/DC", Spotify Playlist ID: "37i9dQZF1DXdOEFt9ZX0dh"
* [Rock Drive](https://open.spotify.com/playlist/37i9dQZF1DX7Ku6cgJPhh5): "Amp up your commute with these rock hits. Cover: Foo Fighters", Spotify Playlist ID: "37i9dQZF1DX7Ku6cgJPhh5"
* [Dad Rock](https://open.spotify.com/playlist/37i9dQZF1DX09NvEVpeM77): "Classic rock favorites. Cover: Bruce Springsteen", Spotify Playlist ID: "37i9dQZF1DX09NvEVpeM77"

<br>

Let's **put these playlists in a DataFrame**:

In [88]:
# Three Model Lists.  Two of them share only one.  Two of them share all except one.  
# What do we expect?
rock_playlists_dfs_list = []
PL_1 = '37i9dQZF1DWXRqgorJj26U'
PL_2 = '37i9dQZF1EQpj7X7UK8OOF'
PL_3 = '37i9dQZF1DXdOEFt9ZX0dh'
PL_4 = '37i9dQZF1DX7Ku6cgJPhh5'
PL_5 = '37i9dQZF1DX09NvEVpeM77'

rock_playlists_ids_list = [PL_1, PL_2, PL_3]



# # Create a list of playlists
# rock_playlists_dfs_list = []
# rock_playlists_ids_list = ["37i9dQZF1DWXRqgorJj26U",
#                           "37i9dQZF1EQpj7X7UK8OOF",
#                           "37i9dQZF1DXdOEFt9ZX0dh",
#                            "37i9dQZF1DX7Ku6cgJPhh5",
#                           "37i9dQZF1DX09NvEVpeM77"]

# Looping through the items and producing Audio Features DataFrames
for item in rock_playlists_ids_list:
  temp_playlist_df = pd.DataFrame(sp.playlist_items(item))
  temp_playlist_audio = get_audio_features_df(temp_playlist_df)
  temp_playlist_audio["playlist_name"] = sp.playlist(item)["name"]
  rock_playlists_dfs_list.append(temp_playlist_audio)
    
# Concatenating the Audio Features DataFrames
rock_playlists_df = pd.concat(rock_playlists_dfs_list)
len(rock_playlists_df)

250

Our new Rock Playlists DataFrame contains 250 tracks gathered across the 5 playlists. As we don't want to overwhelm our Network, we will **choose a random sample** of 100 tracks out of this DataFrame:

In [89]:
# also take sample

input_data_rock_df = rock_playlists_df.reset_index().sample(100)
input_data_rock_df.head()

Unnamed: 0,index,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name
145,45,Led Zeppelin,Physical Graffiti (Deluxe Edition),Kashmir - Remaster,6Vjk8MNXpQpi0F4BefdTyq,0.483,0.615,2,-8.538,1,0.0497,0.000414,0.0512,0.594,80.576,517125,3,Rock Mix
201,51,Pat Benatar,Crimes Of Passion,Hit Me With Your Best Shot,0vOkmmJEtjuFZDzrQSFzEE,0.741,0.58,4,-9.05,1,0.0304,3.3e-05,0.212,0.944,127.402,171267,4,Classic Rock Drive
225,75,Free,Fire And Water,All Right Now,1gcESexgftSuLuML57Y69q,0.787,0.472,2,-12.824,1,0.0832,0.000336,0.153,0.824,120.059,330643,4,Classic Rock Drive
172,22,AC/DC,Back In Black,You Shook Me All Night Long,2SiXAy7TuUkycRVbbWDEpo,0.532,0.767,7,-5.509,1,0.0574,0.000513,0.39,0.755,127.361,210173,4,Classic Rock Drive
247,97,Lynyrd Skynyrd,Pronounced' Leh-'Nerd 'Skin-'Nerd,Gimme Three Steps,0x2wtJbtJrox3SDmnMj97x,0.554,0.74,9,-7.706,1,0.0803,6.9e-05,0.225,0.804,133.363,267173,4,Classic Rock Drive


Now, let's **define the Louvain Community Algorithm methods**.

First, we need a method to **create nodes** (courtesy of Daniel Russo Batterham and Richard Freedman):

In [90]:
# Creating an HTML node
def create_node_html(node: str, source_df: pd.DataFrame, node_col: str):
    rows = source_df.loc[source_df[node_col] == node].itertuples()
    html_lis = []
    for r in rows:
        html_lis.append(f"""<li>Artist: {r.artist}<br>
                                Playlist: {r.playlist_name}<br>"""
                       )
    html_ul = f"""<ul>{''.join(html_lis)}</ul>"""
    return html_ul

Then, a method to **add nodes from edge list** (courtesy of Daniel Russo Batterham and Richard Freedman):

In [91]:
# Adding nodes from an Edgelist
def add_nodes_from_edgelist(edge_list: list, 
                               source_df: pd.DataFrame, 
                               graph: nx.Graph,
                               node_col: str):
    graph = deepcopy(graph)
    node_list = pd.Series(edge_list).apply(pd.Series).stack().unique()
    for n in node_list:
        graph.add_node(n, title=create_node_html(n, source_df, node_col), spring_length=1000)
    return graph

Then, the **Louvain Community Builder method** (courtesy of Daniel Russo Batterham and Richard Freedman):

In [92]:
# Adding Louvain Communities
def add_communities(G):
    G = deepcopy(G)
    partition = community_louvain.best_partition(G)
    nx.set_node_attributes(G, partition, "group")
    return G

Finally, we need a method to **produce a Network of pairs**, which we'll run the add_communities method on, marking the Louvain communities:

In [93]:
def choose_network(df, chosen_word, file_name):
    
    # creating unique pairs
    output_grouped = df.groupby(['playlist_name'])[chosen_word].apply(list).reset_index()
    pairs = output_grouped[chosen_word].apply(lambda x: list(combinations(x, 2)))
    pairs2 = pairs.explode().dropna()
    unique_pairs = pairs.explode().dropna().unique()
    
    # creating a new Graph
    pyvis_graph = net.Network(notebook=True, width="1000", height="1000", bgcolor="black", font_color="white")
    G = nx.Graph()
    
    try:
        G = add_nodes_from_edgelist(edge_list=unique_pairs, source_df=input_data_rock_df, graph=G, node_col=chosen_word)
    except Exception as e:
        print(e)
    
    # add edges and find communities
    G.add_edges_from(unique_pairs)
    G = add_communities(G)
    pyvis_graph.from_nx(G)
    return pyvis_graph

Now, let's run our algorithm to **detect Louvain communities** of artists within the playlists they belong to:

In [94]:
louvain_network = choose_network(input_data_rock_df, 'artist', 'modified_rock.html')
louvain_network.show("modified_rock.html")

In [95]:
output_grouped = input_data_rock_df.groupby(['playlist_name'])['artist'].apply(set).reset_index()
pairs = output_grouped['artist'].apply(lambda x: list(combinations(x, 2)))
pairs2 = pairs.explode().dropna()
unique_pairs = pairs.explode().dropna().unique()



### First Let's Look at the "Grouped" Playlists

In [96]:
output_grouped

Unnamed: 0,playlist_name,artist
0,Classic Rock Drive,"{Black Sabbath, KISS, AC/DC, Yes, Blue Öyster ..."
1,Rock Classics,"{KISS, Tom Petty and the Heartbreakers, The Po..."
2,Rock Mix,"{The Rolling Stones, Dire Straits, System Of A..."


### And the "Pairs" in each List

In [97]:
pairs[0]

[('Black Sabbath', 'KISS'),
 ('Black Sabbath', 'AC/DC'),
 ('Black Sabbath', 'Yes'),
 ('Black Sabbath', 'Blue Öyster Cult'),
 ('Black Sabbath', 'Boston'),
 ('Black Sabbath', 'Heart'),
 ('Black Sabbath', 'Aerosmith'),
 ('Black Sabbath', 'Rush'),
 ('Black Sabbath', 'Kansas'),
 ('Black Sabbath', 'Steve Miller Band'),
 ('Black Sabbath', 'Free'),
 ('Black Sabbath', 'The Who'),
 ('Black Sabbath', 'George Thorogood & The Destroyers'),
 ('Black Sabbath', 'Bruce Springsteen'),
 ('Black Sabbath', 'The Marshall Tucker Band'),
 ('Black Sabbath', "Guns N' Roses"),
 ('Black Sabbath', 'The Black Crowes'),
 ('Black Sabbath', 'Pat Benatar'),
 ('Black Sabbath', 'The Rolling Stones'),
 ('Black Sabbath', 'Foreigner'),
 ('Black Sabbath', 'Eagles'),
 ('Black Sabbath', 'Bachman-Turner Overdrive'),
 ('Black Sabbath', 'Skid Row'),
 ('Black Sabbath', 'Bob Seger'),
 ('Black Sabbath', 'Lynyrd Skynyrd'),
 ('Black Sabbath', 'Journey'),
 ('Black Sabbath', 'The Beatles'),
 ('Black Sabbath', 'Def Leppard'),
 ('Black Sa

In [98]:
# pairs are produced via combinations of all items in a set:

list(combinations(["paul", "john", 'george'], 2))

[('paul', 'john'), ('paul', 'george'), ('john', 'george')]

In [99]:
# note the same thing as permutations (which considers all orderings)

from itertools import permutations
list(permutations(["paul", "john", 'george'], 3))

[('paul', 'john', 'george'),
 ('paul', 'george', 'john'),
 ('john', 'paul', 'george'),
 ('john', 'george', 'paul'),
 ('george', 'paul', 'john'),
 ('george', 'john', 'paul')]

In [100]:
# each item in series is a list of tuples. the tuples will be the edges!
pairs

0    [(Black Sabbath, KISS), (Black Sabbath, AC/DC)...
1    [(KISS, Tom Petty and the Heartbreakers), (KIS...
2    [(The Rolling Stones, Dire Straits), (The Roll...
Name: artist, dtype: object

In [101]:
pairs.shape

(3,)

### Explode will unnest the lists.  Now the len is 90 for just ten items in our original sample!

In [102]:

pairs.explode().apply(sorted)

0                         [Black Sabbath, KISS]
0                        [AC/DC, Black Sabbath]
0                          [Black Sabbath, Yes]
0             [Black Sabbath, Blue Öyster Cult]
0                       [Black Sabbath, Boston]
                        ...                    
2                    [Aerosmith, Guns N' Roses]
2         [Aerosmith, Rage Against The Machine]
2               [Guns N' Roses, Twisted Sister]
2    [Rage Against The Machine, Twisted Sister]
2     [Guns N' Roses, Rage Against The Machine]
Name: artist, Length: 1177, dtype: object

### There are nevertheless duplicate edges!  We could keep them for figuring weights!
### But the key thing is that Louvain does NOT know about the lists!  
### It creates the commmunities only on the basis of edges!


In [103]:

# pairs2 = pairs.explode().dropna()

# do not need drop na!

# Note that the original lists are NOT here!

# could have other attributes added to data structure--weights, for instance!
unique_pairs = pairs.explode().unique()
unique_pairs

array([('Black Sabbath', 'KISS'), ('Black Sabbath', 'AC/DC'),
       ('Black Sabbath', 'Yes'), ..., ('Twisted Sister', "Guns N' Roses"),
       ('Twisted Sister', 'Rage Against The Machine'),
       ("Guns N' Roses", 'Rage Against The Machine')], dtype=object)

### Python can count Tuples, so this would help us make a dict of value counts
###  These could be edge weights in the graph

In [104]:

pairs.explode().value_counts()

(Led Zeppelin, Bon Jovi)                                  3
(Steve Miller Band, Bon Jovi)                             2
(KISS, Led Zeppelin)                                      2
(The Who, Lynyrd Skynyrd)                                 2
(Steve Miller Band, Led Zeppelin)                         2
                                                         ..
(Pat Benatar, Foreigner)                                  1
(Tom Petty and the Heartbreakers, Stone Temple Pilots)    1
(The Rolling Stones, Journey)                             1
(Boston, Aerosmith)                                       1
(The Who, The Rolling Stones)                             1
Name: artist, Length: 1122, dtype: int64

In the Network Graph above, you can see the 5 **communities** of artists that are detected based on what playlist they belong to. Note: *we didn't pass the playlist information into the Network*!

What are your **observations**?

<br>

----

<br>

## **Part 5: Reflections**

You've learned **a lot** through this lab! We would love to hear what you liked/disliked most and what you found most interesting. Thank you!