___________
# **Music Data Analysis: Spotify API with Pandas, Altair, and NetworkX**

This notebook illustrates several ways to operate Spotify API using Spotipy – a Python package designed to enable user-friendly (ish) interactions with Spotify's music metadata. In Part I of this notebook, we will use Spotipy and Pandas to **set up a DataFrame containing a collection of songs (tracks)** found by a playlist ID. Then, we will investigate ways to **visually represent and compare** this collection using Altair (Part II) and explore the basics of **network graph visualization** using Pyvis and NetworkX.

You can learn more about these resources here:
* [Spotify API](https://developer.spotify.com/documentation/web-api/)
* [Spotipy](https://spotipy.readthedocs.io/en/master/#)
* [Pandas](https://pandas.pydata.org/)
* [Altair](https://altair-viz.github.io/)
* [Pyvis](https://pyvis.readthedocs.io/en/latest/)
* [NetworkX](https://networkx.org/)

### Brief Introduction: Spotify, APIs, Spotify API

As many of you know, **Spotify** is a paid music streaming web application launched in 2006. The service has about 182 million subscribers and hosts more than 70 million tracks. In 2014, Spotify released **Spotify API**, a web-based interface that allows anyone with a Spotify account to search, analyze, and manipulate Spotify's music metadata. In short, **an API** is a piece of software that enables two or more programs to talk to each other. You can learn more about APIs [here](https://en.wikipedia.org/wiki/API).

Going through this notebook, you'll be able to request Spotify API access for your personal notebook and perform all sorts of analyses on the tracks, users, artists, albums, and playlists of your interest. While some of the material covered in this Notebook is very basic, some elements might seem quite puzzling. Please don't hesitate to reach out and ask questions.
______

## **Part 1: Setting up**
#### Step 1.1: Importing Python Libraries

In [1]:
import pandas as pd
import numpy as np
import random
import altair as alt
import requests
import inspect
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import networkx as nx
import networkx.algorithms.community as nx_comm
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pyvis
from pyvis import network as net

#### Step 1.2: Providing User Credentials


In order to utilize the functionality of Spotify's API, you'll need to establish a connection between the local endpoint (your laptop) and the API (cloud). To do that, you'll need to create a **web client** (read more [here](https://en.wikipedia.org/wiki/Client_(computing))).

A web client typically requires authentication parameters **(key and secret)**. Spotify API uses OAuth2.0 authorization scheme. As we don't want to trouble you with setting up your own tokens, we have created one common set of login credentials for this course. You can learn more about authentication [here](https://en.wikipedia.org/wiki/OAuth).

Please find the tokens below:

In [2]:
# storing the credentials:
CLIENT_ID = "116bae2a86fd4737862816c5f45d4c36"
CLIENT_SECRET = "4f4a732d83d04cfa94acc26d2b77169f"
my_username = "sx47r9lq4dwrjx1r0ct9f9m09"

# instantiating the client
# source: Max Hilsdorf (https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6)
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

At this point, you should be perfectly able to access the API! Hence, we move on to scraping and analyzing music metadata.

----------
## **Part 2: Analyzing Playlists**

### Step 2.1: Obtaining Data

We can **get tracks in a playlist** of a user using the *sp.user_playlist_tracks(username, playlist)* method and turning it into a Pandas DataFrame. The two parameters we need for this are **user ID** and **playlist ID**; they can be easily found on the Spotify website or in the Spotify app. Just look in the URL bar and copy the IDs as Strings.

In this we are using the following data:
* "sx47r9lq4dwrjx1r0ct9f9m09": Oleh's **Spotify User ID**. Typically, a Spotify ID is formatted somewhat nicer (e.g. "barackobama" but Oleh somehow messed his up...
* "7KfWEjHxpcOIkqvDqMW5RV": the **Playlist ID** for one of Oleh's playlists. 

Both playlist ID and User ID can be found in a web browser when accessing the User's or Playlist's webpage.

In [3]:
# playlist_tracks(user_id: String, playlist_id: String): json_dict
playlist_tracks = pd.DataFrame(sp.user_playlist_tracks("sx47r9lq4dwrjx1r0ct9f9m09", "7KfWEjHxpcOIkqvDqMW5RV"))
playlist_tracks

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:18Z', 'added_by...",100,,0,,16
1,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:32Z', 'added_by...",100,,0,,16
2,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:50Z', 'added_by...",100,,0,,16
3,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:10Z', 'added_by...",100,,0,,16
4,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:22Z', 'added_by...",100,,0,,16
5,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:38Z', 'added_by...",100,,0,,16
6,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:26:57Z', 'added_by...",100,,0,,16
7,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:31:50Z', 'added_by...",100,,0,,16
8,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:35:09Z', 'added_by...",100,,0,,16
9,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:38:34Z', 'added_by...",100,,0,,16


We can take a look at an **individual track** here:

In [4]:
sample_track = playlist_tracks.iloc[0]["items"]["track"]
sample_track

{'album': {'album_type': 'album',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/5EvFsr3kj42KNv97ZEnqij'},
    'href': 'https://api.spotify.com/v1/artists/5EvFsr3kj42KNv97ZEnqij',
    'id': '5EvFsr3kj42KNv97ZEnqij',
    'name': 'Shaggy',
    'type': 'artist',
    'uri': 'spotify:artist:5EvFsr3kj42KNv97ZEnqij'}],
  'available_markets': ['AD',
   'AE',
   'AL',
   'AM',
   'AO',
   'AR',
   'AT',
   'AU',
   'AZ',
   'BA',
   'BD',
   'BE',
   'BF',
   'BG',
   'BH',
   'BI',
   'BJ',
   'BN',
   'BO',
   'BR',
   'BT',
   'BW',
   'BY',
   'BZ',
   'CA',
   'CD',
   'CG',
   'CH',
   'CI',
   'CL',
   'CM',
   'CO',
   'CR',
   'CV',
   'CW',
   'CY',
   'CZ',
   'DE',
   'DJ',
   'DK',
   'DZ',
   'EC',
   'EE',
   'EG',
   'ES',
   'FI',
   'FJ',
   'FM',
   'FR',
   'GA',
   'GB',
   'GE',
   'GH',
   'GM',
   'GN',
   'GQ',
   'GR',
   'GT',
   'GW',
   'GY',
   'HK',
   'HN',
   'HR',
   'HU',
   'ID',
   'IE',
   'IL',
   'IN',
   'IQ',
   'IS',
   'I

As you can notice, tracks are stored as **JSON objects** (think Dictionaries), which you can read more about [here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON). Each Track object has many attributes, including "album", "artists", "id", "duration", "popularity", "name" etc. Some of these are extremely useful to us! You can learn more about Spotify's Track features [here](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track).

While this information is already a lot (!), we can extract some perhaps-more-interesting features of tracks via the Audio Features method. Using *sp.audio_features(track_id)*, we easily get track's audio features (by track_id):

In [5]:
sample_track_audio_features = pd.DataFrame(sp.audio_features(sample_track["id"]))
sample_track_audio_features

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.867,0.538,2,-16.183,1,0.361,0.242,1.7e-05,0.316,0.781,158.328,audio_features,4fxF8ljwryMZX5c9EKrLFE,spotify:track:4fxF8ljwryMZX5c9EKrLFE,https://api.spotify.com/v1/tracks/4fxF8ljwryMZ...,https://api.spotify.com/v1/audio-analysis/4fxF...,249933,4


As you can see, each track has **a large number of recorded audio features**. These are typically generated by Spotify and cover various musical aspects, ranging from Loudness to Liveness, from Danceability to Duration, and from Tempo to Time Signature. The feature values are of different **data types**: "key" is an **Integer**, "energy" is a **Float**, "id" is a **String**, and "mode" is a **Boolean** represented as Integer. As you work your way through this notebook, you will discover many options to count, bin, sort, graph, and connect variables and values of different types.

Consider the function below (courtesy of Max Hilsdorf), which can help us **loop through the items of a playlist and get every track's [audio] features of interest**:

In [6]:
# This function is created based on Max Hilsdorf's article
# Source: https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6
def get_audio_features_df(playlist):
    
    # Create an empty dataframe
    playlist_features_list = ["artist", "album", "track_name", "track_id","danceability","energy","key","loudness","mode", "speechiness","instrumentalness","liveness","valence","tempo", "duration_ms","time_signature"]
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    for track in playlist["items"]:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the DataFrames
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df

Note: the **@playlist parameter** (that is passed in to the get_audio_features_df() method) should be a **DataFrame consisting of several track objects**. In our case, we have one such collection stored in **playlist_tracks**, which we got from calling sp.user_playlist_tracks() on a playlist and storing it as a Pandas DataFrame. 

Hence, we run the get_audio_features_df() method on our collection to obtain the **audio features DataFrame** for the tracks in **playlist_tracks**.

In [7]:
audio_features_df = get_audio_features_df(playlist_tracks)
audio_features_df

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4
2,Shaggy,Boombastic,In The Summertime,726KAdf3k8Ce8W95O38XNI,0.734,0.684,1,-13.822,1,0.227,0.0,0.0497,0.962,173.607,238360,4
3,Afroman,Waiting to Inhale,Colt 40ty Fiva,3hody5PjTIzwoiV3hnAvWL,0.666,0.751,0,-5.355,0,0.216,0.0,0.294,0.607,175.891,201440,4
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4
5,Shaggy,The Boombastic Collection - Best of Shaggy,It Wasn't Me,0OaunKfsxkgBvPv68jBbmm,0.852,0.604,0,-4.569,1,0.0642,0.0,0.327,0.667,94.762,227547,4
6,House Of Pain,House of Pain (Fine Malt Lyrics),Jump Around,3TZwjdclvWt7iPJUnMpgcs,0.854,0.71,4,-6.32,0,0.0793,8.7e-05,0.166,0.818,106.894,214947,4
7,Vanilla Ice,Ice Ice Baby,Ice Ice Baby - Radio Edit,3sy0rren2cVFNfkDxa0q2e,0.977,0.488,2,-15.962,1,0.12,0.0,0.0826,0.77,115.726,231773,4
8,Beastie Boys,Ill Communication,Sure Shot,21REQ1bCUWphT2QK3bLWYQ,0.692,0.799,1,-7.924,1,0.164,0.0,0.301,0.549,97.978,199667,4
9,Afroman,The Good Times,Crazy Rap (Colt 45 & 2 Zig Zags),1ACZpHI5vZ5Ea4xGlkdGWM,0.927,0.367,9,-7.797,1,0.382,0.0,0.132,0.576,99.053,328667,4


As you can see above, our new DataFrame contains **Spotify's audio features for every track in the provided playlist**.

### Step 2.2: Charting Data

As we now have a collection of data points that represent different feature values for one complete playlist, we should be able to graph our findings using Altair. While there are many available charts, we will start with graphing **one feature's values for every item (track) in the series**. 

To illustrate this concept, we will use **Altair's scatterplot** to chart **each track's tempo**. This could be done by setting the Chart's data source to **audio_features_df**, it's **x** variable to **track_name** and it's **y** variable to **tempo**.

Here's our chart:

In [8]:
alt.Chart(audio_features_df).mark_point().encode(
    x="track_name",
    y='tempo'
)

Note: by default, Altair will sort the tracks alphabetically. If you prefer to keep the original sorting or sort them some particular way, you should toggle the **sort** attribute on the axis of interest. Specifically, we will set up our **x** variable this way: *x=alt.X("track_name", sort=None)* instead of *x="track_name"*. 

You can read more about **Altair's axis sorting** [here](https://altair-viz.github.io/user_guide/generated/channels/altair.X.html). 

##### Adding Multiple Variables:


While there are many available charts, one useful way to visually illustrate a correlation between two variables (think DataFrame columns) is **constructing a scatterplot using two data ranges**. 

In general, a Scatterplot requires **two variables (data ranges)** that will be mapped according to their corresponding values. For example, consider **"energy"** and **"loudness"**. Our first track (Shaggy: Boombastic) has an "energy" score of 0.538 and a "loudness" score of -16.183, which together make one of the points on the scatterplot: (0.538, -16.183); the second track (Ini Kamoze: Here Comes The Hotstepper) makes up the (0.454, -8.598) datapoint – hopefully, you can see where this is going.

You can read more about Altair's scatterplots [here](https://altair-viz.github.io/gallery/scatter_tooltips.html).

In the example below, we are using **audio_features_df** as the data source, **"energy"** as the x (horizontal variable) and **"loudness"** as the y (vertical variable). Let's take a look at the result:

In [9]:
alt.Chart(audio_features_df).mark_point().encode(
    x='energy',
    y='loudness'
)

As you can see in the example above, "energy" and "loudness" tend to have somewhat of a **corresponding upward trend**: for items with higher "energy", "valence" tends to be higher, too. This, in turn, corresponds to our natural hypothesis: one could normally expect a higher-energy track to be louder. Mathematically, the relationship between these two variables could be described as one having **positive correlation**. 

While we've briefly talked about Correlation in our Pandas lab, we invite you to read more about it [here](https://www.washington.edu/assessment/scanning-scoring/scoring/reports/correlations/).

Using Pandas' built-in *pandas.Series.corr()* method, it is extremely easy to obtain the **Correlation Coefficient** for the two variables:

In [10]:
audio_features_df['energy'].corr(audio_features_df['loudness'])

0.5252140232410945

While there is a multitude of aspects to correlation (including test types, sample sizes, strengths, variance, and many other factors), it sometimes can be a useful statistical measure in your Music Data Analysis exploration.

--- 

To categorize your tracks, you would sometimes need to map their values from a continuous range onto a discrete range. Typically, we call this process **"binning"**. Binning usually involves creating a new column within the existing (or in a new) DataFrame such that the new column's values correspond to the discretely defined categories of the item (based on some threshold value).

You can read more about continuous and discrete variables [here](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable).

For example, consider **"danceability"** – a continuous variable with values ranging from 0 to 1. In order to **"bin"** our tracks, we will classify everything with a "danceability" score of o.75 and higher as a dance tune. For this, we'll create a new column – "dance_tune", and if a track's "danceability" score is equal to or above 0.75, its "dance_tune" value should be True; otherwise, it should be set to False.

This can be easily done using Pandas and NumPy's np.where method, which you can learn more about [here](https://numpy.org/doc/stable/reference/generated/numpy.where.html).

Here's how to do it:

In [11]:
feature_based_tracks = audio_features_df.copy() # make a copy of the DataFrame
feature_based_tracks["dance_tune"] = np.where(feature_based_tracks['danceability'] >= 0.75, True, False)
feature_based_tracks

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,dance_tune
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4,True
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4,True
2,Shaggy,Boombastic,In The Summertime,726KAdf3k8Ce8W95O38XNI,0.734,0.684,1,-13.822,1,0.227,0.0,0.0497,0.962,173.607,238360,4,False
3,Afroman,Waiting to Inhale,Colt 40ty Fiva,3hody5PjTIzwoiV3hnAvWL,0.666,0.751,0,-5.355,0,0.216,0.0,0.294,0.607,175.891,201440,4,False
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4,True
5,Shaggy,The Boombastic Collection - Best of Shaggy,It Wasn't Me,0OaunKfsxkgBvPv68jBbmm,0.852,0.604,0,-4.569,1,0.0642,0.0,0.327,0.667,94.762,227547,4,True
6,House Of Pain,House of Pain (Fine Malt Lyrics),Jump Around,3TZwjdclvWt7iPJUnMpgcs,0.854,0.71,4,-6.32,0,0.0793,8.7e-05,0.166,0.818,106.894,214947,4,True
7,Vanilla Ice,Ice Ice Baby,Ice Ice Baby - Radio Edit,3sy0rren2cVFNfkDxa0q2e,0.977,0.488,2,-15.962,1,0.12,0.0,0.0826,0.77,115.726,231773,4,True
8,Beastie Boys,Ill Communication,Sure Shot,21REQ1bCUWphT2QK3bLWYQ,0.692,0.799,1,-7.924,1,0.164,0.0,0.301,0.549,97.978,199667,4,False
9,Afroman,The Good Times,Crazy Rap (Colt 45 & 2 Zig Zags),1ACZpHI5vZ5Ea4xGlkdGWM,0.927,0.367,9,-7.797,1,0.382,0.0,0.132,0.576,99.053,328667,4,True


At this point, you should be able to see which tunes are "Dance Tunes" based on our categorization threshold. Excitingly, Altair provides an easy way to visualize our findings using a **bar chart**.

You can learn more about Altair's bar charts [here](https://altair-viz.github.io/gallery/simple_bar_chart.html).

Here's how to do it:

In [12]:
alt.Chart(feature_based_tracks).mark_bar().encode(
    x='dance_tune',
    y='count()'
)

As you can see, our data indicates that out of 16 songs in the playlist, 13 are Dance Tunes (e.g. have a "danceability" score of at least 0.75) and 3 are not. 

<br> 

If we were looking to make our lives even more complicated, we could **bin "energy"** based on a 0.75 "energy" score threshold:

In [13]:
feature_based_tracks["energy_tune"] = np.where(feature_based_tracks['energy'] >= 0.75, True, False)
feature_based_tracks

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,dance_tune,energy_tune
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4,True,False
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4,True,False
2,Shaggy,Boombastic,In The Summertime,726KAdf3k8Ce8W95O38XNI,0.734,0.684,1,-13.822,1,0.227,0.0,0.0497,0.962,173.607,238360,4,False,False
3,Afroman,Waiting to Inhale,Colt 40ty Fiva,3hody5PjTIzwoiV3hnAvWL,0.666,0.751,0,-5.355,0,0.216,0.0,0.294,0.607,175.891,201440,4,False,True
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4,True,False
5,Shaggy,The Boombastic Collection - Best of Shaggy,It Wasn't Me,0OaunKfsxkgBvPv68jBbmm,0.852,0.604,0,-4.569,1,0.0642,0.0,0.327,0.667,94.762,227547,4,True,False
6,House Of Pain,House of Pain (Fine Malt Lyrics),Jump Around,3TZwjdclvWt7iPJUnMpgcs,0.854,0.71,4,-6.32,0,0.0793,8.7e-05,0.166,0.818,106.894,214947,4,True,False
7,Vanilla Ice,Ice Ice Baby,Ice Ice Baby - Radio Edit,3sy0rren2cVFNfkDxa0q2e,0.977,0.488,2,-15.962,1,0.12,0.0,0.0826,0.77,115.726,231773,4,True,False
8,Beastie Boys,Ill Communication,Sure Shot,21REQ1bCUWphT2QK3bLWYQ,0.692,0.799,1,-7.924,1,0.164,0.0,0.301,0.549,97.978,199667,4,False,True
9,Afroman,The Good Times,Crazy Rap (Colt 45 & 2 Zig Zags),1ACZpHI5vZ5Ea4xGlkdGWM,0.927,0.367,9,-7.797,1,0.382,0.0,0.132,0.576,99.053,328667,4,True,False


Based on this information, we can **analyze the composition** of our modified DataFrame using Altair. For example, one could think: out of the dance tunes, are most high energy or low energy? 

Here's a way to find out using Altair:

In [14]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('energy_tune', title=""),
    y=alt.Y('count()', title='Count'),
    color=alt.Color('energy_tune', title="High energy")
)

alt.layer(bars, data=feature_based_tracks).facet(
    column=alt.Column('dance_tune', title = "Dance tune")
)

Alternatively, we can use **Altair's built-in bin method** of the Chart object to produce more standard binning. This is typically useful when creating a Histogram.

You can learn more about binning and histograms in Altair [here](https://altair-viz.github.io/gallery/simple_histogram.html).

Here's an example:

In [15]:
alt.Chart(feature_based_tracks).mark_bar().encode(
    alt.X("danceability", bin=True),
    y='count()',
)

<br>

Another extremely useful tool is **sorting a DataFrame** based on one or many columns. As an example, we can sort our brand new DataFrame by the Tracks' "energy":

In [16]:
my_sorted_df = feature_based_tracks.sort_values(['energy'], ascending=[True])
my_sorted_df

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,dance_tune,energy_tune
9,Afroman,The Good Times,Crazy Rap (Colt 45 & 2 Zig Zags),1ACZpHI5vZ5Ea4xGlkdGWM,0.927,0.367,9,-7.797,1,0.382,0.0,0.132,0.576,99.053,328667,4,True,False
1,Ini Kamoze,Here Comes The Hotstepper,Here Comes the Hotstepper - Heartical Mix,3QRM0qZB7oMYavveH0iEqx,0.889,0.454,4,-8.598,0,0.221,0.000186,0.203,0.436,100.36,250467,4,True,False
13,Toots & The Maytals,Funky Kingston,Funky Kingston,26WPI2aksB9XdqmeLfca5z,0.777,0.458,11,-12.358,1,0.0841,0.00295,0.0423,0.961,99.05,295667,4,True,False
7,Vanilla Ice,Ice Ice Baby,Ice Ice Baby - Radio Edit,3sy0rren2cVFNfkDxa0q2e,0.977,0.488,2,-15.962,1,0.12,0.0,0.0826,0.77,115.726,231773,4,True,False
11,Vanilla Ice,To The Extreme,Play That Funky Music,1Ezs8eYxuZjhlgyoI1Bo76,0.851,0.514,4,-15.279,0,0.165,4e-06,0.381,0.573,100.425,285800,4,True,False
0,Shaggy,Boombastic,Boombastic,4fxF8ljwryMZX5c9EKrLFE,0.867,0.538,2,-16.183,1,0.361,1.7e-05,0.316,0.781,158.328,249933,4,True,False
5,Shaggy,The Boombastic Collection - Best of Shaggy,It Wasn't Me,0OaunKfsxkgBvPv68jBbmm,0.852,0.604,0,-4.569,1,0.0642,0.0,0.327,0.667,94.762,227547,4,True,False
12,Naughty By Nature,19 Naughty III,Hip Hop Hooray,1w29UTa5uUvIri2tWtZ12Y,0.862,0.642,6,-13.652,0,0.101,0.0,0.272,0.765,99.201,267267,4,True,False
15,Blackstreet,Another Level,No Diggity,6MdqqkQ8sSC0WB4i8PyRuQ,0.867,0.646,1,-4.674,0,0.288,0.0,0.279,0.67,88.634,304600,4,True,False
4,Salt-N-Pepa,Very Necessary,Shoop,0Pu71wxadDlB8fJXfjIjeJ,0.939,0.675,0,-7.232,1,0.211,0.0,0.0565,0.795,96.918,248573,4,True,False


Things to note here:
* the tracks in this new DataFrame are arranged based on their "energy" scores
* the tracks' indices are now inconsequent (the leftmost column), but could be easily reset (check out Pandas B lab)
* the tracks can also be sorted by multiple columns (specified in the value list)

Let's **chart the Tracks' "energy" based on our new DataFrame**:

In [17]:
alt.Chart(my_sorted_df).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y="energy"
)

----------
## **Part 3: Comparing Playlists**

In this part, we will obtain tracks from multiple playlists (from the same user) and compare these playlists using Altair's charts and scatterplots and Panda's built-in statistics methods.

<br>

Building onto Max Hilsdorf's code, we can create a function that would **produce an audio features DataFrame for all tracks for a given Spotify User** based on a User ID:

In [18]:
# preserve the name of the playlist in the dataframe
def get_all_user_tracks(username):
  all_my_playlists = pd.DataFrame(sp.user_playlists(username))
  list_of_dataframes = []

  for playlist in all_my_playlists.index:
    current_playlist = pd.DataFrame(sp.user_playlist_tracks(username, all_my_playlists["items"][playlist]["id"]))
    current_playlist_audio = get_audio_features_df(current_playlist)
    if all_my_playlists["items"][playlist]["name"]:
      current_playlist_audio["playlist_name"] = all_my_playlists["items"][playlist]["name"]
    else:
       current_playlist_audio["playlist_name"] = None
    list_of_dataframes.append(current_playlist_audio)

  return pd.concat(list_of_dataframes)

Using this function, we can get **all tracks contained in Oleh's followed public playlists** and **produce an Audio Features DataFrame** for them:

In [19]:
# Getting the current_user's all tracks
all_my_tracks = get_all_user_tracks(my_username)
all_my_tracks["Author"] = "oleh" # noting where the tracks came from
all_my_tracks

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Author
0,alt-J,This Is All Yours,Left Hand Free,4iEOVEULZRvmzYSZY2ViKN,0.697,0.877,3,-4.465,1,0.0462,0.00943,0.083,0.802,101.990,173631,4,Alternative & Indie,oleh
1,Cage The Elephant,Cage The Elephant (Expanded Edition),Ain't No Rest for the Wicked,3Pzh926pXggbMe2ZpXyMV7,0.636,0.849,0,-7.075,1,0.1060,0.0,0.372,0.917,156.036,175493,4,Alternative & Indie,oleh
2,Cage The Elephant,Cage The Elephant (Expanded Edition),Back Against the Wall,0vz64VTiPPBpcmla0QvAI9,0.598,0.743,1,-6.163,1,0.0305,0.0,0.112,0.534,110.334,228320,4,Alternative & Indie,oleh
3,The Kooks,Listen,Bad Habit,3huV7eiNpaQlCB3LbZi9bB,0.733,0.882,0,-4.199,0,0.0389,0.00001,0.131,0.854,123.071,221413,4,Alternative & Indie,oleh
4,Weezer,Weezer (Green Album),Island In The Sun,2MLHyLy5z5l5YRp7momlgw,0.654,0.810,4,-6.260,0,0.0288,0.00251,0.165,0.661,114.623,200307,4,Alternative & Indie,oleh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,The Band,Music From Big Pink (Remastered),The Weight - Remastered 2000,0P7DoyGrr4Wp9w5TotEtUC,0.630,0.551,9,-9.280,1,0.0549,0.000058,0.103,0.518,143.868,274493,4,good shit,oleh
83,BabyJake,Cigarettes On Patios,Cigarettes On Patios,0LJDFZohBgWOMvXQw0cc9W,0.752,0.712,4,-5.467,0,0.0474,0.0,0.136,0.588,139.999,207813,4,good shit,oleh
84,24kGoldn,Mood (feat. iann dior),Mood (feat. iann dior),3tjFYV6RSFtuktYl3ZtYcq,0.700,0.722,7,-3.558,0,0.0369,0.0,0.272,0.756,90.989,140526,4,good shit,oleh
85,Miike Snow,Happy To You,Paddling Out,2egGsu9X7zdNJxU9Kftq6l,0.599,0.818,10,-3.652,0,0.0394,0.00571,0.297,0.366,128.159,217960,4,good shit,oleh


As you can see, our new DataFrame consists of 267 tracks – which are pretty much **all Oleh's tracks**. Using the familiar tools from Altair, we can **produce a color-coded chart** for the new collection.

Specifically, we can **color** the data points **based on the playlist** they are in:

In [20]:
alt.Chart(all_my_tracks).mark_point().encode(
    x="liveness",
    y="danceability",
    color="playlist_name"
)

Bigger sample sizes prompt stronger observations! If you have noticed a trend when looking at just the 16 Tracks of the initial playlist, you are much more likely to witness a similar trend as the **sample size increases**.

Another useful chart would be **charting every track's energy and color-coding the data points** based on what playlist they are in.

Here's how to do it:

In [21]:
alt.Chart(all_my_tracks).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="playlist_name",
    tooltip=["artist", "track_name", "playlist_name"]
).properties(
    width=1200
)

As you can see in the chart above, Oleh's playlists tend to **follow a certain "energy" trend (typically upward)** as the playlist progresses. This likely corresponds with how many of you listen to your own playlists: start with less energetic songs and move on to more energetic ones.

Mathematically, we can **describe** each playlist as a subset of the overall DataFrame. You can read more about categorical descriptions in Pandas [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).

Here's how to get the **description detail for a particular Playlist**:

In [22]:
all_my_tracks[all_my_tracks["playlist_name"] == "Alternative & Indie"].describe()

Unnamed: 0,danceability,energy,loudness,speechiness,liveness,valence,tempo
count,22.0,22.0,22.0,22.0,22.0,22.0,22.0
mean,0.629318,0.684864,-6.982318,0.051305,0.134209,0.613818,118.456
std,0.150462,0.139713,2.423161,0.035315,0.068607,0.252706,22.96004
min,0.207,0.403,-14.79,0.0265,0.0499,0.216,75.179
25%,0.56075,0.6035,-7.624,0.03275,0.1045,0.389,106.24325
50%,0.626,0.7035,-6.776,0.0384,0.115,0.6045,116.432
75%,0.724,0.77225,-5.60625,0.04845,0.14525,0.838,133.75725
max,0.876,0.882,-3.809,0.174,0.372,0.975,156.036


Now, let's compare Oleh's tracks to another listener! For example, we could **get one of Ava's playlists** using the *sp.plylist_items()* method. In this example, we will use a playlist with Playlist ID = "3tt4ET474Xr1uOPgNz8jAY" 

Here's how to do it:

In [23]:
avas_playlist_df = pd.DataFrame(sp.playlist_items("3tt4ET474Xr1uOPgNz8jAY"))
avas_playlist_df

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-16T16:43:01Z', 'added_by...",100,,0,,60
1,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-16T16:44:10Z', 'added_by...",100,,0,,60
2,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-16T17:02:14Z', 'added_by...",100,,0,,60
3,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-19T03:30:59Z', 'added_by...",100,,0,,60
4,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-19T04:26:10Z', 'added_by...",100,,0,,60
5,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-19T12:19:20Z', 'added_by...",100,,0,,60
6,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-06-24T01:41:49Z', 'added_by...",100,,0,,60
7,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-07-12T13:52:34Z', 'added_by...",100,,0,,60
8,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-07-12T20:49:27Z', 'added_by...",100,,0,,60
9,https://api.spotify.com/v1/playlists/3tt4ET474...,"{'added_at': '2020-07-12T20:54:49Z', 'added_by...",100,,0,,60


Similarly to what we have done earlier, we can **construct an Audio Features DataFrame** for this playlist:

In [24]:
avas_audio_features_df = get_audio_features_df(avas_playlist_df)
avas_audio_features_df["Author"] = "ava"
avas_audio_features_df

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,Author
0,Princess Nokia,Gemini - A COLORS SHOW,Gemini - A COLORS SHOW,0KHRpftQXPXk1ZJrBRjbu7,0.707,0.526,9,-8.016,1,0.403,1e-06,0.12,0.809,80.312,184827,4,ava
1,Alice Phoebe Lou,Witches,Witches,4CZgaNdobtnTfBevPBje0c,0.576,0.719,8,-5.749,1,0.0583,5e-06,0.116,0.729,73.122,195079,4,ava
2,Japanese Breakfast,Soft Sounds from Another Planet,Boyish,0De8H4o9xzPtjRp9dns0L5,0.227,0.457,2,-7.459,1,0.0299,0.000447,0.189,0.371,106.727,213987,4,ava
3,Elvis Presley,Elvis 30 #1 Hits,Can't Help Falling In Love,4hAUynwghvrqDXs1ejKNEq,0.438,0.325,2,-11.066,1,0.0268,5e-06,0.107,0.362,101.577,177840,3,ava
4,King Princess,Prophet,Prophet,4vFTpKeY2F3ckwhULrtS0z,0.502,0.783,7,-4.718,1,0.105,0.000464,0.099,0.626,165.996,249773,4,ava
5,Alaina Castillo,mensajes de voz,un niño,1TXeqjCYIahhfooXkdb3aI,0.493,0.296,7,-5.598,0,0.0761,0.0,0.0957,0.351,197.707,185682,3,ava
6,James Taylor,Greatest Hits,Fire and Rain,1XUKItaRs2494LclJwHhl8,0.611,0.35,5,-14.48,1,0.0356,8e-06,0.0844,0.36,76.064,200579,4,ava
7,Esperanza Spalding,12 Little Spells (Deluxe Edition),12 Little Spells (thoracic spine),0ZoE1JZG6cqjckvvUBqHrT,0.425,0.429,4,-11.29,1,0.037,6.9e-05,0.219,0.0517,125.728,293053,4,ava
8,Clairo,Immunity,Softly,4PvbbMYL4fkToni5BLaYRb,0.759,0.436,0,-11.233,0,0.0419,0.0113,0.102,0.782,94.03,185307,4,ava
9,Faye Webster,Atlanta Millionaires Club,Kingston,0EDQwboQDmswDRn58wcslg,0.729,0.344,10,-9.541,0,0.0395,0.00107,0.134,0.543,142.13,202160,4,ava


As it is useful to conduct comparisons on collections of similar sizes, we could **import one of Oleh's playlists of relatively similar length**. One of such playlists has ID "47VfnY1RsMOadBdy9MCDYW"; and, as we've seen before, Oleh's Spotify User ID is "sx47r9lq4dwrjx1r0ct9f9m09". 

After constructing the playlist DataFrame, we will concatenate the two individual-playlist-based DataFrames into one. This will help us chart our results.

Here's how to do it:

In [25]:
# Getting one of Oleh's playlists
gs_playlist_tracks = pd.DataFrame(sp.user_playlist_tracks("sx47r9lq4dwrjx1r0ct9f9m09", "47VfnY1RsMOadBdy9MCDYW"))
gs_playlist_tracks_audio_df = get_audio_features_df(gs_playlist_tracks)
gs_playlist_tracks_audio_df["Author"] = "oleh"

# Combining the two DataFrames
two_playlists_combined = pd.concat([gs_playlist_tracks_audio_df, avas_audio_features_df], ignore_index=True)
two_playlists_combined

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,Author
0,Prince,Purple Rain,Purple Rain,54X78diSLoUDI3joC2bjMz,0.367,0.452,10,-10.422,1,0.0307,0.002280,0.6890,0.189,113.066,520787,4,oleh
1,Bob Dylan,Pat Garrett & Billy The Kid (Soundtrack From T...,Knockin' On Heaven's Door,6HSXNV0b4M4cLJ7ljgVVeh,0.513,0.396,7,-13.061,1,0.0299,0.177000,0.1100,0.229,140.208,149880,4,oleh
2,The Beatles,Abbey Road (Remastered),Here Comes The Sun - Remastered 2009,6dGnYIeXmHdcikdzNNDMm2,0.557,0.540,9,-10.484,1,0.0347,0.002480,0.1790,0.394,129.171,185733,4,oleh
3,David Gilmour,Live in Gdansk,Wish You Were Here - Live in Gdańsk,2q0BviPG80XxEkaCJCrBm8,0.526,0.472,7,-13.148,1,0.0370,0.000030,0.9820,0.339,124.443,314387,4,oleh
4,Chris Cornell,Chris Cornell (Deluxe Edition),Nothing Compares 2 U - Live At SiriusXM/2015,0tUELgOuOJ3KCsYMDDsNvD,0.434,0.327,0,-10.720,1,0.0312,0.000002,0.6860,0.295,119.506,303907,4,oleh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,Angèle,Brol La Suite,Oui ou non,7rBP4bLjMLNkix1nGHjheP,0.649,0.574,2,-7.856,1,0.1430,0.000013,0.1610,0.361,199.906,196800,4,ava
143,Jim Croce,You Don't Mess Around With Jim,Walkin' Back to Georgia,51ueZKM83MTRv9rgiDfI6Y,0.661,0.518,6,-10.061,0,0.0319,0.000018,0.0953,0.776,127.266,170760,4,ava
144,Bonnie Raitt,Bonnie Raitt (2008 Remaster),Thank You - 2008 Remaster,2zLIjfjQ8kMy7WSSLmF0I2,0.670,0.279,5,-15.589,1,0.0289,0.008790,0.1090,0.552,77.820,170800,4,ava
145,Aretha Franklin,"Young, Gifted and Black",Day Dreaming,7L4G39PVgMfaeHRyi1ML7y,0.463,0.273,0,-15.364,0,0.0740,0.000367,0.1010,0.293,146.426,239960,4,ava


Finally, we can **chart the two playlist side by side**.

In this example, we color the entries based on the Author column and sort them exactly the way they appear in the original playlist (by setting sort=None). We will color Ava's tracks blue and Oleh's tracks yellow. Some trends are very visible from the plot:

In [26]:
alt.Chart(two_playlists_combined).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="Author",
    tooltip=["artist", "track_name"]
).properties(
    width=1000
)

Note a few things here:
* Oleh's tracks (yellow) typically **vary** less, whereas Ava's tracks **vary** greatly
* Oleh's tracks (yellow) have **average** energy that is higher than that of Ava's
* Oleh's tracks (yellow) follow a visible **trend** in the way they arranged

<br>

We can support our conclusions mathematically, by exploring Pandas' **descriptions** of the "energy" column for the two sub-DataFrames: 

In [27]:
print("Oleh's data: \n", two_playlists_combined[two_playlists_combined["Author"] == "oleh"]["energy"].describe(), "\n")
print("Ava's data: \n", two_playlists_combined[two_playlists_combined["Author"] == "ava"]["energy"].describe())

Oleh's data: 
 count    87.000000
mean      0.678161
std       0.168881
min       0.167000
25%       0.569000
50%       0.710000
75%       0.800000
max       0.954000
Name: energy, dtype: float64 

Ava's data: 
 count    60.000000
mean      0.419764
std       0.200654
min       0.006220
25%       0.284750
50%       0.412500
75%       0.553500
max       0.806000
Name: energy, dtype: float64


As expected, there are some corresponding statistical observations:
* Oleh's Standard Deviation (std) is 0.169 whereas Ava's is 0.201 (corresponding to the spread)
* Oleh's average "energy" (mean) is 0.678 whereas Ava's is 0.420 (lower average, as expected)

<br>

Instead of comparing just two playlists, we can compare many! As an example, we'll load **8 of Ava's favorite playlists**:

In [28]:
list_of_avas_playlists = []
avas_export_playlists_list = ["3tt4ET474Xr1uOPgNz8jAY",
                              "69bvktIqRHFk56zJLFu3ms", 
                              "5nGnFuPH2G1e2lZwji2qxy",
                              "1H715wD7rkVCSGz0fwtLeH",
                              "35DLrFVs4dK3QreeuQt9vZ",
                              "0N6HSTGQcNhgrsjvdgqjH9",
                              "1BwJKfuRNrnfdkvIpaaSHH",
                              "6AfdBAcUHElsK8cRzMpnc1"]

for item in avas_export_playlists_list:
  temp_playlist_df = pd.DataFrame(sp.playlist_items(item))
  temp_playlist_audio = get_audio_features_df(temp_playlist_df)
  temp_playlist_audio["playlist_name"] = sp.playlist(item)["name"]
  temp_playlist_audio["Author"] = "ava"
  list_of_avas_playlists.append(temp_playlist_audio)

avas_eight_playlists = pd.concat(list_of_avas_playlists)
avas_eight_playlists

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Author
0,Princess Nokia,Gemini - A COLORS SHOW,Gemini - A COLORS SHOW,0KHRpftQXPXk1ZJrBRjbu7,0.707,0.526,9,-8.016,1,0.4030,0.000001,0.120,0.809,80.312,184827,4,Cheers to latching,ava
1,Alice Phoebe Lou,Witches,Witches,4CZgaNdobtnTfBevPBje0c,0.576,0.719,8,-5.749,1,0.0583,0.000005,0.116,0.729,73.122,195079,4,Cheers to latching,ava
2,Japanese Breakfast,Soft Sounds from Another Planet,Boyish,0De8H4o9xzPtjRp9dns0L5,0.227,0.457,2,-7.459,1,0.0299,0.000447,0.189,0.371,106.727,213987,4,Cheers to latching,ava
3,Elvis Presley,Elvis 30 #1 Hits,Can't Help Falling In Love,4hAUynwghvrqDXs1ejKNEq,0.438,0.325,2,-11.066,1,0.0268,0.000005,0.107,0.362,101.577,177840,3,Cheers to latching,ava
4,King Princess,Prophet,Prophet,4vFTpKeY2F3ckwhULrtS0z,0.502,0.783,7,-4.718,1,0.1050,0.000464,0.099,0.626,165.996,249773,4,Cheers to latching,ava
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36,Jalen Santoy,Foreplay,Foreplay,28luqgS4NCuFKP6YSOtia5,0.730,0.422,5,-6.211,0,0.0678,0.002920,0.105,0.505,118.427,173785,4,For drives,ava
37,Little Dragon,Ritual Union,Ritual Union,5uTjNzGKCQ50synrf9dWmT,0.700,0.738,1,-4.398,1,0.0324,0.005270,0.220,0.796,144.035,210267,4,For drives,ava
38,The Last Shadow Puppets,Everything You've Come To Expect (Deluxe Edition),Miracle Aligner,4iwpCp7qdDLngGI3gsVTza,0.562,0.724,6,-8.627,0,0.0271,0.000001,0.223,0.855,113.056,245728,4,For drives,ava
39,Young the Giant,Home of the Strange,Art Exhibit,3XqdYTHbYWw2haLim9Kwfc,0.490,0.480,8,-9.949,1,0.0374,0.000002,0.135,0.221,149.819,243640,3,For drives,ava


Here we got the 218 songs Ava listens to in total! And, similarly, we'll **chart them side by side**:

In [29]:
alt.Chart(avas_eight_playlists).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="playlist_name",
    tooltip=["artist", "track_name", "playlist_name"]
).properties(
    width=1200
)

Then, we can create our **shared DataFrame of all the tracks** obtained from Oleh's and Ava's Spotify profiles:

In [30]:
two_people_dataframe = pd.concat([avas_eight_playlists, all_my_tracks], ignore_index=True)
two_people_dataframe

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Author
0,Princess Nokia,Gemini - A COLORS SHOW,Gemini - A COLORS SHOW,0KHRpftQXPXk1ZJrBRjbu7,0.707,0.526,9,-8.016,1,0.4030,0.000001,0.120,0.809,80.312,184827,4,Cheers to latching,ava
1,Alice Phoebe Lou,Witches,Witches,4CZgaNdobtnTfBevPBje0c,0.576,0.719,8,-5.749,1,0.0583,0.000005,0.116,0.729,73.122,195079,4,Cheers to latching,ava
2,Japanese Breakfast,Soft Sounds from Another Planet,Boyish,0De8H4o9xzPtjRp9dns0L5,0.227,0.457,2,-7.459,1,0.0299,0.000447,0.189,0.371,106.727,213987,4,Cheers to latching,ava
3,Elvis Presley,Elvis 30 #1 Hits,Can't Help Falling In Love,4hAUynwghvrqDXs1ejKNEq,0.438,0.325,2,-11.066,1,0.0268,0.000005,0.107,0.362,101.577,177840,3,Cheers to latching,ava
4,King Princess,Prophet,Prophet,4vFTpKeY2F3ckwhULrtS0z,0.502,0.783,7,-4.718,1,0.1050,0.000464,0.099,0.626,165.996,249773,4,Cheers to latching,ava
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
480,The Band,Music From Big Pink (Remastered),The Weight - Remastered 2000,0P7DoyGrr4Wp9w5TotEtUC,0.630,0.551,9,-9.280,1,0.0549,0.000058,0.103,0.518,143.868,274493,4,good shit,oleh
481,BabyJake,Cigarettes On Patios,Cigarettes On Patios,0LJDFZohBgWOMvXQw0cc9W,0.752,0.712,4,-5.467,0,0.0474,0.0,0.136,0.588,139.999,207813,4,good shit,oleh
482,24kGoldn,Mood (feat. iann dior),Mood (feat. iann dior),3tjFYV6RSFtuktYl3ZtYcq,0.700,0.722,7,-3.558,0,0.0369,0.0,0.272,0.756,90.989,140526,4,good shit,oleh
483,Miike Snow,Happy To You,Paddling Out,2egGsu9X7zdNJxU9Kftq6l,0.599,0.818,10,-3.652,0,0.0394,0.00571,0.297,0.366,128.159,217960,4,good shit,oleh


The combined DataFrame consists of 485 songs that these two people listen to in totality. Let's **chart out the "energy" values** for these songs to see how the two compare:

In [31]:
alt.Chart(two_people_dataframe).mark_point().encode(
    x=alt.X("track_name", sort=None),
    y='energy',
    color="Author",
    tooltip=["artist", "track_name", "playlist_name"]
).properties(
    width=1200
)

Just as noted earlier (when comparing just two playlists), there are some important things to note here:

* Oleh's tracks (yellow) typically **vary** less, whereas Ava's tracks **vary** greatly
* Oleh's tracks (yellow) have **average** energy that is higher than that of Ava's
* Oleh's tracks (yellow) follow a visible **trend** in the way they arranged

<br>

We can similarly support our conclusions mathematically, by exploring Pandas' **descriptions** of the "energy" column for the two sub-DataFrames: 

In [32]:
print("Oleh's data: \n", two_people_dataframe[two_people_dataframe["Author"] == "oleh"]["energy"].describe(), "\n")
print("Ava's data: \n", two_people_dataframe[two_people_dataframe["Author"] == "ava"]["energy"].describe())

Oleh's data: 
 count    267.000000
mean       0.653301
std        0.161060
min        0.099300
25%        0.538000
50%        0.665000
75%        0.776500
max        0.984000
Name: energy, dtype: float64 

Ava's data: 
 count    218.000000
mean       0.473340
std        0.220546
min        0.006220
25%        0.310750
50%        0.472000
75%        0.647000
max        0.973000
Name: energy, dtype: float64


As expected, there are some corresponding statistical observations:
* Oleh's Standard Deviation (std) is 0.161 whereas Ava's is 0.221 (corresponding to the spread)
* Oleh's average "energy" (mean) is 0.653 whereas Ava's is 0.473 (lower average, as expected)
<br>

----
## **Part 4: Network Graph Visualization**

In this part, we'll explore some basic Network Theory graphing for Spotify's Artists and Songs based on Recommended and Related songs and artists.

#### Step 4.1: Network Basics

At first, we will illustrate the basics of Pyvis-based **Network Graphs**. Generally speaking, a Network Graph is a visual structure designed to emphasize connections between discrete entities. It consists of Nodes and Edges, which represent a system of connected or related elements, and is largely studied within Network Theory. 

You can learn more about Network Theory [here](https://en.wikipedia.org/wiki/Network_theory) and explore Network Grahps [here](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)).

Here's how to **build, populate, and show a simple Network Graph** using Networkx and Pyvis:

In [33]:
# Creating a Network
g = net.Network(notebook=True)

# Adding nodes
g.add_node("John")
g.add_node("Paul")

# Adding an edge
g.add_edge("John", "Paul")

# Showing the network
g.show("example.html")

As you can see, in this example we created two nodes "John" and "Paul" and connected them. We are able to **add nodes** to an existing network by calling *net.Network.add_node()* and **add edges** to the same network by calling *net.Network.add_edge()*. It is also possible to **get all nodes** by calling *net.Network.get_edges()*.

Using these tools, we can **check if a node is in a network**:

In [34]:
# checking
"John" in g.get_nodes()

True

Building onto these tools, we can create something more advanced – for example, **a diagram of Oleh's playlists** (by iterating over *all_my_tracks*). We will scale the nodes (playlists) based on their size using Pyvis' **value** attribute of Nodes.

Here's how to do it:

In [35]:
# Creating a Network with one center Node
playlists_network = net.Network(notebook=True)
playlists_network.add_node("Oleh's Spotify", color="#fffff")

# As we want to record both playlist names and corresponding sizes, we need a Dictionary:
oleh_playlist_dictionary = {}
olehs_playlists = pd.DataFrame(sp.user_playlists(my_username)["items"])

# Iterating over the playlists and recording Names and Sizes
for i in range(len(olehs_playlists)):
    oleh_playlist_dictionary[olehs_playlists.loc[i]["name"]] = olehs_playlists["tracks"][i]["total"]

# Adding new Nodes and Edges based on the items in the Dictionary:
for item in oleh_playlist_dictionary:
    playlists_network.add_node(item, value=oleh_playlist_dictionary[item])
    playlists_network.add_edge("Oleh's Spotify", item)

# Showing the Network Graph
playlists_network.show("playlists_diagram.html")

As expected, we can see the center node we added at first – which is now connected to 8 other nodes, which all correspond to Oleh's playlists. These nodes are sized based on the playlists' sizes (number of tracks) and named based on the playlists' names. **This is a simple undirected network**.

----

Now, we'll get into slightly more complicated things.

Spotify API provides a way to **get related artists** given an Artist ID. Reflecting this method, Spotipy conveniently has *sp.artist_related_artists*, which returns a collection of artists related to an artist. Making use of this method, one could think of a function that would go through a number of related artists (**limit**) and add graph Nodes and Edges corresponding to the newly discovered related artists. We will also **size nodes** based on popularity.

Here's what such a function could look like:

In [36]:
def add_related_artists(starting_artist_name, starting_artist_id, existing_graph, limit, order_group=None):
    # get artists related to the current artist
    current_artist_related = pd.DataFrame(sp.artist_related_artists(starting_artist_id)["artists"])
    # loop through the related artists, add nodes and edges
    for i in range(limit):
        # check if node already exists
        if current_artist_related.loc[i]["name"] not in existing_graph.get_nodes():
            if order_group:
                existing_graph.add_node(current_artist_related.loc[i]["name"], value=int(current_artist_related.loc[i]["popularity"]), group=order_group)
            else:
                existing_graph.add_node(current_artist_related.loc[i]["name"], value=int(current_artist_related.loc[i]["popularity"]), group=(i + 1))
        # add edge
        existing_graph.add_edge(starting_artist_name, current_artist_related.loc[i]["name"])

In the cell below, we will make use of the function we just defined. Using this function and some basic information, we will **produce a Network Graph for two generations (circles) of artists related to The Beatles**. 

As noted, we will start with Beatles (Artist ID = "3WrFJ7ztbogyGnTHbHJFl2", Name = "The Beatles")

In [37]:
# First, we need to record the information about The Beatles
center_artist_id = "3WrFJ7ztbogyGnTHbHJFl2"
center_artist_name = "The Beatles"
center_artist_popularity = 83

# limit: how many related per generation are we interested in
limit = 10
center_artist_related = pd.DataFrame(sp.artist_related_artists(center_artist_id)["artists"]).loc[0:(limit-1)]

# setting up the Network
artist_network = net.Network(notebook=True)
artist_network.add_node(center_artist_name, value=center_artist_popularity, color="#fffff", group=0)

# Getting the first circle of related artists:
add_related_artists(center_artist_name, center_artist_id, artist_network, limit)

# Showing the Network Graph
artist_network.show("artist_example.html")

In order to further complicate our lives, we can **add one more generation of related artists** (think friends of friends):

In [38]:
# Running through the once-related artists
for i in range(limit):
    add_related_artists(center_artist_related.loc[i]["name"], center_artist_related.loc[i]["id"], artist_network, limit, (i+1))

# Showing the Network Graph
artist_network.show("artist_example.html")

As you can see, the Network Graph above provides some very interesting information and prompts some very important thoughts. Think about: 
* Why are the nodes located the way they are located? 
* Who are the artists we've missed? 
* How are these people related?

<br>

Similarly to Related Artists, Spotify API has a way of **recommending songs** based on a "seed" of tracks, which is mirrored by Spotipy – specifically, in the *sp.recommendations* method. One could think of a function that would **get a generation of recommended songs and add them to a Network Graph** (scaled by popularity):

In [39]:
def add_related_songs(starting_song_name, starting_artist_name, starting_song_id, existing_graph, limit, first_gen=True, order_group=None):
    current_song_related = pd.DataFrame(sp.recommendations(seed_tracks=[starting_song_id])["tracks"])
    for i in range(limit):
        if str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]) not in existing_graph.get_nodes():
            if order_group:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=order_group)
            else:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=(i+1))
        existing_graph.add_edge(str(starting_artist_name + ": " + starting_song_name), str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]))
    return current_song_related

In the cell below, we will make use of the function we just defined. Using this function and some basic information, we will **produce a Network Graph for two generations (circles) of recommended songs based on Ben E. King's Stand By Me**. 

As noted, we will start with Stand By Me (Song ID = "3SdTKo2uVsxFblQjpScoHy")

In [40]:
# First, we need to record the information about Stand By Me
center_song = sp.track("3SdTKo2uVsxFblQjpScoHy")
center_song_id = center_song["id"]
center_song_artist = center_song["artists"][0]["name"]
center_song_name = center_song["name"]
center_song_popularity = int(center_song["popularity"])

# limit: how many recommended songs per generation we are interested in
limit = 10

# Creating the Network graph and adding the center Node
song_network = net.Network(notebook=True)
song_network.add_node(str(center_song_artist + ": " + center_song_name), value=center_song_popularity, color="#fffff", group=0)

# Getting the first circle of related artists:
recommended_songs = add_related_songs(center_song_name, center_song_artist, center_song_id, song_network, limit)

# Showing the Network
song_network.show("song_network.html")

Similarly to Related Artists, we will further complicate our lives by **adding one more generation of recommended songs** (with no extra seed knowledge):

In [41]:
# Getting the second generation of Recommended songs
for i in range(limit):
    add_related_songs(recommended_songs.loc[i]["name"], recommended_songs.loc[i]["artists"][0]["name"], recommended_songs.loc[i]["id"], song_network, limit, False, (i+1))

# Showing the network
song_network.show("song_network.html")

Interestingly, Spotify's recommendations for songs **change every time you run your code**. We encourage you to re-run  the previous two cells a few times! Just like the Related Artists graph, the Network Graph above provides some very interesting information and prompts some very important thoughts. Think about: 
* Why are the nodes located the way they are located? 
* Who are the artists we've missed? 
* How are these people related?

<br>

Finally, we can make one very slight tweak to our add_related_songs method. Previously, we only included one track as a seed track for running the GET Recommendations method. In the function below, we will define a new function that will essentially do the same thing as the one above, except it will **pass 5 random tracks (out of the tracks in the graph) as the recommendation seed** into the Recommendation function:

In [42]:
def add_related_songs_gen(starting_song_name, starting_artist_name, starting_song_id, existing_graph, limit, first_gen=True, order_group=None):
    current_song_related = pd.DataFrame(sp.recommendations(seed_tracks=starting_song_id)["tracks"]).loc[0:(limit - 1)]
    for i in range(limit):
        if str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]) not in existing_graph.get_nodes():
            if order_group:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=order_group)
            else:
                existing_graph.add_node(str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]), value=int(current_song_related.loc[i]["popularity"]), group=(i+1))
        existing_graph.add_edge(str(starting_artist_name + ": " + starting_song_name), str(current_song_related.loc[i]["artists"][0]["name"] + ": " + current_song_related.loc[i]["name"]))
    return current_song_related

We will run this function for **two generations** for the same song (Stand By Me by Ben E. King):

In [43]:
# Start the network
song_network = net.Network(notebook=True)
song_network.add_node(str(center_song_artist + ": " + center_song_name), value=center_song_popularity, color="#fffff", group=0)

# First generation
recommended_songs = add_related_songs_gen(center_song_name, center_song_artist, [center_song_id], song_network, limit)

# Second generation
for i in range(limit):
    add_related_songs_gen(recommended_songs.loc[i]["name"], recommended_songs.loc[i]["artists"][0]["name"], random.sample(list(recommended_songs["id"]), 3), song_network, limit, False, (i+1))

# Show the network Graph
song_network.show("song_network.html")

Note that this graph looks a little different! What are **some of your observations**?

<br>

----

<br>

## **Part 5: Reflections**

You've learned **a lot** through this lab! We would love to hear what you liked/disliked most and what you found most interesting. Thank you!