Notebook by Joseph Gentile + Nicholas Lasinsky. Markdown cells denoted with (N) or (J) depending on who wrote it. Joseph assembled and wrote most of the code and did most of the markdowns in the notebook, while Nicholas did a write up of our shared thoughts/discussions/challenges and submitted it himself.

(J)The question we're looking it is as follows. Among the playlists, many of the sub-selections of the 40 songs are categorized by some variety of "these are the songs that I recognize" or "these are songs I have nostalgia for." Effectively, those sub-lists consist of songs the students, all of whom are generally in the early 20s demographic, like. Meanwhile, among the student-generated playlists, many of them are composited based on what the given student thinks some early-20's person would enjoy. So, our idea was to compile these two separate lists, of what students like and what they think a student around their age would like, and compare to see any discrepencies. The data set is decidedly not very rigorous, probably not enough to draw the most meaningful conclusions, but we figured it's a good enough idea to let us practice with the tools and see what discrepencies there are between the two.

(J)Just copy/pasting stuff from the other doc to set up all the libraries

In [1]:
pip install spotipy

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pyvis

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import random
import altair as alt
import requests
import inspect
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import networkx as nx
import networkx.algorithms.community as nx_comm
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pyvis
from pyvis import network as net
from itertools import combinations
from community import community_louvain
from copy import deepcopy

In [4]:
# storing the credentials:
CLIENT_ID = "116bae2a86fd4737862816c5f45d4c36"
CLIENT_SECRET = "4f4a732d83d04cfa94acc26d2b77169f"
my_username = "sx47r9lq4dwrjx1r0ct9f9m09"

# instantiating the client
# source: Max Hilsdorf (https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6)
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [5]:
# playlist_tracks(user_id: String, playlist_id: String): json_dict
playlist_tracks = pd.DataFrame(sp.user_playlist_tracks("sx47r9lq4dwrjx1r0ct9f9m09", "7KfWEjHxpcOIkqvDqMW5RV"))
playlist_tracks

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:18Z', 'added_by...",100,,0,,16
1,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:32Z', 'added_by...",100,,0,,16
2,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:22:50Z', 'added_by...",100,,0,,16
3,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:10Z', 'added_by...",100,,0,,16
4,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:22Z', 'added_by...",100,,0,,16
5,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:23:38Z', 'added_by...",100,,0,,16
6,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:26:57Z', 'added_by...",100,,0,,16
7,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:31:50Z', 'added_by...",100,,0,,16
8,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:35:09Z', 'added_by...",100,,0,,16
9,https://api.spotify.com/v1/playlists/7KfWEjHxp...,"{'added_at': '2020-06-24T18:38:34Z', 'added_by...",100,,0,,16


(J)To combine playlists using some code from notebook C -- function taken from there.

In [6]:
# This function is created based on Max Hilsdorf's article
# Source: https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6
def get_audio_features_df(playlist):
    
    # Create an empty dataframe
    playlist_features_list = ["artist", "album", "track_name", "track_id","danceability","energy","key","loudness","mode", "speechiness","instrumentalness","liveness","valence","tempo", "duration_ms","time_signature"]
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    for track in playlist["items"]:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the DataFrames
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df

(J)The question we're asking is that we're looking at two subsections of the playlists given. The first are playlists from the selection of 40 that were chosen as something along the lines of "songs I recognize/like." The second are student generated playlists that claim to belong to a demographic of around early 20s. So, we have a selection of songs that students say they recognize/like, and a selection of songs that students claim that a 20-something would like. Thus, the idea is to analyze the disparity of characteristics of songs we know we like, and think someone of our age group would like.

To that effect, I'm making a combined dataframe with info from all songs from playlists consisting of songs students recognized/liked, as well as both separate categories individually. First is the subset of the top 40 playlists. Playlist links below:

2hfOGugGPsjfPTYKlZojom
07UNN5sIx1dYmAOmLPri3B
3cfp7fWCHVXPl6JiBnmkLQ
5QxWG2oKSylpTg9qS5rPRr
4xV24s6m1s6mPrilnsBqBa
1kaDWd90UWUByF4Tu2UPVx
5LVjFO57XKpRNV5vzzDmP5
0QZa8PuiIKpSRDXFcbc2y2


In [7]:
#Combined List
playlist_list = []
#Subset of top 40 List
playlist_40s_list = []
playlist_40s_IDS = ["2hfOGugGPsjfPTYKlZojom",
                              "07UNN5sIx1dYmAOmLPri3B",
                              "3cfp7fWCHVXPl6JiBnmkLQ",
                              "5QxWG2oKSylpTg9qS5rPRr",
                              "4xV24s6m1s6mPrilnsBqBa",
                              "1kaDWd90UWUByF4Tu2UPVx",
                              "5LVjFO57XKpRNV5vzzDmP5",
                              "0QZa8PuiIKpSRDXFcbc2y2"]

for item in playlist_40s_IDS:
  temp_playlist_df = pd.DataFrame(sp.playlist_items(item))
  temp_playlist_audio = get_audio_features_df(temp_playlist_df)
  temp_playlist_audio["playlist_name"] = sp.playlist(item)["name"]
  #Sets new Group column divided into Whattheylike and Thinktheylike which distinguishes both groups in the combined dataframe.
  temp_playlist_audio["Group"] = "Whattheylike"
  playlist_list.append(temp_playlist_audio)
  playlist_40s_list.append(temp_playlist_audio)

  playlist_top_40s = pd.concat(playlist_40s_list)
  playlist_top_40s

(J)Sets up the playlist for songs we think a 20-something would like, as well as gets a separate combined playlist. Playlist links below:
1QaBNtgCIzTDoXMjz3EfDM?si=6b33a7c74e4a450b;
7fqBcrxNGMjbwTAeCBTzL6;
4kiakE4MVsxanqJHdTJwDU;
4NC377bgf15Vwymy1yMR9A;
1g44fvkwYsjasjPcy0HGOC;
playlist/6Er5rSv7eawUDyK2gzSa47;


In [8]:
playlist_20s_list = []
playlist_20s_IDS = ["1QaBNtgCIzTDoXMjz3EfDM",
                    "7fqBcrxNGMjbwTAeCBTzL6",
                    "4kiakE4MVsxanqJHdTJwDU",
                    "4NC377bgf15Vwymy1yMR9A",
                    "1g44fvkwYsjasjPcy0HGOC",
                    "6Er5rSv7eawUDyK2gzSa47"]


for item in playlist_20s_IDS:
  temp_playlist_df = pd.DataFrame(sp.playlist_items(item))
  temp_playlist_audio = get_audio_features_df(temp_playlist_df)
  temp_playlist_audio["playlist_name"] = sp.playlist(item)["name"]
  temp_playlist_audio["Group"] = "Thinktheylike"
  playlist_list.append(temp_playlist_audio)
  playlist_20s_list.append(temp_playlist_audio)

#What we think 20's people would like
playlist_20s = pd.concat(playlist_20s_list)

#The combined playlist
young_people_playlist = pd.concat(playlist_list)
young_people_playlist

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Group
0,Aretha Franklin,Respect - The Very Best of,I Say a Little Prayer,0FD8KMG4pHp0O9clTpChjp,0.590,0.3550,9,-14.051,1,0.0352,0,0.0585,0.4990,133.030,216773,4,Most Familiar With,Whattheylike
1,Ben E. King,Stand By Me,Stand By Me,2KQM3kDM0zMBC9iynePBbS,0.653,0.3340,9,-6.955,1,0.0313,0.000035,0.1230,0.6650,119.460,174253,4,Most Familiar With,Whattheylike
2,Louis Armstrong,What A Wonderful World,What A Wonderful World,29U7stRjqHU6rMiS8BfaI9,0.271,0.1650,5,-20.652,1,0.0351,0.000002,0.1180,0.2030,77.082,139227,4,Most Familiar With,Whattheylike
3,Santana,Abraxas,Oye Como Va,5u6y4u5EgDv0peILf60H5t,0.736,0.3790,7,-13.208,1,0.0539,0.345,0.1040,0.9480,128.399,256933,4,Most Familiar With,Whattheylike
4,The Temptations,The Temptations Sing Smokey,My Girl,745H5CctFr12Mo7cqa1BMH,0.572,0.4180,0,-10.738,1,0.0349,0,0.0961,0.6940,104.566,165000,4,Most Familiar With,Whattheylike
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5,Hélène Grimaud,Chopin / Rachmaninov: Piano Sonatas,"Piano Sonata No. 2 in B Flat Minor, Op. 36: II...",79O3P99BRXIOWuARKZXNnT,0.255,0.0496,4,-24.645,0,0.0469,0.869,0.0686,0.0385,72.575,445800,4,your 20s are for losing your mind,Thinktheylike
6,Fred Astaire,Funny Face (Original Motion Picture Soundtrack...,Basal Metabolism (Based On How Long Has This B...,3jV7xeKd7V5suD6w9qCT2m,0.491,0.4450,4,-10.695,0,0.0740,0.164,0.0694,0.4510,114.497,173960,4,your 20s are for losing your mind,Thinktheylike
7,Phoebe Bridgers,Stranger in the Alps (Deluxe Edition),Motion Sickness,25Syi9wnfn6ZGAmiOBypPq,0.651,0.5460,1,-9.021,1,0.0357,0.0437,0.0842,0.6230,107.021,229760,4,your 20s are for losing your mind,Thinktheylike
8,The Buttertones,Buttertones,Dionysus,7wYU1avLwl1Gtkib8OTrZp,0.331,0.6100,9,-8.138,1,0.0293,0.0548,0.1400,0.5500,132.746,209040,4,your 20s are for losing your mind,Thinktheylike


(J) Here we started the process of sorting our two datasets by danceability, in the hopes that this would make any graphs we created more legible. 
In retrospect, wound up not to be useful but we left it here just in case.

In [9]:
pl_20s_dance = playlist_20s.sort_values("danceability")
pl_40s_dance = playlist_top_40s.sort_values("danceability")
pl_dance = pd.concat([pl_20s_dance,pl_40s_dance])
pl_dance

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,playlist_name,Group
3,Mac Miller,Watching Movies with the Sound Off,Youforia,2dLYzJHP5Zc6xuMNLnhH16,0.146,0.6060,8,-6.305,1,0.0428,0.0,0.1110,0.0920,178.981,237777,3,MUSC255:Lonely_Loner,Thinktheylike
4,The Smiths,Hatful of Hollow,"Please, Please, Please, Let Me Get What I Want...",6BrMEbPSSj55nQhkgf6DnE,0.241,0.4680,2,-9.579,1,0.0272,0.0,0.1610,0.4510,91.581,112707,3,your 20s are for losing your mind,Thinktheylike
4,Andrew Skeet,The Greatest Video Game Music,Super Mario Galaxy: Gusty Garden Galaxy,05XPxcgHp4I4CFlOhMnskS,0.249,0.3160,1,-14.587,1,0.0368,0.963,0.2940,0.3200,153.897,229777,4,Aidans Shower Playlist,Thinktheylike
5,Hélène Grimaud,Chopin / Rachmaninov: Piano Sonatas,"Piano Sonata No. 2 in B Flat Minor, Op. 36: II...",79O3P99BRXIOWuARKZXNnT,0.255,0.0496,4,-24.645,0,0.0469,0.869,0.0686,0.0385,72.575,445800,4,your 20s are for losing your mind,Thinktheylike
6,David Bowie,Best of Bowie,Space Oddity - 1999 Remaster,22yy03IdKXGRUFMQX8EfBv,0.310,0.4230,0,-12.902,1,0.0324,0.000013,0.3650,0.4570,136.205,316333,4,POV: You Just Discovered Popular Music Older t...,Thinktheylike
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,Wilson Pickett,In the Midnight Hour,In the Midnight Hour,4NRQwaks9r58tTDvr4iEyv,0.750,0.4440,4,-8.630,1,0.0403,0.000004,0.1180,0.8490,111.919,157160,4,Nostalgia,Whattheylike
4,Wilson Pickett,In the Midnight Hour,In the Midnight Hour,4NRQwaks9r58tTDvr4iEyv,0.750,0.4440,4,-8.630,1,0.0403,0.000004,0.1180,0.8490,111.919,157160,4,10 From List of 40,Whattheylike
9,Wilson Pickett,In the Midnight Hour,In the Midnight Hour,4NRQwaks9r58tTDvr4iEyv,0.750,0.4440,4,-8.630,1,0.0403,0.000004,0.1180,0.8490,111.919,157160,4,Most Familiar With,Whattheylike
10,Goth Babe,Encinitas,Encinitas,4tO2Ol08xzay6zcfhDKpuN,0.760,0.5770,7,-4.981,1,0.0908,0.00352,0.4450,0.5530,116.020,171724,4,10 From List of 40,Whattheylike


(N)Next came an actual attempt at graphing. We first tried a simple x-y chart, listing tracks' danceability scores from ascending to descending, but decided that this wasn't the most heplpful way to visualize the data. Instead, we settled on a histogram, with bars colored accoridng to our two datasets. Looking at some of this preliminary charting, it seems like "thinktheylike" tracks tend to have higher danceability scores—or at least display a more extreme tail. Perhaps this suggests that our conception of what twenty-somethings like is more extreme than the music they are actially drawn to from a playlist. 

In [10]:
#alt.Chart(pl_dance).mark_point().encode(
 #   x=alt.X("track_name", sort=None),
  #  y='danceability',
   # color="Author",
    #tooltip=["artist", "track_name"]
#).properties(
 #   width=1000
#)

alt.Chart(young_people_playlist).mark_bar().encode(
    alt.X("danceability", bin=True),
    y='count()',
    color="Group",tooltip=["artist", "track_name"]
).properties(width = 1000
)

(N)This scatterplot seems to also suggest some extremes; "thinktheylike" seems to have more high energy songs, even while liveness is relativley constant, save for the one outlier of "I Knew You So Well" in the Thinktheylike group.

In [11]:
alt.Chart(young_people_playlist).mark_point().encode(
    x="liveness",
    y="energy",
    color="Group",tooltip=["artist", "track_name"]
)

(N) These radar plots represent the fruits of a lot of experimentation, and an effort to find the best way to display differences between our datasets with this tool. A few pitfalls included: tempo and loudness values distorting the entire plot (which we removed to solve the issue), an uncertainty in which values to plot; and uncertainty around the proper sample size. In the end, we went with a sample size of half the total dataset, and plotted danceability, speechiness, liveness, valence, instrumentalness, and energy.


In [13]:
import plotly.graph_objects as go
import plotly.offline as pyo
length = round(len(playlist_20s) / 2)
length2 = round(len(playlist_top_40s) / 2)
input_data = playlist_20s.sample(length).copy()
input_data_2 = playlist_top_40s.sample(length2).copy()
feature_columns = ["danceability", "energy", "speechiness", "liveness","valence","instrumentalness","danceability"]
def createRadarElement(row, feature_cols):
    return go.Scatterpolar(
        r = row[feature_cols].values.tolist(), 
        theta = feature_cols, 
        mode = 'lines', 
        name = row['track_name'])
data = list(input_data.apply(createRadarElement, axis=1, args=(feature_columns, )))  
fig = go.Figure(data, )
fig.show()
data2 = list(input_data_2.apply(createRadarElement, axis=1, args=(feature_columns, )))  
fig2 = go.Figure(data2, )
fig2.show()

(J) Looking at the above it seems like the Thinktheylike group tends to have more high-energy, more high-valence, and more instrumental pieces overall. Thinktheylike pieces also tend to be slightly more danceable, while both groups share similar liveness and speechiness values. 

(J) Here, setting up a community for the purpose of trying to figure out if there are any composers in common between both groups. I was thinking to check if any songs were in common, but I found that unlikely, and composers might be a more interesting link. If there are a lot of composers in common, it would seem as if both what early 20-somethings think they like and say they like would have even more in common, wheras if not it would seem as if there's a disparity in even the composers we think we like. 
Code adapted from example given in notebook C.
Note that it doesn't display well on colab unfortunately, but copy/pasting the notebook into jupyter works better.
Also found it useful to modify the group_by parameter to display the data in a few different ways, so rather than rewrite the code I just added a new parameter to choose_network (group_by) so I can pass the attribute to the function.

In [22]:
# Creating an HTML node
def create_node_html(node: str, source_df: pd.DataFrame, node_col: str):
    rows = source_df.loc[source_df[node_col] == node].itertuples()
    html_lis = []
    for r in rows:
        html_lis.append(f"""<li>Artist: {r.artist}<br>
                                Playlist: {r.playlist_name}<br>"""
                       )
    html_ul = f"""<ul>{''.join(html_lis)}</ul>"""
    return html_ul

# Adding nodes from an Edgelist
def add_nodes_from_edgelist(edge_list: list, 
                               source_df: pd.DataFrame, 
                               graph: nx.Graph,
                               node_col: str):
    graph = deepcopy(graph)
    node_list = pd.Series(edge_list).apply(pd.Series).stack().unique()
    for n in node_list:
        graph.add_node(n, title=create_node_html(n, source_df, node_col), spring_length=1000)
    return graph

# Adding Louvain Communities
def add_communities(G):
    G = deepcopy(G)
    partition = community_louvain.best_partition(G)
    nx.set_node_attributes(G, partition, "Group")
    return G
#(J)Added an extra parameter to be able to pass different "group by" parameters without rewriting stuff.
def choose_network(df, chosen_word, file_name, group_by):
    
    # creating unique pairs
    output_grouped = df.groupby([group_by])[chosen_word].apply(list).reset_index()
    pairs = output_grouped[chosen_word].apply(lambda x: list(combinations(x, 2)))
    pairs2 = pairs.explode().dropna()
    unique_pairs = pairs.explode().dropna().unique()
    
    # creating a new Graph
    pyvis_graph = net.Network(notebook=True, width="1000", height="1000", bgcolor="black", font_color="white")
    G = nx.Graph()
    
    try:
        G = add_nodes_from_edgelist(edge_list=unique_pairs, source_df=young_people_playlist, graph=G, node_col=chosen_word)
    except Exception as e:
        print(e)
    
    # add edges and find communities
    G.add_edges_from(unique_pairs)
    G = add_communities(G)
    pyvis_graph.from_nx(G)
    return pyvis_graph

louvain_network = choose_network(young_people_playlist, 'artist', 'out.html', 'playlist_name')
louvain_network.show("out.html")

(J) Looking at the results above we are connecting the playlists that have artists in common. Overall, the only playlists that share composers with the Knowtheylike group are POV: you discovered music older than you, and My summer playlist, indicating that overall the composers we think a 20 something would like and what they do like are distinctly diferent.
Of course, it might be interesting to try to see which are the artists these playlists share in common, so I changed the values below such that the communities display by artist and are separated by playlist name so we can see all the individual songs that connect.

In [15]:
louvain_network = choose_network(young_people_playlist, 'playlist_name', 'modified_rock.html', 'artist')
louvain_network.show("modified_rock.html")

(J) It would appear that Goth Babe, the Rolling Stones, and Led Zeppelin are the composers in common, which is an admittedly very small quantity.
In so doing indicates some disparity between what we 20-somethings think we like, and what we actually like.