Playlist names cluster
============
Work in progress notebook
__________

Function for printing progress

In [3]:
def log_progress(sequence, every=None, size=None, name='Items'):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record, index - 1
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )


Loading data (first 2000 playlists, 20000 tracks)

In [4]:
import pandas as pd
import numpy as np

In [5]:
df_playlist = pd.read_csv('../../data/playlists.csv', sep=';', nrows=2000)
df_playlist.head()

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,num_edits,duration_ms,num_artists,description
0,Throwbacks,False,0,1493424000,52,47,1,6,11532414,37,
1,Awesome Playlist,False,1,1506556800,39,23,1,5,11656470,21,
2,korean,False,2,1505692800,64,51,1,18,14039958,31,
3,mat,False,3,1501027200,126,107,1,4,28926058,86,
4,90s,False,4,1401667200,17,16,2,7,4335282,16,


In [6]:
df_tracks = pd.read_csv('../../data/tracks.csv', sep=';', nrows=200000)
df_tracks.head()

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,pid
0,0,Missy Elliott,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,0
1,1,Britney Spears,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,0
2,2,Beyoncé,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),0
3,3,Justin Timberlake,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,0
4,4,Shaggy,spotify:track:1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,0


One hot encoding of presence of a track in the playlist

In [10]:
_unique_pid = np.unique(df_playlist['pid'])
_unique_tracks = np.unique(df_tracks['track_uri'])
matrix = np.zeros((len(_unique_pid), len(_unique_tracks)), np.int)
matrix.shape

(2000, 75779)

In [13]:
for c, index in log_progress(_unique_tracks, every=1, name='distinct tracks'):
    _sub = df_tracks.loc[df_tracks['track_uri'] == c]
    for pid in _sub['pid']:
        if pid < len(_unique_pid):
            matrix[pid, index] = 1

Clusterising

In [175]:
X = matrix
Y = df_playlist['name'].values
num_clusters=20

In [None]:
from sklearn import cluster
import matplotlib.pyplot as plt

model = cluster.AgglomerativeClustering(n_clusters=num_clusters).fit(X, Y)
new_order = np.argsort(model.labels_)

In [202]:
df_playlist.loc[:,'cluster'] = model.labels_
clusterized = df_playlist[['name', 'cluster']].sort_values(['cluster'], ascending=False)

Printing some results

In [204]:
from IPython.core.display import Markdown, display

for i in np.arange(num_clusters):
    _involved = clusterized.loc[df_playlist['cluster']==i] 

    display(Markdown('#### CLUSTER %d' % i))
    display(Markdown('%d playlists' % _involved.shape[0]))
    display(Markdown('`%s`' % '` `'.join(_involved['name'][0:30])))

#### CLUSTER 0

12 playlists

`ALT`  `Pole`  `vibes`  `( ͡° ͜ʖ ͡°)`  `Good Vibes`  `FIREFLY 2016`  `⚡️⚡️⚡️`  `Chill `  `chill out`  `jamz`  `The Good Stuff`  `california`

#### CLUSTER 1

36 playlists

`This Is What You Came For`  `sb2k17`  `Pop Hits`  `NewNew`  `car`  `Clubbin`  `PT`  `rap`  `English `  `w o r k o u t`  `pregame`  `party music`  `Cinco De Mayo`  `vibin'`  `Main Playlist`  `lib`  `bounce`  `everything `  `Rap/Pop`  `NB`  `HITS`  `😈😈😈`  `Rap / hip hop`  `Life `  `party`  `bump`  `Pump up`  `rap mix`  `random`  `PARTY`

#### CLUSTER 2

1593 playlists

`pow pow`  `Autumn Playlist`  `booze cruise`  `Türkçe Slow`  `Eminem`  `summa summa`  `Yeah`  `Class`  `chill`  `winter 2015`  `Movie music`  `Party`  `fire`  `Baseball`  `Frozen`  `now`  `Harvest Moon`  `Slow Jamz`  `idek`  `Hips Don't Lie`  `Elizabeth`  `worship `  `DANCE`  `mmm`  `Me`  `feels`  `back to school`  `Be Mine`  `NEW`  `Ayy`

#### CLUSTER 3

89 playlists

`classics`  `old jams`  `HALLOWEEN`  `Party Mix`  `Oldies`  `Old Rock`  `Music`  `John's Playlist`  `The Drive`  `Oldies`  `The Classics`  `classic `  `Emo`  `family`  `Classic Rock`  `classic rock`  `70/80`  `Oldies`  `before my time`  `WORKOUT!!`  `Jamie`  `Dad Rock`  `🤘🏼`  `Hype`  `Awesome Playlist`  `THE MIX`  `Hotel California`  `Good Music `  `oldies`  `Good Stuff`

#### CLUSTER 4

13 playlists

`Gym Mix`  `Melting Pot`  `Driving`  `Mashup`  `christian`  `2015`  `Rock`  `Kyle`  `Gym`  `Everything`  `rock`  `death`  `Beats`

#### CLUSTER 5

22 playlists

`💛💛`  `Up`  `Rap`  `rap`  `wrap`  `2k17`  `music`  `everything`  `2016`  `❤❤❤`  `Party`  `music`  `New School`  `BANGAZ`  `💯💯💯`  `Gym`  `Litty `  `Its a Trap`  `summer 2k17`  `My Music`  `🔥🔥🔥🔥`  `BUMP`

#### CLUSTER 6

3 playlists

`Favorite Songs`  `jamzzz`  `vibes`

#### CLUSTER 7

9 playlists

`Country`  `Tennessee `  `Country`  `My Favorites`  `Country`  `new`  `Country`  `Country Favorites`  `Country`

#### CLUSTER 8

11 playlists

`Classic Country`  `country`  `Country`  `country `  `Country`  `Country`  `Country summer`  `Country`  `Good Country`  `country`  `Country`

#### CLUSTER 9

13 playlists

`Way Back When`  `housewarming`  `R&B classics `  `2000s r&b`  `Love Music`  `~Rando~`  `Slow jams`  `R&B`  `breathe`  `HER`  `Main`  `in my feels`  `R&B classics`

#### CLUSTER 10

68 playlists

`study!`  `Current`  `Feels`  `yo`  `slow`  `throwbacksss`  `Sleep`  `Party`  `Throwback`  `Party`  `my heart`  `Guilty pleasure`  `Throwbacks `  `Electronic`  `partay `  `// Drive`  `old songs `  `throw backs`  `Dance mix`  `((chris))`  `Running`  `Throwbackkkk`  `Workout!`  `scott`  `Happy :)`  `throwback`  `workout 2`  `2009`  `old but good`  `Chill Out Music`

#### CLUSTER 11

17 playlists

`Disney`  `three`  `Disney`  `Disney Jams`  `Tangled`  `Disney`  `Disney!`  `disney`  `Musicals`  `Disney`  `Disney`  `Disney`  `Disney`  `Disney`  `Girls`  `Disney`  `Musicals`

#### CLUSTER 12

61 playlists

`Bumpin'`  `getting ready`  `IDGAF`  `tailgate`  `roadtrippin`  `New ish`  `mood`  `Playlist`  `beach`  `KILLA`  `hey`  `explicit`  `New music`  `The Mix`  `its lit`  `SENIOR YEAR`  `jams`  `BANGERZ`  `The PLayliSt`  `HARDCORE WORKOUT`  `feels`  `hoco`  `party playlist`  `Lit`  `mar`  `abby `  `mb`  `summer 17`  `Bet`  `Yeet`

#### CLUSTER 13

2 playlists

`Drake`  `Drake`

#### CLUSTER 14

3 playlists

`Musicals`  `Musicales`  `musicals`

#### CLUSTER 15

15 playlists

`Throwbacks`  `90's Hits`  `Rock mix`  `90's`  `Jammin`  `Ma`  `Teen Angst`  `Alternative`  `woo`  `Rock`  `90s alternative`  `Alternative Rock`  `Dancing on my own`  `90's `  `nostalgia`

#### CLUSTER 16

6 playlists

`Country`  `Country`  `Country`  `Country`  `Country`  `country`

#### CLUSTER 17

16 playlists

`getting ready`  `basic`  `Summer17`  `hyfr`  `high`  `pg`  `LIT`  `2020`  `summer 2016`  `Summa`  `lit`  `Summer Party`  `Stuff`  `summer playlist`  `Lifting `  `summer '17`

#### CLUSTER 18

10 playlists

`Throwback`  `childhood`  `my heart`  `Trap`  `HSM`  `hsm`  `BEST SONGS EVER`  `Childhood Jams`  `disney bops`  `BLAST from the PAST`

#### CLUSTER 19

1 playlists

`Chill`

In [207]:
for i in ['country', '90\'s', 'Work', 'Disney', 'Dance']:
    display(clusterized.loc[df_playlist['name'] == i])

Unnamed: 0,name,cluster
1096,country,16
328,country,8
252,country,8
1442,country,2
1248,country,2
1240,country,2
1299,country,2
1915,country,2
1724,country,2
378,country,2


Unnamed: 0,name,cluster
406,90's,15
1190,90's,2
1747,90's,2
485,90's,2


Unnamed: 0,name,cluster
1710,Work,2
557,Work,2
681,Work,2


Unnamed: 0,name,cluster
292,Disney,11
834,Disney,11
748,Disney,11
1365,Disney,11
1259,Disney,11
1806,Disney,11
1731,Disney,11
1364,Disney,11
1048,Disney,11
1458,Disney,2


Unnamed: 0,name,cluster
1621,Dance,2
195,Dance,2
1069,Dance,2
1091,Dance,2
