# Project Plan

After discussing with Robin Burke, the general model will include have multiple levels for detecting for the recommender system
- Last FM to start the cluster
- use eucliedean distance, and only draw connections between "djable" nodes

## Data to Test

- [🎹 Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

In [107]:
## IMPORTS ##
import seaborn as sns
import pandas as pd
import networkx as nx
import numpy as np
import requests
import re
from collections import Counter
import os
from kmodes.kmodes import KModes
import spotipy
import keys

In [108]:
# looking at dataset of songs
df = pd.read_csv('dataset.csv')

In [109]:
# All the genres in the dataframe
np.unique(df.track_genre)

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie', 'indie-pop', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop', 'pop-film',
       'pow

[keys.py](https://github.com/schwartzadev/dj-recommender/blob/master/keys.py)


- Rules for Building Edges
    - BPM within 8 of each other
    - must be within similar key --> Camelot
    - must be at least danceability of 0.5, otherwise add and ignore
    - must be at least popularity of 0.7, otherwise add and ignore

In [110]:
# Helper Functions

def _calc_bpm_range(bpm,brange=8):
    '''
    returns list of bpm range
    '''
    return [bpm-brange,bpm+brange]

def song_id_search(_id):
    return df[df['track_id']==_id]

def find_neighbors(song,danceability=0.5,popularity=45):
    brange = _calc_bpm_range(song.tempo)
    small_df = df[(df['tempo']>brange[0]) & (df['tempo']<brange[1]) & (df['track_genre'] == song.track_genre) & (df['danceability']>danceability) & (df['popularity']>popularity) & (df['key'] == song.key) & (df['track_id'] != song.track_id)]
    return small_df['track_id'].tolist() 


In [111]:
bieb = df.iloc[30021]
print(bieb.track_name)
find_neighbors(bieb)

Family


['322TxW77VZdX9gHynK5Xue',
 '58kZ9spgxmlEznXGu6FPdQ',
 '4EZDJ5FMTj5pCJe2HBmd21',
 '1ZOVeidJCvkxOARWTHmWOL',
 '0HLhptvI8NozbOHRLNniFz']

In [112]:
song_id_search('4vP5AQEH20l7zXfkdMCtzX')

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
13303,13303,4vP5AQEH20l7zXfkdMCtzX,Barbara Tucker;Obskür,Beautiful People (Obskür Remix),Beautiful People - Obskür Remix,48,333771,False,0.832,0.677,...,-8.78,0,0.105,0.00338,0.0153,0.0494,0.384,128.017,4,chicago-house


In [113]:
song_id_search('6fSdR81YvNG8Wo6i2ytLPR')

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
13012,13012,6fSdR81YvNG8Wo6i2ytLPR,Roy Davis Jr.;Peven Everett,Gabriel,Gabriel - Live Garage Mix,52,443849,False,0.829,0.332,...,-13.837,0,0.047,0.0125,0.294,0.0918,0.448,128.627,4,chicago-house


In [114]:
# The Chainsmokers, So Far So Good
test = song_id_search('2FqkTu4FhwDWn9hzEaWWCE')
test

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
30403,30403,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,False,0.651,0.719,...,-5.804,1,0.0318,0.143,3.6e-05,0.0948,0.81,103.981,4,edm
31455,31455,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,False,0.651,0.719,...,-5.804,1,0.0318,0.143,3.6e-05,0.0948,0.81,103.981,4,electro
53355,53355,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,False,0.651,0.719,...,-5.804,1,0.0318,0.143,3.6e-05,0.0948,0.81,103.981,4,house


In [115]:
# checking to see how many duplicates there are
len(df)-len(np.unique(df.track_id))

24259

In [116]:
# 25k duplicates?!?!! Let's drop those

In [117]:
# Drop unneed columns
df = df.drop(columns=[
    'Unnamed: 0', 
    'explicit',
    'loudness',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness']
            )


In [118]:
cols = list(df.columns)
cols.remove('track_genre')
print(cols)

['track_id', 'artists', 'album_name', 'track_name', 'popularity', 'duration_ms', 'danceability', 'energy', 'key', 'mode', 'valence', 'tempo', 'time_signature']


In [119]:
# combines columns that are similar except with different genre
df = df.groupby(cols)['track_genre'].apply(','.join).reset_index()

In [120]:
df

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
0,0000vdREvCVMxbQTkS888c,Rill,Lolly,Lolly,44,160725,0.910,0.37400,8,0,0.432,104.042,4,german
1,000CC8EParg64OmTxVnZ0p,Glee Cast,Glee Love Songs,It's All Coming Back To Me Now (Glee Cast Vers...,47,322933,0.269,0.51600,0,1,0.341,178.174,4,club
2,000Iz0K615UepwSJ5z2RE5,Paul Kalkbrenner;Pig&Dan,X,Böxig Leise - Pig & Dan Remix,22,515360,0.686,0.56000,5,0,0.108,119.997,4,minimal-techno
3,000RDCYioLteXcutOjeweY,Jordan Sandhu,Teeje Week,Teeje Week,62,190203,0.679,0.77000,0,1,0.839,161.721,4,hip-hop
4,000qpdoc97IMTBvF8gwcpy,Paul Kalkbrenner,Zeit,Tief,19,331240,0.519,0.43100,6,0,0.234,129.971,4,minimal-techno
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90455,7zxHiMmVLt4LGWpOMqOpUh,Haricharan;Gopi Sundar,Bangalore Days,"Aethu Kari Raavilum - From ""Bangalore Days""",56,325156,0.766,0.38200,7,0,0.672,119.992,4,pop-film
90456,7zxpdh3EqMq2JCkOI0EqcG,Piano Genie,Disney Favourites,"Two Worlds (From ""Tarzan"")",23,109573,0.529,0.00879,10,1,0.510,82.694,4,disney
90457,7zyYmIdjqqiX6kLryb7QBx,Eric Chou,學著愛,以後別做朋友,61,260573,0.423,0.36000,3,1,0.291,130.576,4,mandopop
90458,7zybSU9tFO9HNlwmGF7stc,Stereoclip,Echoes,Sunset Drive,54,234300,0.649,0.83400,10,0,0.150,125.004,4,electronic


In [121]:
test = song_id_search('2FqkTu4FhwDWn9hzEaWWCE')
test

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
26106,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,0.651,0.719,8,1,0.81,103.981,4,"edm,electro,house"


In [122]:
# Lol, why are there still duplicates
len(df)-len(np.unique(df.track_id))

720

In [123]:
# list of unique ids
unique = np.unique(df.track_id)

# all ids
ids = df.track_id

In [124]:
# duplicates left
len(ids) - len(unique)

720

In [125]:
id_count = dict(Counter(ids))

In [126]:
dup_list = [key for key,val in id_count.items() if val > 1]

In [127]:
# list of duplicates captured
len(dup_list)

720

In [128]:
# list of duplicates
# dup_list

In [129]:
# slight difference in popularity, hm
song_id_search('00YwP3wJWiG8IxAA7OS9lo')

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
109,00YwP3wJWiG8IxAA7OS9lo,Anupam Roy,Doorbiney Chokh Rakhbona,Amake Amar Moto Thakte Dao,46,319946,0.566,0.419,7,1,0.186,147.881,4,"singer-songwriter,songwriter"
110,00YwP3wJWiG8IxAA7OS9lo,Anupam Roy,Doorbiney Chokh Rakhbona,Amake Amar Moto Thakte Dao,47,319946,0.566,0.419,7,1,0.186,147.881,4,"indian,indie-pop,indie,k-pop"


In [130]:
# another difference in popularity
song_id_search('014SIjoLDG1Ku19c5FlDYh')

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
189,014SIjoLDG1Ku19c5FlDYh,Creedence Clearwater Revival,Pumpkin Patch Hits,I Put A Spell On You,0,271786,0.393,0.732,4,0,0.621,100.41,4,country
190,014SIjoLDG1Ku19c5FlDYh,Creedence Clearwater Revival,Pumpkin Patch Hits,I Put A Spell On You,3,271786,0.393,0.732,4,0,0.621,100.41,4,rock


In [131]:
# ANOTHER difference in popularity, weird
song_id_search('0DqLuhTD1xI8mb2gY5YoLM')

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
2547,0DqLuhTD1xI8mb2gY5YoLM,Håkan Hellström,Det är så jag säger det,Den fulaste flickan i världen,35,197826,0.292,0.812,9,1,0.502,108.782,4,swedish
2548,0DqLuhTD1xI8mb2gY5YoLM,Håkan Hellström,Det är så jag säger det,Den fulaste flickan i världen,36,197826,0.292,0.812,9,1,0.502,108.782,4,goth


At this point I am thinking of dropping the rows that are duplicates, and removing the one with higher popularity. I don't think it will be the end of the world, but because this data is weird, I don't want to recommend it probably.

In [132]:
df = df[~df['track_id'].isin(dup_list)]

In [176]:
# hell yeah, duplicate values are gone, totally removed for now
df[df['track_id']=='0DqLuhTD1xI8mb2gY5YoLM']

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre,cam_key,valid_cam_keys


### Sudo Code

I want to create a song object so that I can pass the data in a box to the visualization in D3. I'm hoping I can project the image and add the spotify link to the visualization.

1. Create Song Objects with attributes
    - Attributes
        - Song ID
        - Name
        - Artist
        - Spotify Link
        - spotify genre
        - popularity
        - key (convert to camelot)
        - tempo
        - lyrics
        - duration
        - explicit
    - Methods
        - Generate Spotify Link
        - get last fm track tag
        - from Song_ID (class method)
        - get lyrics
        - get valid tempo range
        - get neighbors (using filtering)
2. Go through entire track list
    - for a song in the song list
        - build a list of node list songs that share similar features
3. Playlist Object
    - Attributes
        - Song Objects
        - BPM range
        - Key Range
        - Genre
    - Methods
        - Create in Spotify(using Spotipy)
        - From 2 songs (class method)
            - short
            - different paths
        - Add (add to graph, use graph logic to create new playlist?)
4. SongGraph Object
   - Attributes
       - Last Update
   - Methods
       - Save to GraphML
       - From GraphML
       - Visualize
       - _add_song
       - _remove_song

Simple, just referenced when Playlist Object is implemented, contains metadata about last update, etc. It's nice because it will also have ways to save the current graph to graphml, etc.
 

In [134]:
class Song:
    def __init__(self, song_id, name, artists, popularity, key, tempo, duration, explicit=None,lyrics=None):
        self.song_id = song_id
        self.name = name
        self.artists = artists
        self.popularity = popularity
        self.key = key
        self.tempo = tempo
        self._duration = duration

    @classmethod
    def from_df_row(cls,df_row):
        return Song(df_row.track_id, df_row.track_name, df_row.artists, df_row.popularity, df_row.key, df_row.tempo, df_row.duration_ms)        

    @classmethod
    def from_spot_id(cls,_id):
        pass

    @property
    def spot_link(self):
        return f'https://open.spotify.com/track/{self.song_id}'
    
    @property
    def duration(self):
        _sec=int((self._duration/1000)%60)
        _min=int((self._duration/(1000*60))%60)
        return f'{_min}min {_sec}sec'
    
    @property
    def valid_bpm_range(self):
        pass
    
    
    @property
    def valid_keys(self):
        pass
    
    @property
    def stats(self):
        output = f'''Song ID: {self.song_id}
Song Name: {self.name}
Song Artists: {self.artists}
Song Key: {self.key}
Song Tempo: {self.tempo}
Song Duration {self.duration}'''
        print(output)
    
    def __str__(self):
        return f'Song Object: \'{self.name} by {self.artists}\''
    
    def __repr__(self):
        return f'ID: {self.song_id} | Name: {self.name} | Song Artists: {self.artists} | Song Key: {self.key} | Song Tempo: {self.tempo} | Song Duration {self.duration}'
    


In [135]:
# Creating Object

first = df.iloc[0]

song = Song.from_df_row(first)

In [136]:
song.duration

'2min 40sec'

In [137]:
song

ID: 0000vdREvCVMxbQTkS888c | Name: Lolly | Song Artists: Rill | Song Key: 8 | Song Tempo: 104.042 | Song Duration 2min 40sec

## TODOS:

- key to camelot conversion
- networkx the mf dataframe
    - convert to graphml [(link)](https://stackoverflow.com/questions/13159575/using-a-graphml-file-for-d3-js-force-directed-layout)

In [138]:
first['key']

8

In [139]:
first['mode']

0

In [140]:
df.iloc[0]

track_id          0000vdREvCVMxbQTkS888c
artists                             Rill
album_name                         Lolly
track_name                         Lolly
popularity                            44
duration_ms                       160725
danceability                        0.91
energy                             0.374
key                                    8
mode                                   0
valence                            0.432
tempo                            104.042
time_signature                         4
track_genre                       german
Name: 0, dtype: object

In [141]:
key = keys.generate_camelot_key(first['mode'],first['key'])
key
keys._get_matching_keys(key)

['2A', '4A', '7A', '10A', '1A', '1B']

In [142]:
key

'1A'

In [155]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [156]:
# gave a warning, disabled warnings so we could get it to work

# assigned new values for all of the keys using two columns and function from my friends code
df['cam_key'] = [keys.generate_camelot_key(*a) for a in tuple(zip(df.loc[:,'mode'],df.loc[:,'key']))]

In [157]:
df

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre,cam_key
0,0000vdREvCVMxbQTkS888c,Rill,Lolly,Lolly,44,160725,0.910,0.37400,8,0,0.432,104.042,4,german,1A
1,000CC8EParg64OmTxVnZ0p,Glee Cast,Glee Love Songs,It's All Coming Back To Me Now (Glee Cast Vers...,47,322933,0.269,0.51600,0,1,0.341,178.174,4,club,8B
2,000Iz0K615UepwSJ5z2RE5,Paul Kalkbrenner;Pig&Dan,X,Böxig Leise - Pig & Dan Remix,22,515360,0.686,0.56000,5,0,0.108,119.997,4,minimal-techno,4A
3,000RDCYioLteXcutOjeweY,Jordan Sandhu,Teeje Week,Teeje Week,62,190203,0.679,0.77000,0,1,0.839,161.721,4,hip-hop,8B
4,000qpdoc97IMTBvF8gwcpy,Paul Kalkbrenner,Zeit,Tief,19,331240,0.519,0.43100,6,0,0.234,129.971,4,minimal-techno,11A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90455,7zxHiMmVLt4LGWpOMqOpUh,Haricharan;Gopi Sundar,Bangalore Days,"Aethu Kari Raavilum - From ""Bangalore Days""",56,325156,0.766,0.38200,7,0,0.672,119.992,4,pop-film,6A
90456,7zxpdh3EqMq2JCkOI0EqcG,Piano Genie,Disney Favourites,"Two Worlds (From ""Tarzan"")",23,109573,0.529,0.00879,10,1,0.510,82.694,4,disney,6B
90457,7zyYmIdjqqiX6kLryb7QBx,Eric Chou,學著愛,以後別做朋友,61,260573,0.423,0.36000,3,1,0.291,130.576,4,mandopop,5B
90458,7zybSU9tFO9HNlwmGF7stc,Stereoclip,Echoes,Sunset Drive,54,234300,0.649,0.83400,10,0,0.150,125.004,4,electronic,3A


In [159]:
second = df.iloc[1]

In [160]:
key = keys.generate_camelot_key(second['mode'],second['key'])

In [166]:
keys._get_matching_keys(key)

['2B', '5B', '7B', '9B', '11B', '8A', '8B']

In [167]:
#df['valid_cam_keys'] = df['cam_key'].map(keys._get_matching_keys)
#df.drop('valid_cam_keys', axis=1, inplace=True)

In [175]:
df[df['cam_key'] == "8B"]

SyntaxError: invalid syntax (2304662452.py, line 1)

In [219]:
def updated_find_neighbors(song,danceability=0.5,popularity=45):
    '''
    ** Need to convert to using track_id instead of row **
    
    function used to filter to valid DJ songs that work (hypothetically)
    
    @param song: row of df to match, e.g. df.iloc[0]
    @param danceability: float from 0 -> 1 indicating spotifies danceability metric
    @param danceability: float from 0 -> 1 indicating spotifies danceability metric    
    
    @return: 
    '''
    
    # Removed (df['track_genre'] == song.track_genre) for now, need to just cluster by them
    
    # quick calculations to help with finding neighbors 
    cam_key = keys.generate_camelot_key(song.mode,song.key)
    valid_cam_keys = keys._get_matching_keys(cam_key)
    brange = _calc_bpm_range(song.tempo)
    genres = song.track_genre.split(',')
    
    small_df = df[(df['tempo']>brange[0]) & (df['tempo']<brange[1]) & (df['danceability']>danceability) & (df['popularity']>popularity) & (df['cam_key'].isin(valid_cam_keys)) & (df['track_id'] != song.track_id)]
    return small_df['track_id'].tolist()

In [224]:
bieb.track_name

'Somebody To Love'

In [221]:
len(updated_find_neighbors(bieb))

1513

In [188]:
bieb.cam_key

'10A'

In [191]:
bieb.track_genre.split(',')

['chicago-house']

## Takeaways

We might have to use NLP here, it is hard for us to process the difference between chicago house and progressive-house. Because these are classified as EDM, it might make sense to use a Kmode function to cluster these songs.

In [222]:
# one additional recommendation!
bieb = df.iloc[30021]
print(bieb.track_name)
updated_find_neighbors(bieb)

Somebody To Love


['0068lzo1xXa9ED8ThypHU1',
 '00B7SBwrjbycLMOgAmeIU8',
 '0110a0AM2nOV8yYa9u6kjQ',
 '02K5L2D21TVIINipDIPEfA',
 '02X93AUKXP7FPJBPckqgGu',
 '033uNwmVzZvXm00CUPBnA0',
 '039VxfSo5FErqE65169VZL',
 '03rdAFKPMOjSOXTcoZSajT',
 '040SS97cdQhszuY9MNNSkQ',
 '043bfUkTydw0xJ5JjOT91w',
 '04EtBLFIxbcVt9NdYgcrpF',
 '04K2bMi2vyOBwxr5EjDq5O',
 '04e6inMj0fH8ifKgxVhGdn',
 '0530i8KhAfzyJfHhjlSlUS',
 '05bUAkDRK3xzvVkZlGU6ee',
 '06eFWpksA3M9qg8GeOGGBX',
 '06rcHn4ehOCA1I1n6CJ2lb',
 '06ypiqmILMdVeaiErMFA91',
 '07PHNpknqZFw6N3GzNQWB8',
 '07RxMrhnBCJbiS6B8Xcwtn',
 '07WWwzmVHoEMgcbuDo8P0N',
 '07fynzlBq7ltsOYSx6cC3A',
 '07hSCImGi0GMSKKRCgP8A5',
 '07qrA1FxZpXy383wX3IDEb',
 '08kSDRN3ZUB5V5ba2Rt5KB',
 '09IT6ZbPsY5EioVEqeyq4j',
 '09cgbbadzZSKFd1hGN23p5',
 '0AJhcuRl3i1FfPNr88ZScv',
 '0AdUlf7gmrwhuVp8oznsRc',
 '0AmA8MQue1LBuYZSAdBMsj',
 '0B2GWovtxzsh0a02PBbNl7',
 '0B9U02XhMpK0N0vbuboKgc',
 '0BNt1E54JDgCc06Ld5oaQ6',
 '0BOivei4EFBvk7CKwEwaZf',
 '0BbPcyNwgIbB9GYNwCN9Ja',
 '0Be7sopyKMv8Y8npsUkax2',
 '0C6G3mqMjVcIgTWbK9nFIE',
 