# Project Plan

After discussing with Robin Burke, the general model will include have multiple levels for detecting for the recommender system
- Last FM to start the cluster
- use eucliedean distance, and only draw connections between "djable" nodes

## Data to Test

- [🎹 Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

In [1]:
## IMPORTS ##
import seaborn as sns
import pandas as pd
import networkx as nx
import numpy as np
import requests
import re
from collections import Counter
import os
from kmodes.kmodes import KModes
import spotipy
import keys

In [2]:
# looking at dataset of songs
df = pd.read_csv('dataset.csv')

In [3]:
# All the genres in the dataframe
np.unique(df.track_genre)

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie', 'indie-pop', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop', 'pop-film',
       'pow

## EDA

After doing some basic data exploration, I got into the meat of the project. I decided that I needed a few things to determine, like how I was going to build the edges of the network.

[keys.py](https://github.com/schwartzadev/dj-recommender/blob/master/keys.py)


- Rules for Building Edges
    - BPM within 8 of each other
    - must be within similar key --> Camelot
    - must be at least danceability of 0.5, otherwise add and ignore
    - must be at least popularity of 0.7, otherwise add and ignore
    
    
### **The first thing I wanted to do was build helper functions**

In [4]:
# Helper Functions

def _calc_bpm_range(bpm,brange=8):
    '''
    a helper function that calculates the bpm of a song
    
    @param bpm: bpm of song
    @param brange: number to add/subtract from the given bpm used to create the bpm boundary
    
    @return: list of bpm boundaries for range for valid songs
    '''
    
    return [bpm-brange,bpm+brange]

def song_id_search(id_):
    '''
    a function to search a song by its id
    
    @param id_: id to search the dataframe by
    
    @return: dataframe with id
    '''
    
    return df[df['track_id']==id_]

def genre_search(set_):
    
    '''
    a function that filters the dataframe by genre(s)
    
    @param _set: a string or list of strings used to filter the dataframe
    
    @return: dataframe of songs with genre in provided set
    '''
    
    set_ = set(set_)
    return df[df['track_genre'] & set_]


In [5]:
song_id_search('4vP5AQEH20l7zXfkdMCtzX')

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
13303,13303,4vP5AQEH20l7zXfkdMCtzX,Barbara Tucker;Obskür,Beautiful People (Obskür Remix),Beautiful People - Obskür Remix,48,333771,False,0.832,0.677,...,-8.78,0,0.105,0.00338,0.0153,0.0494,0.384,128.017,4,chicago-house


In [6]:
# The Chainsmokers, So Far So Good

# There are a lot of Duplicates
test = song_id_search('2FqkTu4FhwDWn9hzEaWWCE')
test

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
30403,30403,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,False,0.651,0.719,...,-5.804,1,0.0318,0.143,3.6e-05,0.0948,0.81,103.981,4,edm
31455,31455,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,False,0.651,0.719,...,-5.804,1,0.0318,0.143,3.6e-05,0.0948,0.81,103.981,4,electro
53355,53355,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,False,0.651,0.719,...,-5.804,1,0.0318,0.143,3.6e-05,0.0948,0.81,103.981,4,house


### First issue: Dupliates

Now that I knew there were duplicates, I had to see how many duplicates there actually were.

In [7]:
# checking to see how many duplicates there are
len(df)-len(np.unique(df.track_id))

24259

### 25k duplicates?!?!! We NEED to drop those!

Because there were songs that had multiple genres, at this point I realized that I had to use a funny pandas hack to get what I needed to get done. I needed to combine the genres.

To do the hack, I decided to make it easier by dropping columns I wasn't going to use.

In [8]:
# Drop unneed columns
df = df.drop(columns=[
    'Unnamed: 0', 
    'explicit',
    'loudness',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness']
            )


### Next Step: Creating a list of columns

After dropping these columns, I created a list of the remaining columns, and removed the one I was going to combine, `track_genre`

In [9]:
cols = list(df.columns)
cols.remove('track_genre')
print(cols)

['track_id', 'artists', 'album_name', 'track_name', 'popularity', 'duration_ms', 'danceability', 'energy', 'key', 'mode', 'valence', 'tempo', 'time_signature']


### Final Step: Using apply to combine the `track_genre`

Finally, I used a set to combine all of the genres into one column.

In [10]:
df = df.groupby(cols)['track_genre'].apply(set).reset_index()

In [11]:
test = song_id_search('2FqkTu4FhwDWn9hzEaWWCE')
test

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
26106,2FqkTu4FhwDWn9hzEaWWCE,The Chainsmokers,So Far So Good,I Love U,69,185522,0.651,0.719,8,1,0.81,103.981,4,"{house, edm, electro}"


### Checking Duplicates: Still?!?

There were still duplicates, so I was kinda frustrated.

In [12]:
# Lol, why are there still duplicates
len(df)-len(np.unique(df.track_id))

720

I decided I needed to make a list of all of the duplicates so I could see what was weird about them.

In [13]:
# list of unique ids
unique = np.unique(df.track_id)

# all ids
ids = df.track_id

In [14]:
# duplicates left
len(ids) - len(unique)

720

In [15]:
id_count = dict(Counter(ids))

In [16]:
dup_list = [key for key,val in id_count.items() if val > 1]

In [17]:
# list of duplicates captured
len(dup_list)

720

In [18]:
# list of duplicates
dup_list[:4]

['00YwP3wJWiG8IxAA7OS9lo',
 '014SIjoLDG1Ku19c5FlDYh',
 '02jLfqc9gMo8PkHEGHY3OT',
 '03mHinvLdrdSTd7w4GPXwH']

In [19]:
# slight difference in popularity, hm
song_id_search('00YwP3wJWiG8IxAA7OS9lo')

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
109,00YwP3wJWiG8IxAA7OS9lo,Anupam Roy,Doorbiney Chokh Rakhbona,Amake Amar Moto Thakte Dao,46,319946,0.566,0.419,7,1,0.186,147.881,4,"{singer-songwriter, songwriter}"
110,00YwP3wJWiG8IxAA7OS9lo,Anupam Roy,Doorbiney Chokh Rakhbona,Amake Amar Moto Thakte Dao,47,319946,0.566,0.419,7,1,0.186,147.881,4,"{indie, indian, indie-pop, k-pop}"


In [20]:
# another difference in popularity
song_id_search('014SIjoLDG1Ku19c5FlDYh')

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
189,014SIjoLDG1Ku19c5FlDYh,Creedence Clearwater Revival,Pumpkin Patch Hits,I Put A Spell On You,0,271786,0.393,0.732,4,0,0.621,100.41,4,{country}
190,014SIjoLDG1Ku19c5FlDYh,Creedence Clearwater Revival,Pumpkin Patch Hits,I Put A Spell On You,3,271786,0.393,0.732,4,0,0.621,100.41,4,{rock}


In [21]:
# ANOTHER difference in popularity, weird
song_id_search('02jLfqc9gMo8PkHEGHY3OT')

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
495,02jLfqc9gMo8PkHEGHY3OT,Feid,Si Te La Encuentras Por Ahí,Si Te La Encuentras Por Ahí,82,191573,0.743,0.576,8,0,0.455,171.945,4,"{reggaeton, reggae}"
496,02jLfqc9gMo8PkHEGHY3OT,Feid,Si Te La Encuentras Por Ahí,Si Te La Encuentras Por Ahí,83,191573,0.743,0.576,8,0,0.455,171.945,4,{latino}


At this point I am thinking of dropping the rows that are duplicates, and removing the one with higher popularity. I don't think it will be the end of the world, but because this data is weird, I don't want to recommend it probably.

In [22]:
df = df[~df['track_id'].isin(dup_list)]

In [23]:
# hell yeah, duplicate values are gone, totally removed for now
df[df['track_id']=='0DqLuhTD1xI8mb2gY5YoLM']

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre


In [24]:
len(ids) - len(unique)

720

In [25]:
id_count = dict(Counter(ids))

In [26]:
dup_list = [key for key,val in id_count.items() if val > 1]

In [27]:
len(df)-len(np.unique(df.track_id))

0

## Removed Songs

It was only 720 songs, so I just removed them, but in the future I will probably just removed the less popular version of the song.

Next up, I write the sudo code for the song object that will be used on the website.

### Sudo Code

I want to create a song object so that I can pass the data in a box to the visualization in D3. I'm hoping I can project the image and add the spotify link to the visualization.

1. Create Song Objects with attributes
    - Attributes
        - Song ID
        - Name
        - Artist
        - Spotify Link
        - spotify genre
        - popularity
        - key (convert to camelot)
        - tempo
        - lyrics
        - duration
        - explicit
    - Methods
        - Generate Spotify Link
        - get last fm track tag
        - from Song_ID (class method)
        - get lyrics
        - get valid tempo range
        - get neighbors (using filtering)
2. Go through entire track list
    - for a song in the song list
        - build a list of node list songs that share similar features
3. Playlist Object
    - Attributes
        - Song Objects
        - BPM range
        - Key Range
        - Genre
    - Methods
        - Create in Spotify(using Spotipy)
        - From 2 songs (class method)
            - short
            - different paths
        - Add (add to graph, use graph logic to create new playlist?)
4. SongGraph Object
   - Attributes
       - Last Update
   - Methods
       - Save to GraphML
       - From GraphML
       - Visualize
       - _add_song
       - _remove_song

Simple, just referenced when Playlist Object is implemented, contains metadata about last update, etc. It's nice because it will also have ways to save the current graph to graphml, etc.
 

In [28]:
class Song:
    def __init__(self, song_id, name, artists, popularity, key, tempo, duration, explicit=None,lyrics=None):
        self.song_id = song_id
        self.name = name
        self.artists = artists
        self.popularity = popularity
        self.key = key
        self.tempo = tempo
        self._duration = duration

    @classmethod
    def from_df_row(cls,df_row):
        return Song(df_row.track_id, df_row.track_name, df_row.artists, df_row.popularity, df_row.key, df_row.tempo, df_row.duration_ms)        

    @classmethod
    def from_spot_id(cls,_id):
        pass

    @property
    def spot_link(self):
        return f'https://open.spotify.com/track/{self.song_id}'
    
    @property
    def duration(self):
        _sec=int((self._duration/1000)%60)
        _min=int((self._duration/(1000*60))%60)
        return f'{_min}min {_sec}sec'
    
    @property
    def valid_bpm_range(self):
        pass
    
    
    @property
    def valid_keys(self):
        pass
    
    @property
    def stats(self):
        output = f'''Song ID: {self.song_id}
Song Name: {self.name}
Song Artists: {self.artists}
Song Key: {self.key}
Song Tempo: {self.tempo}
Song Duration {self.duration}'''
        print(output)
    
    def __str__(self):
        return f'Song Object: \'{self.name} by {self.artists}\''
    
    def __repr__(self):
        return f'ID: {self.song_id} | Name: {self.name} | Song Artists: {self.artists} | Song Key: {self.key} | Song Tempo: {self.tempo} | Song Duration {self.duration}'
    


In [29]:
first = df.iloc[0]
first

track_id          0000vdREvCVMxbQTkS888c
artists                             Rill
album_name                         Lolly
track_name                         Lolly
popularity                            44
duration_ms                       160725
danceability                        0.91
energy                             0.374
key                                    8
mode                                   0
valence                            0.432
tempo                            104.042
time_signature                         4
track_genre                     {german}
Name: 0, dtype: object

In [30]:
df[df['track_id']=='0000vdREvCVMxbQTkS888c']

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre
0,0000vdREvCVMxbQTkS888c,Rill,Lolly,Lolly,44,160725,0.91,0.374,8,0,0.432,104.042,4,{german}


In [31]:
# Creating Object

first = df.iloc[0]

song = Song.from_df_row(first)

In [32]:
song.duration

'2min 40sec'

In [33]:
song

ID: 0000vdREvCVMxbQTkS888c | Name: Lolly | Song Artists: Rill | Song Key: 8 | Song Tempo: 104.042 | Song Duration 2min 40sec

## Song Object Todos

The next thing I need to do is make a song object from the dataframe and also from the spotify api. I'm going to register my app at some point to do it. Wondering if I made this an actual company what would happen.

## Key Matching

I needed to translate the keys from key and mode to camelot for easier mixing. I'm going to use a script my friend wrote a year ago to help with that.

In [34]:
key = keys.generate_camelot_key(first['mode'],first['key'])
key
keys._get_matching_keys(key)

['2A', '4A', '7A', '10A', '1A', '1B']

In [35]:
key

'1A'

In [36]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [37]:
# gave a warning, disabled warnings so we could get it to work

# assigned new values for all of the keys using two columns and function from my friends code
df['cam_key'] = [keys.generate_camelot_key(*a) for a in tuple(zip(df.loc[:,'mode'],df.loc[:,'key']))]

In [38]:
second = df.iloc[1]

In [39]:
key = keys.generate_camelot_key(second['mode'],second['key'])

In [40]:
keys._get_matching_keys(key)

['2B', '5B', '7B', '9B', '11B', '8A', '8B']

In [41]:
df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre,cam_key
0,0000vdREvCVMxbQTkS888c,Rill,Lolly,Lolly,44,160725,0.91,0.374,8,0,0.432,104.042,4,{german},1A
1,000CC8EParg64OmTxVnZ0p,Glee Cast,Glee Love Songs,It's All Coming Back To Me Now (Glee Cast Vers...,47,322933,0.269,0.516,0,1,0.341,178.174,4,{club},8B
2,000Iz0K615UepwSJ5z2RE5,Paul Kalkbrenner;Pig&Dan,X,Böxig Leise - Pig & Dan Remix,22,515360,0.686,0.56,5,0,0.108,119.997,4,{minimal-techno},4A
3,000RDCYioLteXcutOjeweY,Jordan Sandhu,Teeje Week,Teeje Week,62,190203,0.679,0.77,0,1,0.839,161.721,4,{hip-hop},8B
4,000qpdoc97IMTBvF8gwcpy,Paul Kalkbrenner,Zeit,Tief,19,331240,0.519,0.431,6,0,0.234,129.971,4,{minimal-techno},11A


In [42]:
# Testing to see if you can filter by cam_key
df[df['cam_key'] == "8B"].head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre,cam_key
1,000CC8EParg64OmTxVnZ0p,Glee Cast,Glee Love Songs,It's All Coming Back To Me Now (Glee Cast Vers...,47,322933,0.269,0.516,0,1,0.341,178.174,4,{club},8B
3,000RDCYioLteXcutOjeweY,Jordan Sandhu,Teeje Week,Teeje Week,62,190203,0.679,0.77,0,1,0.839,161.721,4,{hip-hop},8B
23,006rHBBNLJMpQs8fRC2GDe,Calcinha Preta;Gusttavo Lima,CP 25 Anos (Ao Vivo em Aracaju),Agora Estou Sofrendo - Ao Vivo,47,260510,0.605,0.678,0,1,0.439,125.059,4,"{sertanejo, forro, pagode}",8B
33,009MGoCC568mI1yvsbmTxw,Justnormal,Guessing Game,Park Bench,8,128984,0.805,0.503,0,1,0.515,97.964,4,{study},8B
38,00BYitnjj9tACCkLapk5uS,Silverchair,Neon Ballroom,Satin Sheets,25,144333,0.345,0.844,0,1,0.384,161.465,4,{grunge},8B


## Making the Neighbor Filtering Rules

The next two functions help with the filtering rules. I'll use this to create a new column for all of the edges between the nodes of the songs.

In [43]:
def find_intersection(song_specific_genres,valid_set_of_genres):
    
    '''
    A function to find the 'genre intersection' between a target song and other 'valid' songs
    
    
    @param valid_set_of_genres: set of genres from dataframe
    
    @return: int indicating amount of genres in common 
    '''
    
    # just to make sure the input is a set
    song_specific_genres = set(song_specific_genres)
    valid_set_of_genres = set(valid_set_of_genres)
    
    # Return number of common genres
    return len(song_specific_genres & valid_set_of_genres)

In [44]:
def _song_id_match(id_):
    '''
    a function to create a song series by its id
    
    @param id_: id to search the dataframe by
    
    @return: series with id
    '''
    
    return df[df['track_id']==id_].squeeze()

In [64]:
def find_neighbors(_track_id,danceability=0.5,popularity=45):
    '''
    ** Need to convert to using track_id instead of row **
    
    function used to filter to valid DJ songs that work (hypothetically)
    
    @param _track_id: id used to create a song from the dataframe 
    @param danceability: float from 0 -> 1 indicating spotifies danceability metric
    @param danceability: float from 0 -> 1 indicating spotifies danceability metric    
    
    @return: list of valid ids
    '''
    song = _song_id_match(_track_id)
    print(song)
    
    
    # quick calculations to help with finding neighbors 
    cam_key = keys.generate_camelot_key(song.mode,song.key)
    valid_cam_keys = keys._get_matching_keys(cam_key)
    brange = _calc_bpm_range(song.tempo)
    
        
    # filtering rules for creating small df
    small_df = df[(df['tempo']>brange[0]) & (df['tempo']<brange[1]) & (df['danceability']>danceability) & (df['popularity']>popularity) & (df['cam_key'].isin(valid_cam_keys)) & (df['track_id'] != song.track_id)]
    
    # apply genre filtering rule with genres (create new column, make) #should I weight the amount of genres??
    small_df['weight'] = small_df.apply(lambda x: find_intersection(song_specific_genres = x['track_genre'],valid_set_of_genres = {'edm'}), axis=1)
    
    small_df = small_df[small_df['weight']>0]
    
    # return a list of valid track ids
    
    return small_df['track_id'].tolist()



In [65]:
# one additional recommendation!
len(find_neighbors("2d4NfufMbawr8n1gBSyGOI"))

index                                   30505
track_id               2d4NfufMbawr8n1gBSyGOI
artists           Mark Farina;Homero Espinosa
album_name                   Somebody To Love
track_name                   Somebody To Love
popularity                                  7
duration_ms                            410269
danceability                            0.806
energy                                  0.666
key                                        11
mode                                        0
valence                                 0.257
tempo                                 123.446
time_signature                              4
track_genre                   {chicago-house}
cam_key                                   10A
Name: 30021, dtype: object


45

In [50]:
# create edges by 

#df['edges'] = df['cam_key'].map(keys._get_matching_keys)

## Testing Below

In [51]:
df = df.reset_index()

In [52]:
df.set_index('level_0')
df.index.names = ['index']

KeyError: "None of ['level_0'] are in the columns"

In [53]:
df.head()

Unnamed: 0,index,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre,cam_key
0,0,0000vdREvCVMxbQTkS888c,Rill,Lolly,Lolly,44,160725,0.91,0.374,8,0,0.432,104.042,4,{german},1A
1,1,000CC8EParg64OmTxVnZ0p,Glee Cast,Glee Love Songs,It's All Coming Back To Me Now (Glee Cast Vers...,47,322933,0.269,0.516,0,1,0.341,178.174,4,{club},8B
2,2,000Iz0K615UepwSJ5z2RE5,Paul Kalkbrenner;Pig&Dan,X,Böxig Leise - Pig & Dan Remix,22,515360,0.686,0.56,5,0,0.108,119.997,4,{minimal-techno},4A
3,3,000RDCYioLteXcutOjeweY,Jordan Sandhu,Teeje Week,Teeje Week,62,190203,0.679,0.77,0,1,0.839,161.721,4,{hip-hop},8B
4,4,000qpdoc97IMTBvF8gwcpy,Paul Kalkbrenner,Zeit,Tief,19,331240,0.519,0.431,6,0,0.234,129.971,4,{minimal-techno},11A


In [54]:
df[df['track_id']=='038gnuPve7s8rADCEDYsMH'].index

Int64Index([557], dtype='int64')

In [55]:
meduza_df = df[df['track_id'] == '038gnuPve7s8rADCEDYsMH']
meduza_squeeze = meduza_df.squeeze()
type(meduza_df)

pandas.core.frame.DataFrame

In [56]:
meduza_df

Unnamed: 0,index,track_id,artists,album_name,track_name,popularity,duration_ms,danceability,energy,key,mode,valence,tempo,time_signature,track_genre,cam_key
557,563,038gnuPve7s8rADCEDYsMH,Sonu Nigam,Paramathma (Original Motion Picture Soundtrack),Paravashanadenu,55,235833,0.635,0.299,1,0,0.402,83.983,4,{pop-film},12A


In [57]:
meduza_series = df.iloc[557]
type(meduza_series)

pandas.core.series.Series

In [58]:
meduza_series.popularity

55

In [None]:
# Need to figure out the issues with series vs dataframe objects. Can we select a series with loc?