## Dependencies

In [1]:
import pandas as pd
import numpy as np
import json
import re 
import sys
import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt


import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import spotipy.util as util

import warnings
warnings.filterwarnings("ignore")

In [2]:
%matplotlib inline

In [3]:
#If you're not familiar with this, save it! Makes using jupyter notebook on laptops much easier
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

## Summary:

## 1. Data Exploration/Preparation

Download datasets here:
https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

In [4]:
spotify_df = pd.read_csv('dataset/tracks.csv')

In [5]:
spotify_df.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.177,1,-21.18,1,0.0512,0.994,0.0218,0.212,0.457,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918,0.104,0.397,169.98,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.158,3,-16.9,0,0.039,0.989,0.13,0.311,0.196,103.22,4


Observations:
1. This data is at a **song level**
2. Many numerical values that I'll be able to use to compare movies (liveness, tempo, valence, etc)
2. Release date will useful but I'll need to create a OHE variable for release date in 5 year increments
3. Similar to 2, I'll need to create OHE variables for the popularity. I'll also use 5 year increments here
4. There is nothing here related to the genre of the song which will be useful. This data alone won't help us find relavent content since this is a content based recommendation system. Fortunately there is a `data_w_genres.csv` file that should have some useful information

In [6]:
data_w_genre = pd.read_csv('dataset/artists.csv')

Observations:
1. This data is at an **artist level**
2. There are similar continuous variables as our initial dataset but I won't use this. I'll just use the values int he previous dataset. 
3. The genres are going to be really useful here and I'll need to use it moving forward. Now, the genre column appears to be in a list format but my past experience tells me that it's likely not. Let's investigate this further.

In [7]:
data_w_genre.dtypes

id             object
followers     float64
genres         object
name           object
popularity      int64
dtype: object

This checks whether or not `genres` is actually in a list format:

In [8]:
data_w_genre['genres'].values[0]

'[]'

In [9]:
#To check if this is actually a list, let me index it and see what it returns
data_w_genre['genres'].values[0][0]

'['

As we can see, it's actually a string that looks like a list. Now, look at the example above, I'm going to put together a regex statement to extract the genre and input into a list

In [10]:
data_w_genre['genres_upd'] = data_w_genre['genres'].apply(lambda x: [re.sub(' ','_',i) for i in re.findall(r"'([^']*)'", x)])

In [11]:
data_w_genre['genres_upd'].values[48][0]

'carnaval_cadiz'

Voila, now we have the genre column in a format we can actually use. If you go down, you'll see how we use it. 

Now, if you recall, this data is at a artist level and the previous dataset is at a song level. So what here's what we need to do:
1. Explode artists column in the previous so each artist within a song will have their own row
2. Merge `data_w_genre` to the exploded dataset in Step 1 so that the previous dataset no is enriched with genre dataset

Before I go further, let's complete these two steps.

Step 1. 
Similar to before, we will need to extract the artists from the string list. 

In [12]:
spotify_df['artists_upd_v1'] = spotify_df['artists'].apply(lambda x: re.findall(r"'([^']*)'", x))

In [13]:
spotify_df['artists'].values[0]

"['Uli']"

In [14]:
spotify_df['artists_upd_v1'].values[0][0]

'Uli'

This looks good but did this work for every artist string format. Let's double check

In [15]:
spotify_df[spotify_df['artists_upd_v1'].apply(lambda x: not x)].head(5)

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artists_upd_v1
164,1xEEYhWxT4WhDQdxfPCT8D,Snake Rag,20,194533,0,"[""King Oliver's Creole Jazz Band""]",['08Zk65toyJllap1MnzljxZ'],1923,0.708,0.361,...,-11.764,0,0.0441,0.994,0.883,0.103,0.902,105.695,4,[]
170,3rauXVLOOM5BlxWqUcDpkg,Chimes Blues,14,170827,0,"[""King Oliver's Creole Jazz Band""]",['08Zk65toyJllap1MnzljxZ'],1923,0.546,0.189,...,-15.984,1,0.0581,0.996,0.908,0.339,0.554,80.318,4,[]
172,1UdqHVRFYMZKU2Q7xkLtYc,Pickin' On Your Baby,11,197493,0,"[""Clarence Williams' Blue Five""]",['6RuQvIr0t0otZHnAxXTGkm'],1923,0.52,0.153,...,-14.042,1,0.044,0.995,0.131,0.353,0.319,102.937,4,[]
174,0Vl2DO5U6FjgBpzCtBN3OA,Everybody Loves My Baby,10,152507,0,"[""Clarence Williams' Blue Five""]",['6RuQvIr0t0otZHnAxXTGkm'],1923,0.514,0.193,...,-13.92,0,0.238,0.996,0.199,0.248,0.665,180.674,4,[]
180,5SvyP1ZeJX1jA7AOZD08NA,Tears,10,187227,0,"[""King Oliver's Creole Jazz Band""]",['08Zk65toyJllap1MnzljxZ'],1923,0.359,0.357,...,-11.81,1,0.0511,0.994,0.819,0.29,0.753,205.053,4,[]


So, it looks like it didn't catch all of them and you can quickly see that it's because artists with an apostrophe in their title and the fact that they are enclosed in a full quotes. I'll write another regex to handle this and then combine the two

In [16]:
spotify_df['artists_upd_v2'] = spotify_df['artists'].apply(lambda x: re.findall('\"(.*?)\"',x))
spotify_df['artists_upd'] = np.where(spotify_df['artists_upd_v1'].apply(lambda x: not x), spotify_df['artists_upd_v2'], spotify_df['artists_upd_v1'] )

In [17]:
#need to create my own song identifier because there are duplicates of the same song with different ids. I see different
spotify_df['artists_song'] = spotify_df.apply(lambda row: str(row['artists_upd'][0])+str(row['name']),axis = 1)

In [18]:
spotify_df.sort_values(['artists_song','release_date'], ascending = False, inplace = True)

In [19]:
spotify_df[spotify_df['name']=='Adore You']

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song
86217,5AnCLGg35ziFOloEnXK4uu,Adore You,71,278747,0,['Miley Cyrus'],['5YGY8feqx7naU7z4HrwZM6'],2013-10-04,0.583,0.655,...,0.111,4e-06,0.113,0.201,119.759,4,[Miley Cyrus],[],[Miley Cyrus],Miley CyrusAdore You
91884,3jjujdWJ72nww5eGnfs2E7,Adore You,88,207133,0,['Harry Styles'],['6KImCVD70vtIoJWnq6nGn3'],2019-12-13,0.676,0.771,...,0.0237,7e-06,0.102,0.569,99.048,4,[Harry Styles],[],[Harry Styles],Harry StylesAdore You
92524,1M4qEo4HE3PRaCOM7EXNJq,Adore You,74,207133,0,['Harry Styles'],['6KImCVD70vtIoJWnq6nGn3'],2019-12-06,0.676,0.771,...,0.0237,7e-06,0.102,0.569,99.048,4,[Harry Styles],[],[Harry Styles],Harry StylesAdore You


In [20]:
spotify_df.drop_duplicates('artists_song',inplace = True)

In [21]:
spotify_df[spotify_df['name']=='Adore You']

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song
86217,5AnCLGg35ziFOloEnXK4uu,Adore You,71,278747,0,['Miley Cyrus'],['5YGY8feqx7naU7z4HrwZM6'],2013-10-04,0.583,0.655,...,0.111,4e-06,0.113,0.201,119.759,4,[Miley Cyrus],[],[Miley Cyrus],Miley CyrusAdore You
91884,3jjujdWJ72nww5eGnfs2E7,Adore You,88,207133,0,['Harry Styles'],['6KImCVD70vtIoJWnq6nGn3'],2019-12-13,0.676,0.771,...,0.0237,7e-06,0.102,0.569,99.048,4,[Harry Styles],[],[Harry Styles],Harry StylesAdore You


Now I can explode this column and merge as I planned to in `Step 2`

In [22]:
artists_exploded = spotify_df[['artists_upd','id']].explode('artists_upd')

In [23]:
artists_exploded_enriched = artists_exploded.merge(data_w_genre, how = 'left', left_on = 'artists_upd',right_on = 'name')
artists_exploded_enriched_nonnull = artists_exploded_enriched[~artists_exploded_enriched.genres_upd.isnull()]

In [24]:
artists_exploded_enriched_nonnull.drop(columns=['id_y'], inplace=True)
artists_exploded_enriched_nonnull.rename(columns={'id_x': 'id'}, inplace=True)

In [25]:
artists_exploded_enriched_nonnull[artists_exploded_enriched_nonnull['id'] =='6KuQTIu1KoTTkLXKrwlLPV']

Unnamed: 0,artists_upd,id,followers,genres,name,popularity,genres_upd
195444,Robert Schumann,6KuQTIu1KoTTkLXKrwlLPV,423826.0,"['classical', 'early romantic era', 'german ro...",Robert Schumann,64.0,"[classical, early_romantic_era, german_romanti..."
195445,Vladimir Horowitz,6KuQTIu1KoTTkLXKrwlLPV,92365.0,"['classical', 'classical performance', 'classi...",Vladimir Horowitz,54.0,"[classical, classical_performance, classical_p..."


Alright we're almost their, now we need to:
1. Group by on the song `id` and essentially create lists lists
2. Consilidate these lists and output the unique values

In [26]:
artists_genres_consolidated = artists_exploded_enriched_nonnull.groupby('id')['genres_upd'].apply(list).reset_index()

In [27]:
artists_genres_consolidated['consolidates_genre_lists'] = artists_genres_consolidated['genres_upd'].apply(lambda x: list(set(list(itertools.chain.from_iterable(x)))))

In [28]:
artists_genres_consolidated.head()

Unnamed: 0,id,genres_upd,consolidates_genre_lists
0,0004Uy71ku11n3LMpuyf59,[[polish_rock]],[polish_rock]
1,000CSYu4rvd8cQ7JilfxhZ,"[[country_quebecois, rock_quebecois]]","[country_quebecois, rock_quebecois]"
2,000DsoWJKHdaUmhgcnpr8j,[[barnmusik]],[barnmusik]
3,000G1xMMuwxNHmwVsBdtj1,"[[candy_pop, new_wave, new_wave_pop, permanent...","[power_pop, new_wave_pop, rock, new_wave, perm..."
4,000KblXP5csWFFFsD6smOy,"[[chamame, folclore_salteno, folklore_argentino]]","[folklore_argentino, folclore_salteno, chamame]"


In [29]:
spotify_df = spotify_df.merge(artists_genres_consolidated[['id','consolidates_genre_lists']], on = 'id',how = 'left')

## 2. Feature Engineering

### - Normalize float variables
### - OHE Year and Popularity Variables
### - Create TF-IDF features off of artist genres

In [30]:
spotify_df['year'] = spotify_df['release_date'].apply(lambda x: x.split('-')[0])

In [31]:
float_cols = spotify_df.dtypes[spotify_df.dtypes == 'float64'].index.values

In [32]:
ohe_cols = 'popularity'

In [33]:
spotify_df['popularity'].describe()

count    523475.000000
mean         27.235870
std          18.030233
min           0.000000
25%          13.000000
50%          27.000000
75%          40.000000
max          99.000000
Name: popularity, dtype: float64

In [34]:
# create 5 point buckets for popularity 
spotify_df['popularity_red'] = spotify_df['popularity'].apply(lambda x: int(x/5))

In [35]:
# tfidf can't handle nulls so fill any null values with an empty list
spotify_df['consolidates_genre_lists'] = spotify_df['consolidates_genre_lists'].apply(lambda d: d if isinstance(d, list) else [])

In [36]:
spotify_df.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,valence,tempo,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,consolidates_genre_lists,year,popularity_red
0,3u1C6nWVRoP5F0w8gGrDL3,사랑의 미로,25,222380,0,['최진희'],['1NSrAf8XJYJVgAXKoxaMet'],1987-06-01,0.367,0.194,...,0.367,144.316,4,[최진희],[],[최진희],최진희사랑의 미로,[trot],1987,5
1,1Mv4u308L16NZDZiD6HZCy,사랑은 힘든가봐,28,213440,0,['지수'],['4c9QIMfEbIIynuaswyxGx9'],2005-12-23,0.675,0.785,...,0.623,103.008,4,[지수],[],[지수],지수사랑은 힘든가봐,[],2005,5
2,1jvoY322nxyKXq8OBhgmSY,어떡하죠,44,244360,0,['지선'],['2Mo9NQaNCFCWSR5CnlfmbN'],2011-10-13,0.606,0.341,...,0.294,135.667,4,[지선],[],[지선],지선어떡하죠,[],2011,8
3,2ghebdwe2pNXT4eL34T7pW,그아픔까지사랑한거야,32,237688,0,['조정현'],['2WTpsPucygbYRnCnoEUkJQ'],1989-06-15,0.447,0.215,...,0.177,71.979,4,[조정현],[],[조정현],조정현그아픔까지사랑한거야,[],1989,6
4,7rxpWwcXNgDUXl0wN0gUvp,천국의 기억 장정우 Version,31,280372,0,['장정우'],['5L7zKs2ftwENWOMI7LFaN1'],2003-12-24,0.494,0.656,...,0.42,82.003,4,[장정우],[],[장정우],장정우천국의 기억 장정우 Version,[],2003,6


In [37]:
#simple function to create OHE features
#this gets passed later on
def ohe_prep(df, column, new_name): 
    """ 
    Create One Hot Encoded features of a specific column

    Parameters: 
        df (pandas dataframe): Spotify Dataframe
        column (str): Column to be processed
        new_name (str): new column name to be used
        
    Returns: 
        tf_df: One hot encoded features 
    """
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [38]:
#function to build entire feature set
def create_feature_set(df, float_cols):
    """ 
    Process spotify df to create a final set of features that will be used to generate recommendations

    Parameters: 
        df (pandas dataframe): Spotify Dataframe
        float_cols (list(str)): List of float columns that will be scaled 
        
    Returns: 
        final: final set of features 
    """
    
    #tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['consolidates_genre_lists'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]
    genre_df.reset_index(drop = True, inplace=True)

    #explicity_ohe = ohe_prep(df, 'explicit','exp')    
    year_ohe = ohe_prep(df, 'year','year') * 0.5
    popularity_ohe = ohe_prep(df, 'popularity_red','pop') * 0.15

    #scale float columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    #concanenate all features
    final = pd.concat([genre_df, floats_scaled, popularity_ohe, year_ohe], axis = 1)
     
    #add song id
    final['id']=df['id'].values
    
    return final

In [39]:
complete_feature_set = create_feature_set(spotify_df, float_cols=float_cols)#.mean(axis = 0)

In [40]:
complete_feature_set.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|48g,genre|_brasileira,genre|_hip_hop,genre|_house,genre|a3,genre|a_cappella,genre|abstract,genre|abstract_beats,...,year|2013,year|2014,year|2015,year|2016,year|2017,year|2018,year|2019,year|2020,year|2021,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3u1C6nWVRoP5F0w8gGrDL3
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1Mv4u308L16NZDZiD6HZCy
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1jvoY322nxyKXq8OBhgmSY
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2ghebdwe2pNXT4eL34T7pW
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7rxpWwcXNgDUXl0wN0gUvp


## 3. Connect to Spotify API

Useful links:
1. https://developer.spotify.com/dashboard/
2. https://spotipy.readthedocs.io/en/2.16.1/

In [41]:
#client id and secret for my application
client_id = ''
client_secret= ''

In [42]:
scope = 'user-library-read'

if len(sys.argv) > 1:
    username = sys.argv[1]
else:
    print("Usage: %s username" % (sys.argv[0],))
    sys.exit()

In [43]:
auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [44]:
token = util.prompt_for_user_token(scope, client_id= client_id, client_secret=client_secret, redirect_uri='http://localhost:9001/callback')

In [45]:
sp = spotipy.Spotify(auth=token)

In [46]:
#gather playlist names and images. 
#images aren't going to be used until I start building a UI
id_name = {}
list_photo = {}
for i in sp.current_user_playlists()['items']:

    id_name[i['name']] = i['uri'].split(':')[2]
    list_photo[i['uri'].split(':')[2]] = i['images'][0]['url']

In [47]:
def create_necessary_outputs(playlist_name,id_dic, df):
    """ 
    Pull songs from a specific playlist.

    Parameters: 
        playlist_name (str): name of the playlist you'd like to pull from the spotify API
        id_dic (dic): dictionary that maps playlist_name to playlist_id
        df (pandas dataframe): spotify datafram
        
    Returns: 
        playlist: all songs in the playlist THAT ARE AVAILABLE IN THE KAGGLE DATASET
    """
    
    #generate playlist dataframe
    playlist = pd.DataFrame()
    playlist_name = playlist_name

    for ix, i in enumerate(sp.playlist(id_dic[playlist_name])['tracks']['items']):
        #print(i['track']['artists'][0]['name'])
        playlist.loc[ix, 'artist'] = i['track']['artists'][0]['name']
        playlist.loc[ix, 'name'] = i['track']['name']
        playlist.loc[ix, 'id'] = i['track']['id'] # ['uri'].split(':')[2]
        playlist.loc[ix, 'url'] = i['track']['album']['images'][1]['url']
        playlist.loc[ix, 'date_added'] = i['added_at']

    playlist['date_added'] = pd.to_datetime(playlist['date_added'])  
    
    playlist = playlist[playlist['id'].isin(df['id'].values)].sort_values('date_added',ascending = False)
    
    return playlist

In [48]:
id_name

{'ブルーマックスの日本の歌': '081ncGCAGhUdWRT0DE1x33',
 'Wangan Midnight': '6mpm7edNkpHFDS25CUTDlZ',
 'Vivid BAD SQUAD': '62LjZeAJhrwLDwu515QUUZ',
 'Chill OST (Genshin/HSR)': '4tH69XcuH88Cdb0sz5ZKSP',
 "BlueMax's KPop Playlist": '6xX3v6fCPLhVPAal9MUSca',
 'Bolero': '0HaJeUKoHl2fkiRFtDeqfq',
 'Genshin Impact': '4dzmso408xFWuKBmHZ6jJ6',
 'Project SEKAI UNIT IMAGE ALBUM セカイノオト vol.1': '1FVSwUNHb1dwDIoUsenEXh',
 'J-Pop Mix': '37i9dQZF1EQoowv2cDraCW',
 'Fast & Furious 6': '5aPqOEshdvLiPqpaaAkBA0',
 'Furious 7: Original Motion Picture Soundtrack': '0T2sOiitVBZzkEfjkeR9Pp',
 'Fast & Furious 8: The Album': '52SErOtY1gfYFeWsm3jnmL',
 'ueueueuuee': '6L9TcTGdIFrCMUMb8VbJMT'}

In [49]:
playlist_EDM = create_necessary_outputs('ブルーマックスの日本の歌', id_name,spotify_df)

In [50]:
playlist_EDM

Unnamed: 0,artist,name,id,url,date_added
32,Jin Hashimoto,STAND PROUD,3OqPSJsqe4LvcaVl7G6vV3,https://i.scdn.co/image/ab67616d00001e0285d0dd...,2022-10-22 07:46:35+00:00
57,ヨルシカ,春泥棒,1rr2DJOxV0sHXeUXCAz1yf,https://i.scdn.co/image/ab67616d00001e02a19652...,2022-09-19 01:59:03+00:00
68,YOASOBI,ハルカ,5D9MPWdY2hjSeTIGE5n5kv,https://i.scdn.co/image/ab67616d00001e02684d81...,2022-08-19 11:46:53+00:00
67,YOASOBI,たぶん,398dL22bDbKbAmiOnPaq7o,https://i.scdn.co/image/ab67616d00001e02684d81...,2022-08-19 11:46:53+00:00
66,YOASOBI,あの夢をなぞって,5ptl2PXkiSth54HCuGO7vN,https://i.scdn.co/image/ab67616d00001e02684d81...,2022-08-19 11:46:53+00:00
65,YOASOBI,ハルジオン,08YwAPnX8sygJUXG9rvhDv,https://i.scdn.co/image/ab67616d00001e02684d81...,2022-08-19 11:46:53+00:00
64,YOASOBI,アンコール,465JzFiajJO59sUrDFsxdC,https://i.scdn.co/image/ab67616d00001e02684d81...,2022-08-19 11:46:53+00:00
56,ヨルシカ,ただ君に晴れ,3wJHCry960drNlAUGrJLmz,https://i.scdn.co/image/ab67616d00001e0228b535...,2022-08-03 15:20:50+00:00
46,BACK-ON,wimp,4deAcev969KEs1YrSgnhmS,https://i.scdn.co/image/ab67616d00001e0268f5bf...,2022-07-28 00:32:59+00:00
45,BACK-ON,ニブンノイチ,5goG94oNxga2XLJWOHOXdx,https://i.scdn.co/image/ab67616d00001e0268f5bf...,2022-07-28 00:32:59+00:00


## 4. Create Playlist Vector

In [51]:
def generate_playlist_feature(complete_feature_set, playlist_df, weight_factor):
    """ 
    Summarize a user's playlist into a single vector

    Parameters: 
        complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
        playlist_df (pandas dataframe): playlist dataframe
        weight_factor (float): float value that represents the recency bias. The larger the recency bias, the most priority recent songs get. Value should be close to 1. 
        
    Returns: 
        playlist_feature_set_weighted_final (pandas series): single feature that summarizes the playlist
        complete_feature_set_nonplaylist (pandas dataframe): 
    """
    
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]#.drop('id', axis = 1).mean(axis =0)
    complete_feature_set_playlist = complete_feature_set_playlist.merge(playlist_df[['id','date_added']], on = 'id', how = 'inner')
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]#.drop('id', axis = 1)
    
    playlist_feature_set = complete_feature_set_playlist.sort_values('date_added',ascending=False)

    most_recent_date = playlist_feature_set.iloc[0,-1]
    
    for ix, row in playlist_feature_set.iterrows():
        playlist_feature_set.loc[ix,'months_from_recent'] = int((most_recent_date.to_pydatetime() - row.iloc[-1].to_pydatetime()).days / 30)
        
    playlist_feature_set['weight'] = playlist_feature_set['months_from_recent'].apply(lambda x: weight_factor ** (-x))
    
    playlist_feature_set_weighted = playlist_feature_set.copy()
    #print(playlist_feature_set_weighted.iloc[:,:-4].columns)
    playlist_feature_set_weighted.update(playlist_feature_set_weighted.iloc[:,:-4].mul(playlist_feature_set_weighted.weight,0))
    playlist_feature_set_weighted_final = playlist_feature_set_weighted.iloc[:, :-4]
    #playlist_feature_set_weighted_final['id'] = playlist_feature_set['id']
    
    return playlist_feature_set_weighted_final.sum(axis = 0), complete_feature_set_nonplaylist

In [52]:
complete_feature_set_playlist_vector_EDM, complete_feature_set_nonplaylist_EDM = generate_playlist_feature(complete_feature_set, playlist_EDM, 1.09)
#complete_feature_set_playlist_vector_chill, complete_feature_set_nonplaylist_chill = generate_playlist_feature(complete_feature_set, playlist_chill, 1.09)

## 5. Generate Recommendations

In [53]:
def generate_playlist_recos(df, features, nonplaylist_features):
    """ 
    Pull songs from a specific playlist.

    Parameters: 
        df (pandas dataframe): spotify dataframe
        features (pandas series): summarized playlist feature
        nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Returns: 
        non_playlist_df_top_40: Top 40 recommendations for that playlist
    """
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    non_playlist_df_top_40['url'] = non_playlist_df_top_40['id'].apply(lambda x: sp.track(x)['album']['images'][1]['url'])
    
    return non_playlist_df_top_40

In [55]:
edm_top40 = generate_playlist_recos(spotify_df, complete_feature_set_playlist_vector_EDM, complete_feature_set_nonplaylist_EDM)

In [56]:
edm_top40

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,consolidates_genre_lists,year,popularity_red,sim,url
13945,1LwSnnsoKcAUv9TPFEZ7iQ,麻痺,69,198100,0,['yama'],['7kOrrFIBIBc8uCu2zbxbLv'],2021-01-15,0.532,0.932,...,4,[yama],[],[yama],yama麻痺,"[japanese_teen_pop, j-pop]",2021,13,0.833887,https://i.scdn.co/image/ab67616d00001e0234faef...
25339,06XQvnJb53SUYmlWIhUXUi,怪物,82,206000,0,['YOASOBI'],['64tJ2EAv1R6UaZqc4iOCyj'],2021-01-06,0.627,0.824,...,4,[YOASOBI],[],[YOASOBI],YOASOBI怪物,"[japanese_teen_pop, j-pop]",2021,16,0.832332,https://i.scdn.co/image/ab67616d00001e02f609c7...
5625,154Jycdld9dX9rBLE6L3v4,寄り酔い,67,216959,0,['和ぬか'],['6LesPuO1nhgJ2acJ4MjyBI'],2021-02-15,0.591,0.665,...,4,[和ぬか],[],[和ぬか],和ぬか寄り酔い,"[japanese_teen_pop, j-pop]",2021,13,0.829102,https://i.scdn.co/image/ab67616d00001e0279132d...
25341,19fhOFi6pNGeZe5uiFlm7c,優しい彗星,74,215333,0,['YOASOBI'],['64tJ2EAv1R6UaZqc4iOCyj'],2021-01-20,0.715,0.722,...,4,[YOASOBI],[],[YOASOBI],YOASOBI優しい彗星,"[japanese_teen_pop, j-pop]",2021,14,0.828348,https://i.scdn.co/image/ab67616d00001e028c9e15...
363366,1EM9pvygHqQm03as6sxLg9,旅路,68,277333,0,['Fujii Kaze'],['6bDWAcdtVR3WHz2xtiIPUi'],2021-03-01,0.59,0.556,...,4,[Fujii Kaze],[],[Fujii Kaze],Fujii Kaze旅路,"[japanese_teen_pop, j-pop]",2021,13,0.82741,https://i.scdn.co/image/ab67616d00001e02261a68...
25347,25o1M3Jse81xusDV6WhvC5,Epilogue,63,50547,0,['YOASOBI'],['64tJ2EAv1R6UaZqc4iOCyj'],2021-01-06,0.714,0.685,...,4,[YOASOBI],[],[YOASOBI],YOASOBIEpilogue,"[japanese_teen_pop, j-pop]",2021,12,0.816993,https://i.scdn.co/image/ab67616d00001e02684d81...
82967,32WvFbddpcnZsQzDCUQLkJ,Luv Letter,52,196193,0,['Takaya Kawasaki'],['3BjFX1nExMNHvSaoLd1I1k'],2018-03-14,0.616,0.377,...,4,[Takaya Kawasaki],[],[Takaya Kawasaki],Takaya KawasakiLuv Letter,"[japanese_teen_pop, j-pop]",2018,10,0.708941,https://i.scdn.co/image/ab67616d00001e02ccf41a...
82963,37IPQgBkvbmH9JR5mlY6a8,魔法の絨毯,75,212018,0,['Takaya Kawasaki'],['3BjFX1nExMNHvSaoLd1I1k'],2018-03-14,0.599,0.309,...,4,[Takaya Kawasaki],[],[Takaya Kawasaki],Takaya Kawasaki魔法の絨毯,"[japanese_teen_pop, j-pop]",2018,15,0.703972,https://i.scdn.co/image/ab67616d00001e02ccf41a...
82964,0ZSh6TRfx5WyYqXDxn6vdQ,幸せあれ,42,237425,0,['Takaya Kawasaki'],['3BjFX1nExMNHvSaoLd1I1k'],2018-03-14,0.55,0.474,...,4,[Takaya Kawasaki],[],[Takaya Kawasaki],Takaya Kawasaki幸せあれ,"[japanese_teen_pop, j-pop]",2018,8,0.702271,https://i.scdn.co/image/ab67616d00001e02ccf41a...
363373,58zXQUhb3NXbY2QjhRgTNL,さよならべいべ,56,260533,0,['Fujii Kaze'],['6bDWAcdtVR3WHz2xtiIPUi'],2020-05-20,0.61,0.884,...,4,[Fujii Kaze],[],[Fujii Kaze],Fujii Kazeさよならべいべ,"[japanese_teen_pop, j-pop]",2020,11,0.700599,https://i.scdn.co/image/ab67616d00001e02731953...
