# Summary & Notes
### Recomendation Technique: Cosine Similarity

- Represent the entire playlist as a vector, with each feature as an element of the vector
- Represent each track in the Spotify dataset as a vector
  - Same features used for each track
  - Only for those songs that are in the dataset but not in the playlist
  - Potential Drawback: I do not have access to the entire Spotify Dataset, and thus have to use a subset of songs (~230k tracks)
- Those track vectors that have a smaller angle (theta) between its own vector and the playlist vector, have higher similiarity
  - Recommend the 10 songs (tunable to more) with the lowest theta value

### Tunable Parameters:
- Weight of each indicator variable (Genre, key, time_sig, popularity)
- Recency Bias Weight

### Improvements over v1: 
- Better Dataset
  - More features per track
  - More relevant/modern tracks available
- Refined Recommendation Algorithm
  - Base recommendations off only those tracks in the playlist that are in the datset
    - Avoids any bias/skewing of recomendations

### Future Areas of Improvement
- Incorporate features for track artists
  - Spotify Recommendation Algorithm must be using artists as well
- Implement Filters for what types of music user wants to be recommended

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
from dotenv import load_dotenv
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Dataset Setup & Feature Engineering

In [2]:
# Has Genre and Popularity; does not have explicit --> Net total of 1 more feature; but will me expanded to more
dataset_df = pd.read_csv('./SpotifyFeatures.csv')
dataset_df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [20]:
def create_feature_vectors(track_dataset_df):
    """
    Creates Feature Vectors for each track in the dataset. 
    Tunable Parameters: Weight of each indicator variable (Genre, key, time_sig, popularity)
    Parameters:
    - all_tracks_df: consists of all tracks in the used dataset, mimicking the "spotify db"
    Returns:
    - dataframe consisting of each track id, and their feature vector normalized.
    """
    # Get Unique Genre Values in df; make col for each genre and its corresponding value 1
    genre_df=pd.get_dummies(track_dataset_df['genre']) * 1

    # Get Unique key Values in df; make col for each key and its corresponding value 1
    key_df=pd.get_dummies(track_dataset_df['key']) * 1

    # Create 5 point buckets for popularity feature (OHE) - Reduces sensitivity to feature
    track_dataset_df['popularity_red'] = track_dataset_df['popularity'].apply(lambda x: int(x/5))
    tf_df = pd.get_dummies(track_dataset_df['popularity_red'])
    feature_names = tf_df.columns
    tf_df.columns = ["pop" + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)
    popularity_cols_df  = tf_df * 0.25

    # Scale and Normalize remaining columns
    float_cols = track_dataset_df[['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(float_cols), columns = float_cols.columns) * 0.2

    # Create OHE Buckets for time_signature feature
    time_sig_df = pd.get_dummies(track_dataset_df['time_signature']) * 0.2

    # Combine all compononets
    tracks_feature_set = pd.concat([genre_df,key_df,time_sig_df, popularity_cols_df, floats_scaled], axis = 1)
    tracks_feature_set['id'] = track_dataset_df['track_id'].values


    return tracks_feature_set    

In [21]:
tracks_feature_set = create_feature_vectors(dataset_df)
tracks_feature_set.head()

Unnamed: 0,A Capella,Alternative,Anime,Blues,Children's Music,Children’s Music,Classical,Comedy,Country,Dance,...,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,id
0,0,0,0,0,0,0,0,0,0,0,...,0.071258,0.003033,0.182182,0.0,0.067923,0.180171,0.006414,0.128541,0.1628,0BRjO6ga9RKCKjfDqeFgWV
1,0,0,0,0,0,0,0,0,0,0,...,0.114387,0.004406,0.147546,0.0,0.028542,0.166894,0.013675,0.13516,0.1632,0BjC1NfoEOOusryehmNudP
2,0,0,0,0,0,0,0,0,0,0,...,0.13005,0.005594,0.026223,0.0,0.018848,0.137286,0.002964,0.065036,0.0736,0CoSDzoNIKCRs124s9uTVy
3,0,0,0,0,0,0,0,0,0,0,...,0.039288,0.004949,0.065263,0.0,0.017939,0.143339,0.003662,0.133048,0.0454,0Gc6TVm52BwZD07Ki6tIvf
4,0,0,0,0,0,0,0,0,0,0,...,0.058813,0.002428,0.045042,0.024625,0.038842,0.111411,0.004953,0.103703,0.078,0IuslXpMROHdEPvSl1fTQK


# Spotify API Connection

In [22]:
load_dotenv()
scope = "user-library-read"
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope))

In [23]:
playlists_res = sp.current_user_playlists()['items']
playlists = {}

for item in playlists_res:
  playlists[item['name']] = item['id']

playlists

{'sum': '6URjLae4AnvscdYVVB9Tqq',
 'rihanna': '76QhOFRR5kyyrbupyT9R9I',
 'RO tation': '6eQTWnnnbFDm3QRy7lgSsr',
 'dj': '7BKQPJIKcp9mKV24LyLq61',
 'chill': '1XPdaFPovJmoL9OzDAYJ5Z',
 'old': '3Ehc9YycykfZ6ZvZ6GlIt2'}

In [24]:
ro_id = playlists['RO tation'] # This is what the input to the model will be, playlist ID
ro_id

'6eQTWnnnbFDm3QRy7lgSsr'

In [25]:
def filter_user_playlist(playlist_id, all_tracks_df):
    """
    Given a user playlist that will be used to make recommendations based off,
    return a 'filtered' playlist of the tracks that are available in the dataset.
    Parameters:
    - playlist_id: id of the user playlist
    - all_tracks_df: tracks dataset
    Returns:
    - refined_playlist_df: filtered playlist of songs in dataset
    """
    user_playlist = sp.playlist(playlist_id)
    refined_playlist = pd.DataFrame()
    for ix, i in enumerate(user_playlist['tracks']['items']):
        if i['track'] is not None and i['track']['id'] is not None:
            refined_playlist.loc[ix, 'artist'] = i['track']['artists'][0]['name']
            refined_playlist.loc[ix, 'name'] = i['track']['name']
            refined_playlist.loc[ix, 'id'] = i['track']['id']
            refined_playlist.loc[ix, 'url'] = i['track']['album']['images'][1]['url']
            refined_playlist.loc[ix, 'date_added'] = i['added_at']

    refined_playlist['date_added'] = pd.to_datetime(refined_playlist['date_added'])  
    refined_playlist = refined_playlist[refined_playlist['id'].isin(all_tracks_df['track_id'].values)].sort_values('date_added',ascending = False)
    
    return refined_playlist

In [26]:
refined_playlist_ro = filter_user_playlist(ro_id, dataset_df)
refined_playlist_ro.head()

Unnamed: 0,artist,name,id,url,date_added
99,Drake,Portland,2bjwRfXMk4uRgOD9IBYl9h,https://i.scdn.co/image/ab67616d00001e024f0fd9...,2023-03-20 00:23:47+00:00
98,Playboi Carti,Shoota (feat. Lil Uzi Vert),2BJSMvOGABRxokHKB0OI8i,https://i.scdn.co/image/ab67616d00001e02a1e867...,2023-03-20 00:23:17+00:00
95,GoldLink,Herside Story,564oa00vY05d1uYnTEAAmE,https://i.scdn.co/image/ab67616d00001e027bcd3c...,2023-03-19 14:16:38+00:00
93,A$AP Rocky,"A$AP Forever REMIX (feat. Moby, T.I. & Kid Cudi)",3oHkMCVJyOcjg5FhfLc2Rq,https://i.scdn.co/image/ab67616d00001e029feadc...,2023-03-12 20:33:26+00:00
92,Kodak Black,Transportin',1WIZiOuNO3woKfdlSK2gNn,https://i.scdn.co/image/ab67616d00001e02583ce9...,2023-03-12 20:32:42+00:00


# Vectorize Playlist

In [27]:
def create_playlist_vector(tracks_feature_set, refined_playlist, recency_bias=1.2):
    """
    Vectorizes a user playlist by summarizing the playlist dataframe
    into a single dataframe. 
    Tunable paramateres: Recency Bias
    Parameters:
    - tracks_feature_set: Full Feature set of each/all songs in dataset
    - refined_playlist: Refined playlist dataframe (tracks that are in dataset)
    - recency_bias: Weight value for how much to emphasize more recently added songs
    Returns:
    - playlist_vector_weighted_final: Feature Vector summarizing playlist
    - complete_feature_set_nonplaylist: Dataframe where each row is a feature vector for each track not in playlist in dataset
    """
    feature_set_playlist = tracks_feature_set[tracks_feature_set['id'].isin(refined_playlist['id'].values)]
    feature_set_playlist = feature_set_playlist.merge(refined_playlist[['id','date_added']], on = 'id', how = 'inner')
    complete_feature_set_nonplaylist = tracks_feature_set[~tracks_feature_set['id'].isin(refined_playlist['id'].values)]

    playlist_vector = feature_set_playlist.sort_values('date_added',ascending=False)
    most_recent_date = playlist_vector.iloc[0,-1]

    for ix, row in playlist_vector.iterrows():
        playlist_vector.loc[ix,'months_back'] = int((most_recent_date.to_pydatetime() - row.iloc[-1].to_pydatetime()).days / 30)
        
    playlist_vector['weight'] = playlist_vector['months_back'].apply(lambda x: recency_bias ** (-x))
    
    playlist_vector_weighted = playlist_vector.copy()
    playlist_vector_weighted.update(playlist_vector_weighted.iloc[:,:-4].mul(playlist_vector_weighted.weight,0))
    playlist_vector_weighted_final = playlist_vector_weighted.iloc[:, :-4]
    
    return playlist_vector_weighted_final.sum(axis = 0), complete_feature_set_nonplaylist

In [28]:
playlist_feature_vector_ro, nonplaylist_features_ro = create_playlist_vector(tracks_feature_set, refined_playlist_ro)
nonplaylist_features_ro

Unnamed: 0,A Capella,Alternative,Anime,Blues,Children's Music,Children’s Music,Classical,Comedy,Country,Dance,...,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,id
0,0,0,0,0,0,0,0,0,0,0,...,0.071258,0.003033,0.182182,0.000000,0.067923,0.180171,0.006414,0.128541,0.1628,0BRjO6ga9RKCKjfDqeFgWV
1,0,0,0,0,0,0,0,0,0,0,...,0.114387,0.004406,0.147546,0.000000,0.028542,0.166894,0.013675,0.135160,0.1632,0BjC1NfoEOOusryehmNudP
2,0,0,0,0,0,0,0,0,0,0,...,0.130050,0.005594,0.026223,0.000000,0.018848,0.137286,0.002964,0.065036,0.0736,0CoSDzoNIKCRs124s9uTVy
3,0,0,0,0,0,0,0,0,0,0,...,0.039288,0.004949,0.065263,0.000000,0.017939,0.143339,0.003662,0.133048,0.0454,0Gc6TVm52BwZD07Ki6tIvf
4,0,0,0,0,0,0,0,0,0,0,...,0.058813,0.002428,0.045042,0.024625,0.038842,0.111411,0.004953,0.103703,0.0780,0IuslXpMROHdEPvSl1fTQK
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232720,0,0,0,0,0,0,0,0,0,0,...,0.135200,0.011227,0.142942,0.108909,0.015112,0.148862,0.001990,0.080144,0.1924,2XGLdVl7lGeq8ksM6Al7jT
232721,0,0,0,0,0,0,0,0,0,0,...,0.156228,0.009645,0.136735,0.000176,0.045910,0.161965,0.002434,0.078533,0.1938,1qWZdkBl4UVPj9lK6HuuFM
232722,0,0,0,0,0,0,0,0,0,0,...,0.098723,0.005474,0.083882,0.000000,0.017132,0.157204,0.026630,0.050588,0.1626,2ziWXUmQLrXTiYjCg2fZ2t
232723,0,0,0,0,0,0,0,0,0,0,...,0.147645,0.007478,0.140940,0.000000,0.065297,0.161278,0.026207,0.065547,0.0978,6EFsue2YbIG4Qkq8Zr9Rir


# Generate Recommendations

In [29]:
def generate_recommendations(all_tracks_df, playlist_vector, all_tracks_features, recommend_amt=10):
    """
    Generate recommendations based on the playlist vector, using
    the all_tracks_features.
    Parameters:
    - all_tracks_df: All tracks and info in the dataset
    - playlist_vector: Feature Vector summarizing playlist
    - all_tracks_features: All features for each track not in playlist but in dataset
    Returns:
    - rec_10: 10 recommended tracks
    """
    non_playlist_df = all_tracks_df[all_tracks_df['track_id'].isin(all_tracks_features['id'].values)]
    non_playlist_df['sim'] = cosine_similarity(all_tracks_features.drop('id', axis = 1).values, playlist_vector.values.reshape(1, -1))[:,0]
    recs = non_playlist_df.sort_values('sim',ascending = False).head(recommend_amt)
    recs['url'] = recs['track_id'].apply(lambda x: sp.track(x)['album']['images'][1]['url'])

    return recs

In [30]:
rotation_recs = generate_recommendations(dataset_df, playlist_feature_vector_ro, nonplaylist_features_ro)
rotation_recs

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,...,liveness,loudness,mode,speechiness,tempo,time_signature,valence,popularity_red,sim,url
114991,Rap,Dr. Dre,Xxplosive,0Ed7MeXx64f6OcIuoTRCg1,65,0.201,0.78,215200,0.882,1.1e-05,...,0.15,-4.368,Major,0.134,168.955,4/4,0.926,13,0.648807,https://i.scdn.co/image/ab67616d00001e029b19c1...
114699,Rap,D12,My Band,4XHQyvbrBsQaaBUW1VvmsL,69,0.497,0.851,298773,0.849,2e-06,...,0.116,-3.383,Minor,0.0828,120.014,4/4,0.844,13,0.647298,https://i.scdn.co/image/ab67616d00001e0260b9bc...
116510,Rap,Sean Kingston,Fire Burning,2oENJa1T33GJ0w8dC167G4,68,0.0192,0.839,239987,0.804,0.0,...,0.331,-2.513,Major,0.0329,122.973,4/4,0.888,13,0.646781,https://i.scdn.co/image/ab67616d00001e02ea3436...
114132,Rap,Mac Miller,Donald Trump,2e0PQjgRNMDKeaMH49tHnC,65,0.0339,0.676,164523,0.935,0.0,...,0.359,-4.97,Minor,0.176,162.996,4/4,0.809,13,0.646647,https://i.scdn.co/image/ab67616d00001e0212a65c...
115860,Rap,Lil Pump,Multi Millionaire (feat. Lil Uzi Vert),0zCKuE6FtcrH9PdZtCdyXP,67,0.0226,0.856,170959,0.82,0.0,...,0.168,-3.813,Major,0.0962,146.009,4/4,0.723,13,0.646241,https://i.scdn.co/image/ab67616d00001e02a89a23...
114483,Rap,Logic,Overnight,3s3VVLE1kB7Xk2AoJKlGmr,66,0.00189,0.868,217560,0.752,1.7e-05,...,0.142,-6.786,Major,0.0667,149.987,4/4,0.874,13,0.645409,https://i.scdn.co/image/ab67616d00001e02e19b1b...
116651,Rap,Ice Cube,Arrest The President,3Oj5f6XETShvp2xknJyGMf,65,0.0954,0.844,233720,0.85,1e-06,...,0.597,-2.882,Major,0.215,94.969,4/4,0.608,13,0.645371,https://i.scdn.co/image/ab67616d00001e02826cbd...
115537,Rap,Kendrick Lamar,i,7wdzLe2Gsx1RGqbvYZHASz,66,0.0196,0.761,231933,0.886,2e-06,...,0.236,-5.322,Major,0.0627,121.91,4/4,0.89,13,0.645334,https://i.scdn.co/image/ab67616d00001e026d89f3...
114760,Rap,Key Glock,Yea!!,1o8n563oEpZzCj4qTIJ0NM,68,0.000807,0.955,191213,0.691,0.0,...,0.138,-6.351,Major,0.188,129.988,4/4,0.744,13,0.644882,https://i.scdn.co/image/ab67616d00001e021ab3ee...
115428,Rap,Smooky MarGielaa,"Flight To Memphis (feat. Chris Brown, Juicy J ...",6R9nl7ucaIt4mMNscXP0Qq,68,0.0237,0.796,215690,0.718,0.0,...,0.105,-7.215,Minor,0.277,156.07,4/4,0.867,13,0.64462,https://i.scdn.co/image/ab67616d00001e0230202b...


# Full Pipeline

In [31]:
def recommend_tracks(playlist_id, spotify_dataset_df=dataset_df, recommend_amt=10):
    """
    Recommends tracks based off playlist specified by playlist_id.
    Tracks are pulled from the spotify dataset specified.
    Amount of tracks that are recommended are given by recommend_amt.
    Returns:
    - recs: recommend tracks in dictionary format, {track_name : track_id}
    """
    refined_playlist = filter_user_playlist(playlist_id, spotify_dataset_df)
    dataset_features = create_feature_vectors(spotify_dataset_df)
    playlist_vector, remaining_dataset_features = create_playlist_vector(dataset_features, refined_playlist)
    recs = generate_recommendations(spotify_dataset_df, playlist_vector, remaining_dataset_features, recommend_amt)
    recs_dict= recs.set_index('track_id')['track_name'].to_dict()
    return recs_dict

In [32]:
chill_recs=recommend_tracks(playlists['chill'])
chill_recs

{'6ldwfK0yWgTAlmIfuQkTYN': "I'm a Slave 4 U",
 '2USyvcBpPjhW0rgiD2R8Bp': 'Mariposa Traicionera',
 '6gorwqDJ7bsdCHcVs5uS9u': 'HERO',
 '5fSDXbY8o9pA3TKwAbfwML': 'Mi primer millon',
 '6IY2y3kjjLaNbxW4GLiYQR': 'Ill Mind of Hopsin 8',
 '4ou5xyFUJX4VwX76tw1qb1': 'Fall',
 '0jx8zY5JQsS4YEQcfkoc5C': 'Angels (feat. Saba)',
 '3Yt9lRtS5V4nbJnwcgFgvC': 'You Da One',
 '4JViGq60SvqtQXI3WK0OLS': 'Oh My!',
 '3axkNosdVQLZiq1HakuGhc': 'Countdown'}