<H1 align = "center">Spotify Recommendation System</H1>
<H2 align = "center">
<img src="https://developer.spotify.com/assets/branding-guidelines/logo.png" width="300" height="200" align="middle">
    </H2>
 <H3 align = "center">by Randy Williams</H3>


# <H2 align = "center">Introduction</H2>

#### The purpose of this notebook is to demonstrate my Implimentation of a python based recommendation system using a similarity algorithm. The general outline of the process will be:

- Define and establish client credentials with your spotify developers account
- Define a source playlist that is the users favorites and a search playlist to examine for recommendations
- Build functions to extract audio features from a playlist and genre
- Define a function for creating a similarity score (we will use cosine similarity)
- Define a function for returning a data frame of recommendations
- Compare recommendations when Hot Encoding genre or using only audio features
- Simplify the process by building a pipeline function




# <H3 align = "center">1. Setup libraries and client and playlist variables</H3>

#### In order to run this notebook it is necessary to setup a free spotify developers account and obtain your client_id and your client_secret code. If not already completed:

- Resister as a developer at developer.spotify.com
- Create an app and click on edit settings then obtain your Client ID and Client secret codes.

In [None]:
# install the spotify API if needed
!pip install spotipy

In [1]:
# load the libraries needed for the recommendation system
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth # this not used in this notebook but it allows username authentication
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder

ModuleNotFoundError: No module named 'spotipy'

### *NOTE!*
The client_id and client_secret below were temporary Id's created to make this notebook. The developer app with these codes has been changed so it is important to replace those Id's with your own in order to make the notebook function. Remember, never share your credentials!

In [None]:
# Define environment variables
client_id = '1101196291e049e6952dd1bc5a0168f9' #replace with your client ID
client_secret = '84825c00e5ff47d0af375decc567860a'#replace with your client secret
playlist_personal = '73foPknywpV4l8EdymN68r' #customize to your playlists, this is my test playlist
playlist_compare = '4LZtDy62wDvQ4o8JB4UrcR' # Customize to the playlist you want to compare, this is the BB top 200


In [None]:
# authenticate the user without the username, this method allows the user to only read data from spotify
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)


# <H2 align = "center">2. Define the functions</H2>

- First, a function to extract the playlist data into the desired pandas frame format with audio features if it is desired to include genre in the return data frame the genre=True is specified.
- Second, a function that will create a similaity matrix between the 2 playlists.
- Third a function for hot encoding
- Fourth a function for making reccomendations using the similarity matrix.
- Finally a simple function to pipeline the plalists with a single command.

##### Portions of the origianal source code for these functions before my modifications and additionas can be found at Towards Data Science from Merlin Shaefer at https://towardsdatascience.com/using-python-to-refine-your-spotify-recommendations-6dc08bcf408e

<H3 align = "center">Spotify genre data</H3>

##### Some important things to note is how spotify tags genres to songs. Each song can be tagged with multiple genres. This is problematic because the genre data we extract will be variable. Additionaly, there is a large number of generes that can be defined or tagged to each song. The distincion between genre types can be subjective in some cases. We are also worried about cardinality. Since One Hot encoding creates a column for each item we could easily add a large amount of columns with minimal benefit. I will explore cardinality for genre more. For this function, we will extract only the first genre listed. I consider it the most relevant becuase the first genre listed to a song is the genre the artist fits into generally.

In [None]:
# Define a function for extracting and processing playlists
def feature_extract(plist1,genre=False):
    """ 
    Extracts a pandas playlist based on the desired audio features from a spotify playlist input
    'danceability',  'energy', 'key', 'loudness', 'mode', 'speechiness', 
    'acousticness', 'instrumentalness', 'liveness',
    'valence', 'tempo', 'type', 'id', 'uri', 'track_href',
    'analysis_url', 'duration_ms', 'time_signature'
    if true is specified then 'genre' is also extracted
    """
    playlist_link = "https://open.spotify.com/playlist/"+plist1
    playlist_URI = playlist_link.split("/")[-1].split("?")[0]
    track_uris = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]
    
    #Define the Playlist variable
    p_list=pd.DataFrame(columns=['danceability',  'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
    'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
    
    
    #Extract data from the selected playlist
    # it is important to note that i left the original code for extracting information
    # However in this implimentation I am only going to extract audio features for the return frame
    # This code could be further modified to return more data
    
    track_list=[]
    track_genre=[]
    for track in sp.playlist_tracks(playlist_URI)["items"]:
         #URI
        track_uri = track["track"]["uri"]
    
        #Track name
        track_name = track["track"]["name"]
        track_list.append(track_name)
        
         #Main Artist
        artist_uri = track["track"]["artists"][0]["uri"]
        artist_info = sp.artist(artist_uri)
    
        #Name, popularity, genre
        artist_name = track["track"]["artists"][0]["name"]
        artist_pop = artist_info["popularity"]
        artist_genres = artist_info["genres"]
        if genre==True:
             track_genre.append(artist_genres)
        #Album
        album = track["track"]["album"]["name"]
    
        #Popularity of the track
        track_pop = track["track"]["popularity"]
        
        #Audio features - this will be extracted
        temp_list=sp.audio_features(track_uri)
        my_favs_temp=pd.DataFrame(temp_list, columns=['danceability',  'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
        'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
        p_list=pd.concat([my_favs_temp, p_list], axis=0) # add together each frame per iteration
        p_list.reset_index(drop=True, inplace=True)
        
    #Create the track list
    
    track_names=pd.DataFrame(track_list, columns=['track_name'])
    track_names=track_names[::-1].reset_index(drop=True)
    
    #Create the Genres list if applicable
    if genre==True:
        track_genere2= [item[0] for item in track_genre] # loop extracts only the first genre element
        genre_names=pd.DataFrame(track_genere2, columns=['genre'])
        genre_names=genre_names[::-1].reset_index(drop=True) #reverse the list order
        
    #Add columns to the return list
    p_list['track_names']=track_names['track_name']
    #If we have included genres then we need to Hot Encode them and concat to the return list
    if genre==True:
        p_list['genre']=genre_names['genre']
    return p_list

<H3 align = "center">Similarity function</H3>

- This function will utilize the cosine similarity feature from sikit learn. Other similarity algorithms could be substituted and experimented with. 
- Prior to passing the data into the similarity algorithm it will also be normalized.


<H3 align = "center">For more information on the mathematics of similarity algorithms</H3>
<H4 align = "center">
<div class="ephox-summary-card" style="max-width: 500px;" align="center" data-ephox-embed-iri="https://medium.com/@sasi24/cosine-similarity-vs-euclidean-distance-e5d9a9375fc8"><a class="ephox-summary-card-link-thumbnail" href="https://medium.com/@sasi24/cosine-similarity-vs-euclidean-distance-e5d9a9375fc8"> <img class="ephox-summary-card-thumbnail" src="https://miro.medium.com/max/1400/0*MWuD1-9QA7wuPYHU" /> </a> <a class="ephox-summary-card-link" href="Cosine Similarity Vs Euclidean Distance"> <span class="ephox-summary-card-title">Cosine Similarity Vs Euclidean Distance</span>  <span class="ephox-summary-card-author">Vijaya Sasidhar Nagella</span>  </a></div>
</H4>

In [None]:
def create_similarity_score(df1,df2,similarity_score = "cosine_sim"):
    """ 
    Creates a similarity matrix for the audio features of two Dataframes.
    Parameters
    ----------
    df1 : DataFrame containing danceability, energy, key, loudness,	mode, speechiness, acousticness, instrumentalness,
    	liveness, valence, tempo, id, track_name, and if specified hot encoded genre
    df2 : DataFrame containing danceability, energy, key, loudness,	mode, speechiness, acousticness, instrumentalness,
    	liveness, valence, tempo, id, track_name, and if specified hot encoded genre
    
    similarity_score: similarity measure (linear,cosine_sim)
    
    Returns
    -------
    A matrix of similarity scores for the audio features of both DataFrames.
    """
    
    features = list(df1.columns)
    features.remove('id') #remove Id since it is not a feature
    features.remove('track_names') #remove Id since it is not a feature
    df_features1,df_features2 = df1[features],df2[features]
    scaler = StandardScaler() #Scale the data for input into the similarity function
    df_features_scaled1,df_features_scaled2 = scaler.fit_transform(df_features1),scaler.fit_transform(df_features2)
    if similarity_score == "linear":
        linear_sim = linear_kernel(df_features_scaled1, df_features_scaled2)
        return linear_sim
    elif similarity_score == "cosine_sim":
        cosine_sim = cosine_similarity(df_features_scaled1, df_features_scaled2)
        return cosine_sim


<H3 align = "center">Hot Encoding Function</H3>

- If genre is selected then One Hot Encoding is required


In [None]:
def genre_encode(plist):
    """
    Takes a feature extracted data frame with a genre column and hot encodes it with 
    the OneHotEncoder function
    
    Returns a fame of columns containing the numerical data
    """
    
    # initiate the encoder
    OH_encoder=OneHotEncoder(handle_unknown='ignore', sparse=False)
    #Distill each list to the genre columns
    Genre_List=plist['genre']
    Genre_Reshape=Genre_List.values.reshape(-1,1) #This is important becuase of the way the Encoder expects the input.
    OH_genre=pd.DataFrame(OH_encoder.fit_transform(Genre_Reshape))
    OH_genre.index = Genre_List.index #re-index extracted Hot Encoding
    # return_frame = pd.concat([OH_genre, plist], axis=1)
    return OH_genre # return the Hot Encoded Column

<H3 align = "center">Recommendation Function</H3>

- This function will return a data frame with recommendations using similarity scores
- This function will trim the results into track name and track Id

In [None]:
def recommend_tracks(plist1,plist2, genre=False): 
    """
    Takes the processed data frames from the feature extract function cleans the data to be only the
    numerical columns and then feeds the numerical frame into the similarity function.
    
    Return
    A dataframe of recommendations with track name and id

    """
    # if we added genres then we need to Hot Encode them and combine them with the return dataframe
    # we also need to drop the features we are not encoding
    if genre==True:
        oh_list1=genre_encode(plist1)
        oh_list2=genre_encode(plist2)
        plist1=plist1.drop(['type','uri','track_href','analysis_url','duration_ms', 'time_signature','genre'],axis=1)
        plist2=plist2.drop(['type','uri','track_href','analysis_url','duration_ms', 'time_signature','genre'],axis=1)
        Track_Input = pd.concat([oh_list1, plist1], axis=1)
        Track_Input_compare = pd.concat([oh_list2, plist2], axis=1)
    else:
        Track_Input=plist1.drop(['type','uri','track_href','analysis_url','duration_ms', 'time_signature'],axis=1)
        Track_Input_compare=plist2.drop(['type','uri','track_href','analysis_url','duration_ms', 'time_signature'],axis=1)
        
    #create similarity scoring between playlist and recommendations
    similarity_score = create_similarity_score(Track_Input,Track_Input_compare)
    
    #get filtered recommendations
    final_recomms = Track_Input_compare.iloc[[np.argmax(i) for i in similarity_score]]
    final_recomms = final_recomms.drop_duplicates()
    
    #filter again so tracks are not already in playlist_df
    final_recomms = final_recomms[~final_recomms["id"].isin(Track_Input["id"])]
    final_recomms.reset_index(drop = True, inplace = True)
    #trim the results to id and track name
    final_recomms_return =final_recomms.loc[:, ['track_names','id']]
    return final_recomms_return
    

<H3 align = "center">Cardinality</H3>

Before I work on making the pipline and getting recommendations, I want to look at the cardinality of the Hot Encoding to ensure that the size of the dataframe isn't excessive when encodeing genres. 

In [None]:
my_favorites=feature_extract(playlist_personal, genre=True) #create a dataframe from my playlist
my_favorites['genre'].unique() # examine the number of genres

In [None]:
# the same for the comparison playlist
bb_200=feature_extract(playlist_compare, genre=True) #create a dataframe from the comparison playlist
bb_200['genre'].unique() # examine the number of genres

my playlist has 10 unique genres and the BB top 200 has 23 unique genres. This is probably not an excessive amount of additional columns. So next we will get the recommendations.


# <H2 align = "center">3. Define the pipeline function</H2>

#### The processing sequence is as follows
- Extract the playlists into DataFrames
- Process the playlist for reccomendations using the similarity function and hot encoding if applicable
- Return the reccomendations to a DataFrame

The functions were built so it would be easy to make a pipeline function and get the final results with a single command.

In [None]:
def Spotify_AI(playlist_personal,playlist_compare, genre_in=False):
    my_favorites=feature_extract(playlist_personal, genre=genre_in) #my favorite as a dataframe
    comparison_list=feature_extract(playlist_compare, genre=genre_in) #Comparison playlist as a dataframe
    results=recommend_tracks(my_favorites,comparison_list,genre=genre_in)
    return results
    


# <H2 align = "center">4. Test run the algorithm</H2>

- First without genre
- Second with genre

In [None]:
recommendations=Spotify_AI(playlist_personal,playlist_compare,False)
recommendations.head()

Looking at the top 5 in the list, I would say that the comparison did a good job on matching the type of audio asthetics I enjoy. It was tilted a bit towards Hip Hop with a strictly instrumental chistmas song and a country song. However, I would probably not add those to my playlist.

In [None]:
recommendations=Spotify_AI(playlist_personal,playlist_compare,True)
recommendations.head()

Wow! what a difference in my subjective opinion. Blinding Lights, Come Together, and Happier Than Ever would all be additions to my list. It looks like incorporating genre resulted in better performance.

# <H2 align = "center">4. Final Thoughts</H2>

This system doesn't have an objective measure on it's performance other than my opinion. Like a lot of recommendation systems it is disadvantaged with a cold start before improving. The best way to improve performance would be to significantly increase my playlist and choose alternative comparison playlists that might be closer to the type of music I enjoy. I hope you enjoy tinkering with the process.

<H3 align = "center">Happy Music Hunting</H3>