# Building a Recommendation System

In this notebook we will build a content-based recommendation system on the previously saved .csv file.

Importing libraries to be used.

In [15]:
!pip install spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from collections import defaultdict
from scipy.spatial.distance import cdist
cid = 'cdb0a1aa1fc24842b9d98603fab657be'
secret = 'e421b4cb445b45dd8ea9635ba9892c22'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager
=
client_credentials_manager)



In [16]:
df = pd.read_csv('fourtet.csv')

The below function uses Spotipy to find any track and return its metadata and audio features.

In [17]:
def find_song(name, artist):

    # Initialize an empty dictionary to store features and values
    song_data = defaultdict()
    
    # Using Spotipy search function for track and artist, returning None if cannot be found in Spotify
    results = sp.search(q= 'track: {} artist: {}'.format(name,
                                                       artist), limit=1)
    if results['tracks']['items'] == []:
        return None

    # Isolating track information and ID from results
    results = results['tracks']['items'][0]
    track_id = results['id']
    
    # Obtaining audio features
    audio_features = sp.audio_features(track_id)[0]
    
    # Preparing columns and converting to DataFrame
    song_data['name'] = [name]
    song_data['artist'] = [artist]
    song_data['explicit'] = [int(results['explicit'])]
    song_data['duration_ms'] = [results['duration_ms']]
    song_data['popularity'] = [results['popularity']]
    
    for key, value in audio_features.items():
        song_data[key] = value
    
    return pd.DataFrame(song_data)

We can also call a function to return this information from a dataset and revert to our previous function if the song is not available.

In [18]:
def get_song_data(song, spotify_data):

    # Function will attempt to find ID track name and artist from dataset to return track data
    try:
        song_data = spotify_data[(spotify_data['track_name'] == song['name']) 
                                & (spotify_data['artist_name'] == song['artist'])].iloc[0]
        return song_data
    
    except IndexError:
        return find_song(song['name'], song['artist'])

From a list of multiple songs, we can calculate the mean vector of its audio features.

In [19]:
def get_mean_vector(song_list, spotify_data):

    #Initialize empty list to store vectors
    song_vectors = []

    # Identify audio features columns
    number_cols = ['valence', 'acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo']

    # Append list of values to list
    for song in song_list:
        song_data = get_song_data(song, spotify_data)
        if song_data is None:
            print('Warning: {} not found in Spotify or database'.format(song['name']))
            continue
        song_vector = song_data[number_cols].values
        song_vectors.append(song_vector)  
    
    # Convert to single array and return mean
    song_matrix = np.array(list(song_vectors))
    return np.mean(song_matrix, axis=0)

We are finally ready to build our recommender. This function will take in a list of songs and return n recommendations as set by the user.

In [32]:
def recommend_songs(song_list, spotify_data, n_songs=10):

    # Establishing metadata and numerical columns
    metadata_cols = ['track_name', 'artist_name']
    number_cols = ['valence', 'acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo']
    
    # Getting mean vector
    song_center = get_mean_vector(song_list, spotify_data)
    
    # Dropping extra columns
    spotify_data = spotify_data.drop(['popularity', 'Unnamed: 0'], axis=1)
    
    # Using KMeans to cluster data, fitting and adding labels to dataset
    X = spotify_data.select_dtypes(np.number)
    cluster_pipeline = Pipeline([('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=3))])
    cluster_pipeline.fit(X)
    cluster_labels = cluster_pipeline.predict(X)
    spotify_data['cluster'] = cluster_labels
    
    # Scaling and transforming numerical columns of data and reshaped song center
    scaler = cluster_pipeline.steps[0][1]
    scaled_data = scaler.transform(spotify_data[number_cols])
    scaled_song_center = scaler.transform(song_center.reshape(1, -1))
    
    # Computing cosine distance on transformed arrays
    distances = cdist(scaled_song_center, scaled_data, 'cosine')
    
    # Return sorted list of top n indices 
    index = list(np.argsort(distances)[:, :n_songs][0])
    
    # Converting to DataFrame and returning track and artist name
    rec_songs = spotify_data.iloc[index]
    df_recs = pd.DataFrame(rec_songs[metadata_cols])
    return df_recs


In [33]:
song_list = [{'name': 'Cardigan', 'artist': 'Taylor Swift','name': 'Love Ridden', 'artist': 'Fiona Apple'}]
rec_df = recommend_songs(song_list, df)
rec_df

  super()._check_params_vs_input(X, default_n_init=10)


ValueError: The feature names should match those that were passed during fit.
Feature names must be in the same order as they were in fit.


We are now ready to deploy our recommendation system in Streamlit.