# Spotify Content Based Recommender System
The goal of this recommender is to look at tracks (in a playlist) and recommend similar new tracks, by only comparing what features the tracks in the user's playlist has compared to tracks from the dataset we use.

For playlists we choose a previously explored playlist from the Spotify million playlist dataset[0].

For track features we use a dataset from *kaggle*[1] that was recommended through a tutorial by *Tanmoy Ghosh*[2] we leaned heavily on along with insights from *Data Science from Scratch, 2nd Edition by Joel Grus*[3]

[0]https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge

[1]https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db

[2]https://www.section.io/engineering-education/building-spotify-recommendation-engine/#implementation

[3]https://www.oreilly.com/library/view/data-science-from/9781492041122/

## Datasets
### kaggle - SpotifyFeatures.csv

A .csv document with 26 genres and a total of 232,725 tracks with the following features:

| genre | artist_name    | track_name                  | track_id               | popularity | acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode  | speechiness | tempo   | time_signature | valence |
|-------|----------------|-----------------------------|------------------------|------------|--------------|--------------|-------------|--------|------------------|-----|----------|----------|-------|-------------|---------|----------------|---------|
| Movie | Henri Salvador | C'est beau de faire un Show | 0BRjO6ga9RKCKjfDqeFgWV | 0          | 0.611        | 0.389        | 99373       | 0.91   | 0                | C#  | 0.346    | -1.828   | Major | 0.0525      | 166.969 | 04-Apr         | 0.814   |

### Spotify MPD - playlists.json
Json data used in other recommenders and exploration. For future elaboration, please see the explanation in the collaborative_based notebook.

## Content Based Filters
*Unlike collaborative methods that only rely on the user-item interactions, content-based approaches use additional information about users and/or items. They rely on the assumption that items with similar properties and features will be rated similarly. Determining which features and attributes are most predictive for a given user is the challenge.*

As we have our goals set already, Spotify recommendation and a dataset picked out for us we can easily pull from them to explain. As John Deer adds tracks to his playlist, we can draw a vector on all the features in his playlist, like tempo or danceability and then look for tracks with a close (a small of a difference as possible) cosine similarity as his playlist.

The advantage of the content-based filter is that John Deer doesn’t need to rely on others, a tracks features is simply raw data we gather from the track itself so even if John Deer was the first user in the world on Spotify, we could still recommend him tracks.


## Playlists and metadata
We'll get the same playlists and metadata that we explored in the earlier notebook

In [1]:
import json
import os
import pandas as pd
import numpy as np
import math
from typing import List, Tuple
from collections import defaultdict
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
import random

playlists = []
DATA_PATH = "mpd/data_samples"
REWRITE_DATA = True # Set to True to not use the stored pickled vectors 

num_files = len(os.listdir(DATA_PATH))

# Load in every json file from the mpd/data_samples directory
for filename in os.listdir(DATA_PATH):
    with open(os.path.join(DATA_PATH, filename), "rt", encoding="utf-8") as f:
        playlists.extend(json.load(f)["playlists"])

In [2]:
def get_top_attribute(playlists, attribute):
    '''
        goes through dicts and appends value for choosen attribute
        
            playlist[] = list of dicts with playlist from million playlist dataset
            attribute(string): 'track_name', 'album_name' or 'artist_name'
            
        returns a dict
    '''
    d = {}
    s = attribute.split('_',)
    if attribute == 'track_name':
        for playlist in playlists:
            for track in playlist['tracks']:
                if track[attribute] not in d:
                    d[track[attribute]] = {s[0]+'_popularity':1, 'artist_name':track['artist_name'], 'album_name':track['album_name']}
                else:
                    d[track[attribute]][s[0]+'_popularity'] += 1
    else:
        for playlist in playlists:
            for track in playlist['tracks']:
                if track[attribute] not in d:
                    d[track[attribute]] = 1
                else:
                    d[track[attribute]] += 1
            
    return d

In [3]:
# build dicts
artists = get_top_attribute(playlists, 'artist_name')
songs = get_top_attribute(playlists, 'track_name')
albums = get_top_attribute(playlists, 'album_name')
# make df from tracks
metadata = pd.DataFrame.from_dict(songs, orient='index')
# add album and artist scores
metadata['artist_popularity']=metadata['artist_name'].apply(lambda x: artists.get(x,0))
metadata['album_popularity']=metadata['album_name'].apply(lambda x: albums.get(x,0))
# reformat the df to be more readable (in our opinion)
metadata=metadata[['track_popularity', 'artist_name', 'artist_popularity', 'album_name', 'album_popularity']]

## Adding Spotify data
We found a dataset on kaggle with sufficent track features that we will use to build a content based recommender system.
This dataset has most of the key features we'd need to make content based decisions.

In [4]:
# read the data
spotify_data = pd.read_csv('SpotifyFeatures.csv')
# make a features dataframe
spotify_features_df = spotify_data

### OHE - One-Hot Encoding
    - Categorical data are variables that contain label values rather than numeric values.
    - https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
    
Since we can't use genres like Pop, Soul, Country or Track keys like C# or D into a numerical calculation we can turn them into numerical data using pandas.get_dummies and apply them later on as booelan columns in our feature dataframe.

Trying to encode track, album or artist names seems less important than genre or key for content feature (and we simply did not have the computer capacity take any larger messures either)

In [5]:
# genre and keys with OHE
genre_OHE = pd.get_dummies(spotify_features_df.genre)
key_OHE = pd.get_dummies(spotify_features_df.key)

In [6]:
# use sklearn to fit all data
scaled_features = MinMaxScaler().fit_transform([
  spotify_features_df['acousticness'].values,
  spotify_features_df['danceability'].values,
  spotify_features_df['duration_ms'].values,
  spotify_features_df['energy'].values,
  spotify_features_df['instrumentalness'].values,
  spotify_features_df['liveness'].values,
  spotify_features_df['loudness'].values,
  spotify_features_df['speechiness'].values,
  spotify_features_df['tempo'].values,
  spotify_features_df['valence'].values,
  ])
# put them into the feature dataframe
spotify_features_df[['acousticness','danceability','duration_ms','energy','instrumentalness','liveness','loudness','speechiness','tempo','valence']] = scaled_features.T
# Add the OHE data we made earlier
spotify_features_df = spotify_features_df.join(genre_OHE)
spotify_features_df = spotify_features_df.join(key_OHE)

## Combine into metadata
We can join the new features on as many tracks as possible in our metadata so that we have one large set with everything.

In [7]:
# we can join metadata and spotify_features on track_name
metadata.index = metadata.index.set_names(['track_name'])
# we don't want duouble columns thou, lets drop some redundant columns from the feature
spotify_features_df = spotify_features_df.drop('artist_name', axis = 1)

In [8]:
metadata = pd.merge(metadata, spotify_features_df, how='inner', on=['track_name'])

In [9]:
# remove duplicates, plenty of tracks get joined from multiple albums
metadata = metadata.drop_duplicates(['track_name'])
# remove old index and put a new one
metadata = metadata.reset_index(drop=True)

In [10]:
# now that we have merged the large metadata, lets make the spotify_features_df into purley features that we use in a recommender
spotify_features_df = spotify_features_df.drop('genre',axis = 1)
spotify_features_df = spotify_features_df.drop('track_name', axis = 1)
spotify_features_df = spotify_features_df.drop('popularity',axis = 1)
spotify_features_df = spotify_features_df.drop('key', axis = 1)
spotify_features_df = spotify_features_df.drop('mode', axis = 1)
spotify_features_df = spotify_features_df.drop('time_signature', axis = 1)

## The user
We can create a comprehensive user dataframe with all features from their playlist by merging with the metadata.

As with our previous exploration we'll pick playlist **42** to keep things as decirnable as possible.

In [11]:
#creating the playlist dataframe with extended features using Spotify data
def generate_user_features_df(user_playlist, spotify_data):
    
    df = pd.DataFrame()

    for i, j in enumerate(user_playlist['tracks']):
        df.loc[i, 'track_name'] = j['track_name']
    
    # merge features from metadata
    df = pd.merge(df, metadata, how='inner', on=['track_name'])
    # drop columns not used for making vectors (i.e not numeric)
    df = df.drop(columns=['track_name', 'track_popularity', 'artist_name', 'artist_popularity', 'album_name', 'album_popularity'])

    return df

In [12]:
user_features_df = generate_user_features_df(playlists[42], metadata)

### Non user songs
Lets get the tracks not in our users playlists (we don't want to recommend tracks the user already has in their playlist afterall)

In [13]:
def create_non_user_playlist(spotify_features, user_features):
    # songs not in the users playlist
    spotify_features_non_user_playlist = spotify_features[~spotify_features['track_id'].isin(user_features['track_id'].values)]
    
    return spotify_features_non_user_playlist

### Vectors
We can summirize our users playlist into 1 nice vector that later will identify similar tracks from our non user tracks

In [14]:
non_user_playlist_df = create_non_user_playlist(spotify_features_df, user_features_df)

In [15]:
def create_playlist_vector(spotify_features, user_features):
    
    # make vectors for all songs
    user_features_playlist = spotify_features[spotify_features['track_id'].isin(user_features['track_id'].values)]
    

    return user_features_playlist.sum(axis = 0)

In [16]:
user_playlist_vector = create_playlist_vector(spotify_features_df, user_features_df)

### Recommender
Finally the recommender, comparing cosine_similarity for our vectors in different tracks.

In [17]:
def generate_recommendation(spotify_data, user_playlist_vector, non_user_playlist_df):
    
    # rather than making loads of copies to workaround a warning, we'll suppress this 1 warning
    pd.options.mode.chained_assignment = None
    # Make the not in user playlist
    recommendations = spotify_data[spotify_data['track_id'].isin(non_user_playlist_df['track_id'].values)]
    # combine all columns (but track_id) and check costine similarity between the users songs and songs not in the users playlist
    recommendations['sim'] = cosine_similarity(non_user_playlist_df.drop(['track_id'], axis = 1).values, user_playlist_vector.drop(labels = 'track_id').values.reshape(1, -1))[:,0]
    # sort the not in user playlist by cosine similariy
    recommendations = recommendations.sort_values('sim',ascending = False)
    
    return  recommendations

In [18]:
recommend = generate_recommendation(spotify_data, user_playlist_vector, non_user_playlist_df)  

In [19]:
# top reommendations for our user
recommend.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,sim
150461,Pop,BROCKHAMPTON,THUG LIFE,2c09gumRCmu3qmbcqdQbUN,62,5.3e-05,5.5e-05,1.0,5.6e-05,5e-05,C,5.1e-05,0.0,Major,5.1e-05,0.001351,4/4,5.3e-05,0.703883
151833,Pop,JID,EdEddnEddy,4vmqU3xlzuhxtaeXrtAX1F,62,7.1e-05,7.5e-05,1.0,7.3e-05,7e-05,C,7.3e-05,0.0,Major,7.3e-05,0.001194,4/4,7.2e-05,0.703883
150955,Pop,DaniLeigh,All I Know,4dxM44QcHvrbTqCtBsNprG,60,5e-05,5.2e-05,1.0,5.4e-05,5e-05,C,5.2e-05,0.0,Minor,5.3e-05,0.001196,4/4,5.4e-05,0.703883
112919,Pop,YBN Cordae,Target,13K3BqdOYYMepkNQPWL1DZ,66,4.9e-05,5.5e-05,1.0,5.4e-05,4.9e-05,C,5e-05,0.0,Major,5.2e-05,0.001189,4/4,5.2e-05,0.703883
150818,Pop,Logic,State Of Emergency,58cvckETahSOG74RP5WE99,62,5e-05,5.4e-05,1.0,5.4e-05,4.9e-05,C,5e-05,0.0,Major,5.1e-05,0.001176,4/4,5.4e-05,0.703883
