# Obtaining Genre Data from Spotify Songs


#### Notebook Overview

One major drawback of our dataset is it does not include information on song genre. This can be a powerful categorical feature, as genre often plays a large impact on the categorization of Spotify playlists.

To fill in this missing data, this notebook will serve as a template for pulling this genre data through the Spotify API.

#### Import Dependencies

In [1]:
import spotipy
import requests
from spotipy.oauth2 import SpotifyClientCredentials
import json
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from numpy import dot
from numpy.linalg import norm

In [2]:
spotify_keys = json.load(open('spotify_keys.json'))
client_id = spotify_keys['client_id']
client_secret = spotify_keys['client_secret']

In [3]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=client_secret)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


#### Display metadata returned from specific song

In [4]:
uri = "spotify:song:6M94FkXd15sOAOQYRnWPN8"

song = uri.split(":")[2]

results = sp.track(song)
print(results)

{'album': {'album_type': 'album', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/2maQMqxNnlRrBrS1oAsrX9'}, 'href': 'https://api.spotify.com/v1/artists/2maQMqxNnlRrBrS1oAsrX9', 'id': '2maQMqxNnlRrBrS1oAsrX9', 'name': 'Francisco Canaro', 'type': 'artist', 'uri': 'spotify:artist:2maQMqxNnlRrBrS1oAsrX9'}], 'available_markets': ['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT', 'AU', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BN', 'BO', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CD', 'CG', 'CH', 'CI', 'CL', 'CM', 'CO', 'CR', 'CV', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 'ES', 'FI', 'FJ', 'FM', 'FR', 'GA', 'GB', 'GD', 'GE', 'GH', 'GM', 'GN', 'GQ', 'GR', 'GT', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IS', 'IT', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KR', 'KW', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MG', 'MH', 'MK', 'ML'

#### Insights

Song metadata does not appear to contain song genre. However, artist metadata does include song genre. This will not be perfect, as artist genre may not correlate exactly to each of their songs. For example, an artist may release an experimental album that deviates from their typical genre expectations. However, this will be a close assumption.

#### Gather genres from artist

In [5]:
artist_id = results['artists'][0]['uri']
print(artist_id)

new_results = sp.artist(artist_id)
print(new_results['genres'])
print(new_results)

spotify:artist:2maQMqxNnlRrBrS1oAsrX9
['tango', 'vintage tango']
{'external_urls': {'spotify': 'https://open.spotify.com/artist/2maQMqxNnlRrBrS1oAsrX9'}, 'followers': {'href': None, 'total': 19976}, 'genres': ['tango', 'vintage tango'], 'href': 'https://api.spotify.com/v1/artists/2maQMqxNnlRrBrS1oAsrX9', 'id': '2maQMqxNnlRrBrS1oAsrX9', 'images': [{'height': 640, 'url': 'https://i.scdn.co/image/ab67616d0000b27370ba3dfe656bd07b54af00c3', 'width': 640}, {'height': 300, 'url': 'https://i.scdn.co/image/ab67616d00001e0270ba3dfe656bd07b54af00c3', 'width': 300}, {'height': 64, 'url': 'https://i.scdn.co/image/ab67616d0000485170ba3dfe656bd07b54af00c3', 'width': 64}], 'name': 'Francisco Canaro', 'popularity': 40, 'type': 'artist', 'uri': 'spotify:artist:2maQMqxNnlRrBrS1oAsrX9'}


#### Certain songs have multiple artists

Generate a function to parse a single song and return a list of the artist uris for that song.

In [6]:
def get_song_genres(song_uri):
    '''
    Given a uri for a single song, return the genres associated with the song
    
    Parameters:
        song_uri (str): uri for a single song
        
    Returns:
        song_genres (list): all genres associated with the artists for a specific song
    '''
    results = sp.track(song_uri)
    
    artist_ids = []
    for i in range(len(results['artists'])):
        artist_ids.append(results['artists'][i]['uri'])
    
    song_genres = return_artist_genres(artist_ids)
    
    return song_genres

Include a second function that returns the genres for that specific artist. The end result should be a concatenated list of unique genres for each song.

In [7]:
def return_artist_genres(artist_uris):
    '''
    Helper function to return the genres for each artist associated with a single song, without duplicates
    
    Parameters:
        artist_uris (list): List of artist uris pulled from the song uri; could be greater that one
        
    Returns:
        list: List of unique genres of all the artists
    '''
    song_genres = []
    
    for i in range(len(artist_uris)):
        artist_metadata = sp.artist(artist_uris[i])
        artist_genres = artist_metadata['genres']
        song_genres.extend(artist_genres)
    
    return list(set(song_genres))

In [8]:
def record_all_song_genres(song_uris):
    '''
    Loop through list of song uris and run functions for each, returning one list of genres.
    
    Parameters:
        song_uris (list): List of song uris for all songs in the database
    
    Returns (list): A list of lists the same size as the original database. Each list contains all unique genres
                    for the song at that index.
    '''
    all_song_genres = []
    
    for i in range(len(song_uris)):
        song_genres = get_song_genres(song_uris[i])
        all_song_genres.append(song_genres)
    
    return all_song_genres

In [9]:
# Test on single song
get_song_genres("6N6tiFZ9vLTSOIxkj8qKrd")

['classical performance',
 'early romantic era',
 'classical',
 'russian classical piano',
 'polish classical',
 'classical piano']

#### Load in our cleaned/improved dataset

In [10]:
cleaned_spotify_df = pd.read_csv('data/improved_df.csv').iloc[:, 1:]
cleaned_spotify_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,decade_1930,decade_1940,decade_1950,decade_1960,decade_1970,decade_1980,decade_1990,decade_2000,decade_2010,decade_2020
0,0.998996,['Carl Woitschach'],0.716599,0.028442,0.195,0.0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,0.151,0.745,...,0,0,0,0,0,0,0,0,0,0
1,0.997992,"['Robert Schumann', 'Vladimir Horowitz']",0.383603,0.051316,0.0135,0.0,6KuQTIu1KoTTkLXKrwlLPV,0.901,0.0763,0.494026,...,0,0,0,0,0,0,0,0,0,0
2,0.606426,['Seweryn Goszczyński'],0.758097,0.018374,0.22,0.0,6L63VW0PibdM1HDSBoqnoM,0.0,0.119,0.627609,...,0,0,0,0,0,0,0,0,0,0
3,0.998996,['Francisco Canaro'],0.790486,0.032538,0.13,0.0,6M94FkXd15sOAOQYRnWPN8,0.887,0.111,0.708887,...,0,0,0,0,0,0,0,0,0,0
4,0.993976,"['Frédéric Chopin', 'Vladimir Horowitz']",0.212551,0.12645,0.204,0.0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,0.098,0.676079,...,0,0,0,0,0,0,0,0,0,0


In [11]:
song_uris = cleaned_spotify_df['id'].head(10)

#### Genre data for the first 10 Spotify songs in our database

In [12]:
record_all_song_genres(list(song_uris))

[[],
 ['german romanticism',
  'classical performance',
  'early romantic era',
  'classical',
  'russian classical piano',
  'classical piano'],
 [],
 ['tango', 'vintage tango'],
 ['classical performance',
  'early romantic era',
  'classical',
  'russian classical piano',
  'polish classical',
  'classical piano'],
 ['german romanticism',
  'classical performance',
  'early romantic era',
  'classical',
  'russian classical piano',
  'classical piano'],
 ['classical performance',
  'late romantic era',
  'classical',
  'russian classical piano',
  'classical piano'],
 [],
 ['tango', 'orquesta tipica', 'vintage tango'],
 []]

# Experiment Design

#### Challenges

To generate the genres for a specific song, we need to make at least 2 API calls:
- Determine the artists of a track (1 API call)
- Determine the genres of a given artist (1 API call per artist on a track)

For our dataset of 169909 songs, this will be a large number of API calls.

#### Solution

To test this implementation, I sampled a section of the dataset and generated the genres for this sample.

To properly weigh these genres, I implemented TF-IDF to provide higher weight to more abstract genres. See Madhav Thaker's repository (https://github.com/madhavthaker/spotify-recommendation-system) for the inspiration.

The success of this recommender model is dependend on the user-inputted playlist having a majority of songs in the master database. Otherwise, we couldn't generate a reliable playlist vector. Therefore, when sampling data, we need to sample on some criterion such that a specific playlist will still have songs in the dataset and the dataset is diverse enough that our results show successful learning.

As a workaround for these criterion, I sample only the year 2014. This leads to a group of 2000 songs. For our custom playlist, we can use the playlist "Top Hits of 2014" by Spotify. This playlist contains the most popular songs of the year, mostly in the genres Top 40, Pop, and Hip Hop. Ideally, our sample will contain a majority of songs in this database.

A successful model should return songs also in these genres.

In [13]:
spotify_df_improvements = pd.read_csv('data/improved_df.csv').iloc[:, 1:]
spotify_df_improvements.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,decade_1930,decade_1940,decade_1950,decade_1960,decade_1970,decade_1980,decade_1990,decade_2000,decade_2010,decade_2020
0,0.998996,['Carl Woitschach'],0.716599,0.028442,0.195,0.0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,0.151,0.745,...,0,0,0,0,0,0,0,0,0,0
1,0.997992,"['Robert Schumann', 'Vladimir Horowitz']",0.383603,0.051316,0.0135,0.0,6KuQTIu1KoTTkLXKrwlLPV,0.901,0.0763,0.494026,...,0,0,0,0,0,0,0,0,0,0
2,0.606426,['Seweryn Goszczyński'],0.758097,0.018374,0.22,0.0,6L63VW0PibdM1HDSBoqnoM,0.0,0.119,0.627609,...,0,0,0,0,0,0,0,0,0,0
3,0.998996,['Francisco Canaro'],0.790486,0.032538,0.13,0.0,6M94FkXd15sOAOQYRnWPN8,0.887,0.111,0.708887,...,0,0,0,0,0,0,0,0,0,0
4,0.993976,"['Frédéric Chopin', 'Vladimir Horowitz']",0.212551,0.12645,0.204,0.0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,0.098,0.676079,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df_2014 = spotify_df_improvements[spotify_df_improvements['year_2014'] == 1]

In [15]:
spotify_keys = json.load(open('spotify_keys.json'))
client_id = spotify_keys['client_id']
client_secret = spotify_keys['client_secret']

client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

playlist_uri = 'spotify:playlist:37i9dQZF1DX0h0QnLkMBl4'
user = spotify_keys['username']


In [16]:
playlist = playlist_uri.split(':')[2]
results = sp.user_playlist(user, playlist, 'tracks')

playlist_song_uris = []

for song in results['tracks']['items']:
    playlist_song_uris.append(song['track']['id'])

print('The number of total songs in the playlist is: ', len(playlist_song_uris))

playlist_data = df_2014[df_2014['id'].isin(playlist_song_uris)]
print('The number of songs in the master database is: ', len(playlist_data))


The number of total songs in the playlist is:  100
The number of songs in the master database is:  45


#### Use our functions for generating genre data to pull genres for all 2000 songs

In [17]:
df_2014_uris = df_2014['id']

In [18]:
df_2014_genres = record_all_song_genres(list(df_2014_uris))

In [19]:
df_2014_genres_raw = df_2014.copy()
df_2014_genres_raw['genres_raw'] = df_2014_genres

In [20]:
df_2014_genres_raw.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,decade_1940,decade_1950,decade_1960,decade_1970,decade_1980,decade_1990,decade_2000,decade_2010,decade_2020,genres_raw
7454,0.007771,['Linkin Park'],0.47166,0.039364,0.852,0.0,3aYBjxTMvrEOP0A0UXg9ER,3e-06,0.0989,0.86936,...,0,0,0,0,0,0,0,1,0,"[post-grunge, alternative metal, nu metal, rap..."
7455,0.007791,['Hippie Sabotage'],0.36336,0.043466,0.557,0.0,5iyFrZpv5f4oWQCWMKfj52,0.00434,0.475,0.795803,...,0,0,0,0,0,0,0,1,0,"[electronic trap, edm]"
7456,0.319277,['Bleachers'],0.558704,0.036174,0.941,0.0,7pK4jm4HiNaYVzL2zbQSoG,0.0655,0.33,0.810336,...,0,0,0,0,0,0,0,1,0,"[modern alternative rock, indie poptimism, dou..."
7457,3.9e-05,['together PANGEA'],0.266194,0.044798,0.853,1.0,14q1CWouLEjFT6zu5re8hR,0.362,0.0839,0.868656,...,0,0,0,0,0,0,0,1,0,[indie garage rock]
7458,0.001255,"['David Guetta', 'Showtek', 'VASSY']",0.621457,0.03066,0.972,0.0,6PtXobrqImYfnpIxNsJApa,0.0186,0.328,0.87813,...,0,0,0,0,0,0,0,1,0,"[pop, australian dance, edm, progressive elect..."


In [21]:
df_2014_genres[:10]

[['post-grunge', 'alternative metal', 'nu metal', 'rap metal'],
 ['electronic trap', 'edm'],
 ['modern alternative rock',
  'indie poptimism',
  'double drumming',
  'indie pop',
  'modern rock',
  'pop rock'],
 ['indie garage rock'],
 ['pop',
  'australian dance',
  'edm',
  'progressive electro house',
  'classic hardstyle',
  'big room',
  'dance pop',
  'euphoric hardstyle',
  'pop dance',
  'pop rap',
  'electro house'],
 ['neon pop punk',
  'pop punk',
  'modern rock',
  'post-teen pop',
  'vegas indie',
  'pop rock'],
 ['ninja', 'chillwave'],
 ['christian alternative rock',
  'ccm',
  'christian music',
  'worship',
  'christian hip hop'],
 ['pop rap'],
 ['pop', 'urban contemporary', 'dance pop', 'r&b', 'alternative r&b']]

#### Define functions to run tfidf and return a dataframe to include in the original dataset

In [22]:
def genre_tfidf(genres):
    '''
    Given a list of genres, generate a dataframe that employs tfidf. This will provide a weight for each song to
    better categorize the artits set of genres. For tfidf to properly work, multi-word genres must be combined
    into single words.
    
    Parameters:
        genres (list): List of genres matching the genres for every artist in the larger dataframe
        
    Returns:
        dummy_genres (df): Dummy dataframe weighted by tfidf results
    '''
    cleaned_df_genres = []
    
    for song in genres:
        for i, genre in enumerate(song):
            song[i] = genre.replace(" ", "_")
        cleaned_df_genres.append(" ".join(song))
            
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(cleaned_df_genres) 
    
    feature_names = vectorizer.get_feature_names()
    
    dummy_genres = generate_dummy_df(X, feature_names)
    return dummy_genres

In [23]:
def generate_dummy_df(sparse_dummy_genres, feature_names):
    '''
    Turn a sparse dataframe into a dummy dataframe that can be concatenated with the original database.
    
    Parameters:
        sparse_dummy_genres (sparse df): Results of tfidf
        feature_names (list): List of feature names returned from tfidf vectorizer
        
    Returns:
        dummy_genres (df): Dummy dataframe weighted by tfidf results
    '''
    clean_feature_names = []
    for name in feature_names:
        clean_feature_names.append("genre_" + name)
    dummy_genres = pd.DataFrame(sparse_dummy_genres.toarray())
    dummy_genres.columns = clean_feature_names
    
    return dummy_genres

In [24]:
dummy_genres = genre_tfidf(df_2014_genres)
dummy_genres.head()

Unnamed: 0,genre__hip_hop,genre_a_cappella,genre_abstract_hip_hop,genre_acoustic_blues,genre_acoustic_pop,genre_adult_standards,genre_aesthetic_rap,genre_african_rock,genre_afro_dancehall,genre_afrobeat,...,genre_west_coast_rap,genre_west_coast_reggae,genre_west_coast_trap,genre_wonky,genre_wop,genre_worcester_ma_indie,genre_world,genre_world_worship,genre_worship,genre_yacht_rock
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
df_2014 = df_2014.reset_index(drop=True)
display(df_2014.head())

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,decade_1930,decade_1940,decade_1950,decade_1960,decade_1970,decade_1980,decade_1990,decade_2000,decade_2010,decade_2020
0,0.007771,['Linkin Park'],0.47166,0.039364,0.852,0.0,3aYBjxTMvrEOP0A0UXg9ER,3e-06,0.0989,0.86936,...,0,0,0,0,0,0,0,0,1,0
1,0.007791,['Hippie Sabotage'],0.36336,0.043466,0.557,0.0,5iyFrZpv5f4oWQCWMKfj52,0.00434,0.475,0.795803,...,0,0,0,0,0,0,0,0,1,0
2,0.319277,['Bleachers'],0.558704,0.036174,0.941,0.0,7pK4jm4HiNaYVzL2zbQSoG,0.0655,0.33,0.810336,...,0,0,0,0,0,0,0,0,1,0
3,3.9e-05,['together PANGEA'],0.266194,0.044798,0.853,1.0,14q1CWouLEjFT6zu5re8hR,0.362,0.0839,0.868656,...,0,0,0,0,0,0,0,0,1,0
4,0.001255,"['David Guetta', 'Showtek', 'VASSY']",0.621457,0.03066,0.972,0.0,6PtXobrqImYfnpIxNsJApa,0.0186,0.328,0.87813,...,0,0,0,0,0,0,0,0,1,0


In [26]:
combined_df_2014 = pd.concat([df_2014, dummy_genres], axis=1, join='inner')
display(combined_df_2014.head())

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,genre_west_coast_rap,genre_west_coast_reggae,genre_west_coast_trap,genre_wonky,genre_wop,genre_worcester_ma_indie,genre_world,genre_world_worship,genre_worship,genre_yacht_rock
0,0.007771,['Linkin Park'],0.47166,0.039364,0.852,0.0,3aYBjxTMvrEOP0A0UXg9ER,3e-06,0.0989,0.86936,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.007791,['Hippie Sabotage'],0.36336,0.043466,0.557,0.0,5iyFrZpv5f4oWQCWMKfj52,0.00434,0.475,0.795803,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.319277,['Bleachers'],0.558704,0.036174,0.941,0.0,7pK4jm4HiNaYVzL2zbQSoG,0.0655,0.33,0.810336,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.9e-05,['together PANGEA'],0.266194,0.044798,0.853,1.0,14q1CWouLEjFT6zu5re8hR,0.362,0.0839,0.868656,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.001255,"['David Guetta', 'Showtek', 'VASSY']",0.621457,0.03066,0.972,0.0,6PtXobrqImYfnpIxNsJApa,0.0186,0.328,0.87813,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# Save df to csv to avoid rerunning API calls on future runs
spotify_keys = json.load(open('spotify_keys.json'))
path = spotify_keys["csv_path"] + "sample_df_with_genres.csv"
combined_df_2014.to_csv(r'{}'.format(path))

#### Redefine functions for building playlist vector and making predictions

In [28]:
def build_playlist_vector(playlist_uri, user, reference_df):
    '''
    Given a playlist uri, load the playlist songs with the Spotify API. Identify which playlist songs are also in 
    the reference database. For these songs, generate a vector that summarizes the playlist. Return the vector and
    the reference database for making furthur predictions.
    
    Parameters: 
        playlist_uri (string): Unique spotify identifier for playlists
        user (string): Username associated with a Spotify developer account
        reference_df (df): Master database that contains all of the reference songs and their features
    
    Returns:
        averaged_playlist_vector (df): Dataframe with a single row entry summarizing all songs in the playlist
        updated_reference_df (df): Reference dataframe with all playlist songs removed to avoid predicting songs already in the playlist
    '''
    playlist = playlist_uri.split(":")[2]
    results = sp.user_playlist(user, playlist, 'tracks')
    
    playlist_song_uris = []
    for song in results['tracks']['items']:
        playlist_song_uris.append(song['track']['id'])
        
    print("The number of total songs in the playlist is: ", len(playlist_song_uris))
    
    playlist_df = reference_df[reference_df['id'].isin(playlist_song_uris)]
    print("The number of songs in the master database is: ", len(playlist_df))
        
    averaged_playlist_vector = generate_playlist_vector(playlist_df)

    # Remove songs from the reference dataframe that are also in the playlist
    numeric_reference_df = return_numeric_features(reference_df)
    updated_reference_df = numeric_reference_df.drop(index=playlist_df.index)
    
    return averaged_playlist_vector, updated_reference_df

In [29]:
# redefine our numberic feature function
def return_numeric_features(df):
    '''
    Take a dataframe with both numeric and text features, and return a dataframe with only the numerical features. In
    this case, features like "artist" and "name" will be removed, which are not needed for determining cosine 
    similarity.
    
    Parameters:
        df (df): Any dataframe
        
    Returns:
        numeric_df (df): A dataframe cleaned to only have numeric features
    '''
    numeric_features = df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())

    non_numeric_features = []
    for feat, value in numeric_features.iteritems():
        if value == False:
            non_numeric_features.append(feat)
            
    numeric_df = df.drop(non_numeric_features, axis=1)
    
    return numeric_df

In [30]:
def generate_playlist_vector(playlist_df):
    '''
    Given a dataframe with data on our playlist songs, apply a weighting function to offset popularity. Once applied,
    take the mean along axis 0 to return an averaged playlist vector that can best represent the entire playlist.
    
    Parameters:
        playlist_df (df): Dataframe containing all songs within playlist that are also contained in the reference dataframe
        
    Returns:
        averaged_playlist_vector (df): Dataframe with a single entry, summarizing the playlist
    '''
    numeric_cleaned_df = return_numeric_features(playlist_df)

    song_weights = numeric_cleaned_df["popularity"].apply(lambda x: weight_formula(x * 100))
    numeric_cleaned_df["Song Weight"] = song_weights

    weighted_playlist_df = numeric_cleaned_df.mul(numeric_cleaned_df["Song Weight"], axis=0)
    weighted_playlist_df = weighted_playlist_df.drop("Song Weight", axis=1)
    
    averaged_playlist_vector = weighted_playlist_df.mean(axis=0)
    return averaged_playlist_vector

In [31]:
# redefine weighting formula for clarity
def weight_formula(popularity):
    '''
    Equation used to gradually dampen the value of higher popularities, which range from 0 to 100.
    
    Parameters:
        popularity (int): A popularity score ranging between 0 and 100
        
    Returns:
        (int): A weight used to scale the song vector
    '''
    return np.log(-popularity + 120)/(np.log(120))

In [32]:
playlist_uri = "spotify:playlist:37i9dQZF1DX0h0QnLkMBl4"
user = spotify_keys['username']

improved_playlist_vector, reference_df = build_playlist_vector(playlist_uri, user, combined_df_2014)
display(improved_playlist_vector)

The number of total songs in the playlist is:  100
The number of songs in the master database is:  45


acousticness                0.137013
danceability                0.527068
duration_ms                 0.033315
energy                      0.541779
explicit                    0.109857
                              ...   
genre_worcester_ma_indie    0.000000
genre_world                 0.000000
genre_world_worship         0.000000
genre_worship               0.000000
genre_yacht_rock            0.000000
Length: 919, dtype: float64

In [33]:
def cos_sim(row, playlist_vector):
    '''
    Function to return cosine similarity between two vectors.
    
    Parameters:
        row (df): Row of a dataframe that function will be applied to
        playlist_vector (df): Reference vector that will be compared to all rows
    
    Returns:
        (int): Cosine similarity score
    '''
    return dot(row, playlist_vector)/(norm(row)*norm(playlist_vector))

In [34]:
def predict_top_songs(playlist_vector, reference_df, original_reference_df, number_of_songs):
    '''
    Use a playlist vector and a reference dataframe to calculate the N most highly recommended songs to add to the 
    given playlist.
    
    Parameters:
        playlist_vector (df): Average vector summarizing the entries in the given playlist
        reference_df (df): Numerical dataframe with all playlist songs removed
        original_reference_df (df): Reference df including text features
        number_of_songs (int): Number of top songs to display
        
    Returns:
        top_songs (df): Dataframe showing the song, artist, and cosine score for the N most similar songs
    '''
    song_similarity_to_playlist = reference_df.apply(cos_sim, axis=1, args=(playlist_vector,))
    
    original_reference_df['cosine_similarity'] = song_similarity_to_playlist
    top_songs = original_reference_df.sort_values('cosine_similarity', ascending=False)
    top_songs = top_songs[['name', 'artists', 'cosine_similarity']].sort_values('cosine_similarity', ascending=False).head(number_of_songs)
    return top_songs

In [36]:
top_songs = predict_top_songs(improved_playlist_vector, reference_df, combined_df_2014, 20)
display(top_songs)

Unnamed: 0,name,artists,cosine_similarity
764,Everyday,['Melodyia Music'],0.974734
1745,Sex You,['Bando Jonez'],0.974054
1933,Breaking Free,['Melodyia Music'],0.972717
26,La Planta,['Caos'],0.971756
1790,Little Game,['Benny'],0.969441
658,Hero's Come Back!!(第1話〜第30話),['Junichi Sakamoto'],0.968579
1956,Gotta Go My Own Way,['Melodyia Music'],0.968413
703,Jaan,['Maz Bonafide'],0.967497
390,Hit and Run,['LOLO'],0.967248
766,Monsters,['Katie Sky'],0.967167


# Conclusions

This genre-based model appears to have returned a majority pop/hip-hop songs. However, in relation to our previous model, these results look less consistent. Here are some possible reasons:

- Using only 2000 songs provides a smaller field with which to search for best fits. By using the entire dataset on this playlist, there would definitely be higher scoring songs.

- After introducing our new features from our genre data, there were 922 total features. Around 700 of where were based on genre data alone. These new feaures were mostly 0 because each song only contains a handful of genres, and these 700 features are categorical. This highly skews our results because our cosine similarity assigns high correlation between songs just because they share many of the genre dummy features. This takes away importance from some of the other, more meaningful features. This also explains the high cosine_similarity scores, despite some of these top songs feeling out of place.

- TFIDF may not be the optimal choice in this scenario. We are scraping genres based on the artist, and not the songs themselves. Some artists have a large number of genres, which would lead to lower TFIDF scores. Other artists only have a single genre, which would be a TFIDF score of 1 for that specific genre. Therefore, while TFIDF does provide an additional level of information on the frequency of certain genres, the effectiveness of this is diminished because of the varying document size. A better solution may be just using term frequency.

How can this idea be improved?

I think the ideal solution would be to find the 20 most common genres out of the database. Use term-frequency to weight these genres, and create a similar dummy dataframe that can be used in the same manner, only 20 features wide. If a song does not have one of these 20 genres, they will receive 0 for all fields. Therefore, all songs without a top-20 genre will inherently receive a feature for "some other less popular genre", which will hold some weight.

This combination of steps will not dilute the feature-space with rarely used genres, yet will provide important information on the genres that appear most commonly.