# Goal

As Satisfaction Scouts, we want to be able to generate the ultimate ordered-lineup for a concert. In order to do that, we need to determine similarities among various artists to efficiently predict a cohesive lineup plan. Given a dataset of tracks, we want to convert that information into a new dataset of artists, and then use a 5-neighbors classifier model to determine 5 most similar artists using a user's input.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

# Load Data

In [3]:
data = pd.read_csv("../Data/dataset.csv")

In [4]:
data

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,artist_popularity,release_dates,duration_ms,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,58,2022-04-08,230666,...,-6.746,0,0.1430,0.03220,0.000001,0.3580,0.7150,87.917,4,acoustic
1,1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,42,2021-04-30,149610,...,-17.235,1,0.0763,0.92400,0.000006,0.1010,0.2670,77.489,4,acoustic
2,2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,54,2021-03-17,210826,...,-9.734,1,0.0557,0.21000,0.000000,0.1170,0.1200,76.332,4,acoustic
3,3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,57,2018-08-10,201933,...,-18.515,1,0.0363,0.90500,0.000071,0.1320,0.1430,181.740,3,acoustic
4,4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,59,2017-02-03,198853,...,-9.681,1,0.0526,0.46900,0.000000,0.0829,0.1670,119.949,4,acoustic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29234,29234,29234,6X6wIzuxsh7GVNMPz1xTNa,Hillsong Worship,No Other Name,Depths,37,70,2014-07-01,377697,...,-8.232,1,0.0284,0.00887,0.000009,0.7060,0.0687,80.003,4,world-music
29235,29235,29235,5y8ARSg47Yx52xvQQAlS35,Mosaic MSC,HUMAN (Deluxe) [Live],Fountain (I Am Good) - Live,22,52,2020-10-09,318874,...,-8.285,1,0.0357,0.02060,0.000013,0.2530,0.1140,139.983,4,world-music
29236,29236,29236,6PM55W7WiUmHVPdUebJP55,Planetshakers,Greater (Live),Stay (You Are Good) - Live,38,56,2022-09-02,462397,...,-5.696,1,0.0548,0.07240,0.000003,0.3740,0.1460,139.051,4,world-music
29237,29237,29237,0XEgJiDryoDd2gIJhVXghd,Bryan & Katie Torwalt;Brock Human,I've Got Good News (Live) [Deluxe],Hallelujah On My Knees - Live,22,47,2022-07-01,380344,...,-12.070,1,0.0316,0.39200,0.000000,0.6620,0.2060,127.731,4,world-music


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29239 entries, 0 to 29238
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0.1       29239 non-null  int64  
 1   Unnamed: 0         29239 non-null  int64  
 2   track_id           29239 non-null  object 
 3   artists            29239 non-null  object 
 4   album_name         29239 non-null  object 
 5   track_name         29239 non-null  object 
 6   popularity         29239 non-null  int64  
 7   artist_popularity  29239 non-null  int64  
 8   release_dates      29239 non-null  object 
 9   duration_ms        29239 non-null  int64  
 10  explicit           29239 non-null  bool   
 11  danceability       29239 non-null  float64
 12  energy             29239 non-null  float64
 13  key                29239 non-null  int64  
 14  loudness           29239 non-null  float64
 15  mode               29239 non-null  int64  
 16  speechiness        292

# Peter's Genre Mapping + Further Data Cleaning

In [6]:
# Mapping dictionary
genre_mapping = {
    'acoustic': 'World/Folk',
    'afrobeat': 'World/Folk',
    'alt-rock': 'Rock',
    'ambient': 'Electronic',
    'anime': 'World/Folk',
    'black-metal': 'Rock',
    'bluegrass': 'World/Folk',
    'blues': 'World/Folk',
    'breakbeat': 'Electronic',
    'british': 'Rock',
    'chicago-house': 'Electronic',
    'chill': 'Electronic',
    'classical': 'Classical',
    'club': 'Electronic',
    'comedy': 'Other/Miscellaneous',
    'country': 'World/Folk',
    'dance': 'Pop',
    'dancehall': 'World/Folk',
    'death-metal': 'Rock',
    'deep-house': 'Electronic',
    'detroit-techno': 'Electronic',
    'disco': 'Pop',
    'drum-and-bass': 'Electronic',
    'dub': 'Electronic',
    'dubstep': 'Electronic',
    'edm': 'Pop',
    'electro': 'Electronic',
    'electronic': 'Electronic',
    'emo': 'Rock',
    'folk': 'World/Folk',
    'forro': 'World/Folk',
    'funk': 'World/Folk',
    'garage': 'World/Folk',
    'goth': 'Rock',
    'grindcore': 'Rock',
    'groove': 'World/Folk',
    'grunge': 'Rock',
    'guitar': 'World/Folk',
    'happy': 'Other/Miscellaneous',
    'hard-rock': 'Rock',
    'hardcore': 'Rock',
    'hardstyle': 'Electronic',
    'heavy-metal': 'Rock',
    'hip-hop': 'Hip-Hop/Rap',
    'honky-tonk': 'World/Folk',
    'house': 'Electronic',
    'idm': 'Electronic',
    'indie': 'Rock',
    'industrial': 'Rock',
    'j-dance': 'Electronic',
    'j-pop': 'Pop',
    'j-rock': 'Rock',
    'jazz': 'Jazz',
    'malay': 'World/Folk',
    'mandopop': 'Pop',
    'metal': 'Rock',
    'metalcore': 'Rock',
    'minimal-techno': 'Electronic',
    'mpb': 'World/Folk',
    'new-age': 'Classical',
    'party': 'Other/Miscellaneous',
    'piano': 'Classical',
    'pop-film': 'Pop',
    'pop': 'Pop',
    'power-pop': 'Pop',
    'progressive-house': 'Electronic',
    'psych-rock': 'Rock',
    'punk-rock': 'Rock',
    'punk': 'Rock',
    'r-n-b': 'Pop',
    'reggae': 'World/Folk',
    'reggaeton': 'World/Folk',
    'rock-n-roll': 'Rock',
    'rock': 'Rock',
    'rockabilly': 'World/Folk',
    'romance': 'Other/Miscellaneous',
    'sad': 'Other/Miscellaneous',
    'show-tunes': 'Other/Miscellaneous',
    'singer-songwriter': 'World/Folk',
    'ska': 'World/Folk',
    'sleep': 'Other/Miscellaneous',
    'soul': 'World/Folk',
    'study': 'Other/Miscellaneous',
    'synth-pop': 'Pop',
    'tango': 'World/Folk',
    'techno': 'Electronic',
    'trance': 'Electronic',
    'trip-hop': 'Electronic',
    'world-music': 'World/Folk'
}

# Apply the mapping
data['track_genre'] = data['track_genre'].replace(genre_mapping)



In [7]:
data=data.dropna()
data = data[data['popularity'] != 0]
data['instrumentalness'] = np.log(data['instrumentalness'].replace(0, np.nan)).fillna(0)
data['acousticness'] = np.log(data['acousticness'].replace(0, np.nan)).fillna(0)
data['speechiness'] = np.log(data['speechiness'].replace(0, np.nan)).fillna(0)
data['liveness'] = np.log(data['liveness'].replace(0, np.nan)).fillna(0)
data['duration_ms'] = np.log(data['duration_ms'].replace(0, np.nan)).fillna(0)


# Getting Mode of Categorical Variables from dataset.csv

This is to prepare for organizing our current dataset by artist, rather than by track. This also keeps categorical variables consistent with each other.

In [8]:
def calculate_mode(category):
    return category.mode().iloc[0] if not category.mode().empty else None

In [9]:
mode_explicit = data.groupby('artists')['explicit'].apply(calculate_mode).reset_index()
mode_explicit.columns = ['artists', 'mode_explicit']

mode_key = data.groupby('artists')['key'].apply(calculate_mode).reset_index()
mode_key.columns = ['artists', 'mode_key']

mode_mode = data.groupby('artists')['mode'].apply(calculate_mode).reset_index()
mode_mode.columns = ['artists', 'mode_mode']

mode_time_signature = data.groupby('artists')['time_signature'].apply(calculate_mode).reset_index()
mode_time_signature.columns = ['artists', 'mode_time_signature']

mode_genre = data.groupby('artists')['track_genre'].apply(calculate_mode).reset_index()
mode_genre.columns = ['artists', 'mode_genre']


Make dataframe of categorical features to make it easier (for me)

In [10]:
categorical_features  = pd.merge(mode_explicit, mode_key, on='artists')

categorical_features = pd.merge(categorical_features, mode_mode, on='artists')

categorical_features = pd.merge(categorical_features, mode_time_signature, on='artists')

categorical_features = pd.merge(categorical_features, mode_genre, on='artists')

In [11]:
categorical_features

Unnamed: 0,artists,mode_explicit,mode_key,mode_mode,mode_time_signature,mode_genre
0,!nvite,False,2,0,4,Other/Miscellaneous
1,"""Weird Al"" Yankovic",False,8,1,4,Other/Miscellaneous
2,$affie,False,0,1,4,Other/Miscellaneous
3,$affie;lofi stu,False,11,0,4,Other/Miscellaneous
4,&ME;Rampa;Adam Port;Sofie Royer,False,11,0,4,Electronic
...,...,...,...,...,...,...
13854,yama,False,5,1,4,Pop
13855,yetep;Caslow;Lexi Scatena,False,6,1,4,Electronic
13856,yetep;Kyle Reynolds,False,9,1,4,Electronic
13857,yeule,False,0,0,4,Electronic


# Make a new dataset organized by artist and corresponding average statistics

Get numerical features

In [12]:
numerical_features = ['popularity', 'artist_popularity','duration_ms', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']

Group dataset.csv by artist by taking the average of track statistics for each artist, named average_data

In [13]:
average_data = data.groupby('artists')[numerical_features].mean().reset_index()

Sort by highest to lowest artist popularity for organizational purposes

In [14]:
average_data = average_data.sort_values(by='artist_popularity', ascending = False)

Merge categorical dataframe to average_data dataset

In [15]:
average_data = pd.merge(average_data, categorical_features, on='artists')


In [16]:
average_data

Unnamed: 0,artists,popularity,artist_popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,mode_explicit,mode_key,mode_mode,mode_time_signature,mode_genre
0,Taylor Swift,86.000000,100.0,12.475802,0.532000,0.623,-9.208000,-3.408222,-0.619897,-9.527795,-2.380547,0.403000,89.937000,False,5,1,4,Pop
1,Drake;21 Savage,91.000000,93.0,12.293919,0.529000,0.673,-4.711000,-1.742969,-8.088663,-12.935884,-2.375156,0.366000,165.921000,True,0,1,4,Hip-Hop/Rap
2,Drake;Travis Scott,83.000000,93.0,12.581680,0.666000,0.465,-8.545000,-1.347074,-2.989750,0.000000,-1.537117,0.292000,167.937000,True,1,1,4,Hip-Hop/Rap
3,The Weeknd;Daft Punk,3.000000,91.0,12.503158,0.773000,0.820,-5.897000,-2.162823,-0.931404,0.000000,-2.605042,0.555000,92.996000,False,0,0,4,Pop
4,Ariana Grande,82.666667,90.0,12.121431,0.761333,0.680,-5.627000,-2.506097,-1.224338,0.000000,-2.020347,0.526333,120.319667,True,0,1,4,Pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13854,Tisoki;Shreya Jain;MAGIC;Shashwat Sachdev,31.000000,0.0,12.004434,0.737000,0.903,-2.278000,-2.365525,-4.866535,-0.294371,-2.460409,0.179000,138.014000,False,5,0,4,Electronic
13855,Hardcore Masif;Weaver;Suae,24.000000,0.0,12.770949,0.491000,0.941,-4.444000,-1.883875,-6.156339,-1.995100,-2.850767,0.248000,168.959000,False,11,1,4,Other/Miscellaneous
13856,Hardcore Masif;Weaver;Steve Hill,25.333333,0.0,12.815809,0.341667,0.978,-5.264667,-2.651806,-5.673034,-0.388656,-1.556825,0.250667,168.994000,False,7,0,4,Other/Miscellaneous
13857,Bantunagojeje,19.000000,0.0,12.328711,0.572000,0.730,-12.687000,-3.239079,-9.416135,-0.196015,-2.253795,0.815000,129.706000,False,1,1,4,World/Folk


# Predict 5 Similar Artists Using User Input and average_data

## Preprocess average_data

We have several different 'types' of categorical features. Along with the quantitative features that can easily be scaled, we also have ordinal and nominal features with different representations (boolean, integer, string). We need to ensure that the model interprets these values correctly through further preprocessing. 

Specifically, mode_key, mode_mode, mode_time_signature, mode_genre, and mode_explicit is what we need to "worry" about. Key is ordinal, but is already in numeric representation. So to ensure that the model interprets key correctly, we will need to one-hot encode this. Similarly goes for mode_mode, time_signature, and track_genre, all of which are nominal features. 

mode_explicit has boolean representation, so we will have to convert those booleans into numerical representations of 0 and 1.

We will create a new Dataframe, data_encoded, to place our modifications.

In [17]:
#One-Hot Encoding
one_hot_encoding_cols = ['mode_key', 'mode_mode', 'mode_time_signature', 'mode_genre']
data_encoded = pd.get_dummies(average_data, columns=one_hot_encoding_cols) #Note that this gives boolean outputs

In [18]:
#Boolean to numerical conversion
boolean_cols = ['mode_explicit','mode_key_0','mode_key_1','mode_key_2','mode_key_3','mode_key_4','mode_key_5',
                'mode_key_6','mode_key_7','mode_key_8','mode_key_9','mode_key_10','mode_key_11','mode_mode_0',
                'mode_mode_1','mode_time_signature_0','mode_time_signature_1','mode_time_signature_3',
                'mode_time_signature_4','mode_time_signature_5','mode_genre_Classical','mode_genre_Electronic',
                'mode_genre_Hip-Hop/Rap','mode_genre_Jazz','mode_genre_Other/Miscellaneous','mode_genre_Pop',
                'mode_genre_Rock','mode_genre_World/Folk','mode_genre_alternative']
data_encoded[boolean_cols] = data_encoded[boolean_cols].astype(int)

In [19]:
data_encoded

Unnamed: 0,artists,popularity,artist_popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,...,mode_time_signature_5,mode_genre_Classical,mode_genre_Electronic,mode_genre_Hip-Hop/Rap,mode_genre_Jazz,mode_genre_Other/Miscellaneous,mode_genre_Pop,mode_genre_Rock,mode_genre_World/Folk,mode_genre_alternative
0,Taylor Swift,86.000000,100.0,12.475802,0.532000,0.623,-9.208000,-3.408222,-0.619897,-9.527795,...,0,0,0,0,0,0,1,0,0,0
1,Drake;21 Savage,91.000000,93.0,12.293919,0.529000,0.673,-4.711000,-1.742969,-8.088663,-12.935884,...,0,0,0,1,0,0,0,0,0,0
2,Drake;Travis Scott,83.000000,93.0,12.581680,0.666000,0.465,-8.545000,-1.347074,-2.989750,0.000000,...,0,0,0,1,0,0,0,0,0,0
3,The Weeknd;Daft Punk,3.000000,91.0,12.503158,0.773000,0.820,-5.897000,-2.162823,-0.931404,0.000000,...,0,0,0,0,0,0,1,0,0,0
4,Ariana Grande,82.666667,90.0,12.121431,0.761333,0.680,-5.627000,-2.506097,-1.224338,0.000000,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13854,Tisoki;Shreya Jain;MAGIC;Shashwat Sachdev,31.000000,0.0,12.004434,0.737000,0.903,-2.278000,-2.365525,-4.866535,-0.294371,...,0,0,1,0,0,0,0,0,0,0
13855,Hardcore Masif;Weaver;Suae,24.000000,0.0,12.770949,0.491000,0.941,-4.444000,-1.883875,-6.156339,-1.995100,...,0,0,0,0,0,1,0,0,0,0
13856,Hardcore Masif;Weaver;Steve Hill,25.333333,0.0,12.815809,0.341667,0.978,-5.264667,-2.651806,-5.673034,-0.388656,...,0,0,0,0,0,1,0,0,0,0
13857,Bantunagojeje,19.000000,0.0,12.328711,0.572000,0.730,-12.687000,-3.239079,-9.416135,-0.196015,...,0,0,0,0,0,0,0,0,1,0


Define features and target in the dataset

In [20]:
features = data_encoded.drop(columns=['artists'])
target = data_encoded['artists']

Scale numerical features

In [21]:
X = data_encoded[numerical_features]  # quantitative features of artists
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale the quantitative features

## Train K-Neighbor Classifier Model with Principal Component Analysis (PCA)

In [22]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=440)

Due to the large amount of features, we (and the computer) could easily get loss in our analysis. For a better understanding of our data, we can use PCA to reduce dimensionality.

In [23]:
# Perform PCA for dimensionality reduction
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [24]:
# Train the k-Neighbor Classifier model
k = 5  # Number of neighbors to consider
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_pca, y_train)

## User Inputs an Artist to Predict 5 Similar Artists

In [25]:
# Get input artist from the user
input_artist = input("Enter the name of the artist: ")

# Find the index of the input artist in the dataset
input_index = data_encoded.index[average_data['artists'] == input_artist].tolist()

# If input artist not found
if not input_index:
    print("Artist not found.")
else:
    # Get the features of the input artist
    query_features = features.iloc[input_index]
    
    query_artist_array = query_features.values.reshape(1, -1)
    
    # Transform the features of the query artist using the PCA transformation
    query_artist_pca = pca.transform(query_artist_array)

    # Find k similar artists
    distances, indices = knn.kneighbors(query_artist_pca)

    # Print the indices of similar artists
    print("Indices of similar artists:", indices)

    # Get the details of similar artists from the dataset
    similar_artists = data_encoded.iloc[indices[0]]

    # Print the details of similar artists
    print("Similar Artists:")
    print(similar_artists)

Indices of similar artists: [[8191  150 3607 8064 6018]]
Similar Artists:
                                         artists  popularity  \
8191                                  Rei Yasuda        52.0   
150         Marshmello;Southside;Giggs;SAINt JHN        49.0   
3607                       Masicka;Bounty Killer        20.0   
8064  Consoul Trainin;Steven Aderinto;DuoViolins        52.0   
6018       Estas Tonne;Istvan Sky;Pablo Arellano        41.0   

      artist_popularity  duration_ms  danceability  energy  loudness  \
8191               38.0    12.374067        0.5170  0.9160    -3.831   
150                80.0    12.241150        0.8780  0.7150    -2.069   
3607               55.0    11.928750        0.6520  0.8470    -5.942   
8064               39.0    12.315236        0.7655  0.7395    -5.516   
6018               46.0    13.828831        0.2750  0.2040   -17.649   

      speechiness  acousticness  instrumentalness  ...  mode_time_signature_5  \
8191    -2.682382     -6.98



# Summary

Given a dataset of tracks with 20+ features, we were able to convert it into a new dataframe (average_data) of artists with those same features using the averages of quantitiative features, and the mode for quantitative features. From there, we were able to create a 5-neighbors classifier model to predict 5 similar artists using our new dataframe. We ensured that our categorical features were treated as such to avoid any discrepancies through encoding methods. We also reduced our dataframe's dimensionality using PCA to better understand the relationships of our data. Some limitations are how in the dataset, some artists have very little tracks, which isn't an accurate representation of their work. Some tracks also contain "featured artists", where the features of those tracks are dependent on all artists present.