
This is a dataset of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. The data is in CSV format which is tabular and can be loaded quickly.

Usage
The dataset can be used for:

1. Building a Recommendation System based on some user input or preference
2. Classification purposes based on audio features and available genres
3. Any other application that you can think of. Feel free to discuss!

Column Description
track_id: The Spotify ID for the track

artists: The artists' names who performed the track. If there is more than one artist, they are separated by a ;

album_name: The album name in which the track appears

track_name: Name of the track

popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity.

duration_ms: The track length in milliseconds

explicit: Whether or not the track has explicit lyrics (true = yes it does; false = no it does not OR unknown)

danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable

energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale

key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1

loudness: The overall loudness of a track in decibels (dB)

mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0

speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks

acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content

liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live

valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)

tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration

time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.

track_genre: The genre in which the track belongs

We want recommendations : similar artists, different songs from the same artist, genre

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.similarities import cosine, msd, pearson

In [2]:
spotify_df = pd.read_csv('Data/dataset.csv', index_col = 0)
spotify_df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [3]:
spotify_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 114000 entries, 0 to 113999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          114000 non-null  object 
 1   artists           113999 non-null  object 
 2   album_name        113999 non-null  object 
 3   track_name        113999 non-null  object 
 4   popularity        114000 non-null  int64  
 5   duration_ms       114000 non-null  int64  
 6   explicit          114000 non-null  bool   
 7   danceability      114000 non-null  float64
 8   energy            114000 non-null  float64
 9   key               114000 non-null  int64  
 10  loudness          114000 non-null  float64
 11  mode              114000 non-null  int64  
 12  speechiness       114000 non-null  float64
 13  acousticness      114000 non-null  float64
 14  instrumentalness  114000 non-null  float64
 15  liveness          114000 non-null  float64
 16  valence           114000 

In [4]:
spotify_df.isna().sum()

track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

### Check for Duplicate Values

In [5]:
spotify_df.duplicated().sum()

450

There are 450 identified duplicates. In this case we drop them.

In [6]:
# dropping duplicates

spotify_df = spotify_df.drop_duplicates()
spotify_df.shape

(113550, 20)

In [7]:
spotify_df.describe()

Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0,113550.0
mean,33.324139,228079.4,0.567031,0.64209,5.309467,-8.243419,0.63786,0.084674,0.314067,0.155702,0.213611,0.474207,122.175888,3.904218
std,22.283976,106414.8,0.173408,0.251052,3.560134,5.011401,0.480621,0.105761,0.331907,0.309216,0.190461,0.259204,29.972861,0.432115
min,0.0,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,17.0,174180.2,0.456,0.473,2.0,-9.99775,0.0,0.0359,0.0168,0.0,0.098,0.26,99.2965,4.0
50%,35.0,213000.0,0.58,0.685,5.0,-6.997,1.0,0.0489,0.168,4.1e-05,0.132,0.464,122.02,4.0
75%,50.0,261587.8,0.695,0.854,8.0,-5.001,1.0,0.0845,0.596,0.048675,0.273,0.683,140.07375,4.0
max,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


From this preview, we notice that our numerical columns have different ranges. Therefore, we will use the Standardscaler() to normalize our data.

## Preprocessing For Modelling

In [8]:
spotify_df.columns

Index(['track_id', 'artists', 'album_name', 'track_name', 'popularity',
       'duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_genre'],
      dtype='object')

In [9]:
columns_to_drop = ['album_name', 'track_name', 'time_signature']
filtered_spotify_df = spotify_df.drop(columns=columns_to_drop)
filtered_spotify_df

Unnamed: 0,track_id,artists,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,73,230666,False,0.676,0.4610,1,-6.746,0,0.1430,0.0322,0.000001,0.3580,0.7150,87.917,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,55,149610,False,0.420,0.1660,1,-17.235,1,0.0763,0.9240,0.000006,0.1010,0.2670,77.489,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,57,210826,False,0.438,0.3590,0,-9.734,1,0.0557,0.2100,0.000000,0.1170,0.1200,76.332,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.9050,0.000071,0.1320,0.1430,181.740,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,82,198853,False,0.618,0.4430,2,-9.681,1,0.0526,0.4690,0.000000,0.0829,0.1670,119.949,acoustic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113995,2C3TZjDRiAzdyViavDJ217,Rainy Lullaby,21,384999,False,0.172,0.2350,5,-16.393,1,0.0422,0.6400,0.928000,0.0863,0.0339,125.995,world-music
113996,1hIz5L4IB9hN3WRYPOCGPw,Rainy Lullaby,22,385000,False,0.174,0.1170,0,-18.318,0,0.0401,0.9940,0.976000,0.1050,0.0350,85.239,world-music
113997,6x8ZfSoqDjuNa5SVP5QjvX,Cesária Evora,22,271466,False,0.629,0.3290,0,-10.895,0,0.0420,0.8670,0.000000,0.0839,0.7430,132.378,world-music
113998,2e6sXL2bYv4bSz6VTdnfLs,Michael W. Smith,41,283893,False,0.587,0.5060,7,-10.889,1,0.0297,0.3810,0.000000,0.2700,0.4130,135.960,world-music


Since we do not have user-interaction data such as ratings , we are going to use popularity to simulate user interaction items. We will therefore use a MinMaxScaler to normalize our `popularity` column to a scale of 1 to 5. 

Moreover, we do not have `user_id` or identifier information, therefore we need to decide how we want to model user behaviour. 
1. simulate user_id using `track_genre` information - this approach works if we want to simulate users who prefer specific genres. Each genre can represent a "user," and you can assume that users (genres) would interact with multiple tracks within that genre. This simulates varying preferences based on genre popularity.
2. using `track_name` -
3. `artists` - aims to recommend music based on specific artists. Many users have preferences for certain artists. Users might listen to multiple tracks by the same artist, leading to more interactions and data points associated with those synthetic user IDs. This also provides a more granularized approach as it may lead to more personalized recommendations.
   >> this however, loses the genre preference information as users might prefer a mix of styles from an artist. Also, users might like various artists in the same genre and using this will  not reflect that.
4. Combination of `artists` and `track_genre` - allows capturing a widerrange of user preferences as a user may enjoy multiple tracks from various artists across different genres.

In [10]:
filtered_spotify_df.shape

(113550, 17)

In [11]:
from surprise import Dataset, Reader, SVD, KNNBaseline
from sklearn.preprocessing import MinMaxScaler
from surprise import accuracy
from scipy.sparse import csr_matrix
from surprise.model_selection import train_test_split

filtered_spotify_df = filtered_spotify_df.sample(n=20000, random_state=42)
# Create a combined synthetic user ID by concatenating artists and track_genre
filtered_spotify_df['user_id'] = filtered_spotify_df['artists'] + '_' + filtered_spotify_df['track_genre']

# Normalize popularity scores to fit a rating scale of 1 to 5
scaler = MinMaxScaler(feature_range=(1, 5))
filtered_spotify_df['rating'] = scaler.fit_transform(filtered_spotify_df[['popularity']])

# Check for NaN values and drop them if necessary
filtered_spotify_df.dropna(subset=['user_id', 'track_id', 'rating'], inplace=True)


# Define the Reader and load the data
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(filtered_spotify_df[['user_id', 'track_id', 'rating']], reader)



# Split the dataset into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Display the first few rows of the filtered DataFrame for verification
filtered_spotify_df.head()
# filtered_spotify_df.shape


Unnamed: 0,track_id,artists,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_genre,user_id,rating
47730,6cLVxeHljKgjD1mQ75QKhp,Pitty,36,166683,False,0.431,0.768,11,-7.065,0,0.0385,8.8e-05,0.00122,0.113,0.659,82.132,hard-rock,Pitty_hard-rock,2.44
96769,6PphhDw4Fa1U2TJkhejpdD,Akatu,37,105389,False,0.652,0.701,2,-8.178,1,0.0348,0.532,2e-06,0.602,0.928,158.124,samba,Akatu_samba,2.48
6993,1LL4f5D8E5sdURjuADBD8s,Carnifex,17,175906,False,0.573,0.976,1,-4.004,0,0.179,4.6e-05,0.0222,0.272,0.315,114.02,black-metal,Carnifex_black-metal,1.68
92786,6SesqTxqo0FlU6XlHNHKuG,The Delta Bombers,23,211213,False,0.593,0.614,8,-7.134,0,0.0303,0.659,0.0135,0.369,0.531,81.557,rockabilly,The Delta Bombers_rockabilly,1.92
61286,2kAhqjQ6Tjsj9qUbJDs8Ih,≠ME,26,265760,False,0.535,0.685,1,-4.9,0,0.03,0.553,0.0,0.199,0.212,80.008,j-idol,≠ME_j-idol,2.04


In [12]:
unique_users = filtered_spotify_df['user_id'].nunique()
unique_tracks = filtered_spotify_df['track_id'].nunique()
print(f'Unique users: {unique_users}, Unique tracks: {unique_tracks}')


Unique users: 12361, Unique tracks: 19000


There were some missing values in our `user_id` and `track_id` columns. We drop them.

In [13]:
# Split the dataset into training and test sets
# trainset, testset = train_test_split(data, test_size=0.2)

# # Get the training set as a list of tuples
# trainset_data = [(trainset.to_raw_uid(u), trainset.to_raw_iid(i), r) 
#                  for (u, i, r) in trainset.all_ratings()]


In [14]:
# Create a DataFrame
# trainset_df = pd.DataFrame(trainset_data, columns=['user_id', 'track_id', 'rating'])
# trainset_df


### Model 1 - KNNBaseline 


In [17]:
from surprise import Dataset, Reader, accuracy
from surprise.model_selection import train_test_split


# Step 1: Prepare your DataFrame (Assuming 'filtered_spotify_df' is your DataFrame)
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(filtered_spotify_df[['user_id', 'track_id', 'rating']], reader)

# Step 2: Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Step 3: Define your model and parameter grid
param_grid = {
    'k': [5, 10, 15, 20],
    'min_k': [1, 5],
    'sim_options': {
        'name': ['cosine', 'pearson'],
        'user_based': [True, False]
    }
}

# Step 4: Perform grid search for KNNBaseline
grid_search_knn = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3)
grid_search_knn.fit(data)



Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Com

In [19]:
# Step 5: Get the best KNN model
best_knn_model = grid_search_knn.best_estimator['rmse']

# Step 6: Fit the best model to the training set
best_knn_model.fit(trainset)

# Step 7: Predict ratings for the training set
# You can use the test method on the trainset
train_predictions = best_knn_model.test(trainset.build_testset())

# Step 8: Calculate RMSE for the training set predictions
train_rmse = accuracy.rmse(train_predictions)

# # Step 9: Predict ratings for the test set
# test_predictions = best_knn_model.test(testset)

# Step 10: Calculate RMSE for the test set predictions
# test_rmse = accuracy.rmse(test_predictions)

# Print the results
print(f"Best KNN Model Parameters: {grid_search_knn.best_params['rmse']}")
print(f"Training RMSE: {train_rmse:.4f}")
# print(f"Test RMSE: {test_rmse:.4f}")


Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.0127
Best KNN Model Parameters: {'k': 5, 'min_k': 1, 'sim_options': {'name': 'cosine', 'user_based': True}}
Training RMSE: 0.0127


### Model 2 - SVD algorithm

In [20]:
# Step 3: Define parameter grid for SVD
param_grid = {
    'n_factors': [50, 100, 150],  # Number of latent factors
    'n_epochs': [20, 30, 40],     # Number of epochs
    'lr_all': [0.005, 0.01],      # Learning rate for all parameters
    'reg_all': [0.02, 0.1],       # Regularization term
}

# Step 4: Perform grid search for SVD
grid_search_svd = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
grid_search_svd.fit(data)




In [21]:
# Step 5: Get the best SVD model
best_svd_model = grid_search_svd.best_estimator['rmse']

# Step 6: Fit the best model to the training set
best_svd_model.fit(trainset)

# Step 7: Predict ratings for the training set
train_predictions = best_svd_model.test(trainset.build_testset())

In [23]:

# Step 8: Calculate RMSE for the training set predictions
train_rmse = accuracy.rmse(train_predictions)

# Step 9: Predict ratings for the test set
# test_predictions = best_svd_model.test(testset)

# Step 10: Calculate RMSE for the test set predictions
# test_rmse = accuracy.rmse(test_predictions)

# Print the results
print(f"Best SVD Model Parameters: {grid_search_svd.best_params['rmse']}")
print(f"Training RMSE: {train_rmse:.4f}")
# print(f"Test RMSE: {test_rmse:.4f}")

RMSE: 0.2288
Best SVD Model Parameters: {'n_factors': 50, 'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.02}
Training RMSE: 0.2288


The SVD model rmse is  slightly higher as compared to the KNNBaseline model. The KNNBaseline model still performs better than the SVD algorithm.

In [30]:

numerical_features = filtered_spotify_df.select_dtypes(include=['float64', 'int64']).columns
categorical_features = filtered_spotify_df.select_dtypes(include=['object']).columns

In [21]:
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

# transform categorical data
categorical_transformer = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))])

In [22]:
# combine steps using columnTransformer
preprocessor = ColumnTransformer(transformers= [
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

In [23]:
# full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

In [24]:
# apply pipeline to full df
spotify_processed = pipeline.fit_transform(spotify_df)