# Datasets

- https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset
- https://www.kaggle.com/datasets/tonygordonjr/spotify-dataset-2023?select=spotify-albums_data_2023.csv
- https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv

# Links
- https://forecastegy.com/posts/xgboost-multiclass-classification-python/
- https://github.com/jannine92/spotify_recommendation/blob/main/music_recommender.ipynb
- https://www.kaggle.com/code/nyjoey/spotify-clustering
- https://ausaf-a.github.io/ml-song-recommender/
- https://medium.com/@Marlon_H/spotify-clustering-f41b40003c9a
- https://www.kaggle.com/code/choongqianzheng/song-genre-classification-system
- https://developer.spotify.com/documentation/web-api/reference/get-audio-features
- https://medium.com/@miguelrodrigueznovelo/discover-your-perfect-playlist-10-songs-recommended-by-a-music-recommendation-system-with-python-5fd246d87127
- https://medium.com/@shruti.somankar/building-a-music-recommendation-system-using-spotify-api-and-python-f7418a21fa41
- https://www.kaggle.com/code/merveeyuboglu/music-recommendation-system-cosine-s


# ToDo List:

- Basic stuff✅
  - Load Data✅
  -  Display, Info and Describe data✅
  - Split Datasets into song_metrics and song_info✅
- Data Visualization (Also in Percent if valuable)✅
  - Visualize Correlation Heatmap✅
  - Display Genres as Numbers and Histogram✅
  - Display Genre Dendogram✅
  - Display most frequent artists✅
  - Display most popular artist✅
  - Plot Popularity as histogram✅
  - Plot Average Song metric for Each genre (could also be on a 3D plot)✅
  - Plot Box plots to detect outliers✅
- Features
  - Apply Standard and MinMaxScaler ✅
  - Apply and Visualize PCA and t-SNE / UMAP
  - Use Silhouette Score to see how many clusters are needed (also try fancy plot from Medium)
  - Use KMeans to start
  - Use DBSCAN
  - Use Agglomerative Clustering
  - Use HDBSCAN
  - Use XGBClassifier with Cross Validation
- Feature Extensions
  - Plot Similar Artists
  - Plot Similar Genres
  - (Plot Similar Songs [Only small set of Data here])
- Possible Uses:
  - Put song into spotify api, get song data back, and use that to find similar songs (with possibility to get different artists than the one from the provided song)
  - Put Song in, get similar artist (you could also put multiple songs in, but I dont think that this is worth it)
  - Simulate entering a whole user profile, from which we can take the average song data and get new artists this way (which are not in here)
- Things missing
  - We dont have the release date or listening date, so we cannot use time as a feature. This could create even better recommendations, because we would know what the user currently listens to and weigh it  

# Load and View Dataset

In [1]:
import pandas as pd

results = []
for i in range(3):
    data = pd.read_parquet(f'spotify_data_part_{i+1}.parquet')
    results.append(data)

original_data = pd.concat(results)


original_data["year"] = pd.to_datetime(original_data["year"], format='%Y')
original_data = original_data.dropna(subset=["danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature", "popularity", "track_id", "track_name", "artist_name", "year"])
original_data = original_data.drop_duplicates(subset=["track_name", "artist_name", "danceability", "energy", "key", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"])
original_data = original_data.reset_index(drop=True)
original_data = original_data.drop(columns=["Unnamed: 0"])
display(original_data.head())
display(original_data.describe())
print(original_data.info())

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68.0,2012-01-01,acoustic,0.483,0.303,4.0,-10.058,1.0,0.0429,0.694,0.0,0.115,0.139,133.406,240166.0,3.0
1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,50.0,2012-01-01,acoustic,0.572,0.454,3.0,-10.286,1.0,0.0258,0.477,1.4e-05,0.0974,0.515,140.182,216387.0,4.0
2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,57.0,2012-01-01,acoustic,0.409,0.234,3.0,-13.711,1.0,0.0323,0.338,5e-05,0.0895,0.145,139.832,158960.0,4.0
3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,58.0,2012-01-01,acoustic,0.392,0.251,10.0,-9.845,1.0,0.0363,0.807,0.0,0.0797,0.508,204.961,304293.0,4.0
4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,54.0,2012-01-01,acoustic,0.43,0.791,6.0,-5.419,0.0,0.0302,0.0726,0.0193,0.11,0.217,171.864,244320.0,4.0


Unnamed: 0,popularity,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,2013313.0,2013313,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0,2013313.0
mean,18.48877,2005-10-05 01:00:34.158026624,0.5491744,0.5904303,5.256363,-9.799186,0.6422682,0.1010894,0.3714914,0.2211444,0.2162194,0.4809441,120.0666,238511.7,3.878141
min,0.0,1886-01-01 00:00:00,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2073.0,0.0
25%,3.0,2002-01-01 00:00:00,0.428,0.387,2.0,-12.176,0.0,0.036,0.0206,0.0,0.0979,0.254,97.003,175125.0,4.0
50%,15.0,2011-01-01 00:00:00,0.565,0.622,5.0,-8.334,1.0,0.0487,0.249,0.000438,0.133,0.474,120.01,218547.0,4.0
75%,30.0,2018-01-01 00:00:00,0.686,0.825,8.0,-5.815,1.0,0.088,0.721,0.424,0.279,0.702,138.254,273987.0,4.0
max,100.0,2023-01-01 00:00:00,0.999,1.0,11.0,6.172,1.0,0.971,0.996,1.0,1.0,1.0,249.993,6000495.0,5.0
std,16.74866,,0.1819063,0.2708,3.549231,5.787686,0.4793327,0.1522333,0.3599918,0.3519784,0.1926347,0.2695143,29.96391,140513.6,0.4832763


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013313 entries, 0 to 2013312
Data columns (total 19 columns):
 #   Column            Dtype         
---  ------            -----         
 0   artist_name       object        
 1   track_name        object        
 2   track_id          object        
 3   popularity        float64       
 4   year              datetime64[ns]
 5   genre             object        
 6   danceability      float64       
 7   energy            float64       
 8   key               float64       
 9   loudness          float64       
 10  mode              float64       
 11  speechiness       float64       
 12  acousticness      float64       
 13  instrumentalness  float64       
 14  liveness          float64       
 15  valence           float64       
 16  tempo             float64       
 17  duration_ms       float64       
 18  time_signature    float64       
dtypes: datetime64[ns](1), float64(14), object(4)
memory usage: 291.8+ MB
None


# Recommendation Engine code

In [13]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler

metric_columns = ["danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"]

def standardized_data(data:pd.DataFrame):
    standard_scaler = StandardScaler()
    numeric_columns = data[metric_columns]
    other_columns = data.drop(columns=metric_columns).reset_index(drop=True)
    standardized_data = standard_scaler.fit_transform(numeric_columns)
    standardized_df = pd.DataFrame(standardized_data, columns=numeric_columns.columns)
    standardized_df = pd.merge(standardized_df, other_columns, left_index=True, right_index=True, how="left")
    return standardized_df

def normalized_data(data:pd.DataFrame):
    min_max_scaler = MinMaxScaler()
    numeric_columns = data[metric_columns]
    other_columns = data.drop(columns=metric_columns).reset_index(drop=True)
    normalized_data = min_max_scaler.fit_transform(numeric_columns)
    normalized_df = pd.DataFrame(normalized_data, columns=numeric_columns.columns)
    normalized_data = pd.merge(normalized_df, other_columns, left_index=True, right_index=True)
    return normalized_df

def reduce_data(data, dimensions):
    numeric_columns = data[metric_columns]
    pca_standardized = PCA(n_components=dimensions)
    pca_standardized_result = pca_standardized.fit_transform(numeric_columns)
    return pca_standardized_result


import hdbscan
import joblib

original_data_subset = original_data.sample(frac=0.02)

original_data_subset = standardized_data(original_data_subset)
display(original_data_subset)
# original_data_subset = normalized_data(original_data_subset)
# original_data_subset = reduce_data(original_data_subset, 2)

data_for_clustering = original_data_subset[metric_columns]

hdbscan_clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
hdbscan_clusterer.fit(data_for_clustering)
joblib.dump(hdbscan_clusterer, 'hdbscan_model.pkl')

clustered_subset = original_data_subset.copy()
clustered_subset["cluster"] = hdbscan_clusterer.labels_
clustered_subset.to_parquet("clustered_subset.parquet")

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist_name,track_name,track_id,popularity,year,genre,duration_ms
0,-0.064824,1.317150,-1.477662,0.887697,-1.344018,-0.254815,-0.454620,-0.624436,0.406616,1.397004,1.463675,0.252343,Santana,Brown Skin Girl (feat. Bo Bice),14kZ72wYjQckHYuEvleiUx,20.0,2005-01-01,blues,282707.0
1,0.620469,-1.099396,-1.197239,-0.891436,0.744038,5.574823,1.006105,-0.624436,2.588485,0.160478,-1.153218,-1.818338,Steve Hofstetter,The Reason You Take a Date to See Me,3SqfXryD4Ws7WYLEdOEYRq,8.0,2022-01-01,comedy,262360.0
2,1.007327,-0.285246,-1.477662,0.630475,0.744038,-0.439033,1.075663,-0.623538,-0.478482,0.873430,1.197960,0.252343,Jack Hartmann,We Are a Family,5hnzpC8Fp3BzDgaacxGItd,9.0,2004-01-01,party,194320.0
3,1.134438,0.373476,-1.477662,0.767010,0.744038,-0.426621,-0.707812,-0.620788,-0.845386,1.014535,-0.579823,0.252343,Tina Moore,Never Gonna Let You Go - Kelly G. Bump-N-Go Vo...,0bwYKV0xuWcYviq9XJbCex,24.0,2013-01-01,garage,253080.0
4,0.128605,-1.084593,-1.197239,-0.571346,-1.344018,-0.325367,0.789083,-0.624436,-0.715709,-0.493061,-0.172513,0.252343,The American Young Voices Choir,Pop Medley From Trolls,3wYVbfMrSb1upgotmkIq93,24.0,2023-01-01,,392972.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40261,0.614942,0.432687,-0.355971,0.164445,-1.344018,0.244271,-0.635472,-0.622860,-0.571109,-0.418795,-0.679688,0.252343,Phabo,Step 2 Me,57tg0m3Iu8YzWyaDfFfDtR,51.0,2023-01-01,chill,110400.0
40262,0.686787,1.331953,0.485297,1.292600,-1.344018,-0.361949,-0.867518,-0.624425,0.401470,0.528094,0.257683,0.252343,Sick Individuals,I Want You,0oZDidA985NcUVE2GhnHI1,49.0,2021-01-01,edm,186576.0
40263,0.460198,-0.081708,1.046143,-0.352959,-1.344018,-0.478881,0.271569,-0.424952,-0.478482,0.609786,-0.475782,0.252343,Marshmellow,A Lighthouse For Your Soul,7k74WfhigDSzIFQ9AGXiPB,0.0,2021-01-01,,218350.0
40264,-0.888280,1.376361,1.606988,1.690014,0.744038,1.452791,-0.387844,0.709253,0.036109,-0.972076,0.162863,0.252343,MC5,I’m A Man (Live 1966),0eg53He31OMpYqNWmVwFAq,5.0,2008-01-01,garage,256679.0


## Actual Functionality

In [21]:
from rapidfuzz import process, utils
from sklearn.neighbors import NearestNeighbors
import joblib
def get_closest_match(user_input, df, column, threshold=90):
    processed_user_input = utils.default_process(user_input)
    strings_column = df[column].dropna()
    processed_strings = [utils.default_process(string) for string in strings_column]
    
    match = process.extractOne(processed_user_input, processed_strings, processor=None, score_cutoff=threshold)
    if match is not None:
        return strings_column.iloc[match[2]]  # match[2] is the index of the best match
    return None

def song_finder(song_name, artist_name):
    song = original_data[(original_data["track_name"] == song_name) & (original_data["artist_name"] == artist_name)]
    return None if song.empty else song

def find_closest_songs(song_name, artist_name, same_artist, number_of_songs, data):
    sample_data = pd.read_parquet("clustered_subset.parquet")
    metrics = ["danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"]
    
    
    artist_name = get_closest_match(artist_name, data, "artist_name")
    song_name = get_closest_match(song_name, data, "track_name")
    print(song_name, artist_name)
    song = song_finder(song_name, artist_name)
    
    
    song = standardized_data(song)
    
    
    if artist_name is None or song_name is None or song is None:
        print("No match found")
        return None
    new_data_point = song[metrics].values.reshape(1, -1)
    model = joblib.load('hdbscan_model.pkl')
    predicted_cluster, _ = hdbscan.approximate_predict(model, new_data_point)
    print(predicted_cluster[0])
    
    if not same_artist:
        sample_data = sample_data[sample_data["artist_name"] != artist_name]
        
    cluster_data = sample_data[sample_data["cluster"] == predicted_cluster[0]]
    
    knn_model = NearestNeighbors(n_neighbors=5)
    cluster_data = cluster_data[metrics]
    knn_model.fit(cluster_data)
    distances, indices = knn_model.kneighbors(song[metrics], n_neighbors=number_of_songs)
    
    return data[data.index.isin(cluster_data.iloc[indices[0]].index)]


find_closest_songs("Shape of You", "Ed Sheeran", False, 5, original_data)

Shape of You Ed Sheeran
12


Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
5028,Nathaniel Kimble,Can U Bagg It Up (Remixed),5ZnddSlFx5cU3d5Aho3ONE,16.0,2012-01-01,blues,0.876,0.415,1.0,-6.795,1.0,0.0663,0.143,0.0,0.0639,0.761,123.955,249133.0,4.0
12030,Kellie Pickler,Stop Cheatin' On Me,2Z6r4r6oqhObyU5Ftt38mP,18.0,2012-01-01,country,0.513,0.536,2.0,-5.682,1.0,0.0246,0.321,0.0,0.0911,0.352,109.608,168067.0,4.0
13792,Cannibal Corpse,Scourge of Iron,6V3SNkvi4BnfmZU0j7s9TQ,54.0,2012-01-01,death-metal,0.446,0.977,10.0,-5.036,0.0,0.0781,0.000535,0.472,0.105,0.339,172.059,284400.0,4.0
22198,Raí Saia Rodada,Ponto Final - Ao Vivo,7twFGyEBkgKVT7cD28xZOk,9.0,2012-01-01,forro,0.557,0.74,0.0,-6.542,1.0,0.0783,0.804,0.000131,0.566,0.51,176.079,216790.0,4.0
22405,Stuck in the Sound,Silent and Sweet,7LROnTlpYggwagWNCdXCB1,32.0,2012-01-01,french,0.508,0.765,4.0,-6.02,1.0,0.0363,0.152,0.00019,0.0995,0.221,131.615,277773.0,4.0


# Here the modelling and transformation starts

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Select the numeric columns
numeric_columns = feature_df.drop(columns=["track_id", "genre"])

standard_scaler = StandardScaler()
min_max_scaler = MinMaxScaler()

# Standardize the numeric columns
standardized_data = standard_scaler.fit_transform(numeric_columns)
standardized_df = pd.DataFrame(standardized_data, columns=numeric_columns.columns)
standardized_df['genre'] = feature_df['genre']
standardized_df['track_id'] = feature_df['track_id']

# Normalize the numeric columns
normalized_data = min_max_scaler.fit_transform(numeric_columns)
normalized_df = pd.DataFrame(normalized_data, columns=numeric_columns.columns)
normalized_df['genre'] = feature_df['genre']
normalized_df['track_id'] = feature_df['track_id']

# Display the standardized and normalized dataframes
display(standardized_df.describe())
display(normalized_df.describe())


In [None]:
display(standardized_df.isna().sum())

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import plotly.express as px

# Perform PCA on standardized_data
pca_standardized = PCA(n_components=2)
pca_standardized_result = pca_standardized.fit_transform(standardized_data)
print(1)

# Perform PCA on normalized_data
pca_normalized = PCA(n_components=2)
pca_normalized_result = pca_normalized.fit_transform(normalized_data)
print(2)

# # Perform t-SNE on standardized_data
# tsne_standardized = TSNE(n_components=2)
# tsne_standardized_result = tsne_standardized.fit_transform(standardized_data)
# print(3)

# # Perform t-SNE on normalized_data
# tsne_normalized = TSNE(n_components=2)
# tsne_normalized_result = tsne_normalized.fit_transform(normalized_data)
# print(4)

# # Create the subplot with 4 plots
# fig = px.subplots(
#     rows=2, cols=2,
#     subplot_titles=("PCA - Standardized Data", "PCA - Normalized Data", "t-SNE - Standardized Data", "t-SNE - Normalized Data"),
#     shared_xaxes=True, shared_yaxes=True,
#     vertical_spacing=0.1, horizontal_spacing=0.1
# )

# # Add PCA - Standardized Data plot
# fig.add_trace(
#     px.scatter(x=pca_standardized_result[:, 0], y=pca_standardized_result[:, 1], color=standardized_df['track_genre']).data[0],
#     row=1, col=1
# )

# # Add PCA - Normalized Data plot
# fig.add_trace(
#     px.scatter(x=pca_normalized_result[:, 0], y=pca_normalized_result[:, 1], color=normalized_df['track_genre']).data[0],
#     row=1, col=2
# )

# # Add t-SNE - Standardized Data plot
# fig.add_trace(
#     px.scatter(x=tsne_standardized_result[:, 0], y=tsne_standardized_result[:, 1], color=standardized_df['track_genre']).data[0],
#     row=2, col=1
# )

# # Add t-SNE - Normalized Data plot
# fig.add_trace(
#     px.scatter(x=tsne_normalized_result[:, 0], y=tsne_normalized_result[:, 1], color=normalized_df['track_genre']).data[0],
#     row=2, col=2
# )

# # Update layout
# fig.update_layout(
#     height=800,
#     showlegend=False
# )

# # Show the subplot
# fig.show()

px.scatter(x=pca_standardized_result[:, 0], y=pca_standardized_result[:, 1], color=standardized_df['genre']).show()

In [None]:
# sns.pairplot(original_data, hue='track_genre', diag_kind='kde')

In [None]:
# import pandas as pd
# import numpy as np
# from sklearn.manifold import TSNE

# import plotly.express as px

# dataframe = standardized_df.copy()
# # Assuming 'data' is your dataframe and track_genre is a column in the dataframe

# # Create a subset of the data
# subset_data = dataframe.sample(n=1000, random_state=42)

# # Prepare the data: Separate features and labels
# features = subset_data.drop(columns=['track_genre', "track_id"])  # Drop the track_genre column
# labels = subset_data['track_genre']  # Save the track_genre column separately

# # Apply t-SNE
# tsne = TSNE(n_components=2, random_state=42)
# tsne_results = tsne.fit_transform(features)

# # Create a DataFrame for the t-SNE results
# tsne_df = pd.DataFrame(tsne_results, columns=['tsne_1', 'tsne_2'])
# tsne_df['track_genre'] = labels.values

# # Plot the results using Plotly Express
# fig = px.scatter(tsne_df, x='tsne_1', y='tsne_2', color='track_genre', title='t-SNE of Track Features by Genre')
# fig.show()


In [None]:
# import umap

# reducer = umap.UMAP(n_components=2, random_state=42)

# # Apply UMAP
# umap_results = reducer.fit_transform(subset_data.drop(columns=['track_genre', "track_id"]))

# px.scatter(x=umap_results[:, 0], y=umap_results[:, 1], color=subset_data['track_genre']).show()

In [None]:
# import pandas as pd
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import MinMaxScaler
# from datetime import datetime
# from sklearn.metrics.pairwise import cosine_similarity

# # a function to get content-based recommendations based on music features
# def content_based_recommendations(input_song_name, num_recommendations=5):
#     if input_song_name not in music_df['Track Name'].values:
#         print(f"'{input_song_name}' not found in the dataset. Please enter a valid song name.")
#         return

#     # Get the index of the input song in the music DataFrame
#     input_song_index = music_df[music_df['Track Name'] == input_song_name].index[0]

#     # Calculate the similarity scores based on music features (cosine similarity)
#     similarity_scores = cosine_similarity([music_features_scaled[input_song_index]], music_features_scaled)

#     # Get the indices of the most similar songs
#     similar_song_indices = similarity_scores.argsort()[0][::-1][1:num_recommendations + 1]

#     # Get the names of the most similar songs based on content-based filtering
#     content_based_recommendations = music_df.iloc[similar_song_indices][['Track Name', 'Artists', 'Album Name', 'Release Date', 'Popularity']]

#     return content_based_recommendations

In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Loop through a range of cluster numbers to calculate silhouette scores
silhouette_scores = []
cluster_range = range(2, 26)
data_sample = standardized_data[np.random.choice(standardized_data.shape[0], 20000, replace=False)]
for k in cluster_range:
    print(f"Calculating silhouette score for k = {k}")
    kmeans = KMeans(n_clusters=k, random_state=42)
    cluster_labels = kmeans.fit_predict(data_sample)
    silhouette_avg = silhouette_score(data_sample, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    print(f"For n_clusters = {k}, the average silhouette score is {silhouette_avg:.4f}")

# Optionally, you can plot the silhouette scores
import matplotlib.pyplot as plt

plt.plot(cluster_range, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for k-means clustering')
plt.show()


In [None]:
import numpy as np
from sklearn.cluster import HDBSCAN

data_sample = standardized_data[np.random.choice(standardized_data.shape[0], 200000, replace=False)]

# Fit the HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=100)
hdbscan_model.fit(data_sample)

# Get the labels assigned to each data point
cluster_labels = hdbscan_model.labels_

# Example: Print out the first 10 cluster labels
print("First 10 cluster labels:", cluster_labels[:10])

# Print out the number of clusters found (excluding noise)
print(f"Number of clusters found: {len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)}")


In [None]:
def song_finder(song_name, artist_name):
    song = original_data[(original_data["track_name"] == song_name) & (original_data["artist_name"] == artist_name)]
    return song

song = song_finder("Shape of You", "Ed Sheeran")

standardized_data[song.index]

for song in original_data[['track_name', 'artist_name']].itertuples():
    print(song[0])
    print(standardized_data[song[0]])

In [None]:
from scipy.spatial import distance

def song_finder(song_name, artist_name):
    song = original_data[(original_data["track_name"] == song_name) & (original_data["artist_name"] == artist_name)]
    return song

def find_closest_songs(song_name, artist_name, song_number=5):
    all_distances = []
    
    chosen_song = song_finder(song_name, artist_name)
    index = chosen_song.index
    print(index)
    print(standardized_data[index][0])
    for song in original_data[['track_name', 'artist_name']].itertuples():

        current_distance = distance.cosine(standardized_data[song[0]],standardized_data[chosen_song.index][0])
        all_distances.append((song.track_name, song.artist_name, current_distance))
    all_distances.sort(key=lambda x: x[2], reverse=False)
    return all_distances[1:song_number+1]

find_closest_songs("Shape of You", "Skrillex")