# Datasets

- https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset
- https://www.kaggle.com/datasets/tonygordonjr/spotify-dataset-2023?select=spotify-albums_data_2023.csv
- https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv

# Links
- https://forecastegy.com/posts/xgboost-multiclass-classification-python/
- https://github.com/jannine92/spotify_recommendation/blob/main/music_recommender.ipynb
- https://www.kaggle.com/code/nyjoey/spotify-clustering
- https://ausaf-a.github.io/ml-song-recommender/
- https://medium.com/@Marlon_H/spotify-clustering-f41b40003c9a
- https://www.kaggle.com/code/choongqianzheng/song-genre-classification-system
- https://developer.spotify.com/documentation/web-api/reference/get-audio-features
- https://medium.com/@miguelrodrigueznovelo/discover-your-perfect-playlist-10-songs-recommended-by-a-music-recommendation-system-with-python-5fd246d87127
- https://medium.com/@shruti.somankar/building-a-music-recommendation-system-using-spotify-api-and-python-f7418a21fa41
- https://www.kaggle.com/code/merveeyuboglu/music-recommendation-system-cosine-s


# ToDo List:

- Basic stuff✅
  - Load Data✅
  -  Display, Info and Describe data✅
  - Split Datasets into song_metrics and song_info✅
- Data Visualization (Also in Percent if valuable)✅
  - Visualize Correlation Heatmap✅
  - Display Genres as Numbers and Histogram✅
  - Display Genre Dendogram✅
  - Display most frequent artists✅
  - Display most popular artist✅
  - Plot Popularity as histogram✅
  - Plot Average Song metric for Each genre (could also be on a 3D plot)✅
  - Plot Box plots to detect outliers✅
- Features
  - Apply Standard and MinMaxScaler ✅
  - Apply and Visualize PCA and t-SNE / UMAP
  - Use Silhouette Score to see how many clusters are needed (also try fancy plot from Medium)
  - Use KMeans to start
  - Use DBSCAN
  - Use Agglomerative Clustering
  - Use HDBSCAN
  - Use XGBClassifier with Cross Validation
- Feature Extensions
  - Plot Similar Artists
  - Plot Similar Genres
  - (Plot Similar Songs [Only small set of Data here])
- Possible Uses:
  - Put song into spotify api, get song data back, and use that to find similar songs (with possibility to get different artists than the one from the provided song)
  - Put Song in, get similar artist (you could also put multiple songs in, but I dont think that this is worth it)
  - Simulate entering a whole user profile, from which we can take the average song data and get new artists this way (which are not in here)
- Things missing
  - We dont have the release date or listening date, so we cannot use time as a feature. This could create even better recommendations, because we would know what the user currently listens to and weigh it  

# Load and View Dataset

In [None]:
import pandas as pd

results = []
for i in range(3):
    data = pd.read_parquet(f'spotify_data_part_{i+1}.parquet')
    results.append(data)

original_data = pd.concat(results)


original_data["year"] = pd.to_datetime(original_data["year"], format='%Y')
original_data = original_data.dropna(subset=["danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature", "popularity", "track_id", "track_name", "artist_name", "year"])
original_data = original_data.drop_duplicates(subset=["track_name", "artist_name", "danceability", "energy", "key", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"])
original_data = original_data.reset_index(drop=True)
original_data = original_data.drop(columns=["Unnamed: 0"])
display(original_data.head())
display(original_data.describe())
print(original_data.info())

# Recommendation Engine code

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler

metric_columns = ["danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"]

def standardized_data(data:pd.DataFrame):
    standard_scaler = StandardScaler()
    numeric_columns = data[metric_columns]
    other_columns = data.drop(columns=metric_columns).reset_index(drop=True)
    standardized_data = standard_scaler.fit_transform(numeric_columns)
    standardized_df = pd.DataFrame(standardized_data, columns=numeric_columns.columns)
    standardized_df = pd.merge(standardized_df, other_columns, left_index=True, right_index=True, how="left")
    return standardized_df

def normalized_data(data:pd.DataFrame):
    min_max_scaler = MinMaxScaler()
    numeric_columns = data[metric_columns]
    other_columns = data.drop(columns=metric_columns).reset_index(drop=True)
    normalized_data = min_max_scaler.fit_transform(numeric_columns)
    normalized_df = pd.DataFrame(normalized_data, columns=numeric_columns.columns)
    normalized_data = pd.merge(normalized_df, other_columns, left_index=True, right_index=True)
    return normalized_df

def reduce_data(data, dimensions):
    numeric_columns = data[metric_columns]
    pca_standardized = PCA(n_components=dimensions)
    pca_standardized_result = pca_standardized.fit_transform(numeric_columns)
    return pca_standardized_result


import hdbscan
import joblib

original_data_subset = original_data.sample(frac=0.02)

original_data_subset = standardized_data(original_data_subset)
display(original_data_subset)
# original_data_subset = normalized_data(original_data_subset)
# original_data_subset = reduce_data(original_data_subset, 2)

data_for_clustering = original_data_subset[metric_columns]

hdbscan_clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
hdbscan_clusterer.fit(data_for_clustering)
joblib.dump(hdbscan_clusterer, 'hdbscan_model.pkl')

clustered_subset = original_data_subset.copy()
clustered_subset["cluster"] = hdbscan_clusterer.labels_
clustered_subset.to_parquet("clustered_subset.parquet")

## Actual Functionality

#### Config

In [None]:
from rapidfuzz import process, utils
from sklearn.neighbors import NearestNeighbors
import hdbscan
import joblib
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import os
from dotenv import load_dotenv
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

class MusicRecommendation:
    def __init__(self):

        load_dotenv()

        # Initialize the Spotify client with authentication
        auth_manager = SpotifyOAuth(client_id=os.getenv('SPOTIFY_CLIENT_ID'), client_secret=os.getenv('SPOTIFY_CLIENT_SECRET'), redirect_uri=os.getenv('SPOTIFY_REDIRECT_URI'), scope="user-library-read")
        self.sp = spotipy.Spotify(auth_manager=auth_manager)
        self.model = joblib.load("hdbscan_model.pkl")

        self.original_data = pd.read_parquet("clustered_subset.parquet")
        self.metrics = [
            'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 
            'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 
            'time_signature'
        ]
    
    def get_track_features(self, song_name: str, artist_name: str) -> pd.DataFrame:
        results = self.sp.search(q=f"track:{song_name} artist:{artist_name}", type="track", limit=1)
        
        if not results['tracks']['items']:
            print("Track not found")
            return pd.DataFrame()  # Return an empty DataFrame if the track is not found
        
        track = results['tracks']['items'][0]
        track_id = track['id']
        track_name = track['name']
        artist_name = track['artists'][0]['name']
        popularity = track['popularity']
        release_date = track['album']['release_date']
        year = int(release_date.split('-')[0])
        duration_ms = track['duration_ms']
        audio_features = self.sp.audio_features(track_id)[0]
        
        data = {
            'danceability': audio_features['danceability'],
            'energy': audio_features['energy'],
            'key': audio_features['key'],
            'loudness': audio_features['loudness'],
            'mode': audio_features['mode'],
            'speechiness': audio_features['speechiness'],
            'acousticness': audio_features['acousticness'],
            'instrumentalness': audio_features['instrumentalness'],
            'liveness': audio_features['liveness'],
            'valence': audio_features['valence'],
            'tempo': audio_features['tempo'],
            'time_signature': audio_features['time_signature'],
            'artist_name': artist_name,
            'track_name': track_name,
            'track_id': track_id,
            'popularity': popularity,
            'year': year,
            'duration_ms': duration_ms
        }
        return pd.DataFrame([data])
    
    def standardized_data(self, data: pd.DataFrame) -> pd.DataFrame:
        standard_scaler = StandardScaler()
        numeric_columns = data[self.metrics]
        other_columns = data.drop(columns=self.metrics).reset_index(drop=True)
        standardized_data = standard_scaler.fit_transform(numeric_columns)
        standardized_df = pd.DataFrame(standardized_data, columns=self.metrics)
        return pd.merge(standardized_df, other_columns, left_index=True, right_index=True, how="left")

    def normalized_data(self, data: pd.DataFrame) -> pd.DataFrame:
        min_max_scaler = MinMaxScaler()
        numeric_columns = data[self.metrics]
        other_columns = data.drop(columns=self.metrics).reset_index(drop=True)
        normalized_data = min_max_scaler.fit_transform(numeric_columns)
        return pd.merge(pd.DataFrame(normalized_data, columns=self.metrics), other_columns, left_index=True, right_index=True)
    
    def reduce_data(self, data: pd.DataFrame, dimensions: int) -> np.ndarray:
        numeric_columns = data[self.metrics]
        pca = PCA(n_components=dimensions)
        return pca.fit_transform(numeric_columns)

    def get_closest_match(self, user_input: str, df: pd.DataFrame, column: str, threshold: int = 90) -> str:
        processed_user_input = utils.default_process(user_input)
        strings_column = df[column].dropna()
        processed_strings = [utils.default_process(string) for string in strings_column]
        match = process.extractOne(processed_user_input, processed_strings, processor=None, score_cutoff=threshold)
        return strings_column.iloc[match[2]] if match else None

    def song_finder(self, song_name: str, artist_name: str) -> pd.DataFrame:
        song = self.original_data[(self.original_data["track_name"] == song_name) & (self.original_data["artist_name"] == artist_name)]
        return song if not song.empty else None

    def preprocess_song(self, song: pd.DataFrame, normalization: str, reduction: str) -> pd.DataFrame:
        if normalization == "standardized":
            song = self.standardized_data(song)
        elif normalization == "normalized":
            song = self.normalized_data(song)
        if reduction == "pca":
            song = self.reduce_data(song, 2)
        return song

    def get_song_cluster(self, song: pd.DataFrame) -> int:
        new_data_point = song[self.metrics].values.reshape(1, -1)
        model = self.model
        predicted_cluster, _ = hdbscan.approximate_predict(model, new_data_point)
        return predicted_cluster[0]

    def find_nearest_neighbors(self, song: pd.DataFrame, data: pd.DataFrame, number_of_songs: int) -> tuple:
        knn_model = NearestNeighbors(n_neighbors=100)
        cluster_data = data[self.metrics]
        knn_model.fit(cluster_data)
        distances, indices = knn_model.kneighbors(song[self.metrics], n_neighbors=100)
        neighbors_df = data.iloc[indices[0]]
        return neighbors_df, distances[0]

    def get_weighted_scores(self, neighbors_df: pd.DataFrame, neighbor_distances: np.ndarray) -> pd.DataFrame:
        # Ensure we are working with a copy of the DataFrame to avoid SettingWithCopyWarning
        neighbors_df = neighbors_df.copy()
        
        # Convert 'year' column to numeric
        neighbors_df['year_numeric'] = neighbors_df['year'].dt.year
        
        # Normalize 'year_numeric' and 'popularity' columns
        scaler = MinMaxScaler()
        neighbors_df[['year_normalized', 'popularity_normalized']] = scaler.fit_transform(
            neighbors_df[['year_numeric', 'popularity']]
        )
        
        year_normalized = neighbors_df['year_normalized'].values
        popularity_normalized = neighbors_df['popularity_normalized'].values
        
        # Define weights for year and popularity
        year_weight = 0.4
        popularity_weight = 0.6

        # Compute base scores (inverse distance, to ensure higher similarity has a higher base score)
        base_scores = 1 / (neighbor_distances + 1e-8)  # Avoid division by zero

        # Compute boosting scores
        boosting_scores = year_normalized * year_weight + popularity_normalized * popularity_weight

        # Final scores: add boosting scores to base scores
        final_scores = base_scores + boosting_scores

        # Rank neighbors based on weighted scores
        ranked_indices = np.argsort(final_scores)[::-1]  # Sort in descending order
        return neighbors_df.iloc[ranked_indices]

    def print_preview_urls(self, song_df: pd.DataFrame) -> None:
        for _, row in song_df.iterrows():
            track_id = row['track_id']
            track = self.sp.track(track_id)
            preview_url = track.get('preview_url')
            if preview_url:
                print(f"Track: {row['track_name']} by {row['artist_name']}")
                print(f"Preview URL: {preview_url}")
            else:
                print(f"Track: {row['track_name']} by {row['artist_name']}")
                print("Preview URL not available.")

    def find_closest_songs(self, song_name: str = "", artist_name: str = "", same_artist: bool = "", number_of_songs: int = "") -> pd.DataFrame:
        if artist_name == "":
            artist_name = input("Enter the artist name: ")
        if song_name == "":
            song_name = input("Enter the song name: ")
        if same_artist == "":
            same_artist = input("Filter by same artist? (yes/no): ").strip().lower() == 'yes'
        if number_of_songs == "":
            number_of_songs = int(input("Enter the number of songs to return: "))

        artist_name_corrected = self.get_closest_match(artist_name, self.original_data, "artist_name")
        song_name_corrected = self.get_closest_match(song_name, self.original_data, "track_name")
        
        song = self.song_finder(song_name_corrected, artist_name_corrected)
        if song is None:
            song = self.get_track_features(song_name, artist_name)
        
        if song is None:
            print("No match found")
            return None
        
        song_standardized = self.standardized_data(song)
        predicted_cluster = self.get_song_cluster(song_standardized)
        
        sample_data = self.original_data[self.original_data["cluster"] == predicted_cluster]
        if not same_artist:
            sample_data = sample_data[sample_data["artist_name"] != artist_name]
        
        neighbors_df, neighbor_distances = self.find_nearest_neighbors(song, sample_data, number_of_songs)
        neighbors_df = self.get_weighted_scores(neighbors_df, neighbor_distances)
        
        closest_songs = neighbors_df.head(number_of_songs)
        
        self.print_preview_urls(closest_songs)
        
        return closest_songs

# Example usage:
if __name__ == "__main__":
    # Initialize with default paths and parameters
   
    recommender = MusicRecommendation()

    # Find closest songs with

recommender.find_closest_songs()


# Here the modelling and transformation starts

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Select the numeric columns
numeric_columns = feature_df.drop(columns=["track_id", "genre"])

standard_scaler = StandardScaler()
min_max_scaler = MinMaxScaler()

# Standardize the numeric columns
standardized_data = standard_scaler.fit_transform(numeric_columns)
standardized_df = pd.DataFrame(standardized_data, columns=numeric_columns.columns)
standardized_df['genre'] = feature_df['genre']
standardized_df['track_id'] = feature_df['track_id']

# Normalize the numeric columns
normalized_data = min_max_scaler.fit_transform(numeric_columns)
normalized_df = pd.DataFrame(normalized_data, columns=numeric_columns.columns)
normalized_df['genre'] = feature_df['genre']
normalized_df['track_id'] = feature_df['track_id']

# Display the standardized and normalized dataframes
display(standardized_df.describe())
display(normalized_df.describe())


In [None]:
display(standardized_df.isna().sum())

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import plotly.express as px

# Perform PCA on standardized_data
pca_standardized = PCA(n_components=2)
pca_standardized_result = pca_standardized.fit_transform(standardized_data)
print(1)

# Perform PCA on normalized_data
pca_normalized = PCA(n_components=2)
pca_normalized_result = pca_normalized.fit_transform(normalized_data)
print(2)

# # Perform t-SNE on standardized_data
# tsne_standardized = TSNE(n_components=2)
# tsne_standardized_result = tsne_standardized.fit_transform(standardized_data)
# print(3)

# # Perform t-SNE on normalized_data
# tsne_normalized = TSNE(n_components=2)
# tsne_normalized_result = tsne_normalized.fit_transform(normalized_data)
# print(4)

# # Create the subplot with 4 plots
# fig = px.subplots(
#     rows=2, cols=2,
#     subplot_titles=("PCA - Standardized Data", "PCA - Normalized Data", "t-SNE - Standardized Data", "t-SNE - Normalized Data"),
#     shared_xaxes=True, shared_yaxes=True,
#     vertical_spacing=0.1, horizontal_spacing=0.1
# )

# # Add PCA - Standardized Data plot
# fig.add_trace(
#     px.scatter(x=pca_standardized_result[:, 0], y=pca_standardized_result[:, 1], color=standardized_df['track_genre']).data[0],
#     row=1, col=1
# )

# # Add PCA - Normalized Data plot
# fig.add_trace(
#     px.scatter(x=pca_normalized_result[:, 0], y=pca_normalized_result[:, 1], color=normalized_df['track_genre']).data[0],
#     row=1, col=2
# )

# # Add t-SNE - Standardized Data plot
# fig.add_trace(
#     px.scatter(x=tsne_standardized_result[:, 0], y=tsne_standardized_result[:, 1], color=standardized_df['track_genre']).data[0],
#     row=2, col=1
# )

# # Add t-SNE - Normalized Data plot
# fig.add_trace(
#     px.scatter(x=tsne_normalized_result[:, 0], y=tsne_normalized_result[:, 1], color=normalized_df['track_genre']).data[0],
#     row=2, col=2
# )

# # Update layout
# fig.update_layout(
#     height=800,
#     showlegend=False
# )

# # Show the subplot
# fig.show()

px.scatter(x=pca_standardized_result[:, 0], y=pca_standardized_result[:, 1], color=standardized_df['genre']).show()

In [None]:
# sns.pairplot(original_data, hue='track_genre', diag_kind='kde')

In [None]:
# import pandas as pd
# import numpy as np
# from sklearn.manifold import TSNE

# import plotly.express as px

# dataframe = standardized_df.copy()
# # Assuming 'data' is your dataframe and track_genre is a column in the dataframe

# # Create a subset of the data
# subset_data = dataframe.sample(n=1000, random_state=42)

# # Prepare the data: Separate features and labels
# features = subset_data.drop(columns=['track_genre', "track_id"])  # Drop the track_genre column
# labels = subset_data['track_genre']  # Save the track_genre column separately

# # Apply t-SNE
# tsne = TSNE(n_components=2, random_state=42)
# tsne_results = tsne.fit_transform(features)

# # Create a DataFrame for the t-SNE results
# tsne_df = pd.DataFrame(tsne_results, columns=['tsne_1', 'tsne_2'])
# tsne_df['track_genre'] = labels.values

# # Plot the results using Plotly Express
# fig = px.scatter(tsne_df, x='tsne_1', y='tsne_2', color='track_genre', title='t-SNE of Track Features by Genre')
# fig.show()


In [None]:
# import umap

# reducer = umap.UMAP(n_components=2, random_state=42)

# # Apply UMAP
# umap_results = reducer.fit_transform(subset_data.drop(columns=['track_genre', "track_id"]))

# px.scatter(x=umap_results[:, 0], y=umap_results[:, 1], color=subset_data['track_genre']).show()

In [None]:
# import pandas as pd
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import MinMaxScaler
# from datetime import datetime
# from sklearn.metrics.pairwise import cosine_similarity

# # a function to get content-based recommendations based on music features
# def content_based_recommendations(input_song_name, num_recommendations=5):
#     if input_song_name not in music_df['Track Name'].values:
#         print(f"'{input_song_name}' not found in the dataset. Please enter a valid song name.")
#         return

#     # Get the index of the input song in the music DataFrame
#     input_song_index = music_df[music_df['Track Name'] == input_song_name].index[0]

#     # Calculate the similarity scores based on music features (cosine similarity)
#     similarity_scores = cosine_similarity([music_features_scaled[input_song_index]], music_features_scaled)

#     # Get the indices of the most similar songs
#     similar_song_indices = similarity_scores.argsort()[0][::-1][1:num_recommendations + 1]

#     # Get the names of the most similar songs based on content-based filtering
#     content_based_recommendations = music_df.iloc[similar_song_indices][['Track Name', 'Artists', 'Album Name', 'Release Date', 'Popularity']]

#     return content_based_recommendations

In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Loop through a range of cluster numbers to calculate silhouette scores
silhouette_scores = []
cluster_range = range(2, 26)
data_sample = standardized_data[np.random.choice(standardized_data.shape[0], 20000, replace=False)]
for k in cluster_range:
    print(f"Calculating silhouette score for k = {k}")
    kmeans = KMeans(n_clusters=k, random_state=42)
    cluster_labels = kmeans.fit_predict(data_sample)
    silhouette_avg = silhouette_score(data_sample, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    print(f"For n_clusters = {k}, the average silhouette score is {silhouette_avg:.4f}")

# Optionally, you can plot the silhouette scores
import matplotlib.pyplot as plt

plt.plot(cluster_range, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for k-means clustering')
plt.show()


In [None]:
import numpy as np
from sklearn.cluster import HDBSCAN

data_sample = standardized_data[np.random.choice(standardized_data.shape[0], 200000, replace=False)]

# Fit the HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=100)
hdbscan_model.fit(data_sample)

# Get the labels assigned to each data point
cluster_labels = hdbscan_model.labels_

# Example: Print out the first 10 cluster labels
print("First 10 cluster labels:", cluster_labels[:10])

# Print out the number of clusters found (excluding noise)
print(f"Number of clusters found: {len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)}")


In [None]:
def song_finder(song_name, artist_name):
    song = original_data[(original_data["track_name"] == song_name) & (original_data["artist_name"] == artist_name)]
    return song

song = song_finder("Shape of You", "Ed Sheeran")

standardized_data[song.index]

for song in original_data[['track_name', 'artist_name']].itertuples():
    print(song[0])
    print(standardized_data[song[0]])

In [None]:
from scipy.spatial import distance

def song_finder(song_name, artist_name):
    song = original_data[(original_data["track_name"] == song_name) & (original_data["artist_name"] == artist_name)]
    return song

def find_closest_songs(song_name, artist_name, song_number=5):
    all_distances = []
    
    chosen_song = song_finder(song_name, artist_name)
    index = chosen_song.index
    print(index)
    print(standardized_data[index][0])
    for song in original_data[['track_name', 'artist_name']].itertuples():

        current_distance = distance.cosine(standardized_data[song[0]],standardized_data[chosen_song.index][0])
        all_distances.append((song.track_name, song.artist_name, current_distance))
    all_distances.sort(key=lambda x: x[2], reverse=False)
    return all_distances[1:song_number+1]

find_closest_songs("Shape of You", "Skrillex")