# **Introduction**:

## **Creating an Innovative Music Recommendation System using Spotify Dataset**

In today's digital age, music has become an integral part of our lives, accompanying us through various moods, activities, and experiences. With the advent of streaming platforms like Spotify, the world of music has undergone a transformative shift, offering a treasure trove of songs across genres, eras, and cultures. In this era of abundance, the need for intelligent music recommendation systems has never been more pronounced.

This project delves into the realm of data science and music to construct a sophisticated music recommendation system that takes advantage of the vast Spotify dataset and leverages advanced analytics and machine learning techniques. The goal is to create a platform that not only curates music based on user preferences but also unravels the nuances of musical trends, genres, and artist profiles.

Through a journey of exploration, analysis, and modeling, this project unveils the steps involved in building a music recommendation system from scratch. We start by comprehensively understanding the Spotify dataset, which provides a panoramic view of music features, artist details, and historical trends. With this robust dataset as our foundation, we embark on a data-driven voyage to uncover the intricacies of music evolution and genre dynamics over the years.

To ensure accurate and personalized recommendations, we harness the power of machine learning. We employ techniques like K-means clustering to group genres and songs with shared audio characteristics, allowing us to categorize and identify patterns in the vast musical landscape. Through Spotipy, a Python library that interfaces with the Spotify Web API, we enrich our recommendations with real-time song data and artist insights, offering users a seamless blend of technology and creativity.

Here we will directly import the data from Kaggle. Before that, we need to make sure that we are into Kaggle environment.

In [None]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/

In [None]:
!kaggle datasets download -d vatsalmavani/spotify-dataset

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!unzip spotify-dataset.zip -d /content

Let's import the common and necessary libraries:

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings("ignore")

Let's now import and read the data.

In [None]:
data = pd.read_csv("/content/data/data.csv")
genre_data = pd.read_csv("/content/data/data_by_genres.csv")
year_data = pd.read_csv("/content/data/data_by_year.csv")

Some basic exploration of the data:

In [None]:
print(data.info())

In [None]:
print(genre_data.info())

In [None]:
print(year_data.info())

**Getting into EDA:**

Feature Correlation -

In [None]:
# Calculate the correlation matrix
# Calculate the correlation between different audio features and visualize it using a heatmap.
correlation_matrix = data.corr()

# Set up the style and context
sns.set(style="whitegrid")
plt.figure(figsize=(12, 10))

# Create a heatmap with a diverging color palette
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", center=0,
            fmt=".2f", linewidths=0.5, cbar_kws={"shrink": 0.8})

# Customize the title and labels
plt.title("Feature Correlation Heatmap", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()

Now let's see how music has changed over the years.

In [None]:
# Visualize how different audio features have changed over the years.
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))

plt.plot(year_data['year'], year_data['acousticness'], label='Acousticness')
plt.plot(year_data['year'], year_data['danceability'], label='Danceability')
plt.plot(year_data['year'], year_data['energy'], label='Energy')
plt.plot(year_data['year'], year_data['valence'], label='Valence')

# Customizations
plt.title("Audio Feature Trends Over the Years", fontsize=16)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Feature Value", fontsize=12)
plt.legend(fontsize=10)

plt.tight_layout()
plt.show()

Now let's see the characteristics of different genres:

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))

plt.bar(genre_data['genres'], genre_data['acousticness'], label='Acousticness')
plt.bar(genre_data['genres'], genre_data['danceability'], label='Danceability')
plt.bar(genre_data['genres'], genre_data['energy'], label='Energy')
plt.bar(genre_data['genres'], genre_data['valence'], label='Valence')

plt.title("Audio Feature Characteristics by Genre", fontsize=16)
plt.xlabel("Genre", fontsize=12)
plt.ylabel("Average Feature Value", fontsize=12)
plt.xticks(rotation=90)
plt.legend(fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
sns.set(style="whitegrid")
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))

sns.barplot(x='genres', y='acousticness', data=genre_data, ax=axes[0, 0])
sns.barplot(x='genres', y='danceability', data=genre_data, ax=axes[0, 1])
sns.barplot(x='genres', y='energy', data=genre_data, ax=axes[1, 0])
sns.barplot(x='genres', y='valence', data=genre_data, ax=axes[1, 1])

axes[0, 0].set_title("Acousticness by Genre")
axes[0, 0].set_ylabel("Average Acousticness")
axes[0, 1].set_title("Danceability by Genre")
axes[0, 1].set_ylabel("Average Danceability")
axes[1, 0].set_title("Energy by Genre")
axes[1, 0].set_ylabel("Average Energy")
axes[1, 1].set_title("Valence by Genre")
axes[1, 1].set_ylabel("Average Valence")

for ax in axes.flat:
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

In [None]:
top10_genres = genre_data.nlargest(10, 'popularity')

fig = px.bar(top10_genres, x='genres', y=['valence', 'energy', 'danceability', 'acousticness'], barmode='group')
fig.show()

# Clustering Genres

We are using K-Means to cluster the genre into 10 clusters.


In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline with StandardScaler and KMeans
cluster_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=10))
])

# Select numerical features for clustering
X = genre_data.select_dtypes(np.number)

# Fit the pipeline to the data and predict clusters
cluster_pipeline.fit(X)
genre_data['cluster'] = cluster_pipeline.predict(X)

In [None]:
# Fit t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Add t-SNE coordinates to the genre_data dataframe
genre_data['tsne_x'] = X_tsne[:, 0]
genre_data['tsne_y'] = X_tsne[:, 1]

# Create a scatter plot of clusters using t-SNE coordinates
plt.figure(figsize=(10, 8))
sns.scatterplot(data=genre_data, x='tsne_x', y='tsne_y', hue='cluster', palette='viridis', legend='full')
plt.title('t-SNE Visualization of Genre Clusters')
plt.show()

# Now let's look at the Songs Clusters

In [None]:
song_cluster_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=20))
])

X = data.select_dtypes(np.number)
number_cols = list(X.columns)
song_cluster_pipeline.fit(X)
data['cluster_label'] = song_cluster_pipeline.predict(X)

Using PCA for reducing dimensions:

In [None]:
from sklearn.decomposition import PCA

pca_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('PCA', PCA(n_components=2))
])

song_embedding = pca_pipeline.fit_transform(X)

projection = pd.DataFrame(columns=['x', 'y'], data=song_embedding)
projection['title'] = data['name']
projection['cluster'] = data['cluster_label']

fig = px.scatter(
    projection, x='x', y='y', color='cluster', hover_data=['x', 'y', 'title'],
    labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'}, title='PCA Visualization of Song Clusters'
)

fig.show()

# Model Building:

In [None]:
!pip install spotify

#### Next is importing Spotify Credentials:

1. Create a Spotify Developer Account: Go to the Spotify Developer Dashboard and log in or sign up for an account Create a new app by clicking the "Create an App" button. Fill in the necessary information about your app.
Get your Client ID and Secret Key:

2. Once your app is created, you'll be able to see your Client ID and Client Secret. Keep these confidential as they will be used to authenticate requests to the Spotify Web API.
Use Spotipy in Your Code:

3. Import the necessary modules from the Spotipy library.
Set your Client ID and Client Secret.
Use the Spotipy methods to interact with the Spotify Web API.

In [None]:
!pip install spotipy

In [None]:
import spotipy
import os
from spotipy.oauth2 import SpotifyClientCredentials
from collections import defaultdict

# Set the environment variables
%env SPOTIFY_CLIENT_ID = 7522c78b4ec445f78cdd6679908ddaf4
%env SPOTIFY_CLIENT_SECRET = d6d7ea3df6ee4f329250e8e13668baff

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=os.environ["SPOTIFY_CLIENT_ID"],
                                                           client_secret=os.environ["SPOTIFY_CLIENT_SECRET"]))

* Special Note: Please make sure to run and set the environment in a separate code line, to avoid errors. *

In [None]:
def find_song(name, year):
    song_data = defaultdict()
    results = sp.search(q= 'track: {} year: {}'.format(name,year), limit=1)
    if results['tracks']['items'] == []:
        return None

    results = results['tracks']['items'][0]
    track_id = results['id']
    audio_features = sp.audio_features(track_id)[0]

    song_data['name'] = [name]
    song_data['year'] = [year]
    song_data['explicit'] = [int(results['explicit'])]
    song_data['duration_ms'] = [results['duration_ms']]
    song_data['popularity'] = [results['popularity']]

    for key, value in audio_features.items():
        song_data[key] = value

    return pd.DataFrame(song_data)

In [None]:
number_cols = ['valence', 'year', 'acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'popularity', 'speechiness', 'tempo']

def get_song_data(song, spotify_data):
    song_data = spotify_data[(spotify_data['name'] == song['name']) & (spotify_data['year'] == song['year'])].iloc[0]
    return song_data if not song_data.empty else find_song(song['name'], song['year'])

def get_mean_vector(song_list, spotify_data):
    return np.mean([get_song_data(song, spotify_data)[number_cols].values for song in song_list], axis=0)

def flatten_dict_list(dict_list):
    return defaultdict(list, {key: [d[key] for d in dict_list] for key in dict_list[0]})

metadata_cols = ['name', 'year', 'artists']

def recommend_songs(song_list, spotify_data, n_songs=10):
    song_center = get_mean_vector(song_list, spotify_data)
    scaled_song_center = song_cluster_pipeline.steps[0][1].transform(song_center.reshape(1, -1))
    scaled_data = song_cluster_pipeline.steps[0][1].transform(spotify_data[number_cols])
    distances = pairwise_distances(scaled_song_center, scaled_data, metric='cosine')[0]
    index = np.argsort(distances)[:n_songs]

    rec_songs = spotify_data.iloc[index]
    rec_songs = rec_songs[~rec_songs['name'].isin([song['name'] for song in song_list])]
    return rec_songs[metadata_cols].to_dict(orient='records')


In [None]:
recommend_songs([{'name': 'Somebody Like You', 'year': 2002},
                {'name': 'No Excuses', 'year': 1994},
                {'name': 'Corazón Mágico', 'year': 1995}], data)

# Conclusion

In this project, we embarked on a journey to create a comprehensive music recommendation system using the Spotify dataset and various data science techniques. We started by exploring the dataset, analyzing music trends over the years, and investigating the characteristics of different genres. Through insightful visualizations, we gained a deeper understanding of how music has evolved and diversified over time.

We then utilized machine learning techniques to cluster both genres and songs, uncovering hidden patterns and relationships within the music data. The K-means clustering algorithm enabled us to group genres and songs with similar audio features, providing a meaningful segmentation for further analysis.

Using Spotipy, a Python library for the Spotify Web API, we harnessed the power of real-time data retrieval to enhance our recommendations. By combining user-defined song preferences with machine-generated song embeddings, we formulated a robust system that suggests songs tailored to the user's taste.

This project is not only a testament to the capabilities of data science but also a tribute to the artistry of music itself. By leveraging data insights and machine learning, we have created a platform that bridges the gap between music data and user experience, ultimately bringing listeners closer to the melodies that resonate with them.

In conclusion, this project demonstrates the synergy between technology and creativity, showcasing how data science can enrich our understanding of music and enhance the way we explore and enjoy the vast world of musical artistry.

Acknowledgement:


*   https://www.kaggle.com/code/vatsalmavani/music-recommendation-system-using-spotify-dataset/notebook
