# Load Dataset

**1. Data Ingestion & Quality Checks**
- Loaded the main Spotify-like dataset (`songs_normalize.csv` fallback). Added robust delimiter handling.
- Inspected shape, columns, head, and schema (`df.info()`).
- Checked and handled duplicates and missing values; removed duplicate rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import seaborn as sns
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
init_notebook_mode(connected=True)

In [2]:
# Load spotify.csv with pandas
from pathlib import Path
import pandas as pd
from IPython.display import display

# Prefer relative path from the notebook folder; fallback to absolute path
csv_path = Path("songs_normalize.csv")
if not csv_path.exists():
    csv_path = Path(r"f:\Business Analytics Dk\Computational Tools for Data Science 1-2\songs_normalize.csv")

# Verify existence
assert csv_path.exists(), f"File not found: {csv_path}"

# Load CSV with robust delimiter handling
try:
    df = pd.read_csv(csv_path)  # assume comma
except pd.errors.ParserError:
    # Try to auto-detect delimiter with Python engine
    try:
        df = pd.read_csv(csv_path, sep=None, engine="python")
    except Exception:
        # Fallback to common delimiters
        df = None
        for sep in [";", "\t", "|"]:
            try:
                df = pd.read_csv(csv_path, sep=sep)
                break
            except Exception:
                pass
        if df is None:
            raise

# Quick sanity checks
print("Path:", csv_path)
print("Shape:", df.shape)
print("Columns:", list(df.columns))
display(df.head(10))
print()
df.info()

Path: songs_normalize.csv
Shape: (2000, 18)
Columns: ['artist', 'song', 'duration_ms', 'explicit', 'year', 'popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'genre']


Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop
5,Sisqo,Thong Song,253733,True,1999,69,0.706,0.888,2,-6.959,1,0.0654,0.119,9.6e-05,0.07,0.714,121.549,"hip hop, pop, R&B"
6,Eminem,The Real Slim Shady,284200,True,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.0302,0.0,0.0454,0.76,104.504,hip hop
7,Robbie Williams,Rock DJ,258560,False,2000,68,0.708,0.772,7,-4.264,1,0.0322,0.0267,0.0,0.467,0.861,103.035,"pop, rock"
8,Destiny's Child,Say My Name,271333,False,1999,75,0.713,0.678,5,-3.525,0,0.102,0.273,0.0,0.149,0.734,138.009,"pop, R&B"
9,Modjo,Lady - Hear Me Tonight,307153,False,2001,77,0.72,0.808,6,-5.627,1,0.0379,0.00793,0.0293,0.0634,0.869,126.041,Dance/Electronic



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            2000 non-null   object 
 1   song              2000 non-null   object 
 2   duration_ms       2000 non-null   int64  
 3   explicit          2000 non-null   bool   
 4   year              2000 non-null   int64  
 5   popularity        2000 non-null   int64  
 6   danceability      2000 non-null   float64
 7   energy            2000 non-null   float64
 8   key               2000 non-null   int64  
 9   loudness          2000 non-null   float64
 10  mode              2000 non-null   int64  
 11  speechiness       2000 non-null   float64
 12  acousticness      2000 non-null   float64
 13  instrumentalness  2000 non-null   float64
 14  liveness          2000 non-null   float64
 15  valence           2000 non-null   float64
 16  tempo             2000 non-null   float64

In [3]:
df.isnull().sum()

artist              0
song                0
duration_ms         0
explicit            0
year                0
popularity          0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
genre               0
dtype: int64

In [4]:
df.duplicated().value_counts()

False    1941
True       59
Name: count, dtype: int64

In [5]:
df.drop_duplicates(inplace=True)

In [6]:
df.shape

(1941, 18)

In [7]:
df.describe()

Unnamed: 0,duration_ms,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
count,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0,1941.0
mean,228594.973725,2009.52035,59.633179,0.667814,0.721549,5.369397,-5.514082,0.553323,0.103783,0.128173,0.015372,0.181726,0.552966,120.158442
std,39249.796103,5.875532,21.501053,0.140608,0.152872,3.61527,1.93895,0.497277,0.096148,0.172584,0.088371,0.14091,0.220845,26.990475
min,113000.0,1998.0,0.0,0.129,0.0549,0.0,-20.514,0.0,0.0232,1.9e-05,0.0,0.0215,0.0381,60.019
25%,203506.0,2004.0,56.0,0.581,0.624,2.0,-6.49,0.0,0.0397,0.0135,0.0,0.0884,0.39,98.986
50%,223186.0,2010.0,65.0,0.676,0.739,6.0,-5.285,1.0,0.061,0.0558,0.0,0.124,0.56,120.028
75%,247946.0,2015.0,73.0,0.765,0.84,8.0,-4.168,1.0,0.129,0.176,6.9e-05,0.242,0.731,134.199
max,484146.0,2020.0,89.0,0.975,0.999,11.0,-0.276,1.0,0.576,0.976,0.985,0.853,0.973,210.851


2. Exploratory Data Analysis (EDA)**
- Descriptive statistics and correlation heatmap on numeric features.
- Multiple Plotly visualizations: yearly counts, popularity trends, genre distributions, artist productivity & popularity, explicit content patterns, feature relationships (energy vs loudness, tempo vs popularity, etc.).

In [8]:
import plotly.express as px

# Use select_dtypes(include='number') before calling .corr()
fig = px.imshow(df.select_dtypes(include='number').corr(), 
                text_auto=True,
                height=800,
                width=800,
                color_continuous_scale=px.colors.sequential.Greens,
                aspect='auto',
                title='<b>Pairwise Correlation of Columns') # (Fixed typo in "pairwise")

fig.update_layout(title_x=0.5)
fig.show()

In [9]:
fig=px.area(df.groupby('year',as_index=False).count().sort_values(by='song',ascending=False).sort_values(by='year'),x='year',y='song',markers=True,labels={'song':'Total songs'},color_discrete_sequence=['green'],title='<b>Year by Year Songs collection')
fig.update_layout(hovermode='x',title_x=0.5)

In [12]:
px.bar(df.groupby('artist',as_index=False).count().sort_values(by='song',ascending=False).head(50),x='artist',y='song',labels={'song':'Total Songs'},width=1000,color_discrete_sequence=['green'],text='song',title='<b> List of Songs Recorded by Each Singer')

In [13]:
px.bar(df.groupby('artist',as_index=False).sum().sort_values(by='popularity',ascending=False).head(30),x='artist',y='popularity',color_discrete_sequence=['lightgreen'],template='plotly_dark',text='popularity',title='<b>Top 30 Popular Singers')

In [14]:
fig=px.line(df.sort_values(by='popularity',ascending=False).head(25),x='song',y='popularity',hover_data=['artist'],color_discrete_sequence=['green'],markers=True,title='<b> Top 25 songs in Spotify')
fig.show()

In [15]:
fig=px.treemap(df,path=[px.Constant('Singer'),'artist','genre','song'],values='popularity',title='<b>TreeMap of Singers Playlist')
fig.update_traces(root_color='lightgreen')
fig.update_layout(title_x=0.5)

In [16]:
fig=px.pie(df.groupby('explicit',as_index=False).count().sort_values(by='song',ascending=False),names='explicit',values='song',labels={'song':'Total songs'},hole=.6,color_discrete_sequence=['green','crimson'],template='plotly_dark',title='<b>Songs having explicit content')
fig.update_layout(title_x=0.5)

In [17]:
fig=px.area(df[df['explicit']==True].groupby('year',as_index=False).count().sort_values(by='song',ascending=False).sort_values(by='year'),x='year',y='song',labels={'song':'Total songs'},markers=True,color_discrete_sequence=['red'],template='plotly_dark',title='<b>Yearwise explicit content songs')
fig.update_layout(hovermode='x')

In [18]:
px.box(df,x='explicit',y='popularity',color='explicit',template='plotly_dark',color_discrete_sequence=['cyan','magenta'],title='<b>popularity based on explicit content')

In [19]:
px.scatter(df,x='tempo',y='popularity',color='tempo',color_continuous_scale=px.colors.sequential.Plasma,template='plotly_dark',title='<b>Tempo Versus Popularity')

In [20]:
px.scatter(df,x='speechiness',y='popularity',color='speechiness',color_continuous_scale=px.colors.sequential.Plasma,template='plotly_dark',title='<b> Speechiness Versus Popularity')

In [21]:
px.scatter(df,x='energy',y='danceability',color='danceability',color_continuous_scale=px.colors.sequential.Plotly3,template='plotly_dark',title='<b>Energy Versus Danceability')

In [22]:
px.scatter(df,x='energy',y='loudness',color_discrete_sequence=['lightgreen'],template='plotly_dark',title='<b>Energy versus Loudness correlation')


**3. Clustering ("Vibe" Exploration)**
- Feature selection focused on perceptual audio attributes: energy, loudness, danceability, tempo, valence.
- Scaled features before clustering.
- Applied K-Means: elbow method to choose k; visualized clusters and t-SNE projection for interpretability.
- Added extended clustering overview: K-Means, DBSCAN, Agglomerative, CURE (conceptual) with evaluation metrics (Davies‚ÄìBouldin, Silhouette, Calinski‚ÄìHarabasz), plus a comparison demo cell.

In [63]:
# Group by year and get the average
year_trends = df.groupby('year')[['energy', 'loudness', 'danceability']].mean().reset_index()

# Plot the trend
fig = px.line(year_trends, x='year', y=['energy', 'loudness'],
              title='<b>Trend of Energy and Loudness Over Time</b>',
              template='plotly_dark')
fig.show()

In [23]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# --- Select the features you want to cluster on ---
# Let's use the two you plotted, but you can add more!
# e.g., features = ['energy', 'loudness', 'tempo', 'danceability']
features = ['energy', 'loudness'] 

# Create a new DataFrame for scaling
df_features = df[features]

# --- Scale the data ---
# This is crucial for K-Means to work properly
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_features)

### Chose our Features (The "Vibe" Model)

This set of features is excellent for clustering because all five describe the *human-perceived feel* and *composition* of a song, rather than just its technical audio properties. They work together to define a multi-dimensional "vibe space."

* **`energy`**: A perceptual measure of intensity and activity.
    * **What it is:** This feature represents how fast, loud, and "noisy" a song feels. A high energy track (e.g., death metal, a high-octane dance track) feels active and intense, while a low energy track (e.g., a sad ballad, an ambient piece) feels calm.
    * **Why it's appropriate:** This is a core component of "vibe." A "chill" cluster will have low energy, while a "workout" cluster will have high energy.

* **`loudness`**: The overall loudness of a track in decibels (dB).
    * **What it is:** This is a physical measurement of the average volume of the song. While it often correlates with `energy`, it's not the same. You can have a "loud" but "low energy" recording (like a heavily compressed but slow-moving classical piece).
    * **Why it's appropriate:** Loudness is a key contributor to intensity. Your K-Means model will use this to help separate tracks that *feel* intense from those that are quieter.

* **`danceability`**: How suitable a track is for dancing.
    * **What it is:** This is a high-level metric calculated from a combination of *tempo*, *rhythm stability*, and *beat strength*. A high score doesn't just mean "fast," it means the song has a consistent, danceable rhythm (e.g., a disco track).
    * **Why it's appropriate:** This is a perfect "vibe" feature. It directly helps separate "background music" from "groove-based" music, which is a major distinction in how we categorize songs.

* **`tempo`**: The speed or pace of a track, in Beats Per Minute (BPM).
    * **What it is:** The "speed" of the underlying pulse. A low tempo song is slow (like a ballad), and a high tempo song is fast (like a techno track).
    * **Why it's appropriate:** Along with `energy` and `danceability`, tempo forms the "energetic" axis of your analysis. It's a fundamental part of a song's feel.

* **`valence`**: A measure of the musical positiveness (mood) of a track.
    * **What it is:** This is the **single most important feature for mood**. High valence songs sound happy, cheerful, and euphoric. Low valence songs sound sad, angry, or depressed.
    * **Why it's appropriate:** This is the *direct* measure of "mood" or "vibe." Your clusters will be heavily defined by this. For example, 
        * **Cluster 1:** High `valence` + High `energy` (Happy, workout music)
        * **Cluster 2:** Low `valence` + Low `energy` (Sad, acoustic ballads)
        * **Cluster 3:** Low `valence` + High `energy` (Angry, intense music)

---

### Why Other Features Are Less Appropriate (For This Task)

The other variables in your dataset are less ideal for a *mood clustering* task because they either **don't describe the song's feel** or they **aren't numeric**.

#### 1. Categorical / ID Features

These features are text-based and cannot be used in a mathematical algorithm like K-Means.
* **`artist`**, **`song`**: These are just identifiers. The algorithm can't calculate the "average" of "Britney Spears" and "Queen."
* **`genre`**: This is a label. While you *could* convert it to numbers (one-hot encoding), it's often the *result* you want to analyze, not the *input*. You want to *discover* clusters of vibes, not just re-cluster the genres you already have.

#### 2. Contextual / Outcome Features

These features are numeric, but they don't describe the *sound* of the music. They describe its *context* or *popularity*.
* **`year`**: This describes *when* a song was made, not what it sounds like (though production styles change). Clustering on `year` would just give you "songs from the 80s" or "songs from the 2020s," which isn't a "vibe" cluster.
* **`popularity`**: This is an *outcome* of how many people streamed the song. It has no bearing on whether the song is happy, sad, or energetic.
* **`duration_ms`**: The length of a song doesn't define its mood. You can have a 2-minute happy punk song and a 10-minute sad orchestral piece.

#### 3. Technical / Timbre Features

These are numeric and describe the sound, but they measure *what* is in the track, not the overall *feel*. Including them can confuse the model.
* **`acousticness`**, **`instrumentalness`**: These measure *timbre* (the "color" of the sound). `acousticness` measures the likelihood the song is acoustic. `instrumentalness` measures the lack of vocals. If you include these, your model might group a "sad acoustic ballad" with a "happy acoustic folk song," which defeats your goal of clustering by *mood*.
* **`speechiness`**: This detects the presence of spoken words. It's great for separating podcasts from music, but not for finding the "vibe" of a song (unless you specifically want a "rap" cluster).
* **`key`**, **`mode`**: These are music theory concepts.
    * **`mode`** (Major/Minor) actually *is* relevant to mood (Major is often happy, Minor is often sad) and is highly correlated with `valence`.
    * **`key`** (C, G, F#, etc.), however, is almost completely irrelevant to mood for most listeners and would just add noise to the model.

**Conclusion:** Your chosen list is a "Goldilocks" set‚Äîit's focused, relevant, and all the features work together to describe the "vibe" you're trying to find.

In [24]:
# A list to store the inertia (WCSS) for each k
inertia = []

# We will test k from 1 up to 10
k_range = range(1, 11)

for k in k_range:
    model = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    model.fit(scaled_features)
    inertia.append(model.inertia_)

# --- Plot the elbow curve ---
# Since you're using Plotly, let's stick with it
fig = px.line(x=k_range, y=inertia, 
              title='Elbow Method for Optimal k',
              labels={'x': 'Number of Clusters (k)', 'y': 'Inertia (WCSS)'},
              template='plotly_dark')

fig.update_layout(title_x=0.5)
fig.show()

In [25]:
# --- SET YOUR K HERE based on the elbow plot ---
chosen_k = 5

# --- Apply K-Means with your chosen k ---
kmeans = KMeans(n_clusters=chosen_k, init='k-means++', n_init=10, random_state=42)
kmeans.fit(scaled_features)

# Get the cluster labels for each song
cluster_labels = kmeans.labels_

# --- Add the cluster labels back to your ORIGINAL DataFrame ---
# This is great for analysis
df['cluster'] = cluster_labels

# --- Visualize the clusters ---
# Let's re-create your 'energy' vs 'loudness' scatter plot,
# but now colored by the new clusters!

# Make sure to convert the numeric 'cluster' column to a string
# so Plotly treats it as a category (for distinct colors)
df['cluster'] = df['cluster'].astype(str) 

fig = px.scatter(df,
                 x='energy',
                 y='loudness',
                 color='cluster', # Color by the new cluster labels
                 template='plotly_dark',
                 title=f'<b>Song Clusters (k={chosen_k})</b>')

fig.update_layout(title_x=0.5)
fig.show()

In [26]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# --- 1. Select the features you want to cluster on ---
# You can add more numeric columns to this list!
features = ['energy', 'loudness', 'danceability', 'tempo', 'valence'] 

# Create a new DataFrame with only your selected features
# (We'll filter out any non-numeric columns just in case)
df_features = df.select_dtypes(include='number')[features]

# --- 2. Scale the data ---
# This is critical for K-Means to work properly
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_features)

print("Data scaled successfully.")

Data scaled successfully.


## Clustering algorithms overview

This section summarizes common clustering methods and how to evaluate them.

- K-means
  - Partitions data into k clusters by minimizing within-cluster variance (inertia).
  - Strengths: fast, scalable, simple. Works best with spherical, similarly sized clusters.
  - Gotchas: requires k; sensitive to scale and outliers; assumes Euclidean geometry.




In [27]:
# A list to store the inertia (WCSS) for each k
inertia = []

# We will test k from 1 up to 10
k_range = range(1, 11)

print("Calculating inertia for k=1 to 10...")
for k in k_range:
    # We use n_init=10 to run the algorithm 10 times with different starting points
    model = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    model.fit(scaled_features)
    inertia.append(model.inertia_)

print("Plotting elbow curve...")

# --- Plot the elbow curve using Plotly ---
fig = px.line(x=k_range, y=inertia, 
              title='<b>Elbow Method for Optimal k</b>',
              labels={'x': 'Number of Clusters (k)', 'y': 'Inertia (WCSS)'},
              template='plotly_dark')

fig.update_layout(title_x=0.5)
fig.show()

Calculating inertia for k=1 to 10...
Plotting elbow curve...
Plotting elbow curve...


##The clustering you performed in the previous step was a Mood & Vibe Classification. You grouped songs based on their perceived feel, not their timbre

In [64]:
# --- 1. SET YOUR K HERE based on the elbow plot ---
chosen_k = 6  # <-- CHANGE THIS NUMBER BASED ON YOUR ELBOW PLOT

print(f"Applying K-Means with k={chosen_k}...")

# --- 2. Apply K-Means with your chosen k ---
kmeans = KMeans(n_clusters=chosen_k, init='k-means++', n_init=10, random_state=42)
kmeans.fit(scaled_features)

# Get the cluster labels for each song
cluster_labels = kmeans.labels_

# --- 3. Add the cluster labels back to your ORIGINAL DataFrame ---
df['cluster'] = cluster_labels

# --- 4. Visualize the clusters ---
# Convert 'cluster' to a string so Plotly treats it as a category
df['cluster'] = df['cluster'].astype(str) 

fig = px.scatter(df,
                 x='energy',
                 y='loudness',
                 color='cluster', # Color by the new cluster labels
                 template='plotly_dark',
                 title=f'<b>Song Clusters (k={chosen_k})</b>',
                 # --- THIS IS THE FIX ---
                 hover_data=['song', 'artist']) # Use 'song' and 'artist'

fig.update_layout(title_x=0.5)
fig.show()

Applying K-Means with k=6...


In [65]:
from sklearn.manifold import TSNE

# 1. Use the same 'scaled_features' from your K-Means step
# (The 5D data you used to find the clusters)

# 2. Reduce the 5 dimensions down to 2
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_components = tsne.fit_transform(scaled_features)

# 3. Create a new DataFrame for plotting
df_tsne = pd.DataFrame(tsne_components, columns=['tsne_1', 'tsne_2'])
# Add your cluster labels from K-Means
df_tsne['cluster'] = df['cluster'] 
# Add song info for hover
df_tsne['song'] = df['song']
df_tsne['artist'] = df['artist']

# 4. Plot the results!
fig = px.scatter(df_tsne,
                 x='tsne_1',
                 y='tsne_2',
                 color='cluster', # Color by the clusters you found
                 hover_data=['song', 'artist'],
                 template='plotly_dark',
                 title='<b>2D t-SNE Visualization of K-Means Clusters</b>')

fig.update_layout(title_x=0.5)
fig.show()

**4. Recommendation Approaches**
- Built multiple recommendation strategies using the vibe feature space:
  - Same-cluster sampling.
  - Centroid distance ranking.
  - KNN-based nearest neighbors (within cluster fallback to global).
- Added comparison logic to view KNN vs centroid recommendations side-by-side.

In [58]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


# -------------------------
# YOUR VIBE MODEL FEATURES
# -------------------------
VIBE_FEATURES = [
    'danceability', 'energy', 'valence', 'acousticness',
    'instrumentalness', 'speechiness', 'liveness', 'tempo', 'loudness'
]

TRACK_COL = "song"                # Actual name in your dataframe
CLUSTER_COL = "vibe_cluster"      # Your cluster column


# ------------------------------------
# RECOMMENDATION METHOD 1:
# SAME CLUSTER RECOMMENDER
# ------------------------------------
def recommend_same_cluster(df, song_name, n=10):
    if song_name not in df[TRACK_COL].values:
        raise ValueError(f"Song '{song_name}' not found in df['{TRACK_COL}'].")

    # get cluster id
    cluster_id = df[df[TRACK_COL] == song_name][CLUSTER_COL].iloc[0]

    # candidates in same cluster
    candidates = df[df[CLUSTER_COL] == cluster_id].copy()

    # remove the query song
    candidates = candidates[candidates[TRACK_COL] != song_name]

    # sample N
    return candidates.sample(min(n, len(candidates)), random_state=1).reset_index(drop=True)



# ------------------------------------
# RECOMMENDATION METHOD 2:
# NEAREST TO CENTROID IN SAME CLUSTER
# ------------------------------------

def recommend_by_centroid(df, song_name, scaler, kmeans, n=10):
    if song_name not in df[TRACK_COL].values:
        raise ValueError(f"Song '{song_name}' not found in df['{TRACK_COL}'].")

    # scale all song vectors
    X_scaled = scaler.transform(df[VIBE_FEATURES])

    # index of query song
    song_idx = df.index[df[TRACK_COL] == song_name][0]

    # get cluster ID & centroid
    cluster_id = df.loc[song_idx, CLUSTER_COL]
    centroid = kmeans.cluster_centers_[cluster_id]

    # compute distances to centroid
    distances = np.linalg.norm(X_scaled - centroid, axis=1)

    # add distances
    df_local = df.copy()
    df_local["centroid_dist"] = distances

    # filter same cluster
    same_cluster = df_local[df_local[CLUSTER_COL] == cluster_id]

    # remove the query song
    same_cluster = same_cluster[same_cluster[TRACK_COL] != song_name]

    # return closest N songs
    return same_cluster.sort_values("centroid_dist").head(n).reset_index(drop=True)


# ------------------------------------
# RECOMMENDATION METHOD 3:
# KNN NEAREST NEIGHBOURS RECOMMENDER
# ------------------------------------
from sklearn.neighbors import NearestNeighbors

def recommend_knn(df, song_name, scaler, n=10, within_cluster=True):
    """
    Finds the n nearest songs to `song_name` based on scaled vibe features.
    If within_cluster=True, only searches inside the same vibe_cluster.
    """

    if song_name not in df[TRACK_COL].values:
        raise ValueError(f"Song '{song_name}' not found in df['{TRACK_COL}'].")

    # Scale the vibe-feature matrix
    X = df[VIBE_FEATURES].values.astype(float)
    X_scaled = scaler.transform(X)

    # Index of the query song
    query_idx = df.index[df[TRACK_COL] == song_name][0]

    # Restrict search to same cluster (recommended)
    if within_cluster:

        cluster_id = df.loc[query_idx, CLUSTER_COL]

        # All songs in same cluster
        mask = df[CLUSTER_COL] == cluster_id
        X_cluster = X_scaled[mask]
        idx_cluster = df[mask].index.tolist()

        # Handle tiny clusters
        if len(idx_cluster) <= 1:
            print("Cluster too small ‚Äî using global search instead.")
            within_cluster = False
        else:
            # Train KNN on cluster items
            knn = NearestNeighbors(n_neighbors=min(n+1, len(idx_cluster)), algorithm='auto')
            knn.fit(X_cluster)

            # Find position of query inside cluster index list
            query_pos = idx_cluster.index(query_idx)

            # Query the KNN
            distances, indices = knn.kneighbors(X_cluster[query_pos].reshape(1, -1))

            # Build results (excluding the song itself)
            results = []
            for dist, idx_pos in zip(distances[0], indices[0]):
                orig_idx = idx_cluster[idx_pos]
                if orig_idx == query_idx:
                    continue
                results.append((orig_idx, dist))
                if len(results) >= n:
                    break

            # Build dataframe output
            rec_df = df.loc[[i for i, _ in results]].copy()
            rec_df["distance"] = [d for _, d in results]

            return rec_df[[TRACK_COL, "artist", CLUSTER_COL, "distance"]].reset_index(drop=True)

    # ------------------------------------------
    # GLOBAL SEARCH (if cluster too small or optional)
    # ------------------------------------------
    knn = NearestNeighbors(n_neighbors=n+1, algorithm='auto')
    knn.fit(X_scaled)

    distances, indices = knn.kneighbors(X_scaled[query_idx].reshape(1, -1))

    results = []
    for dist, pos in zip(distances[0], indices[0]):
        if pos == query_idx:
            continue
        results.append((pos, dist))
        if len(results) >= n:
            break

    rec_df = df.iloc[[i for i, _ in results]].copy()
    rec_df["distance"] = [d for _, d in results]

    return rec_df[[TRACK_COL, "artist", CLUSTER_COL, "distance"]].reset_index(drop=True)


In [59]:
recommend_knn(df, "Oops!...I Did It Again", scaler, n=10)


Unnamed: 0,song,artist,vibe_cluster,distance
0,Hips Don't Lie (feat. Wyclef Jean),Shakira,16,0.849365
1,Soltera - Remix,Lunay,16,1.112294
2,Black Horse And The Cherry Tree,KT Tunstall,16,1.336496
3,Dura,Daddy Yankee,16,1.383012
4,Hit 'Em Up Style (Oops!),Blu Cantrell,16,1.46275
5,Good Grief,Bastille,16,1.466201
6,What Took You So Long?,Emma Bunton,16,1.524489
7,Heaven Is a Halfpipe (If I Die),OPM,16,1.654326
8,Down With The Trumpets,Rizzle Kicks,16,1.826194
9,Black Magic,Little Mix,16,1.834119


In [60]:
recommend_by_centroid(df, "Oops!...I Did It Again", scaler, kmeans, n=10)


Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre,cluster,vibe_cluster,centroid_dist
0,Flo Rida,Good Feeling,248133,False,2012,76,0.706,0.89,1,-4.444,0,0.0688,0.0588,0.00286,0.306,0.684,128.011,"hip hop, pop",1,16,0.463477
1,Example,Kickstarts,181826,False,2010,63,0.61,0.836,5,-4.455,1,0.0573,0.00374,0.0,0.358,0.657,126.056,"pop, Dance/Electronic",1,16,0.656898
2,Flo Rida,I Cry,223800,False,2012,60,0.693,0.822,4,-5.441,0,0.0439,0.00616,2e-06,0.315,0.763,126.035,"hip hop, pop",1,16,0.707402
3,MIKA,Grace Kelly,187733,False,2006,69,0.675,0.828,0,-5.799,1,0.0454,0.0242,0.0102,0.364,0.669,122.229,pop,1,16,0.777363
4,Oliver Heldens,Gecko (Overdrive) - Radio Edit,165440,False,2014,67,0.609,0.885,0,-5.469,1,0.0642,0.00521,1.2e-05,0.336,0.76,124.959,"pop, Dance/Electronic",1,16,0.816804
5,Maroon 5,"Moves Like Jagger - Studio Recording From ""The...",201493,False,2010,77,0.722,0.761,11,-4.459,0,0.0475,0.0117,0.0,0.315,0.624,128.044,pop,1,16,0.852871
6,Razorlight,In The Morning,222453,False,2006,59,0.616,0.855,4,-3.495,0,0.042,0.00379,0.000863,0.318,0.686,124.191,"rock, pop",1,16,0.860624
7,Alexandra Burke,Bad Boys (feat. Flo Rida),206480,False,2009,57,0.67,0.866,1,-3.684,1,0.0538,0.0115,0.0,0.358,0.636,140.029,pop,1,16,0.883842
8,3OH!3,My First Kiss (feat. Ke$ha),192440,False,2010,62,0.682,0.889,0,-4.166,1,0.0804,0.00564,0.0,0.36,0.827,138.021,"hip hop, pop, rock",1,16,0.89526
9,Joel Corry,Sorry,188640,False,2019,63,0.744,0.79,8,-4.617,0,0.0562,0.0547,0.000802,0.32,0.847,125.002,"pop, Dance/Electronic",1,16,0.901491


In [None]:
recommend_knn(df, "Oops!...I Did It Again", scaler, n=10)


Unnamed: 0,song,artist,vibe_cluster,distance
0,Sunflower - Spider-Man: Into the Spider-Verse,Post Malone,2,1.248728
1,Options,NSG,2,1.592452
2,Secreto,Anuel AA,2,1.627511
3,My Band,D12,2,1.708607
4,Rock Wit U (Awww Baby),Ashanti,2,1.73212
5,Because Of You,Ne-Yo,2,1.778927
6,Get Busy,Sean Paul,2,1.808552
7,Rockabye (feat. Sean Paul & Anne-Marie),Clean Bandit,2,1.88928
8,Crank That (Soulja Boy),Soulja Boy,2,2.066309
9,Drowning (feat. Kodak Black),A Boogie Wit da Hoodie,2,2.079693


In [62]:
# ------------------------------
# Compare two recommendation methods
# ------------------------------
import pandas as pd
import numpy as np

QUERY = "Oops!...I Did It Again"
N = 10

# --- helper: best-guess artist column if it exists ---
possible_artist_cols = [c for c in df.columns if any(k in c.lower() for k in ('artist','singer','performer'))]
ARTIST_COL = possible_artist_cols[0] if possible_artist_cols else None
TRACK_COL = "song"   # your real track column

# --- get recommendations (these calls use your existing functions) ---
knn_df = recommend_knn(df, QUERY, scaler, n=N)              # expected to contain 'distance'
centroid_df = recommend_by_centroid(df, QUERY, scaler, kmeans, n=N)  # expected to contain 'centroid_dist'

# --- normalize column names for merging/display ---
# ensure distance columns exist and are named consistently
if 'distance' not in knn_df.columns and 'dist' in knn_df.columns:
    knn_df = knn_df.rename(columns={'dist':'distance'})

if 'centroid_dist' not in centroid_df.columns and 'dist_to_centroid' in centroid_df.columns:
    centroid_df = centroid_df.rename(columns={'dist_to_centroid':'centroid_dist'})

# If centroid_df lacks centroid_dist, compute it from scaled features & centroids
if 'centroid_dist' not in centroid_df.columns:
    Xs = scaler.transform(df[VIBE_FEATURES])
    q_idx = df.index[df[TRACK_COL] == QUERY][0]
    q_cluster = df.loc[q_idx, 'vibe_cluster']
    centroid = kmeans.cluster_centers_[q_cluster]
    # compute distances for items in same cluster
    mask = (df['vibe_cluster'] == q_cluster)
    sub_Xs = Xs[mask]
    dists = np.linalg.norm(sub_Xs - centroid, axis=1)
    tmp = df[mask].copy().reset_index(drop=True)
    tmp['centroid_dist'] = dists
    tmp = tmp[tmp[TRACK_COL] != QUERY]
    centroid_df = tmp.sort_values('centroid_dist').head(N)

# --- pick display columns (safely) ---
display_cols = [TRACK_COL]
if ARTIST_COL:
    display_cols.append(ARTIST_COL)
# include distance columns if present
if 'distance' in knn_df.columns:
    knn_display_cols = display_cols + ['distance']
else:
    knn_display_cols = display_cols

if 'centroid_dist' in centroid_df.columns:
    centroid_display_cols = display_cols + ['centroid_dist']
else:
    centroid_display_cols = display_cols

# --- add rank and method columns ---
knn_show = knn_df[knn_display_cols].copy().head(N)
knn_show = knn_show.reset_index(drop=True)
knn_show.insert(0, 'rank', knn_show.index + 1)
knn_show.insert(0, 'method', 'KNN')

centroid_show = centroid_df[centroid_display_cols].copy().head(N)
centroid_show = centroid_show.reset_index(drop=True)
centroid_show.insert(0, 'rank', centroid_show.index + 1)
centroid_show.insert(0, 'method', 'Centroid')

# --- side-by-side display: concatenate with suffixes ---
side_by_side = pd.concat([knn_show.add_suffix('_knn'), centroid_show.add_suffix('_centroid')], axis=1)

# --- combined long table for easy filtering/sorting ---
combined_long = pd.concat([knn_show, centroid_show], axis=0).reset_index(drop=True)

# --- show results ---
print("Side-by-side (KNN | Centroid) ‚Äî columns _knn and _centroid show each method's fields:\n")
display(side_by_side)

print("\nCombined (long) table ‚Äî each row = one recommendation (method + rank):\n")
display(combined_long)




Side-by-side (KNN | Centroid) ‚Äî columns _knn and _centroid show each method's fields:



Unnamed: 0,method_knn,rank_knn,song_knn,artist_knn,distance_knn,method_centroid,rank_centroid,song_centroid,artist_centroid,centroid_dist_centroid
0,KNN,1,Hips Don't Lie (feat. Wyclef Jean),Shakira,0.849365,Centroid,1,Good Feeling,Flo Rida,0.463477
1,KNN,2,Soltera - Remix,Lunay,1.112294,Centroid,2,Kickstarts,Example,0.656898
2,KNN,3,Black Horse And The Cherry Tree,KT Tunstall,1.336496,Centroid,3,I Cry,Flo Rida,0.707402
3,KNN,4,Dura,Daddy Yankee,1.383012,Centroid,4,Grace Kelly,MIKA,0.777363
4,KNN,5,Hit 'Em Up Style (Oops!),Blu Cantrell,1.46275,Centroid,5,Gecko (Overdrive) - Radio Edit,Oliver Heldens,0.816804
5,KNN,6,Good Grief,Bastille,1.466201,Centroid,6,"Moves Like Jagger - Studio Recording From ""The...",Maroon 5,0.852871
6,KNN,7,What Took You So Long?,Emma Bunton,1.524489,Centroid,7,In The Morning,Razorlight,0.860624
7,KNN,8,Heaven Is a Halfpipe (If I Die),OPM,1.654326,Centroid,8,Bad Boys (feat. Flo Rida),Alexandra Burke,0.883842
8,KNN,9,Down With The Trumpets,Rizzle Kicks,1.826194,Centroid,9,My First Kiss (feat. Ke$ha),3OH!3,0.89526
9,KNN,10,Black Magic,Little Mix,1.834119,Centroid,10,Sorry,Joel Corry,0.901491



Combined (long) table ‚Äî each row = one recommendation (method + rank):



Unnamed: 0,method,rank,song,artist,distance,centroid_dist
0,KNN,1,Hips Don't Lie (feat. Wyclef Jean),Shakira,0.849365,
1,KNN,2,Soltera - Remix,Lunay,1.112294,
2,KNN,3,Black Horse And The Cherry Tree,KT Tunstall,1.336496,
3,KNN,4,Dura,Daddy Yankee,1.383012,
4,KNN,5,Hit 'Em Up Style (Oops!),Blu Cantrell,1.46275,
5,KNN,6,Good Grief,Bastille,1.466201,
6,KNN,7,What Took You So Long?,Emma Bunton,1.524489,
7,KNN,8,Heaven Is a Halfpipe (If I Die),OPM,1.654326,
8,KNN,9,Down With The Trumpets,Rizzle Kicks,1.826194,
9,KNN,10,Black Magic,Little Mix,1.834119,
