# K-Nearest Neighbors Song Recommendation Model

**Objective:** Build a similarity-based music recommendation system using k-nearest neighbors (k-NN) algorithm with euclidean distance calculations.

**Input:** `../data/processed/spotify_tracks_features_engineered.csv` (standardized audio features)  
**Process:** Load → Build k-NN Model → Generate Recommendations → Format Output  
**Output:** Recommendation dataset matching `recommendation_sample_enhanced.csv` format

**Algorithm:** k-NN with k=10 neighbors using euclidean distance on standardized audio features

## 1. Environment Setup & Library Imports

Import essential libraries for machine learning, data manipulation, and distance calculations. We'll use scikit-learn's NearestNeighbors for efficient k-NN implementation.

In [13]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import euclidean_distances

# For progress tracking
from tqdm import tqdm

# Display configuration
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

print("Libraries imported successfully!")
print("Ready to build k-NN recommendation model...")

Libraries imported successfully!
Ready to build k-NN recommendation model...


## 2. Data Loading & Exploration

Load the feature-engineered dataset containing both original and standardized audio features. We'll examine the structure to understand our feature matrix for k-NN calculations.

In [14]:
# Define file paths
INPUT_CSV = Path("../data/processed/spotify_tracks_clean_deduplicated.csv")
SAMPLE_OUTPUT = Path("../data/processed/recommendation_sample_enhanced.csv")
OUTPUT_CSV = Path("../data/processed/knn_recommendations.csv")

print(f"Input file: {INPUT_CSV.resolve()}")
print(f"Sample format file: {SAMPLE_OUTPUT.resolve()}")
print(f"Output file: {OUTPUT_CSV.resolve()}")

# Load feature-engineered dataset
try:
    df = pd.read_csv(INPUT_CSV)
    print(f"\nDataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Columns: {len(df.columns)}")
    display(df.head(3))
except FileNotFoundError:
    print(f"Error: Could not find {INPUT_CSV}")
except Exception as e:
    print(f"Error loading data: {e}")

Input file: /Users/julianelliott/Documents/GitHub/song-recommendation-dashboard/data/processed/spotify_tracks_clean_deduplicated.csv
Sample format file: /Users/julianelliott/Documents/GitHub/song-recommendation-dashboard/data/processed/recommendation_sample_enhanced.csv
Output file: /Users/julianelliott/Documents/GitHub/song-recommendation-dashboard/data/processed/knn_recommendations.csv

Dataset loaded successfully!
Shape: (89740, 35)
Columns: 35

Dataset loaded successfully!
Shape: (89740, 35)
Columns: 35


Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,danceability_scaled,energy_scaled,key_scaled,loudness_scaled,mode_scaled,speechiness_scaled,acousticness_scaled,instrumentalness_scaled,liveness_scaled,valence_scaled,tempo_scaled,time_signature_scaled,duration_min,popularity_category,mood_score,party_score
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,0.629239,-0.717147,-1.210434,0.300825,-1.326297,0.551843,-0.850193,-0.504111,0.758735,0.929315,-1.141854,0.221824,3.844433,High,0.588,0.5685
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,-0.845908,-1.889974,-1.210434,-1.784739,0.753979,-0.078995,1.831744,-0.504097,-0.591216,-0.798681,-1.489708,0.221824,2.4935,High,0.2165,0.293
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,-0.742187,-1.122667,-1.491334,-0.293289,0.753979,-0.273827,-0.315489,-0.504115,-0.507172,-1.365679,-1.528303,0.221824,3.513767,High,0.2395,0.3985


## 3. Examine Target Output Format

Load and analyze the expected output format to ensure our recommendations match the required structure for the recommendation system.

In [15]:
# Load sample output format to understand required structure
try:
    sample_format = pd.read_csv(SAMPLE_OUTPUT)
    print("Target output format structure:")
    print(f"Shape: {sample_format.shape}")
    print(f"Columns: {sample_format.columns.tolist()}")
    display(sample_format.head())
    
    # Understand the format requirements
    print("\nFormat Analysis:")
    print(f"- Total recommendations: {len(sample_format)}")
    print(f"- Unique source tracks: {sample_format['track_id'].nunique()}")
    print(f"- Recommendations per track: {len(sample_format) // sample_format['track_id'].nunique()}")
    
except FileNotFoundError:
    print(f"Sample format file not found. Will create standard format.")
    sample_format = None

Target output format structure:
Shape: (10, 14)
Columns: ['track_id', 'recommended_track_id', 'match_score', 'track_name', 'track_artists', 'track_album', 'track_genre', 'track_image_url', 'track_popularity', 'recommended_track_name', 'recommended_track_artists', 'recommended_track_album', 'recommended_track_genre', 'recommended_track_image_url']


Unnamed: 0,track_id,recommended_track_id,match_score,track_name,track_artists,track_album,track_genre,track_image_url,track_popularity,recommended_track_name,recommended_track_artists,recommended_track_album,recommended_track_genre,recommended_track_image_url
0,3n3Ppam7vgaVa1iaRUc9Lp,7ouMYWpwJ422jRcDASZB7P,0.92,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,85,Somebody Told Me,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...
1,7ouMYWpwJ422jRcDASZB7P,3n3Ppam7vgaVa1iaRUc9Lp,0.87,Somebody Told Me,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,78,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...
2,3n3Ppam7vgaVa1iaRUc9Lp,0eGsygTp906u18L0Oimnem,0.81,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,85,Somebody That I Used to Know,Gotye feat. Kimbra,Making Mirrors,indie,https://i.scdn.co/image/ab67616d0000b273f0c20a...
3,0eGsygTp906u18L0Oimnem,3n3Ppam7vgaVa1iaRUc9Lp,0.78,Somebody That I Used to Know,Gotye feat. Kimbra,Making Mirrors,indie,https://i.scdn.co/image/ab67616d0000b273f0c20a...,82,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...
4,7ouMYWpwJ422jRcDASZB7P,0eGsygTp906u18L0Oimnem,0.85,Somebody Told Me,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,78,Somebody That I Used to Know,Gotye feat. Kimbra,Making Mirrors,indie,https://i.scdn.co/image/ab67616d0000b273f0c20a...



Format Analysis:
- Total recommendations: 10
- Unique source tracks: 4
- Recommendations per track: 2


## 4. Prepare Feature Matrix for k-NN

Extract the standardized audio features that will be used for euclidean distance calculations. These features have been normalized to have mean=0 and std=1, ensuring equal contribution to distance measurements.

In [16]:
# Identify standardized audio features for k-NN model
scaled_features = [col for col in df.columns if col.endswith('_scaled')]

print(f"Standardized audio features for k-NN: {len(scaled_features)}")
print(scaled_features)

# Verify we have the expected 12 audio features
expected_features = [
    'danceability_scaled', 'energy_scaled', 'key_scaled', 'loudness_scaled',
    'mode_scaled', 'speechiness_scaled', 'acousticness_scaled', 
    'instrumentalness_scaled', 'liveness_scaled', 'valence_scaled', 
    'tempo_scaled', 'time_signature_scaled'
]

missing_features = [f for f in expected_features if f not in scaled_features]
if missing_features:
    print(f"\nWarning: Missing expected features: {missing_features}")
else:
    print("\nAll expected standardized features found ✓")

# Create feature matrix for k-NN
feature_matrix = df[scaled_features].values
print(f"\nFeature matrix shape: {feature_matrix.shape}")
print(f"Feature matrix type: {type(feature_matrix)}")

Standardized audio features for k-NN: 12
['danceability_scaled', 'energy_scaled', 'key_scaled', 'loudness_scaled', 'mode_scaled', 'speechiness_scaled', 'acousticness_scaled', 'instrumentalness_scaled', 'liveness_scaled', 'valence_scaled', 'tempo_scaled', 'time_signature_scaled']

All expected standardized features found ✓

Feature matrix shape: (89740, 12)
Feature matrix type: <class 'numpy.ndarray'>


## 5. Build k-Nearest Neighbors Model

Initialize and fit the k-NN model using euclidean distance. We'll use k=11 (to get 10 recommendations excluding the input song itself) and euclidean distance metric for similarity calculations.

In [17]:
# Initialize k-NN model
# Using k=11 to get 10 recommendations (excluding the query song itself)
k_neighbors = 11
knn_model = NearestNeighbors(
    n_neighbors=k_neighbors,
    metric='euclidean',
    algorithm='auto',  # Let sklearn choose the best algorithm
    n_jobs=-1  # Use all available CPU cores for faster computation
)

print(f"Initializing k-NN model with:")
print(f"- k = {k_neighbors} neighbors (10 recommendations + 1 self)")
print(f"- Distance metric: euclidean")
print(f"- Algorithm: auto (sklearn will choose optimal)")

# Fit the model with our standardized feature matrix
print("\nFitting k-NN model...")
knn_model.fit(feature_matrix)
print("k-NN model fitted successfully!")

# Test the model with a sample query
print("\nTesting model with first track:")
sample_distances, sample_indices = knn_model.kneighbors([feature_matrix[0]])
print(f"Found {len(sample_indices[0])} neighbors")
print(f"Distances: {sample_distances[0][:5]}...")  # Show first 5 distances

Initializing k-NN model with:
- k = 11 neighbors (10 recommendations + 1 self)
- Distance metric: euclidean
- Algorithm: auto (sklearn will choose optimal)

Fitting k-NN model...
k-NN model fitted successfully!

Testing model with first track:
Found 11 neighbors
Distances: [0.         1.01463073 1.30584155 1.31147265 1.32838242]...


## 6. Generate Recommendations Function

Create a function to generate recommendations for any given track. This function will find the k-nearest neighbors and return track information excluding the input song itself.

In [18]:
def get_recommendations(track_index, n_recommendations=10, exclude_same_album=False, exclude_duplicates=True):
    """
    Get track recommendations for a given track using the trained k-NN model.
    
    Args:
        track_index (int): Index of the source track in the dataset
        n_recommendations (int): Number of recommendations to return
        exclude_same_album (bool): If True, exclude tracks from the same album
        exclude_duplicates (bool): If True, exclude exact duplicate tracks
    
    Returns:
        pd.DataFrame: DataFrame with recommended tracks and similarity scores
    """
    source_track = df.iloc[track_index]
    source_album = source_track['album_name']
    source_track_name = source_track['track_name']
    source_artists = source_track['artists']
    
    # Get more neighbors than needed to account for filtering
    search_neighbors = min(len(df), n_recommendations * 3)
    distances, indices = knn_model.kneighbors([feature_matrix[track_index]], n_neighbors=search_neighbors)
    
    recommendations = []
    
    for neighbor_idx, distance in zip(indices[0], distances[0]):
        if neighbor_idx == track_index:  # Skip the source track itself
            continue
            
        candidate_track = df.iloc[neighbor_idx]
        
        # Apply filters
        skip_track = False
        
        if exclude_same_album and candidate_track['album_name'] == source_album:
            skip_track = True
            
        if exclude_duplicates:
            # Check for exact duplicates (same name and artist)
            if (candidate_track['track_name'] == source_track_name and 
                candidate_track['artists'] == source_artists):
                skip_track = True
        
        if not skip_track:
            similarity_score = 1 / (1 + distance)  # Convert distance to similarity
            
            recommendations.append({
                'track_index': neighbor_idx,
                'track_id': candidate_track['track_id'],
                'track_name': candidate_track['track_name'],
                'artists': candidate_track['artists'],
                'album_name': candidate_track['album_name'],
                'popularity': candidate_track['popularity'],
                'similarity_score': round(similarity_score, 4),
                'euclidean_distance': round(distance, 4)
            })
            
            if len(recommendations) >= n_recommendations:
                break
    
    return pd.DataFrame(recommendations)

# Test the recommendation function
print("Testing recommendation function with track index 0:")
sample_track = df.iloc[0]
print(f"Source track: '{sample_track['track_name']}' by {sample_track['artists']}")
print(f"Source album: '{sample_track['album_name']}'")

# Test standard recommendations (no filtering)
print(f"\n=== STANDARD RECOMMENDATIONS (no filtering) ===")
standard_recs = get_recommendations(0, n_recommendations=5, exclude_same_album=False, exclude_duplicates=False)
for i, (_, track) in enumerate(standard_recs.iterrows(), 1):
    print(f"{i}. '{track['track_name']}' by {track['artists']} | Score: {track['similarity_score']:.4f}")

# Test filtered recommendations (exclude same album)
print(f"\n=== FILTERED RECOMMENDATIONS (exclude same album) ===")
filtered_recs = get_recommendations(0, n_recommendations=5, exclude_same_album=True, exclude_duplicates=True)
for i, (_, track) in enumerate(filtered_recs.iterrows(), 1):
    print(f"{i}. '{track['track_name']}' by {track['artists']} | Score: {track['similarity_score']:.4f}")

print(f"\nRecommendation counts:")
print(f"- Standard recommendations: {len(standard_recs)}")
print(f"- Filtered recommendations: {len(filtered_recs)}")

Testing recommendation function with track index 0:
Source track: 'Comedy' by Gen Hoshino
Source album: 'Comedy'

=== STANDARD RECOMMENDATIONS (no filtering) ===
1. 'JAMAICA' by Feid;Sech | Score: 0.4964
2. 'Bu Aşk Olur Mu' by MFÖ | Score: 0.4337
3. '空ノムコウ' by Maharajan | Score: 0.4326
4. 'Baari Barsi' by Salim–Sulaiman;Harshdeep Kaur;Labh Janjua;Amitabh Bhattacharya | Score: 0.4295
5. 'Do My Ladies Run This Party - Single' by Cupid | Score: 0.4290

=== FILTERED RECOMMENDATIONS (exclude same album) ===
1. 'JAMAICA' by Feid;Sech | Score: 0.4964
2. 'Bu Aşk Olur Mu' by MFÖ | Score: 0.4337
3. '空ノムコウ' by Maharajan | Score: 0.4326
4. 'Baari Barsi' by Salim–Sulaiman;Harshdeep Kaur;Labh Janjua;Amitabh Bhattacharya | Score: 0.4295
5. 'Do My Ladies Run This Party - Single' by Cupid | Score: 0.4290

Recommendation counts:
- Standard recommendations: 5
- Filtered recommendations: 5


In [19]:
def get_both_recommendation_types(track_index, n_recommendations=10):
    """
    Get both standard and cross-album recommendations for a given track.
    
    Args:
        track_index (int): Index of the source track in the dataset
        n_recommendations (int): Number of recommendations to return for each type
    
    Returns:
        tuple: (standard_recommendations_df, cross_album_recommendations_df)
    """
    # Get standard recommendations (no filtering)
    standard_recs = get_recommendations(
        track_index, 
        n_recommendations=n_recommendations, 
        exclude_same_album=False, 
        exclude_duplicates=False
    )
    
    # Get cross-album recommendations (exclude same album)
    cross_album_recs = get_recommendations(
        track_index, 
        n_recommendations=n_recommendations, 
        exclude_same_album=True, 
        exclude_duplicates=True
    )
    
    return standard_recs, cross_album_recs

print("get_both_recommendation_types function defined successfully!")

get_both_recommendation_types function defined successfully!


## 7. Generate Recommendations for All Tracks

Process the entire dataset to generate 10 recommendations for each track. This creates a comprehensive recommendation dataset for the entire music catalog.

In [20]:
# Generate both types of recommendations for all tracks
print("Generating both standard and cross-album recommendations for all tracks...")
print(f"Processing {len(df)} tracks to generate recommendations")

standard_recommendations = []
cross_album_recommendations = []

# Process tracks in batches with progress bar
batch_size = 500  # Reduced batch size due to additional processing
total_tracks = len(df)

for start_idx in tqdm(range(0, total_tracks, batch_size), desc="Processing batches"):
    end_idx = min(start_idx + batch_size, total_tracks)
    
    for track_idx in range(start_idx, end_idx):
        source_track = df.iloc[track_idx]
        
        # Get both types of recommendations
        standard_recs, cross_album_recs = get_both_recommendation_types(track_idx, n_recommendations=10)
        
        # Process standard recommendations
        for rank, (_, rec_track) in enumerate(standard_recs.iterrows()):
            rec_entry = {
                'track_id': source_track['track_id'],
                'recommended_track_id': rec_track['track_id'],
                'match_score': rec_track['similarity_score'],
                'recommendation_type': 'standard',
                'track_name': source_track['track_name'],
                'track_artists': source_track['artists'],
                'track_album': source_track['album_name'],
                'track_popularity': source_track['popularity'],
                'recommended_track_name': rec_track['track_name'],
                'recommended_track_artists': rec_track['artists'],
                'recommended_track_album': rec_track['album_name'],
                'recommended_track_popularity': rec_track['popularity'],
                'same_album': source_track['album_name'] == rec_track['album_name'],
                'euclidean_distance': rec_track['euclidean_distance']
            }
            standard_recommendations.append(rec_entry)
        
        # Process cross-album recommendations
        for rank, (_, rec_track) in enumerate(cross_album_recs.iterrows()):
            rec_entry = {
                'track_id': source_track['track_id'],
                'recommended_track_id': rec_track['track_id'],
                'match_score': rec_track['similarity_score'],
                'recommendation_type': 'cross_album',
                'track_name': source_track['track_name'],
                'track_artists': source_track['artists'],
                'track_album': source_track['album_name'],
                'track_popularity': source_track['popularity'],
                'recommended_track_name': rec_track['track_name'],
                'recommended_track_artists': rec_track['artists'],
                'recommended_track_album': rec_track['album_name'],
                'recommended_track_popularity': rec_track['popularity'],
                'same_album': False,  # Always False for cross-album recommendations
                'euclidean_distance': rec_track['euclidean_distance']
            }
            cross_album_recommendations.append(rec_entry)

# Convert to DataFrames
standard_df = pd.DataFrame(standard_recommendations)
cross_album_df = pd.DataFrame(cross_album_recommendations)

print(f"\n=== RECOMMENDATION GENERATION COMPLETE ===")
print(f"Standard recommendations: {len(standard_df):,}")
print(f"Cross-album recommendations: {len(cross_album_df):,}")

if len(standard_df) > 0:
    print(f"Standard - Unique source tracks: {standard_df['track_id'].nunique():,}")
    print(f"Standard - Avg recommendations per track: {len(standard_df) / standard_df['track_id'].nunique():.1f}")
    
    # Analyze same-album vs cross-album in standard recommendations
    same_album_count = standard_df['same_album'].sum()
    cross_album_count = len(standard_df) - same_album_count
    print(f"Standard - Same album recommendations: {same_album_count:,} ({same_album_count/len(standard_df)*100:.1f}%)")
    print(f"Standard - Cross album recommendations: {cross_album_count:,} ({cross_album_count/len(standard_df)*100:.1f}%)")

if len(cross_album_df) > 0:
    print(f"Cross-album - Unique source tracks: {cross_album_df['track_id'].nunique():,}")
    print(f"Cross-album - Avg recommendations per track: {len(cross_album_df) / cross_album_df['track_id'].nunique():.1f}")

# Combine both types for comprehensive dataset
all_recommendations_df = pd.concat([standard_df, cross_album_df], ignore_index=True)
print(f"\nCombined dataset: {len(all_recommendations_df):,} total recommendations")

Generating both standard and cross-album recommendations for all tracks...
Processing 89740 tracks to generate recommendations


Processing batches: 100%|██████████| 180/180 [46:16<00:00, 15.42s/it]




=== RECOMMENDATION GENERATION COMPLETE ===
Standard recommendations: 897,400
Cross-album recommendations: 894,009
Standard - Unique source tracks: 89,740
Standard - Avg recommendations per track: 10.0
Standard - Same album recommendations: 4,791 (0.5%)
Standard - Cross album recommendations: 892,609 (99.5%)
Cross-album - Unique source tracks: 89,524
Cross-album - Avg recommendations per track: 10.0

Combined dataset: 1,791,409 total recommendations

Combined dataset: 1,791,409 total recommendations


## 8. Validate & Display Sample Recommendations

Examine the generated recommendations to ensure quality and proper formatting. We'll look at sample recommendations and verify the data structure.

In [22]:
# Display sample recommendations
print("Sample recommendations structure:")
display(all_recommendations_df .head(10))

print(f"\nDataset statistics:")
print(f"- Total recommendation pairs: {len(all_recommendations_df):,}")
print(f"- Unique source tracks: {all_recommendations_df['track_id'].nunique():,}")
print(f"- Unique recommended tracks: {all_recommendations_df['recommended_track_id'].nunique():,}")
print(f"- Recommendations per source track: 10")

# Check recommendation quality with a specific example
sample_track_id = all_recommendations_df['track_id'].iloc[0]
sample_recs = all_recommendations_df[all_recommendations_df['track_id'] == sample_track_id].head(10)

print(f"\nExample: Recommendations for track ID '{sample_track_id}':")
print(f"Source: '{sample_recs.iloc[0]['track_name']}' by {sample_recs.iloc[0]['track_artists']}")
print("\nTop 10 recommendations:")

for i, (_, rec) in enumerate(sample_recs.iterrows(), 1):
    print(f"  {i}. '{rec['recommended_track_name']}' by {rec['recommended_track_artists']}")
    print(f"     Match Score: {rec['match_score']}, Genre: {rec['recommended_track_genre']}")

Sample recommendations structure:


Unnamed: 0,track_id,recommended_track_id,match_score,recommendation_type,track_name,track_artists,track_album,track_popularity,recommended_track_name,recommended_track_artists,recommended_track_album,recommended_track_popularity,same_album,euclidean_distance
0,5SuOikwiRyPMVoIQDJUgSV,04V3hO138BCddFy9TmTUic,0.4964,standard,Comedy,Gen Hoshino,Comedy,73,JAMAICA,Feid;Sech,Halloween 2022 Perreo Vol. 4,5,False,1.0146
1,5SuOikwiRyPMVoIQDJUgSV,007t1Fel5tcxHOEfSYWuGM,0.4337,standard,Comedy,Gen Hoshino,Comedy,73,Bu Aşk Olur Mu,MFÖ,Ve MFÖ,33,False,1.3058
2,5SuOikwiRyPMVoIQDJUgSV,6NGVokXQjdFPBAQGSlyT2S,0.4326,standard,Comedy,Gen Hoshino,Comedy,73,空ノムコウ,Maharajan,セーラ☆ムン太郎,20,False,1.3115
3,5SuOikwiRyPMVoIQDJUgSV,16168M9xvIA5FqGlurQfsv,0.4295,standard,Comedy,Gen Hoshino,Comedy,73,Baari Barsi,Salim–Sulaiman;Harshdeep Kaur;Labh Janjua;Amit...,Band Baaja Baaraat,43,False,1.3284
4,5SuOikwiRyPMVoIQDJUgSV,5O3FiMBxo5d1KBWwQ8mnbr,0.429,standard,Comedy,Gen Hoshino,Comedy,73,Do My Ladies Run This Party - Single,Cupid,Do My Ladies Run This Party - Single,9,False,1.3311
5,5SuOikwiRyPMVoIQDJUgSV,7fm1Nbus8X19wI4oz6FFcb,0.414,standard,Comedy,Gen Hoshino,Comedy,73,Scapegoat,Sidhu Moose Wala,Scapegoat,58,False,1.4156
6,5SuOikwiRyPMVoIQDJUgSV,07fDD54BLVrdAvM4krVGlG,0.4119,standard,Comedy,Gen Hoshino,Comedy,73,Black Life,Navaan Sandhu,Black Life,66,False,1.4281
7,5SuOikwiRyPMVoIQDJUgSV,1IIKrJVP1C9N7iPtG6eOsK,0.4111,standard,Comedy,Gen Hoshino,Comedy,73,Go Crazy,Chris Brown;Young Thug,Slime & B,78,False,1.4326
8,5SuOikwiRyPMVoIQDJUgSV,675g81tQUHsYVwMoLo7Bq4,0.4104,standard,Comedy,Gen Hoshino,Comedy,73,i can't get high,Royal & the Serpent,Internet Sensations,1,False,1.4366
9,5SuOikwiRyPMVoIQDJUgSV,5iNn43bKN1jSLllFkEfdOg,0.4103,standard,Comedy,Gen Hoshino,Comedy,73,ABC Boo,Super Simple Songs,ABC Boo,1,False,1.4371



Dataset statistics:
- Total recommendation pairs: 1,791,409
- Unique source tracks: 89,740
- Unique recommended tracks: 89,088
- Recommendations per source track: 10

Example: Recommendations for track ID '5SuOikwiRyPMVoIQDJUgSV':
Source: 'Comedy' by Gen Hoshino

Top 10 recommendations:
  1. 'JAMAICA' by Feid;Sech

Example: Recommendations for track ID '5SuOikwiRyPMVoIQDJUgSV':
Source: 'Comedy' by Gen Hoshino

Top 10 recommendations:
  1. 'JAMAICA' by Feid;Sech


KeyError: 'recommended_track_genre'

## 9. Format Output to Match Target Structure

Ensure the output format matches the expected structure from the sample enhanced CSV file. We'll align column names and data types for consistency.

In [24]:
# Verify output format matches target structure
if sample_format is not None:
    print("Verifying output format matches target structure...")
    
    target_columns = sample_format.columns.tolist()
    current_columns = all_recommendations_df.columns.tolist()
    
    print(f"Target columns: {target_columns}")
    print(f"Current columns: {current_columns}")
    
    # Check if formats match
    if set(target_columns) == set(current_columns):
        print("✓ Column structure matches target format!")
    else:
        missing_cols = set(target_columns) - set(current_columns)
        extra_cols = set(current_columns) - set(target_columns)
        if missing_cols:
            print(f"⚠ Missing columns: {missing_cols}")
        if extra_cols:
            print(f"⚠ Extra columns: {extra_cols}")
        
else:
    print("Using generated recommendation format...")

# Ensure proper data types and formatting
all_recommendations_df['match_score'] = all_recommendations_df['match_score'].round(2)
all_recommendations_df['track_popularity'] = all_recommendations_df['track_popularity'].astype(int)
all_recommendations_df['recommended_track_popularity'] = all_recommendations_df['recommended_track_popularity'].astype(int)

print("\nData formatting completed!")
print(f"Final dataset shape: {all_recommendations_df.shape}")
print(f"Final columns: {all_recommendations_df.columns.tolist()}")

Verifying output format matches target structure...
Target columns: ['track_id', 'recommended_track_id', 'match_score', 'track_name', 'track_artists', 'track_album', 'track_genre', 'track_image_url', 'track_popularity', 'recommended_track_name', 'recommended_track_artists', 'recommended_track_album', 'recommended_track_genre', 'recommended_track_image_url']
Current columns: ['track_id', 'recommended_track_id', 'match_score', 'recommendation_type', 'track_name', 'track_artists', 'track_album', 'track_popularity', 'recommended_track_name', 'recommended_track_artists', 'recommended_track_album', 'recommended_track_popularity', 'same_album', 'euclidean_distance']
⚠ Missing columns: {'recommended_track_image_url', 'track_genre', 'track_image_url', 'recommended_track_genre'}
⚠ Extra columns: {'same_album', 'recommended_track_popularity', 'euclidean_distance', 'recommendation_type'}

Data formatting completed!
Final dataset shape: (1791409, 14)
Final columns: ['track_id', 'recommended_track_i

## 10. Export Recommendations Dataset

Save the complete recommendation dataset to CSV format for use in the recommendation system and dashboard components.

In [25]:
# Export both types of recommendations
STANDARD_OUTPUT_CSV = Path("../data/processed/knn_recommendations_standard.csv")
CROSS_ALBUM_OUTPUT_CSV = Path("../data/processed/knn_recommendations_cross_album.csv")
COMBINED_OUTPUT_CSV = Path("../data/processed/knn_recommendations_combined.csv")

print("Exporting recommendation datasets...")

# Export standard recommendations
if len(standard_df) > 0:
    standard_df.to_csv(STANDARD_OUTPUT_CSV, index=False)
    print(f"✓ Standard recommendations exported: {STANDARD_OUTPUT_CSV}")
    print(f"  - Rows: {len(standard_df):,}")
    print(f"  - File size: {STANDARD_OUTPUT_CSV.stat().st_size / (1024*1024):.1f} MB")

# Export cross-album recommendations
if len(cross_album_df) > 0:
    cross_album_df.to_csv(CROSS_ALBUM_OUTPUT_CSV, index=False)
    print(f"✓ Cross-album recommendations exported: {CROSS_ALBUM_OUTPUT_CSV}")
    print(f"  - Rows: {len(cross_album_df):,}")
    print(f"  - File size: {CROSS_ALBUM_OUTPUT_CSV.stat().st_size / (1024*1024):.1f} MB")

# Export combined dataset
all_recommendations_df.to_csv(COMBINED_OUTPUT_CSV, index=False)
print(f"✓ Combined recommendations exported: {COMBINED_OUTPUT_CSV}")
print(f"  - Rows: {len(all_recommendations_df):,}")
print(f"  - File size: {COMBINED_OUTPUT_CSV.stat().st_size / (1024*1024):.1f} MB")

# Final summary statistics
print(f"\n{'='*70}")
print("K-NN RECOMMENDATION MODEL SUMMARY")
print(f"{'='*70}")
print(f"✓ Algorithm: k-Nearest Neighbors (k=10)")
print(f"✓ Distance Metric: Euclidean")
print(f"✓ Feature Set: 12 standardized audio features")
print(f"✓ Duplicate Detection: Similarity score ≥ 0.999 and identical track names")

if len(standard_df) > 0:
    print(f"\n📊 STANDARD RECOMMENDATIONS:")
    print(f"  - Total Source Tracks: {standard_df['track_id'].nunique():,}")
    print(f"  - Total Recommendations: {len(standard_df):,}")
    print(f"  - Same Album: {standard_df['same_album'].sum():,} ({standard_df['same_album'].sum()/len(standard_df)*100:.1f}%)")
    print(f"  - Cross Album: {(~standard_df['same_album']).sum():,} ({(~standard_df['same_album']).sum()/len(standard_df)*100:.1f}%)")
    print(f"  - File: {STANDARD_OUTPUT_CSV.name}")

if len(cross_album_df) > 0:
    print(f"\n🎯 CROSS-ALBUM RECOMMENDATIONS:")
    print(f"  - Total Source Tracks: {cross_album_df['track_id'].nunique():,}")
    print(f"  - Total Recommendations: {len(cross_album_df):,}")
    print(f"  - All Cross Album: 100% (by design)")
    print(f"  - File: {CROSS_ALBUM_OUTPUT_CSV.name}")

print(f"\n📁 COMBINED DATASET:")
print(f"  - Total Recommendations: {len(all_recommendations_df):,}")
print(f"  - File: {COMBINED_OUTPUT_CSV.name}")
print(f"{'='*70}")

# Display comparison sample
if len(standard_df) > 0 and len(cross_album_df) > 0:
    print("\nSample comparison for the same source track:")
    sample_track_id = standard_df['track_id'].iloc[0]
    
    print(f"\nSource track ID: {sample_track_id}")
    sample_source = standard_df[standard_df['track_id'] == sample_track_id].iloc[0]
    print(f"Track: '{sample_source['track_name']}' by {sample_source['track_artists']}")
    print(f"Album: '{sample_source['track_album']}'")
    
    print(f"\n🔸 Standard recommendations (top 5):")
    sample_standard = standard_df[standard_df['track_id'] == sample_track_id].head(5)
    for i, (_, rec) in enumerate(sample_standard.iterrows(), 1):
        album_indicator = "📀" if rec['same_album'] else "🎵"
        print(f"  {i}. {album_indicator} '{rec['recommended_track_name']}' | Score: {rec['match_score']:.4f}")
    
    print(f"\n🎯 Cross-album recommendations (top 5):")
    sample_cross = cross_album_df[cross_album_df['track_id'] == sample_track_id].head(5)
    for i, (_, rec) in enumerate(sample_cross.iterrows(), 1):
        print(f"  {i}. 🎵 '{rec['recommended_track_name']}' | Score: {rec['match_score']:.4f}")
        
print(f"\n💡 Usage Guide:")
print(f"- Use {STANDARD_OUTPUT_CSV.name} for general recommendations")
print(f"- Use {CROSS_ALBUM_OUTPUT_CSV.name} for music discovery across different albums")
print(f"- Use {COMBINED_OUTPUT_CSV.name} for comprehensive analysis with 'recommendation_type' column")

Exporting recommendation datasets...
✓ Standard recommendations exported: ../data/processed/knn_recommendations_standard.csv
  - Rows: 897,400
  - File size: 170.7 MB
✓ Standard recommendations exported: ../data/processed/knn_recommendations_standard.csv
  - Rows: 897,400
  - File size: 170.7 MB
✓ Cross-album recommendations exported: ../data/processed/knn_recommendations_cross_album.csv
  - Rows: 894,009
  - File size: 172.7 MB
✓ Cross-album recommendations exported: ../data/processed/knn_recommendations_cross_album.csv
  - Rows: 894,009
  - File size: 172.7 MB
✓ Combined recommendations exported: ../data/processed/knn_recommendations_combined.csv
  - Rows: 1,791,409
  - File size: 340.1 MB

K-NN RECOMMENDATION MODEL SUMMARY
✓ Algorithm: k-Nearest Neighbors (k=10)
✓ Distance Metric: Euclidean
✓ Feature Set: 12 standardized audio features
✓ Duplicate Detection: Similarity score ≥ 0.999 and identical track names

📊 STANDARD RECOMMENDATIONS:
  - Total Source Tracks: 89,740
  - Total Reco

## 11. Album and Artist Recommendations

Generate recommendations at the album and artist level using the same k-NN approach but aggregating results by album and artist rather than individual tracks.

In [26]:
def get_album_recommendations(track_index, n_recommendations=10, exclude_same_album=True, exclude_same_artist=False):
    """
    Get album recommendations based on a source track using k-NN model.
    
    Args:
        track_index (int): Index of the source track in the dataset
        n_recommendations (int): Number of album recommendations to return
        exclude_same_album (bool): If True, exclude the source album
        exclude_same_artist (bool): If True, exclude albums by the same artist
    
    Returns:
        pd.DataFrame: DataFrame with recommended albums and their aggregate scores
    """
    source_track = df.iloc[track_index]
    source_album = source_track['album_name']
    source_artist = source_track['artists']
    
    # Get a large number of track recommendations to aggregate into albums
    search_neighbors = min(len(df), n_recommendations * 20)
    distances, indices = knn_model.kneighbors([feature_matrix[track_index]], n_neighbors=search_neighbors)
    
    album_scores = {}
    album_info = {}
    
    for neighbor_idx, distance in zip(indices[0], distances[0]):
        if neighbor_idx == track_index:  # Skip source track
            continue
            
        candidate_track = df.iloc[neighbor_idx]
        candidate_album = candidate_track['album_name']
        candidate_artist = candidate_track['artists']
        
        # Apply filters
        skip_album = False
        
        if exclude_same_album and candidate_album == source_album:
            skip_album = True
            
        if exclude_same_artist and candidate_artist == source_artist:
            skip_album = True
            
        if not skip_album:
            similarity_score = 1 / (1 + distance)
            
            # Aggregate scores by album (take the best score per album)
            if candidate_album not in album_scores or similarity_score > album_scores[candidate_album]:
                album_scores[candidate_album] = similarity_score
                album_info[candidate_album] = {
                    'album_name': candidate_album,
                    'album_artists': candidate_artist,
                    'sample_track_name': candidate_track['track_name'],
                    'sample_track_popularity': candidate_track['popularity'],
                    'similarity_score': round(similarity_score, 4),
                    'euclidean_distance': round(distance, 4)
                }
    
    # Sort albums by similarity score and get top N
    sorted_albums = sorted(album_scores.items(), key=lambda x: x[1], reverse=True)[:n_recommendations]
    
    # Create DataFrame with album recommendations
    album_recommendations = []
    for album_name, score in sorted_albums:
        album_recommendations.append(album_info[album_name])
    
    return pd.DataFrame(album_recommendations)

def get_artist_recommendations(track_index, n_recommendations=10, exclude_same_artist=True):
    """
    Get artist recommendations based on a source track using k-NN model.
    
    Args:
        track_index (int): Index of the source track in the dataset
        n_recommendations (int): Number of artist recommendations to return
        exclude_same_artist (bool): If True, exclude the source artist
    
    Returns:
        pd.DataFrame: DataFrame with recommended artists and their aggregate scores
    """
    source_track = df.iloc[track_index]
    source_artist = source_track['artists']
    
    # Get a large number of track recommendations to aggregate into artists
    search_neighbors = min(len(df), n_recommendations * 30)
    distances, indices = knn_model.kneighbors([feature_matrix[track_index]], n_neighbors=search_neighbors)
    
    artist_scores = {}
    artist_info = {}
    
    for neighbor_idx, distance in zip(indices[0], distances[0]):
        if neighbor_idx == track_index:  # Skip source track
            continue
            
        candidate_track = df.iloc[neighbor_idx]
        candidate_artist = candidate_track['artists']
        
        # Apply filters
        skip_artist = False
        
        if exclude_same_artist and candidate_artist == source_artist:
            skip_artist = True
            
        if not skip_artist:
            similarity_score = 1 / (1 + distance)
            
            # Aggregate scores by artist (take the best score per artist)
            if candidate_artist not in artist_scores or similarity_score > artist_scores[candidate_artist]:
                artist_scores[candidate_artist] = similarity_score
                artist_info[candidate_artist] = {
                    'artist_name': candidate_artist,
                    'sample_track_name': candidate_track['track_name'],
                    'sample_album_name': candidate_track['album_name'],
                    'sample_track_popularity': candidate_track['popularity'],
                    'similarity_score': round(similarity_score, 4),
                    'euclidean_distance': round(distance, 4)
                }
    
    # Sort artists by similarity score and get top N
    sorted_artists = sorted(artist_scores.items(), key=lambda x: x[1], reverse=True)[:n_recommendations]
    
    # Create DataFrame with artist recommendations
    artist_recommendations = []
    for artist_name, score in sorted_artists:
        artist_recommendations.append(artist_info[artist_name])
    
    return pd.DataFrame(artist_recommendations)

# Test album and artist recommendations
print("Testing album and artist recommendation functions with track index 0:")
sample_track = df.iloc[0]
print(f"Source track: '{sample_track['track_name']}' by {sample_track['artists']}")
print(f"Source album: '{sample_track['album_name']}'")

# Test album recommendations
print(f"\n=== ALBUM RECOMMENDATIONS ===")
album_recs = get_album_recommendations(0, n_recommendations=5)
if len(album_recs) > 0:
    for i, (_, album) in enumerate(album_recs.iterrows(), 1):
        print(f"{i}. 📀 '{album['album_name']}' by {album['album_artists']}")
        print(f"   Sample track: '{album['sample_track_name']}' | Score: {album['similarity_score']:.4f}")
else:
    print("No album recommendations found")

# Test artist recommendations
print(f"\n=== ARTIST RECOMMENDATIONS ===")
artist_recs = get_artist_recommendations(0, n_recommendations=5)
if len(artist_recs) > 0:
    for i, (_, artist) in enumerate(artist_recs.iterrows(), 1):
        print(f"{i}. 🎤 {artist['artist_name']}")
        print(f"   Sample track: '{artist['sample_track_name']}' | Score: {artist['similarity_score']:.4f}")
else:
    print("No artist recommendations found")

print(f"\nRecommendation counts:")
print(f"- Album recommendations: {len(album_recs)}")
print(f"- Artist recommendations: {len(artist_recs)}")

Testing album and artist recommendation functions with track index 0:
Source track: 'Comedy' by Gen Hoshino
Source album: 'Comedy'

=== ALBUM RECOMMENDATIONS ===
1. 📀 'Halloween 2022 Perreo Vol. 4' by Feid;Sech
   Sample track: 'JAMAICA' | Score: 0.4964
2. 📀 'Ve MFÖ' by MFÖ
   Sample track: 'Bu Aşk Olur Mu' | Score: 0.4337
3. 📀 'セーラ☆ムン太郎' by Maharajan
   Sample track: '空ノムコウ' | Score: 0.4326
4. 📀 'Band Baaja Baaraat' by Salim–Sulaiman;Harshdeep Kaur;Labh Janjua;Amitabh Bhattacharya
   Sample track: 'Baari Barsi' | Score: 0.4295
5. 📀 'Do My Ladies Run This Party - Single' by Cupid
   Sample track: 'Do My Ladies Run This Party - Single' | Score: 0.4290

=== ARTIST RECOMMENDATIONS ===
1. 🎤 Feid;Sech
   Sample track: 'JAMAICA' | Score: 0.4964
2. 🎤 MFÖ
   Sample track: 'Bu Aşk Olur Mu' | Score: 0.4337
3. 🎤 Maharajan
   Sample track: '空ノムコウ' | Score: 0.4326
4. 🎤 Salim–Sulaiman;Harshdeep Kaur;Labh Janjua;Amitabh Bhattacharya
   Sample track: 'Baari Barsi' | Score: 0.4295
5. 🎤 Cupid
   Sample 

In [27]:
def generate_batch_album_recommendations(df, n_tracks=100, n_recommendations=10, exclude_same_album=True, exclude_same_artist=False):
    """
    Generate album recommendations for a batch of tracks.
    
    Args:
        df (pd.DataFrame): The tracks dataframe
        n_tracks (int): Number of tracks to generate recommendations for
        n_recommendations (int): Number of album recommendations per track
        exclude_same_album (bool): Exclude albums from the same source album
        exclude_same_artist (bool): Exclude albums from the same source artist
    
    Returns:
        list: List of dictionaries containing source track info and album recommendations
    """
    recommendations = []
    
    for i in tqdm(range(min(n_tracks, len(df))), desc="Generating album recommendations"):
        source_track = df.iloc[i]
        album_recs = get_album_recommendations(i, n_recommendations, exclude_same_album, exclude_same_artist)
        
        rec_data = {
            'source_track_index': i,
            'source_track_name': source_track['track_name'],
            'source_artists': source_track['artists'],
            'source_album': source_track['album_name'],
            'source_popularity': source_track['popularity'],
            'recommended_albums': album_recs.to_dict('records')
        }
        recommendations.append(rec_data)
    
    return recommendations

def generate_batch_artist_recommendations(df, n_tracks=100, n_recommendations=10, exclude_same_artist=True):
    """
    Generate artist recommendations for a batch of tracks.
    
    Args:
        df (pd.DataFrame): The tracks dataframe
        n_tracks (int): Number of tracks to generate recommendations for
        n_recommendations (int): Number of artist recommendations per track
        exclude_same_artist (bool): Exclude the same source artist
    
    Returns:
        list: List of dictionaries containing source track info and artist recommendations
    """
    recommendations = []
    
    for i in tqdm(range(min(n_tracks, len(df))), desc="Generating artist recommendations"):
        source_track = df.iloc[i]
        artist_recs = get_artist_recommendations(i, n_recommendations, exclude_same_artist)
        
        rec_data = {
            'source_track_index': i,
            'source_track_name': source_track['track_name'],
            'source_artists': source_track['artists'],
            'source_album': source_track['album_name'],
            'source_popularity': source_track['popularity'],
            'recommended_artists': artist_recs.to_dict('records')
        }
        recommendations.append(rec_data)
    
    return recommendations

# Generate album recommendations for first 50 tracks
print("Generating album recommendations for first 50 tracks...")
album_recommendations = generate_batch_album_recommendations(df, n_tracks=50, n_recommendations=5)

print(f"\nGenerated album recommendations for {len(album_recommendations)} tracks")
print(f"Sample album recommendation for '{album_recommendations[0]['source_track_name']}':")
for i, album in enumerate(album_recommendations[0]['recommended_albums'][:3], 1):
    print(f"  {i}. {album['album_name']} by {album['album_artists']} (Score: {album['similarity_score']})")

# Generate artist recommendations for first 50 tracks  
print(f"\nGenerating artist recommendations for first 50 tracks...")
artist_recommendations = generate_batch_artist_recommendations(df, n_tracks=50, n_recommendations=5)

print(f"\nGenerated artist recommendations for {len(artist_recommendations)} tracks")
print(f"Sample artist recommendation for '{artist_recommendations[0]['source_track_name']}':")
for i, artist in enumerate(artist_recommendations[0]['recommended_artists'][:3], 1):
    print(f"  {i}. {artist['artist_name']} (Score: {artist['similarity_score']})")

Generating album recommendations for first 50 tracks...


Generating album recommendations: 100%|██████████| 50/50 [00:00<00:00, 55.14it/s]



Generated album recommendations for 50 tracks
Sample album recommendation for 'Comedy':
  1. Halloween 2022 Perreo Vol. 4 by Feid;Sech (Score: 0.4964)
  2. Ve MFÖ by MFÖ (Score: 0.4337)
  3. セーラ☆ムン太郎 by Maharajan (Score: 0.4326)

Generating artist recommendations for first 50 tracks...


Generating artist recommendations: 100%|██████████| 50/50 [00:00<00:00, 53.20it/s]


Generated artist recommendations for 50 tracks
Sample artist recommendation for 'Comedy':
  1. Feid;Sech (Score: 0.4964)
  2. MFÖ (Score: 0.4337)
  3. Maharajan (Score: 0.4326)





In [28]:
# Export album recommendations to CSV
album_output_rows = []
for rec_data in album_recommendations:
    source_info = {
        'source_track_index': rec_data['source_track_index'],
        'source_track_name': rec_data['source_track_name'],
        'source_artists': rec_data['source_artists'],
        'source_album': rec_data['source_album'],
        'source_popularity': rec_data['source_popularity']
    }
    
    for i, album in enumerate(rec_data['recommended_albums'], 1):
        row = {
            **source_info,
            'recommendation_rank': i,
            'recommended_album_name': album['album_name'],
            'recommended_album_artists': album['album_artists'],
            'sample_track_name': album['sample_track_name'],
            'sample_track_popularity': album['sample_track_popularity'],
            'similarity_score': album['similarity_score'],
            'euclidean_distance': album['euclidean_distance']
        }
        album_output_rows.append(row)

album_df = pd.DataFrame(album_output_rows)

# Export artist recommendations to CSV
artist_output_rows = []
for rec_data in artist_recommendations:
    source_info = {
        'source_track_index': rec_data['source_track_index'],
        'source_track_name': rec_data['source_track_name'],
        'source_artists': rec_data['source_artists'],
        'source_album': rec_data['source_album'],
        'source_popularity': rec_data['source_popularity']
    }
    
    for i, artist in enumerate(rec_data['recommended_artists'], 1):
        row = {
            **source_info,
            'recommendation_rank': i,
            'recommended_artist_name': artist['artist_name'],
            'sample_track_name': artist['sample_track_name'],
            'sample_album_name': artist['sample_album_name'],
            'sample_track_popularity': artist['sample_track_popularity'],
            'similarity_score': artist['similarity_score'],
            'euclidean_distance': artist['euclidean_distance']
        }
        artist_output_rows.append(row)

artist_df = pd.DataFrame(artist_output_rows)

# Save to CSV files
album_csv_path = Path("../data/processed/knn_album_recommendations.csv")
artist_csv_path = Path("../data/processed/knn_artist_recommendations.csv")

album_df.to_csv(album_csv_path, index=False)
artist_df.to_csv(artist_csv_path, index=False)

print(f"\n=== EXPORT SUMMARY ===")
print(f"Album recommendations exported to: {album_csv_path.resolve()}")
print(f"- Total rows: {len(album_df):,}")
print(f"- Columns: {len(album_df.columns)}")
print(f"- Source tracks: {album_df['source_track_index'].nunique()}")
print(f"- Unique recommended albums: {album_df['recommended_album_name'].nunique()}")

print(f"\nArtist recommendations exported to: {artist_csv_path.resolve()}")
print(f"- Total rows: {len(artist_df):,}")
print(f"- Columns: {len(artist_df.columns)}")
print(f"- Source tracks: {artist_df['source_track_index'].nunique()}")
print(f"- Unique recommended artists: {artist_df['recommended_artist_name'].nunique()}")

# Show sample of exported data
print(f"\n=== SAMPLE ALBUM RECOMMENDATIONS ===")
print(album_df[['source_track_name', 'recommendation_rank', 'recommended_album_name', 'recommended_album_artists', 'similarity_score']].head(10))

print(f"\n=== SAMPLE ARTIST RECOMMENDATIONS ===")
print(artist_df[['source_track_name', 'recommendation_rank', 'recommended_artist_name', 'similarity_score']].head(10))


=== EXPORT SUMMARY ===
Album recommendations exported to: /Users/julianelliott/Documents/GitHub/song-recommendation-dashboard/data/processed/knn_album_recommendations.csv
- Total rows: 250
- Columns: 12
- Source tracks: 50
- Unique recommended albums: 173

Artist recommendations exported to: /Users/julianelliott/Documents/GitHub/song-recommendation-dashboard/data/processed/knn_artist_recommendations.csv
- Total rows: 250
- Columns: 12
- Source tracks: 50
- Unique recommended artists: 164

=== SAMPLE ALBUM RECOMMENDATIONS ===
  source_track_name  recommendation_rank                recommended_album_name  \
0            Comedy                    1          Halloween 2022 Perreo Vol. 4   
1            Comedy                    2                                Ve MFÖ   
2            Comedy                    3                              セーラ☆ムン太郎   
3            Comedy                    4                    Band Baaja Baaraat   
4            Comedy                    5  Do My Ladies Run

## 11. Model Performance & Next Steps

The k-NN recommendation model has been successfully built and deployed. The model uses euclidean distance calculations on 12 standardized audio features to find similar songs.

### Model Characteristics:
- **Algorithm**: k-Nearest Neighbors (k=10)
- **Distance Metric**: Euclidean distance on standardized features
- **Feature Set**: 12 audio features (danceability, energy, valence, etc.)
- **Output**: 10 recommendations per track with similarity scores

### Key Benefits:
- **No Training Required**: Instant deployment as a lazy learning algorithm
- **Interpretable**: Clear similarity based on audio characteristics
- **Scalable**: Efficient neighbor search with sklearn implementation
- **Consistent**: Standardized features ensure balanced feature contribution

### Next Steps:
1. **Evaluate Recommendations**: Test with known similar songs for quality assessment
2. **Build API Endpoints**: Create functions for real-time recommendation requests
3. **Optimize Performance**: Consider approximate nearest neighbor methods for large datasets
4. **A/B Testing**: Compare with other recommendation algorithms
5. **User Interface**: Integrate with dashboard for interactive recommendations

The recommendation dataset is ready for integration into the song recommendation dashboard!