# K-Nearest Neighbors Song Recommendation Model

**Objective:** Build a similarity-based music recommendation system using k-nearest neighbors (k-NN) algorithm with euclidean distance calculations.

**Input:** `../data/processed/spotify_tracks_features_engineered.csv` (standardized audio features)  
**Process:** Load → Build k-NN Model → Generate Recommendations → Format Output  
**Output:** Recommendation dataset matching `recommendation_sample_enhanced.csv` format

**Algorithm:** k-NN with k=10 neighbors using euclidean distance on standardized audio features

## 1. Environment Setup & Library Imports

Import essential libraries for machine learning, data manipulation, and distance calculations. We'll use scikit-learn's NearestNeighbors for efficient k-NN implementation.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import euclidean_distances

# For progress tracking
from tqdm import tqdm

# Display configuration
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

print("Libraries imported successfully!")
print("Ready to build k-NN recommendation model...")

Libraries imported successfully!
Ready to build k-NN recommendation model...


## 2. Data Loading & Exploration

Load the feature-engineered dataset containing both original and standardized audio features. We'll examine the structure to understand our feature matrix for k-NN calculations.

In [2]:
# Define file paths
INPUT_CSV = Path("../data/processed/spotify_tracks_features_engineered.csv")
SAMPLE_OUTPUT = Path("../data/processed/recommendation_sample_enhanced.csv")
OUTPUT_CSV = Path("../data/processed/knn_recommendations.csv")

print(f"Input file: {INPUT_CSV.resolve()}")
print(f"Sample format file: {SAMPLE_OUTPUT.resolve()}")
print(f"Output file: {OUTPUT_CSV.resolve()}")

# Load feature-engineered dataset
try:
    df = pd.read_csv(INPUT_CSV)
    print(f"\nDataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Columns: {len(df.columns)}")
    display(df.head(3))
except FileNotFoundError:
    print(f"Error: Could not find {INPUT_CSV}")
except Exception as e:
    print(f"Error loading data: {e}")

Input file: /Users/nasraibrahim/Documents/vscode-projects/song-recommendation-dashboard/data/processed/spotify_tracks_features_engineered.csv
Sample format file: /Users/nasraibrahim/Documents/vscode-projects/song-recommendation-dashboard/data/processed/recommendation_sample_enhanced.csv
Output file: /Users/nasraibrahim/Documents/vscode-projects/song-recommendation-dashboard/data/processed/knn_recommendations.csv

Dataset loaded successfully!
Shape: (113999, 36)
Columns: 36

Dataset loaded successfully!
Shape: (113999, 36)
Columns: 36


Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,danceability_scaled,energy_scaled,key_scaled,loudness_scaled,mode_scaled,speechiness_scaled,acousticness_scaled,instrumentalness_scaled,liveness_scaled,valence_scaled,tempo_scaled,time_signature_scaled,duration_min,popularity_category,mood_score,party_score
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic,0.629239,-0.717147,-1.210434,0.300825,-1.326297,0.551843,-0.850193,-0.504111,0.758735,0.929315,-1.141854,0.221824,3.844433,High,0.588,0.5685
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic,-0.845908,-1.889974,-1.210434,-1.784739,0.753979,-0.078995,1.831744,-0.504097,-0.591216,-0.798681,-1.489708,0.221824,2.4935,High,0.2165,0.293
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic,-0.742187,-1.122667,-1.491334,-0.293289,0.753979,-0.273827,-0.315489,-0.504115,-0.507172,-1.365679,-1.528303,0.221824,3.513767,High,0.2395,0.3985


## 3. Examine Target Output Format

Load and analyze the expected output format to ensure our recommendations match the required structure for the recommendation system.

In [3]:
# Load sample output format to understand required structure
try:
    sample_format = pd.read_csv(SAMPLE_OUTPUT)
    print("Target output format structure:")
    print(f"Shape: {sample_format.shape}")
    print(f"Columns: {sample_format.columns.tolist()}")
    display(sample_format.head())
    
    # Understand the format requirements
    print("\nFormat Analysis:")
    print(f"- Total recommendations: {len(sample_format)}")
    print(f"- Unique source tracks: {sample_format['track_id'].nunique()}")
    print(f"- Recommendations per track: {len(sample_format) // sample_format['track_id'].nunique()}")
    
except FileNotFoundError:
    print(f"Sample format file not found. Will create standard format.")
    sample_format = None

Target output format structure:
Shape: (10, 15)
Columns: ['track_id', 'recommended_track_id', 'match_score', 'track_name', 'track_artists', 'track_album', 'track_genre', 'track_image_url', 'track_popularity', 'recommended_track_name', 'recommended_track_artists', 'recommended_track_album', 'recommended_track_genre', 'recommended_track_image_url', 'recommended_track_popularity']


Unnamed: 0,track_id,recommended_track_id,match_score,track_name,track_artists,track_album,track_genre,track_image_url,track_popularity,recommended_track_name,recommended_track_artists,recommended_track_album,recommended_track_genre,recommended_track_image_url,recommended_track_popularity
0,3n3Ppam7vgaVa1iaRUc9Lp,7ouMYWpwJ422jRcDASZB7P,0.92,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,85,Somebody Told Me,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,78
1,7ouMYWpwJ422jRcDASZB7P,3n3Ppam7vgaVa1iaRUc9Lp,0.87,Somebody Told Me,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,78,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,85
2,3n3Ppam7vgaVa1iaRUc9Lp,0eGsygTp906u18L0Oimnem,0.81,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,85,Somebody That I Used to Know,Gotye feat. Kimbra,Making Mirrors,indie,https://i.scdn.co/image/ab67616d0000b273f0c20a...,82
3,0eGsygTp906u18L0Oimnem,3n3Ppam7vgaVa1iaRUc9Lp,0.78,Somebody That I Used to Know,Gotye feat. Kimbra,Making Mirrors,indie,https://i.scdn.co/image/ab67616d0000b273f0c20a...,82,Mr. Brightside,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,85
4,7ouMYWpwJ422jRcDASZB7P,0eGsygTp906u18L0Oimnem,0.85,Somebody Told Me,The Killers,Hot Fuss,rock,https://i.scdn.co/image/ab67616d0000b273ccdddd...,78,Somebody That I Used to Know,Gotye feat. Kimbra,Making Mirrors,indie,https://i.scdn.co/image/ab67616d0000b273f0c20a...,82



Format Analysis:
- Total recommendations: 10
- Unique source tracks: 4
- Recommendations per track: 2


## 4. Prepare Feature Matrix for k-NN

Extract the standardized audio features that will be used for euclidean distance calculations. These features have been normalized to have mean=0 and std=1, ensuring equal contribution to distance measurements.

In [4]:
# Identify standardized audio features for k-NN model
scaled_features = [col for col in df.columns if col.endswith('_scaled')]

print(f"Standardized audio features for k-NN: {len(scaled_features)}")
print(scaled_features)

# Verify we have the expected 12 audio features
expected_features = [
    'danceability_scaled', 'energy_scaled', 'key_scaled', 'loudness_scaled',
    'mode_scaled', 'speechiness_scaled', 'acousticness_scaled', 
    'instrumentalness_scaled', 'liveness_scaled', 'valence_scaled', 
    'tempo_scaled', 'time_signature_scaled'
]

missing_features = [f for f in expected_features if f not in scaled_features]
if missing_features:
    print(f"\nWarning: Missing expected features: {missing_features}")
else:
    print("\nAll expected standardized features found ✓")

# Create feature matrix for k-NN
feature_matrix = df[scaled_features].values
print(f"\nFeature matrix shape: {feature_matrix.shape}")
print(f"Feature matrix type: {type(feature_matrix)}")

Standardized audio features for k-NN: 12
['danceability_scaled', 'energy_scaled', 'key_scaled', 'loudness_scaled', 'mode_scaled', 'speechiness_scaled', 'acousticness_scaled', 'instrumentalness_scaled', 'liveness_scaled', 'valence_scaled', 'tempo_scaled', 'time_signature_scaled']

All expected standardized features found ✓

Feature matrix shape: (113999, 12)
Feature matrix type: <class 'numpy.ndarray'>


## 5. Build k-Nearest Neighbors Model

Initialize and fit the k-NN model using euclidean distance. We'll use k=11 (to get 10 recommendations excluding the input song itself) and euclidean distance metric for similarity calculations.

In [5]:
# Initialize k-NN model
# Using k=11 to get 10 recommendations (excluding the query song itself)
k_neighbors = 11
knn_model = NearestNeighbors(
    n_neighbors=k_neighbors,
    metric='euclidean',
    algorithm='auto',  # Let sklearn choose the best algorithm
    n_jobs=-1  # Use all available CPU cores for faster computation
)

print(f"Initializing k-NN model with:")
print(f"- k = {k_neighbors} neighbors (10 recommendations + 1 self)")
print(f"- Distance metric: euclidean")
print(f"- Algorithm: auto (sklearn will choose optimal)")

# Fit the model with our standardized feature matrix
print("\nFitting k-NN model...")
knn_model.fit(feature_matrix)
print("k-NN model fitted successfully!")

# Test the model with a sample query
print("\nTesting model with first track:")
sample_distances, sample_indices = knn_model.kneighbors([feature_matrix[0]])
print(f"Found {len(sample_indices[0])} neighbors")
print(f"Distances: {sample_distances[0][:5]}...")  # Show first 5 distances

Initializing k-NN model with:
- k = 11 neighbors (10 recommendations + 1 self)
- Distance metric: euclidean
- Algorithm: auto (sklearn will choose optimal)

Fitting k-NN model...
k-NN model fitted successfully!

Testing model with first track:
Found 11 neighbors
Distances: [0.         0.         0.         0.         1.01463073]...


## 6. Generate Recommendations Function

Create a function to generate recommendations for any given track. This function will find the k-nearest neighbors and return track information excluding the input song itself.

In [6]:
def get_recommendations(track_index, n_recommendations=10):
    """
    Get song recommendations for a given track using k-NN model.
    
    Args:
        track_index (int): Index of the track in the dataset
        n_recommendations (int): Number of recommendations to return
    
    Returns:
        pd.DataFrame: DataFrame with recommended tracks
    """
    # Get k+1 neighbors (including the track itself)
    distances, indices = knn_model.kneighbors([feature_matrix[track_index]], 
                                            n_neighbors=n_recommendations+1)
    
    # Remove the first result (the track itself) and get recommendations
    neighbor_indices = indices[0][1:n_recommendations+1]  # Exclude self
    neighbor_distances = distances[0][1:n_recommendations+1]  # Exclude self
    
    # Get track information for recommendations
    recommended_tracks = df.iloc[neighbor_indices].copy()
    
    # Add similarity score (inverse of distance - higher score = more similar)
    recommended_tracks['similarity_score'] = 1 / (1 + neighbor_distances)
    recommended_tracks['euclidean_distance'] = neighbor_distances
    
    return recommended_tracks

# Test the recommendation function
print("Testing recommendation function with track index 0:")
sample_track = df.iloc[0]
print(f"Source track: '{sample_track['track_name']}' by {sample_track['artists']}")

sample_recs = get_recommendations(0, n_recommendations=5)
print(f"\nTop 5 recommendations:")
for i, (_, track) in enumerate(sample_recs.iterrows(), 1):
    print(f"{i}. '{track['track_name']}' by {track['artists']} (distance: {track['euclidean_distance']:.4f})")

Testing recommendation function with track index 0:
Source track: 'Comedy' by Gen Hoshino

Top 5 recommendations:
1. 'Comedy' by Gen Hoshino (distance: 0.0000)
2. 'Comedy' by Gen Hoshino (distance: 0.0000)
3. 'Comedy' by Gen Hoshino (distance: 0.0000)
4. 'JAMAICA' by Feid;Sech (distance: 1.0146)
5. 'JAMAICA' by Feid;Sech (distance: 1.0146)


## 7. Generate Recommendations for All Tracks

Process the entire dataset to generate 10 recommendations for each track. This creates a comprehensive recommendation dataset for the entire music catalog.

In [7]:
# Generate recommendations for all tracks
print("Generating recommendations for all tracks...")
print(f"Processing {len(df)} tracks to generate {len(df) * 10:,} total recommendations")

all_recommendations = []

# Process tracks in batches with progress bar
batch_size = 1000
total_tracks = len(df)

for start_idx in tqdm(range(0, total_tracks, batch_size), desc="Processing batches"):
    end_idx = min(start_idx + batch_size, total_tracks)
    batch_recommendations = []
    
    for track_idx in range(start_idx, end_idx):
        # Get source track info
        source_track = df.iloc[track_idx]
        
        # Get 10 nearest neighbors (excluding self)
        distances, indices = knn_model.kneighbors([feature_matrix[track_idx]], n_neighbors=11)
        neighbor_indices = indices[0][1:]  # Exclude self (first result)
        neighbor_distances = distances[0][1:]  # Exclude self
        
        # Generate recommendations for this track
        for rank, (neighbor_idx, distance) in enumerate(zip(neighbor_indices, neighbor_distances)):
            rec_track = df.iloc[neighbor_idx]
            
            # Calculate similarity score (inverse of distance + 1)
            similarity_score = round(1 / (1 + distance), 2)
            
            # Create recommendation entry matching target format
            rec_entry = {
                'track_id': source_track['track_id'],
                'recommended_track_id': rec_track['track_id'],
                'match_score': similarity_score,
                'track_name': source_track['track_name'],
                'track_artists': source_track['artists'],
                'track_album': source_track['album_name'],
                'track_genre': source_track['track_genre'],
                'track_image_url': '',  # Will be populated later if needed
                'track_popularity': source_track['popularity'],
                'recommended_track_name': rec_track['track_name'],
                'recommended_track_artists': rec_track['artists'],
                'recommended_track_album': rec_track['album_name'],
                'recommended_track_genre': rec_track['track_genre'],
                'recommended_track_image_url': '',  # Will be populated later if needed
                'recommended_track_popularity': rec_track['popularity']
            }
            batch_recommendations.append(rec_entry)
    
    all_recommendations.extend(batch_recommendations)

# Convert to DataFrame
recommendations_df = pd.DataFrame(all_recommendations)
print(f"\nRecommendations generated successfully!")
print(f"Total recommendations: {len(recommendations_df):,}")
print(f"Unique source tracks: {recommendations_df['track_id'].nunique():,}")
print(f"Average recommendations per track: {len(recommendations_df) / recommendations_df['track_id'].nunique():.1f}")

Generating recommendations for all tracks...
Processing 113999 tracks to generate 1,139,990 total recommendations


Processing batches:   0%|          | 0/114 [00:14<?, ?it/s]



KeyboardInterrupt: 

## 8. Validate & Display Sample Recommendations

Examine the generated recommendations to ensure quality and proper formatting. We'll look at sample recommendations and verify the data structure.

In [None]:
# Display sample recommendations
print("Sample recommendations structure:")
display(recommendations_df.head(10))

print(f"\nDataset statistics:")
print(f"- Total recommendation pairs: {len(recommendations_df):,}")
print(f"- Unique source tracks: {recommendations_df['track_id'].nunique():,}")
print(f"- Unique recommended tracks: {recommendations_df['recommended_track_id'].nunique():,}")
print(f"- Recommendations per source track: 10")

# Check recommendation quality with a specific example
sample_track_id = recommendations_df['track_id'].iloc[0]
sample_recs = recommendations_df[recommendations_df['track_id'] == sample_track_id].head(10)

print(f"\nExample: Recommendations for track ID '{sample_track_id}':")
print(f"Source: '{sample_recs.iloc[0]['track_name']}' by {sample_recs.iloc[0]['track_artists']}")
print("\nTop 10 recommendations:")

for i, (_, rec) in enumerate(sample_recs.iterrows(), 1):
    print(f"  {i}. '{rec['recommended_track_name']}' by {rec['recommended_track_artists']}")
    print(f"     Match Score: {rec['match_score']}, Genre: {rec['recommended_track_genre']}")

## 9. Format Output to Match Target Structure

Ensure the output format matches the expected structure from the sample enhanced CSV file. We'll align column names and data types for consistency.

In [None]:
# Verify output format matches target structure
if sample_format is not None:
    print("Verifying output format matches target structure...")
    
    target_columns = sample_format.columns.tolist()
    current_columns = recommendations_df.columns.tolist()
    
    print(f"Target columns: {target_columns}")
    print(f"Current columns: {current_columns}")
    
    # Check if formats match
    if set(target_columns) == set(current_columns):
        print("✓ Column structure matches target format!")
    else:
        missing_cols = set(target_columns) - set(current_columns)
        extra_cols = set(current_columns) - set(target_columns)
        if missing_cols:
            print(f"⚠ Missing columns: {missing_cols}")
        if extra_cols:
            print(f"⚠ Extra columns: {extra_cols}")
        
else:
    print("Using generated recommendation format...")

# Ensure proper data types and formatting
recommendations_df['match_score'] = recommendations_df['match_score'].round(2)
recommendations_df['track_popularity'] = recommendations_df['track_popularity'].astype(int)
recommendations_df['recommended_track_popularity'] = recommendations_df['recommended_track_popularity'].astype(int)

print("\nData formatting completed!")
print(f"Final dataset shape: {recommendations_df.shape}")
print(f"Final columns: {recommendations_df.columns.tolist()}")

## 10. Export Recommendations Dataset

Save the complete recommendation dataset to CSV format for use in the recommendation system and dashboard components.

In [None]:
# Export recommendations to CSV
recommendations_df.to_csv(OUTPUT_CSV, index=False)

print(f"Recommendations exported successfully!")
print(f"Output location: {OUTPUT_CSV.resolve()}")
print(f"File size: {OUTPUT_CSV.stat().st_size / (1024*1024):.1f} MB")

# Final summary statistics
print(f"\n{'='*60}")
print("K-NN RECOMMENDATION MODEL SUMMARY")
print(f"{'='*60}")
print(f"✓ Algorithm: k-Nearest Neighbors (k=10)")
print(f"✓ Distance Metric: Euclidean")
print(f"✓ Feature Set: 12 standardized audio features")
print(f"✓ Total Source Tracks: {recommendations_df['track_id'].nunique():,}")
print(f"✓ Total Recommendations: {len(recommendations_df):,}")
print(f"✓ Recommendations per Track: 10")
print(f"✓ Output Format: CSV with match scores")
print(f"✓ File Location: {OUTPUT_CSV.name}")
print(f"{'='*60}")

# Display final sample
print("\nFinal sample of recommendations:")
sample_data = recommendations_df.sample(10)[['track_name', 'track_artists', 'recommended_track_name', 'recommended_track_artists', 'match_score']]
display(sample_data)

## 11. Model Performance & Next Steps

The k-NN recommendation model has been successfully built and deployed. The model uses euclidean distance calculations on 12 standardized audio features to find similar songs.

### Model Characteristics:
- **Algorithm**: k-Nearest Neighbors (k=10)
- **Distance Metric**: Euclidean distance on standardized features
- **Feature Set**: 12 audio features (danceability, energy, valence, etc.)
- **Output**: 10 recommendations per track with similarity scores

### Key Benefits:
- **No Training Required**: Instant deployment as a lazy learning algorithm
- **Interpretable**: Clear similarity based on audio characteristics
- **Scalable**: Efficient neighbor search with sklearn implementation
- **Consistent**: Standardized features ensure balanced feature contribution

### Next Steps:
1. **Evaluate Recommendations**: Test with known similar songs for quality assessment
2. **Build API Endpoints**: Create functions for real-time recommendation requests
3. **Optimize Performance**: Consider approximate nearest neighbor methods for large datasets
4. **A/B Testing**: Compare with other recommendation algorithms
5. **User Interface**: Integrate with dashboard for interactive recommendations

The recommendation dataset is ready for integration into the song recommendation dashboard!