# 4. k-NN on High-Dimensional Data & Curse of Dimensionality

## Overview
This notebook explores:
- How k-NN performs on high-dimensional datasets
- The curse of dimensionality phenomenon
- Why distance metrics become less meaningful in high dimensions
- Mitigation strategies for high-dimensional data
- Practical implications and solutions
- Analysis using the banknote dataset with synthetic high-dimensional extensions

## Understanding the Curse of Dimensionality

### What is the Curse of Dimensionality?

The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces. For k-NN specifically:

1. **Distance Concentration**: In high dimensions, distances between points become similar
2. **Sparse Data**: Data points become sparse and isolated
3. **Nearest Neighbors Become Less "Near"**: The concept of "closeness" loses meaning
4. **Computational Complexity**: Distance calculations become expensive

### Mathematical Intuition

In d-dimensional space:
- **Volume grows exponentially**: V ∝ r^d
- **Surface area dominates**: Most points lie near the boundary
- **Distance ratio converges**: max_dist/min_dist → 1 as d → ∞

### Why This Affects k-NN

k-NN relies on the assumption that **nearby points are similar**. When this assumption breaks down:
- All points appear equally "close"
- Neighbors become random
- Classification becomes unreliable
- Performance degrades significantly

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Load the prepared banknote data
with open('../data/prepared_banknote_data.pkl', 'rb') as f:
    data = pickle.load(f)

X_train_orig = data['X_train']
X_test_orig = data['X_test']
y_train = data['y_train']
y_test = data['y_test']
X_train_scaled_orig = data['X_train_scaled']
X_test_scaled_orig = data['X_test_scaled']
scaler = data['scaler']
feature_names = data['feature_names']

print("Original banknote data loaded successfully!")
print(f"Original dimensions: {X_train_scaled_orig.shape[1]} features")
print(f"Training samples: {X_train_scaled_orig.shape[0]}")
print(f"Test samples: {X_test_scaled_orig.shape[0]}")

## Demonstrating Distance Concentration

Let's create synthetic high-dimensional versions of our banknote data to show how distances behave.

In [None]:
# Create high-dimensional versions of the banknote dataset
def create_high_dimensional_data(X_orig, dimensions_list, noise_std=0.1):
    """
    Create high-dimensional versions by adding noise features
    """
    datasets = {}
    n_orig_features = X_orig.shape[1]
    
    for d in dimensions_list:
        if d <= n_orig_features:
            # Use subset of original features
            X_high_dim = X_orig[:, :d]
        else:
            # Add noise features
            n_noise_features = d - n_orig_features
            noise_features = np.random.normal(0, noise_std, 
                                            (X_orig.shape[0], n_noise_features))
            X_high_dim = np.hstack([X_orig, noise_features])
        
        datasets[d] = X_high_dim
    
    return datasets

# Create datasets with different dimensionalities
dimensions = [2, 4, 8, 16, 32, 64, 128]
train_datasets = create_high_dimensional_data(X_train_scaled_orig, dimensions)
test_datasets = create_high_dimensional_data(X_test_scaled_orig, dimensions)

print("High-dimensional datasets created:")
for d in dimensions:
    print(f"  {d}D: {train_datasets[d].shape}")

In [None]:
# Analyze distance distribution in different dimensions
def analyze_distances(X, n_samples=100):
    """
    Analyze pairwise distance distribution
    """
    # Sample random points to avoid computational explosion
    if len(X) > n_samples:
        indices = np.random.choice(len(X), n_samples, replace=False)
        X_sample = X[indices]
    else:
        X_sample = X
    
    # Calculate all pairwise distances
    distances = []
    for i in range(len(X_sample)):
        for j in range(i+1, len(X_sample)):
            dist = np.sqrt(np.sum((X_sample[i] - X_sample[j])**2))
            distances.append(dist)
    
    distances = np.array(distances)
    
    return {
        'mean': np.mean(distances),
        'std': np.std(distances),
        'min': np.min(distances),
        'max': np.max(distances),
        'cv': np.std(distances) / np.mean(distances),  # Coefficient of variation
        'ratio': np.max(distances) / np.min(distances),  # Max/min ratio
        'distances': distances
    }

# Analyze distance statistics for each dimension
distance_stats = {}
print("DISTANCE CONCENTRATION ANALYSIS")
print("=" * 50)
print(f"{'Dim':<4} {'Mean':<8} {'Std':<8} {'CV':<8} {'Max/Min':<8}")
print("-" * 50)

for d in dimensions:
    stats = analyze_distances(train_datasets[d])
    distance_stats[d] = stats
    
    print(f"{d:<4} {stats['mean']:<8.3f} {stats['std']:<8.3f} {stats['cv']:<8.3f} {stats['ratio']:<8.3f}")

print("\nCV = Coefficient of Variation (lower = more concentrated)")
print("Max/Min = Ratio of maximum to minimum distance (lower = more concentrated)")

In [None]:
# Visualize distance concentration
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Distance Concentration in High Dimensions', fontsize=16)

# 1. Distance distributions for different dimensions
selected_dims = [2, 8, 32, 128]
colors = ['blue', 'green', 'orange', 'red']

for i, (d, color) in enumerate(zip(selected_dims, colors)):
    distances = distance_stats[d]['distances']
    axes[0, 0].hist(distances, bins=30, alpha=0.6, label=f'{d}D', 
                   color=color, density=True)

axes[0, 0].set_title('Distance Distributions by Dimension')
axes[0, 0].set_xlabel('Euclidean Distance')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Coefficient of variation vs dimension
dims = list(distance_stats.keys())
cvs = [distance_stats[d]['cv'] for d in dims]

axes[0, 1].plot(dims, cvs, 'o-', linewidth=2, markersize=8, color='red')
axes[0, 1].set_title('Distance Concentration\n(Lower CV = More Concentrated)')
axes[0, 1].set_xlabel('Number of Dimensions')
axes[0, 1].set_ylabel('Coefficient of Variation')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_xscale('log')

# 3. Max/Min distance ratio
ratios = [distance_stats[d]['ratio'] for d in dims]
axes[1, 0].plot(dims, ratios, 's-', linewidth=2, markersize=8, color='blue')
axes[1, 0].set_title('Distance Range Compression\n(Lower Ratio = More Compressed)')
axes[1, 0].set_xlabel('Number of Dimensions')
axes[1, 0].set_ylabel('Max Distance / Min Distance')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_xscale('log')

# 4. Mean distance vs dimension
mean_dists = [distance_stats[d]['mean'] for d in dims]
axes[1, 1].plot(dims, mean_dists, '^-', linewidth=2, markersize=8, color='green')
axes[1, 1].set_title('Mean Distance vs Dimensionality')
axes[1, 1].set_xlabel('Number of Dimensions')
axes[1, 1].set_ylabel('Mean Euclidean Distance')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xscale('log')

plt.tight_layout()
plt.show()

print("\n💡 KEY OBSERVATIONS:")
print("• Coefficient of Variation decreases → distances become more similar")
print("• Max/Min ratio decreases → range of distances compresses")
print("• Mean distance increases → but variance doesn't keep up")
print("• This is the curse of dimensionality in action!")

## k-NN Performance vs Dimensionality

Now let's see how k-NN classification performance degrades as dimensionality increases.

In [None]:
# Test k-NN performance across dimensions
def evaluate_knn_performance(train_datasets, test_datasets, y_train, y_test, k=5):
    """
    Evaluate k-NN performance across different dimensions
    """
    results = {}
    
    for d in train_datasets.keys():
        # Train k-NN
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(train_datasets[d], y_train)
        
        # Predict
        y_pred = knn.predict(test_datasets[d])
        accuracy = accuracy_score(y_test, y_pred)
        
        # Calculate prediction confidence (distance to nearest neighbors)
        distances, indices = knn.kneighbors(test_datasets[d])
        avg_neighbor_distance = np.mean(distances)
        
        results[d] = {
            'accuracy': accuracy,
            'avg_neighbor_distance': avg_neighbor_distance,
            'dimensions': d
        }
    
    return results

# Evaluate performance
performance_results = evaluate_knn_performance(train_datasets, test_datasets, 
                                              y_train, y_test, k=5)

print("k-NN PERFORMANCE VS DIMENSIONALITY")
print("=" * 40)
print(f"{'Dimensions':<12} {'Accuracy':<10} {'Avg Neighbor Dist':<18}")
print("-" * 40)

for d, results in performance_results.items():
    print(f"{d:<12} {results['accuracy']:<10.4f} {results['avg_neighbor_distance']:<18.4f}")

# Find performance degradation point
accuracies = [performance_results[d]['accuracy'] for d in dimensions]
max_accuracy = max(accuracies)
degradation_threshold = max_accuracy * 0.95  # 5% drop

degradation_point = None
for d, acc in zip(dimensions, accuracies):
    if acc < degradation_threshold:
        degradation_point = d
        break

if degradation_point:
    print(f"\n⚠️  Performance starts degrading at {degradation_point} dimensions")
else:
    print(f"\n✅ Performance remains stable across all tested dimensions")

In [None]:
# Visualize performance degradation
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('k-NN Performance Degradation in High Dimensions', fontsize=16)

dims = list(performance_results.keys())
accuracies = [performance_results[d]['accuracy'] for d in dims]
neighbor_dists = [performance_results[d]['avg_neighbor_distance'] for d in dims]

# 1. Accuracy vs Dimensions
axes[0, 0].plot(dims, accuracies, 'o-', linewidth=3, markersize=8, color='red')
axes[0, 0].set_title('Classification Accuracy vs Dimensions')
axes[0, 0].set_xlabel('Number of Dimensions')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_xscale('log')
axes[0, 0].set_ylim(0.5, 1.05)

# Add horizontal line for random performance
axes[0, 0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.7, 
                  label='Random Performance')
axes[0, 0].legend()

# 2. Average neighbor distance vs Dimensions
axes[0, 1].plot(dims, neighbor_dists, 's-', linewidth=3, markersize=8, color='blue')
axes[0, 1].set_title('Average Distance to Nearest Neighbors')
axes[0, 1].set_xlabel('Number of Dimensions')
axes[0, 1].set_ylabel('Average Distance')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_xscale('log')

# 3. Accuracy vs Neighbor Distance (correlation)
axes[1, 0].scatter(neighbor_dists, accuracies, s=100, alpha=0.7, c=dims, 
                  cmap='viridis')
axes[1, 0].set_title('Accuracy vs Average Neighbor Distance')
axes[1, 0].set_xlabel('Average Neighbor Distance')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].grid(True, alpha=0.3)

# Add colorbar
scatter = axes[1, 0].scatter(neighbor_dists, accuracies, s=100, alpha=0.7, 
                           c=dims, cmap='viridis')
cbar = plt.colorbar(scatter, ax=axes[1, 0])
cbar.set_label('Dimensions')

# 4. Performance degradation rate
# Calculate relative performance (normalized to best performance)
relative_performance = [acc / max(accuracies) for acc in accuracies]
axes[1, 1].plot(dims, relative_performance, '^-', linewidth=3, markersize=8, color='orange')
axes[1, 1].set_title('Relative Performance Degradation')
axes[1, 1].set_xlabel('Number of Dimensions')
axes[1, 1].set_ylabel('Relative Performance')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xscale('log')
axes[1, 1].set_ylim(0.5, 1.05)

# Add threshold lines
axes[1, 1].axhline(y=0.95, color='orange', linestyle='--', alpha=0.7, 
                  label='5% degradation')
axes[1, 1].axhline(y=0.90, color='red', linestyle='--', alpha=0.7, 
                  label='10% degradation')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

# Calculate performance drop statistics
performance_drop = (max(accuracies) - min(accuracies)) / max(accuracies) * 100
print(f"\n📉 PERFORMANCE DEGRADATION SUMMARY:")
print(f"• Best accuracy: {max(accuracies):.4f} at {dims[accuracies.index(max(accuracies))]} dimensions")
print(f"• Worst accuracy: {min(accuracies):.4f} at {dims[accuracies.index(min(accuracies))]} dimensions")
print(f"• Total performance drop: {performance_drop:.1f}%")
print(f"• Average neighbor distance increases {neighbor_dists[-1]/neighbor_dists[0]:.1f}x")

## Mitigation Strategies for High-Dimensional Data

While the curse of dimensionality is a fundamental challenge, there are several strategies to mitigate its effects:

### 1. Dimensionality Reduction
- **Principal Component Analysis (PCA)**: Projects data to lower dimensions
- **Feature Selection**: Choose most relevant features
- **Linear Discriminant Analysis (LDA)**: Supervised dimensionality reduction

### 2. Feature Engineering
- Remove irrelevant/noisy features
- Create meaningful feature combinations
- Domain-specific feature selection

### 3. Algorithm Modifications
- Use different distance metrics
- Weighted k-NN (weight by distance)
- Local distance metrics

Let's test these strategies!

In [None]:
# Strategy 1: Principal Component Analysis (PCA)
def test_pca_strategy(X_train_high_dim, X_test_high_dim, y_train, y_test, 
                     n_components_list, original_performance):
    """
    Test PCA dimensionality reduction strategy
    """
    pca_results = {}
    
    print(f"\nTesting PCA on {X_train_high_dim.shape[1]}D data:")
    print(f"Original performance: {original_performance:.4f}")
    print("-" * 40)
    
    for n_comp in n_components_list:
        if n_comp >= X_train_high_dim.shape[1]:
            continue
            
        # Apply PCA
        pca = PCA(n_components=n_comp)
        X_train_pca = pca.fit_transform(X_train_high_dim)
        X_test_pca = pca.transform(X_test_high_dim)
        
        # Train k-NN on reduced data
        knn = KNeighborsClassifier(n_neighbors=5)
        knn.fit(X_train_pca, y_train)
        y_pred = knn.predict(X_test_pca)
        accuracy = accuracy_score(y_test, y_pred)
        
        # Calculate explained variance
        explained_var = np.sum(pca.explained_variance_ratio_)
        
        pca_results[n_comp] = {
            'accuracy': accuracy,
            'explained_variance': explained_var,
            'improvement': accuracy - original_performance
        }
        
        print(f"PCA to {n_comp:2d}D: Acc={accuracy:.4f} (+{accuracy-original_performance:+.4f}), "
              f"Var Explained={explained_var:.3f}")
    
    return pca_results

# Test PCA on our highest dimensional dataset (128D)
high_dim = 128
original_perf = performance_results[high_dim]['accuracy']

pca_components = [2, 4, 8, 16, 32]
pca_results = test_pca_strategy(train_datasets[high_dim], test_datasets[high_dim],
                               y_train, y_test, pca_components, original_perf)

In [None]:
# Strategy 2: Feature Selection
def test_feature_selection_strategy(X_train_high_dim, X_test_high_dim, y_train, y_test,
                                   k_features_list, original_performance):
    """
    Test feature selection strategy using univariate statistical tests
    """
    fs_results = {}
    
    print(f"\nTesting Feature Selection on {X_train_high_dim.shape[1]}D data:")
    print(f"Original performance: {original_performance:.4f}")
    print("-" * 50)
    
    for k_features in k_features_list:
        if k_features >= X_train_high_dim.shape[1]:
            continue
            
        # Apply feature selection
        selector = SelectKBest(score_func=f_classif, k=k_features)
        X_train_selected = selector.fit_transform(X_train_high_dim, y_train)
        X_test_selected = selector.transform(X_test_high_dim)
        
        # Train k-NN on selected features
        knn = KNeighborsClassifier(n_neighbors=5)
        knn.fit(X_train_selected, y_train)
        y_pred = knn.predict(X_test_selected)
        accuracy = accuracy_score(y_test, y_pred)
        
        # Get feature importance scores
        feature_scores = selector.scores_
        selected_features = selector.get_support()
        avg_score = np.mean(feature_scores[selected_features])
        
        fs_results[k_features] = {
            'accuracy': accuracy,
            'avg_feature_score': avg_score,
            'improvement': accuracy - original_performance,
            'selected_features': selected_features
        }
        
        print(f"Select {k_features:2d} features: Acc={accuracy:.4f} (+{accuracy-original_performance:+.4f}), "
              f"Avg Score={avg_score:.2f}")
    
    return fs_results

# Test feature selection
feature_counts = [2, 4, 8, 16, 32]
fs_results = test_feature_selection_strategy(train_datasets[high_dim], test_datasets[high_dim],
                                           y_train, y_test, feature_counts, original_perf)

In [None]:
# Strategy 3: Weighted k-NN
def test_weighted_knn_strategy(X_train, X_test, y_train, y_test, k_values):
    """
    Test weighted k-NN (distance-weighted) vs uniform k-NN
    """
    print(f"\nTesting Weighted k-NN vs Uniform k-NN:")
    print("-" * 45)
    print(f"{'k':<3} {'Uniform':<10} {'Weighted':<10} {'Improvement':<12}")
    print("-" * 45)
    
    weighted_results = {}
    
    for k in k_values:
        # Uniform weights
        knn_uniform = KNeighborsClassifier(n_neighbors=k, weights='uniform')
        knn_uniform.fit(X_train, y_train)
        y_pred_uniform = knn_uniform.predict(X_test)
        acc_uniform = accuracy_score(y_test, y_pred_uniform)
        
        # Distance weights
        knn_weighted = KNeighborsClassifier(n_neighbors=k, weights='distance')
        knn_weighted.fit(X_train, y_train)
        y_pred_weighted = knn_weighted.predict(X_test)
        acc_weighted = accuracy_score(y_test, y_pred_weighted)
        
        improvement = acc_weighted - acc_uniform
        
        weighted_results[k] = {
            'uniform': acc_uniform,
            'weighted': acc_weighted,
            'improvement': improvement
        }
        
        print(f"{k:<3} {acc_uniform:<10.4f} {acc_weighted:<10.4f} {improvement:<+12.4f}")
    
    return weighted_results

# Test weighted k-NN on high-dimensional data
k_values_test = [3, 5, 7, 9, 11]
weighted_results = test_weighted_knn_strategy(train_datasets[high_dim], test_datasets[high_dim],
                                            y_train, y_test, k_values_test)

In [None]:
# Visualize mitigation strategies
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Mitigation Strategies for High-Dimensional k-NN', fontsize=16)

# 1. PCA Results
pca_dims = list(pca_results.keys())
pca_accuracies = [pca_results[d]['accuracy'] for d in pca_dims]
pca_variances = [pca_results[d]['explained_variance'] for d in pca_dims]

ax1 = axes[0, 0]
line1 = ax1.plot(pca_dims, pca_accuracies, 'o-', color='blue', linewidth=2, label='Accuracy')
ax1.axhline(y=original_perf, color='red', linestyle='--', alpha=0.7, label=f'Original ({original_perf:.3f})')
ax1.set_xlabel('PCA Components')
ax1.set_ylabel('Accuracy', color='blue')
ax1.set_title('PCA Dimensionality Reduction')
ax1.grid(True, alpha=0.3)

# Add variance explained on secondary y-axis
ax1_twin = ax1.twinx()
line2 = ax1_twin.plot(pca_dims, pca_variances, 's-', color='green', linewidth=2, label='Explained Variance')
ax1_twin.set_ylabel('Explained Variance', color='green')

# Combine legends
lines = line1 + line2 + [plt.Line2D([0], [0], color='red', linestyle='--', label=f'Original ({original_perf:.3f})')]
labels = [l.get_label() for l in lines]
ax1.legend(lines, labels, loc='lower right')

# 2. Feature Selection Results
fs_dims = list(fs_results.keys())
fs_accuracies = [fs_results[d]['accuracy'] for d in fs_dims]
fs_scores = [fs_results[d]['avg_feature_score'] for d in fs_dims]

ax2 = axes[0, 1]
line3 = ax2.plot(fs_dims, fs_accuracies, 'o-', color='purple', linewidth=2, label='Accuracy')
ax2.axhline(y=original_perf, color='red', linestyle='--', alpha=0.7, label=f'Original ({original_perf:.3f})')
ax2.set_xlabel('Selected Features')
ax2.set_ylabel('Accuracy', color='purple')
ax2.set_title('Feature Selection (SelectKBest)')
ax2.grid(True, alpha=0.3)
ax2.legend()

# 3. Weighted vs Uniform k-NN
k_vals = list(weighted_results.keys())
uniform_accs = [weighted_results[k]['uniform'] for k in k_vals]
weighted_accs = [weighted_results[k]['weighted'] for k in k_vals]

axes[1, 0].plot(k_vals, uniform_accs, 'o-', label='Uniform Weights', linewidth=2, color='orange')
axes[1, 0].plot(k_vals, weighted_accs, 's-', label='Distance Weights', linewidth=2, color='brown')
axes[1, 0].set_xlabel('k Value')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_title('Weighted vs Uniform k-NN')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Strategy Comparison
strategies = ['Original\n(128D)', 'Best PCA', 'Best FS', 'Best Weighted']
best_pca_acc = max(pca_accuracies)
best_fs_acc = max(fs_accuracies)
best_weighted_acc = max([max(uniform_accs), max(weighted_accs)])

strategy_accs = [original_perf, best_pca_acc, best_fs_acc, best_weighted_acc]
colors = ['red', 'blue', 'purple', 'brown']

bars = axes[1, 1].bar(strategies, strategy_accs, color=colors, alpha=0.7)
axes[1, 1].set_title('Strategy Comparison\n(Best Performance)')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].set_ylim(0.5, 1.05)

# Add value labels
for bar, acc in zip(bars, strategy_accs):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                   f'{acc:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Summary of best strategies
print("\n🏆 MITIGATION STRATEGY RESULTS:")
print(f"• Original 128D performance: {original_perf:.4f}")
print(f"• Best PCA improvement: +{best_pca_acc - original_perf:.4f} (PCA to {pca_dims[pca_accuracies.index(best_pca_acc)]}D)")
print(f"• Best Feature Selection: +{best_fs_acc - original_perf:.4f} (Select {fs_dims[fs_accuracies.index(best_fs_acc)]} features)")
print(f"• Best Weighted k-NN: +{best_weighted_acc - original_perf:.4f}")
print(f"• Best overall strategy: {strategies[strategy_accs.index(max(strategy_accs))]} ({max(strategy_accs):.4f})")

## Practical Recommendations for High-Dimensional k-NN

Based on our analysis, here are practical guidelines for using k-NN with high-dimensional data:

In [None]:
# Create a comprehensive recommendation system
def analyze_dataset_and_recommend(X, y, max_dims_threshold=20):
    """
    Analyze dataset characteristics and provide k-NN recommendations
    """
    n_samples, n_features = X.shape
    n_classes = len(np.unique(y))
    
    print("DATASET ANALYSIS & k-NN RECOMMENDATIONS")
    print("=" * 50)
    print(f"Dataset characteristics:")
    print(f"  • Samples: {n_samples:,}")
    print(f"  • Features: {n_features}")
    print(f"  • Classes: {n_classes}")
    print(f"  • Samples per feature: {n_samples/n_features:.1f}")
    
    # Dimensionality assessment
    if n_features <= 5:
        dim_category = "Low"
        dim_risk = "✅ Minimal"
    elif n_features <= 20:
        dim_category = "Medium"
        dim_risk = "⚠️  Moderate"
    else:
        dim_category = "High"
        dim_risk = "🚨 High"
    
    print(f"\nDimensionality Assessment:")
    print(f"  • Category: {dim_category} dimensional")
    print(f"  • Curse of dimensionality risk: {dim_risk}")
    
    # Sample density assessment
    sample_density = n_samples / (n_features ** 2)  # Rough heuristic
    if sample_density > 10:
        density_status = "✅ Good"
    elif sample_density > 1:
        density_status = "⚠️  Moderate"
    else:
        density_status = "🚨 Poor"
    
    print(f"  • Sample density: {density_status} ({sample_density:.2f})")
    
    # Recommendations
    print(f"\n📋 RECOMMENDATIONS:")
    
    if n_features <= max_dims_threshold:
        print("  ✅ k-NN is suitable for this dataset")
        print("  • Use standard k-NN with proper feature scaling")
        print("  • Try k values between 3-11")
        print("  • Consider cross-validation for optimal k")
    else:
        print("  ⚠️  High-dimensional data detected - apply mitigation:")
        print("  1. 🎯 Dimensionality Reduction:")
        print(f"     • Try PCA to ~{min(10, n_features//4)} components")
        print(f"     • Consider feature selection (top {min(15, n_features//3)} features)")
        print("  2. 🔧 Algorithm Modifications:")
        print("     • Use distance-weighted k-NN")
        print("     • Try larger k values (5-15)")
        print("     • Consider alternative distance metrics")
        print("  3. 🔄 Alternative Approaches:")
        print("     • Consider ensemble methods (Random Forest)")
        print("     • Try SVM with RBF kernel")
        print("     • Consider neural networks for complex patterns")
    
    # Feature scaling reminder
    print("\n  🔧 Always Remember:")
    print("     • Feature scaling is CRITICAL for k-NN")
    print("     • Use StandardScaler or MinMaxScaler")
    print("     • Handle missing values appropriately")
    print("     • Consider feature engineering for domain-specific improvements")
    
    return {
        'dimensionality_category': dim_category,
        'recommended_suitable': n_features <= max_dims_threshold,
        'sample_density': sample_density
    }

# Analyze our banknote dataset
analysis = analyze_dataset_and_recommend(X_train_scaled_orig, y_train)

# Analyze high-dimensional version
print("\n" + "="*60)
analysis_high_dim = analyze_dataset_and_recommend(train_datasets[128], y_train)

## Summary: k-NN and High-Dimensional Data

### Key Findings from Our Analysis:

1. **Curse of Dimensionality is Real**
   - Distance concentration occurs as dimensions increase
   - Nearest neighbors become less "meaningful"
   - Performance degrades significantly in high dimensions

2. **Performance Degradation Patterns**
   - Gradual decline in accuracy with increasing dimensions
   - Increased average distance to nearest neighbors
   - Loss of discriminative power

3. **Effective Mitigation Strategies**
   - **PCA**: Reduces dimensions while preserving variance
   - **Feature Selection**: Keeps only most informative features
   - **Distance Weighting**: Gives more weight to closer neighbors

4. **Practical Guidelines**
   - ✅ k-NN works well: ≤20 dimensions with good sample density
   - ⚠️ Apply mitigation: 20-100 dimensions
   - 🚨 Consider alternatives: >100 dimensions

### Best Practices for High-Dimensional k-NN:

1. **Always scale features** (StandardScaler/MinMaxScaler)
2. **Apply dimensionality reduction** when d > 20
3. **Use distance-weighted k-NN** for better performance
4. **Validate with cross-validation** to avoid overfitting
5. **Consider ensemble methods** as alternatives

### When to Avoid k-NN:
- Very high dimensions (>100) without effective reduction
- Sparse, high-dimensional data (text, genomics)
- When computational efficiency is critical
- When interpretability of individual features is needed

### Alternative Algorithms for High-Dimensional Data:
- **Random Forest**: Handles high dimensions well
- **SVM with RBF kernel**: Good for complex boundaries
- **Neural Networks**: Can learn complex patterns
- **Naive Bayes**: Surprisingly effective for high-dimensional text data

The curse of dimensionality is a fundamental challenge, but with proper understanding and mitigation strategies, k-NN can still be effective in many high-dimensional scenarios!