# KNN Model v1 - Baseline Route Recommendation System

This notebook contains the first version of our KNN-based route recommendation model.

## 1. Load Data and Setup

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors

df = pd.read_csv("UK_Engineered_Data.csv")
pd.set_option('display.max_columns', None)
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (7717, 49)


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,name,distance_m,duration_s,ascent_m,descent_m,steps,turns,Asphalt,Unknown,Paved,Compacted Gravel,Wood,Gravel,Paving Stones,Ground,Concrete,Grass,Metal,Unpaved,Dirt,Grass Paver,Sand,Road,Cycleway,State Road,Track,Street,Path,Footway,Unknown.1,Steps,Construction,Ferry,uphill_very_steep (7% to 10%),uphill_moderate (3% to 5%),uphill_gentle (0% to 3%),flat (0%),downhill_gentle (-5% to 0%),uphill_steep (5% to 7%),uphill_extreme (>10%),downhill_extreme (<-15%),downhill_moderate (-7% to -5%),downhill_steep (-10% to -7%),downhill_very_steep (-15% to -10%),Average_Speed,Turn_Density
0,0,50,1677043,Yorkshire Wolds Cycle Route,67.3,13.5,0.0,0.0,2,0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.985185,0.0
1,1,51,2924632,Ride North Wales,1617.8,323.6,88.9,1.9,2,0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.14,13.79,62.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.999382,0.0
2,2,52,8603353,Living Landscapes (Short Route),149.6,29.9,2.0,0.0,2,0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.003344,0.0
3,3,53,1124203,Sperrins Route 5 - Lough Fea Cycle Route,88.7,17.7,3.0,0.0,2,0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.011299,0.0
4,4,54,913519,Unnamed route,910.0,203.6,0.6,10.6,2,0,18.18,81.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,81.82,18.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,4.469548,0.0


## 2. Feature Preparation

In [11]:
# Create feature matrix by dropping ID and name columns
X = df.drop(['Unnamed: 0.1', 'Unnamed: 0', 'id', 'name'], axis=1)
print(f"Feature matrix shape: {X.shape}")
print(f"Number of features: {len(X.columns)}")

Feature matrix shape: (7717, 45)
Number of features: 45


## 3. Feature Scaling

In [12]:
# Apply different scaling strategies to different feature types
scaler = ColumnTransformer(transformers=[
    ('standard', StandardScaler(), ['distance_m', 'duration_s', 'ascent_m', 'descent_m', 'Turn_Density', 'Average_Speed', 'steps', 'turns']),
    ('minmax', MinMaxScaler(), ['Asphalt', 'Unknown', 'Paved', 'Compacted Gravel', 'Wood', 'Gravel', 'Paving Stones', 'Ground', 'Concrete', 'Grass', 'Metal', 'Unpaved', 'Dirt', 'Grass Paver', 'Sand', 'Road', 'Cycleway', 'State Road', 'Track', 'Street', 'Path', 'Footway', 'Unknown.1', 'Steps', 'Construction', 'Ferry', 'uphill_very_steep (7% to 10%)', 'uphill_moderate (3% to 5%)', 'uphill_gentle (0% to 3%)', 'flat (0%)', 'downhill_gentle (-5% to 0%)', 'uphill_steep (5% to 7%)', 'uphill_extreme (>10%)', 'downhill_extreme (<-15%)', 'downhill_moderate (-7% to -5%)', 'downhill_steep (-10% to -7%)', 'downhill_very_steep (-15% to -10%)']),
], remainder='passthrough')

X_scaled = scaler.fit_transform(X)
print(f"Scaled feature matrix shape: {X_scaled.shape}")

Scaled feature matrix shape: (7717, 45)


## 4. Train/Test Split

We'll split the data and also track indices so we can map back to the original dataframe later.

In [13]:
# Create index array to track which rows go to train vs test
indices_array = np.arange(len(X_scaled))

# Split data: 80% training, 20% testing
X_train, X_test, train_indices, test_indices = train_test_split(
    X_scaled, indices_array, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Training set size: 6173
Test set size: 1544


## 5. Build and Train KNN Model

**Model Parameters:**
- n_neighbors = 5 (find 5 most similar routes)
- metric = 'euclidean' (straight-line distance in feature space)

In [14]:
# Building and training the model
knn = NearestNeighbors(n_neighbors=5, metric='euclidean')
knn.fit(X_train)
print("Model trained successfully!")

Model trained successfully!


## 6. Get Predictions on Test Set

In [15]:
# Find nearest neighbors for all test routes
distances, indices = knn.kneighbors(X_test)
print(f"Predictions shape: {distances.shape}")
print(f"Each test route has {distances.shape[1]} recommendations")

Predictions shape: (1544, 5)
Each test route has 5 recommendations


## 7. Examine Results

In [16]:
# View distances (similarity scores)
print("Similarity distances for first 5 test routes:")
print(distances[:5])
print("\nLower distance = more similar routes")

Similarity distances for first 5 test routes:
[[0.07528638 0.13264428 0.1350721  0.14729481 0.149056  ]
 [0.46525797 0.50276023 0.55436133 0.57498136 0.58159318]
 [0.00815576 0.01154558 0.01475302 0.0153974  0.01645797]
 [0.38997358 0.47557017 0.47901315 0.50078875 0.51229669]
 [0.00626152 0.00824597 0.00990241 0.01476553 0.01559997]]

Lower distance = more similar routes


In [17]:
# View indices (which routes were recommended)
print("Recommended route indices for first 5 test routes:")
print(indices[:5])

Recommended route indices for first 5 test routes:
[[ 563  930 4116 4075 1662]
 [5260  758 3476 3767 5751]
 [ 543 3980 2194 1613 5588]
 [5880 2007 3429 3014 1310]
 [1080 2622   19 4036   27]]


## 8. Example: View Recommended Routes

Let's look at actual route details for the first test route's recommendations

In [18]:
# Get the 5 recommended routes for the first test route
test_route_idx = 0
recommended_indices = indices[test_route_idx]

# Map back to original dataframe
rec_df_indices = [train_indices[idx] for idx in recommended_indices]
neighbors = df.iloc[rec_df_indices]

print("Recommended routes:")
neighbors[['name', 'distance_m', 'ascent_m', 'duration_s']]

Recommended routes:


Unnamed: 0,name,distance_m,ascent_m,duration_s
2095,Unnamed route,2191.3,17.9,438.3
1973,Unnamed route,639.4,7.8,127.9
4167,Unnamed route,2055.2,15.4,411.0
2169,Unnamed route,3019.0,11.7,603.8
2093,Unnamed route,1797.5,6.0,359.5


---
# PHASE 1: BASELINE EVALUATION

Now let's evaluate how well our baseline model performs using multiple metrics.

## 9. Define Evaluation Metrics

We'll measure model performance using several metrics since we don't have user ratings:
- **avg_distance**: How similar are recommendations (lower = better)
- **diversity**: How varied are the 5 recommendations (higher = more variety)
- **coverage**: What % of routes get recommended (higher = less bias)
- **distance_accuracy**: How close is recommended distance to input distance
- **ascent_accuracy**: How close is recommended ascent to input ascent

In [19]:
def evaluate_knn_model(knn_model, X_test, X_train, df, train_indices, test_indices):
    """
    Evaluate a KNN model using multiple metrics

    Parameters:
    - knn_model: Trained KNN model
    - X_test: Test features (scaled)
    - X_train: Training features (scaled)
    - df: Original dataframe
    - train_indices: Indices of training samples in df
    - test_indices: Indices of test samples in df

    Returns:
    - Dictionary with evaluation metrics
    """
    # Get predictions
    distances, indices = knn_model.kneighbors(X_test)

    # Metric 1: Average similarity distance (lower = better)
    avg_distance = distances.mean()

    # Metric 2: Diversity (avg pairwise distance between recommendations)
    diversity_scores = []
    for idx_set in indices:
        recommended_features = X_train[idx_set]
        pairwise_dists = []
        for i in range(len(idx_set)):
            for j in range(i+1, len(idx_set)):
                dist = np.linalg.norm(recommended_features[i] - recommended_features[j])
                pairwise_dists.append(dist)
        diversity_scores.append(np.mean(pairwise_dists) if pairwise_dists else 0)
    avg_diversity = np.mean(diversity_scores)

    # Metric 3: Coverage (% of training routes that appear in recommendations)
    unique_recommendations = len(np.unique(indices.flatten()))
    coverage = unique_recommendations / len(X_train) * 100

    # Metric 4 & 5: Feature-specific accuracy
    distance_errors = []
    ascent_errors = []

    for i, idx_set in enumerate(indices):
        # Get test route features from original df
        test_df_idx = test_indices[i]
        input_distance = df.iloc[test_df_idx]['distance_m']
        input_ascent = df.iloc[test_df_idx]['ascent_m']

        # Get recommended route features
        rec_df_indices = [train_indices[idx] for idx in idx_set]
        rec_distances = df.iloc[rec_df_indices]['distance_m'].values
        rec_ascents = df.iloc[rec_df_indices]['ascent_m'].values

        # Calculate relative errors
        dist_error = np.mean(np.abs(rec_distances - input_distance) / (input_distance + 1))
        ascent_error = np.mean(np.abs(rec_ascents - input_ascent) / (input_ascent + 1))

        distance_errors.append(dist_error)
        ascent_errors.append(ascent_error)

    avg_distance_error = np.mean(distance_errors)
    avg_ascent_error = np.mean(ascent_errors)

    return {
        'avg_distance': avg_distance,
        'diversity': avg_diversity,
        'coverage': coverage,
        'distance_accuracy': 1 / (1 + avg_distance_error),
        'ascent_accuracy': 1 / (1 + avg_ascent_error)
    }

print("Evaluation function defined!")

Evaluation function defined!


## 10. Run Baseline Evaluation

Let's evaluate our current model (k=5, euclidean) to establish baseline performance.

In [20]:
print("=" * 80)
print("BASELINE MODEL EVALUATION")
print("=" * 80)
print(f"Model: KNN with k={knn.n_neighbors}, metric='{knn.metric}'")
print()

# Run evaluation
baseline_metrics = evaluate_knn_model(knn, X_test, X_train, df, train_indices, test_indices)

print("Baseline Performance Metrics:")
print("-" * 80)
for metric, value in baseline_metrics.items():
    print(f"{metric:<25}: {value:.4f}")

print("\n" + "=" * 80)

BASELINE MODEL EVALUATION
Model: KNN with k=5, metric='euclidean'

Baseline Performance Metrics:
--------------------------------------------------------------------------------
avg_distance             : 0.4781
diversity                : 0.5415
coverage                 : 65.7055
distance_accuracy        : 0.5021
ascent_accuracy          : 0.5163



## 11. Detailed Example: Inspect Recommendation Quality

Let's manually inspect a few recommendations to see if they make sense.

In [21]:
# Look at 3 examples from the test set
num_examples = 3

for example_idx in range(num_examples):
    print("\n" + "=" * 80)
    print(f"EXAMPLE {example_idx + 1}")
    print("=" * 80)

    # Get input route
    test_df_idx = test_indices[example_idx]
    input_route = df.iloc[test_df_idx]

    print(f"\nINPUT ROUTE:")
    print(f"  Name: {input_route['name']}")
    print(f"  Distance: {input_route['distance_m']:.1f}m")
    print(f"  Ascent: {input_route['ascent_m']:.1f}m")
    print(f"  Duration: {input_route['duration_s']:.1f}s")
    print(f"  Avg Speed: {input_route['Average_Speed']:.2f}")
    print(f"  Turn Density: {input_route['Turn_Density']:.2f}")

    # Get recommendations
    rec_indices = indices[example_idx]
    rec_distances = distances[example_idx]

    print(f"\nRECOMMENDED ROUTES (Top 5 similar):")
    print("-" * 80)

    for i, (rec_idx, sim_dist) in enumerate(zip(rec_indices, rec_distances), 1):
        rec_df_idx = train_indices[rec_idx]
        rec_route = df.iloc[rec_df_idx]

        # Calculate differences
        dist_diff = rec_route['distance_m'] - input_route['distance_m']
        ascent_diff = rec_route['ascent_m'] - input_route['ascent_m']

        print(f"\n  {i}. {rec_route['name'][:60]}")
        print(f"     Similarity score: {sim_dist:.4f}")
        print(f"     Distance: {rec_route['distance_m']:.1f}m ({dist_diff:+.1f}m)")
        print(f"     Ascent: {rec_route['ascent_m']:.1f}m ({ascent_diff:+.1f}m)")
        print(f"     Duration: {rec_route['duration_s']:.1f}s")
        print(f"     Avg Speed: {rec_route['Average_Speed']:.2f}")

print("\n" + "=" * 80)


EXAMPLE 1

INPUT ROUTE:
  Name: Unnamed route
  Distance: 2286.3m
  Ascent: 17.5m
  Duration: 457.2s
  Avg Speed: 5.00
  Turn Density: 0.00

RECOMMENDED ROUTES (Top 5 similar):
--------------------------------------------------------------------------------

  1. Unnamed route
     Similarity score: 0.0753
     Distance: 2191.3m (-95.0m)
     Ascent: 17.9m (+0.4m)
     Duration: 438.3s
     Avg Speed: 5.00

  2. Unnamed route
     Similarity score: 0.1326
     Distance: 639.4m (-1646.9m)
     Ascent: 7.8m (-9.7m)
     Duration: 127.9s
     Avg Speed: 5.00

  3. Unnamed route
     Similarity score: 0.1351
     Distance: 2055.2m (-231.1m)
     Ascent: 15.4m (-2.1m)
     Duration: 411.0s
     Avg Speed: 5.00

  4. Unnamed route
     Similarity score: 0.1473
     Distance: 3019.0m (+732.7m)
     Ascent: 11.7m (-5.8m)
     Duration: 603.8s
     Avg Speed: 5.00

  5. Unnamed route
     Similarity score: 0.1491
     Distance: 1797.5m (-488.8m)
     Ascent: 6.0m (-11.5m)
     Duration: 359.5s

---
## Summary: Phase 1 Complete

**What we've done:**
- ✅ Built baseline KNN model (k=5, euclidean)
- ✅ Defined evaluation metrics
- ✅ Evaluated baseline performance
- ✅ Inspected example recommendations

**Next steps (Phase 2):**
- Grid search to test different k values and distance metrics
- Compare all model configurations
- Select optimal model parameters