# Café Location Suitability Model Training & Evaluation

This notebook demonstrates the complete machine learning pipeline for predicting café location suitability in Kathmandu:

1. **Data Loading**: Load the preprocessed training dataset
2. **Train/Test Split**: Split data into 80% training and 20% testing sets
3. **Model Training**: Train a Random Forest classifier on the training data
4. **Model Evaluation**: Test the model and calculate performance metrics

**Dataset**: Preprocessed training data with balanced classes and selected features
**Target**: Location suitability (High/Medium/Low)
**Algorithm**: Random Forest Classifier

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    mean_squared_error,
    mean_absolute_error
)
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Data Loading

Load the preprocessed training dataset and examine its structure.

In [2]:
# Load the preprocessed training dataset
data_path = 'cafelocate/data/preprocessed_training_dataset.csv'
df = pd.read_csv(data_path)

print("Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print("\nColumns:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print(f"\nSuitability class distribution:")
print(df['suitability'].value_counts())

print(f"\nData types:")
print(df.dtypes)

print(f"\nMissing values: {df.isnull().sum().sum()}")

# Display first few rows
print("\nFirst 5 rows:")
df.head()

Dataset loaded successfully!
Shape: 2676 rows × 14 columns

Columns:
 1. competitors_min_distance
 2. roads_within_500m
 3. roads_avg_distance
 4. schools_within_500m
 5. schools_within_200m
 6. schools_min_distance
 7. hospitals_within_500m
 8. hospitals_min_distance
 9. population_density_proxy
10. accessibility_score
11. foot_traffic_score
12. suitability
13. latitude
14. longitude

Suitability class distribution:
suitability
Low       892
High      892
Medium    892
Name: count, dtype: int64

Data types:
competitors_min_distance    float64
roads_within_500m           float64
roads_avg_distance          float64
schools_within_500m         float64
schools_within_200m         float64
schools_min_distance        float64
hospitals_within_500m       float64
hospitals_min_distance      float64
population_density_proxy    float64
accessibility_score         float64
foot_traffic_score          float64
suitability                     str
latitude                    float64
longitude         

Unnamed: 0,competitors_min_distance,roads_within_500m,roads_avg_distance,schools_within_500m,schools_within_200m,schools_min_distance,hospitals_within_500m,hospitals_min_distance,population_density_proxy,accessibility_score,foot_traffic_score,suitability,latitude,longitude
0,0.729106,-0.678142,0.904368,-0.66765,-0.465293,0.966992,-0.498428,0.520655,-0.683634,-0.697804,-0.696368,Low,27.66403,85.353726
1,0.729106,-0.678142,0.904368,-0.66765,-0.465293,0.966992,-0.498428,0.520655,-0.683634,-0.697804,-0.696368,Low,27.748368,85.278061
2,0.729106,-0.678142,0.904368,-0.66765,-0.465293,0.966992,-0.498428,0.520655,-0.683634,-0.697804,-0.696368,Low,27.732894,85.277918
3,0.729106,-0.678142,0.904368,-0.66765,-0.465293,0.966992,-0.498428,0.520655,-0.683634,-0.697804,-0.696368,Low,27.724335,85.275551
4,0.729106,-0.49357,-3.239842,0.407154,1.266451,-1.711839,-0.498428,0.520655,0.134359,0.691612,0.560886,High,27.730741,85.295011


## 2. Train/Test Split (80/20)

Split the dataset into training (80%) and testing (20%) sets using stratified sampling to maintain class balance.

In [4]:
# Define features and target
feature_cols = [
    'competitors_min_distance', 'roads_within_500m', 'roads_avg_distance',
    'schools_within_500m', 'schools_within_200m', 'schools_min_distance',
    'hospitals_within_500m', 'hospitals_min_distance',
    'population_density_proxy', 'accessibility_score', 'foot_traffic_score'
]

X = df[feature_cols]
y = df['suitability']

print("Features selected:")
for i, col in enumerate(feature_cols, 1):
    print(f"{i:2d}. {col}")

print(f"\nTarget variable: suitability")
print(f"Classes: {sorted(y.unique())}")

# Split the data (80% train, 20% test) with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,
    stratify=y  # Maintain class balance
)

print("\nData split completed:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(df)*100:.1f}%)")

print("\nTraining set class distribution:")
print(y_train.value_counts())

print("\nTest set class distribution:")
print(y_test.value_counts())

Features selected:
 1. competitors_min_distance
 2. roads_within_500m
 3. roads_avg_distance
 4. schools_within_500m
 5. schools_within_200m
 6. schools_min_distance
 7. hospitals_within_500m
 8. hospitals_min_distance
 9. population_density_proxy
10. accessibility_score
11. foot_traffic_score

Target variable: suitability
Classes: ['High', 'Low', 'Medium']

Data split completed:
Training set: 2140 samples (80.0%)
Test set: 536 samples (20.0%)

Training set class distribution:
suitability
High      714
Low       713
Medium    713
Name: count, dtype: int64

Test set class distribution:
suitability
Low       179
Medium    179
High      178
Name: count, dtype: int64


## 3. Model Training

Train a Random Forest classifier on the training dataset with optimized hyperparameters.

In [5]:
# Encode target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print("Label encoding:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {i}: {label}")

# Initialize and train Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=200,      # Number of trees
    max_depth=15,          # Maximum depth of trees
    min_samples_split=5,   # Minimum samples to split
    min_samples_leaf=2,    # Minimum samples per leaf
    random_state=42,       # For reproducibility
    class_weight='balanced', # Handle any remaining imbalance
    n_jobs=-1              # Use all available cores
)

print("\nTraining Random Forest model...")
rf_model.fit(X_train, y_train_encoded)

print("Model training completed!")
print(f"Number of trees: {rf_model.n_estimators}")
print(f"Maximum depth: {rf_model.max_depth}")
print(f"Number of features: {rf_model.n_features_in_}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 5 most important features:")
print(feature_importance.head())

Label encoding:
  0: High
  1: Low
  2: Medium

Training Random Forest model...
Model training completed!
Number of trees: 200
Maximum depth: 15
Number of features: 11

Top 5 most important features:
                     feature  importance
10        foot_traffic_score    0.288439
8   population_density_proxy    0.255146
3        schools_within_500m    0.149390
5       schools_min_distance    0.121529
6      hospitals_within_500m    0.038179


## 4. Model Testing & Evaluation

Test the trained model on the unseen test dataset and calculate performance metrics.

In [6]:
# Make predictions on test set
y_pred_encoded = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)

# Decode predictions back to labels
y_pred = label_encoder.inverse_transform(y_pred_encoded)

print("Predictions completed!")
print(f"Test set size: {len(y_test)} samples")

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(".4f")

# Classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print("Predicted →")
print("Actual ↓")
cm_df = pd.DataFrame(cm, index=label_encoder.classes_, columns=label_encoder.classes_)
print(cm_df)

# Additional metrics
print("\nAdditional Metrics:")
print(f"Macro-averaged F1-score: {classification_report(y_test, y_pred, target_names=label_encoder.classes_, output_dict=True)['macro avg']['f1-score']:.4f}")
print(f"Weighted-averaged F1-score: {classification_report(y_test, y_pred, target_names=label_encoder.classes_, output_dict=True)['weighted avg']['f1-score']:.4f}")

# Error analysis
errors = y_test != y_pred
error_rate = errors.sum() / len(y_test)
print(f"Error rate: {error_rate:.4f}")

print("\nError Analysis:")
print(f"Total predictions: {len(y_test)}")
print(f"Correct predictions: {(~errors).sum()}")
print(f"Incorrect predictions: {errors.sum()}")

# Per-class error rates
for class_name in label_encoder.classes_:
    class_mask = y_test == class_name
    class_errors = errors[class_mask]
    class_error_rate = class_errors.sum() / class_mask.sum() if class_mask.sum() > 0 else 0
    print(f"Error rate for '{class_name}': {class_error_rate:.4f}")

Predictions completed!
Test set size: 536 samples
.4f

Detailed Classification Report:
              precision    recall  f1-score   support

        High       0.99      0.99      0.99       178
         Low       1.00      1.00      1.00       179
      Medium       0.99      0.99      0.99       179

    accuracy                           1.00       536
   macro avg       1.00      1.00      1.00       536
weighted avg       1.00      1.00      1.00       536


Confusion Matrix:
Predicted →
Actual ↓
        High  Low  Medium
High     177    0       1
Low        0  179       0
Medium     1    0     178

Additional Metrics:
Macro-averaged F1-score: 0.9963
Weighted-averaged F1-score: 0.9963
Error rate: 0.0037

Error Analysis:
Total predictions: 536
Correct predictions: 534
Incorrect predictions: 2
Error rate for 'High': 0.0056
Error rate for 'Low': 0.0000
Error rate for 'Medium': 0.0056


## 5. Model Summary & Key Findings

### Performance Summary
- **Overall Accuracy**: 99.63%
- **Best Performing Class**: Low suitability (100% accuracy)
- **Most Challenging Class**: High & Medium suitability (99.44% accuracy each)
- **Key Insights**: Model achieves near-perfect performance with only 2 misclassifications out of 536 test samples

### Model Characteristics
- **Algorithm**: Random Forest Classifier
- **Training Samples**: 2,140 (80% of 2,676)
- **Test Samples**: 536 (20% of 2,676)
- **Features Used**: 11 selected features
- **Classes**: High, Medium, Low suitability

### Recommendations
- **Deployment Ready**: Model shows excellent performance and is ready for production use
- **Monitoring**: Implement continuous monitoring for performance drift
- **Feature Engineering**: Consider additional temporal features if available
- **Scalability**: Model is lightweight and suitable for real-time predictions

In [9]:
# Save the trained model and encoder for deployment
models_dir = 'cafelocate/ml/models'
os.makedirs(models_dir, exist_ok=True)

model_path = os.path.join(models_dir, 'final_suitability_rf_model.pkl')
encoder_path = os.path.join(models_dir, 'final_suitability_label_encoder.pkl')

joblib.dump(rf_model, model_path)
joblib.dump(label_encoder, encoder_path)

print("Model saved successfully!")
print(f"Model path: {model_path}")
print(f"Encoder path: {encoder_path}")

# Save feature importance for reference
importance_path = os.path.join(models_dir, 'feature_importance.csv')
feature_importance.to_csv(importance_path, index=False)
print(f"Feature importance saved: {importance_path}")

print("\nNotebook execution completed!")
print("The trained model is ready for deployment in the CaféLocate system.")

Model saved successfully!
Model path: cafelocate/ml/models\final_suitability_rf_model.pkl
Encoder path: cafelocate/ml/models\final_suitability_label_encoder.pkl
Feature importance saved: cafelocate/ml/models\feature_importance.csv

Notebook execution completed!
The trained model is ready for deployment in the CaféLocate system.
