# K-Nearest Neighbors: Classifying Supermarkets by Brand

In this notebook, we will use **K-Nearest Neighbors (KNN)** to classify supermarkets based on their brand.
The goal is to predict the supermarket brand (e.g., Spar, ALDI, Lidl, Migros) using features like population density, percentage of foreigners, employment rate, and latitude and longitude.

KNN is a simple supervised learning algorithm used for classification, where an object is classified by the majority label of its neighbors.

## Libraries and Settings

In [None]:
# Libraries
import os
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Show current working directory
print(os.getcwd())

## Import Data

In [None]:
# Load the supermarket dataset
df = pd.read_csv('./Data/supermarkets_data_enriched.csv', 
                 sep=',', 
                 encoding='utf-8',
                 index_col=0)

# Show dimension (rows, columns)
print(df.shape)

# Show the first 5 rows
df.head()

## Data Preprocessing

### Remove rows with missing or duplicated values

In [26]:
# Remove rows with missing values
df = df.dropna()

# Remove rows with duplicated values
df = df.drop_duplicates()

### Change brand names to upper

In [27]:
# Change brand names to uppercase
df['brand'] = df['brand'].str.upper()

### Create a subset of the data with defined brands

In [None]:
# Select subset of supermarket data (reset the index in the subset)
df_sub = df[df['brand'].isin(['SPAR', 'LANDI', 'MIGROS', 'ALDI'])].reset_index(drop=True)
df_sub.head()

### Select relevant features for classification

In [29]:
# Select relevant features for classification
features = ['pop_dens', 'frg_pct', 'emp', 'lat', 'lon']
X = df_sub[features]
y = df_sub['brand']

### Split the data into training and testing sets

In [30]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features (removes the mean and scales to unit variance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Training: K-Nearest Neighbors

In [None]:
# Initialize the KNN model with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

## Model Evaluation

In [None]:
# Predict on the test set
y_pred = knn.predict(X_test)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix as a heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, 
            annot=True, 
            fmt='d', 
            cmap='Blues', 
            xticklabels=knn.classes_, 
            yticklabels=knn.classes_,
            cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Print classification report
print('\nClassification report:\n', classification_report(y_test, y_pred))

### Jupyter notebook --footer info-- (please always provide this at the end of each notebook)

In [None]:
import os
import platform
import socket
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('-----------------------------------')