# Comprehensive Machine Learning Project using K-Nearest Neighbors (KNN)

In this notebook, we demonstrate the KNN algorithm for classification tasks. We cover the complete machine learning workflow, including data loading, exploratory data analysis (EDA), data preprocessing, mathematical explanation, model training and evaluation, model analysis & visualization, discussion, conclusion, and references.

### Introduction

The K-Nearest Neighbors (KNN) algorithm is a simple, non-parametric, and instance-based learning method used for both classification and regression. It works on the assumption that similar instances exist in close proximity in the feature space. KNN is significant because:

- It is easy to implement and understand.
- It makes no assumptions about the underlying data distribution.

KNN finds applications in various domains such as product recommendation systems, fraud detection, and customer segmentation. In this notebook, we will use the well-known Iris dataset to illustrate the KNN workflow.

### Dataset Description & Exploratory Analysis

We will use the Iris dataset, which contains measurements for three species of Iris flowers. The dataset includes the features: sepal length, sepal width, petal length, and petal width, along with the target class (species).

In this section, we:

- Load the dataset
- Display basic statistical summaries
- Check for missing values
- Visualize the data using pair plots

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Iris dataset from sklearn
from sklearn.datasets import load_iris

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)


In [None]:
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['target'] = iris.target
data['species'] = data['target'].map(dict(zip(range(3), iris.target_names)))

# Display the first 5 rows of the dataset
data.head()

In [None]:
# Show class distribution
print("Class distribution:")
print(data['species'].value_counts())
sns.countplot(x='species', data=data)
plt.title('Class Distribution')
plt.show()

In [None]:
# Basic statistical summary
print(data.describe())

In [None]:
# Information about the dataset
print(data.info())

In [None]:
# Check for missing values
print(data.isnull().sum())

In [None]:
# Create a pairplot to visualize relationships between features
sns.pairplot(data, hue='species')
plt.show()

### Data Preprocessing

In this section, we prepare the data for model training. The steps include:

- Handling missing values (if any)
- Scaling features using normalization (StandardScaler)
- (If needed) Encoding categorical variables
- Splitting the dataset into training and testing sets

In [None]:
# Import libraries for splitting and scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separate features and target variable
X = data[iris.feature_names]
y = data['target']

# Split dataset: 80% training and 20% testing (using stratification)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Feature scaling is crucial for KNN because it is a distance-based algorithm. Features with larger scales can dominate the distance calculation, so we use StandardScaler to normalize all features to have mean 0 and variance 1.

In [None]:
# Scale the feature values using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training and testing sets have been prepared.")

### Mathematical Explanation

The K-Nearest Neighbors (KNN) algorithm classifies a new instance based on the classes of its k nearest neighbors. The key mathematical concepts are as follows:

1. **Distance Calculation:**
   - **Euclidean Distance:** For two points, $ x $ and $ y $, in an n-dimensional space:
     
     $ d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $

   - **Manhattan Distance:** The distance is calculated as:
     
     $ d(x, y) = \sum_{i=1}^{n} |x_i - y_i| $

2. **Choice of k (Number of Neighbors):**
   - A small k value may make the model sensitive to noise (overfitting), while a large k may smooth out the decision boundary (underfitting).

3. **Majority Voting:**
   - The class label for the new instance is determined by the majority class among its k nearest neighbors.

Both the distance metric and the choice of k play a crucial role in the model’s performance.

### Model Training & Evaluation

In this section, we train the KNN model using Scikit-learn, tune hyperparameters (such as k and the distance metric), and evaluate performance using metrics like accuracy, precision, recall, F1-score, and the confusion matrix.

In [None]:
# Import the KNN classifier and evaluation metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Initialize the KNN classifier with an initial k value (e.g., 5) using Euclidean distance (Minkowski with p=2)
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)

# Train the model on the scaled training data
knn.fit(X_train_scaled, y_train)

# Predict target labels on the test set
y_pred = knn.predict(X_test_scaled)

# Evaluate the performance of the model and show as DataFrame
metrics = {
    "Accuracy": [accuracy_score(y_test, y_pred)],
    "Precision": [precision_score(y_test, y_pred, average='weighted')],
    "Recall": [recall_score(y_test, y_pred, average='weighted')],
    "F1 Score": [f1_score(y_test, y_pred, average='weighted')]
}

metrics_df = pd.DataFrame(metrics)
display(metrics_df)

In [None]:
# Plot the Confusion Matrix using Seaborn
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Classification Report
print(classification_report(y_test, y_pred))

In [None]:
# Optional: Hyperparameter tuning using GridSearchCV to find the best parameters
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'metric': ['euclidean', 'manhattan'],
    'weights': ['uniform', 'distance']
}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

# Evaluate the best estimator on the test set
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)

acc_best = accuracy_score(y_test, y_pred_best)
print(f"Test set accuracy of best model: {acc_best:.2f}")


In [None]:
# Show full classification report for the best model
from sklearn.metrics import classification_report
print("Classification report for best model:")
print(classification_report(y_test, y_pred_best, target_names=iris.target_names))

### Model Analysis & Visualization

Here, we further analyze the model by visualizing decision boundaries and the effect of different k values on performance. Since decision boundary plots are easier in two dimensions, we reduce our dataset to two features (sepal length and sepal width) for these visualizations.

The following visualizations will be generated:

- **Decision Boundary Plot**: Shows the classifier’s separation of classes.
- **K-value Selection Visualization**: Illustrates test accuracy as a function of the number of neighbors (k).

In [None]:
# For decision boundary visualization, select two features: sepal length and sepal width
features = ['sepal length (cm)', 'sepal width (cm)']

X_vis = X[features]
y_vis = y

# Split the data (using stratification for consistency)
X_train_vis, X_test_vis, y_train_vis, y_test_vis = train_test_split(X_vis, y_vis, test_size=0.2, random_state=42, stratify=y_vis)

# Scale the selected features
scaler_vis = StandardScaler()
X_train_vis_scaled = scaler_vis.fit_transform(X_train_vis)
X_test_vis_scaled = scaler_vis.transform(X_test_vis)

# Train a KNN classifier on the two-feature dataset using the best parameters from grid search
knn_vis = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'],
                               metric=grid_search.best_params_['metric'])
knn_vis.fit(X_train_vis_scaled, y_train_vis)

# Create a mesh grid for plotting decision boundaries
x_min, x_max = X_train_vis_scaled[:, 0].min() - 1, X_train_vis_scaled[:, 0].max() + 1
y_min, y_max = X_train_vis_scaled[:, 1].min() - 1, X_train_vis_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

# Predict class labels for each point in the mesh grid
Z = knn_vis.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundaries along with the training points
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.coolwarm)
scatter = plt.scatter(
    X_train_vis_scaled[:, 0], X_train_vis_scaled[:, 1],
    c=y_train_vis, s=20, edgecolor='k', cmap=plt.cm.coolwarm
)
plt.title('Decision Boundary with KNN')
plt.xlabel(features[0])
plt.ylabel(features[1])

# Automatic legend based on unique classes in y_train_vis
for i, class_name in zip(np.unique(y_train_vis), iris.target_names):
    plt.scatter([], [], c=plt.cm.coolwarm(i / 2), label=class_name)
plt.legend(title="Classes")

plt.tight_layout()
plt.show()

In [None]:
# Explore test accuracy for different values of k using the two selected features
k_values = range(1, 16)
accuracies = []

for k in k_values:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_vis_scaled, y_train_vis)
    pred_temp = knn_temp.predict(X_test_vis_scaled)
    accuracies.append(accuracy_score(y_test_vis, pred_temp))

# Plot test accuracy vs. k
plt.figure(figsize=(8, 4))
plt.plot(k_values, accuracies, marker='o')
best_k = k_values[np.argmax(accuracies)]
best_acc = max(accuracies)
plt.annotate(f'Best k={best_k}\nAcc={best_acc:.2f}', xy=(best_k, best_acc), 
             xytext=(best_k+1, best_acc-0.05),
             arrowprops=dict(facecolor='black', shrink=0.05))
plt.title('K-value Selection: Test Accuracy vs. Number of Neighbors')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Test Accuracy')
plt.xticks(k_values)
plt.tight_layout()
plt.show()

### Discussion

#### Baseline Comparison

For reference, a simple baseline classifier that always predicts the majority class would achieve an accuracy of approximately 33% (since the Iris dataset is balanced). Our KNN model significantly outperforms this baseline, demonstrating its effectiveness.

The KNN algorithm performed well on the Iris dataset, demonstrating clear decision boundaries in the low-dimensional visualization. Some key points to note:

**Strengths:**
- Simple to implement and interpret.
- No assumptions about the underlying data distribution.

**Weaknesses:**
- Computationally expensive for very large datasets due to the distance computations.
- Sensitive to the choice of k and feature scaling.

While more sophisticated algorithms like Support Vector Machines or Decision Trees might capture complex patterns better, KNN remains useful for its simplicity and interpretability in many real-world tasks.

### Conclusion

In this project, we showcased a complete machine learning workflow using the K-Nearest Neighbors (KNN) algorithm. We:

- Loaded and explored the Iris dataset
- Preprocessed the data (including scaling and train-test splitting)
- Explained the mathematical foundation of KNN
- Trained and tuned the model, evaluating it with several performance metrics
- Analyzed model behavior using decision boundary and k-value performance plots

The KNN algorithm, despite its simplicity, demonstrated effective performance on the dataset. Future work could involve applying dimensionality reduction, testing KNN on larger and more complex datasets, or comparing it with alternative classifiers.

#### Next Steps

- Apply dimensionality reduction (e.g., PCA) before KNN to visualize in lower dimensions.
- Test KNN on larger or more complex datasets.
- Compare KNN’s performance with other classifiers such as Logistic Regression or Decision Trees.
- Explore advanced hyperparameter tuning and cross-validation strategies.

### References

1. [Scikit-learn Documentation](https://scikit-learn.org/stable/)
2. Müller, A. C., & Guido, S. (2016). *Introduction to Machine Learning with Python*.
3. [Iris Dataset - UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)