In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7d1a89d9-7c2c-4f0e-8a8d-c5c6a4e2c5b8",
   "metadata": {},
   "source": [
    "# k-NN Classification Analysis\n",
    "\n",
    "This notebook demonstrates the complete workflow for implementing the k-Nearest Neighbors (k-NN) algorithm from scratch. The dataset consists of 178 instances with 13 numerical features and a class label (1, 2, or 3). We will:\n",
    "\n",
    "- Load and visualize the dataset\n",
    "- Preprocess the data (handle missing values, normalize, split into training/testing sets)\n",
    "- Implement k-NN from scratch (using Euclidean and Manhattan distance metrics)\n",
    "- Evaluate model performance for different K values\n",
    "- Plot accuracy vs. K\n",
    "- Display a confusion matrix and classification report for the best-performing model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b4f3c2a-840b-4e31-a8b8-5c3583046c86",
   "metadata": {},
   "source": [
    "## 1. Data Loading and Preprocessing\n",
    "\n",
    "In this section, we load the dataset, handle missing values, normalize the data, and split it into training and testing sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a1442a6-7cbd-4e9f-90b9-97f5c48e6a10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# ----------------------------------------------\n",
    "# Load the Dataset\n",
    "# ----------------------------------------------\n",
    "# If you have a local CSV from the extracted zip, uncomment and adjust the following lines:\n",
    "# df = pd.read_csv('wine.csv')\n",
    "# X = df.iloc[:, :-1]   # all columns except the last one\n",
    "# y = df.iloc[:, -1]    # the last column (class labels)\n",
    "\n",
    "# Alternatively, use the ucimlrepo package to fetch the dataset\n",
    "from ucimlrepo import fetch_ucirepo\n",
    "wine = fetch_ucirepo(id=109)  \n",
    "X = wine.data.features.copy()  # making an explicit copy to avoid warnings\n",
    "y = wine.data.targets\n",
    "\n",
    "# ----------------------------------------------\n",
    "# Preprocess the Data\n",
    "# ----------------------------------------------\n",
    "\n",
    "# Handle missing values (if any)\n",
    "X.fillna(X.mean(), inplace=True)\n",
    "\n",
    "# Normalize features (scale values between 0 and 1)\n",
    "scaler = MinMaxScaler()\n",
    "X_scaled = scaler.fit_transform(X)\n",
    "X_scaled = pd.DataFrame(X_scaled, columns=X.columns)\n",
    "\n",
    "# Split data into Training (80%) and Testing (20%) sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)\n",
    "\n",
    "print(\"Training set size:\", X_train.shape)\n",
    "print(\"Testing set size:\", X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a387d2df-d973-4f97-8d7d-45b9b0c1a5e2",
   "metadata": {},
   "source": [
    "## 2. Data Visualization\n",
    "\n",
    "Let's visualize some of the features to see if they overlap across different classes. Here we create a scatter plot of **Alcohol** vs. **Malicacid**, colored by the class label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3b1f9ab-3e2a-4af1-8b27-cc5dbb5d9d5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scatter plot: Alcohol vs. Malicacid\n",
    "plt.figure(figsize=(8, 6))\n",
    "sns.scatterplot(x=X['Alcohol'], y=X['Malicacid'], hue=y, palette=\"viridis\")\n",
    "plt.title(\"Alcohol vs. Malicacid by Class\")\n",
    "plt.xlabel(\"Alcohol\")\n",
    "plt.ylabel(\"Malicacid\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa1b1374-d4ea-4324-8ad3-7e76e3c6f4f2",
   "metadata": {},
   "source": [
    "## 3. k-NN Implementation from Scratch\n",
    "\n",
    "We implement the k-NN algorithm without using `sklearn.neighbors.KNeighborsClassifier`. Two distance metrics are used: **Euclidean** and **Manhattan**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0f91f60-54b8-4cf2-bd25-9a5918c8a8fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "def euclidean_distance(p1, p2):\n",
    "    return np.sqrt(np.sum((p1 - p2) ** 2))\n",
    "\n",
    "def manhattan_distance(p1, p2):\n",
    "    return np.sum(np.abs(p1 - p2))\n",
    "\n",
    "def knn_predict(X_train, y_train, X_test, k, distance_metric=\"euclidean\"):\n",
    "    predictions = []\n",
    "    X_train_np = X_train.to_numpy()\n",
    "    X_test_np = X_test.to_numpy()\n",
    "    \n",
    "    for test_point in X_test_np:\n",
    "        distances = []\n",
    "        for idx, train_point in enumerate(X_train_np):\n",
    "            if distance_metric == \"euclidean\":\n",
    "                dist = euclidean_distance(test_point, train_point)\n",
    "            elif distance_metric == \"manhattan\":\n",
    "                dist = manhattan_distance(test_point, train_point)\n",
    "            else:\n",
    "                raise ValueError(\"Unsupported distance metric\")\n",
    "            distances.append((dist, y_train.iloc[idx]))\n",
    "        \n",
    "        # Sort distances and select k nearest neighbors\n",
    "        distances.sort(key=lambda x: x[0])\n",
    "        k_nearest = [label for _, label in distances[:k]]\n",
    "        \n",
    "        # Majority vote\n",
    "        most_common = Counter(k_nearest).most_common(1)[0][0]\n",
    "        predictions.append(most_common)\n",
    "    return np.array(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29b4c671-2409-44a5-8c0b-b640f1e9d9e7",
   "metadata": {},
   "source": [
    "## 4. Model Evaluation and Analysis\n",
    "\n",
    "We evaluate the k-NN model for different values of **K** (1, 3, 5, 7, 9) using both distance metrics. We calculate the classification accuracy for each K, plot accuracy vs. K, and generate a confusion matrix and classification report for the best-performing configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6efae60-8a53-4b8e-9367-10a993c3f4e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score, confusion_matrix, classification_report\n",
    "\n",
    "k_values = [1, 3, 5, 7, 9]\n",
    "euclidean_accuracies = []\n",
    "manhattan_accuracies = []\n",
    "\n",
    "for k in k_values:\n",
    "    # Euclidean distance evaluation\n",
    "    y_pred_euclidean = knn_predict(X_train, y_train, X_test, k, distance_metric=\"euclidean\")\n",
    "    acc_euclidean = accuracy_score(y_test, y_pred_euclidean)\n",
    "    euclidean_accuracies.append(acc_euclidean)\n",
    "    \n",
    "    # Manhattan distance evaluation\n",
    "    y_pred_manhattan = knn_predict(X_train, y_train, X_test, k, distance_metric=\"manhattan\")\n",
    "    acc_manhattan = accuracy_score(y_test, y_pred_manhattan)\n",
    "    manhattan_accuracies.append(acc_manhattan)\n",
    "    \n",
    "    print(f\"K = {k}: Euclidean Accuracy = {acc_euclidean:.4f}, Manhattan Accuracy = {acc_manhattan:.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13f1e7c8-c98b-4d8a-8c6b-70193bfe4d9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot Accuracy vs. K\n",
    "plt.figure(figsize=(8, 5))\n",
    "plt.plot(k_values, euclidean_accuracies, marker='o', label='Euclidean')\n",
    "plt.plot(k_values, manhattan_accuracies, marker='s', label='Manhattan')\n",
    "plt.xlabel(\"K Value\")\n",
    "plt.ylabel(\"Accuracy\")\n",
    "plt.title(\"Accuracy vs. K for k-NN\")\n",
    "plt.legend()\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a04b8e59-7e16-4a2e-b956-cf74d25a91d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the best K based on Euclidean accuracy\n",
    "best_index = np.argmax(euclidean_accuracies)\n",
    "best_k = k_values[best_index]\n",
    "print(f\"\\nBest K based on Euclidean distance: {best_k}\")\n",
    "\n",
    "# Get predictions for best K\n",
    "best_predictions = knn_predict(X_train, y_train, X_test, best_k, distance_metric=\"euclidean\")\n",
    "\n",
    "# Confusion Matrix and Classification Report\n",
    "cm = confusion_matrix(y_test, best_predictions)\n",
    "cr = classification_report(y_test, best_predictions)\n",
    "\n",
    "print(\"\\nConfusion Matrix:\")\n",
    "print(cm)\n",
    "print(\"\\nClassification Report:\")\n",
    "print(cr)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ac1195a-73c8-4eeb-9556-b3225d3a8146",
   "metadata": {},
   "source": [
    "## 5. Conclusion\n",
    "\n",
    "In this notebook, we loaded and preprocessed the dataset, visualized the distribution of key features, and implemented a k-NN classifier from scratch using both Euclidean and Manhattan distance metrics. We evaluated the model for various values of K, analyzed the impact on accuracy, and presented a confusion matrix and classification report for the best-performing configuration. This process provides insights into how different parameters affect k-NN performance and offers a complete workflow for implementing and analyzing a machine learning algorithm."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.x"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
