# Case Study 1: Sonar

**Nearest Neighbour Classification using Minkowski Distance**

## Overview

### Objective
- Implement a Nearest Neighbour classifier using the Minkowski distance.
- Evaluate the Nearest Neighbour model on the test dataset using performance metrics.
- Train and tune a Decision Tree classifier with cross-validation to control overfitting.
- Compare the performance of both models on the Sonar dataset.

### Dataset
- **Sonar Dataset:** Contains 60 predictors (A1, A2, …, A60) representing sonar signal measurements.
- **Target Variable:** "Class" – labels objects as either a rock ("R") or a metal cylinder ("M").

### Tasks
1. **Nearest Neighbour Classification:**
   - Classify each test record by finding its nearest neighbour from the training set using the Minkowski distance.
   - Assess performance metrics: accuracy, recall, precision, and F1 measure (with "M" as the positive class).
   - Experiment with different Minkowski powers (q from 1 to 20) to determine the optimal q based on accuracy.

2. **Decision Tree Classification:**
   - Train a Decision Tree classifier on the Sonar dataset.
   - Tune hyperparameters using grid search and cross-validation.
   - Evaluate the optimised model on the test set.

## Data Loading and Preprocessing

In [None]:
import pandas as pd 
import numpy as np

In [None]:
# Load Sonar datasets
train_data = pd.read_csv("sonar_train.csv")
test_data = pd.read_csv("sonar_test.csv")

# Separate features and targets fromtraiin and test data
train_X = train_data.drop(columns=["Class"])
train_y = train_data[["Class"]].values.flatten()

test_X = test_data.drop(columns=["Class"])
test_y = test_data[["Class"]].values.flatten()

# Display s ummary  statistics of training features
train_X.describe()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...,A51,A52,A53,A54,A55,A56,A57,A58,A59,A60
count,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,...,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0
mean,0.028881,0.037319,0.041682,0.052694,0.075263,0.105,0.123863,0.13281,0.17516,0.2054,...,0.016472,0.013347,0.01017,0.010484,0.009495,0.008061,0.007673,0.008077,0.007858,0.006549
std,0.022602,0.033011,0.038513,0.047588,0.056174,0.057755,0.061604,0.087889,0.120416,0.126882,...,0.012275,0.010252,0.007068,0.006864,0.007135,0.006076,0.005402,0.006803,0.006001,0.004598
min,0.0015,0.0017,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0193,...,0.0015,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0006,0.0001,0.0006
25%,0.01315,0.0158,0.01705,0.0244,0.0392,0.07195,0.0868,0.08265,0.094,0.1103,...,0.0087,0.0068,0.00465,0.00535,0.0041,0.00395,0.0037,0.00355,0.00355,0.0031
50%,0.0221,0.0297,0.0324,0.0415,0.0617,0.0929,0.1053,0.1117,0.1522,0.1838,...,0.014,0.0113,0.0084,0.0089,0.0079,0.0064,0.0062,0.0058,0.0065,0.0054
75%,0.03525,0.04755,0.0556,0.0627,0.10105,0.13255,0.1607,0.1676,0.2265,0.269,...,0.0208,0.01645,0.01355,0.0135,0.0121,0.01015,0.01035,0.01065,0.01005,0.00875
max,0.1313,0.2339,0.3059,0.4264,0.401,0.307,0.3322,0.459,0.6828,0.5966,...,0.1004,0.0709,0.0361,0.0352,0.0447,0.0394,0.0355,0.044,0.0294,0.0231


Standardise the data to avoid any potential disproportionality of certain features negatively impacting Minkowski distance calculation.

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
prepared_train_X = scaler.fit_transform(train_X, train_y)
prepared_test_X = scaler.fit_transform(test_X, test_y)

## Nearest Neighbour Implementation

In this section, we implement a Nearest Neighbour (1-NN) classifier using Minkowski distance.

### NearestNeighbourClassifier Class
We create a class called NearestNeighbourClassifier, this class has two main methods:
- `fit(X, y)`
This method validates and stores the training data along with the labels. It also determines the unique classes in the data set.

- `predict(X)`
  This method validates the test data and checks that the classifier has been fitted. For each test sample, it computes the Minkowski distance to every training sample, finds the training sample with the smallest distance, and assigns the label of that sample as the prediction.

We define a function `minkowski_dist` that computes the Minkowski distance between two vectors using a given power q. It traverses both vectors, summing the absolute differences between their respective elements raised to the power of q. Finally, it takes the qth root of the sum, returning the Minkowski distance.

Custom Estimator Template sourced from: https://scikit-learn.org/stable/developers/develop.html

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import validate_data, check_is_fitted
from sklearn.utils.multiclass import unique_labels

class NearestNeighbourClassifier(ClassifierMixin, BaseEstimator):
    def __init__(self, minkowski_q=2): # Euclidean distance by default
        """The constructor for the NearestNeighbourClassifier class."""
        self.minkowski_q = minkowski_q

    def fit(self, X, y):
        """Validate and store the training data."""
        X, y = validate_data(self, X, y)

        self.classes_ = unique_labels(y)

        self.X_ = X
        self.y_ = y

        return self
    
    def predict(self, X):
        """Predict the class labels for the provided data."""
        check_is_fitted(self) # Check fit has been called
        X = validate_data(self, X, reset=False)

        preds = []
        for test_inst in X:
            dists = []
            # Find the Minkowski distance between the test instance and all training instances
            for train_inst in self.X_:
                dists.append(self.minkowski_dist(test_inst, train_inst, self.minkowski_q))

            # Get index of training instance with smallest distance
            nearest_neighbour_index = np.argmin(dists) 
            
            # Append the nearest neighbour's label to the predictions
            preds.append(self.y_[nearest_neighbour_index])
            
        return np.array(preds)
    
    def minkowski_dist(self, array1, array2, q):
        """Compute the Minkowski distance between two vectors using a given power q."""
        assert array1.shape == array2.shape
        
        dist = 0
        for feature1, feature2 in zip(array1, array2):
            dist += np.abs(feature1 - feature2)**q
        return dist**(1/q)

We then verify our classifier with scikit-learn’s estimator checks:

In [None]:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(NearestNeighbourClassifier()) # Passes all common checks

[{'estimator': NearestNeighbourClassifier(),
  'check_name': 'check_estimator_cloneable',
  'exception': None,
  'status': 'passed',
  'expected_to_fail': False,
  'expected_to_fail_reason': 'Check is not expected to fail'},
 {'estimator': NearestNeighbourClassifier(),
  'check_name': 'check_estimator_cloneable',
  'exception': None,
  'status': 'passed',
  'expected_to_fail': False,
  'expected_to_fail_reason': 'Check is not expected to fail'},
 {'estimator': NearestNeighbourClassifier(),
  'check_name': 'check_estimator_tags_renamed',
  'exception': None,
  'status': 'passed',
  'expected_to_fail': False,
  'expected_to_fail_reason': 'Check is not expected to fail'},
 {'estimator': NearestNeighbourClassifier(),
  'check_name': 'check_valid_tag_types',
  'exception': None,
  'status': 'passed',
  'expected_to_fail': False,
  'expected_to_fail_reason': 'Check is not expected to fail'},
 {'estimator': NearestNeighbourClassifier(),
  'check_name': 'check_estimator_repr',
  'exception': N

In [None]:
from sklearn.metrics import accuracy_score

nearest_neighbour = NearestNeighbourClassifier(minkowski_q=1)

nearest_neighbour.fit(prepared_train_X, train_y)
preds = nearest_neighbour.predict(prepared_test_X)

score = accuracy_score(test_y, preds)

print(f"Test accuracy: {round(score*100, 2)}%")

Test accuracy: 94.2%
