# Overview  

This is the **fourth notebook** in my machine learning series.  

In this notebook, we’ll implement our **second classification algorithm — K-Nearest Neighbors (KNN)**.  

Let’s get started!

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Data Loading and Analysis

## Dataset — Iris 

In this notebook, we'll be using the **Iris dataset**, one of the most famous datasets in machine learning.  

The **Iris dataset** was first introduced in **R.A. Fisher’s 1936 paper**, *“The Use of Multiple Measurements in Taxonomic Problems”*, and is also available on the **UCI Machine Learning Repository**.  

It contains **three species of Iris flowers**, with **50 samples each**, and includes measurements of their physical attributes.  
One of the species is **linearly separable**, while the other two are **not linearly separable**, making this dataset a great starting point for classification tasks.  

### Columns:
- **Id** — Unique identifier for each observation  
- **SepalLengthCm** — Length of the sepal (in cm)  
- **SepalWidthCm** — Width of the sepal (in cm)  
- **PetalLengthCm** — Length of the petal (in cm)  
- **PetalWidthCm** — Width of the petal (in cm)  
- **Species** — Type of Iris flower (Setosa, Versicolor, Virginica)


In [2]:
# Load the dataset
df = pd.read_csv('Iris.csv')

In [3]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
# Drop useless columns
df.drop('Id', axis=1, inplace=True)

In [5]:
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Data Visualization

In [6]:
fig = px.pie(df, 'Species', template='plotly_dark', title='Data Distribution')
fig.show(renderer='iframe')

This plot shows that there is an **equal number of samples for each class**, making the dataset **balanced**.

### Sepal Length 

In [7]:
fig = px.box(df, x='Species', y='SepalLengthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [8]:
fig = px.histogram(df, x='SepalLengthCm', color='Species', template='plotly_dark', nbins=50)
fig.show(renderer='iframe')

These plots show that:  
- **Setosa** flowers are much smaller, making them **linearly separable** from the other two classes.  
- **Virginica** flowers are generally the largest and contain a outlier.  
- It is difficult to distinguish between **Versicolor** and **Virginica**, as their features overlap.
- The amount of overlap between **Versicolor** and **Virginica** make this a less useful feature. 

### Petal Length

In [9]:
fig = px.box(df, x='Species', y='PetalLengthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [10]:
fig = px.histogram(df, x='PetalLengthCm', color='Species', template='plotly_dark', nbins=30)
fig.show(renderer='iframe')

These plots show that:
- Again **Setosa** has a much smaller **Petal Length** comapared to the other two classes.
- **Virginica** has the largest **Petal Length.
- There is some overlap between **Versicolor** and **Virginica**, regardless of that **Petal Length** shows some differentiation between the classes.

### Sepal Width

In [11]:
fig = px.box(df, x='Species', y='SepalWidthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [12]:
fig = px.histogram(df, x='SepalWidthCm', color='Species', template='plotly_dark', nbins=30)
fig.show(renderer='iframe')

These plots show that:
- **Setosa** has the largest **Sepal width**.
- **Versicolor** has the smallest **Sepal width**.
- There is alot of overlap between the classes, showing that **Sepal Width** might not be a useful feature.

### Petal Width

In [13]:
fig = px.box(df, x='Species', y='PetalWidthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [14]:
fig = px.histogram(df, x='PetalWidthCm', color='Species', template='plotly_dark', nbins=20)
fig.show(renderer='iframe')

These plots show that:
- **Setosa** has much smaller **PetalWidth** than the other 2 classes, it also has 2 outliers.
- Again the difference is less clear between **Virginica** and **Versicolor**
- Overall this seems like an PetalWidth might be useful in making good predictions.

In [15]:
fig = px.scatter(df, x='SepalLengthCm', y='SepalWidthCm', size='PetalLengthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [16]:
fig = px.scatter(df, x='SepalLengthCm', y='SepalWidthCm', size='PetalWidthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [17]:
fig = px.scatter(df, x='PetalLengthCm', y='PetalWidthCm', size='SepalLengthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

In [18]:
fig = px.scatter(df, x='PetalLengthCm', y='PetalWidthCm', size='SepalWidthCm', color='Species', template='plotly_dark')
fig.show(renderer='iframe')

These plots show us that:
- **Setosa** is the smaller of the three flower species.
- **Virginica** is the largest of the three flower species.
- **Versicolor** and **Virginica** show overlap , making them difficult to distinguish.
- There is a positive correlation between some of the features.

# Date Processing

In [19]:
# Setting the features
X = df.iloc[:,:-1]

# Setting the labels
y = df.iloc[:,-1]

In [20]:
def train_test_split(X, y, test_size=0.2, random_state=42):
    """
    Splits the data into training and testing sets.

    Parameters:
        X (numpy.ndarray): Features array of shape (n_samples, n_features).
        y (numpy.ndarray): Target array of shape (n_samples,).
        test_size (float): Proportion of samples to include in the test set. Default is 0.2.
        random_state (int): Seed for the random number generator. Default is 42.
        

    Returns:
        Tuple[numpy.ndarray]: A tuple containing X_train, X_test, y_train, y_test.
    """

    # Get number of samples
    n_samples = X.shape[0]

    # Set the seed for the random number generator
    np.random.seed(random_state)

    # Shuffle the indices
    shuffled_indices = np.random.permutation(np.arange(n_samples))

    # Determine the size of the test set
    test_size = int(n_samples * test_size)

    # Split the indices into test and train
    test_indices = shuffled_indices[:test_size]
    train_indices = shuffled_indices[test_size:]

    # Split the features and target arrays into test and train
    X_train, X_test = X[train_indices], X[test_indices]
    y_train, y_test = y[train_indices], y[test_indices]

    return X_train, X_test, y_train, y_test

In [21]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size = 0.2, random_state=42)

# Model Implementation

## What is K-Nearest Neighbors (KNN)?

**K-Nearest Neighbors (KNN)** is a **supervised machine learning algorithm** primarily used for **classification**, but it can also be applied to **regression** tasks.  

It works by finding the **“k” closest data points (neighbors)** to a given input and making predictions based on:  
- The **majority class** (for classification), or  
- The **average value** (for regression).  

Since KNN makes **no assumptions** about the underlying data distribution, it is considered a **non-parametric** and **instance-based learning** method.  

KNN is also known as a **lazy learner** because it doesn’t build an explicit model during training.  
Instead, it **stores the entire dataset** and performs computations **only at the time of prediction**.  

---

## How does KNN work?

1. **Compute Distances:**  
   Calculate the **Euclidean distance** between the new sample and all data points in the training set.

2. **Find Nearest Neighbors:**  
   Select the **k data points** that are closest to the new sample.

3. **Make a Prediction:**  
   - For **classification** → choose the **majority class** among the neighbors.  
   - For **regression** → take the **average value** of the neighbors.


## Euclidean Distance  

To classify a test sample $X_{\text{test}}$, we calculate its distance from every training sample $X_i$ using the Euclidean distance formula:

$$
d(X_{\text{test}}, X_i) = \sqrt{\sum_{j=1}^{n} (X_{\text{test},j} - X_{i,j})^2}
$$

### Notation
- $n$: Number of features  
- $X_{\text{test},j}$: The $j^{th}$ feature value of the test sample  
- $X_{i,j}$: The $j^{th}$ feature value of the $i^{th}$ training sample  
- $d(X_{\text{test}}, X_i)$: Euclidean distance between the test sample and the $i^{th}$ training sample 

In [22]:
import numpy as np
from collections import Counter

class KNN:

    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def euclidean_distance(self, x1, x2):
        """
        Calculate the Euclidean distance between two data points.

        Parameters:
        -----------
        x1 : numpy.ndarray, shape (n_features,)
            A data point in the dataset.
            
        x2 : numpy.ndarray, shape (n_features,)
            A data point in the dataset.

        Returns:
        --------
        distance : float
            The Euclidean distance between x1 and x2.
        """
        return np.sqrt(np.sum((x1 - x2) ** 2))


    def fit(self, X_train, y_train):
        """
        Stores the values of X_train and y_train.

        Parameters:
        -----------
        X_train : numpy.ndarray, shape (n_samples, n_features)
            The training dataset.

        y_train : numpy.ndarray, shape (n_samples,)
            The target labels.
        """
        self.X_train = X_train
        self.y_train = y_train


    def predict(self, X):
        """
        Predicts the class labels for each example in X.

        Parameters:
        -----------
        X : numpy.ndarray, shape (n_samples, n_features)
            The test dataset.

        Returns:
        --------
        predictions : numpy.ndarray, shape (n_samples,)
            The predicted class labels for each example in X.
        """
        
        # Create empty array to store the predictions
        predictions = []
        # Loop over X examples
        
        for x_test in X:
            # Get prediction using the prediction helper function
            prediction = self._predict(x_test)
            # Append the prediction to the predictions list
            predictions.append(prediction)
            
        return np.array(predictions)

        
    def _predict(self, x_test):
        """
        Predicts the class label for a single example.

        Parameters:
        -----------
        x : numpy.ndarray, shape (n_features,)
            A data point in the test dataset.

        Returns:
        --------
        most_occuring_value : int
            The predicted class label for x.
        """

        # Create empty array to store distances
        distances = []
        # Loop over all training examples and compute the distance between x_test and all the training examples 
        
        for x_train in self.X_train:
            distance = self.euclidean_distance(x_test, x_train)
            distances.append(distance)
            
        distances = np.array(distances)
        
        # Sort ascendingly and return indices of the given k neighbors
        k_neighbors_idx = np.argsort(distances)[:self.k]
        
        # Get labels of k-neighbor
        k_neighbors_labels = self.y_train[k_neighbors_idx]

        #Get the most frequent class in the array
        most_common_label = Counter(k_neighbors_labels).most_common(1)[0][0]
        return most_common_label

In [23]:
clf = KNN(7)
clf.fit(X_train, y_train)

In [24]:
def compute_accuracy(y_true, y_pred):
    """
    Computes the accuracy of a classification model.

    Parameters:
    y_true (numpy array): A numpy array of true labels for each data point.
    y_pred (numpy array): A numpy array of predicted labels for each data point.

    Returns:
    float: The accuracy of the model, expressed as a percentage.
    """
    y_true = y_true.flatten()
    total_samples = len(y_true)
    correct_predictions = np.sum(y_true == y_pred)
    return (correct_predictions / total_samples) 

In [25]:
predictions = clf.predict(X_test)
accuracy = compute_accuracy(y_test, predictions)
print(f" Classifier Accuracy  : {accuracy}")    

 Classifier Accuracy  : 0.9666666666666667


# Thank You
If anyone has any suggestion, please do let me know.