# K Nearest Neighbour
## Introduction
K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for classification and regression tasks. The core idea behind KNN is to predict the class (or value) of a data point based on the majority class (or average value) of its 'k' nearest neighbors in the feature space. The algorithm works by calculating the distance between the query point and all other points in the dataset, typically using Euclidean distance. The nearest 'k' points are selected, and their class labels or values are used to make predictions. KNN is intuitive and does not require explicit training, but its performance can be influenced by factors such as the choice of distance metric, the value of 'k', and the dimensionality of the data.


## Explanation of Algorithm in Python
### Import Required Libraries
The essential libraries are imported in this step: numpy for numerical computations, pandas for handling and processing tabular datasets, random for shuffling the data, and Counter for tallying votes in the KNN algorithm.

In [1]:
import numpy as np
import pandas as pd
import random
from collections import Counter

### Normalize the Dataset
This function applies Min-Max normalization to the feature columns of the dataset. The idea is to scale the features so that they lie within a fixed range, typically [0, 1]. This ensures that all features contribute equally to the KNN algorithm’s distance computation. 
The function first extracts all feature columns (excluding the class label).
It then computes the minimum and maximum values for each feature across the dataset.
Each feature is normalized by subtracting the minimum value and dividing by the feature range (max - min).
The class label is left unchanged, and the normalized feature values are appended with the original class label for each row.

In [2]:
def normalize_dataset(dataset):
    normalized_data = []
    features = np.array(dataset)[:, :-1]  
    min_vals = np.min(features, axis=0)  
    max_vals = np.max(features, axis=0)  
    ranges = max_vals - min_vals
    for row in dataset:
        normalized_features = (row[:-1] - min_vals) / ranges
        normalized_data.append(np.append(normalized_features, row[-1])) 
    return normalized_data

### Remove Outliers
This function removes outliers using the Z-score method. Outliers are data points that deviate significantly from other points and can negatively impact the KNN algorithm. The function calculates the mean and standard deviation for each feature. 
For each row, it computes the Z-score for each feature, which measures how many standard deviations a value is away from the mean. 
If any Z-score is greater than 3 or less than -3 (meaning the value is more than 3 standard deviations away from the mean), the row is considered an outlier and removed. 
Only rows where all features have Z-scores within the [-3, 3] range are retained.

In [3]:
def remove_outliers(dataset):
    features = np.array(dataset)[:, :-1] 
    mean = np.mean(features, axis=0)  
    std_dev = np.std(features, axis=0)  
    cleaned_data = []
    for row in dataset:
        z_scores = (row[:-1] - mean) / std_dev  
        if all(np.abs(z_scores) < 3): 
            cleaned_data.append(row)
    return cleaned_data

### Weighted K-Nearest Neighbors Algorithm
This is the core implementation of the K-Nearest Neighbors (KNN) algorithm with weighted voting. In regular KNN, each of the k neighbors contributes equally to the classification. In this weighted version, closer neighbors have a greater influence on the classification decision. 
First, the Euclidean distance between the test point (predict) and all the points in the training data is computed.
The k nearest neighbors are selected based on their distance.
Each neighbor's vote is weighted inversely by the distance. The closer a neighbor is, the larger its weight.
The class with the highest total weight from the neighbors is predicted.
The confidence is calculated as the proportion of the total weight attributed to the predicted class.

In [4]:
def k_nearest_neighbors(data, predict, k=3):   
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
            distances.append([euclidean_distance, group])
    distances = sorted(distances)[:k]
    weights = [1 / d[0] if d[0] != 0 else 1 for d in distances]  # Inverse distance weighting
    weighted_votes = {}
    for i, (_, group) in enumerate(distances):
        weighted_votes[group] = weighted_votes.get(group, 0) + weights[i]
    votes_result = max(weighted_votes, key=weighted_votes.get)
    confidence = (weighted_votes[votes_result] / sum(weights)) * 100
    return votes_result, confidence

### Cross-Validation
Cross-validation is a technique used to assess the performance of the model more reliably. Instead of splitting the dataset once into a training set and a test set, the dataset is split into k folds. The model is trained on k-1 folds and tested on the remaining fold, and this process is repeated for all folds. 
The dataset is divided into k equal parts (folds).
For each fold, the model is trained on the remaining k-1 folds and evaluated on the fold that was held out.
The accuracy of the model is computed for each fold and then averaged to provide an overall performance measure.

In [5]:
def cross_validate(dataset, k, folds=5):
    fold_size = len(dataset) // folds
    accuracies = []
    for i in range(folds):
        train_data = dataset[:i * fold_size] + dataset[(i + 1) * fold_size:]
        test_data = dataset[i * fold_size:(i + 1) * fold_size]
        train_set = {cls: [] for cls in set(row[-1] for row in dataset)}
        test_set = {cls: [] for cls in set(row[-1] for row in dataset)}
        for row in train_data:
            train_set[row[-1]].append(row[:-1])
        for row in test_data:
            test_set[row[-1]].append(row[:-1])
        correct = 0
        total = 0
        for group in test_set:
            for data in test_set[group]:
                vote, _ = k_nearest_neighbors(train_set, data, k=k)
                if group == vote:
                    correct += 1
                total += 1
        accuracies.append(correct / total)
    return sum(accuracies) / len(accuracies)


### Implementing the Model
This main script loads the dataset, preprocesses it, and evaluates the KNN model. It handles missing values, removes outliers, and normalizes the features. The data is shuffled and split into training and testing sets. Cross-validation is used to find the best k, and the model's accuracy on the test set is then calculated and displayed.

In [7]:
if __name__ == "__main__":
    df = pd.read_csv(r"C:\Users\arsha\OneDrive - Manipal Academy of Higher Education\Desktop\Cryptonite\Sample_Datasets\knn_dataset.csv")
    df.replace('?', np.nan, inplace=True)
    df.dropna(inplace=True)
    class_column = df.columns[-1]
    full_data = df.astype(float).values.tolist()
    full_data = remove_outliers(full_data)
    full_data = normalize_dataset(full_data)
    random.shuffle(full_data)
    test_size = 0.2
    train_set = {cls: [] for cls in set(row[-1] for row in full_data)}
    test_set = {cls: [] for cls in set(row[-1] for row in full_data)}
    training_data = full_data[:-int(test_size * len(full_data))]
    testing_data = full_data[-int(test_size * len(full_data)):]
    for row in training_data:
        train_set[row[-1]].append(row[:-1])
    for row in testing_data:
        test_set[row[-1]].append(row[:-1])
    best_k = 1
    best_accuracy = 0
    for k in range(1, 11):  
        accuracy = cross_validate(full_data, k=k, folds=5)
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_k = k
    correct = 0
    total = 0
    for group in test_set:
        for data in test_set[group]:
            vote, confidence = k_nearest_neighbors(train_set, data, k=best_k)
            if group == vote:
                correct += 1
            total += 1
    print(f"Test Set Accuracy: {(correct / total) * 100:.2f}%")

Test Set Accuracy: 99.21%
