Name = Goutam Kumar Sah

Roll Number = 2312res271

**Experiment No = 1**

Title = K-Nearest Neighbors (KNN)

Aim = Implementation of K-Nearest Neighbors (KNN) algorithm

K-Nearest Neighbors (KNN) is a simple, intuitive machine learning algorithm used for classification and regression. At its core, KNN operates by finding the K closest data points to a new input point and using these neighbors to predict the output. For classification tasks, it performs a “majority vote” to decide the class label, while for regression, it takes the average (or sometimes median) of the neighbors’ values to predict the outcome. The number of neighbors, K, is a critical parameter; setting it too low can make the algorithm sensitive to noise, while too high a value can oversimplify results.

To determine the neighbors, KNN uses distance metrics like Euclidean distance, though others, such as Manhattan and Minkowski distances, are also possible. After calculating these distances, KNN sorts them in ascending order and selects the K nearest points. The effectiveness of KNN depends significantly on choosing the right distance metric, especially if features are of different scales or types.

KNN is often considered a non-parametric and lazy learning algorithm. As a non-parametric model, it makes no assumptions about the data distribution, allowing it to work well with complex datasets. Being a lazy learner, KNN does not train a model in advance. Instead, it stores all data points and performs computations only when making predictions, which can be computationally demanding for large datasets.

The algorithm works well with lower-dimensional data and smaller datasets due to its high computational demands. It’s also sensitive to irrelevant features and can benefit from feature scaling and selection. Commonly, KNN is applied in image recognition (e.g., identifying handwritten digits), recommendation systems, and medical diagnosis predictions based on symptoms.

While KNN’s simplicity and interpretability make it popular, its performance can degrade with large, high-dimensional datasets. Nonetheless, with careful tuning of parameters and preprocessing, KNN is an effective and versatile tool for many applications in machine learning.

Platform = Google Colab

# Manual code

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.read_csv('/content/iris.csv')
x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y,
                                                   test_size= 0.2,
                                                   shuffle= True,
                                                   random_state= 0)
x_train= np.asarray(x_train)
y_train= np.asarray(y_train)

x_test= np.asarray(x_test)
y_test= np.asarray(y_test)

In [5]:
from sklearn.preprocessing import Normalizer
scaler= Normalizer().fit(x_train)
normalized_x_train= scaler.transform(x_train)
normalized_x_test= scaler.transform(x_test)

In [6]:
from collections import Counter

def calculate_euclidean_distance(training_data, test_point):
    distances = []
    for i in range(len(training_data)):
        current_train_point = training_data[i]
        distance = 0
        for j in range(len(current_train_point)):
            distance += (current_train_point[j] - test_point[j]) ** 2
        distance = np.sqrt(distance)
        distances.append(distance)

    distances_df = pd.DataFrame(data=distances, columns=['distance'])
    return distances_df

def find_nearest_neighbors(distance_df, K):
    nearest_neighbors_df = distance_df.sort_values(by=['distance']).head(K)
    return nearest_neighbors_df

def make_prediction(nearest_neighbors_df, training_labels):
    votes = Counter(training_labels[nearest_neighbors_df.index])
    predicted_label = votes.most_common(1)[0][0]
    return predicted_label

def knn_classifier(training_data, training_labels, test_data, K):
    predictions = []
    for test_point in test_data:
        distance_df = calculate_euclidean_distance(training_data, test_point)
        nearest_neighbors_df = find_nearest_neighbors(distance_df, K)
        predicted_label = make_prediction(nearest_neighbors_df, training_labels)
        predictions.append(predicted_label)
    return predictions



In [8]:
k = 3
y_pred = knn_classifier(normalized_x_train, y_train, normalized_x_test, k)
print(y_pred)

['virginica', 'versicolor', 'setosa', 'virginica', 'setosa', 'virginica', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'setosa', 'versicolor', 'virginica', 'setosa', 'setosa', 'virginica', 'versicolor', 'setosa', 'setosa', 'virginica', 'setosa', 'setosa', 'versicolor', 'versicolor', 'setosa']


In [9]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.9666666666666667


Using SK learn library

In [10]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(normalized_x_train, y_train)
y_pred = classifier.predict(normalized_x_test)

In [11]:
print(accuracy_score(y_test, y_pred))

0.9666666666666667
