# **K-Nearest Neighbors(KNN)**

# We must know...

KNN is a flexible technique that may be applied to both regression and classification tasks. Since it is a non-parametric model, it does not make the same pre-data assumptions as a linear regression model, such as that the pre-data must be linear. It is a supervised learning technique; therefore, you must provide it labeled data to work with. In the training data, measurements are made between x and y.

The objective is to identify a function h:X -> Y that enables a positive prediction of the same output y from an unknown observation. We use the Euclidean metric to discover this link. The same method is used to resolve classifications and regression; you simply specify the number of neighbors you're looking for, and those neighbors serve as a factor for classifications. Two arrays are given, and the difference between each element is squared before taking the square root.

Repeating the calculations while iterating K along the odd numbers within a specified range is the most effective technique to determine the ideal K value. A large number of neighbors usually causes chaos. When you've determined the ideal value, you can utilize it for more training and testing.

To sum up, the implementation of KNN involves the following steps:

* Calculate the distance (Euclidean) between a test data point and every training data point. This is to see who is closer and who is far by how much.
* Sort the distances and pick K nearest distances (first K entries) from it. Those will be K closest neighbors to your given test data point.
* Get the labels of the selected K neighbors. The most common label (label with a majority vote) will be the predicted label for our test data point.
* Repeat everything above for all the test data points in your test set.


# Pseudocode

1. Define the KNN class:
* Create a class named "KNN" to encapsulate the K-Nearest Neighbors algorithm.

2. Constructor - `neighbors(self, k=5)`:
* Create a constructor to set the number of neighbors (k), with a default value of 5.

3. Fit the Model - `fit(self, X, y)`:
* Create a method to fit the KNN model with training data.
* Store the training data (X) in the class attribute `self.X_train`.
* Store the corresponding labels (y) in the class attribute `self.y_train`.

4. Calculate Euclidean Distance - `Euclidean_distance(self, X1, X2)`:
* Create a method to calculate the Euclidean distance between two data points (X1 and X2).
* Initialize a variable `distance` to 0.
* Loop through the features of the data points and compute the squared difference.
* Return the square root of the computed distance.

5. Make Predictions - `prediction(self, X_test)`:
* Create a method to make predictions for the test data.
* Initialize an empty list `results` to store predictions.
* Loop through each test data point in `X_test`.
* For each test data point, calculate the distances to all training data points.
* Store the distances and corresponding training data indices in the `distances` list.
* Sort the distances in ascending order.
* Select the first `k` distances (k nearest neighbors).
* Retrieve the labels of the k nearest neighbors and store them in the `neighbors` list.
* Predict the class label as the majority class among the k nearest neighbors.
* Append the prediction to the `results` list.
* Return the list of predictions.

6. Calculate Accuracy - `accuracy(self, X_test, y_test)`:
* Create a method to calculate the accuracy of the model.
* Make predictions for the test data using the `prediction` method.
* Count the number of correct predictions by comparing predictions with actual labels.
* Calculate the accuracy as the ratio of correct predictions to the total number of test data points.

7. Dataset Preprocessing:
* Import the dataset from a CSV file (e.g., "diabetes.csv").
* Select a subset of the dataset (e.g., 100 data points) for training and testing.

8. Feature Selection and Label Extraction:
* Extract the features (X) and labels (y) from the dataset.

9. Split Data into Training and Test Sets:
* Split the data into training and test sets, e.g., using a function like `divide_training_&_test_sets`.
* Set a random seed for reproducibility and specify the test set size (e.g., 20% of the data).

10. Initialize the KNN Model:
* Create an instance of the KNN class with a specified number of neighbors (e.g., k=5).

11. Fit the Model:
* Fit the KNN model with the training data using the `fit` method.

12. Make Predictions:
* Use the `prediction` method to make predictions for the test data.
* Display each prediction.

13. Calculate and Display Accuracy:
* Use the `accuracy` method to calculate the accuracy of the model on the test data.
* Display the accuracy of the model.


# CODE EXPLANATION

In the code I use the numpy, pandas, and sklearn_model_selection libraries.  




In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

Explanation

* We import numpy to start and then create the class for our KNearestNeighbors algorithm. The model is built around the number of neighbors we are looking to find. Let k equal the assumed number of classifications, in this case be set as k = 5.

* Secondly, with the fit function I ensure that the length of X and y are equal, otherwise we will have trouble down the road. X_train, y_train are what we will call this moving forward, and it will add these values to the class.

* Thirdly, with the distance function, I calculate the Euclidean metrics for two arrays. Its needed to take in two values. These are converted to np.arrays. Then, instantiate the distance at a 0 value, which is the default distance for two numbers of the same value. We then need to find the difference between the two values and square them. This gives us the absolute value and makes the differences act larger. Finally, we return the squared values for the distances.

* We want the X_test inputs to be passed to the predcit function. To start, we create a blank list for our sorted_outputs. The length of X_test is then iterated over in a for loop. You want to make two additional empty lists in this for loop: one for distances and one for neighbors. These will store the results of our distance calculations and execute our forecasts. The length of the X_train data that should already be fitted to the model must be passed via another for loop that must be nested inside of this one. Utilizing that specific instance of the X_train data and the aggregate distance of the X_test data, each iteration should calculate a distance.

  This calculation is added to the distances list. It is  important to sort this list; sklearn approach this from a different angle but this was the most efficient way that I found. Once it is sorted, you want to slice the list down to the most relevant datapoints, which will be 0:k (k being the total number of neighbors you are solving for). Once is sorted, you then run another for loop for each instance in the sliced distances where it appends the y_train value of the instance to a list. This essentially gives  a list of possible neighbors. Once I take the max of that list, and append that value to your sorted outputs, I can return outputs and obtain a prediction on a given set of data.


* with the last function called 'score', we obtain the accuracy for the predictions. It should accept X_test and y_test arguments. It should create a list of predictions taken by running the X_test and running it through the predict method. The accuracy is returned by taking the predictions with the y_test and summing the values. Then, divide this by the length of the y_test to give us a percentage. Scoring only works if we have both the test values and their relative labels.


In [None]:
class KNN:
    def __init__(self, k=5):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def distance(self, X1, X2):
        X1, X2 = np.array(X1), np.array(X2)
        distance = 0
        for i in range(len(X1) - 1):
            distance += (X1[i] - X2[i]) ** 2
        return np.sqrt(distance)

    def predict(self, X_test):
        sorted_output = []
        for i in range(len(X_test)):
            distances = []
            neighbors = []
            for j in range(len(self.X_train)):
                dist = self.distance(self.X_train[j], X_test[i])
                distances.append([dist, j])
            distances.sort()
            distances = distances[0:self.k]
            for distances, j in distances:
                neighbors.append(self.y_train[j])
            ans = max(neighbors)
            sorted_output.append(ans)

        return sorted_output

    def score(self, X_test, y_test):
        predictions = self.predict(X_test)
        return (predictions == y_test).sum() / len(y_test)

* I load the 'diabetes' dataset and I run it through train_test_split. I used a test value of 20%.

* I first cast the class to a variable and set the number of neighbors. For this situatuin I used 5, and odd number.

* I fit the model to our training data.

* I test to see that the predictions processes are working. This test yields an array with diabetes or not tags represented as 1 or 0 for successfully guessing the proper classification for the dataset I chose.

* I run the scoring method to see how well I did.

* I check that the classifications predictions were correct or not.


In [None]:
#Importing dataset, I used the diabetes dataset
dataset = pd.read_csv("diabetes.csv", sep=",")
dataset = dataset.head(120)
X = dataset[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI','DiabetesPedigreeFunction','Age']].values
y = dataset['Outcome'].values
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=40, test_size=0.2)

# Initialize model
testing_neighbors = KNN(k=5)

# We fit the model to the training data
testing_neighbors.fit(X_train, y_train)

# We score the prediction accuracy.
score = testing_neighbors.score(X_test, y_test)
print('The accuracy of the model is :', score)

# Run predictions using the test sample data.
prediction = testing_neighbors.predict(X_test)
print('\nPredictions:\n', prediction)


prediction == y_test

#Clasification results
print("\nClasification Results:")
for cls in np.unique(y_test):
    correct_count = np.sum((y_test == cls) & (prediction == cls))
    incorrect_count = np.sum((y_test == cls) & (prediction != cls))
    print(f"Clase {cls}: {correct_count} clasificados correctamente, {incorrect_count} incorrectos")

The accuracy of the model is : 0.7916666666666666

Predictions:
 [1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Clasification Results:
Clase 0: 5 clasificados correctamente, 5 incorrectos
Clase 1: 14 clasificados correctamente, 0 incorrectos


# LOSS + OPTIMIZATION FUNCTION

Loss functions and optimization functions are not used in the K-Nearest Neighbors (KNN) algorithm because:

1. KNN is instance-based and doesn't learn model parameters.
2. KNN doesn't have a training phase; it memorizes training data.
3. Predictions are directly based on nearest neighbors, not model parameters.
4. Hyperparameters, like 'k,' are typically chosen without optimization.

Loss and optimization functions are used in supervised learning models to adjust model parameters during training, which doesn't apply to KNN's simple, non-parametric approach.