# PHASE 4

## PART 1: K-NEAREST NEIGHBOR(KNN) ALGO

* KNN is used for classification & regression by leveraging nearby data points(neighbors).  
__Steps__:
> 1. Choose a point to predict.
> 2. Find the K-nearest points (K is a predefined constant like 1,  3, 5).
> 3. For **classification**: Predict by taking the most common class among the K neighbors.
> 4. For **regression**: Predict by averaging the target values of the K neighbors.
> 5. **Weighted Prediction**: KNN can also weigh the neighbors' influence based on their distance from the point being predicted.
   
> Choosing the right distance metric is crucial to the success of KNN.

* K-means is a related algorithm but it's used for unsupervised learning and clustering.
* In K-means, K represents the number of clusters, not neighbors.
* It's an iterative algorithm that groups points based on a distance metric until convergence.

### **A}** DISTANCE METRICS

> * Distance metrics quantify similarity btn data points{algo like KNN}.   
> * Data pts closer in distance are more likely to belong to the same class.   
> * **Application**: Each dataset column represents a dimension, allowing distance measurement between points in a multi-dimensional space.

> MANHATTAN DISTANCE {c=1}  
* It measures the distance by traveling along grid axes. {Walking through a city block by block}. 
* Applied to higher dimensions(3D SPACE)
* **Best for grid based problem & works well in high dimension spaces**
> EUCLIDEAN DISTANCE {c=2}  
* It measures the straight line distance btn 2 points using pythagorean theorem.
* **Most Common. Best when shortest path is needed** 
> MINKOWSKI DISTANCE {c>2, cubic,fifth}  
* Generalization of Manhattan & Euclidean Distance
* Defined by parameter of c,changes the exponent of the sum of absolute differences.
* Applied in ML{in KNN and choice of c}
* **Flexible coz it encompasses Manhattan & Euclidean by changing the parameter c**

In [1]:
# Manhattan Distance
A = (2,3,5)
B = (1,-1,3)
manhattan_dist = sum(abs(A[i] - B[i]) for i in range(3))
manhattan_dist

7

In [6]:
# Euclidean Distance
from math import sqrt
A = (2,3,5)
B = (1,-1,3)
euclidean_distance = sqrt(sum((A[i] -B[i])**2 for i in range(3)))
print(euclidean_distance)
print(f"{euclidean_distance:.2f}")

4.58257569495584
4.58


In [8]:
# Minkowski Distance
import numpy as np
A = (2,3,5)
B = (1,-1,3)
c = 3
# Calculate the Minkowski distance btn points A and B
# 1. For each dimension, calculate the absolute difference between A[i] and B[i].
# 2. Raise that difference to the power of c.
# 3. Sum these powered differences.
# 4. Take the c-th root of the sum: raise the sum to the power of 1/c).
minkowski_distance = np.power(sum(np.abs(A[i] - B[i])**c for i in range(3)),1/c)
print(minkowski_distance)
print(f"{minkowski_distance:.2f}")

4.179339196381232
4.18


### **B}** K- NEAREST NEIGHBORS

> OVERVIEW 
* It's a supersised learning algo for classification & regression tasks.
* It's principle is that smilar data pts are close together and distance metrics help identify similarity.
> FIT STAGE
* It stores training data with labels without calculating distances.
> PREDICTION STAGE
* Calculates the distances between the new data pt and every data pt in the training set.
* Identifies the closest K pt(neighbours) and assigns a class based on the majority vote among those neighbors.
> DISTANCE METRICS
* Uses distance metrics like Manhattan, Euclidean or Minkowski depending on the problem
> EVALUATING PERFORMANCE
* **FOR CLASSIFICATION**: performance is measures using Accuracy, Precision, Recall and F1-Score.
* **FOR REGRESSION**: it averages the target values of the K nearest neighbors.

### **C}** K- NEAREST NEIGHBORS Classifier - {*Used iris dataset*}

> **Fit Method**
* Stores training data for later use
> **_get_distances Method**
* Calculates Euclidean distance btn test point and every training pt.
> **_get_k_nearest Method**
* Sorts the distances and returns the indices of the k-nearest neightbors
> **_get_label_prediction Method**
* Finds the most common label among the k-nearest neighbors
> **Predict method**
* Generates predictions for all test points using above methods
> **Testing Method**
* Model is tested on Iris dataset and should have an output score around 97%

In [15]:
# Import relebant libraries
from scipy.spatial.distance import euclidean
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Define KNN Class
class KNN:
    def __init__(self):
        pass
    def fit(self,X_train, y_train):
        # store training data and labels
        self.X_train = X_train
        self.y_train = y_train
    def _get_distances(self,x):
        # Create an empty list to store distances
        distances = []
        # enumerate through the training data to calculate distances
        for idx,point in enumerate(self.X_train):
            dist = euclidean(x,point)
            distances.append((idx,dist))
        return distances
    def _get_k_nearest(self,dists,k):
        # sort the distances by the second value in each tuple(distance)
        sorted_dists = sorted(dists,key=lambda x:x[1])
        # return the first tuples
        return sorted_dists[:k]
    def _get_label_prediction(self,k_nearest):
        # get the labels for the k_nearest neighbors
        labels = [self.y_train[idx] for idx,_ in k_nearest]
        # count the frequency of each label
        counts = np.bincount(labels)
        # return the label with the highest frequency
        return np.argmax(counts)
    def predict(self,X_test,k=3):
        # A list to store the predictions
        preds = []
        # Iterate through all the test points
        for x in X_test:
            # get distances to all training pts
            distances = self._get_distances(x)
            # get k_nearest pts
            k_nearest = self._get_k_nearest(distances,k)
            # predict label based on the nearest neighbors
            pred = self._get_label_prediction(k_nearest)
            # Append the prediction to the list
            preds.append(pred)
        # return the predictions for all the test pts
        return preds
# load Iris dataset
iris = load_iris()
data = iris.data
target = iris.target

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.25, random_state = 0)

# Instantiate and fit the KNN model
knn = KNN()
knn.fit(X_train,y_train)

# generate predictions for the test set
preds = knn.predict(X_test, k=3)

#calculate & print the accuracy
print(f"Accuracy Score: {accuracy_score(y_test,preds)}")

Accuracy Score: 0.9736842105263158


### **D}** Finding Best value for K in KNN

> Optimal K value in KNN: 
* **A small K (k=1)**: Can lead to *overfitting* whereby the model is too sensitive to small variations
* **A large K**: Can lead to *underfitting* whereby the model oversimplifies and misses important patterns
* **Odd values for K (k=3,k=5)** help avoid ties in classification
* Generally there is no universally best value for K
> Iterating to find best K:
* Best to different values of K esp the odd numbers
* plot error for each K Value.
* Choose the K where the error is lowest or it has the highest performance
> KNN & Curse of Dimensionality
* Due to the curse of dimensionality, KNN struggles with high-dimensional data that has many columns(features)
* This means it's inefficient for very large dataset(thousands of columns, millions of rows) as it also grows exponentially with such large dataset. It is time complex.

<img src="Images/1. Best K value.webp" alt="Best K Value" width="300" height="300">

> A smaller K (like K=1) leads to overfitting, while a larger K may lead to underfitting.    
> The optimal K value is where error is lowest as shown by K=3