<a href="https://colab.research.google.com/github/Sharon-Mukami/KNN/blob/main/KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Finding the Best K-value**

- Finding the best value for K in a K-Nearest Neighbors (KNN) algorithm is a crucial step in optimizing the model’s performance
- The value of K determines how many neighbors to consider when classifying a new data point or making a regression prediction

- Steps to Find the Best K:
1. Split Data into Training and Validation Sets:

2. Choose a Range of K Values:

3. Train the Model with Different K Values:

4. Use Cross-Validation:

5. Evaluate the Results:

6. Select the Best K:

- Based on the evaluation metrics, choose the K that strikes the best balance between bias and variance
- This is the point where the model generalizes well to new data.

## KNN with Scikit Learn


- KNN is **easy to implement** and **works well for small datasets**
- Choosing the right value of K is crucial, and **cross-validation helps select the optimal value**
- Distance metrics and weighting can improve performance for certain problems

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [None]:
# Import the dataset
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features (inputs)
y = iris.target  # Target labels (outputs)


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)


In [None]:
# Choosing a value of K
# Instantiate the KNN classifier with K=3
knn = KNeighborsClassifier(n_neighbors=3)


In [None]:
# Fit the KNN model to the training data
knn.fit(X_train, y_train)


In [None]:
# Make predictions on the test data
y_pred = knn.predict(X_test)


In [None]:
# Making predictions and Calculating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.95


In [None]:
# Testing K values from 1 to 20
for k in range(1, 21):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f"K={k}, Accuracy: {accuracy_score(y_test, y_pred):.2f}")


K=1, Accuracy: 0.89
K=2, Accuracy: 0.89
K=3, Accuracy: 0.95
K=4, Accuracy: 0.95
K=5, Accuracy: 0.97
K=6, Accuracy: 0.95
K=7, Accuracy: 0.97
K=8, Accuracy: 0.97
K=9, Accuracy: 0.97
K=10, Accuracy: 0.95
K=11, Accuracy: 0.97
K=12, Accuracy: 0.97
K=13, Accuracy: 0.97
K=14, Accuracy: 0.97
K=15, Accuracy: 0.97
K=16, Accuracy: 0.97
K=17, Accuracy: 0.97
K=18, Accuracy: 0.92
K=19, Accuracy: 0.95
K=20, Accuracy: 0.89


In [None]:
# Using KNN for regression
from sklearn.neighbors import KNeighborsRegressor

# Instantiate the KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=3)

# Fit the regressor model
knn_regressor.fit(X_train, y_train)

# Make predictions
y_pred_reg = knn_regressor.predict(X_test)

# Evaluate the regressor (e.g., using mean squared error)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred_reg)
print(f"Mean Squared Error: {mse:.2f}")


Mean Squared Error: 0.03


**Advanced Options: Distance Metrics, Weights**

You can customize the distance metric by setting the metric parameter (**e.g., 'euclidean', 'manhattan'**).
You can also use weighted KNN, where closer neighbors contribute more to the decision than further ones, by setting **weights='distance'**.

In [None]:
# Weighted KNN with Euclidean distance
knn_weighted = KNeighborsClassifier(n_neighbors=3, weights='distance', metric='euclidean')
knn_weighted.fit(X_train, y_train)
y_pred_weighted = knn_weighted.predict(X_test)

# Check accuracy
accuracy_weighted = accuracy_score(y_test, y_pred_weighted)
print(f"Weighted KNN Accuracy: {accuracy_weighted:.2f}")


Weighted KNN Accuracy: 0.95


Cross-Validation for K Selection
Cross-validation to find the best K by evaluating the performance across multiple folds.

In [None]:
from sklearn.model_selection import cross_val_score

# Test K values from 1 to 10 using 5-fold cross-validation
k_range = range(1, 11)
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    k_scores.append(scores.mean())

print(f"Best K value: {k_range[np.argmax(k_scores)]}")


Best K value: 6


**Considerations:**
1. **Odd K Values:** If you’re dealing with classification, it’s usually good to use odd values of K to avoid ties when there are an equal number of nearest neighbors from different classes
2. **Size of Dataset:** For larger datasets, a larger K value may provide more stable predictions.
3. **Distance Metric:** The choice of the distance metric (Euclidean, Manhattan, etc.) can also affect the performance, so it’s worth testing different metrics when optimizing K.

## How is KNN related to the Curse of Dimensionality?

K-Nearest Neighbors (KNN) is highly sensitive to the curse of dimensionality, a phenomenon that occurs when data exists in a high-dimensional space, making it harder to analyze and interpret. Here's how KNN and the curse of dimensionality are related:

1. **Distance Metrics Become Less Informative**
KNN relies on calculating distances between data points to classify or make predictions, typically using metrics like Euclidean distance. In high-dimensional spaces, the distance between points increases, and all points tend to become equally distant from each other. This dilutes the meaningfulness of proximity, which is the core concept of KNN.

**Effect:** In higher dimensions, the differences in distance between the nearest neighbors and farthest points become smaller, making it difficult for the algorithm to distinguish between relevant and irrelevant points.

2. **Increased Sparsity of Data**
As the number of dimensions (features) increases, the volume of the space grows exponentially, but the amount of data typically remains constant. This makes the data points more sparsely distributed across the feature space.

**Effect:** With sparser data in high dimensions, the neighbors of a point may be far away or irrelevant, reducing the ability of KNN to accurately classify or predict. The model becomes less reliable because it doesn't have enough nearby points to make meaningful decisions.

3. **Overfitting Risk**
In higher dimensions, each data point may become isolated, with only a few close neighbors. This can lead to overfitting because KNN may memorize specific points and produce poor generalization on new data.

**Effect:** If the number of dimensions grows, the model may classify points based on very few or irrelevant neighbors, leading to high variance in the model and overfitting.

4. **Increased Computational Complexity**
As the dimensionality increases, the number of calculations required to compute distances grows exponentially. KNN performs poorly with respect to computational efficiency in high-dimensional spaces due to the need to calculate distances between all points for each prediction.

**Effect:** This results in slower performance and greater resource consumption when the data has many features.


### Strategies to Mitigate the Curse of Dimensionality in KNN

**Dimensionality Reduction:**

Use techniques like Principal Component Analysis (PCA) or t-SNE to reduce the number of features while preserving as much of the data's variance or structure as possible.
Feature selection methods can help identify the most important features and discard irrelevant ones


**Distance Metric Optimization:**

Try different distance metrics (e.g., Manhattan distance) to see which one performs better in high-dimensional spaces.
Some high-dimensional problems may benefit from using Mahalanobis distance, which accounts for correlations between features.


**Normalize the Data:**

In high-dimensional spaces, it’s essential to ensure that all features contribute equally by scaling them (e.g., using standardization or min-max normalization) before calculating distances.

**Use Dimensionality-Sensitive Algorithms:**

Consider algorithms better suited for high-dimensional data, like Support Vector Machines (SVM) or Random Forests, if KNN struggles.


- Lastly, KNN’s reliance on distance calculations makes it particularly vulnerable to the curse of dimensionality
- As the number of features increases, distances between points become less meaningful, data becomes sparse, and the computational cost increases
- Using dimensionality reduction, feature selection, and normalization can help mitigate these challenges.

## Challenge - Use an Actual Dataset following the steps above

## **Model tuning and pipelines**

Model Tuning are essential components of machine learning workflows, helping to optimize model performance and streamline the process.

Here's an overview of each and how they work together.

1. **Model Tuning**
Model tuning is the process of optimizing the hyperparameters of a machine learning algorithm to improve its performance.

Hyperparameters are parameters that are not learned by the model directly during training but are set before training begins.

**Common Model Tuning Techniques:**

**Grid Search:**

Exhaustively searches through a manually specified subset of hyperparameters.
Trains and evaluates the model for each combination to find the best one.

**Random Search:**

Samples random combinations of hyperparameters from a given range.
More efficient than grid search for high-dimensional hyperparameter spaces.

**Bayesian Optimization:**

Uses a probabilistic model to select the next set of hyperparameters based on past evaluations.
More efficient than grid and random search as it intelligently navigates the search space.

**Hyperopt, Optuna, and other advanced libraries:**

Libraries like Hyperopt and Optuna offer more sophisticated optimization techniques, including Bayesian optimization, tree-structured parzen estimators (TPE), and adaptive sampling.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Example classifier
knn = KNeighborsClassifier()

# Define a grid of hyperparameters
param_grid = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}

# Grid search
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best hyperparameters
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_}")


Best Hyperparameters: {'n_neighbors': 3, 'weights': 'uniform'}
Best Accuracy: 0.9818181818181818


2. **Pipelines**

A pipeline is a sequence of steps where data is preprocessed and passed to a model for training and evaluation in a systematic way.

Using pipelines ensures that all steps are executed in the right order and that there's no data leakage between training and testing.

**Benefits of Using Pipelines:**

- Reproducibility: Ensures consistent application of data transformations.
- Simplifies Code: Consolidates multiple steps (data preprocessing, model training, etc.) into a single process.
- Prevents Data Leakage: Ensures transformations like scaling are applied only to the training data and not to the test data during training

Using Pipelines in Scikit-learn

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Define the steps in the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: Standardize the data
    ('knn', KNeighborsClassifier(n_neighbors=5))  # Step 2: Train the KNN model
])

# Train the pipeline
pipe.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = pipe.predict(X_test)


**Pipelines with Model Tuning**

You can combine pipelines with hyperparameter tuning using GridSearchCV or RandomizedSearchCV
This allows you to tune hyperparameters of both the preprocessing steps (like scaling) and the model itself.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Define a pipeline with scaling and KNN
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Standardize data
    ('knn', KNeighborsClassifier())  # KNN model
])

# Define a parameter grid for hyperparameter tuning
param_grid = {
    'knn__n_neighbors': [3, 5, 7],  # Tuning the number of neighbors
    'knn__weights': ['uniform', 'distance'],  # Tuning the weight function
}

# Perform grid search
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Output the best parameters and model performance
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")


Best parameters: {'knn__n_neighbors': 5, 'knn__weights': 'uniform'}
Best score: 0.9731225296442687


**Advanced Techniques:** Cross-Validation and Nested Cross-Validation
Cross-validation (CV): Used to evaluate model performance by splitting the dataset into training and validation sets multiple times.
It helps prevent **overfitting and ensures that the model generalizes well to unseen data.**

**Nested cross-validation: **This combines model tuning and cross-validation in a way that avoids overfitting during hyperparameter tuning.
It involves an **outer loop for model evaluation and an inner loop for hyperparameter tuning**.

In [None]:
# Using nested cross-val

from sklearn.model_selection import cross_val_score, GridSearchCV

# Define the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Standardize the features
    ('knn', KNeighborsClassifier())  # KNN classifier
])

# Parameter grid for Grid Search
param_grid = {'knn__n_neighbors': [3, 5, 7], 'knn__weights': ['uniform', 'distance']}

# Perform GridSearchCV
grid_search = GridSearchCV(pipe, param_grid, cv=5)

# Nested cross-validation (with 10 outer folds)
nested_cv_scores = cross_val_score(grid_search, X, y, cv=10)

# Print the nested cross-validation score
print(f"Nested CV accuracy: {np.mean(nested_cv_scores):.2f}")


Nested CV accuracy: 0.95


In Summary

- Model Tuning is the process of optimizing hyperparameters to improve the performance of a machine learning model.
- Pipelines organize the workflow by combining preprocessing steps with model training in a structured manner
- Combining these concepts (pipelines and model tuning) helps ensure that the entire machine learning process is systematic, reproducible, and efficient
- Additionally, using cross-validation ensures robust model evaluation, preventing overfitting and improving generalization.