Q1. What is the KNN algorithm?

In [None]:
'''
 The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses
 proximity to make classifications or predictions about the grouping of an individual data point.
'''

Q2. How do you choose the value of K in KNN?

In [None]:
'''
Input data-
The type of data you're using will determine the best value of k. For example, if the data has a lot of outliers
or noise, you might want to choose a higher value of k.

Odd or even-
It's generally recommended to choose an odd value of k to avoid ties in classification. For example, if you have
7 nearest neighbors, and 4 belong to class 2 and 3 belong to class 1, the point is confidently classified as
belonging to class 2.

Square root of n-
A common rule of thumb is to choose the square root of the total number of data points (n).

Cross-validation-
You can use cross-validation techniques to help you find the best value of k for your dataset.

Underfitting-
If you choose a value of k that's too large, your model might become too simplistic and fail to capture the
underlying patterns in the data. This is known as underfitting.


The KNN algorithm is versatile and can be used for both classification and regression tasks. It's also robust to
noisy training data and effective for non-linear relationships.
'''

Q3. What is the difference between KNN classifier and KNN regressor?

In [None]:
'''
A KNN classifier is used for classification tasks, meaning it predicts discrete categories like "cat" or "dog",
while a KNN regressor is used for regression tasks, predicting continuous values like price or temperature, both
using the same underlying KNN algorithm but with a different approach to making predictions: a KNN classifier
determines the most frequent class among the nearest neighbors, while a KNN regressor calculates the average
of the target values of the nearest neighbors to make a prediction
'''

Q4. How do you measure the performance of KNN?

In [None]:
'''
The performance of a KNN (K-Nearest Neighbors) algorithm is typically measured using accuracy, which is calculated
by comparing the predictions made by the model on a test dataset to the actual known labels, essentially counting
the percentage of correct classifications; this is often done by splitting the data into training and testing sets,
then evaluating the model on the test set using metrics like accuracy, precision, recall, and F1-score depending
on the problem type.

Key points about measuring KNN performance:

Accuracy as primary metric:
For simple classification tasks, accuracy is the most common metric used to evaluate KNN performance.

Choosing the right K value:
A crucial aspect of KNN is selecting the optimal value of "k" (the number of nearest neighbors to consider),
which significantly impacts accuracy.

Cross-validation:
To find the best "k" value, cross-validation is often employed, where the data is split into multiple folds, and
the model is trained and evaluated on each fold, then the results are averaged to get a more robust evaluation.
'''

Q5. What is the curse of dimensionality in KNN?

In [None]:
'''
The curse of dimensionality in the context of k-Nearest Neighbors (kNN) refers to the challenges and problems
that arise when the number of features (dimensions) in the dataset increases. It impacts the effectiveness and
efficiency of kNN

Distance between points: As the number of dimensions increases, the distance between any two data points becomes
more similar and less meaningful.

Nearest neighbor calculations: In high-dimensional spaces, the distances between nearest and farthest points from
query points become almost equal. This makes it difficult for nearest neighbor calculations to discriminate candidate
points.

Overfitting: KNN is susceptible to overfitting due to the curse of dimensionality.

Reduced sample size: In higher dimensions, there is effectively a reduction in sample size.
'''

Q6. How do you handle missing values in KNN?

In [None]:
'''
Handling missing values in k-Nearest Neighbors (kNN) is an essential preprocessing step to ensure the algorithm
performs well. kNN can impute missing values using its inherent properties

kNN-Based Imputation

The most common approach is to use kNN imputation. It fills in missing values by looking at the values of the
k-nearest neighbors based on a similarity measure (e.g., Euclidean distance, Manhattan distance).

Steps for kNN Imputation:

1. Choose k (number of neighbors): Select an appropriate value for k, typically based on cross-validation or domain
knowledge.

2. Calculate distances: For the record with missing values, calculate the distances to other records using only the
non-missing features.

3. Find k-nearest neighbors: Identify the k most similar records.

4. Impute the missing value:

   For numerical data: Take the mean (or median) of the corresponding values from the k-nearest neighbors.

   For categorical data: Take the mode (most frequent value) of the corresponding values from the k-nearest neighbors.
'''

In [1]:
from sklearn.impute import KNNImputer
import numpy as np

# Example dataset with missing values
data = np.array([[1, 2, np.nan],
                 [3, np.nan, 5],
                 [np.nan, 4, 6],
                 [7, 8, 9]])

# Create and fit the KNN imputer
imputer = KNNImputer(n_neighbors=3, weights="uniform")
imputed_data = imputer.fit_transform(data)

print(imputed_data)


[[1.         2.         6.66666667]
 [3.         4.66666667 5.        ]
 [3.66666667 4.         6.        ]
 [7.         8.         9.        ]]


Q7. Compare and contrast the performance of the KNN classifier and regressor. Which one is better for
which type of problem?

In [None]:
'''
A KNN classifier is better suited for classification problems where the target variable is categorical (like
"yes/no" or "cat/dog"), while a KNN regressor is better for regression problems where the target variable is
continuous (like predicting house price or temperature) because the classifier predicts the most frequent class
among the nearest neighbors, while the regressor calculates the average value of the nearest neighbors; essentially,
the key difference is how they handle the output - a discrete class label for classification and a continuous value
for regression
'''

Q8. What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks,
and how can these be addressed?

In [None]:
'''
Strengths of kNN

1. Simplicity:
kNN is easy to understand and implement.
It makes no assumptions about the underlying data distribution (non-parametric).

2. Versatility:
Can be applied to both classification and regression tasks.
Handles multi-class classification naturally.

3. Adaptable to Complex Boundaries:
Can model complex decision boundaries if the data is well-labeled and sufficient.

4. No Training Time:
No explicit training phase (lazy learning). The entire dataset serves as the model, which can be advantageous when
training time is a constraint.

5. Robust to Small Datasets:
Works well with small datasets, especially when the relationship between features is meaningful and distances are
informative.


Weaknesses of kNN

1. Computational Cost:
High memory usage and computational cost during prediction, as it involves calculating distances to all points in
the dataset. Scales poorly with large datasets.

2. Sensitive to Feature Scaling:
Distance metrics like Euclidean distance are affected by the scale of features, which can skew results if features
are not normalized.

3. Sensitive to Irrelevant Features:
Performance deteriorates when irrelevant or redundant features dominate the distance metric.

4. Curse of Dimensionality:
As the number of features increases, distances between points tend to converge, making it hard to differentiate
between neighbors.

5. Imbalanced Data:
kNN can perform poorly on imbalanced datasets because the majority class can dominate the neighborhood.

6. Choice of k:
Results are highly sensitive to the choice of k. Too small a value can lead to overfitting, while too large a
value may result in underfitting.

7. Outlier Sensitivity:
Outliers can significantly affect the neighborhood and skew predictions.

'''

Q9. What is the difference between Euclidean distance and Manhattan distance in KNN?

In [None]:
'''
Euclidean distance calculates the straight-line distance between two points, while Manhattan distance calculates
the distance by summing the absolute differences along each axis, essentially following a grid-like path, often
visualized as moving only on city streets (like a taxi) between two points; making Euclidean distance better for
continuous data and Manhattan distance more suitable for data with grid-like structures or high-dimensional spaces.

1. Euclidean Distance
It measures the shortest straight-line distance ("as the crow flies") between two points.
It is based on the Pythagorean theorem.

2. Manhattan Distance
It measures the distance between two points along axes at right angles ("city block" distance, like moving on a grid
in a city).
Adds up the absolute differences between corresponding dimensions.
'''

Q10. What is the role of feature scaling in KNN?

In [None]:
'''
In KNN (K-Nearest Neighbors), feature scaling plays a crucial role by ensuring that all features contribute equally
to the distance calculations, preventing features with larger scales from dominating the decision-making process and
leading to more accurate classifications, as KNN heavily relies on distance metrics between data points to make
predictions; essentially, scaling ensures no single feature unfairly influences the distance calculation due to
its magnitude alone.

Distance-based algorithm:
KNN is a distance-based algorithm, meaning it determines the nearest neighbors of a data point based on their
distances in the feature space.

Impact of feature scale:
When features have different scales, the feature with a larger range will have a disproportionate impact on the
distance calculation, potentially leading to inaccurate classifications.

Importance of standardization:
By scaling features to have a similar range (often using techniques like standardization or min-max scaling), all
features contribute equally to the distance calculation, allowing KNN to make more informed decisions.
'''