# k-Nearest Neighbors (kNN)

IT is a type of instance-based learning or lazy learning where the function is only approximated locally and all computation is deferred until classification. Here's a detailed explanation:

 Core Concepts:

1. **Distance Metric**:
   - kNN relies on a distance metric to calculate how close points are. Common metrics include:
     - **Euclidean Distance**: Most common for continuous variables.
     - **Manhattan Distance**: Useful in scenarios where moving in one direction involves different costs than another, like city block distance.
     - **Hamming Distance**: For categorical variables, counts the number of positions at which the corresponding symbols are different.

2. **k Selection**:
   - 'k' is the number of neighbors that contribute to the classification of a new instance. Choosing k is critical:
     - Small k can make the model sensitive to noise (overfitting).
     - Large k can smooth out local variations (underfitting).

3. **Voting (for classification)**:
   - For classification, the class of the new instance is decided by the majority vote among its k nearest neighbors.

4. **Averaging (for regression)**:
   - For regression tasks, the prediction is often the average of the target values of the k nearest neighbors.

 Algorithm Steps:

1. **Store Training Data**:
   - Unlike other algorithms that build a model during training, kNN simply stores all data points.

2. **Calculate Distance**:
   - When a new instance comes for prediction, compute the distance from this instance to all stored instances.

3. **Find k-Nearest Neighbors**:
   - Sort these distances and select the k smallest distances.

4. **Predict**:
   - For classification, take a vote among these neighbors. For regression, compute the average of their target values.

 Example:

Let's consider a simple scenario for classifying whether a fruit is an apple or an orange based on two features: weight and texture.

**Training Data:**

| Fruit  | Weight (g) | Texture (1-5 scale, 1 being smooth) |
|--------|------------|-------------------------------------|
| Apple  | 150        | 4                                   |
| Apple  | 170        | 3                                   |
| Orange | 130        | 2                                   |
| Orange | 160        | 3                                   |
| Apple  | 140        | 4                                   |

**New Instance to Classify:**

- Weight: 155g
- Texture: 3

**Steps:**

1. **Compute Distances** (using Euclidean for simplicity):
   - Distance to Apple(150, 4): $\sqrt{(155-150)^2 + (3-4)^2} \approx 5.1$
   - Distance to Apple(170, 3): $\sqrt{(155-170)^2 + (3-3)^2} = 15$
   - Distance to Orange(130, 2): $\sqrt{(155-130)^2 + (3-2)^2} \approx 25.8$
   - Distance to Orange(160, 3): $\sqrt{(155-160)^2 + (3-3)^2} = 5$
   - Distance to Apple(140, 4): $\sqrt{(155-140)^2 + (3-4)^2} \approx 15.8$

2. **Select k=3 Nearest Neighbors**:
   - Apple (150, 4) - 5.1
   - Orange (160, 3) - 5
   - Apple (170, 3) - 15
   Since we're using k=3, we'll take these three points.

3. **Majority Vote**:
   - 2 Apples, 1 Orange. Therefore, the new fruit is classified as an **Apple**.

 Practical Considerations:

- **Choice of k**:
  - Often chosen via cross-validation to balance between underfitting and overfitting.

- **Distance Weighting**:
  - Points closer to the query point can be given more influence in the vote or average.

- **Feature Scaling**:
  - Important because kNN is sensitive to scale; features should be normalized or standardized.

- **Memory and Time Complexity**:
  - kNN needs to store all training data, which can be memory-intensive. Prediction time increases with data size since distances to all points need calculation.

- **Curse of Dimensionality**:
  - As the number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse. This can make distance measures less


# Interview Questions

1. **What is kNN?**

   kNN, or k-Nearest Neighbors, is a non-parametric, lazy learning algorithm used for classification and regression. It classifies a new instance based on the majority vote of its k nearest neighbors in the feature space for classification or averages their values for regression.

2. **Why is kNN called a "lazy learner"?**

   kNN is termed a "lazy learner" because it does not learn a discriminative function from the training data beforehand; instead, it memorizes the entire dataset and uses it directly for classification at prediction time. There's no explicit training phase other than storing the data.

3. **How do you choose the value of k in kNN?**

   The choice of k is typically made using cross-validation. A small k can lead to overfitting, while a large k might smooth out class boundaries too much, leading to underfitting. One rule of thumb is to choose k as the square root of the number of samples in the dataset, but empirical testing is key.

4. **What distance metrics can be used in kNN?**

   Common distance metrics include:
   - **Euclidean Distance** for continuous data.
   - **Manhattan Distance** for scenarios where moving along axes has different costs.
   - **Hamming Distance** for categorical data.
   - **Minkowski Distance** as a generalization of Euclidean and Manhattan.

5. **How does kNN handle imbalanced datasets?**

   kNN can struggle with imbalanced datasets since a majority vote might not represent the minority class well. Techniques like adjusting the number of neighbors (k), using weighted voting where votes from minority classes have more weight, or using different distance metrics can help manage this issue.

6. **What are the advantages of kNN?**

  - Simple to implement and understand.
  - No training phase, which means new data can be added seamlessly.
  - Works well with small datasets.
  
7. **What are the disadvantages of kNN?**

  - Computationally expensive for large datasets as it requires distance calculations for all points.
  - Sensitive to the scale of data; features need to be normalized.
  - Prone to the curse of dimensionality where performance degrades as the number of features increases.
  
8. **How does scaling affect kNN performance?**
   
   Scaling is crucial for kNN because the algorithm uses distance metrics. Features on larger scales can dominate the distance calculation, leading to biased results. Normalization or standardization helps ensure all features contribute equally to the distance metric.

9. **Can kNN be used for regression?**
    
    Yes, for regression, kNN predicts the value of a new instance by averaging the values of its k nearest neighbors. Sometimes, weights based on distance can be applied to give closer neighbors more influence.
 
10. **How do you handle categorical variables in kNN?**
    
    For categorical variables, you can use distance metrics like Hamming distance for binary attributes or convert categorical variables into numerical ones using techniques like one-hot encoding, then apply traditional distance measures.
 
11. **What's the impact of choosing too large or too small a k?**
  - **Too small k:** The model might overfit, capturing noise as patterns.
  - **Too large k:** The model might underfit, missing out on local patterns due to over-smoothing.