# KNN Assignment 1

### Q1. What is the KNN algorithm?

K-Nearest Neighbors (KNN) is like asking your neighbors for advice. Imagine you move to a new neighborhood, and you want to know if it's safe or friendly. You go to your nearest neighbors and ask them about their experiences.

In the world of data, KNN is an algorithm that helps a computer make decisions based on what its "neighbors" (similar data points in a dataset) are like. If you have some data and you want to know what category or value a new data point should belong to, KNN looks at the data points that are most similar to the new one and decides based on what those neighbors are like.

For example, if you have data about fruits and you want to know if a new fruit is an apple or a banana, KNN would check the features (like size and color) of the nearest fruits in the dataset and make a prediction based on what those similar fruits are.

The key idea is that things that are close to each other in a dataset are often similar, so KNN leverages this similarity to make predictions. It's a simple and intuitive way for a computer to learn from data.

### Q2. How do you choose the value of K in KNN?

1. **Odd vs. Even K**:
   - For a binary classification problem (two classes), it's better to choose an odd K value to prevent ties when voting for the class, making predictions less ambiguous.

2. **Cross-Validation**:
   - A reliable way to find the right K is by splitting your data into training and validation sets and testing different K values. Pick the K that gives the best performance on the validation data, which helps prevent overfitting.

3. **Domain Knowledge**:
   - If you know your problem well, consider your domain knowledge when selecting K. It can provide valuable insights into what neighborhood size is most meaningful for your specific application.

4. **Rule of Thumb**:
   - As a starting point, you can use K = sqrt(N), where N is the total number of data points in your dataset. It's a reasonable initial choice for K.

### Q3. What is the difference between KNN classifier and KNN regressor?


| Aspect                    | KNN Classifier                | KNN Regressor                 |
|---------------------------|-------------------------------|------------------------------|
| **Type of Problem**       | Classification                | Regression                    |
| **Output**                | Discrete class labels         | Continuous numerical values   |
| **Predicted Value**       | Assigns a class label         | Assigns a numerical value    |
| **K-Nearest Neighbors**   | Based on majority voting      | Based on averaging or weighting |
| **Distance Metric**       | Often uses Euclidean distance | Often uses Euclidean distance |
| **Decision Boundary**     | Separates classes with boundaries | Provides a smooth, continuous prediction surface |
| **Evaluation Metrics**    | Accuracy, precision, recall, F1-score, etc. | Mean squared error, R-squared, etc. |
| **Common Applications**   | Image classification, text categorization, etc. | Stock price prediction, house price prediction, etc. |


### Q4. How do you measure the performance of KNN?


**For KNN Classification:**

1. **Accuracy**: It's like checking how many of your predictions are correct out of all predictions.

2. **Precision and Recall**: Think of them as checking how good you are at catching positive cases (precision) and not missing any positive cases (recall).

3. **F1-Score**: It's a balance between being good at catching positive cases and not making wrong positive predictions.

4. **Confusion Matrix**: It's like a summary of how many correct and wrong predictions you made.

5. **ROC Curve and AUC**: They help you see how well your model distinguishes between positive and negative cases.

**For KNN Regression:**

1. **Mean Squared Error (MSE)**: It's like checking how far, on average, your predictions are from the actual values.

2. **Root Mean Squared Error (RMSE)**: Similar to MSE but gives you an error measure in the same units as the values you're predicting.

3. **Mean Absolute Error (MAE)**: It's like checking how far, on average, your predictions are from the actual values but without squaring the differences.

4. **R-squared (R²)**: It tells you how well your model fits the data, with a higher value meaning a better fit.

Remember, which one you use depends on whether you're working with classification (e.g., classifying things into categories) or regression (e.g., predicting numerical values).

### Q5. What is the curse of dimensionality in KNN?

Imagine you have a lot of data with many, many features (like hundreds or thousands). In KNN, having too many features can be a problem.

Here's why:

1. **It's Hard to Find Neighbors**: In a high-dimensional space, things are far apart. It's like living in a huge city, and your nearest neighbors might be very far from you. KNN needs nearby neighbors to work well.

2. **Calculations are Slow**: KNN involves measuring distances between data points. With lots of dimensions, these calculations become very slow and can be a real computational burden.

3. **You Need a Lot of Data**: To make up for the sparsity in high dimensions, you'd need an enormous amount of data. Getting that much data can be tough.

4. **It's Easy to Make Mistakes**: With many features, it's easier to make mistakes and misinterpret the data, leading to poor predictions.

To deal with this, people often use tricks like reducing the number of features or using special distance measures. The curse of dimensionality is all about the problems you face when dealing with too many features in KNN.

### Q6. How do you handle missing values in KNN?

Handling missing values in K-Nearest Neighbors (KNN) can be a bit tricky because KNN relies on measuring distances between data points. Here are several approaches to deal with missing values when using KNN:

1. **Imputation**:
   - Fill in missing values with estimated or imputed values. Common methods for imputation include using the mean, median, mode, or a regression model to estimate the missing values. Imputing values allows you to maintain complete data for KNN.

2. **Data Transformation**:
   - Convert your data into a format that is insensitive to missing values. For example, you can use binary flags (0 for missing, 1 for present) to represent missing values. This way, you can still calculate distances while accounting for the missing values.

3. **Feature Selection**:
   - If a feature has too many missing values, and you believe it doesn't carry much information, you may consider removing that feature from the dataset. Feature selection can help simplify your model.

4. **Weighted KNN**:
   - Modify the KNN algorithm to give different weights to neighbors based on the number of missing values they have. Neighbors with fewer missing values can be given more weight in the calculations, as they are more informative.

5. **Use a Separate Category**:
   - If appropriate for your data, you can treat missing values as a separate category or label. This can work well when the missing values themselves contain valuable information.

6. **Nearest Neighbors with Similar Missing Data**:
   - When you have a lot of missing values, consider using only the neighbors with similar patterns of missing values. This can be more informative than including all neighbors.

7. **KNN Imputation**:
   - There's a specific method called "KNN imputation" where you use KNN to impute missing values. For each missing value, you find the K-nearest neighbors for that data point and use their values to estimate the missing value. This approach is more complex but can be effective.

8. **Use Specialized Libraries**:
   - Some machine learning libraries, like scikit-learn in Python, offer tools and functions to handle missing values within the KNN algorithm itself.

The choice of method depends on your specific data, the nature of the missing values, and the goals of your analysis. In any case, it's essential to carefully consider how missing values may affect the quality of your KNN model and choose an approach that makes sense for your particular problem.

### Q7. Compare and contrast the performance of the KNN classifier and regressor. Which one is better for which type of problem?


**KNN Classifier:**

| Aspect                   | Description                                  |
|--------------------------|----------------------------------------------|
| Type of Problem          | Classification                               |
| Output                   | Discrete class labels                       |
| Predicted Value          | Class label (e.g., "spam" or "not spam")    |
| Performance Metrics      | Accuracy, precision, recall, F1-score, etc.  |
| Decision Boundary        | Separates classes with decision boundaries   |
| Use Cases                | Image classification, text categorization, fraud detection, medical diagnosis (e.g., disease/no disease) |

**KNN Regressor:**

| Aspect                   | Description                                  |
|--------------------------|----------------------------------------------|
| Type of Problem          | Regression                                   |
| Output                   | Continuous numerical values                  |
| Predicted Value          | Continuous numerical value (e.g., house price, stock price) |
| Performance Metrics      | Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared (R²), etc. |
| Prediction Surface       | Provides a smooth, continuous prediction surface |
| Use Cases                | House price prediction, stock price prediction, demand forecasting, any prediction involving numerical values |

This tabular representation summarizes the key differences between KNN classifier and KNN regressor.

### Q8. What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks, and how can these be addressed?



**KNN for Classification**

| **Strengths**               | **Weaknesses**                                | **Addressing Weaknesses**                         |
|---------------------------|---------------------------------------------|-------------------------------------------------|
| Simplicity                | Computationally Intensive                    | Optimize K: Use cross-validation or grid search. |
| Non-parametric            | Sensitivity to Outliers                      | Data Preprocessing: Standardize, normalize data. |
| Adaptability              | Optimal K Value                               | Outlier Handling: Identify and handle outliers.    |
|                           |                                             | Distance Metrics: Experiment with alternatives.   |
|                           |                                             | Ensemble Methods: Combine with other algorithms.  |

**KNN for Regression**

| **Strengths**               | **Weaknesses**                                | **Addressing Weaknesses**                         |
|---------------------------|---------------------------------------------|-------------------------------------------------|
| Simplicity                | Computationally Intensive                    | Optimize K: Use cross-validation or grid search. |
| Flexibility               | Sensitivity to Outliers                      | Data Preprocessing: Standardize, normalize data. |
| Adaptability              | Optimal K Value                               | Outlier Handling: Identify and handle outliers.    |
|                           |                                             | Distance Metrics: Experiment with alternatives.   |
|                           |                                             | Ensemble Methods: Combine with other algorithms.  |


### Q9. What is the difference between Euclidean distance and Manhattan distance in KNN?

![download.jpg](attachment:8ccf951b-20e4-468a-b4c0-c5a3e49a6539.jpg)

**Euclidean Distance:**

| Aspect                  | Description                                          |
|-------------------------|------------------------------------------------------|
| Formula                 | `sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2 + ...)`     |
| Sensitivity             | Sensitive to both magnitude and direction.           |
| Anisotropy              | Suitable for isotropic (similar in all directions) relationships between features. |

**Manhattan Distance:**

| Aspect                  | Description                                          |
|-------------------------|------------------------------------------------------|
| Formula                 | absolute values : x1-x2 + y1-y2 + z1-z2 + ...             |
| Sensitivity             | Less sensitive to outliers and differences in scale.  |
| Anisotropy              | Suitable for anisotropic (differing in various directions) relationships between features. |


### Q10. What is the role of feature scaling in KNN?


Feature scaling helps KNN work better by making sure all the features (like height and weight) are on the same scale. This way, KNN doesn't favor one feature over the others when deciding how similar data points are. It ensures fairness in comparing features.

Imagine you have data about people, and you're using KNN to find similar individuals. Without feature scaling, KNN might give too much importance to, say, weight because it's measured in kilograms (which is a large number) compared to height in meters. Feature scaling levels the playing field, so both height and weight contribute equally to the similarity calculation.

It's like putting everything in the same units so that each feature gets a fair say in the decision-making process. This makes KNN more accurate and less biased toward features with large values.

## The End