Q1. What is the KNN algorithm?

Ans : K-Nearest Neighbors (KNN) - In Short
KNN is a supervised machine learning algorithm used for classification and regression. It finds the K closest data points (neighbors) using distance metrics like Euclidean distance and predicts based on majority voting (classification) or averaging (regression).

No training phase → Just stores the dataset.

Finds nearest neighbors → Compares with existing data points.

Simple but slow → Works well for small datasets but is computationally expensive for large ones.

Q2. How  do you choose the value of K in KNN?

Ans:Choosing the right K value in KNN is crucial for model performance:

Small K (e.g., 1-5): The model becomes too sensitive to noise, meaning it might misclassify due to outliers, leading to overfitting (memorizing the data instead of generalizing).

Large K: The model considers more neighbors, which smooths the decision boundary but may lead to underfitting, meaning it loses important details.

Optimal K Selection: A common approach is using cross-validation to test different values and find the one that minimizes error. A frequently used starting point is √N (square root of the dataset size).

Odd vs. Even K: Choosing an odd K helps in classification problems to avoid ties between classes.

Q3. What is the difference between KNN classifier and KNN regressor?

Ans: The KNN Classifier and KNN Regressor both use the K-Nearest Neighbors algorithm but serve different purposes:

KNN Classifier is used for categorical outputs (classification tasks). It looks at the K nearest neighbors and assigns the most common class among them as the final prediction. Example: Determining if a fruit is an apple or orange.

KNN Regressor is used for continuous outputs (regression tasks). Instead of assigning a class, it calculates the average (or weighted average) of the neighbors' values to make a prediction. Example: Predicting the price of a house based on nearby houses.

Q4. How do you measure the performance of KNN?

Ans: The performance of KNN can be measured using different metrics depending on whether it's used for classification or regression:

For KNN Classification
Accuracy – Percentage of correct predictions.

Precision, Recall, and F1-score – Useful for imbalanced datasets.

Confusion Matrix – Shows true positives, false positives, etc.

ROC Curve & AUC – Measures model performance at different thresholds.

For KNN Regression
Mean Squared Error (MSE) – Measures average squared difference between actual and predicted values.

Mean Absolute Error (MAE) – Measures average absolute difference between actual and predicted values.

R² Score (Coefficient of Determination) – Measures how well predictions fit the actual data (closer to 1 is better).

Q5. What is the curse of dimensionality in KNN?

Ans:The curse of dimensionality in KNN refers to the problem where the performance of the algorithm decreases as the number of features (dimensions) increases.

Why does it happen?
Increased Sparsity – In high dimensions, data points become more spread out, making it harder to find close neighbors.

Distance Becomes Less Meaningful – The difference between the nearest and farthest points reduces, making distance-based methods like KNN less effective.

Computational Complexity – More dimensions increase the time required to compute distances and find neighbors.

How to handle it?
Feature Selection – Keep only the most relevant features.

Dimensionality Reduction – Use PCA (Principal Component Analysis) or t-SNE to reduce dimensions.

Scaling & Normalization – Standardizing data helps in maintaining meaningful distances.

Q6.How do you handle missing values in KNN?

Ans: Handling missing values in KNN is important to ensure accurate predictions. Here are some common techniques:

1. Remove Rows with Missing Values
If only a few rows have missing values, you can simply drop them.

Suitable when the dataset is large and missing data is minimal.

2. Impute Missing Values
Mean/Median Imputation – Replace missing values with the column’s mean or median (works well for numerical data).

Mode Imputation – Use the most frequent value for categorical features.

3. KNN-Based Imputation
Use KNN Imputer, which replaces missing values using the average of the nearest K neighbors based on feature similarity.

More accurate than simple mean/median imputation.

4. Predict Missing Values
Use a separate machine learning model (like Linear Regression or Decision Trees) to predict and fill missing values.

Q7. compare and contrast the performance of the KNN classifier and regressor.Which one is better for which type of problem?

Ans:The KNN classifier and KNN regressor serve different purposes based on the type of problem. The KNN classifier is used for classification tasks where the target variable is categorical, such as spam detection or disease prediction. It determines the class of a data point by majority voting among its k-nearest neighbors. In contrast, the KNN regressor is used for regression tasks where the target variable is continuous, such as predicting house prices or temperature. Instead of voting, it takes the average (or weighted average) of the k-nearest neighbors' values to make predictions. While both methods rely on distance-based similarity, the classifier's performance is typically evaluated using accuracy, precision, recall, or F1-score, whereas the regressor is assessed using RMSE, MAE, or R² score. The classifier can struggle with imbalanced datasets, while the regressor is more sensitive to noisy data and outliers. KNN is computationally expensive for large datasets, but it performs well on small datasets. Choosing between them depends on whether the problem requires class label predictions (classifier) or numerical value predictions (regressor).









Q8. What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks,and how can these be addressed?

Ans:Strengths and Weaknesses of KNN
Strengths:
Simple, easy to implement, and non-parametric (no assumptions about data).

Works well for both classification and regression, especially on small datasets.

Effective when data is well-structured and noise-free.

Weaknesses:
Computationally expensive for large datasets due to distance calculations.

Sensitive to noise, outliers, and imbalanced data.

Struggles with high-dimensional data (curse of dimensionality).

Solutions:
Use KD-Trees, Ball Trees, or Approximate Nearest Neighbors (ANN) for faster search.

Apply feature scaling, outlier removal, and weighted KNN to improve accuracy.

Reduce dimensionality using PCA or feature selection to mitigate the curse of dimensionality.

Handle imbalanced data with SMOTE or weighted KNN.

Q9.What is the difference between Euclidean distance and Manhattan distance in KNN?

Ans:Difference Between Euclidean Distance and Manhattan Distance in KNN

Path Type:

Euclidean: Measures the shortest straight-line distance.

Manhattan: Measures distance by grid-like paths.

Best for:

Euclidean: Works well for continuous data with smooth distributions.

Manhattan: Suitable for grid-based data or independent features.

Computation Complexity:

Euclidean: More complex due to the square root operation.

Manhattan: Simpler and faster as it only involves absolute differences.

Effect of High Dimensions:

Euclidean: Affected by the curse of dimensionality (becomes less reliable in high dimensions).

Manhattan: Less affected compared to Euclidean distance.

Q10. What is the role of feature scalling in KNN?

Ans:Role of Feature Scaling in KNN
Ensures Fair Distance Calculation:

KNN relies on distance metrics (e.g., Euclidean, Manhattan).

Features with larger ranges dominate those with smaller ranges.

Improves Model Accuracy:

Prevents bias towards features with higher numerical values.

Ensures all features contribute equally to distance computations.

Speeds Up Convergence:

Scaled features lead to faster and more stable distance calculations.

Reduces computational complexity, especially for high-dimensional data.

Prevents Skewed Predictions:

Unscaled data can cause incorrect nearest neighbors selection.

Leads to more reliable classifications and regressions.

Common Scaling Techniques:
Min-Max Scaling: Normalizes values between 0 and 1.

Standardization (Z-score): Centers data around mean (0) with unit variance.