## K-Nearest Neighbors (KNN)
 
KNN is a non-parametric algorithm that classifies (or regresses) a point based on the 'k' closest training examples in feature space.it works like asking your neighbors for advice, where "neighbors" are the most similar data points to what you're trying to predict. The algorithm calculates how similar or different each training example is to your new data point using distance measurements (like Euclidean distance). It then identifies the K closest neighbors to your data point based on these distances. For classification, KNN takes a vote among these K neighbors and predicts the majority class; for regression, it averages their values. The entire training dataset must be stored and used during prediction, which is why KNN is called a "lazy learner" - it doesn't build a model during training but does all the work at prediction time.

*   **Use Cases:** Use KNN when you have a reasonable amount of labeled data and the relationship between features and the target is complex or unknown. KNN works best with low-dimensional data where distance metrics are meaningful and when similar inputs truly correspond to similar outputs. It assumes that the local structure of your data carries important information about the target variable. KNN assumes that the distance metric you choose (Euclidean, Manhattan, etc.) appropriately captures similarity between data points. It also assumes that all features contribute equally to the distance calculation unless explicitly weighted otherwise.

*   **Pros:**
    - Simple to understand and implement
    - Makes no assumptions about the underlying data distribution
    - Naturally handles multi-class classification problems
    - No training phase required (lazy learning)
    - Can adapt quickly as new training data becomes available
    - Works well with feature spaces where local information is important

*   **Cons:**
    - Computationally expensive during prediction as dataset size increases
    - Performs poorly with high-dimensional data (curse of dimensionality)
    - Sensitive to irrelevant features that distort distance calculations
    - Requires feature scaling to prevent features with larger ranges from dominating
    - Memory-intensive as it stores the entire training dataset
    - Choosing the optimal K value can be challenging and problem-dependent


*   **Best Practices:** Always scale your features so larger-valued features don't dominate the distance calculations. Test different K values using cross-validation to find the optimal number of neighbors for your specific problem. Remove irrelevant features as KNN performs poorly in high-dimensional spaces due to the "curse of dimensionality." Consider using weighted voting where closer neighbors have more influence on the prediction. Use specialized data structures like KD-trees or Ball trees to speed up neighbor searches with large datasets. Tune the `n_neighbors` parameter (typically start with square root of total samples), `weights` ('uniform' vs 'distance'), `p` parameter (1 for Manhattan, 2 for Euclidean distance), and `algorithm` ('auto', 'ball_tree', 'kd_tree', or 'brute') depending on dataset size and dimensionality.



In [60]:
%pip install --quiet pandas numpy matplotlib seaborn scikit-learn


import pandas as pd
import seaborn as sns

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [61]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target


In [62]:
df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

In [None]:
df.info()
# Check for missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


In [None]:
scaler = StandardScaler()
scaler.fit(df.drop('target',axis=1))
scaled_feature = scaler.transform(df.drop('target',axis=1))
scaled_feature
# Create a DataFrame with the scaled features because it ensures all features contribute equally to the distance calculations in KNN

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

In [66]:
df_feat = pd.DataFrame(scaled_feature,columns=df.columns[:-1])
df_feat

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444
...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832
146,0.553333,-1.282963,0.705921,0.922303
147,0.795669,-0.131979,0.819596,1.053935
148,0.432165,0.788808,0.933271,1.448832


In [67]:
X = df_feat
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
param_grid = {'n_neighbors': range(1, 21)}
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=5)
grid.fit(X_train, y_train)
# Get the best parameters and score using the grid search

In [69]:
best_knn = grid.best_estimator_
pred = best_knn.predict(X_test)
print(classification_report(y_test,pred))
print(confusion_matrix(y_test,pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
