# Notebook by -
## Himel Sarder
## LinkedIn : https://www.linkedin.com/in/himel-sarder/

# ***KNN Imputer***

The KNN Imputer is a powerful tool provided by the scikit-learn library for handling missing values in datasets. It uses the k-Nearest Neighbors (kNN) algorithm to impute missing values by considering the mean value from the nearest neighbors.

Key Principles

The KNN Imputer works by finding the nearest neighbors for each sample with missing values and imputing the missing values using the mean value from these neighbors. This method is particularly useful when the data has a pattern or structure that can be captured by the neighbors.

Parameters

### missing_values: The placeholder for the missing values. Default is np.nan.

### n_neighbors: Number of neighboring samples to use for imputation. Default is 5.

### weights: Weight function used in prediction. Options are 'uniform', 'distance', or a callable function.

### metric: Distance metric for searching neighbors. Default is 'nan_euclidean'.

### copy: If True, a copy of X will be created. If False, imputation will be done in-place.

### add_indicator: If True, a MissingIndicator transform will stack onto the output of the imputer‚Äôs transform.

### keep_empty_features: If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called.


## Considerations

Choosing k: The choice of the number of neighbors (k) can significantly impact the imputation results. It is often recommended to test different values of k and use cross-validation to find the optimal value.

Distance Metric: The default distance metric is 'nan_euclidean', which handles missing values appropriately. However, other distance metrics can be used if needed.

Performance: KNNImputer can be computationally intensive, especially for large datasets. It is important to consider the trade-off between accuracy and computational cost.

In summary, the KNNImputer is a robust and effective method for imputing missing values in datasets, leveraging the k-Nearest Neighbors algorithm to provide accurate estimates based on the nearest neighbors.

![image.png](attachment:d62e3f29-aeed-48fc-b627-d365944feec1.png)

![image.png](attachment:68d03da3-6958-46bc-bf31-dc5ee9d558a0.png)

![image.png](attachment:cc892bc6-ce7f-467c-b203-1f8e8f1eac4e.png)

![image.png](attachment:8f2a4f05-308b-4a9d-a636-adaeec91425d.png)

![image.png](attachment:f1e67a6f-e7a7-4b13-884a-5d80813ff781.png)

![image.png](attachment:8dfc24c0-f725-4a50-86d3-090be2749281.png)

![image.png](attachment:2e60be4e-af93-4670-b922-01c72428db06.png)

Why Use KNN Imputation?   
‚úÖ Preserves Patterns: Unlike mean/median imputation, KNN imputation keeps relationships between features intact.   
‚úÖ Works Well for Non-Normal Data: Handles skewed distributions and non-Gaussian data.   
‚úÖ Adapts to Local Data Structure: Uses actual values from similar records instead of a fixed statistic.   
‚úÖ Can Handle Both Numerical and Categorical Data: Works well for both types when properly configured.   

üö´ When NOT to Use KNN Imputation:   
When the dataset is very large (computationally expensive).   
When the missing data is not random (e.g., missing due to external factors).   
When the dataset is highly sparse (too many missing values).   


Advantages of KNN Imputation   
‚úî Uses actual data points for imputation.   
‚úî Handles numerical and categorical data.   
‚úî Preserves relationships between variables.   
‚úî More accurate than mean/median imputation.   

Disadvantages of KNN Imputation   
‚ùå Computationally expensive for large datasets.   
‚ùå Sensitive to outliers (distance-based method).   
‚ùå Does not work well with very sparse data.   

![image.png](attachment:313777ac-df27-44b1-ad95-5dcb7a80a588.png)

In [32]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [33]:
df = pd.read_csv('train.csv')[['Age','Pclass','Fare','Survived']]

In [34]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [35]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [36]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [37]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [38]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [53]:
knn = KNNImputer(n_neighbors=3,weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [54]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7150837988826816

In [55]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [56]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978

## Thank you