<div id="header">
    <p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:20px;">KNN Imputer
    </p>
</div>

---

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• The KNN (K-Nearest Neighbors) imputer is designed to fill in missing values in a dataset by looking at the `k` nearest samples (neighbors) in the feature space. 
<br>
• It uses the values of these neighbours to estimate the missing values.
</div>

In [190]:
# Importing Libraries
import numpy as np
import pandas as pd

In [191]:
# Reading CSV File
df = pd.read_csv("titanic_data.csv", usecols=['Age','Pclass','Fare','Survived'])
df.sample(5)

Unnamed: 0,Survived,Pclass,Age,Fare
178,0,2,30.0,13.0
375,1,1,,82.1708
130,0,3,33.0,7.8958
399,1,2,28.0,12.65
115,0,3,21.0,7.925


In [192]:
# Shape of the Data
df.shape

(891, 4)

In [193]:
# Percentage of Null values in each Column
df.isna().mean()*100

Survived     0.00000
Pclass       0.00000
Age         19.86532
Fare         0.00000
dtype: float64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Train Test Split</strong>
<br>
The train-test split is a common technique in machine learning for evaluating model performance. It involves dividing your dataset into two parts :
<br>
• <strong>Training Set :</strong> Used to train the model.
<br>
• <strong>Testing Set :</strong> Used to evaluate the model's performance on unseen data.
<br>
<br>
<strong>Parameters</strong>
<br>
• <strong>arrays :</strong> This can be a list or a tuple of arrays (e.g, features and target variables).
<br>
• <strong>test_size :</strong> Determines the proportion of the dataset to include in the test split (e.g, 0.2 for 20%).
<br>
• <strong>random_state :</strong> Controls the shuffling applied to the data before the split (e.g., any integer).
<br>
• <strong>shuffle :</strong> A boolean that indicates whether to shuffle the data before splitting.
</div>

In [194]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

In [195]:
# Dividing Features and Target Variables
X = df.drop(columns=['Survived'])
y = df['Survived']

In [196]:
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [197]:
# Shape of Training and Testing Set 
print(X_train.shape, X_test.shape)

(623, 3) (268, 3)


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• The KNN (K-Nearest Neighbors) imputer is designed to fill in missing values in a dataset by looking at the `k` nearest samples (neighbors) in the feature space. 
<br>
• It uses the values of these neighbors to estimate the missing values.
<br>
<br>
<strong>Parameters of KNNImputer</strong>
<br>
→ When initializing the KNNImputer from scikit-learn, you have several parameters to consider :
<br>
1. <strong>n_neighbors (default=5)</strong>
<br>
• The number of nearest neighbors to consider when imputing the missing values.
<br>
2. <strong>weights (default="uniform")</strong>
<br>
• Defines how the neighbors contribute to the imputation.
<br>
→ uniform: All neighbors contribute equally.
<br>
→ distance: Closer neighbors contribute more to the average (i.e., weighted by distance).
<br>
• Using distance weighting often results in better imputation, especially if the neighbors have varying distances.
<br>
3. <strong>metric (default="nan_euclidean")</strong>
<br>
• The distance metric used to find neighbors.
<br>
→ euclidean: The standard distance measure.
<br>
→ manhattan: Also known as L1 distance.
<br>
→ cosine: Useful for high-dimensional data where the angle between points matters more than the distance.
<br>
4. <strong>missing_values (default=np.nan)</strong>
<br>
• The placeholder for missing values in the dataset.
<br>
<br>
<strong>Advantage of KNNImputer</strong>
<br>
1. Preserves Data Distribution
<br>
• KNN imputation uses the actual data points to estimate missing values, helping maintain the overall distribution of the dataset.
<br>
2. Flexible
<br>
• It can work with various data types (numerical and categorical) and can be adapted with different distance metrics and weighting schemes.
<br>
3. Contextual Imputation
<br>
• By considering the nearest neighbors, it accounts for the local structure of the data which can lead to more accurate imputations compared to simpler methods like mean or median imputation.
<br>
4. Good for Small to Medium Datasets
<br>
• KNN can perform well when the dataset size is manageable allowing for effective neighbour searches.
<br>
<strong>Disadvantage of KNNImputer</strong>
<br>
1. Computationally Intensive
<br>
• KNN imputation can be slow especially with large datasets as it requires calculating distances to all other data points for each missing value.
<br>
2. Sensitivity to Outliers
<br>
• Outliers can significantly affect the imputation results since KNN relies on the local neighborhood of points.
<br>
3. Curse of Dimensionality
<br>
• As the number of features increases the distance metric becomes less meaningful. This can lead to poorer neighbor selection and consequently less accurate imputations.
<br>
4. Choice of k and Distance Metric
<br>
• The performance heavily depends on the choice of k (the number of neighbors) and the distance metric used. There’s no one-size-fits-all and it often requires experimentation.
<br>
5. Scalability Issues
<br>
• As the dataset grows, both memory usage and computation time increase making it less feasible for very large datasets without optimizations.
</div>

In [198]:
# Importing KNNImputer
from sklearn.impute import KNNImputer

In [199]:
# Creating KNNImputer Object
knn = KNNImputer(n_neighbors=2)

In [200]:
# Fit and Transform is called on Training Data only
X_train_knn = knn.fit_transform(X_train)

In [201]:
# Transformation on Testing Data 
X_test_knn = knn.transform(X_test)

In [202]:
# Importing LogisticRegression
from sklearn.linear_model import LogisticRegression

In [203]:
# Creating LogisticRegression Object for KNNImputer
lr_knn = LogisticRegression()

In [204]:
# Fitting the Data to LogisticRegression
lr_knn.fit(X_train_knn, y_train)

In [205]:
# Prediction from the Trained Model
y_pred_knn = lr_knn.predict(X_test_knn)

In [206]:
# Importing Accuracy Score
from sklearn.metrics import accuracy_score

In [207]:
# Accuracy Score of the KNNImputed Data
accuracy_score(y_test, y_pred_knn)

0.7201492537313433

In [208]:
# Importing SimpleImputer
from sklearn.impute import SimpleImputer

In [209]:
# Creating SimpleImputer Object
si = SimpleImputer()

In [210]:
# Fit and Transform is called on Training Data only
X_train_si = si.fit_transform(X_train)

In [211]:
# Transformation on Testing Data 
X_test_si = si.transform(X_test)

In [212]:
# Importing LogisticRegression
from sklearn.linear_model import LogisticRegression

In [213]:
# Creating LogisticRegression Object for SimpleImputer
lr_si = LogisticRegression()

In [214]:
# Fitting the Data to LogisticRegression
lr_si.fit(X_train_si, y_train)

In [215]:
# Prediction from the Trained Model
y_pred_si = lr_si.predict(X_test_si)

In [216]:
# Accuracy Score of the SimpleImputed Data
accuracy_score(y_test, y_pred_si)

0.7126865671641791