## KNN Imputer

KNN imputation is a technique used to fill missing values in a dataset by leveraging the K-Nearest Neighbors algorithm. This method involves finding the k-nearest neighbors to a data point with a missing value and imputing the missing value using the mean or median of the neighboring data points. This approach preserves the relationships between features, which can lead to better model performance compared to simpler imputation methods like mean or median imputation.

### How KNN Imputer Works?
- **Identifying Missing Values:** The first step is to identify the missing values in the dataset, typically marked as NaN (Not a Number).
- **Finding Nearest Neighbors:** For each data point with a missing value, the KNN imputer finds the k-nearest neighbors based on a specified distance metric (e.g., Euclidean distance, cosine similarity).
- **Imputing Missing Values:** The missing value is then imputed using the mean or median of the values from the k-nearest neighbors.

### Choosing the Right Parameters for KNN Imputer
The performance of the KNN Imputer depends on the choice of parameters:

- **n_neighbors:** The number of neighbors to consider for imputation. A smaller value may be more sensitive to noise, while a larger value may oversmooth the data.
- **weights:** Determines how to weight the contributions of the neighbors. Options include:
- **uniform:** All neighbors have equal weight.
- **distance:** Weights neighbors by their distance, giving closer neighbors more influence.
- **p:** The power parameter for the Minkowski distance metric. p=1 corresponds to Manhattan distance, and p=2 corresponds to Euclidean distance.

### Advantage:
- Simple, Good results, More accurate

### Disadvantages
- Need to store all the dataset, so good for medium data
- Overall process is slower as we need to do number of calculation

`imputer = KNNImputer(n_neighbors=3, weights='distance')`

https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

### Missing Completely at Random (MCAR)
MCAR refers to a scenario where the missing observations in a dataset are independent of the observed and unobserved data. This implies that the missingness is purely random and does not depend on any systematic factor related to the dataset.<br>
MCAR does not introduce bias<br>
**Listwise Deletion:** Remove cases with missing values since the missing values are random.<br>
**Pairwise Deletion:** Use the available data in calculations without deleting any observation.<br>
**Mean/Median/Mode Imputation:** Replace missing values with the mean/median for numerical variables or mode for categorical variables.<br>
**Multiple Imputation:** Use statistical methods to estimate and replace missing values while accounting for uncertainty.<br>
**Maximum Likelihood Estimation:** A more advanced approach that estimates parameters directly using the likelihood function.<br>

### Missing at Random (MAR)
MAR is a situation in which the missing values are dependent on observed variables; that is, the probability of missingness depends on the observed data but not on the missing data itself.<br>
For example, older patients might miss blood pressure readings more frequently. If age is recorded, the missingness depends on an observed variable (age) but not on the blood pressure values.<br>
MAR can result from respondents skipping sensitive questions or non-response patterns influenced by observed data by education, age, or location.<br>

Because one can predict MAR values based on observed data, you can use the following statistical methods to handle them:<br>
**Multiple Imputation:** Predicts missing values based on observed data.<br>
**Maximum Likelihood Estimation (MLE):** Estimate parameters without imputing missing values, just like it is common in regression and Structural Equation Modelling (SEM)<br>
**Weighting Methods:** Adjusts for missing data by assigning weights to observed data; this is often used in survey analysis.<br>

### Missing Not at Random (MNAR)
MNAR occurs when the probability of missing data depends on unobserved data. This indicates that the missing data is systematically related and not random.<br>
A typical example in a medical study is if patients with severe symptoms are more likely to drop out of a clinical trial and their severity is not recorded. This is a scenario of MNAR.<br>
Conducting analysis on data with MNAR can lead to biased estimates since the missing values are not dependent on any of the observed values. MNAR also reduces the datasetâ€™s representativeness since a significant portion is missing.<br>

Unlike MCAR and MAR, which one can handle using traditional imputation methods, such as mean or regression-based imputation, MNAR needs an advanced approach such as:<br>
**Modeling the Missing Data Mechanism:** You can use external information to develop a model and estimate the missing values based on known relationships. This external information will convert the data from MNAR to MAR, where you can now use any of their respective handling methods to fill in the missing values.<br>
**Sensitivity Analysis:** Conducting sensitivity analysis helps assess the impact of different assumptions about missing data.<br>
**Multiple Imputation with MNAR Models:** Using imputation techniques that account for MNAR patterns, such as Heckman selection or pattern-mixture models.<br>
**Using Domain Knowledge:** Subject-matter expertise can guide adjustments and improve estimation techniques.<br>

In [11]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [3]:
df = pd.read_csv('train_1.csv')[['Age','Pclass','Fare','Survived']]
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [4]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [5]:
X = df.drop(columns=['Survived'])
y = df['Survived']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [6]:
knn = KNNImputer(n_neighbors=3,weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [7]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7039106145251397

### # Comparision with Simple Imputer --> mean

In [10]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [9]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978

# Iterative Imputer - Multivariate Imputation by Chained Equations (MICE)

Assumption: work best at MCAR<br>
Multivariate imputer that estimates each feature from all the others.

A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

Need to keep the entire dataset at the server, so this is slow
Results are good

https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

In [12]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

In [13]:
df = np.round(pd.read_csv('50_Startups.csv')[['R&D Spend','Administration','Marketing Spend','Profit']]/10000)
np.random.seed(9)
df = df.sample(5)
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
21,8.0,15.0,30.0,11.0
37,4.0,5.0,20.0,9.0
2,15.0,10.0,41.0,19.0
14,12.0,16.0,26.0,13.0
44,2.0,15.0,3.0,7.0


In [14]:
df = df.iloc[:,0:-1]
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,4.0,5.0,20.0
2,15.0,10.0,41.0
14,12.0,16.0,26.0
44,2.0,15.0,3.0


In [17]:
df.iat[1,0] = np.nan
df.iat[3,1] = np.nan
df.iat[-1,-1] = np.nan

df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,


In [18]:
# Step 1 - Impute all missing values with mean of respective col

df0 = pd.DataFrame()

df0['R&D Spend'] = df['R&D Spend'].fillna(df['R&D Spend'].mean())
df0['Administration'] = df['Administration'].fillna(df['Administration'].mean())
df0['Marketing Spend'] = df['Marketing Spend'].fillna(df['Marketing Spend'].mean())

In [19]:
df0

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,9.25,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [20]:
# Remove the col1 imputed value
df1 = df0.copy()

df1.iat[1,0] = np.nan

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [22]:
# Use first 3 rows to build a model and use the last for prediction

X = df1.iloc[[0,2,3,4],1:3]
X

Unnamed: 0,Administration,Marketing Spend
21,15.0,30.0
2,10.0,41.0
14,11.25,26.0
44,15.0,29.25


In [23]:
y = df1.iloc[[0,2,3,4],0]
y

21     8.0
2     15.0
14    12.0
44     2.0
Name: R&D Spend, dtype: float64

In [24]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(df1.iloc[1,1:].values.reshape(1,2))



array([23.14158651])

In [25]:
df1.iloc[1,0] = 23.14

In [26]:
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [27]:
# Remove the col2 imputed value

df1.iloc[3,1] = np.nan

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,29.25


In [28]:
# Use last 3 rows to build a model and use the first for prediction
X = df1.iloc[[0,1,2,4],[0,2]]
X

Unnamed: 0,R&D Spend,Marketing Spend
21,8.0,30.0
37,23.14,20.0
2,15.0,41.0
44,2.0,29.25


In [29]:
y = df1.iloc[[0,1,2,4],1]
y

21    15.0
37     5.0
2     10.0
44    15.0
Name: Administration, dtype: float64

In [30]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(df1.iloc[3,[0,2]].values.reshape(1,2))



array([11.06331285])

In [31]:
df1.iloc[3,1] = 11.06

In [32]:
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,29.25


In [34]:
# Remove the col3 imputed value
df1.iloc[4,-1] = np.nan

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,


In [35]:
# Use last 3 rows to build a model and use the first for prediction
X = df1.iloc[0:4,0:2]
X

Unnamed: 0,R&D Spend,Administration
21,8.0,15.0
37,23.14,5.0
2,15.0,10.0
14,12.0,11.06


In [36]:
y = df1.iloc[0:4,-1]
y

21    30.0
37    20.0
2     41.0
14    26.0
Name: Marketing Spend, dtype: float64

In [37]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(df1.iloc[4,0:2].values.reshape(1,2))



array([31.56351448])

In [38]:
df1.iloc[4,-1] = 31.56

In [39]:
# After 1st Iteration
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,31.56


In [40]:
# Subtract 0th iteration from 1st iteration

df1 - df0

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,13.89,0.0,0.0
2,0.0,0.0,0.0
14,0.0,-0.19,0.0
44,0.0,0.0,2.31


In [57]:
df2 = df1.copy()

df2.iloc[1,0] = np.nan

df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.220202,26.0
44,2.0,15.0,31.56


In [58]:
X = df2.iloc[[0,2,3,4],1:3]
y = df2.iloc[[0,2,3,4],0]

lr = LinearRegression()
lr.fit(X,y)
lr.predict(df2.iloc[1,1:].values.reshape(1,2))



array([24.57800502])

In [59]:
df2.iloc[1,0] = 23.78

In [60]:
df2.iloc[3,1] = np.nan
X = df2.iloc[[0,1,2,4],[0,2]]
y = df2.iloc[[0,1,2,4],1]

lr = LinearRegression()
lr.fit(X,y)
lr.predict(df2.iloc[3,[0,2]].values.reshape(1,2))



array([11.22020174])

In [63]:
df2.iloc[3,1] = 11.22020174

In [64]:
df2.iloc[4,-1] = np.nan

X = df2.iloc[0:4,0:2]
y = df2.iloc[0:4,-1]

lr = LinearRegression()
lr.fit(X,y)
lr.predict(df2.iloc[4,0:2].values.reshape(1,2))



array([38.88361565])

In [65]:
df2.iloc[4,-1] = 38.88361565

In [66]:
df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.78,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.220202,26.0
44,2.0,15.0,38.883616


In [67]:
df2 - df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,0.64,0.0,0.0
2,0.0,0.0,0.0
14,0.0,0.0,0.0
44,0.0,0.0,7.323616


In [69]:
df3 = df2.copy()

df3.iloc[1,0] = np.nan

df3

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.220202,26.0
44,2.0,15.0,38.883616
