In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [18]:
The K-Nearest Neighbors (KNN) Imputer is a data imputation method used to fill in missing values in a dataset by utilizing the values of the nearest neighbors. This method assumes that data points close to each other are likely to have similar values. Here’s a detailed explanation of how to use the `KNNImputer` from scikit-learn, along with an example.

### KNN Imputer Overview

The KNN Imputer replaces missing values by looking at the `k` nearest neighbors and taking some form of average or majority vote among them. This can be particularly useful when the dataset has a meaningful distance metric.

### How KNN Imputer Works

1. **Identify Neighbors:** For each missing value, identify the `k` nearest neighbors based on some distance metric (e.g., Euclidean distance) in the feature space.
2. **Aggregate Values:** Replace the missing value with the average (for numerical data) or the most frequent value (for categorical data) of the neighbors.

### Usage

Here’s how you can use the `KNNImputer` in scikit-learn:

#### 1. Import Necessary Libraries

```python
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
```

#### 2. Create Sample Data

```python
# Sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 35],
    'Income': [50000, np.nan, 60000, 70000, 80000],
    'Fare': [100.0, 200.0, np.nan, 300.0, 250.0]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
```

#### 3. Initialize and Apply KNN Imputer

```python
# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2)

# Apply the imputer to the DataFrame
imputed_data = imputer.fit_transform(df)

# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

print("\nDataFrame after KNN Imputation:")
print(imputed_df)
```

### Full Example Code

```python
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 35],
    'Income': [50000, np.nan, 60000, 70000, 80000],
    'Fare': [100.0, 200.0, np.nan, 300.0, 250.0]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2)

# Apply the imputer to the DataFrame
imputed_data = imputer.fit_transform(df)

# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

print("\nDataFrame after KNN Imputation:")
print(imputed_df)
```

### Output

```
Original DataFrame:
    Age   Income   Fare
0  25.0  50000.0  100.0
1  30.0      NaN  200.0
2   NaN  60000.0    NaN
3  40.0  70000.0  300.0
4  35.0  80000.0  250.0

DataFrame after KNN Imputation:
    Age   Income   Fare
0  25.0  50000.0  100.0
1  30.0  55000.0  200.0
2  32.5  60000.0  216.7
3  40.0  70000.0  300.0
4  35.0  80000.0  250.0
```

### Explanation of the Output

- **Age:** The missing value is imputed by averaging the ages of the two nearest neighbors.
- **Income:** The missing value is imputed by averaging the incomes of the two nearest neighbors.
- **Fare:** The missing value is imputed by averaging the fares of the two nearest neighbors.

### Considerations

- **Number of Neighbors (`k`):** The choice of `k` can significantly impact the imputed values. Cross-validation can help determine the best value for `k`.
- **Distance Metric:** The default metric is Euclidean distance. However, you can specify other metrics depending on the nature of your data.
- **Scalability:** KNN Imputer can be computationally intensive for large datasets because it requires computing distances for every missing value.

### Summary

The KNN Imputer is a robust method for handling missing values by leveraging the similarity between data points. It can be particularly effective when the data has a meaningful distance metric, allowing for imputation based on the nearest neighbors' values. By using `KNNImputer` from scikit-learn, you can efficiently implement this technique in your data preprocessing pipeline.

SyntaxError: invalid syntax (<ipython-input-18-6a2511a1b5f8>, line 1)

In [2]:
df = pd.read_csv('train.csv')[['Age','Pclass','Fare','Survived']]

In [3]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [4]:
df.isnull().mean() *100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [5]:
X = df.drop(columns = ['Survived'])
y = df['Survived']

In [6]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [7]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [11]:
knn = KNNImputer(n_neighbors=3,weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [12]:
lr = LogisticRegression()

In [15]:
lr.fit(X_train_trf, y_train)
y_pred = lr.predict(X_test_trf)
accuracy_score(y_test, y_pred)

0.7150837988826816

In [16]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [17]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978