Importing pandas to read the Titanic dataset and creating a dataframe.

In [2]:
import pandas as pd

df = pd.read_csv("/Users/venkat/Downloads/Titanic_full.csv")

Removing the independent variables which are of little use for the KNN imputer.

In [3]:
df = df.drop(['PassengerId', 'Name', 
              'Ticket', 'Cabin'], axis=1)

Using Pandas datafame attributes .isnull() and .any() to detect if there are any missing values in the dataset.

In [4]:
df.isna().any()

Survived    False
Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Fare         True
Embarked     True
dtype: bool

Using .sum() to see the number of missing values.

In [5]:
df.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         263
SibSp         0
Parch         0
Fare          1
Embarked      2
dtype: int64

Importing KNN Imputer. The KNN Imputer does not recognize text data values.For example, in our Titanic dataset, the categorical columns ‘Sex’ and ‘Embarked’ have text data.A good way to modify the text data is to create “dummy variables”. The idea is to convert each category into a binary data column by assigning a 1 or 0. 

In [6]:
from sklearn.impute import KNNImputer

In [7]:
cat_variables = df[['Sex', 'Embarked']]
cat_dummies = pd.get_dummies(cat_variables, drop_first=True)
cat_dummies.head()

Unnamed: 0,Sex_male,Embarked_Q,Embarked_S
0,1,0,1
1,0,0,0
2,0,0,1
3,0,0,1
4,1,0,1


Now we have 3 dummy variable columns. In the “Sex_male” column, 1 indicates that the passenger is male and 0 is female. The “Sex_female” column is dropped since the “drop_first” parameter is set as True. Similarly, there are only 2 columns for “Embarked” because the third one has been dropped.

Next, we will drop the original “Sex” and “Embarked” columns from the data frame and add the dummy variables.

In [8]:
df = df.drop(['Sex', 'Embarked'], axis=1)
df = pd.concat([df, cat_dummies], axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,0,1
3,1,1,35.0,1,0,53.1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,1


Another critical point here is that the KNN Imptuer is a distance-based imputation method and it requires us to normalize our data. Otherwise, the different scales of our data will lead the KNN Imputer to generate biased replacements for the missing values. For simplicity, I am using Scikit-Learn’s MinMaxScaler which will scale our variables to have values between 0 and 1.

In [9]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0.0,1.0,0.273456,0.125,0.0,0.014151,1.0,0.0,1.0
1,1.0,0.0,0.473882,0.125,0.0,0.139136,0.0,0.0,0.0
2,1.0,1.0,0.323563,0.0,0.0,0.015469,0.0,0.0,1.0
3,1.0,0.0,0.436302,0.125,0.0,0.103644,0.0,0.0,1.0
4,0.0,1.0,0.436302,0.0,0.0,0.015713,1.0,0.0,1.0


Now that our dataset has dummy variables and normalized, we can move on to the KNN Imputation. Let’s import it from Scikit-Learn’s Impute package and apply it to our data.we are setting the parameter ‘n_neighbors’ as 5. So, the missing values will be replaced by the mean value of 5 nearest neighbors measured by Euclidean distance.


In [10]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df = pd.DataFrame(imputer.fit_transform(df),columns = df.columns)

Let’s see the results

In [11]:
df.isna().any()

Survived      False
Pclass        False
Age           False
SibSp         False
Parch         False
Fare          False
Sex_male      False
Embarked_Q    False
Embarked_S    False
dtype: bool

our data frame no longer has missing values. They have been imputed as the means of k-Nearest Neighbor values.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    1309 non-null   float64
 1   Pclass      1309 non-null   float64
 2   Age         1309 non-null   float64
 3   SibSp       1309 non-null   float64
 4   Parch       1309 non-null   float64
 5   Fare        1309 non-null   float64
 6   Sex_male    1309 non-null   float64
 7   Embarked_Q  1309 non-null   float64
 8   Embarked_S  1309 non-null   float64
dtypes: float64(9)
memory usage: 92.2 KB


In [13]:
 from sklearn.model_selection import train_test_split

create a dataframe with all training data except the target columnom

In [14]:
X = df.drop(columns=['Survived'])


Check that the target variable has been removed

In [15]:
X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,1.0,0.273456,0.125,0.0,0.014151,1.0,0.0,1.0
1,0.0,0.473882,0.125,0.0,0.139136,0.0,0.0,0.0
2,1.0,0.323563,0.0,0.0,0.015469,0.0,0.0,1.0
3,0.0,0.436302,0.125,0.0,0.103644,0.0,0.0,1.0
4,1.0,0.436302,0.0,0.0,0.015713,1.0,0.0,1.0


Separate target values

In [16]:
y = df['Survived'].values

View target values

In [17]:
y[0:5]

array([0., 1., 1., 1., 0.])

Spliting the dataset into train and test data

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [19]:
from sklearn.neighbors import KNeighborsClassifier
# Creating KNN classifier
knn = KNeighborsClassifier(n_neighbors = 5)
# Fit the classifier to the data
knn.fit(X_train,y_train)

KNeighborsClassifier()

Predicting and viewing the first 5 predictions

In [20]:
knn.predict(X_test)[0:5]

array([0., 1., 0., 0., 0.])

In [21]:
#check accuracy of our model on the test data
knn.score(X_test, y_test)

0.7175572519083969