## 2. Nearest Neighbor Classification

We implemented a basic nearest neighbor classifier (not at all optimized) in a separate Python file, which we import first. We also import our data again, since we will now try to use some of the categorical features as well.

In [1]:
from nearest_neighbor_code import KNearestNeighbor
import numpy as np
import pandas as pd

In [2]:
train = pd.read_csv('data/train.csv')

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
train.drop(columns = ['Name'], axis=1, inplace=True)

Before proceeding, we normalize the numerical values, so that they all fall in a similar range. Although potentially not optimal, we start with a simple linear normalization.

In [4]:
Age_min, Age_max = train['Age'].min(), train['Age'].max()
SibSp_min, SibSp_max = train['SibSp'].min(), train['SibSp'].max()
Parch_min, Parch_max = train['Parch'].min(), train['Parch'].max()
Fare_min, Fare_max = train['Fare'].min(), train['Fare'].max()

train['Pclass'] = train['Pclass'].apply(lambda x: (x - 1) / 2)
train['Age'] = train['Age'].apply(lambda x: (x - Age_min) / (Age_max - Age_min))
train['SibSp'] = train['SibSp'].apply(lambda x: (x - SibSp_min) / (SibSp_max - SibSp_min))
train['Parch'] = train['Parch'].apply(lambda x: (x - Parch_min) / (Parch_max - Parch_min))
train['Fare'] = train['Fare'].apply(lambda x: (x - Fare_min) / (Fare_max - Fare_min))

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,1.0,male,0.271174,0.125,0.0,A/5 21171,0.014151,,S
1,2,1,0.0,female,0.472229,0.125,0.0,PC 17599,0.139136,C85,C
2,3,1,1.0,female,0.321438,0.0,0.0,STON/O2. 3101282,0.015469,,S
3,4,1,0.0,female,0.434531,0.125,0.0,113803,0.103644,C123,S
4,5,0,1.0,male,0.434531,0.0,0.0,373450,0.015713,,S


In [5]:
y_train = train['Survived']
train.drop(columns = ['Survived'], axis=1, inplace=True)

In [6]:
test = pd.read_csv('data/test.csv')

In [7]:
test.drop(columns = ['Name'], axis=1, inplace=True)

In [8]:
test['Pclass'] = test['Pclass'].apply(lambda x: (x - 1) / 2)
test['Age'] = test['Age'].apply(lambda x: (x - Age_min) / (Age_max - Age_min))
test['SibSp'] = test['SibSp'].apply(lambda x: (x - SibSp_min) / (SibSp_max - SibSp_min))
test['Parch'] = test['Parch'].apply(lambda x: (x - Parch_min) / (Parch_max - Parch_min))
test['Fare'] = test['Fare'].apply(lambda x: (x - Fare_min) / (Fare_max - Fare_min))

test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,1.0,male,0.428248,0.0,0.0,330911,0.015282,,Q
1,893,1.0,female,0.585323,0.125,0.0,363272,0.013663,,S
2,894,0.5,male,0.773813,0.0,0.0,240276,0.018909,,Q
3,895,1.0,male,0.334004,0.0,0.0,315154,0.016908,,S
4,896,1.0,female,0.271174,0.125,0.166667,3101298,0.023984,,S


In [9]:
KNN_classifier = KNearestNeighbor(n_nb=3)

In [10]:
KNN_classifier.fit(train, y_train, cat=['Sex', 'Ticket', 'Cabin', 'Embarked'], num=['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'])

In [11]:
test['Survived'] = None

for i in range(test.shape[0]):
    test.loc[i, 'Survived'] = KNN_classifier.predict(test.iloc[i])

test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,1.0,male,0.428248,0.0,0.0,330911,0.015282,,Q,0
1,893,1.0,female,0.585323,0.125,0.0,363272,0.013663,,S,1
2,894,0.5,male,0.773813,0.0,0.0,240276,0.018909,,Q,0
3,895,1.0,male,0.334004,0.0,0.0,315154,0.016908,,S,1
4,896,1.0,female,0.271174,0.125,0.166667,3101298,0.023984,,S,0


In [None]:
predictions3nb = test[['PassengerId', 'Survived']]

# predictions3nb.tail()

In [None]:
# predictions3nb.to_csv('predictions3nb.csv', index=False)

Result: 0.73684 accuracy

Let's try a subset of features and different number of neighbors:

In [15]:
KNN_classifier2 = KNearestNeighbor(n_nb=5)

In [16]:
KNN_classifier2.fit(train, y_train, cat=['Sex', 'Cabin', 'Embarked'], num=['Pclass', 'Age', 'SibSp', 'Fare'])

In [17]:
test['Survived'] = None

for i in range(test.shape[0]):
    test.loc[i, 'Survived'] = KNN_classifier2.predict(test.iloc[i])

test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,1.0,male,0.428248,0.0,0.0,330911,0.015282,,Q,0
1,893,1.0,female,0.585323,0.125,0.0,363272,0.013663,,S,0
2,894,0.5,male,0.773813,0.0,0.0,240276,0.018909,,Q,0
3,895,1.0,male,0.334004,0.0,0.0,315154,0.016908,,S,0
4,896,1.0,female,0.271174,0.125,0.166667,3101298,0.023984,,S,0


In [None]:
predictions5nb = test[['PassengerId', 'Survived']]

# predictions5nb.tail()

In [None]:
# predictions5nb.to_csv('predictions5nb.csv', index=False)

Result: 0.74880 accuracy