

The objective of this project is to build a k-Nearest Neighbour algorithm that takes as input training and test dataset and will predict the target binary variable with a reasonable degree of accuracy.

This project focuses on the Titanic dataset that we used during the lectures. You can find the dataset at:

https://github.com/andvise/DataAnalyticsDatasets/blob/main/titanic.csv


# Tasks:



# Data Preparation (5 marks)



1.  Load the dataset on Colab


In [None]:
# YOUR CODE HERE
import pandas as pd
import numpy as np

df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/dc74027bd063510ede437e4d612f41eb078e49b9/titanic.csv?raw=true")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,1,0,3,Mr. Owen Harris Braund,male,22.0,1,0.0,7.2500
1,2,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0.0,71.2833
2,3,1,3,Miss. Laina Heikkinen,female,26.0,0,0.0,7.9250
3,4,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,,53.1000
4,5,0,3,Mr. William Henry Allen,male,35.0,0,0.0,8.0500
...,...,...,...,...,...,...,...,...,...
882,883,0,2,Rev. Juozas Montvila,male,27.0,0,0.0,13.0000
883,884,1,1,Miss. Margaret Edith Graham,female,19.0,0,0.0,30.0000
884,885,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2.0,23.4500
885,886,1,1,Mr. Karl Howell Behr,male,26.0,0,0.0,30.0000




2. Display the attributes' name and their data type




In [None]:
# YOUR CODE HERE
df.dtypes

PassengerId                  int64
Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard    float64
Fare                       float64
dtype: object

3. Delete the columns that should not be included in a k-NN technique. Explain why you removed them in 1-2 sentences.




In [None]:
# YOUR CODE HERE
df_new = df[["Survived", "Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard", "Fare"]]
df_new.dtypes
# I removed PassengerId and Name as the distances cannot be calculated for each of these columns therefore should not be used in k-NN model. EXPLAIN WHY LATER

Survived                     int64
Pclass                       int64
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard    float64
Fare                       float64
dtype: object

4. Replace all missing values with 0


In [None]:
# YOUR CODE HERE
df_na = df_new.fillna(0)

5. Transform the Sex column into a numerical one

In [None]:
# YOUR CODE HERE
df_na['Sex'].replace(['male', 'female'], [0, 1], inplace=True)

6. Use **Survived** as the target label and the rest of the data frame as features

In [None]:
# YOUR CODE HERE
df_shuffle = df_na.sample(frac = 1, random_state = 45) # randomly shuffles the dataset
feat = df_shuffle[["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard", "Fare"]]
labels = df_shuffle[["Survived"]]

5. Divide your dataset into 80% for training and 20% for test


In [None]:
# YOUR CODE HERE

# gets the length of the dataset, multiplies by 0.8 to get 80% of the dataset, converts to integer round to whole number and subsets the dataset to 80% and 20% respectively

train_feat = feat[:int(0.8 * len(feat))]
test_feat = feat[int(0.8 * len(feat)):]
train_labels = labels[:int(0.8 * len(labels))]
test_labels = labels[int(0.8 * len(labels)):]

# Resetting indices for test_feat and test_labels
train_feat.reset_index(drop=True, inplace=True)
train_labels.reset_index(drop=True, inplace=True)
test_feat.reset_index(drop=True, inplace=True)
test_labels.reset_index(drop=True, inplace=True)


train_feat

Unnamed: 0,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,2,1,45.0,1,1.0,26.2500
1,3,1,17.0,4,2.0,7.9250
2,3,0,44.0,0,0.0,8.0500
3,2,0,19.0,0,0.0,10.5000
4,3,0,43.0,0,0.0,8.0500
...,...,...,...,...,...,...
704,2,1,24.0,1,0.0,26.0000
705,3,0,23.0,0,0.0,7.8958
706,2,1,28.0,0,0.0,13.0000
707,3,1,30.0,0,0.0,12.4750


6. Scale the columns using min-max scalers


In [None]:
# YOUR CODE HERE
train_feat = (train_feat -train_feat.min()) / train_feat.max()
test_feat = (test_feat - test_feat.min()) / test_feat.max()

7. Print the shape of the train and test set  

In [None]:
# YOUR CODE HERE
print(train_feat.shape)
print(test_feat.shape)
print(train_labels.shape)
print(test_labels.shape)

(709, 6)
(178, 6)
(709, 1)
(178, 1)


# NN implementation (10 Marks)

1. Implement your NN method. To predict **each** point (query point) of the test set, you need to:


> a. Find the training point with the smallest **Euclidian** distance  with the query point


  
> b. Use as a prediction the target value of the point with the smallest distance
to the query point

2. Compute the mean absolute error (MAE) between the test labels and the predicted
values

In [None]:
# YOUR CODE HERE

# calculates the euclidean distance between the query points and each point in the training set
# outputs a list of lists containing the euclidean distance for each query point
def euc(test, train):
    lis = []
    for ind, row in test.iterrows(): # iterrates through each row in the test set
      tot = []
      for in2, row2 in train.iterrows(): # iterrates through each row in the test set
        tot.append((((row - row2)**2).sum())**0.5) # calculates the euclidean distance
      lis.append(tot)
    return lis

In [None]:
# function for determining the label for the test set based on the NN approach

def NN(train, test, labels):
    list_euc = euc(test, train)
    pred_list = [np.argmin(i) for i in list_euc] # gets the index of the minimum euclidean distance for each
    # query point
    pred_labels = [labels.loc[i] for i in pred_list] # gets the corresponding labels for the index
    final_labels = pd.DataFrame(pred_labels, index=range(len(test_feat)), columns=[labels.columns[0]]) # converts output to
    # pd dataframe
    return final_labels

pred_labels = NN(train_feat, test_feat, train_labels)

[100, 684, 114, 329, 493, 257, 459, 250, 396, 230, 480, 244, 114, 114, 571, 61, 656, 401, 167, 455, 370, 291, 219, 283, 56, 708, 620, 411, 310, 696, 283, 64, 79, 544, 145, 656, 358, 254, 693, 370, 561, 4, 480, 360, 194, 138, 265, 114, 228, 209, 553, 662, 87, 242, 209, 280, 337, 636, 283, 60, 140, 439, 707, 1, 283, 312, 547, 458, 660, 636, 78, 427, 696, 433, 75, 379, 216, 458, 376, 701, 149, 696, 51, 504, 51, 322, 283, 527, 145, 413, 696, 626, 356, 14, 606, 417, 417, 382, 356, 273, 391, 172, 132, 158, 523, 480, 215, 706, 357, 170, 624, 25, 217, 144, 696, 410, 665, 296, 458, 64, 382, 327, 692, 452, 391, 177, 233, 370, 365, 693, 370, 537, 340, 53, 676, 697, 343, 604, 114, 656, 701, 167, 149, 383, 96, 164, 696, 105, 275, 696, 235, 452, 219, 467, 229, 692, 372, 118, 370, 545, 496, 409, 707, 308, 144, 250, 2, 170, 343, 434, 696, 144, 433, 413, 310, 357, 696, 657]


In [None]:
# Calculates the Mean Absolute Error between the test labels and the predicted labels

def mae(pred, true):
    diff = abs(pred.iloc[:, 0] - true.iloc[:, 0]) # calculates the absolute difference between the two dataframes
    tot = diff.sum()
    return tot/len(pred)

mae(pred_labels, test_labels)

0.23595505617977527

In [None]:
def accuracy(pred, true):
    diff = true.values - pred.values
    label_diff = pd.DataFrame(diff, index=range(len(true)), columns=[true.columns[0]])
    final_accuracy = sum(label_diff[true.columns[0]] == 0)/len(label_diff)
    return final_accuracy

accuracy(pred_labels, test_labels)

0.7640449438202247

# k-NN implementation (10 Marks)

1. Extend the previous implementation by using the average of the k closest neighbours as a prediction

2. Add the **Manhattan** and **Hamming** distance.


2. Compute the mean absolute error (MAE) between the test labels and the predicted values

> a. Is it more accurate compared to the previous implementation?

In [None]:
def man(pred, train):
    lis = []
    for ind, row in pred.iterrows():
      tot = []
      for in2, row2 in train.iterrows():
        tot.append(abs((row - row2)).sum())
      lis.append(tot)
    return lis

In [None]:
def ham(pred, train):
    lis = []
    for ind, row in pred.iterrows():
      tot = []
      for in2, row2 in train.iterrows():
        tot.append((row != row2).sum())
      lis.append(tot)
    return lis

In [None]:
def KNN(train, test, labels, k, method):
    if method == "hamming":
        list_mse = ham(test, train) # gets the hamming distance between the query and training data
    elif method == "manhattan":
        list_mse = man(test, train) # gets the manhattan distance between the query and training data
    elif method == "euclidean":
        list_mse = euc(test, train) # gets the euclidean distance between the query and training data
    else:
        return "Non-valid Method"

    pred_list = [[i.index(j) for j in sorted(i)[:k]] for i in list_mse] # gets the index of the kth smallest values
    # of the euclidean distances for each of the test points
    pred_labels = [[train_labels.iloc[j, 0] for j in i] for i in pred_list] # gets the labels for each of the
    # predicted values for each of the test points
    act_labels = [max(set(i), key = i.count) for i in pred_labels] # gets the majority label out of the kth NN.
    # this is the prediction label for the test set
    final_labels = pd.DataFrame(act_labels, index=range(len(test)), columns=[labels.columns[0]]) # converts the output to a
    # panda dataframe
    return final_labels

In [None]:
print(accuracy(KNN(train_feat, test_feat, train_labels, 1, "euclidean"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 3, "euclidean"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 5, "euclidean"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 7, "euclidean"), test_labels))

0.7640449438202247
0.8033707865168539
0.848314606741573
0.8426966292134831


In [None]:
print(accuracy(KNN(train_feat, test_feat, train_labels, 1, "manhattan"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 3, "manhattan"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 5, "manhattan"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 7, "manhattan"), test_labels))

0.7640449438202247
0.797752808988764
0.8146067415730337
0.8258426966292135


In [None]:
print(accuracy(KNN(train_feat, test_feat, train_labels, 1, "hamming"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 3, "hamming"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 5, "hamming"), test_labels))
print(accuracy(KNN(train_feat, test_feat, train_labels, 7, "hamming"), test_labels))

0.7528089887640449
0.7528089887640449
0.7415730337078652
0.7471910112359551


Euclidean (k = 1, 3, 5, 7)
0.7640449438202247
0.8033707865168539
0.848314606741573
0.8426966292134831

Manhattan (k = 1, 3, 5, 7)
0.7640449438202247
0.797752808988764
0.8146067415730337
0.8258426966292135

Hamming (k = 1, 3, 5, 7)
0.7471910112359551
0.7528089887640449
0.7415730337078652
0.7471910112359551


3. Test your approach for k  values = [1, 3, 5, 7] and the three distance measures

> a. Which is the best combination?
The best KNN model uses eucildean distance metrics and k = 5

> b. How can you improve the results?


The best KNN model uses eucildean distance metrics and k = 3. However, this can change depending on how the data is shuffled.

To improve the model the correlation between all features prior to separating label and feature datsets was investigated using the code below.

All features with a correlation below 0.1 with "Survived" (i.e. labels dataset) were removed as these likely did not have a significant influence on the accuracy of the models. This resulted in a reduced accuracy of ~5% for the overall model (i.e. for each distance measure and k value). Therefore, for greatest accuracy of the model it is best to include these features in the model. However, accuracy is still predominantly determined by the features "Pclass", "Sex", "Fare"

Features that had a relatively high correlation between each other (e.g. "Fare" and "Pclass") are simplified and only one is used. This assumes that the two features "Fare" and "Pclass" provide similar information when determining the model. This can cause overfitting of the model if two features provide the same information to the model. "Fare" was removed instead of "Pclass" becasue "Fare" has a higher correlation with other features than "Pclass". This reduced the accuracy of the model by ~3-4%. This is likely due to the correlation between "Fare" and "Pclass" being 0.68 which may not be a significant level of correlation between the two of them. "Fare" also has a correlation of 0.32 with "Survived". This likely explains why the accuracy reduced by 3-4% and it as a relatively significant correlation with "Survived".

The code below was substituted for the "feat" subset when splitting the data into features and labels while investigating model improvement

In [None]:
corr = df_na.corr(method = "spearman")
# gets the correlation between all features within the dataset
corr_df = feat[["Pclass", "Sex", "Fare"]]
# removes all features that have a correlation below 0.1 with survived
corr_df = feat[["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard"]]
# removes fare as it has a correlation of 0.68 with Pclass

**Reminder: You can not use any library, with the exception of pandas and NumPy. So scikit-learn is forbidden!**