**DATA LOADING AND DATA PREPROCESSING**

In [1]:
#Import library and load data
import pandas as pd 
import numpy as np
from math import sqrt
from collections import defaultdict #Have used defualtdict to avoid key error of my dictionary when we dont have value.

titanic = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/16ca8de1233c8643bfe85fcd1cd87c9ff2221312/titanic.csv?raw=True")

In [2]:
#Check data types
titanic.dtypes

titanic.drop(['PassengerId','Name'], axis=1, inplace=True)


In [3]:
# check missing values in variables
titanic.isnull().sum()

#Check how many Null values are preengt in Parents/Children Aboard column
titanic['Parents/Children Aboard'].isnull().sum()

#Replace missing values with 0
titanic['Parents/Children Aboard'] = titanic['Parents/Children Aboard'].fillna(0)


**We can see that the Parents/Children Aboard column contains 1 nan value.**

In [4]:
#Transform the Sex column into a numerical one
titanic['Sex'] = np.where(titanic['Sex'] == 'female', 0, 1)

In [5]:
#Assigning Predicting Variables to X and Target Variable Y
x = titanic.iloc[:,1:]
y = titanic.iloc[:,:1]


In [6]:
def min_max_scaling(column):

    """ 
    Method to scale X and Y values

    """
    return(column-column.min())/(column.max() - column.min())

#Scaling X and Y values
for xcol in x.columns:
    x[[xcol]] = min_max_scaling(x[[xcol]])

for ycol in y.columns:
    y[[ycol]] = min_max_scaling(y[[ycol]])

    

In [7]:
#Divide the data into training and testing set with 80% of training data and 20% of testing data

X_train = x.sample(frac=0.8, random_state=1)
X_test = x.drop(X_train.index)

Y_train = y.sample(frac=0.8, random_state=1)
Y_test = y.drop(Y_train.index)

In [8]:
#Shape of test and training data
x_training_data_shape = X_train.shape
x_testing_data_shape = X_test.shape

y_training_data_shape = Y_train.shape
y_testing_data_shape = Y_test.shape

print("X_train_shape:",x_training_data_shape,"Y_train_shape:",y_training_data_shape)
print("X_test_shape:",x_testing_data_shape,"Y_test_shape:",y_testing_data_shape)

X_train_shape: (710, 6) Y_train_shape: (710, 1)
X_test_shape: (177, 6) Y_test_shape: (177, 1)


# Part 2 - k-NN implementation


Class to implement KNN

In [16]:
class KNN:

    def __init__(self, k):
        """
        Let k equal the assumed number of classifications.
        
        """
        self.k = k

    def fit(self, X, y):
        """
        This method will fit training data to the model. We also assert that then length
        of the training data and targets are the same, otherwise the prediction method will break.
        """
        assert len(X) == len(y)
        self.X = X
        self.y = y
        return self

    def _distance(self, data1, data2):
        """
        Finding the eucledian distance
        """
        return np.sqrt(sum((data1 - data2)**2))
  
    def _predict_one(self, test):
        """
        Method that takes the fitted model and runs the X_test data comparing
        the Euclidean distances between each point. The values are sorted, and 
        the highest values are stored to give a prediction on targets for one 
        instance.
        """
        distances = sorted((self._distance(x, test), y) for x, y in zip(self.X, self.y))
        neighbors = distances[:self.k]
        neighbours_by_class = defaultdict(list)
        for d, c in neighbors:
            neighbours_by_class[c].append(d)
        return max((sum(val), key) for key, val in neighbours_by_class.items())[1]

    def predict(self, X):
        """
        Methods for predicting each instance
        """
        return [self._predict_one(x) for x in X]

    def score(self, X, y):
        """
        Method takes the X_test and y_test, runs the data through the predict method.
        The number of successful guesses are summed and compared to the total 
        number in the test data.
        """
        return sum(1 for pred, true in zip(self.predict(X), y) if pred == true) / len(y)

Creating a flatten method to convert my dataframe into lists

In [10]:
#Method to flatten list of lists to a single list
def flatten(t):
    return [item for sublist in t for item in sublist]

Implementation of KNN with inputs.

Input paramaters 
X as numpy array,
Y as list.



In [18]:
#Instantiate the weighted KNN model with K
testing_neighbors_knn = KNN(k=5)

# Fit the model to the training data.
testing_neighbors_knn.fit(X_train.to_numpy(), flatten(Y_train.values.tolist()))

# Run predictions using the test sample data.
prediction = testing_neighbors_knn.predict(X_test.to_numpy())

# Prediction accuracy.
pred_acc_knn = testing_neighbors_knn.score(X_test.to_numpy(),flatten(Y_test.values.tolist()))

data = {'y_Actual':   flatten(Y_test.values.tolist()),
       'y_Predicted': prediction 
       }
df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], 
                               rownames=['Actual'], colnames=['Predicted'], 
                               margins = True)

print('The accuracy of the model is :', round(pred_acc_knn*100,2),'%')
print("***************************")
print("'Confusion Matrix'")
print("***************************")
print(confusion_matrix)

The accuracy of the model is : 86.44 %
***************************
'Confusion Matrix'
***************************
Predicted  0.0  1.0  All
Actual                  
0.0         96    8  104
1.0         16   57   73
All        112   65  177


The confusion matrix shows that model predicted correctly for 96 people for the survival 0 (i.e people who did not survive) and 57 for the survival 1(i.e people who survived).

It also shows that it wrongly predicted as 1 for 0 (i.e predicted as they survived but even though they did not) for 8.

It also shows that it wrongly predicted as 0 for 1 (i.e predicted as they did not survive but even though they did) for 16.


# Part 3 - Hyperparameters search 


In [17]:
K=[1, 3, 5, 7, 9, 11]

for i in K :
  #Instantiate the weighted KNN model with K
  testing_neighbors_knn = KNN(k=i)

  # Fit the model to the training data.
  testing_neighbors_knn.fit(X_train.to_numpy(), flatten(Y_train.values.tolist()))

  # Run predictions using the test sample data.
  prediction = testing_neighbors_knn.predict(X_test.to_numpy())

  # Prediction accuracy.
  pred_acc_knn = testing_neighbors_knn.score(X_test.to_numpy(),flatten(Y_test.to_numpy()))

  print("The accuracy of the model is %.2f%s for K = %d" % (round(pred_acc_knn*100,2),'%', i))



The accuracy of the model is 81.92% for K = 1
The accuracy of the model is 83.05% for K = 3
The accuracy of the model is 86.44% for K = 5
The accuracy of the model is 82.49% for K = 7
The accuracy of the model is 79.10% for K = 9
The accuracy of the model is 81.92% for K = 11


The best k is 5 with the prediction accuracy of 86.44% ,because the kNN classifier determines the class of a data point by majority voting principle. When k is set to 5, the classes of 5 closest points are checked. Prediction is done according to the majority class. Similarly, kNN regression takes the mean value of 5 closest points which gives the best match for the given query point.

On the other hand when k=1  we estimate our probability based on a single sample:  closest neighbor. This is very sensitive to all sort of distortions like noise, outliers, mislabelling of data.

Lower k Values lead to overfitting and higher k values lead to underfitting, hence we should select an optimal k with mid value. Croos validation might be one of the technique which we might use for better selection of k.

# Part 4 - Weighted k-NN 




Class to implement KNN

In [19]:
class WeightedKNN:

    def __init__(self, k):

        """
        Let k equal the assumed number of classifications.

        """
        self.k = k

    def fit(self, X, y):

        """
        Method to fit training data to the model. We also assert that 
        then length of the training data and targets are the same.
        """
        assert len(X) == len(y)
        self.X = X
        self.y = y
        return self

    def _distance(self, data1, data2):

        """
        Finding the eucledian distance
        """
        return np.sqrt(sum((data1 - data2)**2))
        

    def _compute_weights(self, distances):

       """
       Computing the weights using inverse distance
       if the distance = 0, assign 1
       """
       matches = [(1, y) for d, y in distances if d == 0]
       return matches if matches else [(1/np.square(d), y) for d, y in distances]
  
    def _predict_weight(self, test):

        """
        Method to predict based by adding of weights into account

        """
        distances = sorted((self._distance(x, test), y) for x, y in zip(self.X, self.y))
        weights = self._compute_weights(distances[:self.k])
        weights_by_class = defaultdict(list)
        for d, c in weights:
            weights_by_class[c].append(d)
        return max((sum(val), key) for key, val in weights_by_class.items())[1]

    def predict(self, X):

        return [self._predict_weight(x) for x in X]

    def score(self, X, y):

        """
        Method takes the X_test and y_test, runs the data through the predict method.
        The number of successful guesses are summed and compared to the total number in the test data.
        """
        return sum(1 for pred, true in zip(self.predict(X), y) if pred == true) / len(y)

In [20]:
#Instantiate the weighted KNN model with K
testing_neighbors_wknn = WeightedKNN(k=5)

# Fit the model to the training data.
testing_neighbors_wknn.fit(X_train.to_numpy(), flatten(Y_train.values.tolist()))

# Run predictions using the test sample data.
prediction = testing_neighbors_wknn.predict(X_test.to_numpy())

# Prediction accuracy.
pred_acc_wknn = testing_neighbors_wknn.score(X_test.to_numpy(),flatten(Y_test.values.tolist()))

data = {'y_Actual':   flatten(Y_test.values.tolist()),
       'y_Predicted': prediction 
       }
df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], 
                               rownames=['Actual'], colnames=['Predicted'], 
                               margins = True)

print('The accuracy of the model is :', round(pred_acc_wknn*100,2),'%')
print("***************************")
print("'Confusion Matrix'")
print("***************************")
print(confusion_matrix)


The accuracy of the model is : 84.18 %
***************************
'Confusion Matrix'
***************************
Predicted  0.0  1.0  All
Actual                  
0.0         92   12  104
1.0         16   57   73
All        108   69  177


In [21]:
K=[1, 3, 5, 7, 9, 11]

for i in K :
  #Instantiate the weighted KNN model with K
  testing_neighbors_wknn = WeightedKNN(k=i)

  # Fit the model to the training data.
  testing_neighbors_wknn.fit(X_train.to_numpy(), flatten(Y_train.values.tolist()))

  # Run predictions using the test sample data.
  prediction = testing_neighbors_wknn.predict(X_test.to_numpy())

  # Prediction accuracy.
  pred_acc_wknn = testing_neighbors_wknn.score(X_test.to_numpy(),flatten(Y_test.to_numpy()))

  print("The accuracy of the model is %.2f%s for K %d" % (round(pred_acc_wknn*100,2),'%', i))

The accuracy of the model is 81.92% for K 1
The accuracy of the model is 83.05% for K 3
The accuracy of the model is 84.18% for K 5
The accuracy of the model is 84.18% for K 7
The accuracy of the model is 84.18% for K 9
The accuracy of the model is 84.75% for K 11


Based on the obtained results in Normal k-NN and weighted k-NN,we can conclude that, for the datasets, both the majority voting kNN and the inverse distance squared-weighted knearest neighbor rule.In both cases the
differences are not overwhelming.

But the Weighted k-NN does not outperform for the given data set. It might give better results for large data sets.

**REFERENCES**

Below is the link which I have used as an reference to build my k-NN code.

I have used the same code of my normal k-NN to create the weighted k-NN by adding methods.

http://robdbennett.com/2020-09-25-Cluster_From_Scratch/