<a href="https://colab.research.google.com/github/Aidzillafont/KNN-by-hand/blob/main/KNN_by_hand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Preparation 
1. Load the dataset on Colab
2. Display the attributes' name and their data type
3. Delete the columns PassengerId and Name
4. Replace all missing values with 0
5. Transform the Sex column into a numerical one
6. Use Survived as the target label and the rest of the data frame as features
7. Divide your dataset in 80% for training and 20% for test
8. Scale the columns using min-max scalers
9. Print the shape of the train and test set

In [None]:
import pandas as pd
#1. Load the dataset on Colab
df = pd.read_csv('https://github.com/andvise/DataAnalyticsDatasets/blob/16ca8de1233c8643bfe85fcd1cd87c9ff2221312/titanic.csv?raw=true')

#2. Display the attributes name and their data type#
print(df.dtypes)

#3. Delete the columns PassengerId and Name
df = df[['Survived', 'Pclass', 'Sex', 'Age',
       'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']]

#4. Replace all missing values with 0
df = df.fillna(0)

#5. Transform the Sex column into a numerical one
df['Sex'] = df['Sex'].replace({'male':1 , 'female':0})

# 7. Divide your dataset in 80% for training and 20% for test
df_train = df.sample(frac=0.8, replace=False, random_state=6405)
df_test = df.drop(df_train.index)

PassengerId                  int64
Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard    float64
Fare                       float64
dtype: object


#Scaler Object

Below is a sklearn inspired scaler object with a fit_transform and transform method.

.fit_transform(): Takes in a pandas dataframe as input. It iterates over the dataframe storing the min and max values into self.min_x and self.max_x then applies scaling and returns the data frame

.transform(): Uses stored min and max values and applies scaling to passed in dataframe.

Note: here both dataframes must be in the same column order and have the same number of columns

In [None]:
import numpy as np

class MinMaxScaler:
  def __init__(self):
    self.min_x, self.max_x = [],[]
  
  def fit_transform(self, data):
    c=0
    data_scaled = pd.DataFrame()
    for column in data:
      self.min_x.append(np.min(data[column])), self.max_x.append(np.max(data[column]))
      data_scaled[column] = (data[column]-self.min_x[c])/(self.max_x[c]-self.min_x[c])
      c+=1
    return data_scaled

  def transform(self, data):
    c=0
    data_scaled = pd.DataFrame()
    for column in data:
      data_scaled[column] = (data[column]-self.min_x[c])/(self.max_x[c]-self.min_x[c])
      c+=1
    return data_scaled
    

In [None]:
# 6. Use Survived as the target label and the rest of the data frame as features
features = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard','Parents/Children Aboard', 'Fare']
labels = ['Survived']
x_train, x_test, y_train, y_test = df_train[features], df_test[features], df_train[labels], df_test[labels]

#8. Scale the columns using min-max scalers
myscaler = MinMaxScaler()

x_train = myscaler.fit_transform(x_train)
x_test = myscaler.transform(x_test) 

#9. Print the shape of the train and test set
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((710, 6), (177, 6), (710, 1), (177, 1))

# k-NN Implementation

## 1. Implement k-NN method
To classify each point (query point) of the test set using the k-NN method, follow these steps:

### a. Calculate the Euclidean distance
Calculate the Euclidean distance between the query point (each point in the testing set) and all the training points of the training set.

### b. Select k nearest points
Pick the k points with the smallest distance to the query point (k must be a hyperparameter).

### c. Determine the most common class
Select the most common class among the k points.

## 2. Evaluate the model

### a. Compute the accuracy
Calculate the accuracy of the k-NN classifier on the test set.

### b. Plot the confusion matrix and interpret the results
Create a confusion matrix for the k-NN classifier and briefly describe what the confusion matrix shows about the classifier's performance.


#KNN Object

Below is a sklearn inspired KNN object with a .fit and .predict method

This also contains a .euclidean and .weighted_distances methods used by .predict

It has two inputs on initialization k_neighbours and weighted which are intergers and boolean respectivly. 

.fit(): stores the training data set

.predict(): loops through the test set and calculates the euclidean distance and returns predections based on if weighted and number of neighours k_neighbours

In [None]:
#a sklearn style knn object :) made the code a bit longer but worth it
class KNearestNeighbors:
  def __init__(self, k_neighbors, weighted):
    self.k_neighbours = k_neighbors
    self.weighted = weighted

  def fit(self, X, y):
    self.X_train = X.values
    self.y_train = y.values

  def predict(self, X):
    self.X_test = X.values
    prediction = []
    for test in self.X_test:

      #1.a. Calculate the Euclidean distance between the query point (each point in the testing set) and all the training points of the training set.
      dist = np.array([ self.euclidan(test, train) for train in self.X_train])

      #1.b. Then pick the k points with the smallest distance to the query point
      indx_small = np.argsort(dist)[:self.k_neighbours]
      labels = self.y_train[indx_small]

      #weighted or not?
      if self.weighted:
        k_dists = dist[indx_small]
        #pass the distances and labels to weight 
        prediction.append(self.weighted_distance(labels,k_dists))
      else:
        #1.c. Select the most common class among the k points
        #since values are 0,1 we take ratio of 1's to total neighbours and round to find majority vote
        mode = np.round(np.sum(labels)/labels.shape[0])
        prediction.append(mode)
      
    #return predictions in numpy array
    return np.asarray(prediction)
  
  #a method for euclidian distance
  def euclidan(self, p,q):
    return np.sqrt(np.sum((np.subtract(p,q))**2))

  def weighted_distance(self, labels, distance):
    #added this to handle zero distance
    distance = distance + 0.001
    #since 0 and 1 are labels sum of weight*label gives weights for 1 
    wgt1 = sum([li/np.power(di,2) for di, li in zip(distance,labels)])
    wgt0 = sum([1/np.power(di,2) for di in distance]) - wgt1
    #return 1 if bigger weights for label 1
    return(1 if wgt1>=wgt0 else 0)
    

In [None]:
#2. Compute the accuracy and plot the confusion matrix
def confusionMatrix(y_preds,y_test):
  tp,fp,tn,fn = 0,0,0,0

  for pred, targ in zip(y_preds, y_test):
    if pred==0:
      if (pred==targ): 
        tp+=1 
      else:
        fp+=1
    if pred==1:
      if (pred==targ): 
        tn+=1 
      else:
        fn+=1

  cm = np.array([[tp,fn],[fp,tn]])
  return cm

In [None]:
# here we call the knn for non weighted and 3 neighbours
myknn = KNearestNeighbors(k_neighbors=3, weighted=False)
myknn.fit(x_train, y_train)
y_pred = myknn.predict(x_test)
cm = confusionMatrix(y_pred,y_test.values)

In [None]:
# here we call the knn for weighted and 3 neighbours
mywknn = KNearestNeighbors(k_neighbors=3, weighted=True)
mywknn.fit(x_train, y_train)
y_predw = mywknn.predict(x_test)
cmw = confusionMatrix(y_predw,y_test.values)

In [None]:
print('Confusion Matrix KNN:\n', pd.DataFrame(cm, columns=['Predicted: 0', 'Predicted: 1']))
print('Test Accuracy KNN: ', np.trace(cm)/np.sum(cm))
print('\n')
print('Confusion Matrix Weighted KNN:\n', pd.DataFrame(cmw, columns=['Predicted: 0', 'Predicted: 1']))
print('Test Accuracy Weighted KNN: ', np.trace(cmw)/np.sum(cmw))

Confusion Matrix KNN:
    Predicted: 0  Predicted: 1
0            94            19
1            25            39
Test Accuracy KNN:  0.751412429378531


Confusion Matrix Weighted KNN:
    Predicted: 0  Predicted: 1
0            86            27
1            24            40
Test Accuracy Weighted KNN:  0.711864406779661


# Weighted k-NN

## 1. Compare Weighted k-NN to Normal k-NN

### a. Does it outperform the normal k-NN?
As seen from the results, the normal k-NN outperforms the weighted k-NN with accuracies of 75.14% and 71.18% respectively.

## 2. Evaluate the Weighted k-NN Model

### a. Compute the accuracy
Calculate the accuracy of the weighted k-NN classifier on the test set.

### b. Plot the confusion matrix and interpret the results
Create a confusion matrix for the weighted k-NN classifier and briefly describe what the confusion matrix shows about the classifier's performance.

The confusion matrix displays the predicted and actual classifications of an algorithm, indicating the number of True Positives, True Negatives, False Positives, and False Negatives.

- True Positives (TP) and True Negatives (TN) represent the correct classification of 0 (Not Surviving) and 1 (Surviving) respectively.
- False Positives (FP) and False Negatives (FN) represent the incorrect classification of 0 (Not Surviving) and 1 (Surviving) respectively.

The confusion matrix is arranged as follows:

| TP | FN |
|----|----|
| FP | TN |

We can also derive the accuracy from the confusion matrix by dividing the sum of the diagonal elements by the sum of all elements in the matrix.


#Hyperparameters search
1. Test [1, 3, 5, 7, 9, 11] as possible k  values

a. Select the best one and explain why

In [None]:
Ks =  [1, 3, 5, 7, 9, 11]
accs = []
for k in Ks:
  cvknn = KNearestNeighbors(k_neighbors=k, weighted=True)
  cvknn.fit(x_train, y_train)
  y_preds = cvknn.predict(x_test)
  cm = confusionMatrix(y_preds,y_test.values)
  accs.append(np.trace(cm)/np.sum(cm))

cv_df = pd.DataFrame(list(zip(Ks, accs)), columns=['K','Accuracy'])
cv_df.style.hide_index()

K,Accuracy
1,0.694915
3,0.711864
5,0.723164
7,0.723164
9,0.734463
11,0.734463


K=9 appears to be the best choice as it has the highest accuracy on test
