## COMP2006: Graded Lab 2

In this lab, you will create a *k-nearest neighbours* algorithm from scratch. 

**Requirements**

- All code written in Python
- Create a Python class called `KNeighbours` that:
    - (10 marks) initializes an instance of the class with a user-provided value of `k` and whether or not it will be used for a *regression* or a *classification* problem
    - (15 marks) contains a method to calculate the Euclidean distance between a test point and all points in a training data set
        - do NOT use built-in functions to calculate the distance
    - (15 marks) contains a method that sorts Euclidean distances and returns the top-`k` values and the indices of those samples in the training data set
    - contains a prediction method for:
        - (15 marks) classification
        - (15 marks) regression
- (15 marks) Use sklearn's `make_classification()` function to generate data and test if your class works
    - compare the results of your class to those of sklearn's `KNeighborsClassifier`
- (15 marks) Use sklearn's `make_regression()` function to generate data and test if your class works
    - compare the results of your class to those of sklearn's `KNeighborsRegressor`

**What to submit**
 - A completed copy of this notebook with all final, error-free code executed and all output visible in the notebook
 - ONE submission per group

## **The KNeighbours class - TEAM 7**

**Manuel Bishop Noriega - 4362207**

**Robert Matney - 4364229**

we added 2 more methods, fit and predict. just to replicate the way sklearn.neighbors are used. we could have only called predictors methods passing them all training datasets and test point directly. 

In [1]:
import numpy as np
# import matplotlib.pyplot as plt
# import pandas as pd

from math import sqrt # had to use this builtin function :(
class KNeighbours:
    
# default constructor to take k and type of model to use
    def __init__(self,k,model_type="regressor"):
        self.k=k
        self.distances=[]
        self.X_train=[]
        self.y_train=[]
        if model_type=="regressor" or model_type=="classifier":
            self.model_type=model_type
        else:
            self.model_type="regressor"
            print(f'using default model regression')


# calculates distances between testpoint and all points in tain data set
    def calculate_euclidean_distance(self, test, train_points):
        idx=0
        self.distances=[]
        for trp in train_points:
            dist=abs(sqrt(abs(test[0]-trp[0])**2 + abs(test[1]-trp[1])**2))
            self.distances.append([idx,dist])
            idx+=1
        return self.distances

# sorts the distances and returns the top-k values    
    def sort_distances(self):
        sorted_distances=sorted(self.distances,key=lambda pt:pt[1])
        # print(sorted_distances)
        k_distances=[]
        for i in range(0,self.k):
            k_distances.append(sorted_distances[i])
        return k_distances

# I created this method feeds class object with training data
# receives 2 training data sets: X for features and y for targets
    def fit(self, X_train, y_train):
        self.X_train=X_train
        self.y_train=y_train

# Method to deliver prediction according to parameter model_type during instantiation
# similar to sklearn.neighbors, takes a point to make its prediction
    def predict(self,test):
        result=None
        if self.model_type=="regressor":
            result=self.myRegressor(test)
        else:
            self.model_type=="classifier"
            result=self.myClassifier(test)
        return result

# method for classification prediction method
# receives a testint point, looks for k nearest neighbors in the training data
# and then check their classes, predicts according to the class with most incidences.
# returns predicted class

    def myClassifier(self,test):
        predictList=[]
        for p in test:
            self.calculate_euclidean_distance(p,self.X_train)
            knn=self.sort_distances()
            predict=None
            cset=set(self.y_train)  # extracting all different classes from train set
            # print(f"classes:{cset}")
            classes=dict.fromkeys(cset,0)    # creating a classes dictionary to accumulate votes for knn
            for i in knn:
                classes[self.y_train[i[0]]]+=1  # counting knn per class
            # print(f"votes per class: {classes}")
            predict=max(classes,key=classes.get)  # most popular class for knn is predicted
            predictList.append(predict)
        return np.array(predictList)

# method for regression prediction method
# receives a testint point, looks for k nearest neighbors in the training data
# and then calculates the average value.
# returns this value as prediction
    def myRegressor(self, test):
        predictList=[]
        for p in test:
            self.calculate_euclidean_distance(p,self.X_train)
            knn=self.sort_distances()
            predict=None
            for i in knn:
                if predict is None:
                    predict=0
                predict+=self.y_train[i[0]]
            predict=predict/self.k
            predictList.append(predict)
        return np.array(predictList)


## Regression models comparisson
**Creating regression training data sets and test point**

In [2]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

#data set for regression method
X,y = make_regression(n_samples = 50, n_features = 2, random_state = 16)
#testing point
X1,X2,y1,y2= train_test_split(X,y,test_size=0.2)


**Showing Training features and target arrays**

In [3]:
X1,X2,y1,y2

(array([[-0.66612899,  0.16360842],
        [ 0.57970657, -1.33817457],
        [-0.82130108,  1.80553286],
        [-1.07884538,  0.23353755],
        [ 0.06633037, -0.59446798],
        [-2.48664218, -1.40482756],
        [ 0.25394558,  0.16490347],
        [ 1.52115734, -0.72946441],
        [ 0.65863226,  0.71596476],
        [ 0.49336691, -0.75686314],
        [ 0.11874356, -0.7868971 ],
        [-1.04842202,  0.50927062],
        [-0.63928976,  1.67433222],
        [-0.10486685,  0.72127531],
        [-0.54930662, -0.32608562],
        [ 1.10165993,  1.59371334],
        [ 0.50319344,  1.67471048],
        [-0.62136861, -1.23549487],
        [ 2.0104348 ,  0.30906374],
        [-1.45171934, -0.27091291],
        [ 1.88446401, -0.34251968],
        [ 0.55642014,  0.64840236],
        [ 1.61226119,  0.27634147],
        [ 1.07974786,  1.27354347],
        [-0.53386485,  0.87187693],
        [-1.60906787,  0.6500202 ],
        [-1.01502185, -0.83940431],
        [-0.22107202, -0.171

**Training my model**

In [4]:

k=3
# Instantiating KNeighbours object as regressor
myreg=KNeighbours(k,"regressor")

#training with regression data set
myreg.fit(X1,y1)


**looking at the distances array and kNN distances array**

Just calling distance methods to show what they return, not really needed to call them separatedly because they are callled when predict method is called

In [5]:
test=[0,0]
# distances between train points and test point array
print(f"distances array : {myreg.calculate_euclidean_distance(test,X)}")
# kNN distances sorted array
print(f"\nkNN distances sorted array : {myreg.sort_distances()}")

distances array : [[0, 2.28750015688305], [1, 0.23767728528584464], [2, 1.1460182479954695], [3, 1.0567801656504316], [4, 1.1655665125837735], [5, 1.6430617726408525], [6, 2.8560339260029317], [7, 0.6859267756858886], [8, 1.915339221500047], [9, 0.5847953433748087], [10, 1.3829484912269019], [11, 1.533819412308689], [12, 0.9728319436385796], [13, 1.9374149765551019], [14, 1.0223409640630428], [15, 0.10665550526400476], [16, 1.103832935185791], [17, 0.5549256016767292], [18, 2.783228548620853], [19, 0.6388032483311681], [20, 0.36633192518272567], [21, 1.687020443754655], [22, 2.034052228357873], [23, 0.3027895536866004], [24, 1.019381789818306], [25, 0.8906614084325801], [26, 1.7922276065381109], [27, 0.728858781762822], [28, 1.4767812438692498], [29, 2.4680996983574874], [30, 1.6336143143149098], [31, 0.602501701314028], [32, 1.2077804117148108], [33, 0.7958059330273406], [34, 0.8544173388156928], [35, 1.7486734481402009], [36, 0.279968546426739], [37, 1.9835534717746477], [38, 1.03741

**Training sklearn model, predicting with both models and comparing results**

In [6]:

#predicting value for test point
print(f"MY REGRESSOR ----------- \nTest dataset :\n")
# print(X2)
print(f"target values : {y2}")
print(f"predicted values : {myreg.predict(X2)}")

#importing sklearn regressor
from sklearn.neighbors import KNeighborsRegressor
# instantiating regressor
reg = KNeighborsRegressor(n_neighbors=k)

# training model with regression data set
reg.fit(X,y)
#printing sklear prediction
print("\nSKLEARN ------- \nTest dataset : \n")
# print(X2)
print(f"target values : {y2}")
result=reg.predict(X2)
print(f"predicted value is: {result}")

#testing both models using a single point
test=[[1.23,-.000232]]
print(f"\n\nMY REGRESSOR prediction for {test} : {myreg.predict(test)}")
print(f"SKLEARN REGRESSOR prediction for {test} : {myreg.predict(test)}")

MY REGRESSOR ----------- 
Test dataset :

target values : [ 152.579763    232.29849042  -87.00912871  -51.40881716   36.09658124
   49.94463153 -120.0648182    67.71337073   75.72853688   37.83425893]
predicted values : [155.80609594 168.97256752 -82.03970074 -58.99059193 -11.75797932
  22.50149288 -85.93282897  86.52350673  35.32533785  29.60333507]

SKLEARN ------- 
Test dataset : 

target values : [ 152.579763    232.29849042  -87.00912871  -51.40881716   36.09658124
   49.94463153 -120.0648182    67.71337073   75.72853688   37.83425893]
predicted value is: [167.8357878  181.54604887 -82.80785786 -47.89197854  56.60034137
  36.02133262 -90.74284545  68.41445579  51.40665247  33.35404218]


MY REGRESSOR prediction for [[1.23, -0.000232]] : [156.22973675]
SKLEARN REGRESSOR prediction for [[1.23, -0.000232]] : [156.22973675]


## Classification models comparisson
**Creating classification training data sets and test point**

In [7]:
from sklearn.datasets import make_classification

#data set for classification method
X,y = make_classification(n_samples = 50, n_features = 3,n_informative=3,n_classes=3,n_clusters_per_class=2, n_redundant = 0, random_state =8)
#X,y = make_classification(n_samples = 50, n_features = 2, n_redundant = 0, random_state =8)
#testing point
X1,X2,y1,y2= train_test_split(X,y,test_size=0.2)

**Showing Training-Test features and target arrays**

In [8]:
X1, X2, y1, y2      # just to know our datasets

(array([[-0.1224994 ,  3.64890009, -0.42073859],
        [-0.92880925,  0.57310721, -1.07704731],
        [ 1.13314418,  1.35520448, -0.20495827],
        [-0.29891041,  2.47950676,  1.11545618],
        [-1.37892072, -0.67112414,  1.14997261],
        [-0.59841123, -1.64074771, -1.93770078],
        [-0.62338725,  0.60044107, -0.97519507],
        [ 0.89397398,  0.43926437,  0.45317951],
        [ 0.98946994, -1.13541166,  0.3158525 ],
        [ 0.57951866,  1.26611661, -0.62477474],
        [-0.40298131,  2.72152926,  1.19458268],
        [-1.25686285, -0.18729888, -1.97940219],
        [-1.3612487 , -2.35550844,  0.58355255],
        [ 0.17717601,  2.46916311,  1.87284244],
        [-1.02510856,  0.78033122, -1.14206415],
        [-1.55292509, -1.27450565, -0.66978277],
        [ 0.34786703,  1.58164867, -0.06558902],
        [-0.59802075,  2.52559994,  0.1080309 ],
        [ 0.49977479, -1.85814098,  1.57988422],
        [-0.5032997 , -0.32659633, -1.64511494],
        [-1.15156371

In [9]:
# calculates accuracy comparing targets vs predicted targets
def calc_accuracy(yp,yt):
    accuracy=0
    for i, p in enumerate(yp):
        if yp[i]==yt[i]:
            accuracy+=1
    accuracy=accuracy/len(yt)
    print(f"Accuracy : {accuracy}")

**Training models and printing their results to compare**

In [10]:
k=3
# Instantiating KNeighbours object as classifier
myclf=KNeighbours(k,"classifier")
#training with classification data set
myclf.fit(X1,y1)
#predicting class for test dataset
print(f"MY-CLASSIFIER ------------\n test dataset:")
# print(X2)  # test datatset elements
print(f"\ntarget values :  {y2}")
myPredict=myclf.predict(X2)
print(f"predicted values {myPredict}")
# let's calculate model's accuracy
calc_accuracy(myPredict,y2)
#---------------------------------------------------------------------------------------#
#importing class
from sklearn.neighbors import KNeighborsClassifier
# instantiating classifier
clf = KNeighborsClassifier(n_neighbors=k) #<-- el parámetro es k vecinos cercanos

# training model with classification data set
clf.fit(X1, y1)
# printing sklearn prediction
print(f"\n\nSKLEARN classifier -------\n test dataset")
# print(X2)    # test datatset elements
print(f"\ntarget values :  {y2}")
sklPredict=clf.predict(X2)
print(f"predicted values {sklPredict}")
calc_accuracy(sklPredict,y2)

MY-CLASSIFIER ------------
 test dataset:

target values :  [2 2 1 2 0 0 2 1 1 0]
predicted values [1 1 1 1 0 0 0 1 1 1]
Accuracy : 0.5


SKLEARN classifier -------
 test dataset

target values :  [2 2 1 2 0 0 2 1 1 0]
predicted values [1 1 1 2 0 1 0 2 1 1]
Accuracy : 0.4
