# <div align="center">CP322-A Mini-Project 1: Machine Learning</div>
## <div align="center">Group 6</div>
### <div align="center">due on 15-Oct-2023 at 11:30 PM</div>

Imports:

In [143]:
import numpy as np
import heapq
from math import sqrt
from collections import Counter

## Task 1: Acquire, preprocess, and analyze the data

1. Load the datasets into NumPy objects (i.e., arrays or matrices) in Python. Remember to convert the wine dataset
to a binary task, as discussed above.
2. Clean the data. Are there any missing or malformed features? Are there other data oddities that need to be
dealt with? You should remove any examples with missing or malformed features and note this in your
report. For categorical variables, you can use a one-hot encoding.
3. Compute basic statistics on the data to understand it better. E.g., what are the distributions of the positive vs.
negative classes, what are the distributions of some of the numerical features? what are the correlations between
the features? how do the scatter plots of pair-wise features look like for some subset of features?

In [4]:
def readFile(filename):
    
    data = []
    labels = []
    with open(filename, "r") as file:
        for line in file:
            line = line.strip()
            if line:  # Skip empty lines
                row = line.split(",")
                if filename=="data/iris.data":
                    data.append([float(val) for val in row[:-1]])
                    labels.append(row[-1])
                else:
                    data.append(row)

    data = np.array(data)
    if filename == "data/iris.data":
        return data,labels
    else:
        return data

### Dataset 1 (Ionosphere): 

In [5]:
filename = "data/ionosphere.data"

ionosphere_data = readFile(filename)
X = ionosphere_data[:, :-1]  # All columns except the last one
Y = (ionosphere_data[:, -1] == 'g').astype(bool)   # 1 if the class is 'g', 0 otherwise

positive_count = np.sum(Y == 1)
negative_count = np.sum(Y == 0)

#what are the distributions of the positive vs. negative classes?
print("Distribution of classes:")
print("Positive (g):", positive_count)
print("Negative (b):", negative_count)

print("\nData:")
print(ionosphere_data)

Distribution of classes:
Positive (g): 225
Negative (b): 126

Data:
[['1' '0' '0.99539' ... '0.18641' '-0.45300' 'g']
 ['1' '0' '1' ... '-0.13738' '-0.02447' 'b']
 ['1' '0' '1' ... '0.56045' '-0.38238' 'g']
 ...
 ['1' '0' '0.94701' ... '0.92697' '-0.00577' 'g']
 ['1' '0' '0.90608' ... '0.87403' '-0.16243' 'g']
 ['1' '0' '0.84710' ... '0.85764' '-0.06151' 'g']]


### Dataset 2 (Adult Data Set):

In [6]:
filename = "data/adult.data"
adult_data = readFile(filename)

X = adult_data[:, :-1]  # All columns except the last one
Y = (adult_data[:, -1] == '>50K').astype(int)   # 1 if the class is '>50', 0 otherwise

#what are the distributions of the positive vs. negative classes?

positive_count = np.sum(Y == 1)
negative_count = np.sum(Y == 0)

print("Distribution of classes:")
print("Positive (>50):", positive_count)
print("Negative (<=50):", negative_count)

print("\nData:")
print(ionosphere_data)

#what are the distributions of some of the numerical features?

Distribution of classes:
Positive (>50): 0
Negative (<=50): 32561

Data:
[['1' '0' '0.99539' ... '0.18641' '-0.45300' 'g']
 ['1' '0' '1' ... '-0.13738' '-0.02447' 'b']
 ['1' '0' '1' ... '0.56045' '-0.38238' 'g']
 ...
 ['1' '0' '0.94701' ... '0.92697' '-0.00577' 'g']
 ['1' '0' '0.90608' ... '0.87403' '-0.16243' 'g']
 ['1' '0' '0.84710' ... '0.85764' '-0.06151' 'g']]


### Dataset 3 (Iris Data Set):

In [7]:
filename = "data/iris.data"
iris_data, iris_labels = readFile(filename)

# 3 Classifications (is there a good clasification? Doesn't seem like there is a pass or fail here, just which one of the 3 is it):
# Iris-setosa
# Iris-versicolor
# Iris-virginica
unique_classifications = np.unique(iris_labels)
print(unique_classifications)
# X = np.array(iris_data)
# Y = np.array(labels)

setosa_count = 0
versicolor_count = 0
virginica_count = 0

for classification in iris_labels:
    setosa_count += 1 if (classification == "Iris-setosa") else 0
    versicolor_count += 1 if (classification == "Iris-versicolor") else 0
    virginica_count += 1 if (classification == "Iris-virginica") else 0

print("Classification Analysis:\n------------------------")
print("Total:", len(iris_labels))
print("Iris-setosa:", setosa_count)
print("Iris-versicolor:", versicolor_count)
print("Iris-virginica:", virginica_count)


# print(iris_data)
# print(iris_labels)

['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Classification Analysis:
------------------------
Total: 150
Iris-setosa: 50
Iris-versicolor: 50
Iris-virginica: 50


### Dataset 4 (Car Evaluation):

In [9]:
filename = "data/car.data"
cars_data = readFile(filename)
#classifications: 
#unacc
#acc
#good
#vgood

unique_values = np.unique(cars_data[:, 4])
print(unique_values)

# car acceptability
# . PRICE                  overall price (x3) 
# . . buying               buying price (x2) index 0
# . . maint                price of the maintenance (x2) index 1
# . TECH                   technical characteristics (x3) 
# . . COMFORT              comfort (x2) 
# . . . doors              number of doors (x1) index 2 
# . . . persons            capacity in terms of persons to carry (x1) index 3
# . . . lug_boot           the size of luggage boot (x1) index 4
# . . safety               estimated safety of the car (x2) index 5
# and classification is index 7
 

# k-Nearest Neighbors (k-NN), work with numerical data. If you have categorical data 
# represented as strings, you will need to convert them to numeric form. Common 
# approaches include one-hot encoding for nominal categorical variables and label 
# encoding for ordinal categorical variables.

unacc = len(cars_data[cars_data[:, -1] == "unacc"])
acc = cars_data[cars_data[:, -1] == "acc"]



outputs1 = ['low' ,'med', 'high', 'vhigh']
outputs2 = ['small' ,'med', 'big']
for i in range(len(outputs1)):   
    cars_data[cars_data == outputs1[i]] = i
    
for i in range(len(outputs2)):   
    cars_data[cars_data == outputs2[i]] = i
    
cars_data[cars_data == "5more"] = 5
cars_data[cars_data == "more"] = 5


total = len(cars_data)
unacc = len(cars_data[cars_data[:, -1] == "unacc"])
acc = len(cars_data[cars_data[:, -1] == "acc"])
good = len(cars_data[cars_data[:, -1] == "good"])
vgood = len(cars_data[cars_data[:, -1] == "vgood"])


print("Classification Analysis:")
print("------------------------")
print(f"Total: {total}")
print(f"Unacceptable (-): {unacc}")
print(f"Passing (+): {total - unacc}\n")
print(f"Acceptable: {acc}")
print(f"Good: {good}")
print(f"Very Good: {vgood}")

### Weighting the data appropriately
weighted_cars_data = cars_data[:, :6].astype(int)
weighted_cars_labels = cars_data[:, -1]
print(weighted_cars_data)
print(weighted_cars_labels)
for weighted_car in weighted_cars_data:
#     print(weighted_car)
    weighted_car[0] *= 2 #multiply buying price by 2
    weighted_car[1] *= 2 #multiply maint price by 2
    weighted_car[5] *= 2 #multiply safety by 2
#     print(f"Adjusted buying price: {weighted_car[0]}\nAdjusted maint price: {weighted_car[1]}\nAdjusted safety: {weighted_car[5]}")

print(weighted_cars_data)


['big' 'med' 'small']
Classification Analysis:
------------------------
Total: 1728
Unacceptable (-): 1210
Passing (+): 518

Acceptable: 384
Good: 69
Very Good: 65
[[3 3 2 2 0 0]
 [3 3 2 2 0 1]
 [3 3 2 2 0 2]
 ...
 [0 0 5 5 2 0]
 [0 0 5 5 2 1]
 [0 0 5 5 2 2]]
['unacc' 'unacc' 'unacc' ... 'unacc' 'good' 'vgood']
[[6 6 2 2 0 0]
 [6 6 2 2 0 2]
 [6 6 2 2 0 4]
 ...
 [0 0 5 5 2 0]
 [0 0 5 5 2 2]
 [0 0 5 5 2 4]]


## Task 2: Implement the models

#### 1. Implement logistic regression, and use (full batch) gradient descent for optimization.
#### 2. Implement k-Nearest Neighbor (KNN), and find the best K.

Implement both models as Python classes. You should use the constructor for the class to initialize the model
parameters as attributes, as well as to define other important properties of the model.
• Each of your models’ classes should have (at least) two functions:
– Define a fit function, which takes the training data (i.e., x and y)—as well as other hyperparameters (e.g.,
the learning rate and/or number of gradient descent iterations)—as input. This function should train your
model by modifying the model parameters.
– Define a predict function, which takes a set of input points (i.e., x) as input and outputs predictions (i.e.,
yˆ) for these points. Note that for linear regression you need to convert probabilities to binary 0-1
predictions by thresholding the output at 0.5!
In addition to the model classes, you should also define functions evaluate_acc to evaluate the model accuracy.
This function should take the true labels (i.e., y), and target labels (i.e., yˆ) as input, and it should output the accuracy
score.
• Lastly, you should implement a script to run k-fold cross-validation

### Logistic Regression:

In [7]:
class logistic_regression:
    def __init__(self):
        self.threshold = 0.5
        self.convergence = 0.01
        return None
    
    def fit(self, x,y,alpha,iterations):
        m = len(x)
        w = 0
        b = 0

        while self.convergence < self.J(w,b,m,x,y)- self.mse:
            tmp_w = w - alpha*self.J(w,b,m,x,y)
            tmp_b = b - alpha*self.J(w,b,m,x,y)

            w = tmp_w
            b = tmp_b
        
        return None

    def predict(self, x):
        return None

    def evaluate_acc(self, y, y_ex):
        return None

    def mse(self, m, w, b, x, y):
        val = 0

        for i in range(m):
            val += (b + (w*x[i]) - y[i])**2

        return val

    def J(self, w, b, m, x, y):
        return (1/2*m)*self.mse(m,w,b,x,y)

### K-Nearest Neighbor (KNN):
Riley and Torin

In [175]:
# 1) a new data point is input that we need to classify
# 2) check the classification of the k nearest elements
# 3) assunming we have 2 unique classifications (a,b). we take the classification of the dominant group
# 4) if a tie exists take the class with the shortest distance from 

#to calculate distance we can use the Euclidean distance formula sqrt(sum i to N (x1_i — x2_i)²)


class kNN:
    def __init__(self, k, dist_metric="euclidean"):     
        '''
        ===================================================================================
        DESCRIPTION: 
        ===================================================================================
        initialize kNN model
        ===================================================================================
        PARAMETERS:
        ===================================================================================
        * self (kNN): 
        ----------------------------------------
        kNN model to define k values, training data, and distance metric
        ----------------------------------------
        * k (int):
        ----------------------------------------
        integer representing number of neighbours to compare to
        ----------------------------------------
        * dist_metric (string):
        ----------------------------------------
        string representing distance metric formula to follow
        ===================================================================================
        '''
        self.k = k #num of neighbours
        self.dist_metric = dist_metric #equation to calculate distance with
        self.train_data = None #initialize using fit method
        self.train_labels = None
        
    def fit(self, data, labels):
        '''
        ===================================================================================
        DESCRIPTION: 
        ===================================================================================
        set train_data and train_labelsby loading in Train data used to compare new data
        ===================================================================================
        PARAMETERS:
        ===================================================================================
        * data[] (NumPy Array): 
        ----------------------------------------
        list of data with labels seperated
        ----------------------------------------
        * labels (NumPy Array):
        ----------------------------------------
        list of labels with data removed
        ===================================================================================
        '''
        self.train_data = data
        self.train_labels = labels

    def predict(self, new_data):
        '''
        ===================================================================================
        DESCRIPTION: 
        ===================================================================================
        given new data, compare its items to the k closest elements of training data based 
        on a set distance metric and predict the datas classification.
        ===================================================================================
        PARAMETERS:
        ===================================================================================
        * self (kNN): 
        ----------------------------------------
        kNN model with predefined k values, training data, and distance metric
        ----------------------------------------
        * new_data (NumPy Array):
        ----------------------------------------
        Array of new data to predict classifications for
        ===================================================================================
        RETURNS:
        ===================================================================================
        * predictions (List):
        ----------------------------------------
        list of labels for each item in new_data
        ===================================================================================
        '''
        predictions = []#return array of predicted classifications, for each row in new_data
        for new_row in new_data:
            # calculate distances between new data and training data                   
            k_neighbours = self.__neighbours(new_row) #determine the k nearest neighbours using preffered distance metric
            classifications = []#for the given neighbors check their label
            for result in k_neighbours:
                print(f"new = {new_row}: train = {result[2]}")
                i = result[1]#results formatted [row, index of row], so take the index to find the associated label
                classifications.append(self.train_labels[i])#add label at index i

            #take the most frequent item 
            predictions.append(str(max(classifications, key=classifications.count))) #from collections import Counter
        
        return predictions
    
    def __calc_distance(self,newRow, trainRow):
        '''
        ===================================================================================
        DESCRIPTION: 
        ===================================================================================
        Private Function used in self.__neighbours(). Given a row from new data, calculate 
        the distance based on a set metric from a row in Train data
        ===================================================================================
        PARAMETERS:
        ===================================================================================
        * self (kNN): 
        ----------------------------------------
        kNN model with predefined k values, training data, and distance metric
        ----------------------------------------
        * newRow[] (List of data points (float/int)):
        ----------------------------------------
        data row to compare distance from train row data 
        ----------------------------------------
        * trainRow[] (List of data points (float/int)):
        ----------------------------------------
        data row to compare distance with test data
        ===================================================================================
        RETURNS:
        ===================================================================================
        * distance (float):
        ----------------------------------------
        float distance between to rows of data
        ===================================================================================
        '''
        distance = 0
        if self.dist_metric == "euclidean":
            for i in range(len(newRow)):
                squared = pow(newRow[i] - trainRow[i],2)
                distance += squared
            distance = sqrt(distance)
        return(distance)

    def __neighbours(self, new_row):
        '''
        ===================================================================================
        DESCRIPTION: 
        ===================================================================================
        private function used in self.predict(). Given a row from new data, return k number
        of neigbours based on distance
        ===================================================================================
        PARAMETERS:
        ===================================================================================
        * self (kNN): 
        ----------------------------------------
        kNN model with predefined k values, training data, and distance metric
        ----------------------------------------
        * newRow[] (List of data points (float/int)):
        ----------------------------------------
        data row to compare distance from train row data 
        ----------------------------------------
        * trainRow[] (List of data points (float/int)):
        ----------------------------------------
        data row to compare distance with test data
        ===================================================================================
        RETURNS:
        ===================================================================================
        * k_neighbours (List):
        ----------------------------------------
        list of k closest neighbours based on distance metric
        ===================================================================================
        '''
        distances = []#heap array
        #for every row of data
        for index in range(len(self.train_data)):#use index to return that value later
            train_row = self.train_data[index]#current row of train data
            dist = self.__calc_distance(new_row, train_row)#calculate distance between new row and train data row
            heapq.heappush(distances, [-dist, index, list(train_row)])#make negative value temporarily to assure we have smallest values 
            if len(distances) > self.k:#past k values remove largest from heap
                heapq.heappop(distances)
        
        k_neighbours = [[-dist, index, train_row] for dist, index, train_row in sorted(distances)]#make positive values, only 5 smallest remain

        return k_neighbours
    
    def testTrainSplit(self, data, testSplit=0.7):
        '''
        ===================================================================================
        DESCRIPTION: 
        ===================================================================================
        Used externally to split data into test and train data. probably remove later
        ===================================================================================
        PARAMETERS:
        ===================================================================================
        * self (kNN): [not used]
        ----------------------------------------
        kNN model with predefined k values, training data, and distance metric
        ----------------------------------------
        * data (NumPy Array):
        ----------------------------------------
        data imported for assignment
        ----------------------------------------
        * testSplit (float):
        ----------------------------------------
        ratio of data to be used for testing, default 70/30 split
        ===================================================================================
        RETURNS:
        ===================================================================================
        * dataSplit (tuple(List1,List2)):
        ----------------------------------------
        a tuple of 2 lists where list 1 contains training data and training labels,
        similarily list 2 contains test data and tes labels
        ===================================================================================
        '''
        split = int(len(data) * testSplit )
        #Split train data (70% standard)
        train_data = data[:split, :-1].astype(int)#data only 
        train_labels = data[:split, -1]#classifications only
        
        #Split test data (30% standard)
        test_data = data[split:, :-1].astype(int)#data only
        test_labels = data[split:, -1]#classifications only
        
        dataSplit = ([train_data, train_labels], [test_data, test_labels])
        
        return(dataSplit)



In [174]:
model = kNN(3)
train,test = model.testTrainSplit(cars_data)
model.fit(train[0],train[1])
predictions = model.predict(test[0])

print(predictions)

check = predictions.count('vgood')
print(check)

# unacc = len(test[1][test[1] == "unacc"])
# print(unacc)
# # acc = len(test[1][test[1] == "acc"])
# # print(acc)
# good = len(test[1][test[1] == "vgood"])
# print(good)
# vgood = len(test[1][test[1] == "vgood"])
# print(vgood)
# model.predict(test[0])

new = [1 0 2 5 1 0]: train = [1, 1, 2, 5, 1, 0]
new = [1 0 2 5 1 0]: train = [1, 0, 2, 4, 1, 0]
new = [1 0 2 5 1 0]: train = [1, 0, 2, 5, 0, 0]
new = [1 0 2 5 1 1]: train = [1, 1, 2, 5, 1, 1]
new = [1 0 2 5 1 1]: train = [1, 0, 2, 4, 1, 1]
new = [1 0 2 5 1 1]: train = [1, 0, 2, 5, 0, 1]
new = [1 0 2 5 1 2]: train = [1, 1, 2, 5, 1, 2]
new = [1 0 2 5 1 2]: train = [1, 0, 2, 4, 1, 2]
new = [1 0 2 5 1 2]: train = [1, 0, 2, 5, 0, 2]
new = [1 0 2 5 2 0]: train = [2, 0, 2, 5, 2, 0]
new = [1 0 2 5 2 0]: train = [1, 1, 2, 5, 2, 0]
new = [1 0 2 5 2 0]: train = [1, 0, 2, 4, 2, 0]
new = [1 0 2 5 2 1]: train = [2, 0, 2, 5, 2, 1]
new = [1 0 2 5 2 1]: train = [1, 1, 2, 5, 2, 1]
new = [1 0 2 5 2 1]: train = [1, 0, 2, 4, 2, 1]
new = [1 0 2 5 2 2]: train = [2, 0, 2, 5, 2, 2]
new = [1 0 2 5 2 2]: train = [1, 1, 2, 5, 2, 2]
new = [1 0 2 5 2 2]: train = [1, 0, 2, 4, 2, 2]
new = [1 0 3 2 0 0]: train = [2, 0, 3, 2, 0, 0]
new = [1 0 3 2 0 0]: train = [1, 1, 3, 2, 0, 0]
new = [1 0 3 2 0 0]: train = [1, 0, 2, 2

new = [1 0 5 5 1 2]: train = [1, 1, 5, 5, 2, 2]
new = [1 0 5 5 1 2]: train = [2, 0, 5, 5, 1, 2]
new = [1 0 5 5 1 2]: train = [1, 1, 5, 5, 1, 2]
new = [1 0 5 5 2 0]: train = [1, 1, 5, 5, 2, 1]
new = [1 0 5 5 2 0]: train = [2, 0, 5, 5, 2, 0]
new = [1 0 5 5 2 0]: train = [1, 1, 5, 5, 2, 0]
new = [1 0 5 5 2 1]: train = [1, 1, 5, 5, 2, 2]
new = [1 0 5 5 2 1]: train = [2, 0, 5, 5, 2, 1]
new = [1 0 5 5 2 1]: train = [1, 1, 5, 5, 2, 1]
new = [1 0 5 5 2 2]: train = [1, 1, 5, 5, 2, 1]
new = [1 0 5 5 2 2]: train = [2, 0, 5, 5, 2, 2]
new = [1 0 5 5 2 2]: train = [1, 1, 5, 5, 2, 2]
new = [0 3 2 2 0 0]: train = [1, 3, 3, 2, 0, 0]
new = [0 3 2 2 0 0]: train = [1, 2, 2, 2, 0, 0]
new = [0 3 2 2 0 0]: train = [1, 3, 2, 2, 0, 0]
new = [0 3 2 2 0 1]: train = [1, 3, 3, 2, 0, 1]
new = [0 3 2 2 0 1]: train = [1, 2, 2, 2, 0, 1]
new = [0 3 2 2 0 1]: train = [1, 3, 2, 2, 0, 1]
new = [0 3 2 2 0 2]: train = [1, 3, 3, 2, 0, 2]
new = [0 3 2 2 0 2]: train = [1, 2, 2, 2, 0, 2]
new = [0 3 2 2 0 2]: train = [1, 3, 2, 2

new = [0 3 4 2 0 1]: train = [1, 3, 5, 2, 0, 1]
new = [0 3 4 2 0 1]: train = [1, 2, 4, 2, 0, 1]
new = [0 3 4 2 0 1]: train = [1, 3, 4, 2, 0, 1]
new = [0 3 4 2 0 2]: train = [1, 3, 5, 2, 0, 2]
new = [0 3 4 2 0 2]: train = [1, 2, 4, 2, 0, 2]
new = [0 3 4 2 0 2]: train = [1, 3, 4, 2, 0, 2]
new = [0 3 4 2 1 0]: train = [1, 3, 5, 2, 1, 0]
new = [0 3 4 2 1 0]: train = [1, 2, 4, 2, 1, 0]
new = [0 3 4 2 1 0]: train = [1, 3, 4, 2, 1, 0]
new = [0 3 4 2 1 1]: train = [1, 3, 5, 2, 1, 1]
new = [0 3 4 2 1 1]: train = [1, 2, 4, 2, 1, 1]
new = [0 3 4 2 1 1]: train = [1, 3, 4, 2, 1, 1]
new = [0 3 4 2 1 2]: train = [1, 3, 5, 2, 1, 2]
new = [0 3 4 2 1 2]: train = [1, 2, 4, 2, 1, 2]
new = [0 3 4 2 1 2]: train = [1, 3, 4, 2, 1, 2]
new = [0 3 4 2 2 0]: train = [1, 3, 5, 2, 2, 0]
new = [0 3 4 2 2 0]: train = [1, 2, 4, 2, 2, 0]
new = [0 3 4 2 2 0]: train = [1, 3, 4, 2, 2, 0]
new = [0 3 4 2 2 1]: train = [1, 3, 5, 2, 2, 1]
new = [0 3 4 2 2 1]: train = [1, 2, 4, 2, 2, 1]
new = [0 3 4 2 2 1]: train = [1, 3, 4, 2

new = [0 2 2 4 0 2]: train = [1, 2, 3, 4, 0, 2]
new = [0 2 2 4 0 2]: train = [1, 1, 2, 4, 0, 2]
new = [0 2 2 4 0 2]: train = [1, 2, 2, 4, 0, 2]
new = [0 2 2 4 1 0]: train = [1, 2, 3, 4, 1, 0]
new = [0 2 2 4 1 0]: train = [1, 1, 2, 4, 1, 0]
new = [0 2 2 4 1 0]: train = [1, 2, 2, 4, 1, 0]
new = [0 2 2 4 1 1]: train = [1, 2, 3, 4, 1, 1]
new = [0 2 2 4 1 1]: train = [1, 1, 2, 4, 1, 1]
new = [0 2 2 4 1 1]: train = [1, 2, 2, 4, 1, 1]
new = [0 2 2 4 1 2]: train = [1, 2, 3, 4, 1, 2]
new = [0 2 2 4 1 2]: train = [1, 1, 2, 4, 1, 2]
new = [0 2 2 4 1 2]: train = [1, 2, 2, 4, 1, 2]
new = [0 2 2 4 2 0]: train = [1, 2, 3, 4, 2, 0]
new = [0 2 2 4 2 0]: train = [1, 1, 2, 4, 2, 0]
new = [0 2 2 4 2 0]: train = [1, 2, 2, 4, 2, 0]
new = [0 2 2 4 2 1]: train = [1, 2, 3, 4, 2, 1]
new = [0 2 2 4 2 1]: train = [1, 1, 2, 4, 2, 1]
new = [0 2 2 4 2 1]: train = [1, 2, 2, 4, 2, 1]
new = [0 2 2 4 2 2]: train = [1, 2, 3, 4, 2, 2]
new = [0 2 2 4 2 2]: train = [1, 1, 2, 4, 2, 2]
new = [0 2 2 4 2 2]: train = [1, 2, 2, 4

new = [0 2 4 5 1 1]: train = [1, 2, 5, 5, 1, 1]
new = [0 2 4 5 1 1]: train = [1, 1, 4, 5, 1, 1]
new = [0 2 4 5 1 1]: train = [1, 2, 4, 5, 1, 1]
new = [0 2 4 5 1 2]: train = [1, 2, 5, 5, 1, 2]
new = [0 2 4 5 1 2]: train = [1, 1, 4, 5, 1, 2]
new = [0 2 4 5 1 2]: train = [1, 2, 4, 5, 1, 2]
new = [0 2 4 5 2 0]: train = [1, 2, 5, 5, 2, 0]
new = [0 2 4 5 2 0]: train = [1, 1, 4, 5, 2, 0]
new = [0 2 4 5 2 0]: train = [1, 2, 4, 5, 2, 0]
new = [0 2 4 5 2 1]: train = [1, 2, 5, 5, 2, 1]
new = [0 2 4 5 2 1]: train = [1, 1, 4, 5, 2, 1]
new = [0 2 4 5 2 1]: train = [1, 2, 4, 5, 2, 1]
new = [0 2 4 5 2 2]: train = [1, 2, 5, 5, 2, 2]
new = [0 2 4 5 2 2]: train = [1, 1, 4, 5, 2, 2]
new = [0 2 4 5 2 2]: train = [1, 2, 4, 5, 2, 2]
new = [0 2 5 2 0 0]: train = [1, 2, 5, 2, 1, 0]
new = [0 2 5 2 0 0]: train = [1, 1, 5, 2, 0, 0]
new = [0 2 5 2 0 0]: train = [1, 2, 5, 2, 0, 0]
new = [0 2 5 2 0 1]: train = [1, 2, 5, 2, 1, 1]
new = [0 2 5 2 0 1]: train = [1, 1, 5, 2, 0, 1]
new = [0 2 5 2 0 1]: train = [1, 2, 5, 2

new = [0 1 3 5 0 1]: train = [1, 1, 3, 5, 1, 1]
new = [0 1 3 5 0 1]: train = [1, 1, 4, 5, 0, 1]
new = [0 1 3 5 0 1]: train = [1, 1, 3, 5, 0, 1]
new = [0 1 3 5 0 2]: train = [1, 1, 3, 5, 1, 2]
new = [0 1 3 5 0 2]: train = [1, 1, 4, 5, 0, 2]
new = [0 1 3 5 0 2]: train = [1, 1, 3, 5, 0, 2]
new = [0 1 3 5 1 0]: train = [1, 1, 3, 5, 2, 0]
new = [0 1 3 5 1 0]: train = [1, 1, 4, 5, 1, 0]
new = [0 1 3 5 1 0]: train = [1, 1, 3, 5, 1, 0]
new = [0 1 3 5 1 1]: train = [1, 1, 3, 5, 2, 1]
new = [0 1 3 5 1 1]: train = [1, 1, 4, 5, 1, 1]
new = [0 1 3 5 1 1]: train = [1, 1, 3, 5, 1, 1]
new = [0 1 3 5 1 2]: train = [1, 1, 3, 5, 2, 2]
new = [0 1 3 5 1 2]: train = [1, 1, 4, 5, 1, 2]
new = [0 1 3 5 1 2]: train = [1, 1, 3, 5, 1, 2]
new = [0 1 3 5 2 0]: train = [1, 1, 3, 5, 2, 1]
new = [0 1 3 5 2 0]: train = [1, 1, 4, 5, 2, 0]
new = [0 1 3 5 2 0]: train = [1, 1, 3, 5, 2, 0]
new = [0 1 3 5 2 1]: train = [1, 1, 3, 5, 2, 2]
new = [0 1 3 5 2 1]: train = [1, 1, 4, 5, 2, 1]
new = [0 1 3 5 2 1]: train = [1, 1, 3, 5

new = [0 0 2 2 1 1]: train = [1, 0, 2, 2, 1, 2]
new = [0 0 2 2 1 1]: train = [1, 0, 2, 2, 2, 1]
new = [0 0 2 2 1 1]: train = [1, 0, 2, 2, 1, 1]
new = [0 0 2 2 1 2]: train = [1, 0, 2, 2, 1, 1]
new = [0 0 2 2 1 2]: train = [1, 0, 2, 2, 2, 2]
new = [0 0 2 2 1 2]: train = [1, 0, 2, 2, 1, 2]
new = [0 0 2 2 2 0]: train = [1, 0, 2, 2, 1, 0]
new = [0 0 2 2 2 0]: train = [1, 0, 2, 2, 2, 1]
new = [0 0 2 2 2 0]: train = [1, 0, 2, 2, 2, 0]
new = [0 0 2 2 2 1]: train = [1, 0, 2, 2, 2, 0]
new = [0 0 2 2 2 1]: train = [1, 0, 2, 2, 2, 2]
new = [0 0 2 2 2 1]: train = [1, 0, 2, 2, 2, 1]
new = [0 0 2 2 2 2]: train = [1, 0, 2, 2, 1, 2]
new = [0 0 2 2 2 2]: train = [1, 0, 2, 2, 2, 1]
new = [0 0 2 2 2 2]: train = [1, 0, 2, 2, 2, 2]
new = [0 0 2 4 0 0]: train = [1, 0, 2, 4, 1, 0]
new = [0 0 2 4 0 0]: train = [1, 0, 2, 5, 0, 0]
new = [0 0 2 4 0 0]: train = [1, 0, 2, 4, 0, 0]
new = [0 0 2 4 0 1]: train = [1, 0, 2, 4, 1, 1]
new = [0 0 2 4 0 1]: train = [1, 0, 2, 5, 0, 1]
new = [0 0 2 4 0 1]: train = [1, 0, 2, 4

new = [0 0 5 2 1 2]: train = [1, 1, 5, 2, 1, 1]
new = [0 0 5 2 1 2]: train = [1, 1, 5, 2, 2, 2]
new = [0 0 5 2 1 2]: train = [1, 1, 5, 2, 1, 2]
new = [0 0 5 2 2 0]: train = [1, 1, 5, 2, 1, 0]
new = [0 0 5 2 2 0]: train = [1, 1, 5, 2, 2, 1]
new = [0 0 5 2 2 0]: train = [1, 1, 5, 2, 2, 0]
new = [0 0 5 2 2 1]: train = [1, 1, 5, 2, 2, 0]
new = [0 0 5 2 2 1]: train = [1, 1, 5, 2, 2, 2]
new = [0 0 5 2 2 1]: train = [1, 1, 5, 2, 2, 1]
new = [0 0 5 2 2 2]: train = [1, 1, 5, 2, 1, 2]
new = [0 0 5 2 2 2]: train = [1, 1, 5, 2, 2, 1]
new = [0 0 5 2 2 2]: train = [1, 1, 5, 2, 2, 2]
new = [0 0 5 4 0 0]: train = [1, 1, 5, 4, 1, 0]
new = [0 0 5 4 0 0]: train = [1, 1, 5, 5, 0, 0]
new = [0 0 5 4 0 0]: train = [1, 1, 5, 4, 0, 0]
new = [0 0 5 4 0 1]: train = [1, 1, 5, 4, 1, 1]
new = [0 0 5 4 0 1]: train = [1, 1, 5, 5, 0, 1]
new = [0 0 5 4 0 1]: train = [1, 1, 5, 4, 0, 1]
new = [0 0 5 4 0 2]: train = [1, 1, 5, 4, 1, 2]
new = [0 0 5 4 0 2]: train = [1, 1, 5, 5, 0, 2]
new = [0 0 5 4 0 2]: train = [1, 1, 5, 4

## Task 3: Run Experiments

The goal of this project is to have you explore linear classification and compare different features and models. Use
5-fold cross-validation to estimate performance in all of the experiments. Evaluate the performance using accuracy.
You are welcome to perform any experiments and analyses you see fit (e.g., to compare different features), but at a
minimum, you must complete the following experiments in the order stated below:

#### 1. Compare the accuracy of k-nearest neighbor and logistic regression on the four datasets.

#### 2. Test different k values for the k-nearest neighbor to find the best k-value by showing the accuracy plot. 

#### 3. Test different learning rates for gradient descent applied to logistic regression. Use a threshold for change in the value of the cost function as termination criteria and plot the accuracy on the train/validation set as a function of iterations of gradient descent.

#### 4. Compare the accuracy of the two models as a function of the size of the dataset (by controlling the training size)

Note: The above experiments are the minimum requirements that you must complete; however, this project is open-ended. For example, you might investigate different stopping criteria for gradient descent in logistic regression and develop an automated approach to select a good subset of features. You do not need to do all of these things, but you should demonstrate creativity, rigor, and an understanding of the course material in how you run your chosen experiments and how you report on them in your write-up.

In [9]:
## Testing KNN logic w/ Dataset 1
# Just arbitrairly choosing k = 5 to start, and 70/30 training/test data distribution

len_training_data = int( len(ionosphere_data) * 0.7 )
print("Length of training data set:", len_training_data)
ionosphere_train = ionosphere_data[:len_training_data] # 0 to length of training data
ionosphere_test = ionosphere_data[len_training_data+1:] # should grab everything after what we want as training data

iono_knn = knn(5, ionosphere_train, ionosphere_test)
print("5 Nearest Neighbours for Ionosphere Data:")
print(iono_knn)

Length of training data set: 245
vector1[i]: 1  and it's data type:  <class 'numpy.str_'>
vector2[i]: ['1' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '-1' '1' '1' '0.55172'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '1' '1' '0' '0' '0' '0' '0' '0'
 'b']  and it's data type:  <class 'numpy.ndarray'>


UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U1'), dtype('<U8')) -> None