<h1>Part 1</h1>
<ul>
<li>fcond001@gold.ac.uk</li>
<li>Filip Condac</li>
<li>Student Number: 33643814</li>
</ul>

I worked alone

In [1]:
import csv
from collections import Counter
from math import pow
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

In [2]:

#Load data from csv file
def load_data(file_path):
    data = []
    labels = []
    with open(file_path) as file:
        reader = csv.reader(file)
        next(reader)  # skip the header row
        for row in reader:
            #Convert the string data to float
            data.append([float(x) for x in row[0:-1]])
            #Add the label to the labels list
            labels.append(row[-1])
    return data, labels

In [3]:
#Calculate the distance between two points using Minkowski distance formula
def minkowski_distance(x, y, p):
    distance = 0
    for i in range(len(x)):
        distance += pow(abs(x[i] - y[i]), p)
    return pow(distance, 1/p)

In [4]:
#Find the nearest neighbor
def nearest_neighbor(train_data, train_labels, test_datum, p):
    distances = []
    for i in range(len(train_data)):
        #Calculate the distance between the test datum and the current train datum
        distance = minkowski_distance(train_data[i], test_datum, p)
        #Add the distance and the index of the train datum to an ordered collection
        distances.append((distance, train_labels[i]))
    return min(distances)[1]

In [5]:

#Classify the test data
def classify(train_data, train_labels, test_data, p):
    predicted_labels = []
    for test_datum in test_data:
        #Find the nearest neighbor
        label = nearest_neighbor(train_data, train_labels, test_datum, p)
        #Assign the label to the test datum
        predicted_labels.append(label)
    return predicted_labels

In [6]:

#Calculate the accuracy, precision, recall and f1 measure
def calculate_metrics(true_labels, predicted_labels):
    tp, fp, tn, fn = 0, 0, 0, 0
    for i in range(len(true_labels)):
        #True positive
        if true_labels[i] == 'M' and predicted_labels[i] == 'M':
            tp += 1
        elif true_labels[i] == 'R' and predicted_labels[i] == 'M':
            #False positive
            fp += 1
        elif true_labels[i] == 'R' and predicted_labels[i] == 'R':
            tn += 1
            #False negative
        else:
            #True negative
            fn += 1
    #Calculate the accuracy, precision, recall and f1 measure            
    accuracy = (tp + tn) / len(true_labels)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1_measure = 2 * precision * recall / (precision + recall)
    return accuracy, precision, recall, f1_measure

In [7]:
#Main function to run the program and print the results
if __name__ == '__main__':
    train_data, train_labels = load_data('sonar_train.csv')
    test_data, test_labels = load_data('sonar_test.csv')

    # Using my version
    for p in [1, 2]:
        #Classify the test data
        predicted_labels = classify(train_data, train_labels, test_data, p)
        #Calculate the accuracy, precision, recall and f1 measure
        accuracy, precision, recall, f1_measure = calculate_metrics(test_labels, predicted_labels)
        if p==1 : 
            print(f'Using Manhattan')
        else:
            print(f'Using Euclidean')

        print(f'Using Minkowski distance with p={p}')
        print(f'Accuracy: {accuracy:.3f}')
        print(f'Precision: {precision:.3f}')
        print(f'Recall: {recall:.3f}')
        print(f'F1 Measure: {f1_measure:.3f}\n')

    # using scikit-learn's version
    for n_neighbors in [1, 2]:
        #Create the KNN classifier
        knn = KNeighborsClassifier(n_neighbors=n_neighbors)
        #Train the classifier
        knn.fit(train_data, train_labels)
        #Predict the labels of the test data
        predicted_labels = knn.predict(test_data)
        #Calculate the accuracy, precision, recall and f1 measure
        accuracy, precision, recall, f1_measure = calculate_metrics(test_labels, predicted_labels)
        print(f'Using scikit-learn\'s KNeighborsClassifier with n_neighbors={n_neighbors}')
        print(f'Accuracy: {accuracy:.3f}')
        print(f'Precision: {precision:.3f}')
        print(f'Recall: {recall:.3f}\n')
       


Using Manhattan
Using Minkowski distance with p=1
Accuracy: 0.884
Precision: 0.854
Recall: 0.946
F1 Measure: 0.897

Using Euclidean
Using Minkowski distance with p=2
Accuracy: 0.899
Precision: 0.857
Recall: 0.973
F1 Measure: 0.911

Using scikit-learn's KNeighborsClassifier with n_neighbors=1
Accuracy: 0.899
Precision: 0.857
Recall: 0.973

Using scikit-learn's KNeighborsClassifier with n_neighbors=2
Accuracy: 0.725
Precision: 0.667
Recall: 0.973



**Evaluation of results**

As we can see, the accuracy of the Nearest Neighbour algorithm with Minkowski distance varies from 0.884 to 0.899 depending on the value of the parameter p on the test dataset that has been provided. P=2, or the Euclidean distance, improves the algorithm's performance. The method tends to accurately classify the majority of the positive cases (metal cylinders), as evidenced by the precision values, which range from 0.854 to 0.857. Also, the algorithm tends to properly identify the majority of the positive cases out of all the actual positive cases in the test set, as demonstrated by its elevated recall values, which range from 0.946 to 0.973. There is a good balance in the F1 measure values, which range from 0.897 to 0.911.

Using KNeighborsClassifier from scikit-learn, we can see that the approach works similarly to the Nearest Neighbor algorithm using Minkowski distance for k=1, but performs worse with k=2 as the accuracy falls to 0.725. This suggests that a reasonable option for this dataset is the more basic Nearest Neighbour algorithm using Minkowski distance.