# Context

The paper "Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier" showed how effective the Hassanet metric function can be when compared to a broad range of other metrics for a 1-NN algorithm.
Original paper: https://arxiv.org/pdf/1708.04321.pdf

## Observation
The formula in the paper for the hassanet metric contradicts previous definitions by the author (https://arxiv.org/pdf/1501.00687.pdf) because of a missing 1 in the denominator; we will consider the version with the 1 since it is the one used in the previous papers and that holds the properties desired.

# Hypothesis

My hypthesis is that the significance in results is due to the fact the data was not standardized. This would mean we do not expect similar results if the data was standardized.

# Datasets
Sometimes I could not found the exact dataset used in the paper in the UCI repository. Sometimes it was a question of the suppose amount of attributes or features not matching the original paper.

This might also mean that the original datasets were later changed by the owners.

The following datasets could not be found and we're not included:

* Heart - It is not one of [1](https://archive.ics.uci.edu/ml/datasets/spect+heart), [2](https://archive.ics.uci.edu/ml/datasets/heart+Disease) or <a href="http://archive.ics.uci.edu)/ml/datasets/statlog+(heart)">3</a>

* German - no longer matches properties from <a href="https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)">UCI</a>

* Glass - [Extra ID attribute probably](https://archive.ics.uci.edu/ml/datasets/glass+identification)

* EEG - [Arff format](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State)


## Special mentions
The cancer, iris and wine datasets were loaded from sklearn.

# Imports

In [39]:

import math
import time
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_iris, load_wine
import sklearn.datasets

pd.set_option('display.max_columns', 50)

# Load datasets

In [36]:
type(datasets["iris"])

sklearn.utils.Bunch

In [42]:
datasets = {"cancer": load_breast_cancer(), "iris": load_iris(), "wine": load_wine()}

data_rules = {"haberman.data": -1, "sonar.all-data": -1, 
              "letter-recognition.data": 0, "breast-cancer-wisconsin.data": -1, 
             "balance-scale.data": 0}

for filename, y_pos in data_rules.items():
    df = pd.read_csv(f"datasets/{filename}", header=None)
    df = sklearn.datasets.base.Bunch(data=df.drop(columns=df.columns[y_pos]), 
                                                  target=df.iloc[:, y_pos])
    datasets[filename] = df

# Create the different metric functions

In [44]:
v1 = [5.1, 3.5, 1.4, 0.3]
v2 = [5.4, 3.4, 1.7, 0.2]
def test_func(func, num, tol=0.0001):
    if not (num-tol < func(v1, v2) < num+tol):
        raise ValueError("Error on function!")

In [45]:
def HasD(x, y):
    total = 0
    for xi, yi in zip(x, y):
        min_value = min(xi, yi)
        max_value = max(xi, yi)
        total += 1 # we sum the 1 in both cases
        if min_value >= 0:
            total -= (1 + min_value) / (1 + max_value)
        else:
            # min_value + abs(min_value) = 0, so we ignore that
            total -= 1 / (1 + max_value + abs(min_value))
    return total

test_func(HasD, 0.2572)

In [46]:
def LD(x, y):
    total = 0
    for xi, yi in zip(x, y):
        total += math.log(1 + abs(xi-yi))
    return total

test_func(LD, 0.7153)

In [47]:
funcs = {
    "HasD": HasD,
    "LD": LD,
    "CanD": "canberra",
    "L2": "euclidean"
}

In [48]:
def run_experiment(metric_func, data, testing=False):
    """
    testing: if passed, uses just 0.1% of data.
    """
    if testing:
        x, _, y, _ = train_test_split(data.data, data.target, test_size=0.90)
    else:
        x, y = data.data, data.target
        
    # Create standardized train/test
    train_data, test_data, train_y, test_y = train_test_split(x, y, test_size=0.34)
    scaler = StandardScaler()
    train_data = scaler.fit_transform(train_data)
    test_data = scaler.transform(test_data)
    
    clf = KNeighborsClassifier(n_neighbors=1, metric=metric_func, n_jobs=6)
    clf.fit(train_data, train_y)
    preds = clf.predict(test_data)
    return accuracy_score(test_y, preds)

In [None]:
%%time
start_time = time.time()
results = pd.DataFrame(index=funcs.keys(), columns=datasets.keys())
for metric_name, metric_func in funcs.items():
    for data_name, data in datasets.items():
        # Not sure why cancer isnt working
        if data_name == "cancer":
            continue
        avg_score = 0
        for _ in range(10):
            avg_score += run_experiment(metric_func, data)
        avg_score /= 10
        results.loc[metric_name, data_name] = avg_score
        
end_time = time.time()

In [None]:
print(end_time - start_time)

In [None]:
results.to_csv("results.csv")

In [None]:
results