### K Nearest Neighbors

Based on the chosen distance metric, the KNN algorithm finds the k samples in the
training dataset that are closest (most similar) to the point that we want to classify.
The class label of the new data point is then determined by a majority vote among its
k nearest neighbors.

<img src="Images/knn_ex.PNG" width="400" />

In [1]:

using LinearAlgebra, CSV, Plots
theme(:dark)


┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1273
┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1273


### Data

In [5]:
fish = CSV.read("qsar_fish_toxicity.csv"; delim=';', header =["CIC0",
        "SM1_Dz", "GATS1i", "NdsCH", "NdssC", "MLOGP",  "LC50"]);

In [6]:
fish

Unnamed: 0_level_0,CIC0,SM1_Dz,GATS1i,NdsCH,NdssC,MLOGP,LC50
Unnamed: 0_level_1,Float64,Float64,Float64,Int64,Int64,Float64,Float64
1,3.26,0.829,1.676,0,1,1.453,3.77
2,2.189,0.58,0.863,0,0,1.348,3.115
3,2.125,0.638,0.831,0,0,1.348,3.531
4,3.027,0.331,1.472,1,0,1.807,3.51
5,2.094,0.827,0.86,0,0,1.886,5.39
6,3.222,0.331,2.177,0,0,0.706,1.819
7,3.179,0.0,1.063,0,0,2.942,3.947
8,3.0,0.0,0.938,1,0,2.851,3.513
9,2.62,0.499,0.99,0,0,2.942,4.402
10,2.834,0.134,0.95,0,0,1.591,3.021


In [7]:
X_data = [x for x in zip(fish.CIC0, fish.SM1_Dz, fish.GATS1i, fish.NdsCH, fish.NdssC, fish.MLOGP)]
Y_data = [x for x in fish. LC50];

$$d\left(\boldsymbol{x}^{(i)}, \boldsymbol{x}^{(j)}\right)=\sqrt{\sum_{k}\left|x_{k}^{(j)}-x_{k}^{(j)}\right|^{P}}$$

In [8]:
# Define the Euclidean distance formula as a function
function euclidean_distance(p1::Tuple{Float64,Float64,Float64,Int64,Int64,Float64}, 
                            p2::Tuple{Float64,Float64,Float64,Int64,Int64,Float64})::Float64
    return sqrt(sum([(p1[i] - p2[i])^2 for i = 1:length(p1)]))
end

# Test the function euclidean_distance to make sure it works!
print("The distance between ", X_data[1]," and", X_data[50])
println(" is ", euclidean_distance(X_data[1], X_data[50]))
@time euclidean_distance(X_data[1], X_data[50])

The distance between (3.26, 0.829, 1.676, 0, 1, 1.453) and(4.171, 0.693, 1.678, 0, 2, 2.849) is 1.948650045544351
  0.000031 seconds (37 allocations: 1.563 KiB)


1.948650045544351

In [13]:
function k_nearest_neighbors(p, X, Y, k::Int64)
    # Calculate the distance between p and all other points in X
    distance_array = [(X[i], Y[i], euclidean_distance(p, X[i])) 
                      for i = 1:length(X)
                      if X[i] != p
                      ]
    # Sort the distance array in ascending order according to distance
    sort!(distance_array, by = x -> x[3])     # Python = distance_array.sort(key = lambda x : x[2])
    
    # Return the first k entries from the sorted distance array 
    return distance_array[1:k]                # Python = distance_array[0:k-1]
end 

# Test the k_nearest_neighbors function
test = k_nearest_neighbors(X_data[120], X_data, Y_data, 5)
println("")
println("Target Point P = ", X_data[120])
println("k = ", 5)
println("_________________________________________________")
for i = 1:length(test)
    println("Point $i = ", test[i][1])
    println("Point Label = ", test[i][2])
    println("Point Distance = ", test[i][3])
    if i != length(test)
        println("")
    else
      println("______________________________________________")  
    end
end
println("")


Target Point P = (2.665, 0.251, 1.762, 0, 0, -0.534)
k = 5
_________________________________________________
Point 1 = (2.377, 0.331, 1.734, 0, 0, -0.534)
Point Label = 1.474
Point Distance = 0.3002132575353729

Point 2 = (2.366, 0.405, 1.735, 0, 0, -0.534)
Point Label = 0.242
Point Distance = 0.337410728934336

Point 3 = (2.429, 0.405, 1.954, 0, 0, -0.473)
Point Label = 0.931
Point Distance = 0.34640583135969305

Point 4 = (2.377, 0.331, 2.111, 0, 0, -0.534)
Point Label = 2.156
Point Distance = 0.4595051686325196

Point 5 = (2.983, 0.496, 2.106, 0, 0, -0.521)
Point Label = 1.68
Point Distance = 0.5288232218804313
______________________________________________



In [10]:
function more_like_this(fish_LC50, X, Y, k)
    
    for i = 1:length(Y)
        if Y[i] == fish_LC50
            L = k_nearest_neighbors(X[i], X, Y, k)
            println("The top $k similar qualitative Response with LC50: $fish_LC50 are:")
            for j = 1:k
                println("$j. ", L[j][2])
            end
        end
    end
end


more_like_this (generic function with 1 method)

In [11]:
more_like_this(6.535, X_data, Y_data, 5)

The top 5 similar qualitative Response with LC50: 6.535 are:
1. 6.564
2. 5.039
3. 5.048
4. 6.077
5. 4.828
