First, we upload the csv file as a dataframe using pandas. This is the most rapid way to do that.

In [25]:
import pandas as pd
import numpy as np
import math as mt


df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",header=None)
df.tail() #we check the final lines

Unnamed: 0,0,1,2,3,4
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


It can be seen that the number of rows of the csv file is equal to 150. As assignment, we want to select the 20% of the rows randomly: we'll use the function .sample(frac = percentage between 0 and 1) of the dataframe.
Then we use the pandas function .iloc in order to obtain the table of interest.
Moreover, the function "FromStringToValue" is the on which transforms the three Iris species into numbers from 0 to 2: every index is a LABEL. 
Finally, the testing dataset is computed by merging the first-4-columns-matrix and the obtained labelled-species column.

In [26]:
shuffleTest = df.sample(frac=1)
x_test = shuffleTest.iloc[:,0:4].values
y_test = shuffleTest.iloc[:,4].values


def FromStringToValue(species_list):
    class_values= set(species_list) 
    LUT = {}
    print("Legend as it follows:")
    print()
    for i, species in enumerate(class_values):
        LUT[species] = i
        print(f"{species} is equal to {i}")
    updated_speciesList = []
    for row in species_list:
        row = LUT[row]
        updated_speciesList.append(row)
    return updated_speciesList

y_test = FromStringToValue(y_test)
dasetTest=np.column_stack((x_test,y_test))

rowInit = dasetTest[144]#row that's the data selected as recognition testing 
                        #with the 5th column (corresponding species) labelled with 0, 1, 3
speciesInit = int(rowInit[4])
print()
print(f"Test row is: {rowInit}")

Legend as it follows:

Iris-versicolor is equal to 0
Iris-setosa is equal to 1
Iris-virginica is equal to 2

Test row is: [5.8 2.7 3.9 1.2 0. ]


Now, as first step of KNN we have to build the three types of Distances between two points of our 2D vectors. In our case, we will print the result of the function between the first row and every row of the given training dataset (the first will be equal to 0 since it's doing the euclidean distance to itself).
Then we create a "choosing" function (maybe implemented later with Classes Functions) that allows the user to select a Distance-ID and then computed the corresping type of distance.

In [27]:
from math import sqrt

def EuclideanDistance(r1, r2): #we define the p_i and q_i of each pair of rows
    distance = 0.0
    for i in range(len(r1)-1):
        distance += (r1[i] - r2[i])**2
    return mt.sqrt(distance)


def CosineDistance(r1,r2):
    distance = 0.0
    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0
    for i in range(len(r1)-1):
        sum1 += (r1[i]*r2[i])
        sum2 += r1[i]**2
        sum3 += r2[i]**2
    distance = 1-abs(sum1/mt.sqrt(sum2*sum3))
    return distance


def ManhattanDistance(r1,r2):
    distance = 0.0
    for i in range(len(r1)-1):
        distance += abs((r1[i] - r2[i]))
    return distance


def ChooseTypeOfDistance (k,r1,r2):
    types = {'Euclidean': 0, 'Cosine': 1, 'Manhattan': 2}
    if k == types['Euclidean']:
        return EuclideanDistance(r1,r2)
    elif k == types['Cosine']:
        return CosineDistance(r1,r2)
    elif k == types['Manhattan']:
        return ManhattanDistance(r1,r2)

At this point we want to generate a function that, given the proper parameters, uses the KNN Algorithm and computes the most nearest k-neighbors. It will be fed up by the previously calculated distances.

In [28]:
def getNeighbors(trainingList, testRow, numNeighbors):
    distances = []
    for trainingRow in trainingList:
        dist = ChooseTypeOfDistance(2,testRow, trainingRow) #first we calculate the chosen 
                                                            #type of distance from this function
        distances.append((trainingRow, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = []
    
    for i in range(numNeighbors):
        neighbors.append(distances[i][0])
    return neighbors


print()
neighbors = getNeighbors(dasetTest,dasetTest[144], 5)
for neighbor in neighbors:
    print(neighbor)
    print()


[5.8 2.7 3.9 1.2 0. ]

[5.8 2.6 4.  1.2 0. ]

[5.8 2.7 4.1 1.  0. ]

[5.7 2.8 4.1 1.3 0. ]

[5.6 2.5 3.9 1.1 0. ]



We sort distances, but first we initialize them as tuples: we use the key parameter in the sort function to define a "reference function" for the sorting order. In this case, the reference is the 2nd column-tuple of "distances", so "dist".
We have now got how to know the neirghbors from the dataset and it's time to make predictions.
Basically, we could use the MAX function among the neighbors set that "studies" the neighbors and return only one class value.


In [29]:
def PredictClassification(trainingList, testRow, numNeighbors):
    neighbors = getNeighbors(trainingList, testRow, numNeighbors)
    outputs = [el[-1] for el in neighbors]
    prediction = max(set(outputs), key=outputs.count) #same idea of sort for the key
    return prediction


neighbors = getNeighbors(dasetTest, rowInit, 5)
prediction = PredictClassification(dasetTest, rowInit, 5)

print("Data selected:")
print(list(rowInit[:4]))
print()
print(f"Predicted {int(prediction)}")
print()
print(f"Verifying: the expected one is {speciesInit}")
print()


Data selected:
[5.8, 2.7, 3.9, 1.2]

Predicted 0

Verifying: the expected one is 0

