In this project, we analyze and process the Iris flower dataset using the Numpy library

In [21]:
import numpy as np

The features of 120 Iris flowers are used as the training data.

In [22]:
irises = np.load('irises.npy')
print(irises.shape)

(120, 4)


The labels of these flowers are provided in a separate array as numerical values.

In [23]:
types = np.load('types.npy')
print(types.shape)

(120,)


The features of 30 other Iris flowers with unknown labels are used as the test data.

In [24]:
new_irises = np.load('new_irises.npy')
print(new_irises.shape)

(30, 4)


We denote the number of training samples as n and the number of test samples as m

In [25]:
n, m = len(irises), len(new_irises)
print("Number of training samples (n):", n)
print("Number of test samples (m):", m)

Number of training samples (n): 120
Number of test samples (m): 30


Three methods for distance calculation are compared: two loops, one loop and no loop

In [26]:
def calc_two_loops(new_points, points):
    
    m = len(new_points)
    n = len(points)
    d = np.zeros((m, n))
    
    for i in range(m):
        for j in range(n):
            d[i, j] = np.sqrt(np.sum(np.square(new_points[i] - points[j])))
            
    return d

In [27]:
d2 = calc_two_loops(new_irises, irises)
print(d2.shape)

(30, 120)


In [28]:
def calc_one_loop(new_points, points):
    
    m = len(new_points)
    n = len(points)
    d = np.zeros((m, n))
    
    for i in range(m):
        d[i] = np.sqrt(np.sum(np.square(points - new_points[i]), axis=1))
    return d

In [29]:
d1 = calc_one_loop(new_irises, irises)
print(d1.shape)

(30, 120)


In [30]:
def calc_no_loop(new_points, points):
    distances = np.sqrt(
        np.sum(
            np.square(new_points[:, np.newaxis, :] - points[np.newaxis, :, :]),
            axis=2
        )
    )
    return distances

In [31]:
d = calc_no_loop(new_irises, irises)
print(d.shape)

(30, 120)


In [32]:
print(np.abs(d - d1).max())
print(np.abs(d - d2).max())
print(np.abs(d1 - d2).max())

0.0
0.0
0.0


The distance arrays calculated using the three methods above should produce the same result.
Therefore, the following code is executed to verify this 

In [33]:
if np.allclose(d, d1, 1e-5) and np.allclose(d, d2, 1e-5) and np.allclose(d1, d2, 1e-5):
    print('Fine!')
else:
    print('There is something wrong!')

Fine!


Finding the k_nearest neighbors, where the output of K_nearest will be a matrix with shape(30,10)

In [34]:
k = 10
k_nearest = np.argpartition(d, k, axis=1)[:, :k]
print(k_nearest)

[[  6  14   0  32  22  17  39  21  23  37]
 [ 10   1  30  24  20   2  36   3   6  28]
 [ 27  13   8  39   4  29  15  12  26  25]
 [ 14  17   0  39  37  22   8  32  26  13]
 [ 21  20  30   3   9  24   6  19  35   5]
 [ 10   1   2   3  38   5  36  30  24   9]
 [ 10   1  30  24  20   2  36   3   6  28]
 [  0   6  22  23  21  14   9  32  17  39]
 [ 39  35  17  22   4  21  37  19   8  16]
 [ 28   6  23   0  14  21  30  32  22   9]
 [ 47  73  69  61  52  51  41  60 101  58]
 [ 71  43  65  64  77  72  66  76  74  54]
 [ 71  65  74  64  66  77  76  54  57  49]
 [ 64  65  74  66  43  50  54  72  77  71]
 [ 78  60  47  41  73  52  63  57  51  59]
 [ 74  65  54  64  66  43  50  57  71  75]
 [ 77  71  44  53  72  76  49  97  56  63]
 [ 72  66  64  74  43  65  77  54  71  76]
 [ 77  74  44  72  71  76  54  66  43  53]
 [ 74  71  66  76  77  54  57  44  49  72]
 [ 90 103 106  96 112 115  93 110  83  80]
 [ 82  96 100 108  94 115 112 105  80  84]
 [111  92  97 117  81 114 102  91 101  56]
 [107  67  

In [35]:
k_nearest.shape

(30, 10)

Identifying the type of each array returned by k_nearest

In [36]:
k_nearest_types = types[k_nearest]
k_nearest_types

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 2, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 2, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 1],
       [2, 1, 2, 2, 1, 2, 2, 2, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2,

5. The most frequent label (mode) in k_nearest_type[i] is assigned as the predicted label for new_irises[i]

In [37]:
from scipy import stats
predicted_types = stats.mode(k_nearest_types, axis=1).mode.reshape(m)
predicted_types

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

6. Calculating model accuracy

In [38]:
new_types = np.load('new_types.npy')
accuracy = np.mean(predicted_types == new_types)
print('Accuracy:', accuracy)

Accuracy: 1.0
