<a href="https://colab.research.google.com/github/MLcmore2023/MLcmore2023/blob/main/day2_pm_afternoon/knn-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KNN demo
K-Nearest Neighbor is a supervised ML algorithm with a target variable (y) depending on independent variables (X) used for prediction.
In this algorithm, the entire training dataset is stored. When a prediction is required, the k-most similar records to a new record from the training dataset are then located. From these neighbors, a summarized prediction is made.

**accompanying slides:**
https://docs.google.com/presentation/d/1jNR_1fJYOuR8qoLcwh7M07zZB2ObDOufiHQtOVJKuNU/edit?usp=sharing

### Import libraries and initialize random generator

In [1]:
import pandas as pd
import numpy as np
# Set the seed value to make the random number reproducible
np.random.seed(0)

### Read data from CSV
we will use the flower iris dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length

In [2]:
#read in dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/MLcmore2023/MLcmore2023/main/day2_pm_afternoon/Iris.csv")
display(dataset)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


here we split the data into input variable (y) and output variables (X).

In [3]:
#Split the dataset into X (input features) and y (output labels)
X = dataset[ ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'] ].values
y = dataset['Species'].values

In [4]:
# first 5 rows of X:
display(X[0:5])

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [5]:
# first 5 rows of y:
display(y[0:5])

array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa'], dtype=object)

The labels are in strings, so we will convert them into integers
By converting strings to numbers, the models can process the data more efficiently, and perform comparisons on the labels

In [6]:
#converting text labels to integers
print("Before:",y)
unique_label_text, y_labels_integer = np.unique(y, return_inverse=True)
print("After:",y_labels_integer)

Before: ['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-ve

### Split the dataset into training sets and testing sets
this is usually done using `scipy` library's `train_test_split` function:
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)`

However, this tutorial will explicitly show the processes of splitting datasets.

This code segment splits a dataset into training and testing sets for machine learning purposes.
This splitting process is commonly used in machine learning to assess model performance on unseen data, allowing for effective training and evaluation.

- An array of indices, `indices`, which is integers from 0 to 149, is created.
- The `np.random.shuffle(indices)` line shuffles the indices to randomize the data order. If we do not randomize and just take the first thirty samples as testing set, they will all be the same type of iris.

In [7]:
indices = np.arange(len(X))  # Create an array of indices for shuffling
print(indices)

[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149]


In [8]:
np.random.shuffle(indices)  # Shuffle the indices to randomize the data
print(indices)

[114  62  33 107   7 100  40  86  76  71 134  51  73  54  63  37  78  90
  45  16 121  66  24   8 126  22  44  97  93  26 137  84  27 127 132  59
  18  83  61  92 112   2 141  43  10  60 116 144 119 108  69 135  56  80
 123 133 106 146  50 147  85  30 101  94  64  89  91 125  48  13 111  95
  20  15  52   3 149  98   6  68 109  96  12 102 120 104 128  46  11 110
 124  41 148   1 113 139  42   4 129  17  38   5  53 143 105   0  34  28
  55  75  35  23  74  31 118  57 131  65  32 138  14 122  19  29 130  49
 136  99  82  79 115 145  72  77  25  81 140 142  39  58  88  70  87  36
  21   9 103  67 117  47]


- The `split_ratio` variable determines the ratio of the training set to the entire dataset.
- The `index_to_split` variable calculates the index at which the dataset will be split based on the ratio.

In [9]:
split_ratio = 0.8  # Ratio of training set vs. whole dataset
index_to_split = int(len(X) * split_ratio)  # Determine the index to split the dataset
print(index_to_split)

120


- The subsequent commented code block demonstrates an explicit way of splitting the dataset using slicing.
- The lists `X_train`, `X_test`, `y_train`, and `y_test` are initialized to store the training and testing data and labels.
- A loop fill the `X_train` and `y_train` lists with the corresponding data and labels for the training set.

In [10]:
#initialize the lists which the dataset will be split into
X_train = []
X_test = []
y_train = []
y_test = []

# Populate the training data and labels
for i in range(index_to_split):
    X_train.append( X[indices[i]] )
    y_train.append( y_labels_integer[indices[i]] )

# Populate the testing data and labels
for i in range(index_to_split, len(indices)):
    X_test.append( X[indices[i]] )
    y_test.append( y_labels_integer[indices[i]] )

In [11]:
X_train

[array([5.8, 2.8, 5.1, 2.4]),
 array([6. , 2.2, 4. , 1. ]),
 array([5.5, 4.2, 1.4, 0.2]),
 array([7.3, 2.9, 6.3, 1.8]),
 array([5. , 3.4, 1.5, 0.2]),
 array([6.3, 3.3, 6. , 2.5]),
 array([5. , 3.5, 1.3, 0.3]),
 array([6.7, 3.1, 4.7, 1.5]),
 array([6.8, 2.8, 4.8, 1.4]),
 array([6.1, 2.8, 4. , 1.3]),
 array([6.1, 2.6, 5.6, 1.4]),
 array([6.4, 3.2, 4.5, 1.5]),
 array([6.1, 2.8, 4.7, 1.2]),
 array([6.5, 2.8, 4.6, 1.5]),
 array([6.1, 2.9, 4.7, 1.4]),
 array([4.9, 3.1, 1.5, 0.1]),
 array([6. , 2.9, 4.5, 1.5]),
 array([5.5, 2.6, 4.4, 1.2]),
 array([4.8, 3. , 1.4, 0.3]),
 array([5.4, 3.9, 1.3, 0.4]),
 array([5.6, 2.8, 4.9, 2. ]),
 array([5.6, 3. , 4.5, 1.5]),
 array([4.8, 3.4, 1.9, 0.2]),
 array([4.4, 2.9, 1.4, 0.2]),
 array([6.2, 2.8, 4.8, 1.8]),
 array([4.6, 3.6, 1. , 0.2]),
 array([5.1, 3.8, 1.9, 0.4]),
 array([6.2, 2.9, 4.3, 1.3]),
 array([5. , 2.3, 3.3, 1. ]),
 array([5. , 3.4, 1.6, 0.4]),
 array([6.4, 3.1, 5.5, 1.8]),
 array([5.4, 3. , 4.5, 1.5]),
 array([5.2, 3.5, 1.5, 0.2]),
 array([6.

- Finally, the lists are converted to 2D numpy arrays for compatibility with machine learning algorithms.

In [12]:
# Convert the lists to 2d numpy arrays
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)
print("X_test",X_test)
print("y_test",y_test)

X_test [[5.8 4.  1.2 0.2]
 [7.7 2.8 6.7 2. ]
 [5.1 3.8 1.5 0.3]
 [4.7 3.2 1.6 0.2]
 [7.4 2.8 6.1 1.9]
 [5.  3.3 1.4 0.2]
 [6.3 3.4 5.6 2.4]
 [5.7 2.8 4.1 1.3]
 [5.8 2.7 3.9 1.2]
 [5.7 2.6 3.5 1. ]
 [6.4 3.2 5.3 2.3]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 4.9 1.5]
 [6.7 3.  5.  1.7]
 [5.  3.  1.6 0.2]
 [5.5 2.4 3.7 1. ]
 [6.7 3.1 5.6 2.4]
 [5.8 2.7 5.1 1.9]
 [5.1 3.4 1.5 0.2]
 [6.6 2.9 4.6 1.3]
 [5.6 3.  4.1 1.3]
 [5.9 3.2 4.8 1.8]
 [6.3 2.3 4.4 1.3]
 [5.5 3.5 1.3 0.2]
 [5.1 3.7 1.5 0.4]
 [4.9 3.1 1.5 0.1]
 [6.3 2.9 5.6 1.8]
 [5.8 2.7 4.1 1. ]
 [7.7 3.8 6.7 2.2]
 [4.6 3.2 1.4 0.2]]
y_test [0 2 0 0 2 0 2 1 1 1 2 2 1 1 0 1 2 2 0 1 1 1 1 0 0 0 2 1 2 0]


### KNN classifier function
This k-Nearest Neighbors tutorial is broken down into 3 parts:

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*CVcFTGsIZ-sj_Z_CY5X0Ew.png" width=50%>

#### Step 1: Calculate Euclidean Distance.
- We can calculate the straight line distance between two vectors using the Euclidean distance measure. It is calculated as the square root of the sum of the squared differences between the two vectors.
- To locate the neighbors for a new piece of data within a dataset we must first calculate the distance between each record in the dataset to the new piece of data (note in the image not all arrows are shown)

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*0ObiRFulLMuXQ5aj6dO7BQ.png" width=50%>

#### Step 2: Get Nearest Neighbors.
- Neighbors for a new piece of data in the dataset are the k closest instances, as defined by our distance measure.
- Once distances are calculated, we must sort all of the records in the training dataset by their distance to the new data. We can then select the top k to return as the most similar neighbors.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*El2DdaYNYaBficIL7EXwxg.png" width=50%>

#### Step 3: Make Predictions.
- The most similar neighbors collected from the training dataset is used to make predictions.
- `np.unique` is used to find the corresponding counts of each unique label type
- `np.argmax` is used to find the indices of the maximum values

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*TyqUebB-B9MN-XhcKmVGWA.png" width=50%>


In [13]:
def KNNClassify(X_test, Y_train = y_train,X_train = X_train, k = 8): #set default k as 8
    # X_test is the data of 1 flower
    # X_train is the data of 120 flowers

    ## Step 1:
    distance_list = []

    #for every example in the training set, calculate euclidean distance against the test example
    for i in range(len(X_train)):
        point = X_train[i]
        d1 = (point[0]-X_test[0])**2 # 5**2 means 5^2
        d2 = (point[1]-X_test[1])**2
        d3 = (point[2]-X_test[2])**2
        d4 = (point[3]-X_test[3])**2
        distance = np.sqrt(d1+d2+d3+d4)

        #place this calculated distance into the list
        distance_list.append( (distance,i) )

    ## Step 2:
    #sort distances in ascending order by distance
    sorted_distance_list = sorted(distance_list)

    #the k nearest neighbours are the top k points in the sorted distance list
    neighbours = sorted_distance_list[0:k]
    # for example, neighbours = [(352.1, 51),(371.2, 9),(373.5, 2),(376.0, 110),]

    #get index of the minimum distances
    neighbours_index = []
    for distance,idx in neighbours:
        neighbours_index.append(idx)
    # for example, neighbours_index = [51, 9, 2, 110]

    ## Step 3:
    #check which label has majority
    output = Y_train[neighbours_index]
    # for example, neighbours_index = [1, 0, 0, 2] because #51 is iris_type1, #9 and #2 is iris_type0, #110 is iris_type2
    values, counts = np.unique(output, return_counts=True)

    #return label with majority occurence
    max_idx = np.argmax(counts)
    return values[max_idx]

### Running the classifier to predict the labels for the testing set

In [14]:
#getting predicted values using our algorithm
predicted_y = []
for point in X_test:
    predicted_y.append(KNNClassify(point))

print(predicted_y)

[0, 2, 0, 0, 2, 0, 2, 1, 1, 1, 2, 2, 2, 1, 0, 1, 2, 2, 0, 1, 1, 1, 1, 0, 0, 0, 2, 1, 2, 0]


In [15]:
def accuracy(predictions , y_test):
    count = 0
    for i in range(len(predictions)):
        if predictions[i] == y_test[i]:
            count +=1
    return count/len(predictions)


#compare the predictions against the y_tests (true labels）
acc = accuracy(predicted_y, y_test)
print("Accuracy =", acc*100, "%")

Accuracy = 96.66666666666667 %


### Exercise
1. Alice wants to use KNN to classify cat and dog images, which have millions of training images. She wants to distribute this model on the internet. Is this feasible? Why or why not? (hint: think about the size of the trained model)  
2. Explain why there are 4 variables added together in `distance = np.sqrt(d1+d2+d3+d4)`
3. Write the code below so that the KNN classifier can handle 28x28 pixels hand-written greyscale digit image classification. The inputs are vectors of 784 numbers, and the ouput is the label (10 categories)

In [48]:
def KNNClassify(X_test, Y_train,X_train, k = 8):
    distance_list = []
    for idx,point in enumerate(X_train):
        #exercise: code here

        distance_list.append( (distance,idx) )

    sorted_distance_list = sorted(distance_list)

    neighbours = sorted_distance_list[:k]
    neighbours_index = []
    for distance,idx in neighbours:
        neighbours_index.append(idx)

    output = Y_train[neighbours_index]
    values, counts = np.unique(output, return_counts=True)

    max_idx = np.argmax(counts)
    return values[max_idx]

4. Now we apply the KNN classifier on the hand written digit dataset. Complete the code in the 2 blocks below (This will take 2 to 3 minutes to run)

In [40]:
# this block is completed for you
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
mnist = fetch_openml('mnist_784')
X = mnist.data / 255.0  # Scale pixel values to [0, 1]
y = mnist.target.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
display(X_train)
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()


  warn(


Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
56331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55278,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
# this block is for exercise
for i in range(5):
    #getting predicted values using our algorithm
    point = X_test[i]
    predicted_y = ### exercise
    actual_y = ### exercise
    print("prediction:",predicted_y, "  actual:",actual_y)

prediction: 2 actual: 2
prediction: 1 actual: 1
prediction: 5 actual: 5
prediction: 7 actual: 7
prediction: 0 actual: 0


Reference
1. https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
2. https://medium.com/analytics-vidhya/k-nn-from-scratch-212dcff13eb3
3. https://towardsdatascience.com/create-your-own-k-nearest-neighbors-algorithm-in-python-eb7093fc6339