# K-Nearest Neighbours

# 1 - Abstract

The objective is to write the functions required to implement the K-Nearest Neighbour (KNN) algorithm to classify the age of the abalone. The dataset originates from [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Abalone). 

#### Summary of results and findings

__I. Accuracy__

Based on the implementation of KNN to predict the age of the abalone, it was noted that the experiment setup using train-and-test split with a ratio of 70:30 and k-neighbours as 20 has the best accuracy of 27.614% where accuracy is calculated as a percentage of the correct numbers of prediction over the total number of predictions. However, it's considerably low and reflects that it's difficult for the algorithm to predict the abalone's age accurately.

One potential reason could be due to the imbalanced data set. Referring to the data set readme file, there are a total of 29 classes. However, 17 out of 29 classes have less than a hundred data points. As such, depending on how the data is split, there could be insufficient data (i.e., neighbours) in the train set to accurately predict such classes. As such, it might be helpful to gather more samples of these classes of abalone to improve the model prediction.

__II. Computational Time__

As for computational time, it was noted that k-fold cross validation requires more computation time and resources as compared to train-and-test split. This is due to the additional iterations required to calculate the euclidean distances in the train partitions and predict the output in the test partition for each fold. For more details, reference to [section 5f](#plot) on the computation of the time taken.

# 2 - Introduction

**Classification** problem refers to grouping similar data or classifying data of by specific categories using a predetermined set of features. The output of a classification problem has discrete value (which is generally represented by an integer number). 

Specifically for this practicum, we'll attempt to classify the age of the abalone into 15 classes (i.e., Rings) with the following features:

|Features|Data Type|Measures|Description|
|--------|---------|--------|-----------|
|Sex|Nominal|M, F, I (Infant)|Nil|
|Length|Continuous|mm|Longest shell measurement|
|Diameter|Continuous|mm|Perpendicular to length|
|Height|Continuous|mm|With meat in shell|
|Whole weight|Continuous|grams|Whole abalone|
|Shucked weight|Continuous|grams|Weight of meat|
|Viscera weight|Continuous|grams|Gut weight (after bleeding)|
|Shell weight|Continuous|grams|After being dried|

**K-Nearest Neighbours (KNN)** algorithm assumes that similar data points exist in close proximity. Distance between the points of the dataset and the data point that we would like to predict would be calculated. The neighbouring data points will be used to predict the outcome of the new data point by choosing the most common categories for classification or the mean of the neighbours for regression. "K" represents the number of neighbours used to predict the outcome. 

There are many ways to calculate the distance between the data points. In this practicum, we'll use the Euclidean distance (formula as shown below) for the calculation.

$Euclidean Distance = d(p,q) = \sqrt{(p_{1}-q_{1})^2 + (p_{2}-q_{2})^2 + ... + (p_{n}-q_{n})^2 }$

When using KNN for prediction, it is crucial for us to normalise the data points to ensure that the distance is not being influenced by features with higher scale of measurements that could result in misclassifications. This is due to distance dependent algorithm (such as KNN) putting more weight on higher scaled features (or independent variables).

For this practicum, we will explore the following range of K-neighbours: 1, 5, 10, 15, 20 in the respective experiment setups and make a comparison on the accuracy and computation run time among the setups.

# 3 - Accuracy

The accuracy is calculated using the percentage of accurate predictions. Values are rounded up to 3 decimal places. 
* The highest and lowest accuracy for train-and-test split are 27.614% for 70:30 split where k-neighbour=20 and 18.861% for 50:50 split where k-neighbour=1. 
* The highest and lowest accuracy for k-fold cross validation split are 25.659% for k-fold = 15 where k-neighbour=20 and 18.861% for k-fold=1 where k-neighbour=1.

For detailed execution of the experiment setups, please refer to [section 5d](#code) below. 

|Accuracy|Train-and-Test|    |    |Cross-Validation|    |    |
|--------|--------------|----|----|----------------|----|----|
| |0.7-0.3|0.6-0.4|0.5-0.5|5-fold|10-fold|15-fold|
|K=1|19.393%|19.270%|<mark>18.861%</mark>|<mark>19.641%</mark>|19.895%|19.976%|
|K=5|25.858%|24.536%|22.882%|23.401%|22.410%|23.285%|
|K=10|24.741%|25.613%|25.227%|24.048%|23.916%|24.580%|
|K=15|26.257%|26.272%|26.663%|25.102%|24.731%|24.724%|
|K=20|<mark>27.614%</mark>|27.169%|26.472%|25.269%|25.162%|<mark>25.659%</mark>|

# 4 - Running Time

The computational time is tabulated using the `default_timer` in the `timeit` library. Values are rounded up to 3 decimal places. For detailed execution of the experiment setups, please refer to [section 5f](#code) below. 

|Computational Time|Train-and-Test|    |    |Cross-Validation|    |    |
|------------------|--------------|----|----|----------------|----|----|
| |0.7-0.3|0.6-0.4|0.5-0.5|5-fold|10-fold|15-fol|
|K=1|11.969 seconds|13.483 seconds|14.147 seconds|45.429 seconds|50.753 seconds|52.492 seconds|
|K=5|12.136 seconds|13.474 seconds|14.070 seconds|45.299 seconds|50.434 seconds|52.282 seconds|
|K=10|11.931 seconds|13.479 seconds|13.972 seconds|45.307 seconds|50.532 seconds|52.689 seconds|
|K=15|11.906 seconds|13.605 seconds|13.977 seconds|45.840 seconds|50.737 seconds|52.855 seconds|
|K=20|11.936 seconds|13.662 seconds|13.992 seconds|45.423 seconds|50.601 seconds|52.021 seconds|

# 5 - Tasks

### a. Load data from file 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import all functions required for the KNN implementation
from knn import *
np.random.seed(42)

# Load dataset
X = loadData('abalone.data')

In [3]:
# Verify the size of dataset (4177,9)
X.shape

(4177, 9)

### b. Normalize the data set

Normalization equation:
$X' = \frac{X-min}{max-min}$

In [4]:
# Normalisation of the 8 input attributes
X_norm = dataNorm(X)

In [5]:
# Verification of the result: mean and sum of each attribute: OK
testNorm([X_norm])

[0.47750066 0.60674608 0.59307774 0.12346584 0.29280756 0.24100033
 0.23712127 0.2365031  9.93368446]
[ 1994.52023988  2534.37837838  2477.28571429   515.71681416
  1223.05719851  1006.65837256   990.45556287   987.87344295
 41493.        ]


### c. Split the dataset into training and testing set 

In [6]:
# (i) Split dataset using train-and-test split method (60-40)
X_split_TT70 = splitTT(X_norm, 0.7)
X_split_TT60 = splitTT(X_norm, 0.6)
X_split_TT50 = splitTT(X_norm, 0.5)

In [7]:
# Verification of train-and-test split: OK
testNorm(X_split_TT70)

[0.47750066 0.60674608 0.59307774 0.12346584 0.29280756 0.24100033
 0.23712127 0.2365031  9.93368446]
[ 1994.52023988  2534.37837838  2477.28571429   515.71681416
  1223.05719851  1006.65837256   990.45556287   987.87344295
 41493.        ]


In [8]:
testNorm(X_split_TT60)

[0.47750066 0.60674608 0.59307774 0.12346584 0.29280756 0.24100033
 0.23712127 0.2365031  9.93368446]
[ 1994.52023988  2534.37837838  2477.28571429   515.71681416
  1223.05719851  1006.65837256   990.45556287   987.87344295
 41493.        ]


In [9]:
testNorm(X_split_TT50)

[0.47750066 0.60674608 0.59307774 0.12346584 0.29280756 0.24100033
 0.23712127 0.2365031  9.93368446]
[ 1994.52023988  2534.37837838  2477.28571429   515.71681416
  1223.05719851  1006.65837256   990.45556287   987.87344295
 41493.        ]


In [10]:
# (ii) Split dataset using k-fold cross-validation method 
X_split_CV5 = splitCV(X_norm, 5)
X_split_CV10 = splitCV(X_norm, 10)
X_split_CV15 = splitCV(X_norm, 15)

In [11]:
# Verification of train-and-test split: OK
testNorm(X_split_CV5)

[0.47748988 0.60683282 0.5931646  0.12348259 0.29289753 0.24106963
 0.23719466 0.23657808 9.93508982]
[ 1993.52023988  2533.52702703  2476.46218487   515.53982301
  1222.84717549  1006.46570276   990.28768927   987.71350274
 41479.        ]


In [12]:
testNorm(X_split_CV10)

[0.47750066 0.60674608 0.59307774 0.12346584 0.29280756 0.24100033
 0.23712127 0.2365031  9.93368446]
[ 1994.52023988  2534.37837838  2477.28571429   515.71681416
  1223.05719851  1006.65837256   990.45556287   987.87344295
 41493.        ]


In [13]:
testNorm(X_split_CV15)

[0.47770323 0.60682481 0.59314632 0.12347892 0.29292453 0.2410988
 0.23723073 0.23659745 9.9352518 ]
[ 1992.02248876  2530.45945946  2473.42016807   514.90707965
  1221.49530724  1005.38197714   989.25213957   986.61136024
 41430.        ]


### <a id='code'>d. & f. Implement KNN algorithm for K = 1, 5, 10, 15, 20 and compute both accuracy and computational time</a>

In [14]:
# KNN algorithm for train-and-test split (70-30) 
from timeit import default_timer as timer
k = [1, 5, 10, 15, 20]
acc_TT70 = []
time_TT70 = []
for num in k: 
    time_start = timer()
    acc = knn(X_split_TT70[0], X_split_TT70[1], num)
    time_elapsed = timer() - time_start
    print("Train-and-test split: 70-30")
    print(f"K-neighbours: {num}")
    print(f"Mean Accuracy for {num} neighbours: {acc:.3f}%")
    print(f"Computational time for {num} neighbours: {time_elapsed:.3f} seconds")
    print("******************************************************************")
    acc_TT70.append(acc)
    time_TT70.append(time_elapsed)

Train-and-test split: 70-30
K-neighbours: 1
Mean Accuracy for 1 neighbours: 19.393%
Computational time for 1 neighbours: 11.969 seconds
******************************************************************
Train-and-test split: 70-30
K-neighbours: 5
Mean Accuracy for 5 neighbours: 25.858%
Computational time for 5 neighbours: 12.136 seconds
******************************************************************
Train-and-test split: 70-30
K-neighbours: 10
Mean Accuracy for 10 neighbours: 24.741%
Computational time for 10 neighbours: 11.931 seconds
******************************************************************
Train-and-test split: 70-30
K-neighbours: 15
Mean Accuracy for 15 neighbours: 26.257%
Computational time for 15 neighbours: 11.906 seconds
******************************************************************
Train-and-test split: 70-30
K-neighbours: 20
Mean Accuracy for 20 neighbours: 27.614%
Computational time for 20 neighbours: 11.936 seconds
*******************************************

In [15]:
# KNN algorithm for train-and-test split (60-40) 
acc_TT60 = []
time_TT60 = []

for num in k: 
    time_start = timer()
    acc = knn(X_split_TT60[0], X_split_TT60[1], num)
    time_elapsed = timer() - time_start
    print("Train-and-test split: 60-40")
    print(f"K-neighbours: {num}")
    print(f"Mean Accuracy for {num} neighbours: {acc:.3f}%")
    print(f"Computational time for {num} neighbours: {time_elapsed:.3f} seconds")
    print("******************************************************************")
    acc_TT60.append(acc)
    time_TT60.append(time_elapsed)

Train-and-test split: 60-40
K-neighbours: 1
Mean Accuracy for 1 neighbours: 19.270%
Computational time for 1 neighbours: 13.483 seconds
******************************************************************
Train-and-test split: 60-40
K-neighbours: 5
Mean Accuracy for 5 neighbours: 24.536%
Computational time for 5 neighbours: 13.474 seconds
******************************************************************
Train-and-test split: 60-40
K-neighbours: 10
Mean Accuracy for 10 neighbours: 25.613%
Computational time for 10 neighbours: 13.479 seconds
******************************************************************
Train-and-test split: 60-40
K-neighbours: 15
Mean Accuracy for 15 neighbours: 26.272%
Computational time for 15 neighbours: 13.605 seconds
******************************************************************
Train-and-test split: 60-40
K-neighbours: 20
Mean Accuracy for 20 neighbours: 27.169%
Computational time for 20 neighbours: 13.662 seconds
*******************************************

In [16]:
# KNN algorithm for train-and-test split (50-50) 
acc_TT50 = []
time_TT50 = []

for num in k: 
    time_start = timer()
    acc = knn(X_split_TT50[0], X_split_TT50[1], num)
    time_elapsed = timer() - time_start
    print("Train-and-test split: 50-50")
    print(f"K-neighbours: {num}")
    print(f"Accuracy for {num} neighbours: {acc:.3f}%")
    print(f"Computational time for {num} neighbours: {time_elapsed:.3f} seconds")
    print("******************************************************************")
    acc_TT50.append(acc)
    time_TT50.append(time_elapsed)

Train-and-test split: 50-50
K-neighbours: 1
Accuracy for 1 neighbours: 18.861%
Computational time for 1 neighbours: 14.147 seconds
******************************************************************
Train-and-test split: 50-50
K-neighbours: 5
Accuracy for 5 neighbours: 22.882%
Computational time for 5 neighbours: 14.070 seconds
******************************************************************
Train-and-test split: 50-50
K-neighbours: 10
Accuracy for 10 neighbours: 25.227%
Computational time for 10 neighbours: 13.972 seconds
******************************************************************
Train-and-test split: 50-50
K-neighbours: 15
Accuracy for 15 neighbours: 26.663%
Computational time for 15 neighbours: 13.977 seconds
******************************************************************
Train-and-test split: 50-50
K-neighbours: 20
Accuracy for 20 neighbours: 26.472%
Computational time for 20 neighbours: 13.992 seconds
******************************************************************


In [17]:
## KNN algorithm for CV 5 
acc_CV5 = []
time_CV5 = []
folds_5 = len(X_split_CV5)

for num in k:
    acc_k = []
    time_start = timer()
    for i in range(0, folds_5):
        test = X_split_CV5[i]
        train = X_split_CV5.copy()
        train.pop(i)
        train = np.vstack(tuple(train))
        acc = knn(train, test, num)
        acc_k.append(acc)
    time_elapsed = timer() - time_start
    print("K-fold cross validation: k=5")
    print(f"K-neighbours: {num}")
    print(f"Accuracy Scores for {num} neighbours for each of 5-folds:")
    print(acc_k)
    print(f"Mean Accuracy for {num} neighbours: {np.mean(acc_k):.3f}%")
    print(f"Computational time for {num} neighbours: {time_elapsed:.3f} seconds")
    print("******************************************************************")
    acc_CV5.append(np.mean(acc_k))
    time_CV5.append(time_elapsed)

K-fold cross validation: k=5
K-neighbours: 1
Accuracy Scores for 1 neighbours for each of 5-folds:
[19.64071856287425, 18.20359281437126, 21.077844311377245, 19.760479041916167, 19.520958083832333]
Mean Accuracy for 1 neighbours: 19.641%
Computational time for 1 neighbours: 45.429 seconds
******************************************************************
K-fold cross validation: k=5
K-neighbours: 5
Accuracy Scores for 5 neighbours for each of 5-folds:
[23.592814371257482, 21.676646706586826, 22.035928143712574, 23.7125748502994, 25.98802395209581]
Mean Accuracy for 5 neighbours: 23.401%
Computational time for 5 neighbours: 45.299 seconds
******************************************************************
K-fold cross validation: k=5
K-neighbours: 10
Accuracy Scores for 10 neighbours for each of 5-folds:
[22.994011976047904, 21.676646706586826, 24.550898203592812, 24.790419161676645, 26.22754491017964]
Mean Accuracy for 10 neighbours: 24.048%
Computational time for 10 neighbours: 45.307 

In [18]:
## KNN algorithm for CV 10

acc_CV10 = []
time_CV10 = []
folds_10 = len(X_split_CV10)

for num in k:
    acc_k = []
    time_start = timer()
    for i in range(0, folds_10, 1):
        test = X_split_CV10[i]
        train = X_split_CV10.copy()
        train.pop(i)
        train = np.vstack(tuple(train))
        acc = knn(train, test, num)
        acc_k.append(acc)
    time_elapsed = timer() - time_start
    print("K-fold cross validation: k=10")
    print(f"K-neighbours: {num}")
    print(f"Accuracy Scores for {num} neighbours for all 10-folds:")
    print(acc_k)
    print(f"Mean Accuracy for {num} neighbours: {np.mean(acc_k):.3f}%")
    print(f"Computational time for {num} neighbours: {time_elapsed:.3f} seconds")
    print("******************************************************************")
    acc_CV10.append(np.mean(acc_k))
    time_CV10.append(time_elapsed)

K-fold cross validation: k=10
K-neighbours: 1
Accuracy Scores for 1 neighbours for all 10-folds:
[18.660287081339714, 21.770334928229666, 19.617224880382775, 17.942583732057415, 22.00956937799043, 19.37799043062201, 17.464114832535884, 20.334928229665074, 21.291866028708135, 20.481927710843372]
Mean Accuracy for 1 neighbours: 19.895%
Computational time for 1 neighbours: 52.492 seconds
******************************************************************
K-fold cross validation: k=10
K-neighbours: 5
Accuracy Scores for 5 neighbours for all 10-folds:
[22.727272727272727, 25.358851674641148, 19.617224880382775, 20.813397129186605, 20.095693779904305, 21.5311004784689, 17.703349282296653, 24.880382775119617, 27.033492822966508, 24.337349397590362]
Mean Accuracy for 5 neighbours: 22.410%
Computational time for 5 neighbours: 52.282 seconds
******************************************************************
K-fold cross validation: k=10
K-neighbours: 10
Accuracy Scores for 10 neighbours for all 1

In [19]:
## KNN algorithm for CV 15

acc_CV15 = []
time_CV15 = []
folds_15 = len(X_split_CV15)

for num in k:
    acc_k = []
    time_start = timer()
    for i in range(0, folds_15, 1):
        test = X_split_CV15[i]
        train = X_split_CV15.copy()
        train.pop(i)
        train = np.vstack(tuple(train))
        acc = knn(train, test, num)
        acc_k.append(acc)
    time_elapsed = timer() - time_start
    print("K-fold cross validation: k=15")
    print(f"K-neighbours: {num}")
    print(f"Accuracy Scores for {num} neighbours for each of 15-folds:")
    print(acc_k)
    print(f"Mean Accuracy for {num} neighbours: {np.mean(acc_k):.3f}%")
    print(f"Computational time for {num} neighbours: {time_elapsed:.3f} seconds")
    print("******************************************************************")
    acc_CV15.append(np.mean(acc_k))
    time_CV15.append(time_elapsed)


K-fold cross validation: k=15
K-neighbours: 1
Accuracy Scores for 1 neighbours for each of 15-folds:
[19.784172661870503, 20.14388489208633, 21.58273381294964, 20.863309352517987, 17.985611510791365, 18.705035971223023, 20.863309352517987, 20.503597122302157, 18.345323741007196, 15.827338129496402, 20.503597122302157, 20.14388489208633, 22.66187050359712, 18.345323741007196, 23.381294964028775]
Mean Accuracy for 1 neighbours: 19.976%
Computational time for 1 neighbours: 53.963 seconds
******************************************************************
K-fold cross validation: k=15
K-neighbours: 5
Accuracy Scores for 5 neighbours for each of 15-folds:
[25.179856115107913, 23.741007194244602, 26.618705035971225, 16.906474820143885, 23.381294964028775, 22.66187050359712, 19.784172661870503, 21.223021582733814, 21.58273381294964, 20.14388489208633, 21.942446043165468, 26.618705035971225, 29.856115107913666, 23.381294964028775, 26.258992805755394]
Mean Accuracy for 5 neighbours: 23.285%
Comp

### <a id="plot">Computational Time Plots</a>

Based on the plots below, we observe the following: 
1. Comparing the computation time required for train-and-test split, we noted that the time required for training and predicting the train and test data increases according to the following ratio 70:30, 60:40, 50:50. It may imply that more time is taken to calculate the euclidean distances for the test set in 50:50 split as such more time is required. 
2. With reference to the computation time of k-fold cross validations, we observed that the higher number of folds requires more computational time. This is due to the increase in the number of iterations required to train and test each fold as k increases.
3.  The computational time for k-fold cross validation is significantly higher than train-and-test split. Thus, we can conclude that k-fold cross validation typically requires more computational resources due to the additional iterations required to train and test for the different folds as compared to train-and-test split.

In [30]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

K = [1, 5, 10, 15, 20]

y1 = time_TT70
y2 = time_TT60
y3 = time_TT50
y4 = time_CV5
y5 = time_CV10
y6 = time_CV15


fig = make_subplots(rows=2, cols = 3, shared_yaxes = True, subplot_titles=(
                    'Train-and-Test split 70-30', 'Train-and-Test split 60-40', 'Train-and-Test split 50-50',
                    'K-Fold cross-validation 5 folds', 'K-Fold cross-validation 10 folds', 
                    'K-Fold cross-validation 15 folds'))

fig.add_trace(
    go.Scatter(x=K, y=y1, name="TT_70_30"), row=1, col=1)
fig.add_trace(
    go.Scatter(x=K, y=y2, name="TT_60_40"), row=1, col=2)
fig.add_trace(
    go.Scatter(x=K, y=y3, name="TT_50_50"), row=1, col=3)
fig.add_trace(
    go.Scatter(x=K, y=y4, name="CV_5"), row=2, col=1)
fig.add_trace(
    go.Scatter(x=K, y=y5, name="CV_10"), row=2, col=2)
fig.add_trace(
    go.Scatter(x=K, y=y6, name="CV_15"), row=2, col=3)

fig.update_yaxes(title_text="Time (in seconds)", row=1, col=1)
fig.update_yaxes(title_text="Time (in seconds)", row=2, col=1)
fig.update_xaxes(title_text="K-Neighbours", row=2, col=2)

fig.update_annotations(font_size=12)
fig.update_layout(height = 500, width = 800, title_text="Computational Time Taken for Each Experiment Setup")
fig.show()

In [27]:
fig = make_subplots(rows=1, cols = 1)
fig.add_scatter(x=K, y=y1, name='TT_7030')
fig.add_scatter(x=K, y=y2, name='TT_6040')
fig.add_scatter(x=K, y=y3, name='TT_5050')
fig.update_yaxes(title_text="Time (in seconds)", range=[11, 15])
fig.update_xaxes(title_text="K-Neighbours", range=[0,21])
fig.update_layout(height = 400, width = 700, title_text="Computational Time Taken for Train-and-Test Split")
fig.show()

In [28]:
fig = make_subplots(rows=1, cols = 1)
fig.add_scatter(x=K, y=y4, name='CV_5')
fig.add_scatter(x=K, y=y5, name='CV_10')
fig.add_scatter(x=K, y=y6, name='CV_15')
fig.update_yaxes(title_text="Time (in seconds)", range=[44, 55])
fig.update_xaxes(title_text="K-Neighbours", range=[0,21])
fig.update_layout(height = 400, width = 700, title_text="Computational Time Taken for K-Fold Cross Validation")
fig.show()

### e. Classification report for 5-fold cross validation with K=15

From the classification report for K-fold cross validation of 5 folds with k-neighbours = 15, it was noted that the F1-score of 0.25 and a weighted average F1-score of 0.23. The F1 score is the harmonic mean of the precision and recall, which is also commonly used to measure the classification test's accuracy. The weighted average F1 consider the significance and magnitude of the F1-score of each class by the number of samples in the class.

The weighted average for precision and recall is 0.22 and 0.25 respectively.

The accuracy measure did not account for the precision and recall of the KNN classifier, where as F1-score account for that. However, F1 score is relatively low which reflects that it is difficult to accurately predict the age of the abalone. 

In [23]:
def knn_pred(X_train, X_test, K): 
    predictions = []
    # Generate prediction for X_test
    for row in X_test:
        # Calculate the distance of the data points in X_train for each data point in X_test 
        distances = []
        for row_train in X_train:
            distance = 0
            # Calculated distance based on Euclidean distance
            for i in range(len(row)-1):
                distance += (row[i] - row_train[i])**2
            distance = np.sqrt(distance)
            distances.append((row_train, distance))
        # Sort the distances for the data point of X_test
        distances.sort(key=lambda elem: elem[1])
        # Get the "K" (specified number of neighbours) and predict the class for that data point
        neighbours = []
        for i in range(K):
            neighbours.append(distances[i][0])
        output_vals = [n[-1] for n in neighbours]
        prediction = max(set(output_vals), key=output_vals.count)
        predictions.append(prediction)
    # Extract the actual labels 
    actual = X_test[:,-1]
    return actual, predictions

In [24]:
from sklearn.metrics import classification_report 

k = 15 
f_5 = len(X_split_CV5)
print("K-fold cross validation: k=5 & K-neighbours = 15")
actual = [] 
pred = []
for i in range(0, f_5, 1): 
    test = X_split_CV5[i]
    train = X_split_CV5.copy()
    train.pop(i)
    train = np.vstack(tuple(train))
    actual_i, pred_i = knn_pred(train, test, k)
    for val in actual_i: 
        actual.append(val)
    for val in pred_i: 
        pred.append(val)
clsrep = classification_report(actual, pred)
print(f"Classification report:")
print(clsrep)

K-fold cross validation: k=5 & K-neighbours = 15
Classification report:
              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00         1
         2.0       0.00      0.00      0.00         1
         3.0       0.00      0.00      0.00        15
         4.0       0.40      0.47      0.43        57
         5.0       0.28      0.29      0.28       115
         6.0       0.30      0.30      0.30       259
         7.0       0.29      0.32      0.30       389
         8.0       0.31      0.34      0.32       568
         9.0       0.25      0.36      0.29       689
        10.0       0.21      0.31      0.25       634
        11.0       0.25      0.24      0.24       487
        12.0       0.15      0.07      0.10       267
        13.0       0.15      0.04      0.07       203
        14.0       0.07      0.01      0.01       126
        15.0       0.00      0.00      0.00       103
        16.0       0.17      0.03      0.05        67
        1