# k-Nearest-Neighbors Classification

This notebook demonstrates performing k-nearest-neighbors classification. Classification is performed manually and using Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

We will use this classifier to predict the species of iris flower from its petal and sepal measurements.

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [None]:
# Load the train, test, and validation sets from the Iris dataset.
# From what we discussed in class, each of these splits has a specific purpose:
#
#   - Training set: Used to train the k-nearest-neighbors algorithm. In practice,
#     it will be used as reference material, and will be compared against any new
#     examples of iris flowers we want to classify.
#   - Validation set: Used to test the k-nearest-neighbors classifier while we
#     find an ideal value of k.
#   - Testing set: Used to perform one final evaluation on the best-performing value
#     of k. This evaluation gives us an idea of how the model will perform on new,
#     previously unseen data.
train_df = pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/iris/iris_train.csv")
val_df = pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/iris/iris_val.csv")
test_df = pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/iris/iris_test.csv")
test_df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.5,2.3,4.0,1.3,Iris-versicolor
1,6.7,3.0,5.0,1.7,Iris-versicolor
2,6.9,3.2,5.7,2.3,Iris-virginica
3,4.6,3.2,1.4,0.2,Iris-setosa
4,4.8,3.0,1.4,0.1,Iris-setosa
5,6.7,3.1,4.4,1.4,Iris-versicolor
6,6.2,3.4,5.4,2.3,Iris-virginica
7,5.5,3.5,1.3,0.2,Iris-setosa
8,5.6,3.0,4.5,1.5,Iris-versicolor
9,4.9,2.5,4.5,1.7,Iris-virginica




1. Make the necessary modifications to get the input columns prepared
2. Run th personalize knn pseudocode
3. Create the classifier and fit, and report accuracy

Key:

```python

    # Compute the distance between the current row and the training data
    train_df['distance'] = ((train_df[input_cols] - input)**2).sum(axis=1)**1/2

    # Use majority voting to predict the species
    predictions = train_df.sort_values('distance')['Species'].iloc[:K].value_counts()

```


In [None]:
# Example 1: The k-nearest-neighbors classification algorithm

# This section of code allows us to find the best value of k for classification.
# We will try many different values of k, and use the validation set to test it.

# Dataframe columns used as input
input_cols = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']


for K in range(1, 10):
  # For computing accuracy
  total = 0
  correct = 0

  # Perform k-nn classification on each row in the validation set
  for _, row in val_df.iterrows():
    input = row[input_cols]
    species = row['Species']

    # Compute the distance between the current row and the training data
    train_df['distance'] = ((train_df[input_cols] - input)**2).sum(axis=1)**1/2

    # Use majority voting to predict the species
    predictions = train_df.sort_values('distance')['Species'].iloc[:K].value_counts()

    # Was it correct?
    correct += predictions.reset_index()['Species'][0] == species
    total += 1

  print('k:', K, 'accuracy:', correct / total)

k: 1 accuracy: 0.9444444444444444
k: 2 accuracy: 0.9444444444444444
k: 3 accuracy: 0.9444444444444444
k: 4 accuracy: 0.9444444444444444
k: 5 accuracy: 0.9444444444444444
k: 6 accuracy: 0.9444444444444444
k: 7 accuracy: 1.0
k: 8 accuracy: 1.0
k: 9 accuracy: 1.0


In [None]:
# Example 2: k-nearest-neighbors classification with scikit-learn

# Dataframe columns used as input
input_cols = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']

for K in range(1, 10):
  # Create a K-nearest-neighbors classifier object and train it
  knn = KNeighborsClassifier(n_neighbors=K)
  knn.fit(train_df[input_cols], train_df['Species'])

  # Make predictions with the validation set and compute accuracy
  predictions = knn.predict(val_df[input_cols])
  print('k:', K, 'accuracy:', metrics.accuracy_score(val_df['Species'], predictions))

k: 1 accuracy: 0.9444444444444444
k: 2 accuracy: 0.9444444444444444
k: 3 accuracy: 0.9444444444444444
k: 4 accuracy: 0.9444444444444444
k: 5 accuracy: 0.9444444444444444
k: 6 accuracy: 0.9444444444444444
k: 7 accuracy: 1.0
k: 8 accuracy: 1.0
k: 9 accuracy: 1.0


In [None]:
# Now we know k=7 is a good option. Let's retrain the classifier with this value
# and perform one final evaluation with the test set.

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(train_df[input_cols], train_df['Species'])

# The accuracy on the test set is about 88.9%. This is lower than the validation
# set, but this is OK - the purpose of the test set is not to continue developing
# the model, but to give a general idea of how it might perform when classifying
# new examples of irises.
metrics.accuracy_score(test_df['Species'], knn.predict(test_df[input_cols]))

0.8888888888888888