# k‑Nearest Neighbors (kNN)

In this notebook, you'll work through two parts:

1. Use a small dataset to implement and test the k‑Nearest Neighbors algorithm.
2. Apply kNN to a new scenario (animal classification) and experiment with different values of k.

Read the instructions in each cell carefully and complete the code where indicated.

## Part 1: Exploration with a Small Dataset

In this section, you'll:
- Define a function to calculate the Euclidean distance.
- Create a small dataset (height, weight, and class).
- Define a test point and visualize the dataset.
- Compute distances and classify the test point using majority voting.

**Step 1:** First, we will need to import numpy.

**Step 2:** Let's define our Euclidean distance function. Fill in the missing code as indicated with ellipsis (...).

In [None]:
# TODO: Step 1: Define the Euclidean distance function
# Complete the function so that it returns the Euclidean distance between two points
import ... as ..

def euclidean_distance(point1, point2):
    # HINT: Use the formula sqrt((x2 - x1)^2 + (y2 - y1)^2)
    # Replace the following line with your code if desired:
    return ...((...[...] - ...[...])**2 + (...[...] - ...[...])**2)

# Test your function (optional)
print(euclidean_distance([0, 0], [3, 4]))  # Expected output: 5.0

### Creating the Dataset
Next, create a small dataset. Each data point has two features (height and weight) and a class label (e.g., 'A' or 'B').

**Step 3:** Add at least 2 more data points to your data set. Try to come up with a pattern for what is classified as 'A' and what is classified as 'B'. For the following 2 data points continue that pattern. Feel free to add more data points if you'd like.

In [None]:
# TODO: Step 2: Create your dataset
data_points = [
    [160, 60, 'A'],
    [165, 55, 'A'],
    [170, 75, 'B'],
    [175, 80, 'B'],
    [..., ..., ...],
    [..., ..., ...]
]

# Feel free to modify or add more data points if you'd like.

### Defining the Test Point and Visualization
Now, define a new test point that you want to classify, and then visualize the dataset using matplotlib.

In [None]:
# TODO: Step 4: Define a test point
test_point = [..., ...]

# TODO: Step 5: Visualize the dataset
import matplotlib.pyplot as plt

heights = [point[...] for point in data_points]
weights = [point[...] for point in data_points]
classes = [point[...] for point in data_points]

plt.figure(figsize=(6,4))
plt.scatter(heights, weights, c=['red' if c == 'A' else 'blue' for c in classes], marker='o')
plt....(...) #Define your x-axis label (height)
plt....(...) #Define your y-axis label (weight)
plt....(...) #Define your title for the plot

# Mark the test point on the plot (optional)
plt.scatter(test_point[...], test_point[...], c='green', marker='*', s=200, label='Test Point')
plt.legend()
plt.show()

### Computing Distances and Classifying the Test Point
Now compute the distance from the test point to each data point, sort the distances, choose a value for **k** (for example, 3), and then perform majority voting to predict the class.

In [None]:
# TODO: Step 6: Compute distances from the test point to each data point
distances = []
for point in data_points:
    dist = ...(..., ...)  # You will want to use the features height and weight not the class label for your point variable
    distances.append((..., ...)) #Append the distance from the test point to the data point as the first variable and the class label of the point as your second variable

# Sort distances in ascending order
distances.sort(key=lambda x: x[0])
print("Distances (sorted):", distances)

# TODO: Step 7: Choose a value for k (e.g., k = 3) and classify the test point
k = ...
nearest_neighbors = distances[:k]

neighbor_classes = [neighbor[1] for neighbor in nearest_neighbors]
predicted_class = max(set(neighbor_classes), key=neighbor_classes.count)
print(f"Predicted class for test point {test_point} with k={k}: {predicted_class}")

## Part 2: Applying kNN to a New Scenario

In this section, you'll use kNN to classify animals as either **Herbivore** or **Carnivore** based on their weight and height.

You will:
- Create a new dataset of animals.
- Define a new test animal.
- Visualize the new dataset.
- Compute distances and classify the test animal using different values of **k**.

In [None]:
# TODO: Step 1: Create a new dataset for animals. Fill in the rest of animal_data.
# Look up the weight and height of different Herbivores and Carnivores online.
# Each data point is in the form: [weight (kg), height (cm), class]
animal_data = [
    [6000, 300, 'Herbivore'],   # Elephant
    [190, 120, 'Carnivore'],    # Lion
    [..., ..., ...],
    [..., ..., ...],
    [..., ..., ...],
    [..., ..., ...]
]

In [None]:
# TODO: Step 2: Define a new test animal
test_animal = [..., ...]  # [weight, height]

In [None]:
# TODO: Step 3: Visualize the animal dataset
weights = [...[...] for animal in animal_data]
heights = [...[...] for animal in animal_data]
classes = [...[...] for animal in animal_data]

# We will use different colors for each class: green for Herbivore, orange for Carnivore
colors = ['green' if c == 'Herbivore' else 'orange' for c in classes]

plt.figure(figsize=(6,4))
plt.scatter(weights, heights, c=colors, marker='s')
plt....(...) #Define your x-axis label (weight)
plt....(...) #Define your y-axis label (height)
plt....(...) #Define your title for the plot
plt.show()

In [None]:
# TODO: Step 4: Compute distances for the new dataset
animal_distances = ...
for animal in animal_data:
    dist = ...
    animal_distances.append((..., ...)) #Append the distance from the test animal to the animal data point as the first varaible and the class label of the animal as your second variable

animal_distances.sort(key=lambda x: x[0])
print("Animal Distances (sorted):", animal_distances)

In [None]:
# TODO: Step 5: Classify the test animal using different k values
def classify(test_point, data, k):
    distances = []
    for item in data:
        d = euclidean_distance(test_point, item[:2])
        distances.append((d, item[2]))
    distances.sort(key=lambda x: x[0])
    nearest = distances[:k]
    classes = [neighbor[1] for neighbor in nearest]
    return max(set(classes), key=classes.count)

for k in [..., ..., ...]: #choose 3 different k values
    prediction = classify(test_animal, animal_data, k)
    print(f"Predicted class for test animal {test_animal} with k={k}: {prediction}")

### Reflection

You have implemented and experimented with the k‑Nearest Neighbors algorithm on two datasets. Reflect on how changing the value of **k** affects the prediction.

Notebook built with OpenAI o3 Mini High for the 1000 Scientist AI Jam

Additional edits by: Sage Miller