In [8]:
import numpy as np

### KNN Classification Task

This dataset consists of fruits classified based on three features:

1. **Weight (g)**: The weight of the fruit.
2. **Size (cm)**: The size of the fruit.
3. **Color**: Encoded as:
   - `0`: Yellow
   - `1`: Red
   - `2`: Orange

The goal is to implement the **K-Nearest Neighbors (KNN)** algorithm from scratch using only Python and NumPy to classify new fruit samples into one of three types:

- **Apple**
- **Banana**
- **Orange**

The dataset serves as training data for the KNN model, which will predict the fruit type for test samples based on feature similarity.

In [9]:
data = [
    [150, 7.0, 1, 'Apple'],
    [120, 6.5, 0, 'Banana'],
    [180, 7.5, 2, 'Orange'],
    [155, 7.2, 1, 'Apple'],
    [110, 6.0, 0, 'Banana'],
    [190, 7.8, 2, 'Orange'],
    [145, 7.1, 1, 'Apple'],
    [115, 6.3, 0, 'Banana']
]

# Dictionary to encode fruit labels into numeric values
label_encoding = {
    'Apple': 0,
    'Banana': 1,
    'Orange': 2
}

# Replace the fruit type in each row with its corresponding numeric label
for row in data:
    row[3] = label_encoding[row[3]]

# Convert the list of lists into a NumPy array for easier numerical processing
data = np.array(data, dtype=float)

In [10]:
X = data[:, :-1]  # Features (all columns except the last one)
y = data[:, -1]   # Labels (last column)

### Euclidean Distance Function

The `euclidean_distance` function calculates the straight-line distance between two points in a multi-dimensional space using the **Euclidean Distance formula**:
$$
d = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
$$
Where:
- $ x_i $ and $ y_i $ are the coordinates of the two points.

#### Steps:
1. **Convert Inputs to NumPy Arrays**:  
   Ensures compatibility for vectorized operations.

2. **Dimension Check**:  
   Validates that both points have the same number of dimensions.

3. **Squared Differences**:  
   Calculates $ (x_i - y_i)^2 $ for each dimension.

4. **Sum and Square Root**:  
   Sums the squared differences and takes the square root to compute the distance.

In [11]:
def euclidean_distance(point1, point2):
    # Convert inputs to numpy arrays for vectorized operations
    point1 = np.array(point1)
    point2 = np.array(point2)
    
    # Check if points have the same dimensions
    if point1.shape != point2.shape:
        raise ValueError("Points must have the same dimensions")
    
    # Calculate squared differences
    squared_diff = (point1 - point2) ** 2
    
    # Sum the squared differences and take the square root
    distance = np.sqrt(np.sum(squared_diff))
    
    return distance

### K-Nearest Neighbors (KNN) Implementation

This code implements the K-Nearest Neighbors algorithm.

1. **Stores** all training examples without building a model (lazy learning)
2. **Classifies** new instances based on similarity to known examples

#### Algorithm Steps:
- **Training**: Simply stores labeled data points
- **Prediction**: For each test sample:
  1. Calculate distances to all training points
  2. Find k closest neighbors
  3. Use majority vote to determine class

#### Key Components:
- **`__init__(k=3)`**: Sets the number of neighbors to consider
- **`fit(X, y)`**: Stores training data without computation
- **`predict(X_test)`**: Processes multiple samples in batch
- **`predict_one(x)`**: Core logic for a single prediction:
  - Calculates distances using Euclidean metric
  - Selects k nearest points using `np.argsort`
  - Determines majority class with `Counter`

In [12]:
from collections import Counter

class KNN:
    def __init__(self, k=3):
        # Initialize with default k=3 neighbors to consider
        self.k = k
        self.X_train = None  # Will store training features
        self.y_train = None  # Will store training labels
    
    def fit(self, X, y):
        # Store training data (KNN doesn't actually "train" - just memorizes data)
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self  # Return instance for method chaining
    
    def predict(self, X_test):
        # Convert input to numpy array and predict labels for all test samples
        X_test = np.array(X_test)
        # Get prediction for each test sample using list comprehension
        predictions = np.array([self.predict_one(x) for x in X_test])
        return predictions
    
    def predict_one(self, x):
        # Calculate Euclidean distance between test sample and all training samples
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        
        # Get indices of k smallest distances using argsort
        k_indices = np.argsort(distances)[:self.k]
        
        # Extract labels of the nearest neighbors using the indices
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        
        # Return most frequent label using Counter's majority vote
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]  # Return tuple's first element (label)

### Using KNN for Fruit Classification

This code demonstrates how to apply the KNN model to classify new fruit samples:

#### Test Data:
The test data contains three fruit samples with features:
- **Sample 1**: Weight=118g, Size=6.2cm, Yellow color (0) → Expected: Banana
- **Sample 2**: Weight=160g, Size=7.3cm, Red color (1) → Expected: Apple  
- **Sample 3**: Weight=185g, Size=7.7cm, Orange color (2) → Expected: Orange

#### Prediction Process:
1. **Instantiate**: Create KNN classifier with k=3
2. **Train**: Store the training dataset using `fit()`
3. **Predict**: For each test sample:
   - Calculate distances to all training points
   - Find 3 nearest neighbors
   - Assign majority class via voting

#### Output Processing:
- Convert numeric predictions (0,1,2) to fruit names using a dictionary
- Compare predictions with expected values to verify accuracy

The k=3 parameter means each prediction is based on the 3 most similar fruits from the training set.

In [13]:
# Test data:
test_data = np.array([
    [118, 6.2, 0],  # Expected: Banana
    [160, 7.3, 1],  # Expected: Apple
    [185, 7.7, 2]   # Expected: Orange
])

# Create an instance of the KNN classifier with k=3 (we have set it as default in the class)
knn = KNN()

# Fit the model using training data (X: features, y: labels)
knn.fit(X, y)

# Predict labels for the test data using the trained KNN model
predictions = knn.predict(test_data)

# Map numeric labels to corresponding fruit names for better readability
fruit_names = {0: "Apple", 1: "Banana", 2: "Orange"}
predicted_fruits = [fruit_names[label] for label in predictions]

# Print predicted fruit types and expected fruit types for comparison
print("\nPredicted fruits:", predicted_fruits)
print("Expected fruits: ['Banana', 'Apple', 'Orange']")


Predicted fruits: ['Banana', 'Apple', 'Orange']
Expected fruits: ['Banana', 'Apple', 'Orange']
