# Lab Assignment 2 - Part B: k-Nearest Neighbor Classification
Please refer to the `README.pdf` for full laboratory instructions.


## Problem Statement
In this part, you will implement the k-Nearest Neighbor (k-NN) classifier and evaluate it on two datasets:
- **Lenses Dataset**: A small dataset for contact lens prescription
- **Credit Approval (CA) Dataset**: Credit card application data with binary labels (+/-)

### Your Tasks
1. **Preprocess the data**: Handle missing values and normalize features
2. **Implement k-NN** with L2 distance
3. **Evaluate** on both datasets for different values of k
4. **Discuss** your results

### Datasets
The data files are located in the `credit 2017/` folder:
- `lenses.training`, `lenses.testing`
- `crx.data.training`, `crx.data.testing`
- `crx.names` (describes the features)


## Setup


In [46]:
# Library declarations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter


In [47]:
# Data paths
DATA_PATH = "credit 2017/"

# Load Lenses data
def load_lenses_data():
    """Load the lenses dataset."""
    train_data = np.loadtxt(DATA_PATH + "lenses.training", delimiter=',')
    test_data = np.loadtxt(DATA_PATH + "lenses.testing", delimiter=',')
    
    # First column is ID, last column is label
    X_train = train_data[:, 1:-1]
    y_train = train_data[:, -1]
    X_test = test_data[:, 1:-1]
    y_test = test_data[:, -1]
    
    return X_train, y_train, X_test, y_test

# Load Credit Approval data
def load_credit_data():
    """
    Load the Credit Approval dataset.
    Note: This dataset contains missing values (?) and mixed types.
    You will need to preprocess it.
    """
    # TODO: Implement data loading
    # The data is comma-separated
    # Missing values are marked with '?'
    # Last column is the label ('+' or '-')
    
    train_data_crv = pd.read_csv(DATA_PATH+ "crx.data.training", header=None, na_values='?', dtype=str)
    test_data_crv = pd.read_csv(DATA_PATH+ "crx.data.testing", header=None, na_values='?', dtype=str)
    # test_data_crv = np.loadtxt(DATA_PATH , delimiter=',')
        # First column is ID, last column is label
    X_train = train_data_crv.iloc[:, :-1].to_numpy()
    y_train = train_data_crv.iloc[:, -1].to_numpy()
    X_test = test_data_crv.iloc[:, :-1].to_numpy()
    y_test = test_data_crv.iloc[:, -1].to_numpy()
    
    return X_train, y_train, X_test, y_test




# Test loading lenses data
X_train_lenses, y_train_lenses, X_test_lenses, y_test_lenses = load_lenses_data()
print(f"Lenses - Train: {X_train_lenses.shape}, Test: {X_test_lenses.shape}")

X_train_crv, y_train_crv, X_test_crv, y_test_crv = load_credit_data()
print(f"Lenses - Train: {X_train_crv.shape}, Test: {X_test_crv.shape}")


Lenses - Train: (18, 3), Test: (6, 3)
Lenses - Train: (552, 15), Test: (138, 15)


## Task 1: Data Preprocessing
For the Credit Approval dataset, you need to:
1. **Handle missing values** (marked with '?'):
   - Categorical features: replace with mode/median
   - Numerical features: replace with label-conditioned mean
2. **Normalize features** using z-scaling:
   $$z_i^{(m)} = \frac{x_i^{(m)} - \mu_i}{\sigma_i}$$

Document exactly how you handle each feature!


In [48]:
def preprocess_credit_data(train_file, test_file):
    """
    Preprocess the Credit Approval dataset.
    
    Steps:
    1. Load the data
    2. Handle missing values
    3. Encode categorical variables
    4. Normalize numerical features
    
    Returns:
    --------
    X_train, y_train, X_test, y_test : numpy arrays
    """
    # TODO: Implement preprocessing
    # Hint: Read crx.names to understand the features
    # Feature types (from crx.names):
    # A1: categorical (b, a)
    # A2: continuous
    # A3: continuous
    # A4: categorical (u, y, l, t)
    # A5: categorical (g, p, gg)
    # A6: categorical (c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff)
    # A7: categorical (v, h, bb, j, n, z, dd, ff, o)
    # A8: continuous
    # A9: categorical (t, f)
    # A10: categorical (t, f)
    # A11: continuous
    # A12: categorical (t, f)
    # A13: categorical (g, p, s)
    # A14: continuous
    # A15: continuous
    # A16: label (+, -)
    
    # Step 1 load data
    train_df = pd.read_csv(train_file, header=None, na_values='?', dtype=str)
    test_df = pd.read_csv(test_file, header=None, na_values='?', dtype=str)
    num_cols = train_df.shape[1]
    col_name = [f'A{i}' for i in range(1,num_cols+1)]
    train_df.columns = col_name
    test_df.columns = col_name
    numerical_cols = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']
    categorical_cols = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
    label_col = 'A16'

    # Step 2 missing data
    # label
    train_df[label_col] = train_df[label_col].map({'+': 1, '-': 0})
    test_df[label_col] = test_df[label_col].map({'+': 1, '-': 0})
    # 
    for col in numerical_cols:
        train_df[col] = pd.to_numeric(train_df[col], errors='coerce')
        test_df[col] = pd.to_numeric(test_df[col], errors='coerce')
        # mean_val = train_df[col].mean()
        # mean_val_test = test_df[col].mean()

        # train_df[col] = train_df[col].fillna(mean_val)
        # test_df[col] = test_df[col].fillna(mean_val_test)

        train_df[col] = train_df[col].fillna(train_df.groupby(label_col)[col].transform('mean'))
        test_df[col] = test_df[col].fillna(test_df.groupby(label_col)[col].transform('mean'))
    # categorical
    for col in categorical_cols:
        mode_val = train_df[col].mode()[0]
        mode_val_test = test_df[col].mode()[0]
        train_df[col] = train_df[col].fillna(mode_val)
        test_df[col] = test_df[col].fillna(mode_val_test)

  


    # Step 3: One-Hot Encoding
    n_train = len(train_df)
    n_test = len(test_df)

    combined_df = pd.concat([train_df, test_df], axis=0)
    combined_df_encoded = pd.get_dummies(combined_df, columns=categorical_cols)
    
    train_encoded = combined_df_encoded.iloc[:n_train, :].copy()
    test_encoded = combined_df_encoded.iloc[n_test:, :].copy()



    # Step 4: Z-Score normalization
    for col in numerical_cols:
        mean = train_encoded[col].mean()
        std = train_encoded[col].std()
        if std == 0: std = 1
        train_encoded[col] = (train_encoded[col] - mean) / std
        test_encoded[col] = (test_encoded[col] - mean) / std

    # This step is used for the next l2_norm
    #     For **categorical attributes**, use:
    # - Distance = 1 if values are different
    # - Distance = 0 if values are the same
    
    dummy_cols = [c for c in train_encoded.columns 
                  if c not in numerical_cols and c != label_col]
    
    scale_factor = 1.0 / np.sqrt(2)  # 0.7071
    
    train_encoded[dummy_cols] *= scale_factor
    test_encoded[dummy_cols] *= scale_factor

    y_train = train_encoded[label_col].values
    X_train = train_encoded.drop(columns=[label_col]).values
    
    y_test = test_encoded[label_col].values
    X_test = test_encoded.drop(columns=[label_col]).values


    return X_train, y_train, X_test, y_test







# def z_normalize(X_train, X_test, feature_indices):
#     """
#     Apply z-score normalization to specified features.
    
#     Parameters:
#     -----------
#     X_train, X_test : numpy arrays
#     feature_indices : list of indices for numerical features
    
#     Returns:
#     --------
#     X_train_normalized, X_test_normalized : numpy arrays
#     """
#     # TODO: Implement z-normalization
#     pass


## Task 2: Implement k-NN Classifier
Implement k-NN with L2 (Euclidean) distance:
$$\mathcal{D}_{L2}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_i (a_i - b_i)^2}$$

For **categorical attributes**, use:
- Distance = 1 if values are different
- Distance = 0 if values are the same


In [49]:
def l2_distance(a, b):
    """
    Compute L2 (Euclidean) distance between two vectors.
    
    Parameters:
    -----------
    a, b : numpy arrays of same shape
    
    Returns:
    --------
    distance : float
    """
    # TODO: Implement L2 distance

    distances = np.linalg.norm(a-b)
    
    return distances


def knn_predict(X_train, y_train, X_test, k):
    """
    Predict labels for test data using k-NN.
    
    Parameters:
    -----------
    X_train : numpy array of shape (n_train, n_features)
    y_train : numpy array of shape (n_train,)
    X_test : numpy array of shape (n_test, n_features)
    k : int, number of neighbors
    
    Returns:
    --------
    predictions : numpy array of shape (n_test,)
    """
    # TODO: Implement k-NN prediction
    # For each test sample:
    #   1. Compute distance to all training samples
    #   2. Find k nearest neighbors
    #   3. Predict using majority voting
    # pass
    

    # It is more convenient to use calculate the distance in matrix and vector
    
    num_test = X_test.shape[0]
    num_train = X_train.shape[0]
    y_pred = np.zeros(num_test)
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        # Broadcasting
        # (N, D) - (D,) -> (N, D) 
        dists[i, :] = np.linalg.norm(X_train - X_test[i], axis=1)     
        closest_y_indices = np.argsort(dists[i, :])[:k]
        closest_y = y_train[closest_y_indices]

        counts = np.bincount(closest_y.astype(int))
    
        y_pred[i] = np.argmax(counts)

    return y_pred


def compute_accuracy(y_true, y_pred):
    """
    Compute classification accuracy.
    
    Returns:
    --------
    accuracy : float (between 0 and 1)
    """
    # TODO: Implement accuracy computation
    accu =  np.mean(y_pred == y_true)

    return accu



## Task 3: Evaluate on Lenses Dataset
Test your k-NN implementation on the Lenses dataset for different values of k.


In [50]:
# TODO: Evaluate k-NN on Lenses dataset
# Try different values of k (e.g., 1, 3, 5, 7)

k_values = [1, 3, 5, 7]
lenses_results = []

for k in k_values:
    predictions = knn_predict(X_train_lenses, y_train_lenses, X_test_lenses, k)
    accuracy = compute_accuracy(y_test_lenses, predictions)
    lenses_results.append((k, accuracy))
    print(f"k={k}: Accuracy = {accuracy:.4f}")
# print(X_train_lenses, X_test_lenses)
print(predictions,y_test_lenses)



k=1: Accuracy = 1.0000
k=3: Accuracy = 1.0000
k=5: Accuracy = 0.5000
k=7: Accuracy = 0.8333
[3. 3. 3. 2. 3. 2.] [3. 1. 3. 2. 3. 2.]


## Task 4: Evaluate on Credit Approval Dataset
First preprocess the data, then evaluate k-NN.


In [51]:
# TODO: Preprocess Credit Approval data
X_train_credit, y_train_credit, X_test_credit, y_test_credit = preprocess_credit_data(
    DATA_PATH + "crx.data.training",
    DATA_PATH + "crx.data.testing"
)


In [52]:
# TODO: Evaluate k-NN on Credit Approval dataset
k_values = [1, 3, 5, 7]
credit_results = []

for k in k_values:
    predictions = knn_predict(X_train_credit, y_train_credit, X_test_credit, k)
    accuracy = compute_accuracy(y_test_credit, predictions)
    credit_results.append((k, accuracy))
    print(f"k={k}: Accuracy = {accuracy:.4f}")


k=1: Accuracy = 0.9529
k=3: Accuracy = 0.8768
k=5: Accuracy = 0.8605
k=7: Accuracy = 0.8551


## Summary and Discussion

### Results Table

| Dataset | k=1 | k=3 | k=5 | k=7 |
|---------|-----|-----|-----|-----|
| Lenses | 1 | 1 | 0.5 | 0.8333 |
| Credit Approval | 0.9529 | 0.8768 | 0.8587 | 0.8551 |

### Discussion
*Answer these questions:*
1. Which value of k works best for each dataset? Why do you think that is?
2. How did preprocessing affect your results on the Credit Approval dataset?
3. What are the trade-offs of using different values of k?
4. What did you learn from this exercise?


1. **Which value of k works best for each dataset? Why do you think that is?**

Credit Approval Dataset: $k=1$ performed best with an accuracy of 0.9529. The accuracy consistently decreased as $k$ increased ($k=7$ dropped to 0.8551).

Lenses Dataset: $k=1$ and $k=3$ performed best, both achieving 1.0000 accuracy. However, performance dropped drastically at $k=5$ (0.5000).


2. **How did preprocessing affect your results on the Credit Approval dataset?**

(1) Normalization (Z-Score): Since KNN relies on Euclidean distance, features with large magnitudes (like A15 or A14) would have completely dominated the distance calculation without normalization.

(2) One-Hot Encoding: Converting categorical variables (like A6, A7) into binary vectors allowed us to mathematically calculate distances for non-numeric data.

3. **What are the trade-offs of using different values of k?**


Small $k$ (e.g., $k=1$): Pros: Low bias; captures intricate, local structures in the data.

Cons: High variance; very sensitive to noise and outliers. If a test point is near a mislabeled training point, it will be misclassified.

Large $k$ (e.g., $k=7$):Pros: Low variance; smoothes out noise and decision boundaries, making the model more robust to outliers.

Cons: High bias; computational cost is slightly higher (conceptually), and it risks underfitting. As seen in the Lenses dataset ($k=5$), if $k$ is too large, it can "drown out" the signal from the true minority class by including too many neighbors from the majority class.


4. **What did you learn from this exercise?**

Metric Sensitivity: I learned that KNN is not a "plug-and-play" algorithm; it is highly sensitive to how distance is defined. Mixed-type data (continuous and categorical) requires careful weighting (like the $\sqrt{2}$ scaling we discussed) to ensure consistency(Distance = 1 if they are not accurate).