# K-Nearest Neighbours Wine Quality Classifier
https://github.com/AreedAdmin/BUSI70261-Machine-Learning.git

Group K: 
Shehab Hassani 06071687
Fares gharbi 06052076
Riddhima Tanwar 06015675
Leni


## Table of Contents

1. [Imports](#1-imports)
2. [Load Data](#2-load-data)
3. [Create Binary Target](#3-create-binary-target)
4. [Stratified Data Splitting Function](#4-data-splitting-function)
5. [Z-Score Normalization Function](#5-z-score-normalization-function)
6. [Weighted k-NN with Cross-Validation Function](#6-train-and-evaluate-k-nn-function)
7. [Test Evaluation Function](#7-test-evaluation-function)
8. [CV Validation Curve Plotting Function](#8-validation-curve-plotting-function)
9. [Train/Test Split and Model Training](#9-experiment)
10. [Results Summary](#10-results-summary)
11. [Confusion Matrix](#11-confusion-matrix)
12. [Conclusion & Findings](#12-conclusion--findings)

## 1. Imports
We import required libraries:
- **NumPy & Pandas**: Data manipulation and numerical operations
- **Plotly**: Interactive visualization for validation curves
- **Scikit-learn**: k-NN classifier, evaluation metrics, and cross-validation utilities

In [None]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, cross_val_score

## 2. Load Data
We load the sparkling wine dataset from CSV file.
- **1,600 samples** of wine with physicochemical measurements
- **11 input features**: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free/total sulfur dioxide, density, pH, sulphates, alcohol
- **Target**: quality score (0-10 scale)

In [17]:
filepath = 'sparklingwine.csv'
df = pd.read_csv(filepath, index_col=0)

print(f"Shape: {df.shape}")

Shape: (1600, 12)


In [18]:
print(f"Columns: {list(df.columns)}")

Columns: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


In [20]:
print(f"Missing values: {df.isnull().sum()}")

Missing values: fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


## 3. Create Binary Target

In [8]:
threshold = 6
df['good_wine'] = (df['quality'] >= threshold).astype(int)

print(f"\nBinary target created:")
print(f"  Good wines (quality >= {threshold}): {df['good_wine'].sum()}")
print(f"  Not good wines (quality < {threshold}): {(df['good_wine'] == 0).sum()}")


Binary target created:
  Good wines (quality >= 6): 1073
  Not good wines (quality < 6): 527


## 4. Stratified Data Splitting Function

In [9]:
def split_data_stratified(df, train_ratio, val_ratio, test_ratio, random_state=42):
    feature_cols = [col for col in df.columns if col not in ['quality', 'good_wine']]
    
    X = df[feature_cols].values
    y = df['good_wine'].values
    
    total = len(y)
    test_size = test_ratio
    val_size_adjusted = val_ratio / (1 - test_ratio)
    
    sss_test = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=random_state)
    train_val_idx, test_idx = next(sss_test.split(X, y))
    
    X_train_val, y_train_val = X[train_val_idx], y[train_val_idx]
    X_test, y_test = X[test_idx], y[test_idx]
    
    sss_val = StratifiedShuffleSplit(n_splits=1, test_size=val_size_adjusted, random_state=random_state)
    train_idx, val_idx = next(sss_val.split(X_train_val, y_train_val))
    
    X_train, y_train = X_train_val[train_idx], y_train_val[train_idx]
    X_val, y_val = X_train_val[val_idx], y_train_val[val_idx]
    
    print(f"\nStratified Data Split (with shuffle):")
    print(f"  Training set: {X_train.shape[0]} samples (Class 0: {sum(y_train==0)}, Class 1: {sum(y_train==1)})")
    print(f"  Validation set: {X_val.shape[0]} samples (Class 0: {sum(y_val==0)}, Class 1: {sum(y_val==1)})")
    print(f"  Test set: {X_test.shape[0]} samples (Class 0: {sum(y_test==0)}, Class 1: {sum(y_test==1)})")
    
    class_ratio_train = sum(y_train==1) / len(y_train)
    class_ratio_val = sum(y_val==1) / len(y_val)
    class_ratio_test = sum(y_test==1) / len(y_test)
    print(f"\n  Class balance (% Good Wine):")
    print(f"    Train: {class_ratio_train:.2%}, Val: {class_ratio_val:.2%}, Test: {class_ratio_test:.2%}")
    
    return X_train, y_train, X_val, y_val, X_test, y_test

## 5. Z-Score Normalization Function
We standardize all features to zero mean and unit variance using `z = (x - μ) / σ`

We normalize the features because k-NN uses Euclidean distance and features with larger scales would dominate.
- Example: alcohol (8-15 scale) vs chlorides (0.01-0.1 scale) would be unfairly weighted and affect the model's performance.

We compute mean/std from training set only, then apply to val/test sets. This prevents data leakage.

In [10]:
def zscore_normalize(X_train, X_val, X_test):
   
    mean = np.mean(X_train, axis=0)
    std = np.std(X_train, axis=0)
    
    std[std == 0] = 1
 
    X_train_norm = (X_train - mean) / std
    X_val_norm = (X_val - mean) / std
    X_test_norm = (X_test - mean) / std
    
    print(f"\nZ-score normalization applied using training set statistics.")
    
    return X_train_norm, X_val_norm, X_test_norm

## 6. Weighted k-NN with Cross-Validation Function
We find the optimal k using **5-fold Stratified Cross-Validation**:

**Distance-weighted voting** `weights='distance'`
- Closer neighbours contribute more to the prediction than distant ones, reducing noise sensitivity.

**Stratified K-Fold CV**
- Splits training data into 5 folds, each maintaining class balance
- Trains on 4 folds, validates on 1, rotates through all combinations
- Returns mean accuracy ± std for each k value (1 to 100)

The k with highest mean CV accuracy is selected as optimal.

In [11]:
def train_and_evaluate_knn_cv(X_train, y_train, k_values, n_folds=5):
    
    cv_mean_accuracies = []
    cv_std_accuracies = []
    cv_mean_errors = []
    
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    print(f"\nTraining weighted k-NN with {n_folds}-fold Stratified Cross-Validation...")
    print(f"Testing k = 1 to {max(k_values)}...")
    
    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors=k, weights='distance')
        cv_scores = cross_val_score(knn, X_train, y_train, cv=skf, scoring='accuracy')
        
        cv_mean_accuracies.append(cv_scores.mean())
        cv_std_accuracies.append(cv_scores.std())
        cv_mean_errors.append(1 - cv_scores.mean())

    best_idx = np.argmax(cv_mean_accuracies)
    best_k = k_values[best_idx]
    best_accuracy = cv_mean_accuracies[best_idx]
    best_std = cv_std_accuracies[best_idx]
    
    print(f"\nCross-Validation Results:")
    print(f"  Best k: {best_k}")
    print(f"  Best CV accuracy: {best_accuracy:.4f} (+/- {best_std:.4f})")
    print(f"  Best CV error: {1 - best_accuracy:.4f}")
    
    return best_k, best_accuracy, cv_mean_accuracies, cv_mean_errors, cv_std_accuracies

## 7. Test Evaluation Function

In [12]:
def evaluate_on_test(X_train, y_train, X_test, y_test, best_k):
    
    knn = KNeighborsClassifier(n_neighbors=best_k, weights='distance')
    knn.fit(X_train, y_train)
    
    y_pred = knn.predict(X_test)
    
    test_accuracy = accuracy_score(y_test, y_pred)
    test_error = 1 - test_accuracy
    
    print(f"\nTest Set Results (Weighted k-NN, k={best_k}):")
    print(f"  Test accuracy: {test_accuracy:.4f}")
    print(f"  Generalisation error: {test_error:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Not Good', 'Good']))
    print(f"Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    
    return test_accuracy, test_error, knn

## 8. CV Validation Curve Plotting Function

In [13]:
def plot_cv_validation_curve(k_values, cv_errors, cv_stds, title, filename):
    
    best_idx = np.argmin(cv_errors)
    best_k = k_values[best_idx]
    best_error = cv_errors[best_idx]
    best_std = cv_stds[best_idx]
    
    cv_errors = np.array(cv_errors)
    cv_stds = np.array(cv_stds)
    
    hover_text = [f'k={k}<br>CV Error={err:.4f} ± {std:.4f}<br>CV Accuracy={1-err:.4f}' 
                  for k, err, std in zip(k_values, cv_errors, cv_stds)]
    
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(
        x=k_values + k_values[::-1],
        y=list(cv_errors + cv_stds) + list((cv_errors - cv_stds)[::-1]),
        fill='toself',
        fillcolor='rgba(51, 102, 204, 0.2)',
        line=dict(color='rgba(255,255,255,0)'),
        name='±1 Std Dev',
        hoverinfo='skip'
    ))
    
    fig.add_trace(go.Scatter(
        x=k_values,
        y=cv_errors,
        mode='lines+markers',
        name='Mean CV Error',
        line=dict(color='#3366CC', width=2),
        marker=dict(size=6, color='#3366CC'),
        hovertext=hover_text,
        hoverinfo='text'
    ))
    
    fig.add_trace(go.Scatter(
        x=[best_k],
        y=[best_error],
        mode='markers',
        name=f'Best k={best_k} (Error={best_error:.4f}±{best_std:.4f})',
        marker=dict(size=16, color='#DC3912', symbol='star'),
        hovertext=f'<b>BEST</b><br>k={best_k}<br>CV Error={best_error:.4f}±{best_std:.4f}<br>CV Accuracy={1-best_error:.4f}',
        hoverinfo='text'
    ))
    
    fig.update_layout(
        title=dict(text=title, font=dict(size=16)),
        xaxis=dict(
            title='k (Number of Neighbours)',
            tickmode='linear',
            dtick=10,
            gridcolor='rgba(128,128,128,0.2)'
        ),
        yaxis=dict(
            title='Cross-Validation Error',
            gridcolor='rgba(128,128,128,0.2)'
        ),
        legend=dict(x=0.65, y=0.98),
        hovermode='closest',
        template='plotly_white',
        width=900,
        height=500
    )
    
    html_filename = filename.replace('.png', '.html')
    fig.write_html(html_filename)
    fig.show()
    print(f"\nInteractive plot saved as '{html_filename}'")

## 9. Train/Test Split and Model Training

In [14]:
print("\n" + "=" * 60)
print("STRATIFIED DATA SPLIT: 80% Train / 15% Validation / 5% Test")
print("=" * 60)

X_train, y_train, X_val, y_val, X_test, y_test = split_data_stratified(
    df, train_ratio=0.80, val_ratio=0.15, test_ratio=0.05, random_state=42
)

X_train_norm, X_val_norm, X_test_norm = zscore_normalize(X_train, X_val, X_test)

k_values = list(range(1, 101))
best_k, best_cv_acc, cv_accuracies, cv_errors, cv_stds = train_and_evaluate_knn_cv(
    X_train_norm, y_train, k_values, n_folds=5
)

test_acc, test_error, classifier = evaluate_on_test(
    X_train_norm, y_train, X_test_norm, y_test, best_k
)

plot_cv_validation_curve(
    k_values, cv_errors, cv_stds,
    'Weighted k-NN Cross-Validation Error vs k (80/15/5 Split)',
    'validation_curve.png'
)


STRATIFIED DATA SPLIT: 80% Train / 15% Validation / 5% Test

Stratified Data Split (with shuffle):
  Training set: 1280 samples (Class 0: 422, Class 1: 858)
  Validation set: 240 samples (Class 0: 79, Class 1: 161)
  Test set: 80 samples (Class 0: 26, Class 1: 54)

  Class balance (% Good Wine):
    Train: 67.03%, Val: 67.08%, Test: 67.50%

Z-score normalization applied using training set statistics.

Training weighted k-NN with 5-fold Stratified Cross-Validation...
Testing k = 1 to 100...

Cross-Validation Results:
  Best k: 56
  Best CV accuracy: 0.7672 (+/- 0.0227)
  Best CV error: 0.2328

Test Set Results (Weighted k-NN, k=56):
  Test accuracy: 0.7875
  Generalisation error: 0.2125

Classification Report:
              precision    recall  f1-score   support

    Not Good       0.76      0.50      0.60        26
        Good       0.79      0.93      0.85        54

    accuracy                           0.79        80
   macro avg       0.78      0.71      0.73        80
weighted


Interactive plot saved as 'validation_curve.html'


## 10. Results Summary
Here is the final model configuration and performance metrics:
- **Data split sizes**: Number of samples in each set
- **Optimal k**: Best hyperparameter found via CV
- **CV Accuracy**: Expected performance based on cross-validation
- **Test Accuracy**: Actual performance on unseen data
- **Generalisation Error**: 1 - Test Accuracy (estimate of real-world error)

In [None]:
print("\n" + "=" * 22)
print("FINAL RESULTS SUMMARY")
print("=" * 22)

print(f"\nData Split: 80% Train / 15% Validation / 5% Test")
print(f"  Training samples: {len(y_train)}")
print(f"  Validation samples: {len(y_val)}")
print(f"  Test samples: {len(y_test)}")

print(f"\n" + "-" * 50)
print(f"Model: Weighted k-NN with 5-Fold Stratified CV")
print(f"-" * 50)

print(f"\n  Optimal k: {best_k}")
print(f"  CV Accuracy: {best_cv_acc:.4f}")
print(f"  CV Error: {1 - best_cv_acc:.4f}")
print(f"\n  Test Accuracy: {test_acc:.4f}")
print(f"  Generalisation Error: {test_error:.4f}")


FINAL RESULTS SUMMARY

Data Split: 80% Train / 15% Validation / 5% Test
  Training samples: 1280
  Validation samples: 240
  Test samples: 80

--------------------------------------------------
Model: Weighted k-NN with 5-Fold Stratified CV
--------------------------------------------------

  Optimal k: 56
  CV Accuracy: 0.7672
  CV Error: 0.2328

  Test Accuracy: 0.7875
  Generalisation Error: 0.2125


## 11. Confusion Matrix
Visual representation of model predictions vs actual labels:
- **True Negatives (TN)**: Correctly predicted "Not Good" wines
- **False Positives (FP)**: "Not Good" wines incorrectly predicted as "Good"
- **False Negatives (FN)**: "Good" wines incorrectly predicted as "Not Good"
- **True Positives (TP)**: Correctly predicted "Good" wines

In [22]:
y_pred = classifier.predict(X_test_norm)
cm = confusion_matrix(y_test, y_pred)

labels = ['Not Good', 'Good']

annotations = []
for i in range(2):
    for j in range(2):
        annotations.append(
            dict(
                x=labels[j],
                y=labels[i],
                text=str(cm[i, j]),
                showarrow=False,
                font=dict(size=24, color='white' if cm[i, j] > cm.max()/2 else 'black')
            )
        )

fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=labels,
    y=labels,
    colorscale='Blues',
    showscale=True,
    hovertemplate='Actual: %{y}<br>Predicted: %{x}<br>Count: %{z}<extra></extra>'
))

fig.update_layout(
    title=dict(text=f'Confusion Matrix (k={best_k})', font=dict(size=18)),
    xaxis=dict(title='Predicted Label', tickfont=dict(size=14)),
    yaxis=dict(title='Actual Label', tickfont=dict(size=14), autorange='reversed'),
    annotations=annotations,
    width=500,
    height=450
)

fig.write_html('confusion_matrix.html')
fig.show()
print(f"\nInteractive confusion matrix saved as 'confusion_matrix.html'")

tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
accuracy = (tp + tn) / (tp + tn + fp + fn)
f1 = 2 * (precision * recall) / (precision + recall)

print(f"\n" + "=" * 30)
print("CLASSIFICATION METRICS")
print("=" * 30)
print(f"\n  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1-Score:  {f1:.4f}")


Interactive confusion matrix saved as 'confusion_matrix.html'

CLASSIFICATION METRICS

  Accuracy:  0.7875
  Precision: 0.7937
  Recall:    0.9259
  F1-Score:  0.8547


## 12. Conclusion & Findings

### Project Overview
This project implements a **k-Nearest Neighbours classifier** to predict wine quality from physicochemical properties. 

The sparkling wine dataset contains 1,600 samples with 11 features (acidity, sugar, pH, alcohol, etc.) and a quality score which weconverted to a binary classification task.

### Key Methodology Choices

| Technique | Rationale |
|-----------|-----------|
| **Stratified Splitting** | Maintains ~67% class ratio across all sets, preventing distribution mismatch |
| **Z-Score Normalization** | Essential for k-NN since it uses Euclidean distance; prevents features with larger scales from dominating |
| **Distance-Weighted k-NN** | Closer neighbours have more influence, reducing impact of outliers |
| **5-Fold Stratified CV** | Robust hyperparameter selection; reduces overfitting to a single validation set |

### Results Summary

| Metric | Value |
|--------|-------|
| **Optimal k** | 56 |
| **CV Accuracy** | 76.72% (±2.27%) |
| **Test Accuracy** | **78.75%** |
| **Generalisation Error** | 21.25% |


### Key Observations

1. **Optimal k = 56**: A relatively high k suggests the decision boundary benefits from smoothing, indicating some noise in the data.

2. **Good generalisation**: Test accuracy (78.75%) exceeds CV accuracy (76.72%), indicating no overfitting.

3. **Class imbalance effect**: The model performs better on the majority class ("Good" wine) with 93% recall vs 50% for "Not Good" wines.

4. **Validation curve insight**: Error remains relatively stable for k ∈ [10, 70], showing the model is not highly sensitive to k in this range.


### Limitations
- Small test set (80 samples) may lead to variance in accuracy estimates, this was chosen to maximise degrees of freedom in our training set.
- Class imbalance (67% Good vs 33% Not Good) affects minority class prediction, handled using stratified sampling.
- k-NN doesn't provide feature importance insights, this would allow us to perform pca and other dimensionality reduction techniques.

### Potential Improvements
- Try other distance metrics (Manhattan, Minkowski) and see if they perform better.
- Apply SMOTE for class imbalance to improve performance on minority class.
- Feature selection to reduce dimensionality to be more efficient.
- Ensemble methods (e.g., Random Forest) for comparison or other ML Algorithms in general.