# Predicting Wine Quality with k-Nearest Neighbours

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

### 1. Load the data file

In [2]:
#df = pd.read_csv("/Users/apple/Desktop/IC/spring semester/ML/assignment 1/sparklingwine.csv")
df = pd.read_csv("/Users/amermulla/Desktop/Imperial/Term 2/Machine Learning/Assignments/Assignment 1/Group Assignment/sparklingwine.csv")
df = df.drop(columns=["Unnamed: 0"], errors="ignore")

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,5.4,0.29,0.38,1.2,0.029,31.0,132.0,0.98895,3.28,0.36,12.4,6
1,6.7,0.24,0.29,14.9,0.053,55.0,136.0,0.99839,3.03,0.52,9.0,5
2,6.8,0.33,0.31,7.4,0.045,34.0,143.0,0.99226,3.06,0.55,12.2,6
3,6.4,0.27,0.19,2.0,0.084,21.0,191.0,0.99516,3.49,0.63,9.6,4
4,6.1,0.3,0.3,2.1,0.031,50.0,163.0,0.9895,3.39,0.43,12.7,7


### 2. Create the binary column `good_wine`

In [3]:
df["good_wine"] = (df["quality"] >= 6).astype(int)  # Binary column (1 if quality >= 6, else 0)

X = df.drop(columns = ["quality", "good_wine"])     # Feature matrix
y = df["good_wine"]                                 # Target variable

When constructing the feature matrix `X`, `quality` and `good_wine` were both dropped because:

- `good_wine` is the label we want to predict, so it should not appear in `X`
- `good_wine` is defined directly from `quality`, so keeping `quality` as a feature would artificially inflate validation and test accuracies

### 3. Split the data into training, validation, and test sets

In [4]:
X_train = X.iloc[:900]
y_train = y.iloc[:900]

X_val = X.iloc[900:1200]
y_val = y.iloc[900:1200]

X_test = X.iloc[1200:]
y_test = y.iloc[1200:]

The data was split into training, validation, and test sets, without shuffling. The first 900 samples were used for training, the next 300 for validation, and the final 400 for testing.

### 4. Normalise the features using the Z-score transform

In [5]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)


The feature matrix was normalised using Z-score normalisation (via `StandardScaler`). The scaler is fit on the training set only to learn each feature’s mean and standard deviation, and the same transformation is then applied to the validation and test sets to prevent data leakage. This step is important for k-Nearest Neighbours (k-NN) because the algorithm relies on Euclidean distances, so features on larger scales would otherwise dominate the distance calculations.

### 5. Train k-NN classifiers for k = 1, 2, …, 100

In [6]:
validation_accuracies = {}

for k in range(1, 101):
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train_scaled, y_train)
    
    y_val_pred = knn.predict(X_val_scaled)
    acc = accuracy_score(y_val, y_val_pred)
    
    validation_accuracies[k] = acc

A k-NN classifier was trained for each k from 1 to 100 using the training set, and each model was evaluated on the validation set using accuracy. The validation accuracy for each k was stored to select the best-performing model in the next step.


### 6. Select the best k using the validation set

In [30]:
best_val_acc = max(validation_accuracies.values())
best_ks = [k for k, v in validation_accuracies.items() if v == best_val_acc]
best_k = min(best_ks)

print(f"Best k: {best_ks}")
print(f"Selected k: {best_k}")
print(f"Validation accuracy: {best_val_acc:.4f}")


Best k: [1, 9, 17]
Selected k: 1
Validation accuracy: 0.7567


The maximum validation accuracy was achieved by multiple values of k (i.e., k = 1, 9, 17), so the smallest k achieving the maximum was selected.


7. predicts the generalisation error using the test data set.

In [32]:
# Train best model on training data
best_knn = KNeighborsClassifier(n_neighbors = best_k)
best_knn.fit(X_train_scaled, y_train)

# Test prediction
y_test_pred = best_knn.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)

generalisation_error = 1 - test_accuracy

print(f"Test accuracy: {test_accuracy:.4f}")
print(f"Generalisation error: {generalisation_error:.4f}")

Test accuracy: 0.6900
Generalisation error: 0.3100


For model training, k-NN classifiers were fitted for values of k=1 to 100 using the normalised training data. Each classifier was evaluated on the validation set, and the value of k that achieved the highest validation accuracy was selected as the best model. This approach balances model complexity, avoiding overfitting at very small values of k and underfitting at large values.

The selected classifier was then evaluated on the test set to estimate the generalisation error. The test accuracy was lower than the training and validation performance, which is expected and indicates a realistic assessment of how well the model generalises to unseen data.

In [9]:
print("Classification Report for First Split:")
print(classification_report(y_test, y_test_pred))

print("Confusion Matrix for First Split:")
print(confusion_matrix(y_test, y_test_pred))

Classification Report for First Split:
              precision    recall  f1-score   support

           0       0.53      0.51      0.52       132
           1       0.76      0.78      0.77       268

    accuracy                           0.69       400
   macro avg       0.65      0.64      0.65       400
weighted avg       0.69      0.69      0.69       400

Confusion Matrix for First Split:
[[ 67  65]
 [ 59 209]]


8. Try a new splitting: split the data set into a training data set (first 400 samples), a
validation data set (next 400 samples), and a test data set (last 800 samples) - again,
please do not shuffle the data. Then redo steps 4 to 7. What is the new generalisation
error? Explain what you find.
How do you judge whether the classifier is well-suited for the data set?

In [10]:
# Second split
X_train2 = X.iloc[:400]
y_train2 = y.iloc[:400]

X_val2 = X.iloc[400:800]
y_val2 = y.iloc[400:800]

X_test2 = X.iloc[800:]
y_test2 = y.iloc[800:]

# Normalisation
X_train2_scaled = scaler.fit_transform(X_train2)
X_val2_scaled = scaler.transform(X_val2)
X_test2_scaled = scaler.transform(X_test2)

# Train and validate again
validation_accuracies2 = {}

for k in range(1, 101):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train2_scaled, y_train2)
    
    y_val_pred = knn.predict(X_val2_scaled)
    acc = accuracy_score(y_val2, y_val_pred)
    validation_accuracies2[k] = acc

best_k2 = max(validation_accuracies2, key=validation_accuracies2.get)
best_val_acc2 = validation_accuracies2[best_k2]

# Test
best_knn2 = KNeighborsClassifier(n_neighbors=best_k2)
best_knn2.fit(X_train2_scaled, y_train2)

y_test2_pred = best_knn2.predict(X_test2_scaled)
test_accuracy2 = accuracy_score(y_test2, y_test2_pred)
generalisation_error2 = 1 - test_accuracy2

print(f"New best k: {best_k2}")
print(f"New test accuracy: {test_accuracy2:.4f}")
print(f"New generalisation error: {generalisation_error2:.4f}")

New best k: 5
New test accuracy: 0.7475
New generalisation error: 0.2525


The new generalisation error is 0.0363 and is higher than the generalisation error before, which is 0.025. The new best k is 1, and if we change the range of k, the new best k is often the smallest value in the loop. This may because that with small training dataset, k always votes for itself, which then lead to overfitting.   

In [11]:
print("\nClassification Report:")
print(classification_report(y_test2, y_test2_pred))

# --- Add confusion matrix ---
print("Confusion Matrix:")
print(confusion_matrix(y_test2, y_test2_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.62      0.58      0.60       259
           1       0.81      0.83      0.82       541

    accuracy                           0.75       800
   macro avg       0.71      0.70      0.71       800
weighted avg       0.74      0.75      0.75       800

Confusion Matrix:
[[151 108]
 [ 94 447]]


In the second experiment, the data was re-split into 400 training samples, 400 validation samples, and 800 test samples, and the same normalisation, training, and model selection procedure was repeated. The generalisation error increased compared to the first split. This is explained by the reduced size of the training set, which limits the model’s ability to learn reliable neighbourhood structures.

Overall, k-NN performs reasonably well on this dataset after normalisation, but its performance is sensitive to the amount of training data and the choice of 
k. This suggests that while k-NN is suitable for this problem, its effectiveness depends strongly on data availability and proper preprocessing.