In [16]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

1. loads the data file;

In [10]:
df = pd.read_csv("/Users/apple/Desktop/IC/spring semester/ML/assignment 1/sparklingwine.csv")


2. construct a new binary column “good wine” that indicates whether the wine is good (which we define as having a quality of 6 or higher) or not;

In [12]:
df["good_wine"] = (df["quality"] >= 6).astype(int)

X = df.drop(columns=["quality", "good_wine"]) 
y = df["good_wine"] #classification

We drop the column quality and good wine from the database because good wine is the target variable and if we include quality as a feature, the model know the answer and validation and test accuracy will then be artificially inflated. We can not keep quality as a feature as well because binary_variable good wine was built based on the quality feature.

3. splits the data set into a training data set (first 900 samples), a validation data set (next 300 samples) and a test data set (last 400 samples) — please do not shuffle the data, as it is already shuffled

In [13]:
X_train = X.iloc[:900]
y_train = y.iloc[:900]

X_val = X.iloc[900:1200]
y_val = y.iloc[900:1200]

X_test = X.iloc[1200:]
y_test = y.iloc[1200:]

4. normalises the data according to the Z-score transform;

In [14]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)


The data was split into training, validation, and test sets. In the first experiment, the first 900 samples were used for training, the next 300 for validation, and the final 400 for testing. Z-score normalisation was then applied to the feature data. The mean and standard deviation were computed using the training set only, and the same transformation was applied to the validation and test sets to prevent data leakage. This step is important for k-Nearest Neighbours (k-NN), as the algorithm relies on distance calculations and is sensitive to feature scales.

5. loads and trains the k-Nearest Neighbours classifiers for k= 1,2,...,100;

In [27]:
validation_accuracies = {}

for k in range(1, 101):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    y_val_pred = knn.predict(X_val_scaled)
    acc = accuracy_score(y_val, y_val_pred)
    
    validation_accuracies[k] = acc

6. evaluates each classifier using the validation data set and selects the best classifier;

In [26]:

# Select best k
best_k = max(validation_accuracies, key=validation_accuracies.get)
best_val_acc = validation_accuracies[best_k]

print(f"Best k: {best_k}")
print(f"Validation accuracy: {best_val_acc:.4f}")

Best k: 7
Validation accuracy: 0.9933


7. predicts the generalisation error using the test data set.

In [18]:
# Train best model on training data
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_scaled, y_train)

# Test prediction
y_test_pred = best_knn.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)

generalisation_error = 1 - test_accuracy

print(f"Test accuracy: {test_accuracy:.4f}")
print(f"Generalisation error: {generalisation_error:.4f}")

Test accuracy: 0.9750
Generalisation error: 0.0250


For model training, k-NN classifiers were fitted for values of k=1 to 100 using the normalised training data. Each classifier was evaluated on the validation set, and the value of k that achieved the highest validation accuracy was selected as the best model. This approach balances model complexity, avoiding overfitting at very small values of k and underfitting at large values.

The selected classifier was then evaluated on the test set to estimate the generalisation error. The test accuracy was lower than the training and validation performance, which is expected and indicates a realistic assessment of how well the model generalises to unseen data.

In [24]:
print("Classification Report for First Split:")
print(classification_report(y_test, y_test_pred))

print("Confusion Matrix for First Split:")
print(confusion_matrix(y_test, y_test_pred))

Classification Report for First Split:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96       132
           1       0.97      0.99      0.98       268

    accuracy                           0.97       400
   macro avg       0.98      0.97      0.97       400
weighted avg       0.98      0.97      0.97       400

Confusion Matrix for First Split:
[[124   8]
 [  2 266]]


8. Try a new splitting: split the data set into a training data set (first 400 samples), a
validation data set (next 400 samples), and a test data set (last 800 samples) - again,
please do not shuffle the data. Then redo steps 4 to 7. What is the new generalisation
error? Explain what you find.
How do you judge whether the classifier is well-suited for the data set?

In [23]:
# Second split
X_train2 = X.iloc[:400]
y_train2 = y.iloc[:400]

X_val2 = X.iloc[400:800]
y_val2 = y.iloc[400:800]

X_test2 = X.iloc[800:]
y_test2 = y.iloc[800:]

# Normalisation
X_train2_scaled = scaler.fit_transform(X_train2)
X_val2_scaled = scaler.transform(X_val2)
X_test2_scaled = scaler.transform(X_test2)

# Train and validate again
validation_accuracies2 = {}

for k in range(1, 101):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train2_scaled, y_train2)
    
    y_val_pred = knn.predict(X_val2_scaled)
    acc = accuracy_score(y_val2, y_val_pred)
    validation_accuracies2[k] = acc

best_k2 = max(validation_accuracies2, key=validation_accuracies2.get)
best_val_acc2 = validation_accuracies2[best_k2]

# Test
best_knn2 = KNeighborsClassifier(n_neighbors=best_k2)
best_knn2.fit(X_train2_scaled, y_train2)

y_test2_pred = best_knn2.predict(X_test2_scaled)
test_accuracy2 = accuracy_score(y_test2, y_test2_pred)
generalisation_error2 = 1 - test_accuracy2

print(f"New best k: {best_k2}")
print(f"New test accuracy: {test_accuracy2:.4f}")
print(f"New generalisation error: {generalisation_error2:.4f}")

New best k: 1
New test accuracy: 0.9637
New generalisation error: 0.0363


The new generalisation error is 0.0363 and is higher than the generalisation error before, which is 0.025. The new best k is 1, and if we change the range of k, the new best k is often the smallest value in the loop. This may because that with small training dataset, k always votes for itself, which then lead to overfitting.   

In [25]:
print("\nClassification Report:")
print(classification_report(y_test2, y_test2_pred))

# --- Add confusion matrix ---
print("Confusion Matrix:")
print(confusion_matrix(y_test2, y_test2_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.95      0.94       259
           1       0.98      0.97      0.97       541

    accuracy                           0.96       800
   macro avg       0.96      0.96      0.96       800
weighted avg       0.96      0.96      0.96       800

Confusion Matrix:
[[246  13]
 [ 16 525]]


In the second experiment, the data was re-split into 400 training samples, 400 validation samples, and 800 test samples, and the same normalisation, training, and model selection procedure was repeated. The generalisation error increased compared to the first split. This is explained by the reduced size of the training set, which limits the model’s ability to learn reliable neighbourhood structures.

Overall, k-NN performs reasonably well on this dataset after normalisation, but its performance is sensitive to the amount of training data and the choice of 
k. This suggests that while k-NN is suitable for this problem, its effectiveness depends strongly on data availability and proper preprocessing.