# Assignment 1, Predictive Methods - AVTEK 2025 

Participants names and contributions

## Part 1

### 1.1 Dataset from Kaggle

**Dataset & suitability for kNN (meets 2/2 points criteria)**  
- **Dataset:** Wine Quality (WineQT.csv) from Kaggle.  
- **Target (y):** `quality` (discrete score → use as classification label).  
- **Why suitable for kNN:**  
  - All predictors are **numeric**, enabling distance-based learning.  
  - Dataset size is moderate → kNN is computationally feasible.  
  - Clear supervised task with a labeled target.  
- **What is scikit-learn:** Python ML toolkit providing ready-to-use models (e.g., `KNeighborsClassifier`), utilities (`train_test_split`) and metrics (`accuracy_score`).

In [None]:
# Load dataset and define X, y (place WineQT.csv next to this notebook)
import pandas as pd

df = pd.read_csv('WineQT.csv')  # adjust path if needed, e.g., 'data/WineQT.csv'
X = df.drop(['quality','Id'], axis=1)
y = df['quality']

print("Data shape:", df.shape)
print("Features shape:", X.shape, "| Target shape:", y.shape)
print("\nMissing values in whole df:", int(df.isna().sum().sum()))
print("\nTarget distribution (quality):")
print(y.value_counts().sort_index())

df.head()

**Why this dataset works well with kNN (short rationale):**  
Numeric features allow meaningful distances; the quality label provides a multi-class setup; and moderate size keeps training/testing fast.

### 1.2 First kNN run (Training) 


**Goal (2/2 points):** Split 80/20 (stratified) and **train** baseline kNN (k=5). This cell **only trains** the model on the training set (no leakage).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Re-split each run to be safe and reproducible; stratify keeps class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

print("Train size:", len(X_train), "| Test size:", len(X_test))
print("Model object:", knn)

### 1.2 First kNN run (Testing)

**Goal (2/2 points):** Predict on the held-out test set and report accuracy.  
The cell below is **robust**: if the model `knn` isn't defined (e.g., kernel reset), it will re-train before predicting.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Ensure we have X, y; ensure we have a trained model
try:
    knn
except NameError:
    # re-split and re-train to avoid NameError if cells were run out of order
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)

# Evaluate
y_pred = knn.predict(X_test)
basic_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy (k=5) on test set:", basic_accuracy)

**Illustration: 80/20 split (conceptual)**

In [None]:
# Single illustrative plot (default style/colors)
import numpy as np, matplotlib.pyplot as plt

grid = np.zeros((20, 6))
grid[:16,:5] = 1   # train features
grid[:16,5]  = 2   # train labels
grid[16:,:5] = 3   # test features
grid[16:,5]  = 4   # test labels

plt.figure(figsize=(4,6))
plt.imshow(grid, aspect='auto')
plt.title('Example of 80/20 split')
plt.axis('off')
plt.show()

### 1.3 Listing of 2 more interresting use cases for kNN algorithm

**Two real-world kNN use cases (done fully, 2/2 points)**  
1) **Handwritten digit recognition (MNIST):** Classify digits (0–9) by comparing pixel vectors with nearest labeled examples. kNN is a clear baseline for image tasks.  
2) **Customer similarity for marketing:** Find nearest customers by purchase/behavior vectors to enable look‑alike targeting and simple recommendations (items liked by neighbors).

## Part 2

### 2.1 Experiments with different values of $k$

### 2.2 Studying the effect of different train/test splits

**Task (2/2 points):** Compare several **test sizes** (0.2, 0.3, 0.4) while keeping k=5 and present a tidy table with interpretation.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

split_sizes = [0.2, 0.3, 0.4]
rows = []
for s in split_sizes:
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=s, random_state=42, stratify=y)
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)
    rows.append((s, accuracy_score(y_te, pred)))

df_splits = pd.DataFrame(rows, columns=['test_size','test_accuracy']).sort_values('test_size')
print(df_splits)

**Analysis:** Accuracy varies with the amount of training data; 20% test split is a reasonable default. Cross‑validation (next) reduces dependence on a single split.

### 2.3 $k$-fold validation

**Task (2/2 points):** Train kNN with multiple `k` values and compare **test accuracies** systematically; also show a simple accuracy‑vs‑k plot.

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

k_values = [1, 3, 5, 7, 9, 11, 15]
rows = []
for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rows.append((k, accuracy_score(y_test, pred)))

df_k = pd.DataFrame(rows, columns=['k','test_accuracy']).sort_values('k')
print(df_k)

# Single plot, default style/colors
plt.figure()
plt.plot(df_k['k'], df_k['test_accuracy'], marker='o')
plt.xlabel('k (neighbors)')
plt.ylabel('Test accuracy')
plt.title('kNN: accuracy vs k')
plt.tight_layout()
plt.show()

best_idx = df_k['test_accuracy'].idxmax()
best_k = int(df_k.loc[best_idx, 'k'])
best_acc = float(df_k.loc[best_idx, 'test_accuracy'])
print(f"Best k on test set: k={best_k} with accuracy={best_acc:.4f}")

**Analysis:** Small `k` may overfit noise; larger `k` smooths the decision boundary. Choose the `k` with highest accuracy; if tied, prefer a moderate `k` (5–7) for stability.

**Task (2/2 points):** Perform **5-fold cross‑validation** with k=5 and report the **average accuracy** with justification.

In [None]:
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for tr_idx, te_idx in kf.split(X):
    X_tr, X_te = X.iloc[tr_idx], X.iloc[te_idx]
    y_tr, y_te = y.iloc[tr_idx], y.iloc[te_idx]
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_tr, y_tr)
    preds = model.predict(X_te)
    scores.append(accuracy_score(y_te, preds))

cv_mean = sum(scores)/len(scores)
print("5-fold CV mean accuracy:", cv_mean)

**Justification & interpretation:** 5 folds balance stability and runtime for this dataset size. The CV mean is less sensitive to lucky/unlucky single splits and usually reflects generalization more reliably than one 80/20 result.