
# Wine Quality Prediction — Portfolio Project

**Author:** _<Your Name Here>_<br>
**Dataset:** `allwine.csv` (red + white wine quality attributes)<br>
**Goal:** Predict wine **quality** and demonstrate both **from-scratch logistic regression** (recap) and **applied ML with ensembles** using robust evaluation and interpretability.

---

## Project Story

- **Problem:** Predict wine quality (good vs. not-good) from physicochemical properties.
- **Business framing:** Assist vintners and quality control teams to triage batches and optimize processes.
- **What I show here:**
  1. Clean EDA and data preprocessing.
  2. Baseline from-scratch logistic regression **(recap of academic work)**.
  3. Stronger applied models: LogisticRegression (sklearn), Random Forest, Gradient Boosting, AdaBoost.
  4. Robust evaluation: cross-validation, ROC/PR curves, confusion matrix, calibration.
  5. Feature importance + permutation importance.
  6. Clear conclusions and next steps.

> This notebook is the **portfolio-ready** version. The class-restricted notebook remains unchanged for grading.



## 1. Setup & Data Load


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from pathlib import Path

# Reproducibility
RANDOM_STATE = 42

# Load
data_path = Path('/mnt/data/allwine.csv')  # adjust if needed
df = pd.read_csv(data_path)

print(df.shape)
df.head()



## 2. Quick EDA

We look at schema, missingness, target distribution, and pairwise correlations.


In [None]:

df.info()


In [None]:

df.describe().T


In [None]:

df.isnull().sum().sort_values(ascending=False)


In [None]:

plt.figure(figsize=(6,4))
sns.countplot(x=df['quality'])
plt.title('Raw Quality Distribution')
plt.show()


In [None]:

plt.figure(figsize=(10,8))
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=False)
plt.title('Correlation Heatmap')
plt.show()



## 3. Target Engineering

For this portfolio version, we frame it as a **binary classification** problem, a common practice for wine quality datasets:

- **Good (1):** quality \>= 7  
- **Not Good (0):** otherwise


In [None]:

df = df.copy()
df['target'] = (df['quality'] >= 7).astype(int)
df['target'].value_counts(normalize=True).rename('proportion')



## 4. Feature Set

We use the 10 required features from the assignment for comparability.


In [None]:

FEATURES = ['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides',
            'free sulfur dioxide','density','pH','sulphates','alcohol']

X = df[FEATURES].copy()
y = df['target'].copy()

X.head()



## 5. Train/Test Split & Scaling


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

X_train.shape, X_test.shape



## 6. Baseline (Recap): From-Scratch Logistic Regression

Below is a compact recap of a from-scratch logistic regression implementation (vectorized).  
This mirrors the academic notebook but is included here to demonstrate algorithmic understanding.


In [None]:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def initialize_weights(n_features):
    w = np.zeros((1, n_features))
    b = 0.0
    return w, b

def optimize(w, b, X, y):
    m = X.shape[0]
    A = sigmoid(np.dot(w, X.T) + b)
    # numerical stability
    eps = 1e-12
    cost = (-1/m) * np.sum(y.values.reshape(1,-1)*np.log(A+eps) + (1-y.values.reshape(1,-1))*np.log(1-A+eps))
    dw = (1/m) * np.dot(X.T, (A - y.values.reshape(1,-1)).T)
    db = (1/m) * np.sum(A - y.values.reshape(1,-1))
    return {"dw": dw, "db": db}, cost

def train_from_scratch(Xs, ys, lr=0.01, iters=1000):
    w, b = initialize_weights(Xs.shape[1])
    costs = []
    for i in range(iters):
        grads, cost = optimize(w, b, Xs, ys)
        w = w - lr * grads['dw'].T
        b = b - lr * grads['db']
        if i % 100 == 0:
            costs.append(cost)
    return w, b, costs

def predict_from_scratch(w, b, Xs):
    A = sigmoid(np.dot(w, Xs.T) + b)
    return (A.flatten() > 0.5).astype(int)

# Train on scaled data
w_fs, b_fs, costs_fs = train_from_scratch(pd.DataFrame(X_train_s), y_train, lr=0.05, iters=2000)
yhat_fs = predict_from_scratch(w_fs, b_fs, pd.DataFrame(X_test_s))

from sklearn.metrics import accuracy_score
acc_fs = accuracy_score(y_test, yhat_fs)
acc_fs


In [None]:

plt.figure(figsize=(6,4))
plt.plot(costs_fs)
plt.title('From-Scratch Logistic: Cost over Iterations')
plt.xlabel('x100 iterations')
plt.ylabel('Cost')
plt.show()



## 7. Applied Models (sklearn) — Stronger Baselines & Ensembles


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

models = {
    "LogReg (sklearn)": LogisticRegression(max_iter=2000, random_state=RANDOM_STATE),
    "RandomForest": RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE),
    "GradientBoosting": GradientBoostingClassifier(random_state=RANDOM_STATE),
    "AdaBoost": AdaBoostClassifier(n_estimators=300, random_state=RANDOM_STATE),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_results = {}

for name, clf in models.items():
    scores = cross_val_score(clf, X_train_s, y_train, cv=cv, scoring='accuracy', n_jobs=None)
    cv_results[name] = (scores.mean(), scores.std())

cv_results


In [None]:

# Fit best-performing model on full train and evaluate on test
# (We'll pick the model with the highest CV mean)
best_name = max(cv_results.items(), key=lambda kv: kv[1][0])[0]
best_model = models[best_name]
best_model.fit(X_train_s, y_train)
yhat_test = best_model.predict(X_test_s)

test_acc = accuracy_score(y_test, yhat_test)
print("Best CV model:", best_name)
print("Test Accuracy:", test_acc)



## 8. Evaluation: Confusion Matrix & Classification Report


In [None]:

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, yhat_test)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix (Test)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

print(classification_report(y_test, yhat_test, digits=3))



## 9. ROC & Precision-Recall Curves


In [None]:

from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

if hasattr(best_model, "predict_proba"):
    y_proba = best_model.predict_proba(X_test_s)[:,1]
else:
    # For models without predict_proba (e.g., some SVMs), fall back to decision_function if available
    if hasattr(best_model, "decision_function"):
        # scale to [0,1] via min-max for plotting
        z = best_model.decision_function(X_test_s)
        y_proba = (z - z.min()) / (z.max() - z.min() + 1e-12)
    else:
        y_proba = yhat_test.astype(float)

fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

prec, rec, _ = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
plt.plot([0,1],[0,1],'--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()

plt.figure(figsize=(6,4))
plt.plot(rec, prec, label=f'AP = {ap:.3f}')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc='lower left')
plt.show()



## 10. Feature Importance

We inspect feature importances when available (tree-based models) and complement with **Permutation Importance**.


In [None]:

import numpy as np

def plot_importances(names, importances, title):
    order = np.argsort(importances)[::-1]
    plt.figure(figsize=(8,5))
    plt.bar(range(len(names)), np.array(importances)[order])
    plt.xticks(range(len(names)), np.array(names)[order], rotation=45, ha='right')
    plt.title(title)
    plt.tight_layout()
    plt.show()

# Model-specific importances (if available)
if hasattr(best_model, "feature_importances_"):
    plot_importances(FEATURES, best_model.feature_importances_, f'{best_name}: Feature Importances')
elif best_name.startswith("LogReg") and hasattr(best_model, "coef_"):
    plot_importances(FEATURES, np.abs(best_model.coef_[0]), f'{best_name}: |Coefficients|')
else:
    print("Model-specific importances not available for", best_name)


In [None]:

# Permutation Importance (simple implementation)
from sklearn.metrics import accuracy_score

baseline_acc = accuracy_score(y_test, yhat_test)
perm_importances = []

rng = np.random.RandomState(RANDOM_STATE)
X_test_s_copy = X_test_s.copy()

for j in range(X_test_s.shape[1]):
    saved = X_test_s_copy[:, j].copy()
    rng.shuffle(X_test_s_copy[:, j])
    y_perm_pred = best_model.predict(X_test_s_copy)
    perm_acc = accuracy_score(y_test, y_perm_pred)
    drop = baseline_acc - perm_acc
    perm_importances.append(drop)
    X_test_s_copy[:, j] = saved  # restore

plot_importances(FEATURES, perm_importances, f'{best_name}: Permutation Importance (Accuracy Drop)')



## 11. Probability Calibration (Optional)

Well-calibrated probabilities matter in operations. We check calibration curves.


In [None]:

from sklearn.calibration import calibration_curve

if 'y_proba' in locals():
    prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10, strategy='quantile')
    plt.figure(figsize=(6,4))
    plt.plot(prob_pred, prob_true, marker='o')
    plt.plot([0,1],[0,1],'--')
    plt.title('Calibration Curve')
    plt.xlabel('Mean Predicted Probability')
    plt.ylabel('Fraction of Positives')
    plt.show()
else:
    print("Calibration skipped: probability scores not available.")



## 12. Conclusions & Next Steps

**Findings (example — update with your actual numbers):**
- From-scratch logistic regression reached ~**A%** accuracy.
- The best applied model was **<Model>** with **B%** test accuracy.
- Top drivers included **alcohol**, **sulphates**, and **volatile acidity** (based on importances).

**What this shows:**
- Ability to implement algorithms from scratch **and** build production-leaning models with robust validation.
- End-to-end ML: EDA → preprocessing → modeling → evaluation → interpretability → communication.

**Next steps:**
- Try multi-class quality prediction (0–10) or ordinal models.
- Add class imbalance handling (e.g., class weights, SMOTE).
- Explore model monitoring and drift over time.
- Package as a reproducible repo with `README`, environment file, and unit tests for data/metrics.
