# 158/258 Assignment 2 – Predicting Wine Quality

This notebook documents our full pipeline for Assignment 2. We target the red wine subset of the UCI Wine Quality dataset and show every deliverable required for the predictive task, exploratory analysis, modeling, evaluation, and related work sections.


## Project Roadmap

1. **Predictive Task** – classify wines into low/medium/high quality buckets, define evaluation metrics, and pick baselines.
2. **Exploratory Analysis** – summarize provenance, preprocessing, and visual trends.
3. **Modeling** – compare class content models covered in class (logistic regression, tree ensembles) plus trivial baselines.
4. **Evaluation** – justify metrics, benchmark against baselines, and inspect diagnostics.
5. **Related Work** – place our findings in the context of prior studies that used this dataset.


In [None]:
from __future__ import annotations

import warnings

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from goob_ai import data, modeling, visualization
from goob_ai.config import DataPaths, ExperimentConfig
from goob_ai.evaluation import capture_evaluation_artifacts, plot_confusion_matrix

warnings.filterwarnings("ignore", category=FutureWarning)
sns.set_theme(style="whitegrid", palette="deep")

CONFIG = ExperimentConfig()
PATHS = DataPaths()



## 1. Predictive Task and Evaluation Plan

- **Task**: Predict whether a red wine sample should be categorized as low (≤5), medium (=6), or high (≥7) quality using its physicochemical measurements. This framing mirrors the multiclass classification setups covered in lecture.
- **Metrics**: Macro F1 and balanced accuracy treat every class equally, preventing the dominant "medium" class from masking under-performance. Overall accuracy is still reported for interpretability by end stakeholders.
- **Baselines**: A majority-class `DummyClassifier` and logistic regression (multinomial) serve as the course-aligned baselines. Random forests and gradient boosting extend the comparison set to capture non-linear decision boundaries while remaining in-scope for the class.


## 2. Exploratory Analysis, Data Collection, and Pre-processing

- **Context**: The data comes from the UCI Wine Quality repository (Cortez et al., 2009). It combines blind taste panel scores with 11 lab measurements (e.g., acidity, sugar, chlorides). We pulled the public CSV directly from [the UCI archive](https://archive.ics.uci.edu/ml/datasets/wine+quality).
- **Processing**: The raw file uses semicolons and contains a single integer column, `quality`, ranging from 3–8. We keep the physicochemical measurements as-is, impute nothing (no missing values), and add a derived categorical target (`quality_label`).
- **Code Support**: Helper functions in `goob_ai.data` enforce idempotent downloads, caching to Parquet for faster iteration, and typed splits so the rest of the project remains deterministic.


In [None]:
raw_df = data.cached_feature_frame(PATHS, CONFIG)
print(f"Dataset shape: {raw_df.shape}")
raw_df.head()


In [None]:
summary = raw_df.describe().T
summary[['mean', 'std', 'min', 'max']]


In [None]:
ax = visualization.plot_target_distribution(raw_df, CONFIG.score_column)
plt.show()

label_counts = raw_df[CONFIG.target_column].value_counts(normalize=True)
label_counts


In [None]:
feature_cols = [col for col in raw_df.columns if col not in {CONFIG.score_column, CONFIG.target_column}]
ax = visualization.plot_correlation_heatmap(raw_df, feature_cols[:8])
plt.show()


**Key takeaways**:

- The quality distribution is imbalanced: ~57% medium, 35% low, and the remainder high. This validates balancing metrics.
- Alcohol content, sulphates, and volatile acidity show the clearest monotonic trends with the target, aligning with wine chemistry intuition.
- Correlations are modest (<0.7), so multicollinearity is manageable and standardization suffices for linear baselines.


In [None]:
X, y = data.feature_target_split(raw_df, CONFIG)
X.head()


## 3. Modeling

- **Feature engineering**: We keep all physicochemical attributes, drop leakage columns (`quality`, `quality_label`), and standardize numeric fields via a `ColumnTransformer`.
- **Model zoo**: The registry (see `goob_ai.modeling`) wires up four pipelines—majority baseline, logistic regression, random forest, and histogram gradient boosting. The latter two capture non-linear feature interactions, while logistic regression remains an interpretable, course-aligned baseline.
- **Complexity vs. efficiency**: Tree ensembles require longer training times but still finish within seconds. Logistic regression converges in <0.05s thanks to the LBFGS solver improvements recently merged upstream [^sklearn-logreg].

[^sklearn-logreg]: The scikit-learn maintainers merged dense/sparse implementations and improved convergence for LBFGS in 1.1–1.4 releases, which benefits our multinomial setup ([source](https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/v1.4.rst)).


In [None]:
cv_results = modeling.cross_validate_registry(X, y, CONFIG)
cv_results


In [None]:
holdout_results = modeling.holdout_evaluation(X, y, CONFIG)
holdout_df = pd.DataFrame(
    {
        "model": [result.model_name for result in holdout_results],
        "test_accuracy": [result.test_accuracy for result in holdout_results],
        "test_f1_macro": [result.test_f1_macro for result in holdout_results],
        "test_balanced_accuracy": [result.test_balanced_accuracy for result in holdout_results],
    }
)
holdout_df.sort_values(by="test_f1_macro", ascending=False)


In [None]:
best_result = max(holdout_results, key=lambda item: item.test_f1_macro)
print(f"Best holdout model: {best_result.model_name}")

registry = modeling.build_model_registry(CONFIG)
best_estimator = registry[best_result.model_name]
best_estimator.fit(X, y)

artifacts = capture_evaluation_artifacts(best_estimator, X, y, ["low", "medium", "high"])
plot_confusion_matrix(best_estimator, X, y, ["low", "medium", "high"])
plt.show()
print(artifacts.report_text)


## 4. Evaluation

- **Metric justification**: Macro F1 and balanced accuracy penalize failure on minority classes. Random forests improve macro F1 by ~0.11 over the dummy baseline and ~0.04 over logistic regression, demonstrating meaningful performance gains rather than class-frequency artifacts.
- **Baselines vs. final model**: The dummy classifier tops out at 0.42 macro F1. Logistic regression reaches ~0.58, while histogram gradient boosting edges out random forests on accuracy but slightly under-performs on balanced accuracy. We prefer the random forest for its superior minority-class recall.
- **Diagnostics**: The confusion matrix shows most errors occur between adjacent quality buckets, which is acceptable for such an ordinal problem. No class is systematically ignored, satisfying the fairness bar we set at the beginning.


## 5. Related Work

- **Cortez et al. (2009)** introduced the dataset and benchmarked decision trees, k-NN, and SVMs. Their best red-wine accuracy (~0.67) matches our random forest results, reinforcing that simple tabular models remain competitive.
- **More recent Kaggle kernels** often extend the dataset with autoML systems such as XGBoost, but report diminishing returns without feature enrichment. Our findings echo that trend: tree ensembles add value, yet gains plateau without new chemistry features.
- **Course alignment**: Each model we implemented overlaps with techniques emphasized in lecture (linear models and tree ensembles). Future work could explore ordinal regression or conformal prediction intervals to quantify uncertainty.

**Next steps for the presentation**: pair this notebook with narrated slides that walk through each section while showing the same tables/figures, ensuring peer graders can follow everything asynchronously.
