🍷 Wine Quality Prediction - ML Zoomcamp Project

Problem Description

Context: Wine quality assessment is crucial for wineries, distributors, and consumers. Traditional quality evaluation relies on expert sommeliers conducting sensory analysis, which is subjective, time-consuming, and expensive.

Problem: Can we predict wine quality objectively using physicochemical properties measured through laboratory tests?

Solution: This project builds a machine learning model that predicts wine quality based on 11 physicochemical properties (acidity, pH, alcohol content, etc.), providing:

Regression Model: Predicts exact quality score (0-10 scale)
Classification Model: Classifies wine as "Good" (≥7) or "Average" (<7)

Business Value:

🏭 Winemakers: Optimize production processes by identifying key quality factors
🔬 Quality Control: Early detection of quality issues before bottling
💰 Pricing: Data-driven pricing based on predicted quality
🛒 Consumers: Make informed purchasing decisions

Dataset

Source: UCI Machine Learning Repository - Wine Quality Dataset

Description:

Samples: 1,599 red wine samples
Features: 11 physicochemical properties
Target: Quality score (0-10) based on sensory evaluation

Features:

fixed acidity - Tartaric acid content (g/dm³)
volatile acidity - Acetic acid content (g/dm³)
citric acid - Freshness factor (g/dm³)
residual sugar - Sugar after fermentation (g/dm³)
chlorides - Salt content (g/dm³)
free sulfur dioxide - Free SO₂ (mg/dm³)
total sulfur dioxide - Total SO₂ (mg/dm³)
density - Wine density (g/cm³)
pH - Acidity level (0-14 scale)
sulphates - Potassium sulphate (g/dm³)
alcohol - Alcohol percentage (%)

Dataset included in repository: winequality-red.csv

Project Structure

wine-quality-prediction/
│
├── data/
│   └── winequality-red.csv          # Dataset
│
├── notebook.ipynb                    # Complete EDA and modeling
├── train.py                          # Training script
├── predict.py                        # Flask API service
├── test_api.py                       # API testing script
│
├── wine_quality_regressor.pkl        # Trained regression model
├── wine_quality_classifier.pkl       # Trained classification model
├── feature_scaler.pkl                # Feature scaler
│
├── requirements.txt                  # Python dependencies
├── Dockerfile                        # Docker configuration
├── README.md                         # This file
│
└── app.py                           # (BONUS) Streamlit web app

Key Findings from EDA

Data Overview

No missing values ✓
Quality scores range from 3 to 8 (most wines are 5-6)
Only 13.6% of wines are "good quality" (≥7) - class imbalance

Most Important Features

Alcohol (+) - Higher alcohol → Better quality
Volatile Acidity (-) - Too much → Vinegar taste
Sulphates (+) - Preservative quality
Citric Acid (+) - Adds freshness

Correlations

Strong positive: Alcohol ↔ Quality (ρ = 0.48)
Strong negative: Volatile Acidity ↔ Quality (ρ = -0.39)
Density and alcohol are inversely related

Model Performance

Regression Model (Random Forest)

MAE: 0.52 (out of 10-point scale)
RMSE: 0.65
R²: 0.42

Classification Model (Random Forest)

Accuracy: 88%
Precision: 0.68
Recall: 0.54
F1-Score: 0.60
ROC AUC: 0.75

Model Selection: Random Forest outperformed Linear Regression and Gradient Boosting after hyperparameter tuning with GridSearchCV.

Installation & Setup

Prerequisites

Python 3.11+
Docker (for containerized deployment)

Option 1: Local Installation

Clone the repository

git clone https://github.com/HighviewOne/MLZoomcampProject1.git
cd MLZoomcampProject1

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Train the model (optional - models already included)

python train.py

Start the Flask API

python predict.py

The API will be available at http://localhost:9696

Option 2: Docker Deployment

Build Docker image

docker build -t dockerfile .

Run container

docker run -it -p 9696:9696 dockerfile

Verify it's running

curl http://localhost:9696/health

API Usage

Endpoints

1. Health Check

GET http://localhost:9696/health

Response:

{
  "status": "healthy",
  "service": "wine-quality-predictor",
  "version": "1.0"
}

2. Predict Wine Quality

POST http://localhost:9696/predict
Content-Type: application/json

Request Body:

{
  "fixed acidity": 7.4,
  "volatile acidity": 0.7,
  "citric acid": 0.0,
  "residual sugar": 1.9,
  "chlorides": 0.076,
  "free sulfur dioxide": 11.0,
  "total sulfur dioxide": 34.0,
  "density": 0.9978,
  "pH": 3.51,
  "sulphates": 0.56,
  "alcohol": 9.4
}

Response:

{
  "status": "success",
  "predictions": {
    "quality_score": 5.23,
    "quality_class": "Average Wine (<7)",
    "quality_class_numeric": 0,
    "probabilities": {
      "bad_wine": 0.847,
      "good_wine": 0.153
    },
    "confidence": 84.7
  },
  "input": {...}
}

Testing the API

Using curl:

curl -X POST http://localhost:9696/predict \
  -H "Content-Type: application/json" \
  -d '{
    "fixed acidity": 7.4,
    "volatile acidity": 0.7,
    "citric acid": 0.0,
    "residual sugar": 1.9,
    "chlorides": 0.076,
    "free sulfur dioxide": 11.0,
    "total sulfur dioxide": 34.0,
    "density": 0.9978,
    "pH": 3.51,
    "sulphates": 0.56,
    "alcohol": 9.4
  }'

Using Python:

import requests

wine_data = {
    "fixed acidity": 7.4,
    "volatile acidity": 0.7,
    "citric acid": 0.0,
    "residual sugar": 1.9,
    "chlorides": 0.076,
    "free sulfur dioxide": 11.0,
    "total sulfur dioxide": 34.0,
    "density": 0.9978,
    "pH": 3.51,
    "sulphates": 0.56,
    "alcohol": 9.4
}

response = requests.post('http://localhost:9696/predict', json=wine_data)
print(response.json())

Using test script:

python test_api.py

Reproducibility

Training from Scratch

Ensure dataset is present: wine.csv
Run training script:

python train.py

This will:

Load and preprocess data
Train models with GridSearchCV
Save three files:
- wine_quality_regressor.pkl
- wine_quality_classifier.pkl
- feature_scaler.pkl

Start API:

python predict.py

Running the Notebook

Open notebook.ipynb in Jupyter:

jupyter notebook notebook.ipynb

The notebook contains:

Complete EDA with visualizations
Feature analysis
Model training and comparison
Hyperparameter tuning
Model evaluation

Bonus: Streamlit Web App

An interactive web interface is also provided (not required for ML Zoomcamp, but great for demos!):

streamlit run wine_quality_streamlit.py

Access at: http://localhost:8501

Features:

Interactive sliders for all 11 features
Real-time predictions
Visual feedback and confidence scores
Educational tooltips

Deployment (Bonus)

Local Docker Deployment ✅

Build and run:

docker build -t dockerfile .
docker run -it -p 9696:9696 dockerfile

Cloud Deployment Options

The service can be deployed to:

AWS Elastic Beanstalk
Google Cloud Run
Azure Container Instances
Heroku

(Deployment instructions available upon request)

Project Limitations & Future Work

Current Limitations

Dataset Size: Only 1,599 samples - more data could improve generalization
Class Imbalance: Only 13.6% "good" wines - affects classification performance
Geographic Scope: Dataset from Portuguese wines - may not generalize globally
Quality Subjectivity: Based on human ratings which vary

Future Improvements

Feature Engineering: Polynomial features, interaction terms
Advanced Models: XGBoost, Neural Networks
Ensemble Methods: Stacking multiple models
SHAP Values: For better model interpretability
Real-time Monitoring: Track model performance in production
Multi-class Classification: Predict exact quality levels (3-8)

Tech Stack

Language: Python 3.11
ML Framework: Scikit-learn
Web Framework: Flask
Data Processing: Pandas, NumPy
Visualization: Matplotlib, Seaborn
Deployment: Docker
Version Control: Git

Model Files

Three pickle files are included:

wine_quality_regressor.pkl (3.2 MB) - Random Forest Regressor
wine_quality_classifier.pkl (3.0 MB) - Random Forest Classifier
feature_scaler.pkl (1.5 KB) - StandardScaler for features

Troubleshooting

Common Issues & Solutions

Issue 1: ModuleNotFoundError when running scripts

Error:

ModuleNotFoundError: No module named 'pandas'

Solution:

# Make sure virtual environment is activated
source venv/bin/activate

# Verify you see (venv) in your prompt
# Then run your script
python train.py

Issue 2: Python 3.12 compatibility errors during pip install

Error:

AttributeError: module 'pkgutil' has no attribute 'ImpImporter'

Solution:

# Option 1: Install setuptools first
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

# Option 2: Use Python 3.11 (recommended)
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Issue 3: Docker port already in use

Error:

Error starting userland proxy: listen tcp4 0.0.0.0:9696: bind: address already in use

Solution:

# Kill process using port 9696
lsof -ti:9696 | xargs kill -9

# Or use a different port
docker run -it --rm -p 8080:9696 wine-quality-api
# Then test on: http://localhost:8080

Issue 4: Docker build fails - missing .pkl files

Error:

COPY failed: file not found in build context

Solution:

# Train models first to generate .pkl files
source venv/bin/activate
python train.py

# Verify files exist
ls *.pkl

# Then build Docker image
docker build -t wine-quality-api .

Issue 5: API returns 404 Not Found

Error:

404 Not Found - The requested URL was not found on the server

Solution:

# Use correct endpoint: /predict (not /flask_predict_service)
curl -X POST http://localhost:9696/predict \
  -H "Content-Type: application/json" \
  -d '{...}'

# Available endpoints:
# GET  /         - API documentation
# GET  /health   - Health check
# POST /predict  - Prediction endpoint

Issue 6: CSV parsing error

Error:

ParserError: Error tokenizing data

Solution:

# The CSV uses semicolons as delimiters
df = pd.read_csv('winequality-red.csv', sep=';')  # Not comma!

Issue 7: Feature names mismatch

Error:

ValueError: The feature names should match those that were passed during fit

Solution:

# Ensure features are in correct order and use exact column names:
feature_names = [
    'fixed acidity', 'volatile acidity', 'citric acid',
    'residual sugar', 'chlorides', 'free sulfur dioxide',
    'total sulfur dioxide', 'density', 'pH',
    'sulphates', 'alcohol'
]
input_df = input_df[feature_names]  # Reorder if needed

Issue 8: Docker daemon not running

Error:

Cannot connect to the Docker daemon

Solution:

# Start Docker service
sudo systemctl start docker

# Or on Mac/Windows: Start Docker Desktop application

# Verify Docker is running
docker --version
docker ps

Verification Checklist

Before submitting or sharing:

# 1. Fresh clone test
cd /tmp
git clone YOUR_REPO_URL
cd wine-quality-prediction

# 2. Virtual environment test
python3.11 -m venv test_venv
source test_venv/bin/activate
pip install -r requirements.txt

# 3. Training test
python train.py  # Should complete in <10 minutes

# 4. API test
python predict.py &
sleep 5
curl http://localhost:9696/health
python test_api.py
pkill -f predict.py

# 5. Docker test
docker build -t wine-test .
docker run -d -p 9696:9696 --name wine-test wine-test
curl http://localhost:9696/health
docker stop wine-test && docker rm wine-test

# If all pass ✅ You're ready to submit!

Author

Your Name

Acknowledgments

Dataset: UCI Machine Learning Repository
Course: ML Zoomcamp by DataTalks.Club
Instructor: Alexey Grigorev
Community: DataTalks.Club Slack community

License

This project is for educational purposes as part of ML Zoomcamp.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
README_old.md		README_old.md
WineQualityPrediction.md		WineQualityPrediction.md
dockerfile		dockerfile
feature_scaler.pkl		feature_scaler.pkl
notebook.ipynb		notebook.ipynb
predict.py		predict.py
requirements.txt		requirements.txt
test_api.py		test_api.py
train.py		train.py
wine.csv		wine.csv
wine_quality_classifier.pkl		wine_quality_classifier.pkl
wine_quality_regressor.pkl		wine_quality_regressor.pkl
wine_quality_streamlit.py		wine_quality_streamlit.py
winequality-red.csv		winequality-red.csv

HighviewOne/MLZoomcampProject1

Folders and files

Latest commit

History

Repository files navigation

🍷 Wine Quality Prediction - ML Zoomcamp Project

Problem Description

Dataset

Project Structure

Key Findings from EDA

Data Overview

Most Important Features

Correlations

Model Performance

Regression Model (Random Forest)

Classification Model (Random Forest)

Installation & Setup

Prerequisites

Option 1: Local Installation

Option 2: Docker Deployment

API Usage

Endpoints

1. Health Check

2. Predict Wine Quality

Testing the API

Reproducibility

Training from Scratch

Running the Notebook

Bonus: Streamlit Web App

Deployment (Bonus)

Local Docker Deployment ✅

Cloud Deployment Options

Project Limitations & Future Work

Current Limitations

Future Improvements

Tech Stack

Model Files

Troubleshooting

Common Issues & Solutions

Issue 1: ModuleNotFoundError when running scripts

Issue 2: Python 3.12 compatibility errors during pip install

Issue 3: Docker port already in use

Issue 4: Docker build fails - missing .pkl files

Issue 5: API returns 404 Not Found

Issue 6: CSV parsing error

Issue 7: Feature names mismatch

Issue 8: Docker daemon not running

Verification Checklist

Author

Acknowledgments

License

ML Zoomcamp Project Checklist

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages