Context: Wine quality assessment is crucial for wineries, distributors, and consumers. Traditional quality evaluation relies on expert sommeliers conducting sensory analysis, which is subjective, time-consuming, and expensive.
Problem: Can we predict wine quality objectively using physicochemical properties measured through laboratory tests?
Solution: This project builds a machine learning model that predicts wine quality based on 11 physicochemical properties (acidity, pH, alcohol content, etc.), providing:
- Regression Model: Predicts exact quality score (0-10 scale)
- Classification Model: Classifies wine as "Good" (≥7) or "Average" (<7)
Business Value:
- 🏭 Winemakers: Optimize production processes by identifying key quality factors
- 🔬 Quality Control: Early detection of quality issues before bottling
- 💰 Pricing: Data-driven pricing based on predicted quality
- 🛒 Consumers: Make informed purchasing decisions
Source: UCI Machine Learning Repository - Wine Quality Dataset
Description:
- Samples: 1,599 red wine samples
- Features: 11 physicochemical properties
- Target: Quality score (0-10) based on sensory evaluation
Features:
fixed acidity- Tartaric acid content (g/dm³)volatile acidity- Acetic acid content (g/dm³)citric acid- Freshness factor (g/dm³)residual sugar- Sugar after fermentation (g/dm³)chlorides- Salt content (g/dm³)free sulfur dioxide- Free SO₂ (mg/dm³)total sulfur dioxide- Total SO₂ (mg/dm³)density- Wine density (g/cm³)pH- Acidity level (0-14 scale)sulphates- Potassium sulphate (g/dm³)alcohol- Alcohol percentage (%)
Dataset included in repository: winequality-red.csv
wine-quality-prediction/
│
├── data/
│ └── winequality-red.csv # Dataset
│
├── notebook.ipynb # Complete EDA and modeling
├── train.py # Training script
├── predict.py # Flask API service
├── test_api.py # API testing script
│
├── wine_quality_regressor.pkl # Trained regression model
├── wine_quality_classifier.pkl # Trained classification model
├── feature_scaler.pkl # Feature scaler
│
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
├── README.md # This file
│
└── app.py # (BONUS) Streamlit web app
- No missing values ✓
- Quality scores range from 3 to 8 (most wines are 5-6)
- Only 13.6% of wines are "good quality" (≥7) - class imbalance
- Alcohol (+) - Higher alcohol → Better quality
- Volatile Acidity (-) - Too much → Vinegar taste
- Sulphates (+) - Preservative quality
- Citric Acid (+) - Adds freshness
- Strong positive: Alcohol ↔ Quality (ρ = 0.48)
- Strong negative: Volatile Acidity ↔ Quality (ρ = -0.39)
- Density and alcohol are inversely related
- MAE: 0.52 (out of 10-point scale)
- RMSE: 0.65
- R²: 0.42
- Accuracy: 88%
- Precision: 0.68
- Recall: 0.54
- F1-Score: 0.60
- ROC AUC: 0.75
Model Selection: Random Forest outperformed Linear Regression and Gradient Boosting after hyperparameter tuning with GridSearchCV.
- Python 3.11+
- Docker (for containerized deployment)
- Clone the repository
git clone https://github.com/HighviewOne/MLZoomcampProject1.git
cd MLZoomcampProject1- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Train the model (optional - models already included)
python train.py- Start the Flask API
python predict.pyThe API will be available at http://localhost:9696
- Build Docker image
docker build -t dockerfile .- Run container
docker run -it -p 9696:9696 dockerfile- Verify it's running
curl http://localhost:9696/healthGET http://localhost:9696/healthResponse:
{
"status": "healthy",
"service": "wine-quality-predictor",
"version": "1.0"
}POST http://localhost:9696/predict
Content-Type: application/jsonRequest Body:
{
"fixed acidity": 7.4,
"volatile acidity": 0.7,
"citric acid": 0.0,
"residual sugar": 1.9,
"chlorides": 0.076,
"free sulfur dioxide": 11.0,
"total sulfur dioxide": 34.0,
"density": 0.9978,
"pH": 3.51,
"sulphates": 0.56,
"alcohol": 9.4
}Response:
{
"status": "success",
"predictions": {
"quality_score": 5.23,
"quality_class": "Average Wine (<7)",
"quality_class_numeric": 0,
"probabilities": {
"bad_wine": 0.847,
"good_wine": 0.153
},
"confidence": 84.7
},
"input": {...}
}Using curl:
curl -X POST http://localhost:9696/predict \
-H "Content-Type: application/json" \
-d '{
"fixed acidity": 7.4,
"volatile acidity": 0.7,
"citric acid": 0.0,
"residual sugar": 1.9,
"chlorides": 0.076,
"free sulfur dioxide": 11.0,
"total sulfur dioxide": 34.0,
"density": 0.9978,
"pH": 3.51,
"sulphates": 0.56,
"alcohol": 9.4
}'Using Python:
import requests
wine_data = {
"fixed acidity": 7.4,
"volatile acidity": 0.7,
"citric acid": 0.0,
"residual sugar": 1.9,
"chlorides": 0.076,
"free sulfur dioxide": 11.0,
"total sulfur dioxide": 34.0,
"density": 0.9978,
"pH": 3.51,
"sulphates": 0.56,
"alcohol": 9.4
}
response = requests.post('http://localhost:9696/predict', json=wine_data)
print(response.json())Using test script:
python test_api.py-
Ensure dataset is present:
wine.csv -
Run training script:
python train.pyThis will:
- Load and preprocess data
- Train models with GridSearchCV
- Save three files:
wine_quality_regressor.pklwine_quality_classifier.pklfeature_scaler.pkl
- Start API:
python predict.pyOpen notebook.ipynb in Jupyter:
jupyter notebook notebook.ipynbThe notebook contains:
- Complete EDA with visualizations
- Feature analysis
- Model training and comparison
- Hyperparameter tuning
- Model evaluation
An interactive web interface is also provided (not required for ML Zoomcamp, but great for demos!):
streamlit run wine_quality_streamlit.pyAccess at: http://localhost:8501
Features:
- Interactive sliders for all 11 features
- Real-time predictions
- Visual feedback and confidence scores
- Educational tooltips
Build and run:
docker build -t dockerfile .
docker run -it -p 9696:9696 dockerfileThe service can be deployed to:
- AWS Elastic Beanstalk
- Google Cloud Run
- Azure Container Instances
- Heroku
(Deployment instructions available upon request)
- Dataset Size: Only 1,599 samples - more data could improve generalization
- Class Imbalance: Only 13.6% "good" wines - affects classification performance
- Geographic Scope: Dataset from Portuguese wines - may not generalize globally
- Quality Subjectivity: Based on human ratings which vary
- Feature Engineering: Polynomial features, interaction terms
- Advanced Models: XGBoost, Neural Networks
- Ensemble Methods: Stacking multiple models
- SHAP Values: For better model interpretability
- Real-time Monitoring: Track model performance in production
- Multi-class Classification: Predict exact quality levels (3-8)
- Language: Python 3.11
- ML Framework: Scikit-learn
- Web Framework: Flask
- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Deployment: Docker
- Version Control: Git
Three pickle files are included:
- wine_quality_regressor.pkl (3.2 MB) - Random Forest Regressor
- wine_quality_classifier.pkl (3.0 MB) - Random Forest Classifier
- feature_scaler.pkl (1.5 KB) - StandardScaler for features
Error:
ModuleNotFoundError: No module named 'pandas'
Solution:
# Make sure virtual environment is activated
source venv/bin/activate
# Verify you see (venv) in your prompt
# Then run your script
python train.pyError:
AttributeError: module 'pkgutil' has no attribute 'ImpImporter'
Solution:
# Option 1: Install setuptools first
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
# Option 2: Use Python 3.11 (recommended)
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txtError:
Error starting userland proxy: listen tcp4 0.0.0.0:9696: bind: address already in use
Solution:
# Kill process using port 9696
lsof -ti:9696 | xargs kill -9
# Or use a different port
docker run -it --rm -p 8080:9696 wine-quality-api
# Then test on: http://localhost:8080Error:
COPY failed: file not found in build context
Solution:
# Train models first to generate .pkl files
source venv/bin/activate
python train.py
# Verify files exist
ls *.pkl
# Then build Docker image
docker build -t wine-quality-api .Error:
404 Not Found - The requested URL was not found on the server
Solution:
# Use correct endpoint: /predict (not /flask_predict_service)
curl -X POST http://localhost:9696/predict \
-H "Content-Type: application/json" \
-d '{...}'
# Available endpoints:
# GET / - API documentation
# GET /health - Health check
# POST /predict - Prediction endpointError:
ParserError: Error tokenizing data
Solution:
# The CSV uses semicolons as delimiters
df = pd.read_csv('winequality-red.csv', sep=';') # Not comma!Error:
ValueError: The feature names should match those that were passed during fit
Solution:
# Ensure features are in correct order and use exact column names:
feature_names = [
'fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'density', 'pH',
'sulphates', 'alcohol'
]
input_df = input_df[feature_names] # Reorder if neededError:
Cannot connect to the Docker daemon
Solution:
# Start Docker service
sudo systemctl start docker
# Or on Mac/Windows: Start Docker Desktop application
# Verify Docker is running
docker --version
docker psBefore submitting or sharing:
# 1. Fresh clone test
cd /tmp
git clone YOUR_REPO_URL
cd wine-quality-prediction
# 2. Virtual environment test
python3.11 -m venv test_venv
source test_venv/bin/activate
pip install -r requirements.txt
# 3. Training test
python train.py # Should complete in <10 minutes
# 4. API test
python predict.py &
sleep 5
curl http://localhost:9696/health
python test_api.py
pkill -f predict.py
# 5. Docker test
docker build -t wine-test .
docker run -d -p 9696:9696 --name wine-test wine-test
curl http://localhost:9696/health
docker stop wine-test && docker rm wine-test
# If all pass ✅ You're ready to submit!Your Name
- GitHub: @your-username
- LinkedIn: your-profile
- Email: your.email@example.com
- Dataset: UCI Machine Learning Repository
- Course: ML Zoomcamp by DataTalks.Club
- Instructor: Alexey Grigorev
- Community: DataTalks.Club Slack community
This project is for educational purposes as part of ML Zoomcamp.
- Problem description with context
- Dataset included and documented
- Complete EDA in notebook
- Multiple models trained and compared
- Hyperparameter tuning performed
-
train.pyscript for model training -
predict.pyscript with Flask API - Dependencies listed in
requirements.txt - Dockerfile for containerization
- API endpoints tested and documented
- Reproducible from scratch
- README with clear instructions
- Troubleshooting section added