# Wine Quality Prediction - Data Preprocessing Pipeline

## Objective
Prepare the wine quality dataset for machine learning by implementing proper train-test splits, feature scaling, and data validation to ensure optimal model performance.

## Preprocessing Strategy
- **Stratified sampling**: Maintain target distribution across train/test splits
- **Feature standardization**: Scale features to mean=0, std=1 for algorithm optimization
- **Data validation**: Comprehensive checks to ensure preprocessing integrity

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

wine_data = pd.read_csv('../data/winequality-red.csv', sep=';')

target = wine_data['quality']
features = wine_data.drop('quality', axis=1)

print(f"Dataset columns: {len(wine_data.columns)}")
print(f"Features columns: {len(features.columns)}")
print(f"Dataset shape: {wine_data.shape}")

Dataset columns 12
Features columns 11


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42, stratify=target)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Standardize the features
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert the scaled features back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print(f"\nScaled training set shape: {X_train_scaled.shape}")
print(f"Scaled testing set shape: {X_test_scaled.shape}")

# Comparison of means and stds before and after scaling
print(f"\n--- Scaling Validation ---")
print(f"Original mean: {features.mean().mean():.4f}")
print(f"Original std: {features.std().mean():.4f}")
print(f"Scaled mean: {X_train_scaled.mean().mean():.4f}")
print(f"Scaled std: {X_train_scaled.std().mean():.4f}")

# Comparison of target values in the original dataset and the train/test splits 
print(f"\n--- Target Distribution Validation ---")
print(f"Original distribution: {target.value_counts().sort_index().to_dict()}")
print(f"Training set distribution: {y_train.value_counts().sort_index().to_dict()}")
print(f"Testing set distribution: {y_test.value_counts().sort_index().to_dict()}")



Original mean: 18.432687072460002
Original std: 6.105775283639784 

Scaled mean: -2.024481303950566e-15
Scaled std: 1.0001276405646913
target values count: quality
3      20
4     163
5    1457
6    2198
7     880
8     175
9       5
Name: count, dtype: int64
ytrain values count: quality
3      16
4     130
5    1166
6    1758
7     704
8     140
9       4
Name: count, dtype: int64


## Preprocessing Results

### Data Split Validation
- **Training set**: 1,279 samples (80%)
- **Testing set**: 320 samples (20%)
- **Stratification success**: Target distribution preserved across splits

### Feature Standardization Results
- **Original features**: Mixed scales (pH: 2.7-4.0, alcohol: 8.4-14.9)
- **Standardized features**: Mean ≈ 0.0, Standard deviation ≈ 1.0
- **Scaling validation**: Confirmed statistical properties achieved

### Technical Implementation
- **StandardScaler**: Fit on training data only (prevents data leakage)
- **DataFrame preservation**: Maintained feature names and structure
- **Pipeline ready**: Preprocessed data ready for model training

## Key Preprocessing Decisions

### Why Stratified Sampling?
- **Class imbalance**: Quality scores 5-6 represent 82% of dataset
- **Risk mitigation**: Prevents biased train/test distributions
- **Model reliability**: Ensures consistent evaluation metrics

### Why StandardScaler?
- **Algorithm optimization**: Essential for distance-based algorithms
- **Feature equality**: Prevents high-magnitude features from dominating
- **Numerical stability**: Improves gradient descent convergence

### Next Steps
1. **Linear Regression**: Baseline model with interpretable coefficients
2. **XGBoost**: Advanced ensemble method for comparison
3. **Model evaluation**: MSE, MAE, R² metrics for performance assessment