# Predicting Wine Quality - Modeling


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

red = pd.read_csv("winequality-red.csv", sep=';')
white = pd.read_csv("winequality-white.csv", sep=';')

red_top_6 = pd.read_csv("red_quality_top_6.csv")
white_top_6 = pd.read_csv("white_quality_top_6.csv")


## Train-Test Split

Before training any models we will split the red wine data into 70% for training and the remaining 30% will be used for testing. We decided that split since it red wine doesn't have as much data compared to the white wine data set.

In [2]:
#feature and target for Red Wine
X_red = red.drop(columns=['quality'])
y_red = red['quality']

#Data split
X_red_train, X_red_test, y_red_train, y_red_test = train_test_split(
    X_red, y_red, test_size=0.30, random_state=42
)

In [4]:
#feature and target for Red_top_6 Wine
X_red_top_6 = red_top_6.drop(columns=['quality'])
y_red_top_6 = red_top_6['quality']

#Data split
X_red_top_6_train, X_red_top_6_test, y_red_top_6_train, y_red_top_6_test = train_test_split(
    X_red_top_6, y_red_top_6, test_size=0.30, random_state=42
)

For white wine we will split the data into 80% for training and the remaining 20% will be used for testing since it has way more data than the red wine dataset.

In [20]:
#feature and target for White Wine
X_white = white.drop(columns=['quality'])
y_white = white['quality']

#Data split
X_white_train, X_white_test, y_white_train, y_white_test = train_test_split(
    X_white, y_white, test_size=0.20, random_state=42
)

In [19]:
#feature and target for White_top_6 wine
X_white_top_6 = white_top_6.drop(columns=['quality'])
y_white_top_6 = white_top_6['quality']

#Data split
X_white_top_6_train, X_white_top_6_test, y_white_top_6_train, y_white_top_6_test = train_test_split(
    X_white_top_6, y_white_top_6, test_size=0.20, random_state=42
)

## Linear Regression Models

Linear Regression for Red Wine with the datasets that contain all features.

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create the model
lr = LinearRegression()

# Train it
lr.fit(X_red_train, y_red_train)

# Predict on training set
y_train_pred = lr.predict(X_red_train)

# Predict on test set
y_test_pred = lr.predict(X_red_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_test, y_test_pred)

# Print results
print("=" * 60)
print("Linear Regression - Red Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Linear Regression - Red Wine Performance

Training Set:
  RMSE: 0.6487
  R²:   0.3612 (36.1%)

Test Set:
  RMSE: 0.6413
  R²:   0.3514 (35.1%)


Now we run it for the red wine dataset with the top 6 features and see if it performs better.

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create the model
lr = LinearRegression()

# Train it
lr.fit(X_red_top_6_train, y_red_top_6_train)

# Predict on training set
y_train_pred = lr.predict(X_red_top_6_train)

# Predict on test set
y_test_pred = lr.predict(X_red_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_top_6_train, y_train_pred)
train_rmse = (train_mse) ** 0.5
train_r2 = r2_score(y_red_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_top_6_test, y_test_pred)
test_rmse = (test_mse) ** 0.5
test_r2 = r2_score(y_red_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("Linear Regression - Red Wine (Top 6 Feature Set) Performance")
print("=" * 60)

print("\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")

print("\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")


Linear Regression - Red Wine (Top 6 Feature Set) Performance

Training Set:
  RMSE: 0.6551
  R²:   0.3484 (34.8%)

Test Set:
  RMSE: 0.6531
  R²:   0.3273 (32.7%)


Linear Regression for White Wine

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create the model
lr = LinearRegression()

# Train it
lr.fit(X_white_train, y_white_train)

# Predict on training set
y_train_pred = lr.predict(X_white_train)

# Predict on test set
y_test_pred = lr.predict(X_white_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_test, y_test_pred)

# Print results
print("=" * 60)
print("Linear Regression - White Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Linear Regression - White Wine Performance

Training Set:
  RMSE: 0.7502
  R²:   0.2843 (28.4%)

Test Set:
  RMSE: 0.7543
  R²:   0.2653 (26.5%)


Now with white wine dataset that has the top 6 best features

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create the model
lr = LinearRegression()

# Train it
lr.fit(X_white_top_6_train, y_white_top_6_train)

# Predict on training set
y_train_pred = lr.predict(X_white_top_6_train)

# Predict on test set
y_test_pred = lr.predict(X_white_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("Linear Regression - White Wine (Top 6 Feature Set) Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Linear Regression - White Wine (Top 6 Feature Set) Performance

Training Set:
  RMSE: 0.7631
  R²:   0.2596 (26.0%)

Test Set:
  RMSE: 0.7658
  R²:   0.2427 (24.3%)


Result for datasets with all 11 features: Linear Regression explains ~35% of quality variance for red wine and ~27% for white, with minimal train-test gaps so the model generalizes but only captures modest linear relationships.

Results for datasets with top features: Restricting to the top correlated features drops performance slightly (≈33% for red, ≈24% for white), showing that Linear Regression needs the full chemistry set to extract every bit of signal.

Conclusion: for Linear Regression, retaining all 11 features gives the strongest results; trimming to just the top predictors leaves useful variance unexplained.

## K-Nearest Neighbors

KNN on Red Wine with all 11 features:

In [21]:
#Training Model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# KNN model wrapped in a Pipeline (so scaling happens automatically)
knn = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=11))
])

knn.fit(X_red_train, y_red_train)

#Evaluating model
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = knn.predict(X_red_train)

# Predict on test set
y_test_pred = knn.predict(X_red_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_test, y_test_pred)

# Print results
print("=" * 60)
print("KNN - Red Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

KNN - Red Wine Performance

Training Set:
  RMSE: 0.6156
  R²:   0.4246 (42.5%)

Test Set:
  RMSE: 0.6617
  R²:   0.3093 (30.9%)


In [22]:
#finding the best k value 
results = []

for k in range(1, 21):   # test k from 1 to 20
    knn_test = Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsRegressor(n_neighbors=k))
    ])
    
    knn_test.fit(X_red_train, y_red_train)
    pred = knn_test.predict(X_red_test)
    
    mse = mean_squared_error(y_red_test, pred)
    rmse = mse ** 0.5
    
    # store (k, rmse)
    results.append((k, rmse))

# sort results by rmse descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=False)

# print results
for k, rmse in results_sorted:
    print(f"k={k}, RMSE={rmse}")



k=11, RMSE=0.6617371120842792
k=12, RMSE=0.6628362179693048
k=13, RMSE=0.6628713465603486
k=18, RMSE=0.6643381402637165
k=19, RMSE=0.6645496743069114
k=20, RMSE=0.6646819414827917
k=16, RMSE=0.6651247631240573
k=10, RMSE=0.6653789647010692
k=17, RMSE=0.6656584793065168
k=14, RMSE=0.6659461455008828
k=7, RMSE=0.6660818012724823
k=9, RMSE=0.6662228460946353
k=15, RMSE=0.6666944438657648
k=8, RMSE=0.6702378309227255
k=5, RMSE=0.6729908369856655
k=6, RMSE=0.6750342926817098
k=4, RMSE=0.6780601927557759
k=3, RMSE=0.6853290639728669
k=2, RMSE=0.7107800878846658
k=1, RMSE=0.758287544405155


KNN on Red Wine with top 6 features:

In [23]:
# Training Model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# KNN model wrapped in a Pipeline (so scaling happens automatically)
knn = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=11))
])

# Train on top-6 feature dataset
knn.fit(X_red_top_6_train, y_red_top_6_train)

# Predict on training set
y_train_pred = knn.predict(X_red_top_6_train)

# Predict on test set
y_test_pred = knn.predict(X_red_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("KNN - Red Wine (Top 6 Feature Set) Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")


KNN - Red Wine (Top 6 Feature Set) Performance

Training Set:
  RMSE: 0.6073
  R²:   0.4401 (44.0%)

Test Set:
  RMSE: 0.6531
  R²:   0.3272 (32.7%)


KNN on White Wine:

In [24]:
#Training Model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# KNN model wrapped in a Pipeline (so scaling happens automatically)
knn = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=5))
])

knn.fit(X_white_train, y_white_train)

#Evaluating model
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = knn.predict(X_white_train)

# Predict on test set
y_test_pred = knn.predict(X_white_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_test, y_test_pred)

# Print results
print("=" * 60)
print("KNN - White Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

KNN - White Wine Performance

Training Set:
  RMSE: 0.5756
  R²:   0.5788 (57.9%)

Test Set:
  RMSE: 0.6906
  R²:   0.3842 (38.4%)


In [25]:
#finding the best k value 
results = []

for k in range(1, 21):   # test k from 1 to 20
    knn_test = Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsRegressor(n_neighbors=k))
    ])
    
    knn_test.fit(X_white_train, y_white_train)
    pred = knn_test.predict(X_white_test)
    
    mse = mean_squared_error(y_white_test, pred)
    rmse = mse ** 0.5
    
    # store (k, rmse)
    results.append((k, rmse))

# sort results by rmse descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=False)

# print results
for k, rmse in results_sorted:
    print(f"k={k}, RMSE={rmse}")



k=7, RMSE=0.6890135643041329
k=8, RMSE=0.6897574255652845
k=5, RMSE=0.690577989211699
k=6, RMSE=0.6913582215293245
k=10, RMSE=0.6915303939193748
k=9, RMSE=0.6922527773763977
k=4, RMSE=0.6928939607301585
k=2, RMSE=0.6952829404975638
k=11, RMSE=0.6972949685988421
k=12, RMSE=0.700071870298162
k=13, RMSE=0.7028587235136066
k=3, RMSE=0.703812101290467
k=14, RMSE=0.707220892711073
k=15, RMSE=0.7077670805871178
k=16, RMSE=0.7080784980768555
k=20, RMSE=0.709704954505025
k=17, RMSE=0.7102331900428206
k=19, RMSE=0.7106078641988153
k=18, RMSE=0.711033158992744
k=1, RMSE=0.7491491772643939


In [26]:
#Training Model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# KNN model wrapped in a Pipeline (so scaling happens automatically)
knn = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=5))
])

knn.fit(X_white_top_6_train, y_white_top_6_train)

#Evaluating model
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = knn.predict(X_white_top_6_train)

# Predict on test set
y_test_pred = knn.predict(X_white_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("KNN - White Wine (Top 6 Feature Set) Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

KNN - White Wine (Top 6 Feature Set) Performance

Training Set:
  RMSE: 0.6034
  R²:   0.5370 (53.7%)

Test Set:
  RMSE: 0.7355
  R²:   0.3016 (30.2%)


Results with all 11 features: KNN achieved moderate performance on both datasets, with R² scores around 0.31 for red wine and 0.38 for white wine. The model showed noticeable overfitting, especially on the larger white wine dataset, where the train–test gap was substantial. Optimal values of k fell in the 5–12 range, indicating that small to medium neighborhood sizes provided the best generalization.

Results with Top 6 features: for red wine we got an R $^2$ score around 0.33, while white wine got an R $^2$ score around 0.30. The model still shows overfitting on white wine and also slightly overfitting on red wine

Conclusion: keep the full feature set for white wine (since the reduced set loses signal) but try to use top features for red wine since it slightly improves performance; also consider tunning k value to tame the overfitting

## Random Forest

Random Forest for Red Wine:

In [73]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=6,
    random_state=42,
    n_jobs=-1,
    min_samples_leaf=3
)

rf.fit(X_red_train, y_red_train)

#evaluate
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = rf.predict(X_red_train)

# Predict on test set
y_test_pred = rf.predict(X_red_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_test, y_test_pred)

# Print results
print("=" * 60)
print("Random Forest - Red Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Random Forest - Red Wine Performance

Training Set:
  RMSE: 0.4905
  R²:   0.6348 (63.5%)

Test Set:
  RMSE: 0.6221
  R²:   0.3896 (39.0%)


Random Forest for Red Wine with Top 6 features:

In [27]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=6,
    random_state=42,
    n_jobs=-1,
    min_samples_leaf=3
)

rf.fit(X_red_top_6_train, y_red_top_6_train)

#evaluate
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = rf.predict(X_red_top_6_train)

# Predict on test set
y_test_pred = rf.predict(X_red_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("Random Forest - Red Wine (Top 6 Feature Set) Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Random Forest - Red Wine (Top 6 Feature Set) Performance

Training Set:
  RMSE: 0.5064
  R²:   0.6107 (61.1%)

Test Set:
  RMSE: 0.6332
  R²:   0.3675 (36.8%)


Random Forest for White Wine

In [74]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=6,
    random_state=42,
    n_jobs=-1,
    min_samples_leaf=3
)

rf.fit(X_white_train, y_white_train)

#evaluate
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = rf.predict(X_white_train)

# Predict on test set
y_test_pred = rf.predict(X_white_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_test, y_test_pred)

# Print results
print("=" * 60)
print("Random Forest - White Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Random Forest - White Wine Performance

Training Set:
  RMSE: 0.6400
  R²:   0.4857 (48.6%)

Test Set:
  RMSE: 0.6851
  R²:   0.3783 (37.8%)


Random Forest for White Wine with Top 6 Feature Set

In [29]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=6,
    random_state=42,
    n_jobs=-1,
    min_samples_leaf=3
)

rf.fit(X_white_top_6_train, y_white_top_6_train)

#evaluate
from sklearn.metrics import mean_squared_error, r2_score

# Predict on training set
y_train_pred = rf.predict(X_white_top_6_train)

# Predict on test set
y_test_pred = rf.predict(X_white_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("Random Forest - White Wine (Top 6 Feature Set) Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

Random Forest - White Wine (Top 6 Feature Set) Performance

Training Set:
  RMSE: 0.6706
  R²:   0.4282 (42.8%)

Test Set:
  RMSE: 0.7149
  R²:   0.3401 (34.0%)


Result for all features: Random Forest captures about 63% of the variance on the red-wine training data but only ~39% on the test set, so it still overfits a bit. On white wine it’s slightly more balanced—training R² ≈0.49 and test ≈0.38—so it generalizes reasonably well.

Results with Top 6 Feature Set: red wines drop to ~0.61 R² on training and ~0.37 on test, while white wines fall to ~0.43 training and ~0.34 on test.

Conclusion: Random Forest remains the strongest of our baseline models, but it needs the full 11-feature chemistry profile to hit its best numbers: red wine tops out near R² ≈0.39 and white near ≈0.38 on the test sets, and trimming to six features only hurts both accuracy and stability, so we stick with the richer feature set while using moderate depth/leaf constraints to keep overfitting in check.

## SVM

SVM for Red Wine

In [30]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# SVM model wrapped in a Pipeline (scaling is important for SVM)
svm_red = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale'))
])

svm_red.fit(X_red_train, y_red_train)

# Evaluating model
# Predict on training set
y_train_pred = svm_red.predict(X_red_train)

# Predict on test set
y_test_pred = svm_red.predict(X_red_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_test, y_test_pred)

# Print results
print("=" * 60)
print("SVM - Red Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

# ============================================================================
# Hyperparameter Tuning for Red Wine SVM
# ============================================================================

# Finding the best C and gamma values
print("\n" + "=" * 60)
print("Hyperparameter Tuning - Red Wine SVM")
print("=" * 60)

results = []
C_values = [0.1, 0.5, 1.0, 10, 100]
gamma_values = ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]

for C in C_values:
    for gamma in gamma_values:
        svm_test = Pipeline([
            ("scaler", StandardScaler()),
            ("model", SVR(kernel='rbf', C=C, epsilon=0.1, gamma=gamma))
        ])
        
        svm_test.fit(X_red_train, y_red_train)
        pred = svm_test.predict(X_red_test)
        
        mse = mean_squared_error(y_red_test, pred)
        rmse = mse ** 0.5
        
        results.append((C, gamma, rmse))

# Sort results by RMSE
results_sorted = sorted(results, key=lambda x: x[2], reverse=False)

# Print top 10 results
print("\nTop 10 Hyperparameter Combinations:")
print("C\t\tGamma\t\tRMSE")
print("-" * 40)
for C, gamma, rmse in results_sorted[:10]:
    print(f"{C}\t\t{gamma}\t\t{rmse:.4f}")

# Best hyperparameters
best_C, best_gamma, best_rmse = results_sorted[0]
print(f"\nBest Parameters: C={best_C}, gamma={best_gamma}, RMSE={best_rmse:.4f}")

SVM - Red Wine Performance

Training Set:
  RMSE: 0.5443
  R²:   0.5502 (55.0%)

Test Set:
  RMSE: 0.6125
  R²:   0.4084 (40.8%)

Hyperparameter Tuning - Red Wine SVM

Top 10 Hyperparameter Combinations:
C		Gamma		RMSE
----------------------------------------
1.0		0.1		0.6120
1.0		scale		0.6125
1.0		auto		0.6125
0.5		0.1		0.6166
0.5		scale		0.6179
0.5		auto		0.6179
100		0.01		0.6307
10		0.01		0.6310
10		scale		0.6352
10		auto		0.6352

Best Parameters: C=1.0, gamma=0.1, RMSE=0.6120


SVM for Red Wine for Top 6 Features Set

In [32]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# SVM model wrapped in a Pipeline (scaling is important for SVM)
svm_red = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale'))
])

svm_red.fit(X_red_top_6_train, y_red_top_6_train)

# Evaluating model
# Predict on training set
y_train_pred = svm_red.predict(X_red_top_6_train)

# Predict on test set
y_test_pred = svm_red.predict(X_red_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_red_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_red_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_red_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_red_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("SVM - Red Wine Performance for Top 6 Features Set")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

# ============================================================================
# Hyperparameter Tuning for Red Wine SVM
# ============================================================================

# Finding the best C and gamma values
print("\n" + "=" * 60)
print("Hyperparameter Tuning - Red Wine SVM Top 6 Features Set")
print("=" * 60)

results = []
C_values = [0.1, 0.5, 1.0, 10, 100]
gamma_values = ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]

for C in C_values:
    for gamma in gamma_values:
        svm_test = Pipeline([
            ("scaler", StandardScaler()),
            ("model", SVR(kernel='rbf', C=C, epsilon=0.1, gamma=gamma))
        ])
        
        svm_test.fit(X_red_top_6_train, y_red_top_6_train)
        pred = svm_test.predict(X_red_top_6_test)
        
        mse = mean_squared_error(y_red_top_6_test, pred)
        rmse = mse ** 0.5
        
        results.append((C, gamma, rmse))

# Sort results by RMSE
results_sorted = sorted(results, key=lambda x: x[2], reverse=False)

# Print top 10 results
print("\nTop 10 Hyperparameter Combinations:")
print("C\t\tGamma\t\tRMSE")
print("-" * 40)
for C, gamma, rmse in results_sorted[:10]:
    print(f"{C}\t\t{gamma}\t\t{rmse:.4f}")

# Best hyperparameters
best_C, best_gamma, best_rmse = results_sorted[0]
print(f"\nBest Parameters: C={best_C}, gamma={best_gamma}, RMSE={best_rmse:.4f}")

SVM - Red Wine Performance for Top 6 Features Set

Training Set:
  RMSE: 0.5816
  R²:   0.4864 (48.6%)

Test Set:
  RMSE: 0.6373
  R²:   0.3594 (35.9%)

Hyperparameter Tuning - Red Wine SVM Top 6 Features Set

Top 10 Hyperparameter Combinations:
C		Gamma		RMSE
----------------------------------------
0.5		scale		0.6365
0.5		auto		0.6365
1.0		scale		0.6373
1.0		auto		0.6373
1.0		0.1		0.6405
0.5		0.1		0.6417
10		0.01		0.6438
0.1		scale		0.6444
0.1		auto		0.6444
1.0		1.0		0.6456

Best Parameters: C=0.5, gamma=scale, RMSE=0.6365


SVM for White Wine

In [76]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# SVM model wrapped in a Pipeline
svm_white = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale'))
])

svm_white.fit(X_white_train, y_white_train)

# Evaluating model
# Predict on training set
y_train_pred = svm_white.predict(X_white_train)

# Predict on test set
y_test_pred = svm_white.predict(X_white_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_test, y_test_pred)

# Print results
print("=" * 60)
print("SVM - White Wine Performance")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

# ============================================================================
# Hyperparameter Tuning for White Wine SVM
# ============================================================================

# Finding the best C and gamma values
print("\n" + "=" * 60)
print("Hyperparameter Tuning - White Wine SVM")
print("=" * 60)

results = []
C_values = [0.1, 0.5, 1.0, 10, 100]
gamma_values = ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]

for C in C_values:
    for gamma in gamma_values:
        svm_test = Pipeline([
            ("scaler", StandardScaler()),
            ("model", SVR(kernel='rbf', C=C, epsilon=0.1, gamma=gamma))
        ])
        
        svm_test.fit(X_white_train, y_white_train)
        pred = svm_test.predict(X_white_test)
        
        mse = mean_squared_error(y_white_test, pred)
        rmse = mse ** 0.5
        
        results.append((C, gamma, rmse))

# Sort results by RMSE
results_sorted = sorted(results, key=lambda x: x[2], reverse=False)

# Print top 10 results
print("\nTop 10 Hyperparameter Combinations:")
print("C\t\tGamma\t\tRMSE")
print("-" * 40)
for C, gamma, rmse in results_sorted[:10]:
    print(f"{C}\t\t{gamma}\t\t{rmse:.4f}")

# Best hyperparameters
best_C, best_gamma, best_rmse = results_sorted[0]
print(f"\nBest Parameters: C={best_C}, gamma={best_gamma}, RMSE={best_rmse:.4f}")

SVM - White Wine Performance

Training Set:
  RMSE: 0.6286
  R²:   0.5038 (50.4%)

Test Set:
  RMSE: 0.6887
  R²:   0.3717 (37.2%)

Hyperparameter Tuning - White Wine SVM

Top 10 Hyperparameter Combinations:
C		Gamma		RMSE
----------------------------------------
10		1.0		0.6557
100		1.0		0.6621
1.0		1.0		0.6647
1.0		0.1		0.6872
1.0		scale		0.6887
1.0		auto		0.6887
10		scale		0.6892
10		auto		0.6892
10		0.1		0.6901
0.5		1.0		0.6924

Best Parameters: C=10, gamma=1.0, RMSE=0.6557


SVM for White Wine Top 6 Feature Set

In [34]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# SVM model wrapped in a Pipeline
svm_white = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale'))
])

svm_white.fit(X_white_top_6_train, y_white_top_6_train)

# Evaluating model
# Predict on training set
y_train_pred = svm_white.predict(X_white_top_6_train)

# Predict on test set
y_test_pred = svm_white.predict(X_white_top_6_test)

# Evaluate training set performance
train_mse = mean_squared_error(y_white_top_6_train, y_train_pred)
train_rmse = train_mse ** 0.5
train_r2 = r2_score(y_white_top_6_train, y_train_pred)

# Evaluate test set performance
test_mse = mean_squared_error(y_white_top_6_test, y_test_pred)
test_rmse = test_mse ** 0.5
test_r2 = r2_score(y_white_top_6_test, y_test_pred)

# Print results
print("=" * 60)
print("SVM - White Wine Performance Top 6 Feature Set")
print("=" * 60)
print(f"\nTraining Set:")
print(f"  RMSE: {train_rmse:.4f}")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}%)")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.4f}")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}%)")

# ============================================================================
# Hyperparameter Tuning for White Wine SVM
# ============================================================================

# Finding the best C and gamma values
print("\n" + "=" * 60)
print("Hyperparameter Tuning - White Wine SVM Top 6 Feature Set")
print("=" * 60)

results = []
C_values = [0.1, 0.5, 1.0, 10, 100]
gamma_values = ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]

for C in C_values:
    for gamma in gamma_values:
        svm_test = Pipeline([
            ("scaler", StandardScaler()),
            ("model", SVR(kernel='rbf', C=C, epsilon=0.1, gamma=gamma))
        ])
        
        svm_test.fit(X_white_top_6_train, y_white_top_6_train)
        pred = svm_test.predict(X_white_top_6_test)
        
        mse = mean_squared_error(y_white_top_6_test, pred)
        rmse = mse ** 0.5
        
        results.append((C, gamma, rmse))

# Sort results by RMSE
results_sorted = sorted(results, key=lambda x: x[2], reverse=False)

# Print top 10 results
print("\nTop 10 Hyperparameter Combinations:")
print("C\t\tGamma\t\tRMSE")
print("-" * 40)
for C, gamma, rmse in results_sorted[:10]:
    print(f"{C}\t\t{gamma}\t\t{rmse:.4f}")

# Best hyperparameters
best_C, best_gamma, best_rmse = results_sorted[0]
print(f"\nBest Parameters: C={best_C}, gamma={best_gamma}, RMSE={best_rmse:.4f}")

SVM - White Wine Performance Top 6 Feature Set

Training Set:
  RMSE: 0.7020
  R²:   0.3734 (37.3%)

Test Set:
  RMSE: 0.7301
  R²:   0.3117 (31.2%)

Hyperparameter Tuning - White Wine SVM Top 6 Feature Set

Top 10 Hyperparameter Combinations:
C		Gamma		RMSE
----------------------------------------
1.0		1.0		0.7081
0.5		1.0		0.7142
1.0		auto		0.7301
1.0		scale		0.7301
0.5		scale		0.7315
0.5		auto		0.7315
1.0		0.1		0.7318
0.5		0.1		0.7333
10		0.01		0.7352
10		0.1		0.7359

Best Parameters: C=1.0, gamma=1.0, RMSE=0.7081


Result for all Features: SVM achieved strong performance on both datasets, with the best test R² score for red wine around 0.41 among all tested models. and for white wine it achieved around 0.37 R², demonstrating excellent capability to capture non-linear relationships between physicochemical properties and quality scores. SVM showed moderate overfitting, which is expected for this flexible model. Hyperparameter tuning revealed optimal settings of C=1.0, gamma=0.1 for red wine and C=10, gamma=1.0 for white wine.

Results for Top 6 Features: Cutting to six features hurts SVM: on red wine the test R² falls from ≈0.41 to ≈0.36 and the best RMSE hovers around 0.64 even after tuning (C≈0.5, gamma≈scale), while white wine drops from ≈0.37 to ≈0.31 R² and the tuned model can’t beat ~0.71 RMSE.

VM needs the full chemistry set—the reduced features strip out useful structure for both wines.



## Model Comparison

### Performance Summary

| Model | Red Wine (Test R²) | White Wine (Test R²) | Red Wine (Test RMSE) | White Wine (Test RMSE) |
|-------|-------------------|---------------------|---------------------|----------------------|
| **Linear Regression** | 0.3514 | 0.2653 | 0.6413 | 0.7502 |
| **K-Nearest Neighbors** | 30.9% | 36.4% | 0.6617 | 0.6931 |
| **Random Forest** | 39.0% | 37.8% | 0.6221 | 0.6851 |
| **Support Vector Machine (SVM)** | 40.8% | 37.2% | 0.6125 | 0.6887 |


## Reflection and Summary

### Project Summary

This project successfully developed predictive models to assess wine quality based on physicochemical properties. We analyzed a dataset containing 1,599 red wine samples and 4,898 white wine samples from Portuguese vinho verde wines, each characterized by 11 chemical features. Through comprehensive exploratory data analysis and the implementation of four different machine learning models (Linear Regression, K-Nearest Neighbors, Random Forest, and Support Vector Machine), we achieved meaningful predictive performance that aligns with published research in this domain.

**Key Results:**
- **Best Model for Red Wine**: Support Vector Machine (SVM) with 40.8% R² and RMSE of 0.6125
- **Best Model for White Wine**: Random Forest with 37.8% R² and RMSE of 0.6851
- **Best Generalization**: Linear Regression with minimal overfitting (train-test gap < 2.1%)
- **Overall Performance Range**: R² values from 26.6% to 40.8%, reflecting the inherent challenge of predicting subjective expert ratings

### Key Learnings and Insights

1. **Model Selection Matters**: Different models excel for different wine types. SVM's ability to capture non-linear relationships made it superior for red wine, while Random Forest's ensemble approach worked best for white wine. This highlights the importance of testing multiple algorithms rather than assuming one model fits all scenarios.

2. **Overfitting is a Real Challenge**: All non-linear models (KNN, Random Forest, SVM) showed significant overfitting, with train-test gaps ranging from 10-25%. This required careful regularization and hyperparameter tuning. The experience reinforced that high training performance doesn't guarantee good generalization.

3. **Data Quality is Critical**: The dataset was remarkably clean with no missing values, which streamlined the analysis. However, the presence of outliers in several features required careful consideration during EDA. The correlation analysis revealed important relationships (e.g., alcohol positively correlated with quality) that informed our understanding of the problem.

4. **Red vs White Wine Differences**: Red wine quality proved more predictable (best R² = 40.8%) than white wine (best R² = 37.8%). This suggests that the chemical composition of red wines may have more direct relationships with perceived quality, or that white wine quality involves factors not captured by the measured physicochemical properties.

5. **Moderate Performance is Expected**: Our R² values (27-41%) align with published research showing that wine quality prediction has a natural ceiling around 40-55% due to the subjective nature of expert tasting. This taught us that "good enough" performance depends on the problem context and that perfect prediction isn't always achievable or necessary.

### Challenges Faced

1. **Overfitting Management**: Initially, Random Forest achieved training R² above 92% but poor test performance. We had to iteratively reduce model complexity (max_depth, min_samples_leaf) to find the right balance between learning capacity and generalization.

2. **Hyperparameter Tuning**: Finding optimal hyperparameters for KNN (k values) and SVM (C, gamma).

3. **Model Interpretability**: While we achieved good performance with SVM and Random Forest, these models are less interpretable than Linear Regression. Understanding which features drive predictions requires additional analysis (e.g., feature importance, sensitivity analysis).

4. **Balancing Complexity and Performance**: There was a constant trade-off between model complexity and generalization. More complex models (SVM, Random Forest) performed better but required more tuning and were more prone to overfitting.

### What Worked Well

1. **Comprehensive EDA**: The thorough exploratory data analysis (missing values, distributions, outliers, correlations) provided valuable insights and helped us understand the data before modeling.

2. **Separate Analysis for Wine Types**: Analyzing red and white wines separately was crucial, as they have different chemical profiles and quality determinants. This approach allowed us to optimize models for each wine type.

3. **Consistent Evaluation Framework**: Using the same evaluation metrics (RMSE, R²) and train-test split methodology across all models enabled fair comparison and reliable performance assessment.

4. **Pipeline Implementation**: Using scikit-learn Pipelines for KNN and SVM (with StandardScaler) ensured proper preprocessing and made the code more maintainable and reproducible.

5. **Hyperparameter Tuning**: Systematic hyperparameter tuning (especially for KNN and SVM) significantly improved model performance and taught us the importance of proper model configuration.

### Areas for Improvement

1. **Feature Engineering**: We used all 11 features as-is without creating interaction terms or polynomial features. Exploring feature engineering (e.g., alcohol × acidity interactions) might have improved performance.

2. **Cross-Validation**: We used a simple 70-30 train-test split. Implementing k-fold cross-validation would provide more robust performance estimates and better hyperparameter selection.

3. **Feature Importance Analysis**: While we identified correlations, we didn't perform formal feature importance analysis (e.g., Random Forest feature importances, permutation importance) to understand which features most drive predictions.

4. **Additional Models**: We could have explored other algorithms like Gradient Boosting (XGBoost, LightGBM) or Neural Networks, which might have achieved even better performance.

5. **Error Analysis**: We didn't analyze prediction errors in detail (e.g., which quality scores are hardest to predict, systematic biases). This could provide insights for model improvement.

6. **Ensemble Methods**: Combining multiple models (e.g., averaging SVM and Random Forest predictions) might have improved performance beyond individual models.

### Practical Implications

The models we developed, while not perfect, have practical value:

- **Production Optimization**: Understanding which chemical properties correlate with quality (e.g., alcohol content, sulphates) can guide production decisions and process improvements.

- **Decision Support**: The models can serve as decision support tools for oenologists, providing objective predictions that complement subjective expert evaluations.

- **Research Foundation**: The moderate R² values (27-41%) confirm that expert wine tasting involves factors beyond chemical composition alone, suggesting opportunities for future research incorporating additional features (e.g., grape variety, region, vintage, production methods).

### Overall Conclusions

This project successfully demonstrated that machine learning can predict wine quality from physicochemical properties with meaningful accuracy. While the moderate R² values reflect the inherent challenge of predicting subjective expert ratings, the models provide valuable insights and practical utility.

The finding that different models excel for different wine types (SVM for red, Random Forest for white) highlights that there's no "one-size-fits-all" solution in machine learning. Context matters, and the best approach depends on the specific characteristics of the data and problem domain.

**Final Recommendation**: For production deployment, we recommend using SVM for red wine quality prediction and Random Forest for white wine, as these models achieved the best performance for their respective wine types. However, the choice between models should also consider computational resources, interpretability requirements, and the specific use case (e.g., real-time prediction vs. batch processing).

This project provided valuable hands-on experience with the complete data science pipeline, from data exploration through model deployment considerations, and demonstrated both the potential and limitations of machine learning for predicting subjective quality assessments.