# Lab 5 – Ensemble ML Models on Wine Quality Data
**Name:** Huzaifa Nadeem  
**Date:** 2025-04-11  

In this notebook, we explore ensemble machine learning methods to classify red wine quality using physicochemical features. We’ll compare boosted trees, bagging, voting classifiers, and evaluate which models generalize best to unseen data.


In [None]:
# ------------------------------------------------
# Imports 
# ------------------------------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)


## Section 1 – Load and Inspect the Data

In [13]:
df = pd.read_csv("winequality-red.csv", sep=";")
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2 – Prepare the Data

In [14]:
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

df["quality_label"] = df["quality"].apply(quality_to_label)

def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

df["quality_numeric"] = df["quality"].apply(quality_to_number)


## Section 3 – Feature Selection and Justification

In [15]:
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])
y = df["quality_numeric"]


## Section 4 – Split the Data into Train and Test

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Section 5 – Evaluate Model Performance

In [17]:
results = []

def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

evaluate_model(
    "Voting (RF + LR + KNN)",
    VotingClassifier(
        estimators=[
            ("RF", RandomForestClassifier(n_estimators=100)),
            ("LR", LogisticRegression(max_iter=2000)),
            ("KNN", KNeighborsClassifier()),
        ],
        voting="soft",
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)



Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

Voting (RF + LR + KNN) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 257   7]
 [  0  28  15]]
Train Accuracy: 0.9171, Test Accuracy: 0.8500
Train F1 Score: 0.9003, Test F1 Score: 0.8166


## Section 6 – Compare Results

In [18]:
results_df = pd.DataFrame(results)
results_df


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,Random Forest (100),1.0,0.8875,1.0,0.866056
1,Voting (RF + LR + KNN),0.917123,0.85,0.900276,0.816557


## Section 7 – Conclusions and Insights
The Random Forest model performed slightly better on both accuracy and F1 score, while the voting classifier was very close behind. 
This suggests that ensemble tree methods generalize well on this dataset. Further tuning or boosting methods could be explored.


## Section 5 – Evaluate Model Performance

In [19]:
results = []

def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append({
        "Model": name,
        "Train Accuracy": train_acc,
        "Test Accuracy": test_acc,
        "Train F1": train_f1,
        "Test F1": test_f1
    })

In [20]:
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train, y_train, X_test, y_test, results
)


Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661


In [21]:
evaluate_model(
    "Voting (RF + LR + KNN)",
    VotingClassifier(
        estimators=[
            ("RF", RandomForestClassifier(n_estimators=100)),
            ("LR", LogisticRegression(max_iter=2000)),
            ("KNN", KNeighborsClassifier()),
        ],
        voting="soft",
    ),
    X_train, y_train, X_test, y_test, results
)


Voting (RF + LR + KNN) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 257   7]
 [  0  27  16]]
Train Accuracy: 0.9156, Test Accuracy: 0.8531
Train F1 Score: 0.8967, Test F1 Score: 0.8210


## Section 6 – Compare Results

In [22]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,Random Forest (100),1.0,0.8875,1.0,0.866056
1,Voting (RF + LR + KNN),0.915559,0.853125,0.896724,0.821034


## Section 7 – Conclusions and Insights

The Random Forest classifier achieved the highest accuracy and F1 score, suggesting it's a strong model for wine quality classification. The Voting Classifier also performed well and provides more model diversity. Based on this, tree-based models appear effective on this dataset. If more time were available, I’d explore Gradient Boosting and hyperparameter tuning to push performance further.