# Clayton Seabaugh: Ensemble Machine learning
**Author:** Clayton Seabaugh  
**Date:** 4-13-2025  
**Objective:** Explore the Wine Dataset, use ensemble models to predict several features, and evaluate their performance.

## Imports

In [3]:
# Import Wine dataset
# use pip install ucimlrepo in terminal with a virtual environment
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine_quality = fetch_ucirepo(id=186) 
  
# data (as pandas dataframes) 
X = wine_quality.data.features 
y = wine_quality.data.targets 
  
# metadata 
print(wine_quality.metadata) 
  
# variable information 
print(wine_quality.variables) 


{'uci_id': 186, 'name': 'Wine Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/186/wine+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/186/data.csv', 'abstract': 'Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).', 'area': 'Business', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Multivariate'], 'num_instances': 4898, 'num_features': 11, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['quality'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2009, 'last_updated': 'Wed Nov 15 2023', 'dataset_doi': '10.24432/C56S3T', 'creators': ['Paulo Cortez', 'A. Cerdeira', 'F. Almeida', 'T. Matos', 'J. Reis'], 'intro_paper': {'ID': 252, 'type': 'NATIVE', 'title': 'Modeling wine preferences

In [4]:
# Local and Python module imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

## Section 1: Load and Inspect the Data

In [9]:
# Load the dataset
df = pd.read_csv('winequality-red.csv', sep=';')

# Display the structure and first few rows
df.info()
df.head()

# 11 Features and Target is quality

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Reflection Section 1
#### The dataset includes 11 physicochemical input variables(features):
- fixed acidity: (mostly tartaric acid)
- volatile acidity:    mostly acetic acid (vinegar)
- citric acid:    can add freshness and flavor
- residual sugar:         remaining sugar after fermentation
- chlorides:               salt content
- free sulfur dioxide:    protects wine from microbes
- total sulfur dioxide:   sum of free and bound forms
- density:                related to sugar content
- pH:                     acidity level (lower = more acidic)
- sulphates:              antioxidant and microbial stabilizer
- alcohol:                % alcohol by volume

#### The target variable is:
- quality (integer score from 0 to 10, rated by wine tasters)

#### We will simplify this target into three categories:
- low(3-4), medium(5-6), high(7-8) to make classification feasible
- we will also make this numeric (we want both for clarity)
#### This dataset contains 1599 samples and 12 columns

### Section 2: Prepare the Data
Includes cleaning, feature engineering, encoding, splitting, and helper functions.

In [11]:
# Define helper function to convert quality into a string (quality_label)

def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "meidum"
    else: 
        return "high"

In [12]:
# Call apply() method on the quality column to create a quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)

# Create a numeric column for modeling: 0 = lo1, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2
    
df["quality_numeric"] = df["quality"].apply(quality_to_number)

### Section 3: Feature Selection and Justification

In [13]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

### Reflection Section 3:
The input features are all of the columns except ones involved with 'quality'. We want to see how all of the features inpact the quality of the wine. We can separate these into specific features for more specific results as well if needed. <br> The target is a feature created to apply a numeric system to quality ratings: low, medium, and high. <br> We changed this to numeric so we can apply ML and stastical analysis to the target. 

### Section 4: Split the Data into Train and Test

In [14]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Section 5: Evaluate Model Performance
I will use two models. Option 2: Random Forest (200, max_depth-10) and Option 5: Gradient Boosting(100)

In [20]:
# Helper function to train and evaluate models

results = []
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

    results = []

In [21]:
# 2. Random Forest (200, max depth=10) 
evaluate_model(
    "Random Forest (200, max_depth=10)",
    RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Random Forest (200, max_depth=10) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 255   9]
 [  0  16  27]]
Train Accuracy: 0.9758, Test Accuracy: 0.8812
Train F1 Score: 0.9745, Test F1 Score: 0.8596


In [22]:
# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411


### Section 6: Compare Results

In [25]:
# Create a DataFrame from the list of results
results_df = pd.DataFrame(results)

# Calculate the gap between Train and Test Accuracy
results_df["Accuracy Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]

# Sort by Test Accuracy in descending order
results_df = results_df.sort_values(by="Test Accuracy", ascending=False)

print("\nSummary of All Models (Sorted by Test Accuracy):")
display(results_df)



Summary of All Models (Sorted by Test Accuracy):


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap
0,"Random Forest (200, max_depth=10)",0.975762,0.88125,0.974482,0.859643,0.094512
1,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.103875


### Section 7. Conclusions and Insights