# Predicting Red Wine Quality Using Ensemble Machine Learning Models

Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/applied-ml-data-git-hub <br>
Dataset Link: https://archive.ics.uci.edu/ml/datasets/Wine+Quality <br>
15 April 2025 <br>

### Introduction

This project investigates the use of ensemble machine learning methods to classify the quality of red wine based on physicochemical properties. Ensemble models integrate the predictions of multiple base estimators to improve overall model performance and generalization. Techniques such as Random Forests, AdaBoost, Gradient Boosting, and Voting Classifiers are commonly employed to address the limitations of single-model approaches, particularly in reducing overfitting and increasing predictive reliability. <br>

The dataset used in this analysis is sourced from the UCI Machine Learning Repository and was originally compiled by Cortez, Cerdeira, Almeida, Matos, and Reis (2009). It contains various physicochemical attributes of red wine samples, such as fixed acidity, residual sugar, pH, alcohol content, and sulfur dioxide levels. Each sample includes a corresponding quality rating, evaluated by wine tasters on a scale from 0 to 10. To simplify the classification task, the original numerical ratings were categorized into three discrete classes: low, medium, and high quality. <br>

The objective of the analysis is to evaluate and compare multiple ensemble classification techniques using a range of performance metrics, including accuracy, precision, recall, and F1 score. The results are used to determine which models are most effective in predicting wine quality and to identify the trade-offs between model complexity and generalization. <br>

### Imports
Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like pandas, numpy, matplotlib, seaborn, and scikit-learn, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>

Pandas is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

NumPy (Numerical Python) is a foundational library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. NumPy is a key component in scientific computing and machine learning. <br>
https://numpy.org/doc/stable/ <br>

Matplotlib is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

Seaborn is a statistical data visualization library built on top of Matplotlib, designed for creating visually appealing and informative plots. It simplifies complex visualizations, such as heatmaps, violin plots, and pair plots, making it easier to identify patterns and relationships in datasets. <br>
https://seaborn.pydata.org/ <br>

Scikit-learn provides a variety of tools for machine learning, including data preprocessing, model selection, and evaluation. It contains essential functions for building predictive models and analyzing datasets. <br>
sklearn.metrics: This module provides various performance metrics for evaluating machine learning models. <br>
https://scikit-learn.org/stable/modules/model_evaluation.html<br>

In [None]:
# Data handling
import pandas as pd
import numpy as np

# Machine learning imports
from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, PolynomialFeatures, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, mean_absolute_error, mean_squared_error, r2_score, precision_score, recall_score, f1_score, classification_report, silhouette_score, ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV, RFE, SelectKBest, f_classif, mutual_info_classif
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Fully disable output truncation in Jupyter (for VS Code)
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display

### Section 1. Load and Inspect the Data

# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).

# Load spiral dataset
spiral = pd.read_csv("spiral.csv")

# Display basic information
spiral.info()

# Display first few rows
print(spiral.head())

### Section 2. Prepare the Data

# Includes cleaning, feature engineering, encoding, splitting, helper functions
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"


# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)

# Explain what we do and why as you proceed. 

### Section 3. Feature Selection and Justification 

# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

# Explain / introduce your choices.

### Section 4. Split the Data into Train and Test

# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Section 5.  Evaluate Model Performance (Choose 2)
Below is a list of  9 model variations. Choose two to focus on for your comparison. 

Option	Model Name	Notes
1	Random Forest (100)	A strong baseline model using 100 decision trees.
2	Random Forest (200, max_depth=10)	Adds more trees, but limits tree depth to reduce overfitting.
3	AdaBoost (100)	Boosting method that focuses on correcting previous errors.
4	AdaBoost (200, lr=0.5)	More iterations and slower learning for better generalization.
5	Gradient Boosting (100)	Boosting approach using gradient descent.
6	Voting (DT + SVM + NN)	Combines diverse models by averaging their predictions.
7	Voting (RF + LR + KNN)	Another mix of different model types.
8	Bagging (DT, 100)	Builds many trees in parallel on different samples.
9	MLP Classifier	A basic neural network with one hidden layer.
 

Here's a helper function that might be nice - feel free to use or adjust as you like. 


# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )
Here's how to create the different types of ensemble models listed above (you don't need to do all of them yourself. Choose 2 - we have a whole team working on this.)

results = []

# 1. Random Forest
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 2. Random Forest (200, max depth=10) 
evaluate_model(
    "Random Forest (200, max_depth=10)",
    RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 3. AdaBoost 
evaluate_model(
    "AdaBoost (100)",
    AdaBoostClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 4. AdaBoost (200, lr=0.5) 
evaluate_model(
    "AdaBoost (200, lr=0.5)",
    AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 6. Voting Classifier (DT, SVM, NN) 
voting1 = VotingClassifier(
    estimators=[
        ("DT", DecisionTreeClassifier()),
        ("SVM", SVC(probability=True)),
        ("NN", MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)),
    ],
    voting="soft",
)
evaluate_model(
    "Voting (DT + SVM + NN)", voting1, X_train, y_train, X_test, y_test, results
)

# 7. Voting Classifier (RF, LR, KNN) 
voting2 = VotingClassifier(
    estimators=[
        ("RF", RandomForestClassifier(n_estimators=100)),
        ("LR", LogisticRegression(max_iter=1000)),
        ("KNN", KNeighborsClassifier()),
    ],
    voting="soft",
)
evaluate_model(
    "Voting (RF + LR + KNN)", voting2, X_train, y_train, X_test, y_test, results
)

# 8. Bagging 
evaluate_model(
    "Bagging (DT, 100)",
    BaggingClassifier(
        estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 9. MLP Classifier 
evaluate_model(
    "MLP Classifier",
    MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


### Section 6. Compare Results 
# Create a table of results 
results_df = pd.DataFrame(results)

print("\nSummary of All Models:")
display(results_df)

# Recommendation: See if you can add gap calculations to your results and sort the table by test accuracy to find the best models more efficiently. 

### Section 7. Conclusions and Insights

# Using both your results and the results from others, which options are performing well and why do you think so. 

# This is your value as an analyst - narrate your story, link to other notebooks, provide a comprehensive view of what you feel is the best model for predicting quality in red wine. Base all your reasoning on data. Feel free to tune parameters if you like.  Discuss the types of models and why you think some seem to be more helpful. List the next steps you'd like to try if you were in a competition to build the best predictor. 

# Don't just copy code and don't just copy AI insights - use them to learn, but we all get them for free. Use all your tools to provide your own unique value and insights. Professional communication skills are critical. Evaluate your work in the context of others - how well can you craft a unique data story and present a compelling project to your clients / readers / self. 

### References:

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. _Decision Support Systems_, _47_(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016 <br>