# Predicting Red Wine Quality Using Machine Learning  
### Nolan Moss

In this project, I explore a dataset of red wine physicochemical properties and their associated quality ratings, with the goal of building a machine learning model that can accurately predict wine quality based on chemical attributes.

In [11]:
# ------------------------------------------------
# Imports once at the top, organized
# ------------------------------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

## Section 1. Load and Inspect the Data

In [12]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).



# Load spiral dataset
##spiral = pd.read_csv("spiral.csv")

# Display basic information
##spiral.info()

# Display first few rows
#print(spiral.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2. Prepare the Data
Includes cleaning, feature engineering, encoding, splitting, helper functions

In [14]:
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"


# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)

Explain what we do and why as you proceed. 

The data grouped quality scores into categories to simplify analysis and improve interpretability. Then created a numeric version of those categories for use in modeling. This dual labeling one categorical and one numeric helps with both exploratory data analysis and machine learning model development.

## Section 3. Feature Selection and Justification

In [None]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

Explain / introduce your choices.

Target-related columns from the feature set are excluded to avoid data leakage. Objective chemical measurements as input features are chosen because these are relevant, independent, and measurable properties of the wine. A numeric target was seleceted to match the classification goal and ensure compatibility with ML models.

## Section 4. Split the Data into Train and Test

In [None]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5.  Evaluate Model Performance (Choose 2)

Below is a list of  9 model variations. Choose two to focus on for your comparison. 

Option	Model Name	Notes
1	Random Forest (100)	A strong baseline model using 100 decision trees.
2	Random Forest (200, max_depth=10)	Adds more trees, but limits tree depth to reduce overfitting.
3	AdaBoost (100)	Boosting method that focuses on correcting previous errors.
4	AdaBoost (200, lr=0.5)	More iterations and slower learning for better generalization.
5	Gradient Boosting (100)	Boosting approach using gradient descent.
6	Voting (DT + SVM + NN)	Combines diverse models by averaging their predictions.
7	Voting (RF + LR + KNN)	Another mix of different model types.
8	Bagging (DT, 100)	Builds many trees in parallel on different samples.
9	MLP Classifier	A basic neural network with one hidden layer.
 

In [None]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

Here's how to create the different types of ensemble models listed above (you don't need to do all of them yourself. Choose 2 - we have a whole team working on this.)

In [None]:
results = []

In [None]:
# 1. Random Forest
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)



# 9. MLP Classifier 
evaluate_model(
    "MLP Classifier",
    MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)



Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

MLP Classifier Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 257   7]
 [  0  30  13]]
Train Accuracy: 0.8514, Test Accuracy: 0.8438
Train F1 Score: 0.8141, Test F1 Score: 0.8073


Section 6. Compare Results 

In [None]:
# Create a table of results 
results_df = pd.DataFrame(results)

print("\nSummary of All Models:")
display(results_df)



Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,Random Forest (100),1.0,0.8875,1.0,0.866056
1,MLP Classifier,0.851446,0.84375,0.814145,0.807318


Recommendation: See if you can add gap calculations to your results and sort the table by test accuracy to find the best models more efficiently. 

## Section 7. Conclusions and Insights

The RandomForest has excellent performance on the training data, but perfect training accuracy suggests overfitting. It does very well on the test set, especially at correctly identifying Medium quality wines. However, it struggles most with Low-quality wines, probably due to class imbalance.The MLP Classifier is less overfit because tbe performance on train/test is closer together. It has lower overall performance than Random Forest.
This is likely impacted by the lack of class balance and possibly not enough hidden complexity for this problem.

The Random Forest is more accurate overall but shows signs of overfitting. Still, it performs best in terms of test metrics, so it's likely the better model. MLP Classifier is more balanced and less prone to overfitting, but underperforms compared to Random Forest.

Considering another students work with the Voting Classifier, it shows that the ensemble does not overfit as much and is more balanced. It has a slightly lower performance than Random Forest, but it's more conservative and robust.