# Dowdle's Wine Quality Prediction
**Author:** Brittany Dowdle
**Date:** April 8, 2025
**Objective:** Implement an ensemble model, combine multiple models to improve performance. Evaluate the model using performance metrics. Compare results and provide insights.

## Introduction
This project uses the Wine Quality dataset to predict red wines based on their physicochemical properties. Some features include acidities, sulfur dioxides, and sugars. I will train 2 models: . We are using ensemble models because they usually outperform individual models by reducing overfitting and improving generalization.

****

## Imports
In the code cell below, import the necessary Python libraries for this notebook. All imports should be at the top of the notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

****

## Section 1. Load and Inspect the Data
Load the Wine Quality dataset and confirm it’s structured correctly.

In [6]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
print('Info:')
print(df.info(), "\n")
print('First Few Rows:')
print(df.head())

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None 

First Few Rows:
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4   

****

## Section 2. Prepare the Data
Includes cleaning, feature engineering, encoding, splitting, helper functions.

In [12]:
# Define helper function (Takes the quality and returns 3 categorical labels (low, medium, high))
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"
    
# Create the new column the helper function made
df["quality_label"] = df["quality"].apply(quality_to_label)

# Confirm the column
print("quality_label values:")
print(df["quality_label"].unique())

quality_label values:
['medium' 'high' 'low']


In [13]:
# Define helper function (Takes the qualiity and simplifies target into 3 categories (0 = low, 1 = medium, 2 = high))
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

# Create the new column the helper function made
df["quality_numeric"] = df["quality"].apply(quality_to_number)

# Confirm the column
print("quality_numeric values:")
print(df["quality_numeric"].unique())

quality_numeric values:
[1 2 0]


### Reflection
Simplifying this data from a continuous score into categorized classes helps the data be more suitable for data exploration and visualizations. Having discrete categories enables the model to handle and predict the target variable more effectively.

****

## 