# Lab 5: Project (Ensemble ML, Spiral)



Author: Sandra Ruiz

Date: April 10,2025 

### Introduction

Objective: 
In this project, we will work with a wine quality csv file to learn how to implement and evaluate more complex models when simpler techniques aren't enough. We will build on previous methods of training and testing the data to explore results. We will explore ensemble models, a powerful approach that combines multiple models to improve performance. Ensemble methods often outperform individual models by reducing overfitting and improving generalization.

 

In [None]:
### Imports Needed at the Top


!!pip install pandas numpy matplotlib scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)


In [None]:
## Section 1. Load and Inspect the Data
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).



In [8]:
# Load spiral dataset
spiral = pd.read_csv(r"C:\Users\19564\Desktop\Ensemble.SRuiz\ml-05-sruiz\winequality-red.csv")


# Display basic information
spiral.info()

# Display first few rows
print(spiral.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 1 columns):
 #   Column                                                                                                                                                                   Non-Null Count  Dtype 
---  ------                                                                                                                                                                   --------------  ----- 
 0   fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"  1599 non-null   object
dtypes: object(1)
memory usage: 12.6+ KB
  fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
0   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                        

### Section 2. Prepare the Data
Includes cleaning, feature engineering, encoding, splitting, helper functions
####  Define helper function that:

#### Takes one input, the quality (which we will temporarily name q while in the function)
#### And returns a string of the quality label (low, medium, high)
#### This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

In [None]:
# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)

Explain what we do and why as you proceed. 
#### creates a second helper that maps scores to numbers.

### Section 3. Feature Selection and Justification
#### Define input features (X) and target (y)
#### Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
##### Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

Explain / introduce your choices:

We want to train only on physicochemical properties of the wine (like acidity, pH, alcohol content, etc. We’re treating this as a multi-class classification problem where we want to train a model to predict one of three categories.

In [None]:
## Section 4. Split the Data into Train and Test
# Train/test split (stratify to preserve class balance)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Section 5.  Evaluate Model Performance (Choose 2)

Below is a list of  9 model variations. Choose two to focus on for your comparison. 

Option	Model Name	Notes
1	Random Forest (100)	A strong baseline model using 100 decision trees.

2	Random Forest (200, max_depth=10)	Adds more trees, but limits tree depth to reduce overfitting.

3	AdaBoost (100)	Boosting method that focuses on correcting previous errors.

4	AdaBoost (200, lr=0.5)	More iterations and slower learning for better generalization.
    
5	Gradient Boosting (100)	Boosting approach using gradient descent.

6	Voting (DT + SVM + NN)	Combines diverse models by averaging their predictions.

7	Voting (RF + LR + KNN)	Another mix of different model types.

8	Bagging (DT, 100)	Builds many trees in parallel on different samples.

9	MLP Classifier	A basic neural network with one hidden layer.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

# Assuming evaluate_model is defined to store results in a 'results' list/dictionary
results = []


# 1. Random Forest
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 3. AdaBoost 
evaluate_model(
    "AdaBoost (100)",
    AdaBoostClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load dataset with semicolon separator and quoted headers
df = pd.read_csv(
    r"C:\Users\19564\Desktop\Ensemble.SRuiz\ml-05-sruiz\winequality-red.csv",
    sep=';',
    quotechar='"'
)

#  Quality labeling (categorical)
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

df["quality_label"] = df["quality"].apply(quality_to_label)

#  Quality encoding (numeric)
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

df["quality_numeric"] = df["quality"].apply(quality_to_number)

#  Feature/Target split
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

#  Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

#  Evaluation function
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    results.append({
        "model": name,
        "train_accuracy": accuracy_score(y_train, y_train_pred),
        "test_accuracy": accuracy_score(y_test, y_test_pred),
    })

#  Model evaluation
results = []

# Model 1: Random Forest
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train, y_train, X_test, y_test, results
)

# Model 2: AdaBoost
evaluate_model(
    "AdaBoost (100)",
    AdaBoostClassifier(n_estimators=100, random_state=42),
    X_train, y_train, X_test, y_test, results
)

#  Show results with gap calculation
results_df = pd.DataFrame(results)
results_df["gap"] = results_df["train_accuracy"] - results_df["test_accuracy"]
results_df = results_df.sort_values(by="test_accuracy", ascending=False)

### Section 6. Compare Results 
# Create a table of results 
results_df = pd.DataFrame(results)

print("\nSummary of All Models:")
display(results_df)

####Recommendation: See if you can add gap calculations to your results and sort the table by test accuracy to find the best models more efficiently.


# Display results
print("\nSummary of All Models (sorted by Test Accuracy):")
print(results_df)



Summary of All Models:


Unnamed: 0,model,train_accuracy,test_accuracy
0,Random Forest (100),1.0,0.8875
1,AdaBoost (100),0.834246,0.825



Summary of All Models (sorted by Test Accuracy):
                 model  train_accuracy  test_accuracy
0  Random Forest (100)        1.000000         0.8875
1       AdaBoost (100)        0.834246         0.8250


###
 Section 7. Conclusions and Insights
Using both your results and the results from others, which options are performing well and why do you think so?   Discuss the types of models and why you think some seem to be more helpful. List the next steps you'd like to try if you were in a competition to build the best predictor.  

Results:

Two models were tested to see which one could best predict red wine quality. Random Forest had the best test score at 88.75%, but it did get an excellent score on the training set of (100%), which means it might just be memorizing instead of really learning. AdaBoost did a bit worse with 82.5% on the test and 83.4% on training, but that shows it's more balanced and probably better at working with new data. Random Forest is strong because it uses lots of decision trees together, and AdaBoost keeps learning from its mistakes to get better. If I had more time, I’d try changing model settings, test new models, and add new features to improve predictions. In the end, AdaBoost feels like the safer choice, and Random Forest is powerful but can be risky. For a competition I would try more models and cross validation with K-fold.









