## Group 42 Project Proposal : The Classification of Wine Quality

### Introduction

White wine is one of the oldest and most cherished alcoholic beverages known to humanity. It is not merely a drink; it is a wonderful interaction between flavors and aromas, a product of nature's alchemy and human craftsmanship that is commonly enjoyed before a meal, with dessert, or as a refreshing drink between meals. White wine is known for its light and refreshing taste, which sets it apart from many of its red wine counterparts. Due to its acidity and aroma, white wine is also useful in cooking, helping to soften meat and enhance the flavors of various dishes. The essence of white wine lies in its quality, a multifaceted concept that encompasses various chemical components and sensory attributes. 

This study delves into the complexity of wine quality, using a systematic approach to assess white wines on a scale of 1 to 10.  Our research focuses on five fundamental chemical properties: pH, density, alcohol content, residual sugar content, and citric acid.  As each of these elements plays a crucial role in shaping the taste, aroma, and overall character of the wine, we would like to build a model to predict the score of wine quality by the five properties.

This project uses the Wine Quality dataset of the white variant of the Portuguese "Vinho Verde" wine. This dataset contains 4898 observations of white wine with 12 attributes for each observation, however, only 6 of the attributes will be used for this classification project. This particular dataset contains no missing values.


#### Research question: Can a wine’s quality be accurately predicted on a scale of 1 to 10 based on its pH, density, alcohol content, residual sugar content, and citric acid?

To begin thinking about how to approach this question, we must first take a look at the raw data set. We will first load in a few useful packages for loading and working with our data set.

In [3]:
#load library
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 8)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

### 1. Read Data
Before we begin working with the data, we must load it into R from the web. The url for this dataset is https://archive.ics.uci.edu/static/public/186/wine+quality.zip. Note that this is a zip file, and so we must unzip this file to access the .csv file within.

In [4]:
dir.create("data/")

“'data' already exists”


In [5]:
url <- "https://archive.ics.uci.edu/static/public/186/wine+quality.zip" # Url for the dataset's zip file, containing white and red wine data.

download.file(url, destfile = "data/wine_quality.zip")
unzip("data/wine_quality.zip", exdir = "data/") # Unzipping the zipped wine quality file.
white_wine_data <- read_delim("data/winequality-white.csv", delim = ";") |>  # Selecting the white wine data that will be used for this project.
                   drop_na()

[1mRows: [22m[34m4898[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[32mdbl[39m (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


### 2. Wrangling and Cleaning

In [6]:
colnames(white_wine_data) <- c("fixed_acidity", # Adjusting column names for cleanliness.
              "volatile_acidity",
              "citric_acid",
              "residual_sugar",
              "chlorides",
              "free_sulfur_dioxide",
              "total_sulfur_dioxide",
              "density",
              "pH",
              "sulphates",
              "alcohol",
              "quality")

white_wine_data <- white_wine_data |>      # Changing the column "quality" into factor
    mutate(quality = as_factor(quality))

paste("Table 1. Glimpse of the White Wine Data")
head(white_wine_data, n = 5)

fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6


Above is the first five rows of the cleaned data.
A brief description of each column in the dataset is as follows:
- `fixed_acidity` -> The mass of fixed acid in the wine (g(tartaric acid)/dm$^{3}$).
- `volatile_acidity` -> The mass of volatile acid in the wine (g(acetic acid)/dm$^{3}$).
- `citric_acid` -> The mass of citric acid in the wine (g/dm$^{3}$).
- `residual_sugar` -> The mass of residual sugar in the wine (g/dm$^{3}$).
- `chlorides` -> The mass of chlorides in the wine (g(sodium chloride)/dm$^{3}$).
- `free_sulfur_dioxide` -> The mass of free sulfur dioxide in the wine (mg/dm$^{3}$).
- `total_sulfur_dioxide` -> The mass of total sulfur dioxide in the wine (mg/dm$^{3}$).
- `density` -> The overall density of the wine (g/cm$^{3}$).
- `pH` -> The pH of the wine (1-14).
- `sulphates` -> The mass of sulphates in the wine (g(potassium sulphate)/dm$^{3}$).
- `alcohol` -> The volume % alcohol content of the wine.

The last column, `quality`, is a rating on a scale from 1 to 10 of the wine's determined quality based on the given physicochemical factors.

In [None]:
set.seed(1)
# Creating the training and testing split of the data
wine_split <- initial_split(white_wine_data, prop = .75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

### 3. Summarizing the Data
To summarize our training data, we 
1. count the number of observations for each quality of wine
2. calculate the mean of each predictor
3. calculate the mean of each predictor for each quality.

Note: The missing values are ignored.

In [None]:
# Count the number of observations
wine_qual_counts <- wine_train |>
            group_by(quality) |>
            summarize(count = n())
paste("Table 2. Wine Quality Count")
wine_qual_counts

From the table we can see that only white wines of qualities 3 through 9 are present in the table, while there is a class imbalance in the dataset since it has many more counts of samples in classes 5 to 7 than others. Since we do not want the majority classes affect more in the classifier than the under represented ones, which will negatively affect the accuracy of the model, we consider quality 3 to 6 to be **Low** quality , and 7 to 9 represent **High** quality. 

In [None]:
# Recode the quality column
white_wine_data <- white_wine_data |> 
    mutate(quality = recode(quality, "3" = "Low", "4" = "Low", "5" = "Low", "6" = "Low", 
                            "7" = "High", "8" = "High", "9" = "High"))   
paste("Table 3. Glimpse of the White Wine Data with Low and High Quality")
head(white_wine_data, n = 5)

##### Then, we used the new cleaned dataset to create training and testing data, and summarize again.

In [None]:
set.seed(1)
# Creating the training and testing split of the data
wine_split <- initial_split(white_wine_data, prop = .75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

paste("Table 4. Traing Dataset")
wine_train
paste("Table 5. Testing Dataset")
wine_test

In [None]:
# Count the number of observations
wine_qual_counts <- wine_train |>
            group_by(quality) |>
            summarize(count = n())
paste("Table 6. Wine Quality(Low/High) Count")
wine_qual_counts

In [None]:
# Summarize the overall mean of each predictor
wine_avgs <- wine_train |>
            select(fixed_acidity:alcohol) |>
            map_df(mean) 
paste("Table 7. Predictors Mean")
wine_avgs

In [None]:
# Summarize the mean of predictors for each quality of wine
wine_each_avgs <- wine_train |>
            group_by(quality) |>
            summarize(across(fixed_acidity:alcohol, mean))
paste("Table 8. Predictors Mean for Each Quality")
wine_each_avgs

### 4. Visualizing the Data

In [None]:
# Plot each predictor to see useful variables
plot1 <- wine_train |>
    ggplot(aes (x = citric_acid, y = residual_sugar, color = quality)) +
    geom_point() +
    ggtitle("Figure 1. Critric Acid Againt Residual Sugar") +
    labs(x = "Citric Acid (g/dm3)", y = "Residual Sugar (g/dm3)")

plot2 <- wine_train |>
    ggplot(aes (x = citric_acid, y = fixed_acidity, color = quality)) +
    geom_point() +
    ggtitle("Figure 2. Critric Acid Againt Fixed Acidity") +
    labs(x = "Citric Acid (g/dm3)", y = "Fixed Acidity (g/dm3)")

plot3 <- wine_train |>
    ggplot(aes (x = citric_acid, y = volatile_acidity, color = quality)) +
    geom_point() +
    ggtitle("Figure 3. Critric Acid Againt Volatile Acidity") +
    labs(x = "Citric Acid (g/dm3)", y = "Volatile Acidity (g/dm3)")

plot4 <- wine_train |>
    ggplot(aes (x = citric_acid, y = chlorides, color = quality)) +
    geom_point() +
    ggtitle("Figure 4. Critric Acid Againt Clorides") +
    labs(x = "Citric Acid (g/dm3)", y = "Clorides (g/dm3)")

plot5 <- wine_train |>
    ggplot(aes (x = citric_acid, y = free_sulfur_dioxide, color = quality)) +
    geom_point() +
    ggtitle("Figure 5. Critric Acid Againt Free Sulfur Dioxide") +
    labs(x = "Citric Acid (g/dm3)", y = "Free Sulfur Dioxide (mg/dm3)")

plot6 <- wine_train |>
    ggplot(aes (x = citric_acid, y = total_sulfur_dioxide, color = quality)) +
    geom_point() +
    ggtitle("Figure 6. Critric Acid Againt Total Sulfur Dioxide") +
    labs(x = "Citric Acid (g/dm3)", y = "Total Sulfur Dioxide (mg/dm3)")

plot7 <- wine_train |>
    ggplot(aes (x = citric_acid, y = density, color = quality)) +
    geom_point() +
    ggtitle("Figure 7. Critric Acid Againt Density") +
    labs(x = "Citric Acid (g/dm3)", y = "Density (g/cm3)")

plot8 <- wine_train |>
    ggplot(aes (x = citric_acid, y = pH, color = quality)) +
    geom_point() +
    ggtitle("Figure 8. Critric Acid Againt pH") +
    labs(x = "Citric Acid (g/dm3)", y = "pH")

plot9 <- wine_train |>
    ggplot(aes (x = citric_acid, y = sulphates, color = quality)) +
    geom_point() +
    ggtitle("Figure 9. Critric Acid Againt Sulphates") +
    labs(x = "Citric Acid (g/dm3)", y = "Sulphates(g/dm3)")


In [None]:
plot1
plot2
plot3
plot4
plot5
plot6
plot7
plot8
plot9

### 5. Methods
The response variable we are looking for: 
- *Quality*


The variables/predictors, as stated before: 
1) pH 
2) Density 
3) Alcohol 
4) Residual Sugar 
5) Citric Acid 

Procedure:
1. Read the data into R from the web
2. Clean and format the data into a tidy format
3. Select columns, then scale and center all variables of the data set
4. Separate the data set into a training and test set
5. Perform cross-validation in order to determine the K-value to use for the classifier
6. Create the K-nearest neighbors classifier with the training set using the **tidymodels** package
7. Create a model and recipe, and train the classifier
8. Find the classifier’s accuracy

Visualization
- A scatterplot would be used to first visualize the data set.
- A line graph can be used to visualize the accuracy of the classifier


#### Expected outcomes and signifigance:

**What do we expect to find?**

We expect to find the quality of wine by classifying it on a scale of 1 to 10 using 5 of its properties as predictors: pH, density, alcohol, residual sugar, and citric acid. In other words, when given a new observation of wine, we should be able to predict its wine quality using our classification method.
Generally, we expect to find that wine quality will increase with lower pH, residual sugar and density, and higher alcohol content and citric acid. 

**What impact could such findings have?**

These findings can help wine sellers determine the quality of the wine they are selling, as well as their respective price ranges based on the quality. In addition, wine manufacturers will be able to determine how to increase the quality of their wine. Clients will also be able to purchase high-quality wine, and understanding its quality can also help sellers understand the details and basis of why certain kinds of wine are sold most successfully.

**Some future questions this could lead to are:**
1. How can wine manufacturers find a balance between the quality and cost of the wines they make? 
2. Which of the factors we discussed contributes the most to wine quality?
3. How will using different variables affect the wine quality?