# Group 42 Project Report : The Classification of Wine Quality

## Introduction

White wine is one of the oldest and most cherished alcoholic beverages known to humanity. It is not merely a drink; it is a wonderful interaction between flavors and aromas, a product of nature's alchemy and human craftsmanship that is commonly enjoyed before a meal, with dessert, or as a refreshing drink between meals. White wine is known for its light and refreshing taste, which sets it apart from many of its red wine counterparts. Due to its acidity and aroma, white wine is also useful in cooking, helping to soften meat and enhance the flavors of various dishes. The essence of white wine lies in its quality, a multifaceted concept that encompasses various chemical components and sensory attributes. 

This study delves into the complexity of wine quality, using a systematic approach to assess white wines on a scale of 1 to 10.  Our research focuses on five fundamental chemical properties: pH, density, alcohol content, residual sugar content, and citric acid. As each of these elements plays a crucial role in shaping the taste, aroma, and overall character of the wine, we would like to build a model to predict the score of wine quality by the five properties.

This project uses the Wine Quality dataset of the white variant of the Portuguese "Vinho Verde" wine. This dataset contains 4898 observations of white wine with 12 attributes for each observation, however, only 6 of the attributes will be used for this classification project. This particular dataset contains no missing values.


#### Research question: Can a wine’s quality be accurately predicted on a scale of 1 to 10 based on its pH, density, alcohol content, residual sugar content, and citric acid?

To begin thinking about how to approach this question, we must first take a look at the raw data set. We will first load in a few useful packages for loading and working with our data set.

In [1]:
install.packages("kknn")
library(tidyverse)
library(repr)
library(tidymodels)
library(rvest)
library(GGally)
library(kknn)
options(repr.matrix.max.rows = 8)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [3

ERROR: Error in library(GGally): there is no package called ‘GGally’


## Methods and Results

##### The response variable we are looking for: 
- *Quality*

##### Procedure:
1. Read the data into R from the web
2. Clean and format the data into a tidy format
3. Summarize the data to find appropriate variables
4. Separate the data set into a training and test set
5. Perform cross-validation in order to determine the K-value to use for the classifier
6. Create the K-nearest neighbors classifier with the training set using the **tidymodels** package
7. Create a model and recipe, and train the classifier
8. Find the classifier’s accuracy

##### Visualization
- A scatterplot would be used to first visualize the data set.
- A line graph can be used to visualize the accuracy of the classifier

### 1. Read Data
Before we begin working with the data, we must load it into R from the web. The url for this dataset is https://archive.ics.uci.edu/static/public/186/wine+quality.zip.

In [None]:
url <- "https://raw.githubusercontent.com/RachelX6/DSCI100-Group-Project/main/winequality-white.csv"
white_wine_data <- read_delim(url, delim = ";") |>
        drop_na()

### 2. Wrangling and Cleaning

In [None]:
colnames(white_wine_data) <- c("fixed_acidity", # Adjusting column names for cleanliness.
              "volatile_acidity",
              "citric_acid",
              "residual_sugar",
              "chlorides",
              "free_sulfur_dioxide",
              "total_sulfur_dioxide",
              "density",
              "pH",
              "sulphates",
              "alcohol",
              "quality")

white_wine_data <- white_wine_data |>      # Changing the column "quality" into factor
    mutate(quality = as_factor(quality))

paste("Table 1. Glimpse of the White Wine Data")
head(white_wine_data, n = 5)

Above is the first five rows of the cleaned data.
A brief description of each column in the dataset is as follows:
- `fixed_acidity` -> The mass of fixed acid in the wine (g(tartaric acid)/dm$^{3}$).
- `volatile_acidity` -> The mass of volatile acid in the wine (g(acetic acid)/dm$^{3}$).
- `citric_acid` -> The mass of citric acid in the wine (g/dm$^{3}$).
- `residual_sugar` -> The mass of residual sugar in the wine (g/dm$^{3}$).
- `chlorides` -> The mass of chlorides in the wine (g(sodium chloride)/dm$^{3}$).
- `free_sulfur_dioxide` -> The mass of free sulfur dioxide in the wine (mg/dm$^{3}$).
- `total_sulfur_dioxide` -> The mass of total sulfur dioxide in the wine (mg/dm$^{3}$).
- `density` -> The overall density of the wine (g/cm$^{3}$).
- `pH` -> The pH of the wine (1-14).
- `sulphates` -> The mass of sulphates in the wine (g(potassium sulphate)/dm$^{3}$).
- `alcohol` -> The volume % alcohol content of the wine.

The last column, `quality`, is a rating on a scale from 1 to 10 of the wine's determined quality based on the given physicochemical factors.

In [None]:
set.seed(420)
# Creating the training and testing split of the data
wine_split <- initial_split(white_wine_data, prop = .75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

### 3. Summarizing the Data
To summarize our training data, we 
1. count the number of observations for each quality of wine
2. calculate the mean of each predictor for each quality.

Note: There is no missing value.

In [None]:
# Count the number of observations
wine_qual_counts <- wine_train |>
            group_by(quality) |>
            summarize(count = n())
paste("Table 2. Wine Quality Count")
wine_qual_counts

In table 2, column `count` represents the number of observation in that `quality`. From the table we can see that there are totally 3672 observations, while only white wines of qualities 3 through 9 are present in the table, and there is a **class imbalance** in the dataset since it has many more counts of samples in classes 5 to 7 than others, which we need to deal with before we doing the actual model.

In [None]:
wine_samp_m <- wine_train |>
    group_by(quality) |>
    summarize(across(everything(), list(mean)))
paste("Table 3. Predictors Mean")
wine_samp_m

In table 3, column 2-12 represents the mean value for the each variable in each quality. It can help us to do the summarized visualization in the next section as we can use the mean value of variables in each quality to see whether there is a strong relationship between each variable and the quality of wine.

### 4. Visualizing the Data

In [None]:
ggplot(wine_samp_m, aes(x= quality, y = density_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Density", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = fixed_acidity_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Fixed Acidity", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = volatile_acidity_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Volatile Acidity", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = citric_acid_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Citric Acid", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = residual_sugar_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Residual Sugar", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = chlorides_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Chlorides", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = free_sulfur_dioxide_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Free Sulfur dioxide", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = total_sulfur_dioxide_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Total Sulfur Dioxide", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = pH_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean pH", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = sulphates_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Sulphates", x = "Quality")
ggplot(wine_samp_m, aes(x= quality, y = alcohol_1)) +
    geom_point() +
    geom_line() +
    labs(y = "Mean Alcohol", x = "Quality")

From the visualization, we are not able to see the relationship of each predictors and the quality of wine mainly due to the **class imbalance** that we observed before, so in the next section we try to deal with it before doing the actual data analysis.

### 5. Class Imbalance