# Classification Model Predicting White Wine Quality

## Introduction

<b>Vinho Verde</b> is renowned for its savoury taste, fresh colour and stress-relieving benefits. Among the variants of Vinho Verde, white Vinho Verde stood out as the most promising individual in the global market. A study suggests that the global dry white wine industry surged in 2022 and is expected to maintain an upward trend until 2030 (Market Reports World, 2023). This urging demand in the dry white wine market made quality classification daily more significant; therefore, we have designed a k-nearest-neighbor classification model that determines the quality of the Vinho Verde regarding the wine’s chemical ingredients with reasonable accuracy.

<b>Predictive Question: </b>In our project, we will try to answer the question: <u>“How can we predict the level of quality of the White Vinho Verde given the physicochemical attributes in our dataset?”</u>

We utilized the <b>Wine Quality</b> dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/), which features 11 physicochemical attributes of wines, such as fixed acidity, citric acid, residual sugar, density, a quality variable, etc.

Most of the variables, besides the “wine quality” variable, are quantitative. Our dataset focuses on the white variant of Vinho Verde, in which most of the variables are measured in grams/dm^3, with the exceptions of free_sulfur_dioxide (milligrams/dm^3), total_sulfur_dioxide (milligrams/dm^3), and pH (represented on a scale from 0 to 14) (Cortez, Cerdeira, Almeida, Matos, & Reis, 2009). Additionally, the dataset contains 4898 observations without any non-applicable values. Our project involves cleaning and preprocessing the Vinho Verde dataset, implementing appropriate algorithms, k-tuning, and k-nearest-neighbor classification models for wine quality predictions on a scale from 1 to 10 with increasing quality evaluation.

In summary, this document provides a thorough list of procedures for our development of an accurate white Vinho Verde wine quality classification model.  


## Methods & Results

In [1]:
# Run This Cell Before Continuing
set.seed(999)
library(repr)
library(tidyverse)
library(tidymodels)
library(themis)
library(janitor)
library(cowplot)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

ERROR: Error in library(themis): there is no package called ‘themis’


In [2]:
install.packages("themis")

also installing the dependencies ‘RANN’, ‘ROSE’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



Downloading the data to use during our Analysis.

In [16]:
url <- "https://raw.githubusercontent.com/TrBili/dsci-100-project/main/data_2/winequality-white.csv"
download.file(url, "data/winequality-white.csv")

“URL https://raw.githubusercontent.com/TrBili/dsci-100-project/main/data_2/winequality-white.csv: cannot open destfile 'data/winequality-white.csv', reason 'No such file or directory'”
“download had nonzero exit status”


Extracting the data from the downloaded file

In [17]:
wine_data_raw <- read_csv2("data/winequality-white.csv")

head(wine_data_raw)


[36mℹ[39m Using [34m[34m"','"[34m[39m as decimal and [34m[34m"'.'"[34m[39m as grouping mark. Use `read_delim()` for more control.



ERROR: Error: 'data/winequality-white.csv' does not exist in current working directory ('/home/jovyan/dsci-100-project').


We can see that some numerical variables have a `chr` data type; hence, we need to make it numeric to use later in our model. We can also see that the names of variables have spaces; hence, we need to make them suitable for use. Finally, we have to make the `quality` column, a factor as we will use it as our Class (categorical variable) during this analysis. 

We will now clean our data to make it suitable for Exploratory Data Analysis.

In [15]:
wine_data <- wine_data_raw |> 
                clean_names() |>                        
                drop_na() |> # removes rows with NA 
                map_df(as.numeric) |> # as all our columns are numeric
                mutate(quality = as_factor(quality)) # we will use quality as our class
                

head(wine_data)

ERROR: Error in clean_names(wine_data_raw): could not find function "clean_names"


All the unique values in the quality column

In [6]:
wine_data |> distinct(quality)

ERROR: Error in distinct(wine_data, quality): object 'wine_data' not found


Using the clean data, we will spit our data into training & testing set, then perform exploratory data analysis. 

In [7]:
wine_split <- initial_split(wine_data, prop=0.75,strata=quality)

## Training Data
wine_train <- training(wine_split)

## Testing Data
wine_test <- testing(wine_split)

head(wine_train)
head(wine_test)

ERROR: Error in eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, : object 'wine_data' not found


We will now be doing Exploratory Data Analysis on our training set.

In [8]:
## Setting the Width & Height of the Plot
options(repr.plot.width=8,repr.plot.height=25)

## Extracting all the column names from our clean Dataset
all_cols <- wine_train |> select(-quality) |> colnames()

## Extracting all the column names from our raw Dataset
col_names <- wine_data_raw |> colnames()

## Creating a list to store all our plots
plots <- list()

## Loop Variable
i <- 0

## Looping through each column
for(c in all_cols) {
    i <- i + 1
    c_sym <- sym(c)
    box_plot <- ggplot(wine_train, aes(x=quality,y=!!c_sym)) +
            geom_boxplot() +
            labs(x="Quality", y=col_names[i], subtitle=(100 + i)/100)
    plots[[c]] <- box_plot
}

## Merging all the plots
plot_grid(plotlist = plots, ncol = 2)

ERROR: Error in select(wine_train, -quality): object 'wine_train' not found


Observing the box plots above, the median/length appears to be different between the boxes, which implies that the variable has a high effect on the response variable.

Observing the boxplot above, we can choose the following attributes.
1. Volatile Acidity
2. Citric Acid
3. Residual Sugar
4. Sulphates

We will now perform a summary analysis on our selected predictors from our training data, to further distinguish between relevant predictors.

In [9]:
## selecting the required variables
selected_wine_train_data <- wine_train |> 
                    select(quality, volatile_acidity, citric_acid, residual_sugar, sulphates)


## Summary of Training Data - Mean of Each Column & Count of Each Quality
summary_wine_train_data <- wine_train |>
                    group_by(quality) |>
                    summarize(mean_volatile_acidity = mean(volatile_acidity),
                             mean_citric_acid = mean(citric_acid),
                             mean_residual_sugar = mean(residual_sugar),
                             mean_sulphates = mean(sulphates),
                             total_count=n(),
                             percentage=(100*n()/nrow(wine_train)))

summary_wine_train_data
print("Summary Table 1")

ERROR: Error in select(wine_train, quality, volatile_acidity, citric_acid, residual_sugar, : object 'wine_train' not found


The summary table above shows that our selected predictors have variations with `quality`. 

Total Count: The `total_count` column indicates the number of observations for each quality level. A significant imbalance is evident, with much more data for quality levels 5 and 6 compared to others. This could potentially bias a KNN model, and we might need to consider methods to address this class imbalance, such as upsampling.

Percentage: This column shows the percentage of observations in each quality level relative to the entire dataset. Quality 5 and 6 make up a large percentage of the data, indicating that the dataset is imbalanced, which could influence the KNN classifier's performance.

<hr></hr>

We will start by creating a recipe which scales all our variables, and also rebalances our dataset by oversampling all the qualities and maintains a 1:1 ratio.

In [10]:
wine_recipe <- recipe(quality ~ quality + volatile_acidity + citric_acid + residual_sugar + sulphates, data = wine_train) |>
                step_upsample(quality, over_ratio = 1, skip=TRUE) |>
                step_scale(all_predictors()) |>
                step_center(all_predictors())

wine_recipe

ERROR: Error in step_upsample(recipe(quality ~ quality + volatile_acidity + citric_acid + : could not find function "step_upsample"


In order to obtain the optimal k value for k-nearest-neighbor classification algorithm, we apply cross validation that divides the training data set into 5 validation sets (5-fold cross validation). Having multiple training sets would allows us to acquire a more precise calculation of the accuracy of the classification model, which aids us in finding the best k-neighbor.

In [11]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

k_vals <- tibble(neighbors = seq(from=1,to=10,by=1))

wine_train_vfold <- vfold_cv(wine_train, v=5,strata=quality)

vfold_metrics <- workflow() |>
                    add_recipe(wine_recipe) |>
                    add_model(knn_spec) |>
                    tune_grid(resamples=wine_train_vfold, grid=k_vals) |>
                    collect_metrics()

accuracies <- vfold_metrics |> filter(.metric=="accuracy")

accuracies

ERROR: Error in eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, : object 'wine_train' not found


We now plot a graph to choose the Best K.

In [None]:
options(repr.plot.width=7,repr.plot.height=7)

accuracy_vs_k <- ggplot(accuracies, aes(x=neighbors, y=mean)) +
                    geom_point() +
                    geom_line() +
                    labs(x="Neighbors", y="Accuracy Estimate") +
                    scale_x_continuous(limits=c(1,10), breaks=1:10) +
                    theme(text=element_text(size=12))
accuracy_vs_k

According to the above accuracy vs k-neighbors line plot, we observe that the curve peaks at k= 2, which provides an indication that our classification model would return the most accurate predictions at k=2. As a result, we retrained the training dataset with a neighbor of k= 2.

In [None]:
wine_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 2) |>
            set_engine("kknn") |>
            set_mode("classification")

wine_fit <- workflow() |>
            add_recipe(wine_recipe) |>
            add_model(wine_spec) |>
            fit(data=wine_test)



wine_test_predictions <- predict(wine_fit, wine_test) |>
                            bind_cols(wine_test)

head(wine_test_predictions)

Now we will check the accuracy of the prediction results using metrics and see the table of predicted and correct labels using Confusion Matrix

In [None]:
wine_test_predictions |> metrics(truth=quality, estimate=.pred_class) |> filter(.metric == "accuracy")

In [None]:
wine_confusion <- wine_test_predictions |> conf_mat(truth=quality, estimate=.pred_class)
wine_confusion

From both the accuracy metrics and the confusion matrix, we can observe that the majority of the inputs were predicted accurately in terms of wine quality. The incorrect predictions occurred with data that has wine quality equals to 5 and 7. In fact, the model demonstrated an accuracy of 95.02%, which is a reflection of the success and usefulness of our wine quality classification model.