# Predicting Wine Quality Using Multiple Linear Regression
Group 7: Rui Xiang Yu, Rico Chan, & Kevin Yu  
Course: DSCI 310, 2024 Winter Term 2

## 1. Summary

This project analyzes different properties of wine and analyzes which properties affect the quality of wine positively and which properties affect it negatively. We applied a multiple linear regression on a public-use dataset to discover how each property affects the quality of wine. Splitting the dataset into a 75/25 training/testing split, and applying a specified recipe for a multivariate regression, we obtained decent RMSE and MAE values of 0.67 and 0.52 respectively, but a mediocre R-squared value of 0.32. We discover that the fixed acidity, residual sugar, free sulphur dioxide, sulphates, and the alcohol properties tend to increase the wine quality, while the volatile acidity, citric acid, chlorides, total sulphur dioxide, density, and pH properties tend to reduce the wine quality. 

## 2. Introduction
Wine is entrenched in many cultures and remains a strong industry worldwide (Eflman, 2017; *Culture of wine, 2020*). Technological innovations have supported the growth of the wine industry, especially in the realm of certification and quality assessment (Cortez et al., 2009). One prominent innovation is the use of laboratory testing to relate physicochemical properties of wine to human sensory perceptions (Cortez et al., 2009; Luque et al., 2023). Examples of physicochemical indicators include pH and and residual sugar. Using data to model complex wine perceptions is a daunting task, but it can benefit wine production by flagging the most important properties to consider and informing price setting (Cortez et al., 2009).

Thus, our key question is: **Can we use multiple linear regression and various physicochemical indicators to predict the quality of red wine?**

To answer whether a full regression model is viable, we use a dataset on red wine quality [from the UCI Machine Learning Repository](https://doi.org/10.24432/C56S3T). The dataset comprises of 12 variables (11 physicochemical indicators and 1 quality indicator) and contains 1599 instances of red vinho verde, a popular wine from Portugal. Each instance of wine was assessed by at least three sensory assessors and scored on a ten point scale that ranges from "very bad" to "excellent"; the wine quality for each instance is determined by the median of these scores (Cortez et al., 2009). The data was collected by the CVRVV, an inter-professional organisation dedicated to the promotion of vinho verde, from  May 2004 to February 2007.

## 3. Methods

### 3.1. Loading Data
From UCI Machine Learning Repository: <https://doi.org/10.24432/C56S3T>

In [None]:
# Import packages
library(tidyverse)
library(tidymodels)
library(cowplot)
library(car)
library(corrplot)

In [None]:
# Read CSV data.
wine <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
                delim = ";")

# Adding new names as the original names contain spaces.
new_names <- c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides", "free_sulfur_dioxide", 
              "total_sulfur_dioxide", "density", "pH", "sulphates", "alcohol", "quality")
colnames(wine) <- new_names

# Previewing the first 6 rows.
head(wine)

> *Table 1. Loaded dataset of wine quality.*

### 3.2. Exploratory Data Analysis

We first set a seed to ensure reproducibility. We then split the data into a training set (75% of the dataset) and a testing set (the remaining 25%). We check the number of rows for each training and testing dataset to make sure the split was done correctly. The training set will be used to train our model. The testing set will be used to validate the results of our model.

In [None]:
# Setting the seed.
set.seed(7)

# Splitting the data.
wine_split <- initial_split(wine, prop = 0.75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

nrow(wine_train)
nrow(wine_test)

As we can see, the split was done appropriately. We can now move on to see if there are any missing values in our entire dataset:

In [None]:
sum(is.na(wine))

... And summary statistics.

In [None]:
# Summary Statistics
summary(wine_train)

> *Figure 1. Summary statistics of wine dataset.*

Next, we examine the means of the independent variables for every level of our response variable "quality".

In [None]:
# Means for each level of response variable "quality"
response_means <- wine_train %>%
mutate(quality = as.factor(quality)) %>%
group_by(quality) %>%
summarise_all(mean)

response_means

> *Table 2. Means for each level of the response variable "quality".*

### 3.3. Exploratory Data Analysis Visualization



Before we begin on the analysis, we wanted to visualize our dataset to get a general understanding of our data and check for valid assumptions and potential issues we have to alleviate later on.

In [None]:
wine_train_categorical <- wine_train %>%
    mutate(quality = as.factor(quality))

In [None]:
font_size = 50

options(repr.plot.width = 60, repr.plot.height = 30)

fixed_acidity_hist <- wine_train_categorical %>%
    ggplot(aes(x = fixed_acidity)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none")

volatile_acidity_hist <- wine_train_categorical %>%
    ggplot(aes(x = volatile_acidity)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

citric_acid_hist <- wine_train_categorical %>%
    ggplot(aes(x = citric_acid)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

residual_sugar_hist <- wine_train_categorical %>%
    ggplot(aes(x = residual_sugar)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

chlorides_hist <- wine_train_categorical %>%
    ggplot(aes(x = chlorides)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none")

free_sulfur_dioxide_hist <- wine_train_categorical %>%
    ggplot(aes(x = free_sulfur_dioxide)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

total_sulfur_dioxide_hist <- wine_train_categorical %>%
    ggplot(aes(x = total_sulfur_dioxide)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

density_hist <- wine_train_categorical %>%
    ggplot(aes(x = density)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

pH_hist <- wine_train_categorical %>%
    ggplot(aes(x = pH)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none")

sulphates_hist <- wine_train_categorical %>%
    ggplot(aes(x = sulphates)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

alcohol_hist <- wine_train_categorical %>%
    ggplot(aes(x = alcohol)) +
    geom_histogram(aes(color = quality, fill = quality)) + 
    theme(text = element_text(size=font_size),
         legend.position="none",
         axis.title.y = element_blank())

quality_hist <- wine_train %>%
    ggplot(aes(x = quality)) +
    geom_histogram(aes(fill = as.factor(quality))) + 
    theme(text = element_text(size=font_size),
         axis.title.y = element_blank()) +
    labs(fill = "quality")

hists <- plot_grid(fixed_acidity_hist, volatile_acidity_hist, citric_acid_hist, residual_sugar_hist, 
                   chlorides_hist, free_sulfur_dioxide_hist, total_sulfur_dioxide_hist, density_hist, 
                   pH_hist, sulphates_hist, alcohol_hist, quality_hist + theme(legend.position="none"),
             ncol=4, nrow =3)

title_hist <- ggdraw() +
    draw_label(
        "Figure 2. Histogram of the input variables, color-coded by their respective quality. In the last plot, histogram of each wine quality.",
        fontface = 'bold', x = 0, hjust = 0, size = font_size + 20) +
    theme(plot.margin = margin(0, 0, 0, 7)) # alignment
          
legend_hist <- get_legend(quality_hist + theme(legend.box.margin = margin(0, 0, 0, 12),
                                         legend.key.size = unit(3, 'cm'),
                                         legend.title = element_text(size = font_size)))

plot_grid(
    plot_grid(title_hist, hists, ncol = 1, rel_heights = c(0.1, 1)),
    legend_hist,
    rel_widths = c(4, .4)
)

Each of the qualities we are analyzing are plotted together in a histogram to get an understanding of several assumptions we are making. For some of them, (such as density, pH, volatile_acidity, etc), a normality assumption is reasonable. For others, (such as citric_acid, total_sulfur_dioxide, etc) it may be a bit harder to assume normality. Each of the different qualities are also coloured in, so the different densities among the quality levels can be visualized.  

The very last plot is a visual for the count of how many wines are in each quality level. Unfortunately, there does not appear to be a consistent count for each quality level, rather that most of the wines in the dataset have qualities between 5 and 7. 

In [None]:
options(repr.plot.width = 60, repr.plot.height = 30)

point_size = 4

fixed_acidity_plot <- wine_train %>%
    ggplot(aes(x = quality, y = fixed_acidity)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

volatile_acidity_plot <- wine_train %>%
    ggplot(aes(x = quality, y = volatile_acidity)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

citric_acid_plot <- wine_train %>%
    ggplot(aes(x = quality, y = citric_acid)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

residual_sugar_plot <- wine_train %>%
    ggplot(aes(x = quality, y = residual_sugar)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

chlorides_plot <- wine_train %>%
    ggplot(aes(x = quality, y = chlorides)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

free_sulfur_dioxide_plot <- wine_train %>%
    ggplot(aes(x = quality, y = free_sulfur_dioxide)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

total_sulfur_dioxide_plot <- wine_train %>%
    ggplot(aes(x = quality, y = total_sulfur_dioxide)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

density_plot <- wine_train %>%
    ggplot(aes(x = quality, y = density)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

pH_plot <- wine_train %>%
    ggplot(aes(x = quality, y = pH)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

sulphates_plot <- wine_train %>%
    ggplot(aes(x = quality, y = sulphates)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")

alcohol_plot <- wine_train %>%
    ggplot(aes(x = quality, y = alcohol)) +
    geom_point(size=point_size, aes(color = as.factor(quality), fill = as.factor(quality))) +
    theme(text = element_text(size=font_size), axis.title.x = element_blank(), legend.position = "none")


plots <- plot_grid(fixed_acidity_plot, volatile_acidity_plot, citric_acid_plot, residual_sugar_plot, 
                   chlorides_plot, free_sulfur_dioxide_plot, total_sulfur_dioxide_plot, density_plot, 
                   pH_plot, sulphates_plot, alcohol_plot,
                   ncol=4, nrow =3)

title_plots <- ggdraw() +
    draw_label(
        "Figure x. Scatterplot of the quality and the input variables, color-coded by their respective qualities.",
        fontface = 'bold', x = 0, hjust = 0, size = font_size + 20) +
    theme(plot.margin = margin(0, 0, 0, 7)) # alignment

plot_grid(title_plots, plots, ncol = 1, rel_heights = c(0.1, 1))


Plotting each of the predictors on the y-axis against the wine qualities do suggest a possible linear trend in several. Especially alcohol, volatile_acidity, and density all appear to show possible linear trends. Most of them have a histogram-like shape, like the two sulfur dioxides. 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

wine_cors <- cor(wine_train)

corr_plot <- corrplot(wine_cors, method = 'number',
                     title = "Figure x. Correlation Matrix of all Input Variables",
                     mar=c(0,0,3,0))

The correlation matrix suggests that the values all tend to be more independent than each other than some others. For quality, the values that are the most correlated appear to be volatile_acidity and the alcohol content. This suggests that those may be the best predictors, and the others may be a bit weaker.

### 3.4. Multiple Linear Regression Analysis

We first specify a linear regression model and then a recipe. In the recipe, we state "quality" as our response variable, and the other 11 variables as input variables. We then set up the workflow and train the model using our training set.

In [None]:
# Specifying a linear regression model.
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

# Setting up the recipe.
wine_lm_recipe <- recipe(quality ~ ., data = wine_train)

# Training the model.
wine_lm_fit <- workflow() %>%
  add_recipe(wine_lm_recipe) %>%
  add_model(lm_spec) %>%
  fit(data = wine_train)

wine_lm_fit

>*Figure 3. Linear regression workflow summary.*

Let's take a closer look at the obtained coefficients:

In [None]:
# Pulling information of the coefficients in a tibble.
wine_coeffs <- wine_lm_fit %>%
               extract_fit_parsnip() %>%
               tidy()

wine_coeffs

> *Table 3. Summary of the coefficients from the linear regression with their respective standard error, statistic, and p-value.*

Most of our input variables are statistically significant, as their p-values are < 0.05. However, a couple have p-values that are > 0.05, and thus, are not statistically significant. 
- Significant: fixed acidity, sulphates, alcohol, volatile acidity, chlorides, total sulfur dioxide, and density.
- Non-significant: residual sugar, citric acid, free sulfur dioxide, and pH.

The full equation of our linear regression model is (rounded to the nearest 3 decimals):  

quality = 56.828 + 0.067 x fixed_acidity - 1.063 x volatile_acidity - 0.312 x citric_acid + 0.027 x residual_sugar - 1.934 x chlorides + 0.003 x free_sulfur_dioxide - 0.002 x total_sulfur_dioxide - 53.300 x density - 0.341 x pH - 0.881 x sulphates + 0.261 x alcohol  

We can also determine the correlation between the input variables and the response variable quality:
- Positively correlated: fixed acidity, residual sugar, free sulfur dioxide, sulphates, and alcohol.
- Negatively correlated: volatile acidity, citric acid, chlorides, total sulfur dioxide, density, and pH.

We then test our model on the testing set:

In [None]:
# Finding the RMSPE, R^2, and MAE.

wine_lm_test_results <- wine_lm_fit %>%
  predict(wine_test) %>%
  bind_cols(wine_test) %>%
  metrics(truth = quality, estimate = .pred)

wine_lm_test_results

> *Table 4. Estimates of the model's performance on the testing set.*

Our RMSPE is 0.67 units of quality, which we deem to be a low value. Our mean absolute error is 0.52 units of quality, which we also deem to be a low value. Thus, we believe our model performs relatively well. However, our R^2 is 0.32 which is a low number, indicating that our model does not fit the data as well as hoped.

### 3.5. Multiple Linear Regression Visualization

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)

wine_lm_test_preds <- wine_lm_fit %>%
  predict(wine_test) %>%
  bind_cols(wine_test) %>%
  mutate(resid = quality - .pred)

qqPlot(wine_lm_test_preds$resid)

> *Figure 4. Quantile-quantile plot of the model's errors.*

Plotting a quantile-quantile plot for the errors, it appears good. There does appear to be a dip from the line near quantiles = 0, and there appears to be a few outliers, but overall the normality assumption on our data is reasonable.

## 4. Discussion

### 4.1. Summary and Expectations

### 4.2. Impacts and Future Questions

The multivariate regression analysis conducted on the wine quality dataset aimed to uncover the impacts of various factors such as the alcohol content, the acidity, and the amount of sugar on the overall quality of wine. While the accuracy of our model is a bit mediocre at best, it is still a valuable analysis that can uncover proposals of changing winemaking practices and consumer preferences.

Future research should attempt to improve the accuracy of our model, such as incorporating better model selection techniques such as stepwise regression, or to restrict the coefficients through LASSO or Ridge regression. Alternate strategies could be to adapt a non-parametric (i.e. classification) analysis rather than a regression analysis on the data. More data could also be collected to further the amount of predictors and the amount of data there are. For those who are in the winemaking business, this should propose implications on alternate winemaking techniques to better increase the quality of the wine that is being manufactured, and may ask why some qualities negatively affect the quality of the wine and why others positively affect the quality.

## 5. References

<div class="csl-bib-body" style="line-height: 2; margin-left: 2em; text-indent:-2em;">
  <div class="csl-entry">Cortez, P., Cerdeira, A., Almeida, F., Matos, T., &amp; Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. <i>Decision Support Systems</i>, <i>47</i>(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016</div>
  <div class="csl-entry"><i>Culture of wine</i>. (n.d.). Retrieved 29 February 2024, from https://www.wineinmoderation.eu/culture</div>
  <div class="csl-entry">Elfman, Z. (2017, January 6). <i>Libation frontiers – a deep dive into the world wine industry | toptal®</i>. Toptal Finance Blog. https://www.toptal.com/finance/market-sizing/wine-industry</div>
  <div class="csl-entry">Luque, A., Mazzoleni, M., Zamora-Polo, F., Ferramosca, A., Lama, J. R., &amp; Previdi, F. (2023). Determining the importance of physicochemical properties in the perceived quality of wines. <i>IEEE Access</i>, <i>11</i>, 115430–115449. https://doi.org/10.1109/ACCESS.2023.3325676</div>
</div>