# Predicting Wine Quality Using Linear Regression

## Summary

## Introduction
[Wine is entrenched in many cultures and remains a strong industry worldwide.](https://www.toptal.com/finance/market-sizing/wine-industry) [Technological innovations have supported the growth of the wine industry, especially in the realm of certification and quality assessment.](http://dx.doi.org/10.1016/j.dss.2009.05.016) [One prominent innovation is the use of laboratory testing to relate physicochemical properties of wine to human sensory perceptions.](https://ieeexplore.ieee.org/document/10287348) Examples of physicochemical indicators include pH and and residual sugar. [Using data to model complex wine perceptions is a daunting task, but it can benefit wine production by flagging the most important properties to consider and informing price setting.](http://dx.doi.org/10.1016/j.dss.2009.05.016)

Thus, our key question is: **Can we use multiple linear regression and various physicochemical indicators to predict the quality of red wine?**

To answer whether a full regression model is viable, we use a [dataset on red wine quality from the UCI Machine Learning Repository](https://doi.org/10.24432/C56S3T). The dataset comprises of 12 variables (11 physicochemical indicators and 1 quality indicator) and contains 1599 instances of red vinho verde, a popular wine from Portugal. Each instance of wine was assessed by at least three [sensory assessors](https://www.sensorysociety.org/knowledge/sspwiki/Pages/assessor.aspx) and scored on a ten point scale that ranges from "very bad" to "excellent"; the wine quality for each instance is determined by the median of these scores. The data was collected by the CVRVV, an inter-professional organisation dedicated to the promotion of vinho verde, from  May 2004 to February 2007. 

*need to format citations*

## Methods

### Load Data

In [2]:
# Import packages
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [15]:
# Read CSV data
wine <- read_delim("data/winequality-red.csv", delim = ";")
new_names <- c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides", "free_sulfur_dioxide", 
              "total_sulfur_dioxide", "density", "pH", "sulphates", "alcohol", "quality")
colnames(wine) <- new_names
head(wine)

[1mRows: [22m[34m1599[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[32mdbl[39m (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


> *Figure 1.1. Loaded dataset of wine quality.*

### Split Dataset

In [16]:
# Pick seed 1234 for reproducible results
set.seed(1234)

# Split dataset into 75% training and 25% testing
wine_split <- initial_split(wine, prop = 0.75, strata = quality)
wine_train <- training(htru_split)
wine_test <- testing(htru_split)

glimpse(wine_train)
glimpse(wine_test)

ERROR: Error in initial_split(wine, prop = 0.75, strata = quality): could not find function "initial_split"


### EDA 

First, we check for missing values...

In [10]:
sum(is.na(wine))

... And summary statistics.

In [9]:
summary(wine)

 fixed_acidity   volatile_acidity  citric_acid    residual_sugar  
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
   chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
 Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
 1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
 Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
 Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
 3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
 Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1

> *Figure 1.2. Summary statistics.*

Next, we examine the means of the independent variables for every level of our response variable "quality".

In [8]:
response_means <- wine %>% 
    mutate(quality = as.factor(quality)) %>% 
    group_by(quality) %>%
    summarise_all(mean)
response_means

quality,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,8.36,0.8845,0.171,2.635,0.1225,11.0,24.9,0.997464,3.398,0.57,9.955
4,7.779245,0.6939623,0.1741509,2.69434,0.09067925,12.26415,36.24528,0.9965425,3.381509,0.5964151,10.265094
5,8.167254,0.5770411,0.2436858,2.528855,0.09273568,16.98385,56.51395,0.9971036,3.304949,0.6209692,9.899706
6,8.347179,0.4974843,0.2738245,2.477194,0.08495611,15.7116,40.86991,0.9966151,3.318072,0.6753292,10.629519
7,8.872362,0.4039196,0.3751759,2.720603,0.07658794,14.04523,35.0201,0.9961043,3.290754,0.7412563,11.465913
8,8.566667,0.4233333,0.3911111,2.577778,0.06844444,13.27778,33.44444,0.9952122,3.267222,0.7677778,12.094444


> *Figure 1.3. Means for each level of the response variable "quality".*

### EDA Visualization

### Regression

### Regression Visualization

## Discussion

### Findings

### Impacts and Future Questions

Using variable selection... stepwise? LASSO?

## References

[Dataset information](http://dx.doi.org/10.1016/j.dss.2009.05.016)