In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

Before we begin working with the data, we must load it into R from the web. The url for this dataset is https://archive.ics.uci.edu/static/public/186/wine+quality.zip. Note that this is a zip file, and so we must unzip this file to access the .csv file within.

In [2]:
dir.create("data/")

In [3]:
url <- "https://archive.ics.uci.edu/static/public/186/wine+quality.zip" # Url for the dataset's zip file, containing white and red wine data.

download.file(url, destfile = "data/wine_quality.zip")
unzip("data/wine_quality.zip", exdir = "data/") # Unzipping the zipped wine quality file.
white_wine_data <- read_delim("data/winequality-white.csv", delim = ";")  # Selecting the white wine data that will be used for this project.
colnames(white_wine_data) <- c("fixed_acidity", # Adjusting column names for cleanliness.
              "volatile_acidity",
              "citric_acid",
              "residual_sugar",
              "chlorides",
              "free_sulfur_dioxide",
              "total_sulfur_dioxide",
              "density",
              "pH",
              "sulphates",
              "alcohol",
              "quality")
white_wine_data <- white_wine_data |>
            select(citric_acid, residual_sugar, density, pH, alcohol, quality) # Selecting the variables to be measured.
white_wine_data

[1mRows: [22m[34m4898[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[32mdbl[39m (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


citric_acid,residual_sugar,density,pH,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.36,20.7,1.0010,3.00,8.8,6
0.34,1.6,0.9940,3.30,9.5,6
0.40,6.9,0.9951,3.26,10.1,6
⋮,⋮,⋮,⋮,⋮,⋮
0.19,1.2,0.99254,2.99,9.4,6
0.30,1.1,0.98869,3.34,12.8,7
0.38,0.8,0.98941,3.26,11.8,6


The first 5 columns here are going to be the predictors for the quality column. A brief description of each is as follows:
- `citric_acid` -> The mass of citric acid in the wine (g/dm$^{3}$).
- `residual_sugar` -> The mass of residual sugar in the wine (g/dm$^{3}$).
- `density` -> The density of the wine (g/cm$^{3}$).
- `pH` -> The pH of the wine (1-14).
- `alcohol` -> The volume % alcohol content of the wine.

The last column, `quality`, is a rating on a scale from 1 to 10 of the wine's determined quality based on the given physicochemical factors.

In [4]:
set.seed(1357)
# Creating the training and testing split of the data
wine_split <- initial_split(white_wine_data, prop = .75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

wine_train
wine_test

citric_acid,residual_sugar,density,pH,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.40,4.20,0.9947,3.14,9.7,5
0.37,1.20,0.9920,3.18,10.8,5
0.62,19.25,1.0002,2.98,9.7,5
⋮,⋮,⋮,⋮,⋮,⋮
0.28,5.7,0.99168,3.21,12.15,7
0.22,1.9,0.98928,3.04,13.00,7
0.30,1.1,0.98869,3.34,12.80,7


citric_acid,residual_sugar,density,pH,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.40,6.90,0.9951,3.26,10.1,6
0.43,1.50,0.9938,3.22,11.0,6
0.41,1.45,0.9908,2.99,12.0,5
⋮,⋮,⋮,⋮,⋮,⋮
0.40,8.1,0.99494,3.15,9.533333,6
0.38,1.3,0.99298,3.29,9.700000,5
0.19,1.2,0.99254,2.99,9.400000,6


In [5]:
wine_qual_counts <- wine_train |>
            group_by(quality) |>
            summarize(count = n())
print(wine_qual_counts)

[90m# A tibble: 7 × 2[39m
  quality count
    [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m       3    19
[90m2[39m       4   122
[90m3[39m       5  [4m1[24m089
[90m4[39m       6  [4m1[24m648
[90m5[39m       7   658
[90m6[39m       8   132
[90m7[39m       9     5


This count displays the count of observations for each quality of wine present in the table. From the table we can see that only white wines of qualities 3 through 9 are present in the table, with the mode of the data set being the wine quality of 6.

In [6]:
wine_avgs <- wine_train |>
            select(citric_acid:alcohol) |>
            map_df(mean)
wine_avgs

citric_acid,residual_sugar,density,pH,alcohol
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.3326491,6.378968,0.9940026,3.188707,10.52237


These are the averages of each of the columns. etc...