In [1]:
#Read before running
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

**Preliminary exploratory data analysis:**
* Demonstrate that the dataset can be read from the web into R 
* Clean and wrangle your data into a tidy format

In [10]:
#read red wine data, assign wine-type label and new column names
red_wine_data <- read_delim("winequality-red (1).csv", delim = ";")%>%
                    mutate(quality = as.factor(quality))%>%
                    mutate(wine_type = as.factor("red"))%>%
                    setNames (c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density", "pH", "sulphates", "alcohol", "quality", "wine_type"))

#read white wine data, assign wine-type label and new column names
white_wine_data <- read_delim("winequality-white.csv", delim = ";")%>%
                    mutate(quality = as.factor(quality))%>%
                    mutate(wine_type = as.factor("white"))%>%
                    setNames (c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density", "pH", "sulphates", "alcohol", "quality", "wine_type"))

#combine both wine types into one dataframe
wine_data <- rbind(red_wine_data, white_wine_data)

Parsed with column specification:
cols(
  `fixed acidity` = [32mcol_double()[39m,
  `volatile acidity` = [32mcol_double()[39m,
  `citric acid` = [32mcol_double()[39m,
  `residual sugar` = [32mcol_double()[39m,
  chlorides = [32mcol_double()[39m,
  `free sulfur dioxide` = [32mcol_double()[39m,
  `total sulfur dioxide` = [32mcol_double()[39m,
  density = [32mcol_double()[39m,
  pH = [32mcol_double()[39m,
  sulphates = [32mcol_double()[39m,
  alcohol = [32mcol_double()[39m,
  quality = [32mcol_double()[39m
)

Parsed with column specification:
cols(
  `fixed acidity` = [32mcol_double()[39m,
  `volatile acidity` = [32mcol_double()[39m,
  `citric acid` = [32mcol_double()[39m,
  `residual sugar` = [32mcol_double()[39m,
  chlorides = [32mcol_double()[39m,
  `free sulfur dioxide` = [32mcol_double()[39m,
  `total sulfur dioxide` = [32mcol_double()[39m,
  density = [32mcol_double()[39m,
  pH = [32mcol_double()[39m,
  sulphates = [32mcol_double()[39m,
  

* Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
* Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [11]:
#split dataset into training and testing data
wine_split <- initial_split(wine_data, prop = 0.75, strata = quality)
wine_training <- training(wine_split)
wine_testing <- testing(wine_split)

In [12]:
glimpse(wine_training)

Rows: 4,874
Columns: 13
$ fixed_acidity        [3m[90m<dbl>[39m[23m 7.4, 7.8, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7, 7.5…
$ volatile_acidity     [3m[90m<dbl>[39m[23m 0.700, 0.880, 0.700, 0.660, 0.600, 0.650, 0.580,…
$ citric_acid          [3m[90m<dbl>[39m[23m 0.00, 0.00, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36, …
$ residual_sugar       [3m[90m<dbl>[39m[23m 1.9, 2.6, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
$ chlorides            [3m[90m<dbl>[39m[23m 0.076, 0.098, 0.076, 0.075, 0.069, 0.065, 0.073,…
$ free_sulfur_dioxide  [3m[90m<dbl>[39m[23m 11, 25, 11, 13, 15, 15, 9, 17, 15, 17, 9, 52, 35…
$ total_sulfur_dioxide [3m[90m<dbl>[39m[23m 34, 67, 34, 40, 59, 21, 18, 102, 65, 102, 29, 14…
$ density              [3m[90m<dbl>[39m[23m 0.9978, 0.9968, 0.9978, 0.9978, 0.9964, 0.9946, …
$ pH                   [3m[90m<dbl>[39m[23m 3.51, 3.20, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35, …
$ sulphates            [3m[90m<dbl>[39m[23m 0.56, 0.68, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80

Extracting all the categories that may be predicted:

In [15]:
classes <- wine_training %>% pull(quality) %>% levels()
classes

Find the number and percentage of the categories in the dataset

In [16]:
num_obs <- nrow(wine_training)

wine_training %>%
  group_by(quality) %>%
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )

`summarise()` ungrouping output (override with `.groups` argument)



quality,count,percentage
<fct>,<int>,<dbl>
3,22,0.45137464
4,156,3.20065654
5,1614,33.11448502
6,2126,43.61920394
7,812,16.65982766
8,142,2.91341814
9,2,0.04103406
