# The Classification of Wine

### Introduction 

Given the chemical properties (specifically, Flavanoid and Color Intensity) of an unknown wine, is it possible to accurately classify the wine's type?

We are using the Wine Dataset from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine). In which different types of wine (Pinot Noir, Merlot, and Cabernet Sauvignon), grown in the same region but using different cultivators, are examined.  The different cultivators allow for different chemical constituents of each type of wine, hence, the goal of this project is to see if we can classify types of wine given chemical predictors.

According to Jonathon Betchels (https://jonathonbechtel.com/blog/2018/02/06/wines/), it’s most likely the three classes of wine in the set: 1, 2, and 3 matches with Pinot Noir, Cabernet Sauvignon, and Merlot respectively. This set also contains data on 13 attributes: Alcohol content, Malic Acid, Ash, Magnesium, Total Phenols, Flavanoids, Nonflavanoid Phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline. There are 178 different observations of wine samples in this data set.

### Preliminary exploratory data analysis

In [None]:
# Install and load necessary packages
install.packages("GGally")
library(repr)
library(tidyverse)
library(tidymodels)
library(GGally)

In [None]:
# Set seed and options
set.seed(27)
options(repr.matrix.max.rows = 15)

In [None]:
## Read dataset from the web into R
wine <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", 
                 col_names = FALSE)
head(wine)


The first column is our class, and it's currently of type numeric (`<dbl>`). Since we'll be treating class as a categorical variable, we'll convert it to type factor.

In [None]:
## Clean and Wrangle

# assign column names
colnames(wine) <- c("class", "alcohol", "acid", "ash", "alcalinity", "mg", "total_phenol", 
                    "flavanoid", "non_f_phenol", "proantho", "color", "hue", "od280/od315", "proline")

# convert type where applicable
wine <- wine %>%
        mutate(class = as.factor(class), mg = as.integer(mg), proline = as.integer(proline))

# Split wine data into training and testing data
wine_split <- initial_split(wine, prop = 0.75, strata = class)
wine_training <- training(wine_split)
wine_testing <- testing(wine_split)

head(wine)

After some research and exploratory data analysis, we decided to narrow down to exploring 5 variables that we've seen to be the most relevant to `class`. The 5 variables are: `alcohol`, `flavanoid`, `color`, `hue`, `proline`. 

In [None]:
## Summarization of data
options(digits = 4)

# create new data set with chosen variables
wine_main <- select(wine_training, class, alcohol, flavanoid, color, hue, proline)

# create table that summarizes total observations, variables, and missing values of data set
total_observations <- nrow(wine_training)
total_variables <- ncol(wine_training)
total_na <- sum(is.na(wine_training))
table1 <- data.frame(total_observations, total_variables, total_na)

# create summary table of observations in each class of data set
obs_per_class <- wine_training %>%
    group_by(class) %>%
    summarize(count = n(),
    percentage = n() / total_observations * 100)

# creat summary table for the means and standard deviation of chosen variables
means_of_var <- wine_main %>%
    summarize(across(alcohol:proline, mean)) %>%
    pivot_longer(cols = alcohol:proline,
                 names_to = "chemical_components",
                 values_to = "mean")

sd_of_var <- wine_main %>%
    select(alcohol:proline) %>%
    map_dfr(sd) %>%
    pivot_longer(cols = alcohol:proline,
                 names_to = "chemical_components",
                 values_to = "sd") %>%
    select(sd)

summary_tbl <- bind_cols(means_of_var, sd_of_var) %>%
    arrange(mean)

In [None]:
# Number of total observations, variables (including class), and missing values
table1

In [None]:
# Number and percentage of observations in each class
obs_per_class

In [None]:
# Means and SD of chosen variables, arranged in ascending order by mean
summary_tbl

In [None]:
# Visualization using matrix plot to examine each pair of variables in the chosen set
options(repr.plot.width = 15, repr.plot.height = 15)
ggpairs(wine_main, aes(color = class, alpha = 0.5), title = "Matrix plot of variables") +
    theme(text = element_text(size = 18))

Looking at the matrix plot above (histograms and box plots), it’s apparent the distribution of different wine classes within `flavanoid` and `color` overlaps the least in comparison to other variables. The separation of different classes within a variable makes the classification of wine type clearer as each type would have a more distinct range of values within the variable. Furthermore, in examining the scatter plot of `flavanoid` and `color` it’s evident there is little overplotting of classes' data points, as well as the distribution of different classes are distinct. Hence, we believe `flavanoid` and `color` would be the best predictors for this project.

In [None]:
# Clearer scatter plot of Flavanoid vs Color Intensity 
options(repr.plot.width = 8, repr.plot.height = 6)
plot_flava_color <- ggplot(wine_training, aes(x = flavanoid, y = color, color = class)) +
                      geom_point(alpha = 0.5) +
                      labs(x = "Flavanoids", y = "Color Intensity", color = "Class") +
                      ggtitle("Flavanoids vs. Color Intensity") +
                      theme(text = element_text(size = 20))
plot_flava_color

### Methods
With the reasoning given above and further research, we decided to use Flavanoids and Color Intensity as predictors. `class`, `flavanoid`, `color` will be columns used in data analysis 

This is a classification problem so we use the K-nearest neighbors algorithm. The main library used to perform this algorithm is `tidymodels`.

Scatter plots, Line plots, Tables will be used for visualization. Scatter plot could be for visualizing the distribution of class given predictors. Line plots are used when determining K. Tables are used to display results like classifier's accuracy.

### Expected outcomes and significance
We expect to find the most relevant class for an unknown wine type given its chemical analysis. We hope that our model's accuracy could fall above 85%.

The impact of these findings are significant, as within British Columbia the wine industry contributes an annual 2.8 billion dollars, and 339.53 billion globally. Consumers are specific on the type of wine they wish to purchase so accurately classifying the types of wine are important to the wine industry. 

This classification could lead to a vast possibility of questions such as: Is one type of wine healthier to consume than another? Which type of wine is more sought after and more heavily consumed?


