# Predicting Wine Cultivars

<br><br>

## Summary
---
In this project, we will predict what cultivar a wine was derived from based on its chemical properties.  

The data was sourced from the [UCI Machine Learning Repository](https://doi.org/10.24432/C5PC7J). It contains data about various wines from Italy derived from three different cultivars. Each row represents the chemical and physical properties of a different wine, such as its concentration of alcohol, magnesium level and hue. 

<br>

## Introduction
---
Wine is a beverage that has been enjoyed by humans for thousands of years (Feher et al. 2007). Consequently, humans have a long agricultural history with the grape plant which has led to the development of many different cultivars: grape plants selected and breed for their desirable characteristics (Harutyunyan and Malfeito-Ferreira 2022). Our dataset contains information about twelve chemical properties of 178 red wines made from three grape cultivars in Italy. 

The recorded chemical properties include: 
1. Alcohol content
2. Malic acid (gives the wine a fruity flavour)
3. Ash (left over inorganic matter from the wine-making process)
4. Alkalinity of ash (ability to resist acidification)
5. Magnesium, total phenols (contribute to bitter flavour of wine)
6. Flavanoids (antioxidants that contribute to bitter flavour and aroma of wine)
7. Nonflavanoid phenols (weakly acidic)
8. Proanthocyanins (bitter smell)
9. Color intensity
10. Hue
11. The ratio of OD280 to OD315 of diluted wines (protein concentration)
12. Proline (main amino acid in wine, important aspect of the flavour) (Bai et al. 2019).

Using this dataset, our predictive question is what is the cultivar of an unknown wine based on the chemical properties? Overall, determining which chemical properties distinguish cultivars will help even inexperienced wine drinkers easily identify cultivars.the question

<br>

## Code and Analysis
---

In [7]:
# imports and libraries 
library(GGally) # for ggpairs
library(tidyverse) #importing tidyverse
library(dplyr) # for data wrangling
library(knitr) # to create tables
library(themis) # to balance our cultivar classes out

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.2     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Loading required package: recipes


Attaching package: ‘recipes’


The following object is masked from ‘package:stringr’:

    fixed


The following object is masked from ‘package:stats’:

    step




In [None]:
# reading in the data from the web
raw_data <- read.csv("https://raw.githubusercontent.com/DSCI-310-2024/DSCI-310-Group-5/main/data/wine.data", header= FALSE)
           
# name the columns based on the dataset description 
col_names <- c("cultivar","alcohol","malicacid", "ash", "alcalinity_of_ash", "magnesium", "total_phenols", 
               "flavanoids", "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", "0D280_0D315_ratio", "proline")
colnames(raw_data) <- col_names


## Exploratory Data Analysis

In [None]:
# Check the number of observations per cultivar
sample_size_cultivar <- raw_data %>%
    group_by(cultivar) %>%
    summarize(sample_size = n())

sample_size_cultivar # class 2 has more observations than the other classes -> need to balance the classes out

In [None]:
# balance the cultivar class 

# convert cultivar into a factor
raw_data <- raw_data |>
  mutate(cultivar = as.factor(cultivar))

# recipe for balancing
cultivar_balance_recipe <- recipe(cultivar ~ ., data = raw_data) |>
  step_upsample(cultivar, over_ratio = 1, skip = FALSE) |>
  prep()

# execute the balancing
data <- bake(cultivar_balance_recipe, raw_data)

# check the data is balanced
balanced_data <- data |>
  group_by(cultivar) |>
  summarize(n = n())

balanced_data # data has been upsampled so all groups have equal sample size


In [None]:
# EDA: use a plot to see the relationships between variables
options(repr.plot.width = 12, repr.plot.height = 30) # format the any visualizations to be easily viewable
pairplots <- raw_data %>%
  ggpairs(progress = FALSE) +
  theme(
    text = element_text(size = 15),
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )
pairplots

In [None]:
# calculate the mean for each cultivar group
cultivar_mean_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, mean, na.rm = TRUE))

cultivar_mean_table


In [None]:
# calculate the standard deviation for each cultivar group
cultivar_sd_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, sd, na.rm = TRUE))

cultivar_sd_table


In [None]:
# calculate the maximum values for each cultivar group
cultivar_max_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, max, na.rm = TRUE))

cultivar_max_table


In [None]:
# calculate the minimum values for each cultivar group
cultivar_min_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, min, na.rm = TRUE))

cultivar_min_table


In [None]:
# plotting all variables against flavanoids to see how the cultivars differ
eda_plot_data <- raw_data |>
    relocate(flavanoids, 1)

eda_plot_data <- eda_plot_data |>
        pivot_longer(
        cols= alcohol:proline,
        names_to="factor",
        values_to="values")

eda_plot1 <- eda_plot_data|>
    ggplot(aes(x=flavanoids,y=values,color=cultivar))+
    geom_point(alpha=0.35)+
    facet_grid(factor~.,scales="free")+
    labs(x="Flavanoid",y="Values",color="Cultivar")
eda_plot1


In [None]:
# Convert cultivar to a factor
raw_data$cultivar <- as.factor(raw_data$cultivar)

# Create a recipe
balance_recipe <- recipe(cultivar ~ ., data = raw_data) %>%
  step_upsample(cultivar) %>%
  prep()

balanced_data <- bake(balance_recipe, raw_data)

# Selecting flavanoids as a variable for visualization
options(repr.plot.width = 8, repr.plot.height = 6)

# Create a boxplot using ggplot2
ggplot(balanced_data, aes(x = cultivar, y = flavanoids, color = cultivar)) +
  geom_boxplot() +
  labs(title = "Flavanoids across Different Cultivars", x = "Cultivar", y = "Flavanoids")

In [None]:
#Select the list of variables （"flavanoids", "alcohol", "color_intensity") to visualize and compare
variables_to_plot <- c("flavanoids", "alcohol", "color_intensity")

plots_list <- lapply(variables_to_plot, function(var) {
  ggplot(balanced_data, aes(x = cultivar, y = .data[[var]], color = cultivar)) +
    geom_boxplot() +
    labs(title = paste(var, "across Different Cultivars"), x = "Cultivar", y = var) +
    theme_minimal()
})

plots_list

<br><br>

## Methods & Results
---
This project utilized a linear regression model to understand the effects of various chemical properties in order to predict what cultivar a wine was derived from. First, we read in data from the [UCI Machine Learning Repository](https://doi.org/10.24432/C5PC7J). It contains data about various wines from Italy derived from three different cultivars. Each row represents the chemical and physical properties of a different wine, such as its concentration of alcohol, magnesium level and hue.

We then tidied the data and balanced the classes of the classification variable we are interested in. This is because the data set is not extensively large, so ensuring each class has an equal number of observations prevents our model from being biased towards a specific dominant class. Next we calculated some summary statistics to facilitate exploratory data analysis, with the goal of finding key input variables for our model. 


    
6. creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
7. performs classification or regression analysis
8. creates a visualization of the result of the analysis
9. note: all tables and figure should have a figure/table number and a legend

<br>

## Discussion
---
1. summarize what you found
2. discuss whether this is what you expected to find?
3. discuss what impact could such findings have?
4. discuss what future questions could this lead to?

<br>

## References
---
Aeberhard,Stefan and Forina,M.. (1991). Wine. UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J.

Bai, X., Wang, L., & Li, H. (2019). Identification of red wine categories based on physicochemical properties. International Conference on Educational Technology, Management, and Humanities Science, 1443-1448.
https://doi.org/ 10.25236/etmhs.2019.30



Fehér, J., Lengyel, G., & Lugasi, A. (2007). The cultural history of wine—Theoretical background to wine therapy. Central European Journal of Medicine, 2(4), 379–391. https://doi.org/10.2478/s11536-007-0048



Harutyunyan, M., & Malfeito-Ferreira, M. (2022). The Rise of Wine among Ancient Civilizations across the Mediterranean Basin. Heritage, 5(2), Article 2. https://doi.org/10.3390/heritage50203


