# Predicting Wine Cultivars

<br><br>

## Summary
---
In this project, we will predict what cultivar a wine was derived from based on its chemical properties.  

The data was sourced from the [UCI Machine Learning Repository](https://doi.org/10.24432/C5PC7J). It contains data about various wines from Italy derived from three different cultivars. Each row represents the chemical and physical properties of a different wine, such as its concentration of alcohol, magnesium level and hue. 

<br>

## Introduction
---
1. provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
2. clearly state the question you tried to answer with your project
3. identify and describe the dataset that was used to answer the question

<br>

## Code and Analysis
---

In [None]:
# imports and libraries 
library(GGally) # for ggpairs
library(tidyverse) #importing tidyverse
library(dplyr) # for data wrangling
library(knitr) # to create tables
library(themis) # to balance our cultivar classes out

In [None]:
# reading in the data from the web
raw_data <- read.csv("https://raw.githubusercontent.com/DSCI-310-2024/DSCI-310-Group-5/main/data/wine.data", header= FALSE)
           
# name the columns based on the dataset description 
col_names <- c("cultivar","alcohol","malicacid", "ash", "alcalinity_of_ash", "magnesium", "total_phenols", 
               "flavanoids", "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", "0D280_0D315_ratio", "proline")
colnames(raw_data) <- col_names


## Exploratory Data Analysis

In [None]:
# Check the number of observations per cultivar
sample_size_cultivar <- raw_data %>%
    group_by(cultivar) %>%
    summarize(sample_size = n())

sample_size_cultivar # class 2 has more observations than the other classes -> need to balance the classes out

In [None]:
# balance the cultivar class 

# convert cultivar into a factor
raw_data <- raw_data |>
  mutate(cultivar = as.factor(cultivar))

# recipe for balancing
cultivar_balance_recipe <- recipe(cultivar ~ ., data = raw_data) |>
  step_upsample(cultivar, over_ratio = 1, skip = FALSE) |>
  prep()

# execute the balancing
data <- bake(cultivar_balance_recipe, raw_data)

# check the data is balanced
balanced_data <- data |>
  group_by(cultivar) |>
  summarize(n = n())

balanced_data # data has been upsampled so all groups have equal sample size


In [None]:
# EDA: use a plot to see the relationships between variables
options(repr.plot.width = 12, repr.plot.height = 30) # format the any visualizations to be easily viewable
pairplots <- raw_data %>%
  ggpairs(progress = FALSE) +
  theme(
    text = element_text(size = 15),
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )
pairplots

In [None]:
# calculate the mean for each cultivar group
cultivar_mean_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, mean, na.rm = TRUE))

cultivar_mean_table


In [None]:
# calculate the standard deviation for each cultivar group
cultivar_sd_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, sd, na.rm = TRUE))

cultivar_sd_table


In [None]:
# calculate the maximum values for each cultivar group
cultivar_max_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, max, na.rm = TRUE))

cultivar_max_table


In [None]:
# calculate the minimum values for each cultivar group
cultivar_min_table <- raw_data |>
    group_by(cultivar) |>
    summarize(across(alcohol:proline, min, na.rm = TRUE))

cultivar_min_table


In [None]:
# plotting all variables against flavanoids to see how the cultivars differ
eda_plot_data <- raw_data |>
    relocate(flavanoids, 1)

eda_plot_data <- eda_plot_data |>
        pivot_longer(
        cols= alcohol:proline,
        names_to="factor",
        values_to="values")

eda_plot1 <- eda_plot_data|>
    ggplot(aes(x=flavanoids,y=values,color=cultivar))+
    geom_point(alpha=0.35)+
    facet_grid(factor~.,scales="free")+
    labs(x="Flavanoid",y="Values",color="Cultivar")
eda_plot1


<br><br>

## Methods & Results
---
This project utilized a linear regression model to understand the effects of various chemical properties in order to predict what cultivar a wine was derived from. First, we read in data from the [UCI Machine Learning Repository](https://doi.org/10.24432/C5PC7J). It contains data about various wines from Italy derived from three different cultivars. Each row represents the chemical and physical properties of a different wine, such as its concentration of alcohol, magnesium level and hue.

We then tidied the data and balanced the classes of the classification variable we are interested in. This is because the data set is not extensively large, so ensuring each class has an equal number of observations prevents our model from being biased towards a specific dominant class. Next we calculated some summary statistics to facilitate exploratory data analysis, with the goal of finding key input variables for our model. 


    
6. creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
7. performs classification or regression analysis
8. creates a visualization of the result of the analysis
9. note: all tables and figure should have a figure/table number and a legend

<br>

## Discussion
---
1. summarize what you found
2. discuss whether this is what you expected to find?
3. discuss what impact could such findings have?
4. discuss what future questions could this lead to?

<br>

## References
---
Aeberhard,Stefan and Forina,M.. (1991). Wine. UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J.