# Understanding Apple Quality: A Prediction Based on Ripeness and Acidity
### by Paul Bunuan, Nathan Ng, Tyler Tan, Junyi Yao

# Introduction
    Apples are a common fruit consumed by people every day, with a variety of species being cultivated and harvested from farms across the world. In 2014, roughly 84 million tonnes of apples were produced (Musacchi & Serra, 2018). It is important for producers and sellers to maintain a quality standard for apples with regards to characteristics such as acidity, colour, shape, weight, and more. The values of these qualities will influence sales and the consumer’s chances of purchasing the product again. There are different ways manufacturers can measure the traits of apples. For instance, as apples ripen, they tend to become more alkaline with decreasing malic acid content, the primary acid found in apples. (Farcuh, 2018). However, the acidity is not used to track apple maturity because of the variation amongst species and differing optimal acid content for consumption (Farcuh, 2018). In our project, we aim to determine how the acidity and ripeness values of any given apple influences its overall quality status as “good” or “bad”; to answer this predictive question, we use a public dataset from Kaggle regarding apple quality (Elgiriyewithana, 2024). 

    	This dataset contains nine columns, which include Apple ID, Size, Weight, Sweetness, Crunchiness, Juiciness, Ripeness, Acidity, and Quality. However, for the purposes of this report, only the Ripeness, Acidity, and Quality columns will be used. It is hypothesized that apples with a higher Ripeness value and lower Acidity value are more likely to be characterized as “good” quality apples.


# Methods & Results
## Methods
1. describe in written English the methods you used to perform your analysis from beginning to end that narrate the code that does the analysis.
2. your report should include code which:-         
loads data from the original source on the web-  
wrangles and cleans the data froits's original (downloaded) format to the format necessary for the planneanalysisis
performs a summary of the data set that is relevant for exploratory data analysis related to the plannanalysis is 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the plananalysisysis
performs the analysislysis
creates a visualization of the aa- lysis 
note: all tablfigures figure should have a figure/table numbe. legend


### Analysis
First we loaded in the libraries that are needed for the data

In [1]:
# Libraries used
library(tidyverse)
library(repr)
library(tidymodels)
library(janitor)
options(repr.matrix.max.rows = 6)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

Then we created a dataframe for the data called `apple`.

In [37]:

apple <- read_csv("data/apple_quality.csv",show_col_types = FALSE) |>
    head(-1) #This function removes the last row of the dataframe which included NA values and text crediting the author. |>
apple
print("Table 1: Apple dataset")

A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
0,-3.9700485,-2.512336,5.346330,-1.0120087,1.8449004,0.32983980,-0.491590483,good
1,-1.1952172,-2.839257,3.664059,1.5882323,0.8532858,0.86753008,-0.722809367,good
2,-0.2920239,-1.351282,-1.738429,-0.3426159,2.8386355,-0.03803333,2.621636473,bad
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
3997,-2.6345153,-2.138247,-2.4404613,0.6572229,2.199709,4.7638592,-1.334611391,bad
3998,-4.0080037,-1.779337,2.3663970,-0.2003294,2.161435,0.2144884,-2.229719806,good
3999,0.2785397,-1.715505,0.1212173,-1.1540748,1.266677,-0.7765715,1.599796456,good


[1] "Table 1: Apple dataset"


### Cleaning up the Data 

As noted, we have read the CSV file where the data is stored. However, the data is cluttered with numerous columns that are irrelevant to our classification model. To ensure precision, we meticulously clean up the data, selecting only the specific columns that are crucial for our model's accuracy.

BONUS: We noticed that R had trouble reading the original `Acidity` column. This was most likely because there were hidden characters in the file that R could not read, so we cleaned up the data below with the following code.

In [38]:

apple_manipulated <- apple |>
    clean_names() |>
    mutate(acidity = as.numeric(acidity))
apple_manipulated
print("Table 2: Apple dataset (tidy)")

a_id,size,weight,sweetness,crunchiness,juiciness,ripeness,acidity,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0,-3.9700485,-2.512336,5.346330,-1.0120087,1.8449004,0.32983980,-0.4915905,good
1,-1.1952172,-2.839257,3.664059,1.5882323,0.8532858,0.86753008,-0.7228094,good
2,-0.2920239,-1.351282,-1.738429,-0.3426159,2.8386355,-0.03803333,2.6216365,bad
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
3997,-2.6345153,-2.138247,-2.4404613,0.6572229,2.199709,4.7638592,-1.334611,bad
3998,-4.0080037,-1.779337,2.3663970,-0.2003294,2.161435,0.2144884,-2.229720,good
3999,0.2785397,-1.715505,0.1212173,-1.1540748,1.266677,-0.7765715,1.599796,good


[1] "Table 2: Apple dataset (tidy)"


In [39]:
set.seed(4321)
#We selected the values we would use for our data analysis, which only included Ripeness, Acidity, and Quality.
#A random sample of 100 was taken because the data set was so large.
apple_select <- apple_manipulated |>
    select(ripeness, acidity, quality)
apple_select
print("Table 3: Apple select dataset")

ripeness,acidity,quality
<dbl>,<dbl>,<chr>
0.32983980,-0.4915905,good
0.86753008,-0.7228094,good
-0.03803333,2.6216365,bad
⋮,⋮,⋮
4.7638592,-1.334611,bad
0.2144884,-2.229720,good
-0.7765715,1.599796,good


[1] "Table 3: Apple select dataset"


## Results

# Disscusion
summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?


# Sources 
Elgiriyewithana, N. (2024, January 11). Apple Quality. <br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;Kaggle.  https://www.kaggle.com/datasets/nelgiriyewithana/apple-quality/data 
    
Farcuh, M. (2023). Fruit Harvest - Determining Apple Fruit Maturity and Optimal Harvest Date. Determining Apple Fruit Maturity and Optimal Harvest Date. 

    Musacchi, S., & Serra, S. (2018). Apple Fruit Quality: Overview on pre-harvest factors. Scientia Horticulturae, 234, 409–430. https://doi.org/10.1016/j.scienta.2017.12.057 