# COGS 118A - Final Project

# Classifying Grape Yields through Machine Learning Predictive Modeling

# Names
- Lucas Giumarra
- Sebastian Olivas Beltran
- Yu-Hsuan Chi
- Tom Hocquet

# Abstract

This project aims to develop a machine-learning model that can accurately classify grape varieties based on their yield. We plan to make two types of binary classification models to see which one could be the most useful to help make business decisions. Type 1 will focus on classifying high-yield vines and “high” and “low” will represent our binary categories.  Type 2 will focus on classifying low-yield vines and we will split the binary categories into “good” or “bad”. The data used in the project represents grape yield data, which is measured in boxes per acre and includes a variety of features such as climate conditions, the color of the grape, and the age of the vine. The model will be trained on a subset of this data, with the aim of learning to identify patterns in the data that are associated with different yield levels. Once the model is trained, it will predict the yield level of new, unseen data.
Ultimately, the goal is to create a reliable and accurate classification model that can assist grape growers in making informed decisions about their vineyard management practices. We plan to use 3 different algorithms for our models: Support Vector Machine, Gradient Boosting, and Random Forest. The data will be split into 80% training and 20% testing. To tackle our problem, our main metrics will be accuracy, precision, recall, and F1 score.


# Background

Our research area revolves around the world of viticulture, particularly grape production in vineyards. With a growing demand for high-quality grapes and wineries, along with their economic importance, the efficiency of wine and vineyard operations relies heavily on the models that help predict yield production in certain regions around the world[<sup>1</sup>](https://www.researchgate.net/profile/Manisha-Sirsat/publication/334387643_Machine_Learning_predictive_model_of_grapevine_yield_based_on_agroclimatic_patterns/links/5d3af058a6fdcc370a621a0a/Machine-Learning-predictive-model-of-grapevine-yield-based-on-agroclimatic-patterns.pdf). Viticulture as a whole has been a field of study tackled by machine learning on many occasions. Whether it’s detecting diseases in a grapevine’s leaves or using deep learning and neural networks to classify patterns in their early growing stages, models have been trained to make predictions and be evaluated by ML metrics such as accuracy or precision rates[<sup>2</sup>](https://www.researchgate.net/profile/Jerry-Gao/publication/348023331_Grape_Leaf_Disease_Detection_and_Classification_Using_Machine_Learning/links/6096c83a92851c490fc748d5/Grape-Leaf-Disease-Detection-and-Classification-Using-Machine-Learning.pdf). In turn, many of these models were created to make predictions on grape yields based on certain criteria such as weather conditions or pesticide use. 
One predictive model that was based on agroclimatic patterns used 4 machine learning methods (LASSO, ElasticNet, SpikesLab, and RandomForest) to predict grapevine yield. After performing feature selection and cross-validation, they were able to narrow down the number of variables that were the most significant in solving their problem statement. They outlined the specific function, training process, and importance of all 4 predictive methods in order to compare their calculated RMSE results based on selected variables. Using the model’s behavior and RMSE values for flowering, coloring, and harvest phenologies, they concluded how “meteorology is the key relation in measuring quantity of grapes” [<sup>1</sup>](https://www.researchgate.net/profile/Manisha-Sirsat/publication/334387643_Machine_Learning_predictive_model_of_grapevine_yield_based_on_agroclimatic_patterns/links/5d3af058a6fdcc370a621a0a/Machine-Learning-predictive-model-of-grapevine-yield-based-on-agroclimatic-patterns.pdf). Drawing inspiration from this, we will also use a few predictive methods such as Gradient Boosting, SVM, and RandomForest. Instead of calculating RMSE however, we will develop confusion matrices and use F1 scoring to determine the accuracy of our models. We will classify the grape yields to a respective category (“high” and “low” as Type 1 or “good” and “bad” as Type 2) using a range of values based on certain variables such as boxes per acre, whether the grapevine was grafted or not, among others. Another key difference in our approaches is our use of data. Instead of using a combination of climatic variables, plotting, and phenological stage datasets, we will use a main plotting dataset that will narrow our focus and lead us to tread away from the discussion of weather patterns for analysis.

A key problem is that these predictive models can be too complex and computationally expensive. Despite the growing demand, these models have failed to consider factors such as particular practices that can have a positive or negative impact on production rates or even grape varieties and the acres used to produce them. It’s also important to note that comparing current yield models to previous yield models can lead to several misinterpretations, meaning new ones must be trained on a daily basis. Too many confounding variables come into play such as limited varieties in a certain number of parcels at a certain time, grapevine locations, and the availability of workers[<sup>3</sup>](https://www.mdpi.com/2073-4395/12/10/2463). For instance, climate change has a severe impact on wine production, which may lead to unpredictable and inconsistent data collection throughout the seasons. The models are prone to large errors and can lead to misinformation regarding the best course of action for workers on the field. Even with the technology available today, it’s quite difficult to gather this type of data and accurately pinpoint financial gains/losses based on more or less production yields. 
To tackle these limitations in our project, we plan to narrow our scope down in terms of the variables we’re analyzing (such as yield and boxes per acre). The 23 unique ranches in our dataset are relatively close to each other in proximity, meaning drastic weather differences are not going to be an issue. Our model will not be as computationally expensive as the prior work models because our machine learning methods are much simpler and straight-forward. Regardless, a certain accuracy percentage will still be computed and our model will end up being trained and analyzed. Also, while some of these models can be prone to large errors, our model can avoid this because we plan on classifying yield using ranges of values rather than trying to compute an actual yield value. This allows our classification model to be less prone to error because the variability between a right and wrong answer is a lot slimmer, which should lead to a relatively high accuracy rate. 
Thus, answering our question through a business lens is key to creating, evaluating, and drawing conclusions from our predictive yield model. The higher the yield, the better the grape quality and quantity which results in more revenue for the company/vineyard.

# Problem Statement

Grape yield is an important factor for grape growers as it directly impacts their revenue and profitability. Currently, grape growers rely on their experience and knowledge of vineyard management practices to estimate the yield level, which can be subjective and prone to errors. Therefore, an accurate yield classification model can greatly assist grape growers in making informed decisions about their vineyard management practices. To ensure the problem is quantifiable, we will define the yield levels based on a specific range of boxes per acre. We will also use a set of objective features to describe the grape varieties, such as temperature, vine age, grape color, and ranch location. We plan to use a random forest model to predict whether a batch of grapevines would produce a high/low and good/bad yield given variables such as its variety, color, daily high temperatures, daily low temperatures, precipitation, etc. A vine is categorized as “high” yielding if its yield is greater than 175 boxes per acre, and low yielding if otherwise. Furthermore, a vine is considered to be “good” if its yield is greater than 55 boxes per acres and ‘bad’ otherwise.

# Data

The data for this project is sourced from a single vineyard and consists of three separate datasets that will be merged and analyzed. The production dataset contains 4129 weekly observations of grape collection, including the number of boxes filled, grape variety, and harvest date. The grafting dataset provides information on fields, including acreage, grafting, and other characteristics. The weather dataset includes average daily high and low temperatures for the week and precipitation amounts for the week. These measurements were recorded near the vineyard location for the years 2015 to 2022.

The critical variables for this project are the box count and acreage which will be important for understanding the overall productivity of the vineyard. The production week will also be an important variable as it could give some insights into harvesting patterns. Furthermore, the average daily high and low temperatures for the week could also be important variables as they could have the greatest measurable effect on the productivity of the vineyard from the variables that we have. Due to the number and nature of the datasets, significant data wrangling will be required before performing the analysis and modeling. This will involve merging the datasets, cleaning and transforming the data, and handling missing or erroneous data. 

#### *BOLD TEXT*  = Critical variable 

Production.csv -  4129 rows x 12 columns | https://github.com/COGS118A/Group036-Wi23/blob/main/production.csv 

Observations:
- __prod_wk - A data corresponding to the week in which a particular vine was picked__
- ranch_no - The ranch number where the vine is growing
- ranch_sub - Where on the ranch the vine is located (NW - northwest, SW - southwest, etc) 
- var_cd - The code number corresponding to the variety of the vine. 
- __variety_desc - The variety description of the vine__
- __var_boxes - The number of boxes of grapes harvested for the week__
- color - The color of the grape (Green, Red, Black)
- prod_date_ct - The number of days in a week that a particular vine was harvested
- prod_wk_no - The week number for the year
- prod_yr - The production year
- min_date - first day of harvesting a vine for the week
- max_date - last day of harvesting a vine for the week

Graft.csv - 161 rows × 7 columns | https://github.com/COGS118A/Group036-Wi23/blob/main/graft.csv 

Observations:
- ranch_no - The ranch number where the vine is growing
- ranch_sub - Where on the ranch the vine is located (NW - northwest, SW - southwest, etc) 
- var_cd - The code number corresponding to the variety of the vine. 
- __variety_desc - The variety description of the vine*__
- grafted - A boolean expression for whether the vine was planted or grafted (True = grafted, False = planted) 
- __acres - The number of acres that were planted__
- Year_planted_gratfed - The year the vine was planted or grafted. 

Temps.csv -  2909 rows × 4 columns | https://github.com/COGS118A/Group036-Wi23/blob/main/temps.csv 

Observations:
Date - Year/Month/Day 
- __*Daily_High - The measured daily high temperature (Fahrenheit)*__  
- __*Daily_Low - The measured daily low temperature (Fahrenheit)*__ 
- Precipitation - The measured precipitation for the day. 

Special handling, transformations, cleaning, etc
Merged Dataframe: 3366 x 66 columns 

Coding Steps (Data Cleaning):
https://github.com/COGS118A/Group036-Wi23/blob/main/Dataset.pdf 

Written Data Cleaning Observations: 
- __Year - the year a vine was harvested__
- __Month - the month a vine was harvested__
- __Day - the day a vine was harvested__
- __ranch_(*ranch number*)  - (One-hot encoded) The ranch where the vine is grown (accounts for 23 columns)__
- __variety_(*variety name*)  - (One-hot encoded) The variety of the vine (accounts for 32 columns)__
- __color_(*grape color*) - (One-hot-encoded) The color of grape the vine grows (accounts for 3 columns)__ 
- __Year_planted_grafted - The year the vine was planted or grafted.__
- Age - (year_planted_grafted - prod_yr)	
- Daily_High - The measured daily high temperature (Fahrenheit) 	
- Daily_Low - The measured daily low temperature (Fahrenheit) 	
- Precipitation - The measured precipitation for the day. 


# Proposed Solution

To address the challenge of accurately predicting grape yield levels, we plan to use three different models: random forest, support vector machine (SVM), and gradient boosting. Random forest is a suitable model for the project domain, as it can handle both categorical and numerical features and can handle interactions between these features. SVM is a powerful algorithm that works well with high-dimensional feature spaces, making it suitable for the grape yield classification problem. Gradient boosting is a boosting algorithm that builds an ensemble of weak models and gradually improves their performance by optimizing a loss function. By comparing the performance of these three models, we aim to select the best-performing model for the grape yield classification task.

We will evaluate the models using metrics such as accuracy, precision, recall, and F1-score. These metrics will be used to compare the performance of the models using both binary classification types, i.e., high/low and good/bad. Historical grape yield data will be used to train and validate the models. The dataset will be split into training and testing sets, and the models' accuracy will be evaluated using cross-validation techniques.
Once the best-performing model is selected, it could be used to evaluate the yield levels of grapevines and grape growers can use this knowledge to make informed decisions about their vineyard management practices, leading to improved revenue and profitability. The selected model will be tested by evaluating its performance on a holdout set of grape yield data to ensure that it generalizes well to new data.


# Evaluation Metrics

To evaluate the model's performance, we will split the dataset into a training set and a test set. The training set will be used to train the model, and the test set will be used to assess its performance. We will use a confusion matrix and several metrics such as accuracy, precision, and recall to evaluate the model.

Precision measures the accuracy of positive predictions and is calculated as true positives / (true positives + false positives). Recall measures the ability of the model to identify positive instances correctly and is calculated as true positives / (true positives + false negatives). In our project, a false positive occurs when a plant with a low yield class is incorrectly predicted to have a high yield class, while a false negative occurs when a plant with a high yield class is incorrectly predicted to have a low yield class.


# Results

### Regression or Classification

The dataset in question consists of temporal production data covering approximately 7 years. When developing a model to analyze this dataset, there are two possible approaches: one can create a model to predict the number of boxes produced, or one can make a model to classify a vine's yield status.

While it may initially seem more beneficial to create a model that predicts the number of boxes produced, the high degree of variability associated with grape production - which can be influenced by numerous factors such as climate, soil, water availability, pests, diseases, and vine age - makes it difficult to accurately model and predict the exact number of boxes that will be produced.

As a result, it may be more practical and valuable to develop a model that classifies a vine's yield status based on the available data. This approach involves dividing the production data into categories or classes based on certain thresholds, such as high, or low yield. By identifying patterns and relationships in the data that are associated with different yield levels, a model can be developed that accurately predicts a vine's yield status based on these patterns.

Appropriate machine learning algorithms for classification include support vector machine, gradient boosting, and random forest. These algorithms will then be selected to train and test the model using the data. Overall, by creating a model that classifies a vine's yield status, vineyard owners can gain valuable insights into the factors that affect grape production, which can help them make informed decisions regarding vineyard management and resource allocation.


### Binary or Multi-Class Classification  
After considering the nature of the problem, we need to determine whether a binary or multi-class classification approach is best suited for predicting the yield of vines. If we opt for binary classification, we can easily determine whether a vine has a 'good' or 'bad' yield, providing a straightforward and clear outcome. In contrast, a multi-class classification model can provide a more nuanced view of the yield by identifying 'high,' 'medium,' and 'low' yielding vines, enabling more detailed insights.

However, it's important to note that the complexity of a multi-class classification model can affect its accuracy, and it may not necessarily provide a more accurate outcome. Therefore, after experimenting with both approaches, we determined that a simpler binary classification model was more effective in accurately predicting the yield category of the vines. Our models consistently performed better in terms of accuracy, making it a better approach for this specific problem.

## Model Performance (Random Forest Classifier)
This model utilizes two types of binary classification labels: "high and low" and "good and bad." The "high and low" label is determined by categorizing yields as either "high" (yield of over 175 boxes per acre) or "low" (yield of 175 boxes per acre or under). The "good and bad" label is determined by categorizing yields as either "good" (yield of over 55 boxes per acre) or "bad" (yield of 55 boxes per acre or below).

According to the learning curve of the random forest model, while using the “good and bad” labels, the cross-validation accuracy increases slightly as the number of training samples increases. Using the “high and low” labels, the cross-validation accuracy does not seem to increase even as we increase the number of samples used in training. Therefore, it does not seem that obtaining more data for the dataset will necessarily increase the accuracy of the model significantly. We also plotted the performance of the models across folds of the cross-validation. We used a 20-fold cross-validation since our relatively small dataset allowed us to train the model quickly.

### Random Forest (“Good” and “ Bad”):
The accuracy of the model using a simple train and test split is around 0.87 for predicting “high and low,” and 0.76 for predicting “good and bad.” The cross-validation score is around 0.78 with a standard deviation of 0.07 for predicting “high and low,” 0.57 with a standard deviation of 0.08 for predicting “good and bad.” According to the feature importance of the model, the top five most important features are 'year_planted_grafted', 'age', 'Daily_High', 'Daily _Low', and ‘day’.
(https://github.com/COGS118A/Group036-Wi23/blob/main/Random%20Forest.ipynb)

![Screenshot%202023-03-22%20at%208.34.00%20PM.png](attachment:Screenshot%202023-03-22%20at%208.34.00%20PM.png)

### Random Forest (“High” and “Low”):
While classifying high and low, the model only has a 0.53 recall for “high” and a 0.95 recall for “low.” This is likely due to the fact that around 81% of the data points are labeled as low. However, while classifying good and bad, the model has a 0.71 recall for “bad” and a 0.8 recall for “good.” Considering that the data for “good and bad” is relatively balanced, with around 45% of data labeled as “bad” and 55% as “good,” the results should not be nearly as heavily affected by data imbalance as the results from classifying “high and low.”
(https://github.com/COGS118A/Group036-Wi23/blob/main/Random%20Forest.ipynb)

![Screenshot%202023-03-22%20at%208.38.37%20PM.png](attachment:Screenshot%202023-03-22%20at%208.38.37%20PM.png)

## Model Performance (SVM & Gradient Boosting)

These models utilize the same two binary classification labels as the Random Forest Classifier. Once again, the "high and low" labels are determined by categorizing yields as either "high" (yield of over 175 boxes per acre) or "low" (yield of 175 boxes per acre or under). The "good and bad" labels are determined by categorizing yields as either "good" (yield of over 55 boxes per acre) or "bad" (yield of 55 boxes per acre or below). There is about an 18% to 82% split for the “high and low” labels respectively and a 55% to 45% split for the “good and bad” labels respectively. 

### Gradient Boosting (“High” and “Low”):
The gradient boosting classifier used to predict the binary labels "high" and "low" achieved high precision but a lower recall score. Furthermore, the model achieves a high f1-score (0.92) for the "low" class, but a lower score (0.49) for the "high" class. The accuracy score on the held-out test set is 0.86, which indicates that the model is able to correctly predict the binary labels for approximately 86% of the samples in the test set. The 5-fold cross-validation score shows an average accuracy of 0.77 (+/- 0.12), which means that the model's performance is consistent across the 5 different splits of the data. 


(https://github.com/COGS118A/Group036-Wi23/blob/main/GradientBoosting_high.ipynb)

![Screenshot%202023-03-22%20at%208.39.43%20PM.png](attachment:Screenshot%202023-03-22%20at%208.39.43%20PM.png)

### SVM Classifier (“High” and “Low”):

The SVM classifier used to predict the binary labels "high" and "low" achieved a moderate precision and recall score. The model achieves a high f1-score (0.82) for the "low" class, but a lower score (0.51) for the "high" class. The accuracy score on the held-out test set is 0.74, which indicates that the model is able to correctly predict the binary labels for approximately 74% of the samples in the test set. The 5-fold cross-validation score shows an average accuracy of 0.70 (+/- 0.16), which means that the model's performance is consistent across the 5 different splits of the data.

(https://github.com/COGS118A/Group036-Wi23/blob/main/SVM_high.ipynb) 

![Screenshot%202023-03-22%20at%208.40.51%20PM.png](attachment:Screenshot%202023-03-22%20at%208.40.51%20PM.png)

### Gradient Boosting (“Good” and “Bad”):

The gradient boosting classifier used to predict the binary labels "good" and "bad" achieved high precision for both classes, with a slightly higher precision score for the "good" class. The model also achieved a higher recall score for the "good" class, indicating that it was able to identify more of the positive samples correctly. The f1-score was high for both classes, with a slightly higher score for the "good" class. The accuracy score on the held-out test set was 0.75, which indicates that the model was able to correctly predict the binary labels for approximately 75% of the samples in the test set. The 5-fold cross-validation score shows an average accuracy of 0.67 (+/- 0.07), which means that the model's performance is consistent across the 5 different splits of the data.

(https://github.com/COGS118A/Group036-Wi23/blob/main/Gradient_Boosting_Final.ipynb)

![Screenshot%202023-03-22%20at%208.43.10%20PM.png](attachment:Screenshot%202023-03-22%20at%208.43.10%20PM.png)

### SVM Classifier (“Good” and “Bad”):

The SVM classifier used to predict the binary labels "good" and "bad" achieved moderate precision for both classes (0.59, 0.70), with a slightly higher precision score for the "good" class. The model also achieved a higher recall score for the "bad" class indicating that it was able to identify more of the negative samples correctly. The f1-score was similar for both classes. The accuracy score on the held-out test set was 0.64, which indicates that the model was able to correctly predict the binary labels for approximately 64% of the samples in the test set. The 5-fold cross-validation score shows an average accuracy of 0.58 (+/- 0.08), which means that the model's performance is inconsistent across the 5 different splits of the data. 

(https://github.com/COGS118A/Group036-Wi23/blob/main/SVM_final.ipynb)

![Screenshot%202023-03-22%20at%208.43.44%20PM.png](attachment:Screenshot%202023-03-22%20at%208.43.44%20PM.png)

### Important Features
#### Top 5 Feature Importance Scores for High/Low Classification: 

SVM: (https://github.com/COGS118A/Group036-Wi23/blob/main/Important_features%20(1).ipynb)

1. variety_CRIMSON SEEDLESS:
    0.19763333262158764
 
2. variety_IVORY - SHEEGENE 21:
 0.1953983031917974
3. ranch_19:
 0.1791559462239295
4. ranch_10:
 0.17568194039154975
5. ranch_17:
 0.1499427468970572

Gradient Boosting:
(https://github.com/COGS118A/Group036-Wi23/blob/main/Important_features%20(1).ipynb)
1. age:
 0.19526792552393726
2. variety_IVORY - SHEEGENE 21:
 0.10209010352263745
3. ranch_37:
 0.07067177963136012
4. variety_EXPERIMENTAL 95B-59+6:
 0.058431734030680575
5. variety_RED GLOBE:
 0.05397218069270886

Random Forest:
(https://github.com/COGS118A/Group036-Wi23/blob/main/Random%20Forest.ipynb)
1. year_planted_grafted:
  0.12196435
2. age:
  0.1162755
3. Daily_Low:
  0.10102425
4. Daily_High:
  0.09675002
5. day:
  0.09110464

Top 5 Feature Importance Scores for Good/Bad Classification: 
	
SVM:
1. month:
 0.24639531056087258
2. ranch_36:
 0.14590810144829414
3. ranch_17:
 0.14037501540564085
4. variety_ALLISON - SHEEGENE 20:
 0.12215982660312553
5. color_RED:
 0.1115153902773848


Gradient Boosting:
1. year_planted_grafted:
 0.1897613330328271
2. month:
 0.07609958395177072
3. color_RED:
 0.06818481505278351
4. day:
 0.058194751369134136
5. age:
 0.057377626733544594


Random Forest:
1. day:
  0.1222436
2. Daily_High:
  0.11911654
3. Daily_Low:
  0.1184155
4. age:
  0.11139227
5. Year_planted_grafted:
  0.09183314
  
### Analysis of the Results
Across all algorithms, the Random Forest Classifier obtained the best accuracy, followed closely by Gradient Boosting and then Linear Support Vector Classifier. However, regarding the results of the random forest model, we believe that it is only reasonable to consider its performance on classifying “good and bad,” since the “high and low” labels are imbalanced. While classifying with the random forest model for “high and low,” the low 0.53 recall for “high” and the high 0.95 recall for “low” serve as evidence that the model was heavily affected by the imbalanced data.

## Discussion
### Interpreting the result

Main Point: Out of the 3 trained models, the Random Forest model, followed closely by the Gradient Boosting model, garnered the best results using the “good” and “bad” yield labels.
We can see from the results above that our best results come from using the Random Forest Classifier and using the “good and “bad” yield labels. Random Forest can deal with unbalanced data, since the “good” and “bad”  labels were more balanced than the “high” and “low”, resulting in better results. While the difference between the Random Forest and Gradient Boosting models was relatively low (approximately 9% after cross-validation), we considered other factors that pinpointed that Random Forest still had the edge. We also see that with the random forest the factors that come more into play are the age of the vine, when it was grafted, and also the weather. This highly differs compared to SVM and gradient descent, their top factors were linked with type of grape and where it was planted. The factors that were popped out by the Random Forest seem more valuable and more interesting than the factors resulting from the other models.


Secondary Point 1: Having the simplicity of a binary classification of “good” and “bad” or “high” and “low” can lead to clearly interpretable models and thus, more effective business applications. 
With the binary classification of “good” and “bad” it makes things binary and easy to see if it is a good vine or bad vine that needs to be uprooted. This is a bit self explanatory, but from a business standpoint, having a model that can classify these vines in such a way to determine which grapevine varieties are going to increase or decrease profit. All the confusion matrices show a clear distinction between the predicted vs the actual labels, allowing us to quickly notice any imbalances in the data (such as in the Random Forest model based on the “high” and “low” labels) or any evident trends. While there are clear limitations to this as maybe some of the grapes are inherently more valuable per box filled, this can be easily accounted for with the right data. Overall, simplicity makes things easier for us humans to comprehend and lets the results be used in numerous different applications.


Secondary Point 2: An additional point to make is that the Gradient Boosting models still generated solid classification metrics that, while not as accurate as Random Forest, can still effectively classify grape yields as “high” or “low” and “good” or “bad”. On the other hand, the SVM models did not have high enough metrics to truly be on par.
	The Gradient Boosting models performed relatively well. For the “high” vs “low” classification, the model had a high precision and lower recall rate. It had an 86% accuracy rate, but after 5-fold cross validation, this was reduced to 77%. For the “good” vs “bad” model, there were high precision rates for both classes and the accuracy rate went from 75% to 67% also after 5-fold CV. As previously stated, the Gradient Boosting models evidently hold merit and almost reach the metrics of the Random Forest models. The pros and cons of each model are almost identical when it comes to the variety of the data, their evaluation metrics, and their prominent features. The SVM models do have slightly bigger differences. They have moderate precision and recall metrics for both the “high” vs “low” (74% → 70% accuracy after CV) and the “good” vs “bad” (64% → 58% accuracy after CV) models. It’s important to discuss these differences because it comes to show how 


Secondary Point 3: The “Year_planted_grafted”, “age”, “Daily_High”, and “Daily_Low” are important features that have been pinpointed by several of the models discussed.
	Using the variety of models in this project, we have seen a pattern among the important features. The most common important features in all of the models have been determined to be the year the plant was grafted, how old the plant is, and the temperature (high and lows). These factors have been consistently in most of the models and are some of our best indicators to determine the yield of the plants. We see that the SVM, which is one of our more moderate models, has some of the top factors to be the location of the plant and the variety, while the gradient boosting and the Random Forest are more 


### Limitations
One of the primary limitations of our project is the size of our dataset. Despite our best efforts to collect as much data as possible, our dataset is still relatively small, which can limit the accuracy of our models. Furthermore, our dataset only contains data from as early as 2015, which means that we do not have access to any historical data from before this time. This lack of historical data can limit our ability to make accurate predictions and identify long-term trends.

In addition to these limitations, we also faced challenges with the level of detail in our data. For example, we do not have information about the exact dates when plants were planted and removed, which can make it difficult to understand the specific factors that contribute to the yields of individual fields. Furthermore, there are many factors that can impact the growth of vines, such as soil quality and water conditions, that are not fully captured in our dataset. Incorporating these additional factors into our analysis could help us gain a more comprehensive understanding of the various factors that contribute to yield optimization.

Lastly, in terms of our best model selection, while the Random Forest Classifier performed the best using “good” and “bad” yield levels, we encountered an issue with our “high” and “low” level models. There was a large imbalance in our “high” and “low” data where around 75% of our data was in a low category, meaning the large accuracy of our model was inaccurate. The limitation, then, was that if low was predicted every time, we would immediately be correct 75% of the time. To tackle this, we decided to focus on the recall metric rather than the accuracy for this particular grouping to discuss how the model was impacted by this misproportion. Nonetheless, this comes to show that our model accuracies would represent a larger variety of points if we had access to more data.


### Ethics and Privacy
The dataset contains crucial information regarding the statistics of different grape varieties in the vineyard, as well as their respective yield on each harvest. There could be potential privacy concerns if competitors intend to use this data for ulterior purposes. To avoid this, the dataset we obtained already had ranch numbers instead of specific locations, code numbers for vine varieties, and other manipulations that we made to the dataset to allow the data to be trained seamlessly while retaining confidentiality. If we had more time, we could have taken it a step further and changed the grape variety names and colors into numbers to disclose even more company information. However, we decided to leave them unchanged for more critical analysis. Another case would be when people rely on the predictions of the model to determine what grape varieties to grow. The outcome is likely to differ from the predictions considering that the model is built using a dataset from a particular vineyard, which may not have the same climate and soil characteristics as any other vineyard. Therefore, people who intend to completely rely on the model to make decisions on which grape varieties to plant or graft in their vineyard may suffer an unwanted loss of profit.
Conclusion
Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

### Footnotes
1.[^](https://www.researchgate.net/profile/Manisha-Sirsat/publication/334387643_Machine_Learning_predictive_model_of_grapevine_yield_based_on_agroclimatic_patterns/links/5d3af058a6fdcc370a621a0a/Machine-Learning-predictive-model-of-grapevine-yield-based-on-agroclimatic-patterns.pdf): Sirsat, M. S., Mendes-Moreira, J., Ferreira, C., & Cunha, M. (2019). Machine Learning predictive model of grapevine yield based on agroclimatic patterns. Engineering in Agriculture, Environment and Food, 12(4), 443-450.

2.[^](https://www.researchgate.net/profile/Jerry-Gao/publication/348023331_Grape_Leaf_Disease_Detection_and_Classification_Using_Machine_Learning/links/6096c83a92851c490fc748d5/Grape-Leaf-Disease-Detection-and-Classification-Using-Machine-Learning.pdf): Huang, Z., Qin, A., Lu, J., Menon, A., & Gao, J. (2020, November). Grape leaf disease detection and classification using machine learning. In 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics) (pp. 870-877). IEEE.

3.[^](https://www.mdpi.com/2073-4395/12/10/2463): Mohimont, L., Alin, F., Rondeau, M., Gaveau, N., & Steffenel, L. A. (2022). Computer Vision and Deep Learning for Precision Viticulture. Agronomy, 12(10), 2463.