# COGS 118A- Project Proposal

# Names

- Lucas Giumarra
- Sebastian Olivas Beltran
- Yu-Hsuan Chi
- Tom Hocquet

# Abstract 
This project aims to develop a machine-learning model that can accurately classify grape varieties based on their yield as high, medium, or low. The data used in the project represents grape yield data, which is measured in boxes per acre and includes a variety of features such as climate conditions, the color of the grape, and the age of the vine. The model will be trained on a subset of this data, with the aim of learning to identify patterns in the data that are associated with different yield levels. Once the model is trained, it will predict the yield level of new, unseen data.
Ultimately, the goal is to create a reliable and accurate classification model that can assist grape growers in making informed decisions about their vineyard management practices.

(Feedback Response): Add which machine learning model you plan to use, the train/test split, any cross-validation, hyper-param tuning, as well as metrics you plan to use in 1-2 lines within the abstract as well. We plan to make our train/test split to be 80% training and 20% testing. We will use cross validation with a fold number between 5 and 10 and will tune other hyper-parameters including the learning rate, the number of branches in the tree, the size of specific subtrees, and plenty more. To tackle our problem, our main metric will be a confusion matrix that will ultimately compute an F1 score using accuracy, precision, and recall values. This will lead us to validate the effectiveness of our model/benchmark model.


# Background

Our research area revolves around the world of viticulture, particularly grape production in vineyards. With a growing demand for high-quality grapes and wineries, along with their economic importance, the efficiency of wine and vineyard operations relies heavily on the models that help predict yield production in certain regions around the world[1]. Viticulture as a whole has been a field of study tackled by machine learning on many occasions. Whether it’s detecting diseases in a grapevine’s leaves or using deep learning and neural networks to classify patterns in their early growing stages, models have been trained to make predictions and be evaluated by ML metrics such as accuracy or precision rates[2]. In turn, many of these models were created to make predictions on grape yields based on certain criteria such as weather conditions or pesticide use. However, one key problem is that these predictive models can be too complex and computationally expensive. Despite the growing demand, these models have failed to consider factors such as particular practices that can have a positive or negative impact on production rates or even grape varieties and the acres used to produce them. It’s also important to note that comparing current yield models to previous yield models can lead to several misinterpretations, meaning new ones must be trained on a daily basis. Too many confounding variables come into play such as limited varieties in a certain number of parcels at a certain time, grapevine locations, and the availability of workers[3]. For instance, climate change has a severe impact on wine production, which may lead to unpredictable and inconsistent data collection throughout the seasons. The models are prone to large errors and can lead to misinformation regarding the best course of action for workers on the field. Even with the technology available today, it’s quite difficult to gather this type of data and accurately pinpoint financial gains/losses based on more or less production yields. Thus, answering our question through a business lens is key to creating, evaluating, and drawing conclusions from our predictive yield model. The higher the yield, the better the grape quality and quantity which results in more revenue for the company/vineyard.
(Feedback Response): Regarding prior work - what kinds of predictive models were used to predict yield? What were the results? Did you draw inspiration from these models for your approach/How is your approach different? Address the above points. I would suggest finding a specific example of a predictive model for grape yield and highlighting relevant points from that work.
One predictive model that was based on agroclimatic patterns used 4 machine learning methods (LASSO, ElasticNet, SpikesLab, and RandomForest) to predict grapevine yield. After performing feature selection and cross-validation, they were able to narrow down the number of variables that were the most significant in solving their problem statement. They outlined the specific function, training process, and importance of all 4 predictive methods in order to compare their calculated RMSE results based on selected variables. Using the model’s behavior and RMSE values for flowering, coloring, and harvest phenologies, they concluded how “meteorology is the key relation in measuring quantity of grapes” [1]. Drawing inspiration from this, we will also use a few predictive methods such as Gradient Boosting, SVM, and RandomForest. Instead of calculating RMSE however, we will develop confusion matrices and use F1 scoring to determine the accuracy of our models. We will classify the grape yields to a respective category (high, mid, or low) using a range of values based on certain variables such as boxes per acre, whether the grapevine was grafted or not, among others. Another key difference in our approaches is our use of data. Instead of using a combination of climatic variables, plotting, and phenological stage datasets, we will use a main plotting dataset that will narrow our focus and lead us to tread away from the discussion of weather patterns for analysis.
You also mention that prior work is limited by being too computationally expensive or having too many confounding variables - what about your work? How do you deal with these limitations?
Prior work on the subject tends to be too computationally expensive and have many confounding variables because their data collection and overall research project is quite large in scale. For our project, we plan to narrow our scope down in terms of the variables we’re analyzing (such as yield and boxes per acre). The 23 unique ranches are relatively close to each other in proximity, meaning drastic weather differences are not going to be an issue. Our model will not be as computationally expensive as the prior work models because our machine learning methods are much simpler and straight-forward. Regardless, a certain accuracy percentage will still be computed and our model will end up being trained and analyzed. Also, while some of these models can be prone to large errors, our model can avoid this because we plan on classifying yield using ranges of values rather than trying to compute an actual yield value. This allows our classification model to be less prone to error because the variability between a right and wrong answer is a lot slimmer, which should lead to a relatively high accuracy rate.


# Problem Statement

Grape yield is an important factor for grape growers as it directly impacts their revenue and profitability. Currently, grape growers rely on their experience and knowledge of vineyard management practices to estimate the yield level, which can be subjective and prone to errors. Therefore, an accurate yield classification model can greatly assist grape growers in making informed decisions about their vineyard management practices. To ensure the problem is quantifiable, we will define the yield levels based on a specific range of boxes per acre. We will also use a set of objective features to describe the grape varieties, such as temperature, vine age, grape color, and ranch location.

# Data

The data for this project will be sourced from a vineyard and consists of three separate datasets that must be merged in order to perform analysis. The first dataset is 4129 rows by 12 columns and contains information about the weekly collection of grapes and the number of boxes filled. The second dataset is 162 rows by 7 columns and provides information about each field (i.e. acreage, grafting). The last dataset contains information on the average weekly high and low temperatures, as well as the average weekly precipitation. The most critical variables for this project will most likely be the temperature and precipitation data but the box count and acreage observations will also be important. Other dependent variables that will be observed include the variety of grapes, the age of the plant, and the harvest date. Since a significant amount of data wrangling will be required, we will need to carefully prepare the data before performing the analysis and modeling.

# Proposed Solution

One potential approach to building a classification model for this problem is to use a random forest algorithm. This approach has been shown to be effective in a variety of classification problems and is particularly well-suited to handling high-dimensional datasets with many features. We can use scikit-learn's random forest classifier package along with techniques such as cross-validation and hyperparameter tuning to build and optimize a model. Furthermore, we will also consider using a benchmark model, such as a simple decision tree or logistic regression, to compare the performance of the random forest model and validate its effectiveness.

# Evaluation Metrics

To test the performance of the model, we will split the dataset into a training set and a test set. We will use the training set to train the model and the test set to evaluate its performance. We will use metrics such as accuracy, precision, recall, and F1 score after creating a confusion matrix to evaluate the model's performance. However, we will mainly focus on the F1 score of the model, since the cost of false positives and false negatives are approximately equal.

# Ethics & Privacy

The dataset contains crucial information regarding the statistics of different grape varieties in the vineyard, as well as their respective yield on each harvest. There could be potential privacy concerns if competitors intend to use these data for ulterior purposes. To avoid leaking any crucial information we will be changing the names of the grape varieties to keep the data of the vineyard as anonymous as possible.

In cases where people rely on the predictions of the model to determine what grape varieties to grow. The outcome is likely to differ from the predictions considering that the model is built using a dataset from a particular vineyard, which may not have the same climate and soil characteristics as any other vineyard. Therefore, people who intend to completely rely on the model to make decisions on which grape varieties to plant in their vineyard may suffer an unwanted loss of profit.


# Team Expectations 

* Reach out to other group members if you run into any obstacles
* Communicate on Discord when there are concerns regarding the project
* Make sure to meet the deadlines
* Be respectful and calm

# Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/21  |  6 PM |  Be ready to call on Zoom  | Discuss all aspects regarding the project proposal; Assign group members to each part of the project | 
| 2/28  |  6 PM |  Import & Wrangle Data; EDA | Review/edit Wrangling and EDA; Discuss Analysis Plan | 
| 3/7  | 6 PM  | Finalize data wrangling & EDA; Begin Analysis  | Finalize Analysis Plan; Complete checkpoint   |
| 3/14  | 6 PM  | Complete analysis; Draft results | Discuss/edit project; Complete project   |
| 3/25  | Before 11:59 PM  | NA | Turn in Final Project |

# Footnotes

<a name="sirsat"></a>1.[^](#sirsat): Sirsat, M. S., Mendes-Moreira, J., Ferreira, C., & Cunha, M. (2019). Machine Learning predictive model of grapevine yield based on agroclimatic patterns. Engineering in Agriculture, Environment and Food, 12(4), 443-450.<br>
<a name="huang"></a>2.[^](#huang): Huang, Z., Qin, A., Lu, J., Menon, A., & Gao, J. (2020, November). Grape leaf disease detection and classification using machine learning. In 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics) (pp. 870-877). IEEE.<br>
<a name="mohimont"></a>3.[^](#mohimont): Mohimont, L., Alin, F., Rondeau, M., Gaveau, N., & Steffenel, L. A. (2022). Computer Vision and Deep Learning for Precision Viticulture. Agronomy, 12(10), 2463.