# 1. Business Understanding

In this notebook, I define the business context, objectives, success criteria, and constraints for the avocado price forecasting project. The Business Understanding phase is the critical first step in the CRISP-DM methodology, establishing a clear foundation for all subsequent phases. In this section, we'll cover:
- Market overview and trends
- Business objectives and research questions
- Success criteria and metrics
- Project constraints and resources

**What is CRISP-DM?** The CRoss-Industry Standard Process for Data Mining (CRISP-DM) is a proven, robust process model for guiding data mining and machine learning projects. It consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

**Why is Business Understanding important?** This phase ensures that the project addresses actual business needs rather than being a purely technical exercise. By clearly defining objectives and success criteria upfront, we can ensure that our modeling efforts remain focused on delivering tangible business value.

## The Avocado Boom: A Decade of Growth and Consumption Surge

The avocado has transitioned from a niche product to a global dietary staple, witnessing a remarkable surge in consumption over the past ten years. This growth is driven by increasing awareness of its health benefits, culinary versatility, and expanded availability.

### Global Consumption Trends

* **Significant Global Expansion:** The global avocado industry has experienced substantial growth, with production and trade expanding rapidly. Global avocado production increased at a compound annual growth rate of approximately 7% over the past decade, reaching over 8.4 million metric tons in 2022.
* **Regional Growth:**
    * **North America:** Consumption has nearly doubled in the last ten years. The U.S. remains a major consumer, with a significant portion of its avocados imported from Mexico.
    * **Europe:** Avocado consumption in Europe has also seen a dramatic increase, nearly tripling over the past decade. Countries like Germany, the UK, and Spain have shown particularly strong growth.
    * **Latin America:** There has been very strong growth in Latin American countries consumptions also.
    * **Global trade:** Mexico is the world's largest exporter of Avocados.

### US Consumption Statistics

* **Dramatic Increase:** In the United States, avocado consumption has seen a particularly dramatic increase. According to the Hass Avocado Board, U.S. per capita consumption of avocados has increased from 3.5 pounds in 2006 to 8.5 pounds in 2020.
* **Percentage Growth:** This represents a 143% growth in demand.

### Key Drivers

* Increased popularity of avocados in various cuisines.
* Growing awareness of the health benefits of avocados.
* Expanded availability of avocados in supermarkets and restaurants.


This growth in popularity has been accompanied by substantial price volatility due to several factors:
- **Seasonal production cycles** affecting supply
- **Weather events** impacting harvests in major producing regions
- **Regional demand variations** across different US markets
- **Growing preference for organic produce** affecting market segmentation

### Business Value of Price Prediction
Accurate price prediction can deliver substantial value to various stakeholders in the supply chain:

- **Retailers**: 
  - Optimize inventory management to reduce waste (estimated 12% reduction potential)
  - Develop data-driven pricing strategies to maximize profitability while maintaining competitiveness
  - Plan promotional activities during periods of favorable pricing
  
- **Farmers**: 
  - Make informed decisions about harvesting timing to maximize returns
  - Plan production based on predicted market conditions
  - Optimize resource allocation for different avocado varieties
  
- **Consumers**: 
  - Benefit from more stable pricing through improved market efficiency
  - Make more informed purchasing decisions based on price forecasts


## 1.1. Business Objectives and Research Questions

### Primary Business Objectives:
1. **Develop a reliable price prediction model for Hass avocados**: Create a forecasting model that accurately predicts avocado prices across different regions and market segments to support decision-making for stakeholders.

2. **Identify key factors influencing avocado prices**: Determine the most significant variables affecting price variations to provide actionable insights for stakeholders.

### Research Questions:
To guide our analysis and modeling efforts, I have formulated the following research questions:

1. **How accurately can I predict avocado prices using historical data?**
   - *Importance*: Establishes the feasibility of the core objective and defines performance benchmarks for the prediction model.
   - *Approach*: Evaluate different forecasting techniques and quantify prediction accuracy through appropriate metrics.

2. **What features have the strongest influence on avocado prices?**
   - *Importance*: Identifies key price drivers that stakeholders can monitor and potentially influence.
   - *Approach*: Implement feature importance analysis across different model types to identify consistently significant predictors.

3. **How do regional differences affect pricing patterns?**
   - *Importance*: Regional market dynamics significantly impact local pricing strategies and distribution decisions.
   - *Approach*: Analyze price variations across regions and identify region-specific trends and factors.

4. **What is the impact of organic vs. conventional classification on prices?**
   - *Importance*: The premium for organic produce varies over time and by region, affecting production and purchasing decisions.
   - *Approach*: Compare price trends between organic and conventional avocados and identify factors affecting the price differential.

5. **How do seasonal patterns affect avocado prices?**
   - *Importance*: Seasonality is a critical factor in agricultural commodities that affects both supply and demand.
   - *Approach*: Conduct time series decomposition to isolate seasonal components and analyze their consistency across years.
   
6. **How does weather influence avocado prices across different regions?**
   - *Importance*: Weather conditions affect both production and consumption patterns in complex ways.
   - *Approach*: Analyze correlations between weather variables and prices while controlling for other factors.

## 1.2. Business Success Criteria

The success of this project will be primarily evaluated based on the accuracy of predicting avocado prices, differentiated by organic versus conventional types, across distinct US regions. 

### 1.2.1. Technical Success Criteria:
Based on my research, it would be hard to say what score is good or excellent for each model, and the numbers here are based on my general research. For instance, I'm not sure that R² ≥ 0.7 in linear regression is good for this context or not. I think there would be a formula or something to identify what is an acceptable score for each case.

#### Machine Learning Models:

* **R² ≥ 0.7 for Linear Regression:**
    * *Justification:* Represents a good fit for linear models in agricultural price forecasting.
    * *Current Status:* indicating linear models are insufficient for this task's complexity.
* **R² ≥ 0.9 for Random Forest:**
    * *Justification:* Ensemble methods like Random Forest are expected to achieve high accuracy for complex, non-linear relationships.
    * *Current Status:* demonstrating strong predictive performance.
    * *Note:* While 0.9 is excellent, 0.88 is still very good and could be considered acceptable depending on the specific business requirements.
* **MAE < 0.1:**
    * *Justification:* Represents a prediction error of less than 10% on average avocado prices ($1-2), considered actionable for business planning.
    * *Current Status:* indicating good absolute prediction accuracy.
* **RMSE < 0.15:**
    * *Justification:* Captures and penalizes larger prediction errors, crucial for preventing costly business decisions based on outliers.
    * *Current Status:* showing good performance with limited large errors.
    * *Note:* It is important to remember that RMSE is more sensitive to outliers than MAE.
     
***

## 2 Situation Assessment

A comprehensive assessment of the project context, resources, and constraints is essential for establishing realistic expectations and designing an appropriate approach.

### Data Resources:
- **Avocado Price and Volume Data**: 
  - Weekly retail scan data for National retail volume and price (2015-2023)
  - Data from multiple retail channels (grocery, mass, club, drug, dollar, military)
  - Sourced from Kaggle datasets and includes historical trends
  - Contains details on both conventional and organic avocados
  
- **Product Information**:
  - PLU codes identifying different avocado types and sizes
  - Regional market classification
  
- **Weather Data**:
  - Historical temperature data for US regions corresponding to avocado markets
  - Retrieved from Meteostat API


### Technologies Overview

#### Project Structure and Management
- **Cookie Cutter Data Science** - Standardized project structure template
- **Python 3.12** - Primary programming language
- **Jupyter Notebooks/JupyterLab** - Interactive development
- **pandas** - Data manipulation and analysis
- **numpy** - Numerical computing and array operations
- **matplotlib** - Basic plotting capabilities
- **seaborn** - Advanced statistical data visualization
- **scikit-learn**
  - Data preprocessing (StandardScaler, LabelEncoder, SimpleImputer)
  - Model selection (train_test_split, cross_val_score)
  - ML models (GradientBoostingRegressor, RandomForestRegressor, LinearRegression)
  - Metrics (mean_squared_error, r2_score, mean_absolute_error)
  - Pipeline workflow automation
- **XGBoost** - Advanced gradient boosting implementation


### Constraints and Assumptions:

#### 1. Data Constraints:
   - **Limited to Hass avocados only**: 
     - *Impact*: Cannot generalize findings to other avocado varieties
     - *Mitigation*: Focus analysis specifically on Hass market dynamics and clearly communicate this limitation
     
   - **Weekly granularity of data**: 
     - *Impact*: Unable to capture daily price fluctuations or intra-week patterns
     - *Mitigation*: Focus on week-to-week trends and longer-term forecasting horizons
     
   - **Historical data might not capture recent market changes**: 
     - *Impact*: Model may not account for emerging trends or structural market shifts
     - *Mitigation*: Implement a validation approach using the most recent data and monitor model drift
     
   - **Weather data limitations**: 
     - *Impact*: Weather data for regions is averaged and may not reflect microclimate variations
     - *Mitigation*: Use weather as a general indicator rather than a precise predictor

#### 2. Technical Constraints:
   - **No deep learning techniques allowed**: 
