**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Kaleigh Mogatas A17051705 kmogatas@ucsd.edu
- Tairan Liu A17399714 tal012@ucsd.edu
- Teresa Tian A16878664 shtian@ucsd.edu
- Lynna Nguyen A16906910 lnn002@ucsd.edu
- Ella Tung A16363333 etung@ucsd.edu


# Research Question

What impacts do regions, along with production methods (conventional vs. organic) have on U.S. avocado price fluctuations ranging from 2015 to 2023?



## Background and Prior Work



Avocados, often hailed as a superfruit, are packed with a wide array of vitamins and minerals. Owing to their extensive health benefits, they have sustained their popularity over the past decade, consistently featuring across various culinary presentations. The consumption of avocados have become significant over the past years, thus causing an increase in demand. Because of this, the price of avocados has been affected for a number of reasons. Factors such as the seasonality of avocado production, with specific periods favoring optimal growth, the type of avocado (organic or conventional), and the region of cultivation play a role in influencing avocado’s availability and quality.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Avocado prices fluctuate depending on the time of year. Tanmay Deshpande, the author of ‘Avocado Price Forecast,’ used Auto Regressive Integrated Moving Average (ARIMA) and Seasonal Auto Regressive Integrated Moving Average (SARIMA) model to find the correlation between the two types of avocado prices and the time of year. He discovered that the average prices of avocados fluctuate during a certain time of year as it would rise around September to November, but would drop again around December and January. This suggests that there may be more of a demand for avocados since they are seasonal fruits, which is why costs increase during a specific time of year in which they are produced.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

However, the costs of avocados also vary depending on the region of which they are produced and sold for a number of reasons. Mario Caesar, author of  ‘Avocado Price Regression w/ PyCaret & EDA,’ discusses how different areas around the United States play a big role in the contribution of avocado sales and prices, especially with the size of the bag of avocados. His article expresses how the West (excluding California) and California are more likely to purchase both conventional and organic avocados—more so conventional—in comparison to the other top five areas that purchase avocados—the Northeast, Great Lakes and South Central, which begs the question as to why there is such a difference in avocado consumption between different regions. Additionally, the type of avocado that is produced also affects cost and people’s decision to purchase as they are grown using different methods. Caesar concludes that conventional avocados that are within bags are more likely to be purchased in comparison to organic avocados since they tend to be cheaper and bigger in volume and size. Even though organic avocados are produced in a better environment, their prices are higher for that reason.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)


**Citations** 
1. <a name='cite_note-1'></a> [1](#cite_ref-1) Bastida, Olmo. Fresh Avocado Market in the US. *ProducePay*,  5 Oct 2023., https://urldefense.com/v3/__https://producepay.com/blog/fresh-avocado-market-us/*:*:text=Price*20of*20imported*20fresh*20avocado,December*2C*20November*2C*20and*20October__;I34lJSUlJSUlJSU!!Mih3wA!B9m-XUv_p86Tp2fLGM0gqrcykqyn1IMUQlmfFBfh4olrgODtkaWaSvh6TWwBYwws57T6-LXqJNL5mUq2$ . Accessed 9 Feb. 2024. 
2. <a name='cite_note-2'></a> [2](#cite_ref-2) Caesar, Mario. Avocado Price Regression w/ PyCaret & EDA. *Kaggle*, 2022, https://urldefense.com/v3/__http://www.kaggle.com/code/caesarmario/avocado-price-regression-w-pycaret-eda__;!!Mih3wA!B9m-XUv_p86Tp2fLGM0gqrcykqyn1IMUQlmfFBfh4olrgODtkaWaSvh6TWwBYwws57T6-LXqJCOkbb-A$ . Accessed 9 Feb 2024. 
3. <a name='cite_note-3'></a> [3](#cite_ref-3) Deshpande, Tanmay, Avocado Price Forecast- ARIMA & SARIMA. *Kaggle*, 2023, https://urldefense.com/v3/__https://www.kaggle.com/code/tanmay111999/avocado-price-forecast-arima-sarima-detailed__;!!Mih3wA!B9m-XUv_p86Tp2fLGM0gqrcykqyn1IMUQlmfFBfh4olrgODtkaWaSvh6TWwBYwws57T6-LXqJM0AMfNc$ . Accessed 9 Feb. 2024.

# Hypothesis



In the United States, avocado prices are primarily affected by their production method, where organic avocados command higher prices than conventional ones. Additionally, regional variations and seasonal demand fluctuations play critical roles in price determination, leading to higher prices in regions with scarce supply or elevated demand during certain seasons.


# Data

## Data Overview

For each dataset include the following information
Data Overview:

Dataset Name: Avocado_HassAvocadoBoard_20152023v1.0.1.csv

Link to the Dataset: 
https://www.kaggle.com/datasets/vakhariapujan/avocado-prices-and-sales-volume-2015-2023?rvi=1 

Number of Observations: 53415 

Number of Variables: 12 

This dataset provides comprehensive data on Hass avocado sales, sourced from the Hass Avocado Board. It builds on previous versions introduced on Kaggle by Justin Kiggins and later updated by Valentin Joseph to include data from 2015 up until 2021. This dataset categorizes sales data by various regions and key locations within the United States, including cities and sub-regions. Notably, the aggregation of location values doesn't equate to the total for regions, which are specified as California, West, Plains, South Central, Southeast, Midsouth, Great Lakes, and Northeast. This dataset offers valuable insights into avocado sales trends between conventional and organic avacodos, alongside the bag sizes, across different geographical areas. 



## Avocado_HassAvocadoBoard_20152023v1.0.1.csv

To begin, we first need to import a select few of packages that will help get our data ready to make calculations and analysis.

In [1]:
# Setup for data
%matplotlib inline

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
sns.set()
sns.set_context('talk')

import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", 104)
#import patsy
#import statsmodels.api as sm
#import scipy.stats as stats
#from scipy.stats import ttest_ind, chisquare, normaltest

We then stored our data file into a variable named df and printed the dataset to observe what it looks like.

In [3]:
# Import original csv file 
df = pd.read_csv("Avocado_HassAvocadoBoard_20152023v1.0.1.csv")
df

Unnamed: 0,Date,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags,type,region
0,2015-01-04,1.220000,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,Albany
1,2015-01-04,1.790000,1373.95,57.42,153.88,0.00,1162.65,1162.65,0.00,0.0,organic,Albany
2,2015-01-04,1.000000,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
3,2015-01-04,1.760000,3846.69,1500.15,938.35,0.00,1408.19,1071.35,336.84,0.0,organic,Atlanta
4,2015-01-04,1.080000,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington
...,...,...,...,...,...,...,...,...,...,...,...,...
53410,2023-12-03,1.550513,5693.91,204.64,1211.25,0.00,4278.03,,,,organic,Toledo
53411,2023-12-03,1.703920,343326.10,66808.44,132075.11,58.65,138830.45,,,,organic,West
53412,2023-12-03,1.618931,34834.86,15182.42,1211.38,0.00,18075.66,,,,organic,WestTexNewMexico
53413,2023-12-03,1.245406,2942.83,1058.54,7.46,0.00,1779.19,,,,organic,Wichita


Based on what we can visually see, the data seems well organized with the responses also refined structurally. However, the data we can visually see are only some out of the 53415 rows. It is appropriate that we make a few changes and check our data before starting our calculations.

We renamed some of the columns to make the decription of the columns more comprehendable and easier to understand what kind of data is being extracted. Here is further clarification for some of the columns:


Small Avocadoes: The number of small/medium (3-5 oz) avocadoes

Large Avocadoes: The number of large (8-10 oz) avocadoes

XLarge Avocadoes: The number of extra large (10-15 oz) avocadoes

Small Bags: The number of small bags of various sized avocadoes.

Large Avocadoes: The number of large bags of various sized) avocadoes

XLarge Avocadoes: The number of extra large bags of various sized avocadoes

In [4]:
df = df.rename(columns = {'AveragePrice':'Average Price (USD)','TotalVolume': 'Total Volume (lbs)', 'plu4046': 'Small Avocadoes','plu4225': 'Large Avocadoes', 'plu4770': 'XLarge Avocadoes', 'TotalBags': 'Total Bags', 'SmallBags': 'Small Bags', 'LargeBags': 'Large Bags', 'XLargeBags': 'XLarge Bags', 'type': 'Type', 'region': 'Region' })
df

Unnamed: 0,Date,Average Price (USD),Total Volume (lbs),Small Avocadoes,Large Avocadoes,XLarge Avocadoes,Total Bags,Small Bags,Large Bags,XLarge Bags,Type,Region
0,2015-01-04,1.220000,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,Albany
1,2015-01-04,1.790000,1373.95,57.42,153.88,0.00,1162.65,1162.65,0.00,0.0,organic,Albany
2,2015-01-04,1.000000,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
3,2015-01-04,1.760000,3846.69,1500.15,938.35,0.00,1408.19,1071.35,336.84,0.0,organic,Atlanta
4,2015-01-04,1.080000,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington
...,...,...,...,...,...,...,...,...,...,...,...,...
53410,2023-12-03,1.550513,5693.91,204.64,1211.25,0.00,4278.03,,,,organic,Toledo
53411,2023-12-03,1.703920,343326.10,66808.44,132075.11,58.65,138830.45,,,,organic,West
53412,2023-12-03,1.618931,34834.86,15182.42,1211.38,0.00,18075.66,,,,organic,WestTexNewMexico
53413,2023-12-03,1.245406,2942.83,1058.54,7.46,0.00,1779.19,,,,organic,Wichita


Now that we have our columns structurally organized, let's check if whether or not there are NaN values in any of the columns. 

In [5]:
# Checking for NaN values for necessary columns 
df.isnull().any()

Date                   False
Average Price (USD)    False
Total Volume (lbs)     False
Small Avocadoes        False
Large Avocadoes        False
XLarge Avocadoes       False
Total Bags             False
Small Bags              True
Large Bags              True
XLarge Bags             True
Type                   False
Region                 False
dtype: bool

With the stats shown, there only appears to be NaN values in these columns: Small Bags, Large Bags, and XLarge Bags. Since these columns are not our primary focus for our research question, we decided these NaN values are not significant to remove. 


We noticed that in the Average Price column, six numbers appear in after the decimal. We wanted to round the AveragePrice(USD) because having extra numbers after the decimal after two numbers is not necessary for our dataset as we do not really use that much for actual currency. Additionally, this makes it easier to work with our numbers and cleaner to look at.

In [6]:
df = df.round({'Average Price (USD)':2})
df

Unnamed: 0,Date,Average Price (USD),Total Volume (lbs),Small Avocadoes,Large Avocadoes,XLarge Avocadoes,Total Bags,Small Bags,Large Bags,XLarge Bags,Type,Region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,Albany
1,2015-01-04,1.79,1373.95,57.42,153.88,0.00,1162.65,1162.65,0.00,0.0,organic,Albany
2,2015-01-04,1.00,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
3,2015-01-04,1.76,3846.69,1500.15,938.35,0.00,1408.19,1071.35,336.84,0.0,organic,Atlanta
4,2015-01-04,1.08,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington
...,...,...,...,...,...,...,...,...,...,...,...,...
53410,2023-12-03,1.55,5693.91,204.64,1211.25,0.00,4278.03,,,,organic,Toledo
53411,2023-12-03,1.70,343326.10,66808.44,132075.11,58.65,138830.45,,,,organic,West
53412,2023-12-03,1.62,34834.86,15182.42,1211.38,0.00,18075.66,,,,organic,WestTexNewMexico
53413,2023-12-03,1.25,2942.83,1058.54,7.46,0.00,1779.19,,,,organic,Wichita


In [8]:
# Checking unique values of all necessary columns
price = df['Average Price (USD)'].unique()
reg = df['Region'].unique()
types = df['Type'].unique()

price, reg, types

(array([1.22, 1.79, 1.  , 1.76, 1.08, 1.29, 1.01, 1.64, 1.02, 1.83, 1.4 ,
        1.73, 0.93, 1.24, 1.19, 2.13, 1.11, 1.49, 0.88, 1.34, 0.89, 1.44,
        0.74, 1.35, 0.99, 1.42, 1.7 , 0.95, 1.6 , 1.54, 1.05, 1.68, 1.06,
        2.32, 0.71, 1.63, 0.97, 1.81, 0.8 , 1.5 , 0.85, 1.25, 0.92, 1.48,
        1.82, 1.1 , 1.56, 1.84, 0.94, 1.41, 1.09, 1.93, 1.88, 1.8 , 1.72,
        0.65, 1.12, 1.52, 1.69, 1.28, 1.2 , 2.01, 1.13, 1.39, 1.33, 1.23,
        1.18, 1.86, 0.77, 0.98, 1.75, 1.3 , 1.15, 1.46, 0.75, 1.77, 1.17,
        1.94, 1.59, 1.26, 2.29, 1.32, 0.76, 1.87, 1.03, 1.65, 1.38, 2.28,
        0.78, 1.91, 1.85, 1.07, 1.92, 2.03, 0.61, 1.36, 1.98, 1.66, 1.16,
        1.47, 1.27, 0.82, 2.06, 1.45, 2.  , 2.15, 1.14, 1.37, 1.9 , 2.35,
        1.96, 2.08, 0.67, 1.55, 1.97, 1.62, 0.79, 0.96, 1.71, 1.89, 1.67,
        2.21, 1.78, 2.02, 1.57, 1.21, 0.81, 1.04, 1.53, 0.91, 1.43, 0.86,
        0.87, 0.68, 0.72, 0.9 , 0.56, 1.58, 0.84, 2.37, 0.7 , 1.61, 1.95,
        0.64, 0.73, 1.51, 0.83, 0.6 , 

# Ethics & Privacy

Our data doesn’t involve human subjects, so there is no concern for informed consent. They don’t have PII concerns. We believe the avocados will not request for their personal information to be removed. Overall, since we are not using human subjects, there should be no biases, privacy, or terms of concern. Furthermore, since we found our data online, we do not plan on deleting the data after using them. The price of avocados and their origins are transparent online. We do not consider securing the data since they are transparent and can be found online.

One bias our dataset might have is that we can only include the avocados that are sold in bigger institutions, such as supermarkets. Smaller individuals, such as farmers selling small amounts of avocados might not be accounted for. However, since we are just focusing on the impact of regions and production methods on the prices, these small individual sellers shouldn’t do much on the statistics. Since they don’t have much impact on the overall avocado prices and they appear randomly, they can be disregarded in our data analysis.

Another impact that our project will have is that we might come to conclusions of the avocados based on their origin. People might be inclined to buy avocados from one origin and discriminate against avocados from another origin. We will address this in our writeup by stating that our data is based on transparent data collected previously. Avocado prices in different areas might fluctuate in the future and become the opposite of our conclusions.

# Team Expectations 



* *Team Expectation 1*:  Punctual Participation: All team members are expected to attend meetings promptly. Should illness or unavoidable circumstances arise, members are encouraged to participate virtually to maintain continuity.
* *Team Expectation 2*: Effective Communication: We prioritize clear, timely, and constructive communication. Keeping all members informed and engaged is essential for our collective success.
* *Team Expectation 3*: Mutual Respect Towards Teammates: Every team member deserves to be treated with dignity and respect. We commit to fostering an inclusive environment where diverse perspectives are valued and encouraged.
* *Team Expectation 4*: Active Contribution and Collaboration: Each member is expected to actively contribute to our project by sharing ideas, taking on tasks, and collaborating with others. Our goal is to leverage our collective strengths to achieve our research objectives.

By adhering to these expectations, our group aims to create a productive, supportive, and respectful team dynamic conducive to our project's success.


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/8  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/10  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/13  | 3 PM   | Read and understand checkpoint #1, brainstorm the project | Discuss how we can tackle our project as we continue the project | 
| 2/18  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part for Checkpoint #1   |
| 2/24  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/29  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/2  | 7 PM   | Understand what we need to get done for checkpoint #2, prepare questions if needed for this part | Discuss what we should do for checkpoint #2 and assign members their part for this section | 
| 3/5   | 3 PM |  Continue doing parts for checkpoint #2  | Discuss if someone needs extra help for their part and update each other on what we each need so we know what we need to focus on | 
| 3/9   | 1 PM  | Review each other's parts and see what we have done for this checkpoint | Work collaboratively as a group during this checkpoint to get as much done to catch everyone up on their parts | 
| 3/10   | 1 PM   | Continue to review each other's part to make sure that everyone is caught up and that we are not behind | Finalize Checkpoint #2 to make sure that it is ready to submit | 
| 3/11   | 6 PM | Read the final report requirements | Discussion about the final report and make sure members are aware of what we are doing | 
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/15  | 5 PM | Review each other's work for the Final Report | Comment on what else needs editing and finalization for what other members should do for their part and make sure that we have everything that is required for the Final Report. | 
| 3/19  | Before 11:59 PM  | Review Final Report to make sure that it is ready to submit  | Finalize the Final Report and turn in Final Project & Group Project Surveys | 
