## Exploratory Data Analysis Questions 
1. What are the overall pricing levels, variability, and distribution of avocado prices?
2. How do avocado prices and sales volumes differ between organic and conventional products?
3. What is the realtionship between price and demand in the avocado market?
4. How have avocado prices and sales volumes changed over time, and are there clear trends or seasonality?
5. How do pricing, demand, and revenue vary across different regions?
6. Which avocado segments and regions present the strongest revenue opportunities and business insights?

In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv("avocado_cleaned.csv")

In [4]:
df.columns
df.shape

(18040, 17)

In [6]:
df['date'] = pd.to_datetime(df['date'])

date_min = df['date'].min()
date_max = df['date'].max()

date_min, date_max

(Timestamp('2015-01-04 00:00:00'), Timestamp('2018-03-25 00:00:00'))

In [7]:
total_rows = df.shape[0]
total_regions = df['region'].nunique()
total_types = df['type'].nunique()

total_rows, total_regions, total_types

(18040, 54, 2)

In [8]:
price_stats = df['averageprice'].agg (
    mean_price = 'mean', 
    median_price = 'median',
    min_price = 'min',
    max_price = 'max', 
    std_price = 'std', 
    var_price = 'var'
)
price_stats

mean_price      1.390967
median_price    1.360000
min_price       0.440000
max_price       2.490000
std_price       0.379588
var_price       0.144087
Name: averageprice, dtype: float64

#### Price Distribution Summary

The average avocado price is $1.39, with median of $1.36, indicating a slightly right-skewed distribution.
Prices vary widely, ranging from $0.44 to $2.49, suggesting the influence of seasonal and regional factors.
The standard deviation of $0.38 indicates moderate price volatility around the mean.

In [9]:
price_by_type = (
    df.groupby ('type') ['averageprice']
    .agg(
        avg_price = 'mean', 
        median_price = 'median', 
        min_price = 'min',
        max_price = 'max', 
        price_std = 'std'
    )
)
price_by_type

Unnamed: 0_level_0,avg_price,median_price,min_price,max_price,price_std
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
conventional,1.15804,1.13,0.46,2.22,0.263041
organic,1.629433,1.61,0.44,2.49,0.329176


#### Price Analysis by Avocado Type

Organic avocados have a significantly higher average price ($1.63) compared to conventional avocados ($1.16),
representing a price premium of approximately 41%. Organic prices also show higher volatility, with a standard
deviation of $0.33 versus $0.26 for conventional avocados. While both types share similar minimum prices, organic
avocados reach higher maximum prices, indicating stronger premium pricing potential.


In [11]:
volume_stats = df['total_volume'].agg (
    avg_volume = 'mean',
    median_volume = 'median',
    min_volume = 'min',
    max_volume = 'max', 
    std_volume = 'std'
)
volume_stats

avg_volume       8.603374e+05
median_volume    1.118385e+05
min_volume       8.456000e+01
max_volume       6.250565e+07
std_volume       3.472312e+06
Name: total_volume, dtype: float64

#### Sales Volume Distribution

Total avocado sales volume shows a highly right-skewed distribution. The mean volume (860k) is
significantly higher than the median (112k), indicating that a small number of observations drive
a large proportion of total sales. The high standard deviation (3.47M) highlights substantial
variability across regions and time periods, suggesting strong demand concentration and seasonal
effects.


In [12]:
volume_by_type = (
    df.groupby ('type') ['total_volume']
    .agg(
        avg_volume = 'mean',
        total_volume = 'sum',
        volume_std = 'std'
    )
)
volume_by_type

Unnamed: 0_level_0,avg_volume,total_volume,volume_std
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
conventional,1653213.0,15087220000.0,4747892.0
organic,48605.0,433264900.0,143945.9


#### Sales Volume Comparison by Avocado Type

Conventional avocados account for the vast majority of total sales volume, with an average volume
approximately 34 times higher than organic avocados. While organic avocados command higher prices,
their total sales volume remains relatively low, indicating a niche, premium market segment.
Conventional avocado sales also show significantly higher variability, reflecting large-scale
distribution and strong seasonal demand patterns.


In [13]:
price_volume_corr = df[['averageprice', 'total_volume']].corr()
price_volume_corr

Unnamed: 0,averageprice,total_volume
averageprice,1.0,-0.196085
total_volume,-0.196085,1.0


#### Relationship Between Price and Volume

The correlation between average price and total sales volume is -0.20, indicating a weak negative
relationship. While higher prices are associated with slightly lower sales volumes, price alone
does not strongly determine demand. This suggests that other factors such as seasonality, region,
and product type play a more significant role in driving avocado sales.


In [14]:
top_volume_regions = (
    df.groupby('region')['total_volume']
    .sum()
    .sort_values (ascending=False)
    .head(10)
)
top_volume_regions

region
Totalus         5.864740e+09
West            1.086481e+09
California      1.028784e+09
Southcentral    1.011280e+09
Northeast       7.132809e+08
Southeast       6.152384e+08
Greatlakes      5.896425e+08
Midsouth        5.083494e+08
Losangeles      5.078965e+08
Plains          3.111885e+08
Name: total_volume, dtype: float64

#### Regional Sales Volume Analysis

Excluding the national aggregate ("Total US"), the West, California, and South Central regions
account for the highest avocado sales volumes, indicating strong regional demand. Lower-volume
regions such as the Plains and Midsouth may represent growth opportunities, potentially driven by
targeted distribution and marketing strategies.


In [15]:
expensive_regions = (
    df.groupby ('region') ['averageprice']
    .mean()
    .sort_values (ascending=False)
    .head(10)
)
expensive_regions

region
Hartfordspringfield    1.782105
Newyork                1.724837
Philadelphia           1.632130
Sanfrancisco           1.629120
Northeast              1.601923
Sacramento             1.598273
Charlotte              1.573079
Albany                 1.561036
Chicago                1.556775
Baltimorewashington    1.534231
Name: averageprice, dtype: float64

#### Regional Pricing Analysis

Average avocado prices vary significantly by region. Hartford-Springfield and New York exhibit the
highest average prices, followed by Philadelphia and San Francisco. These higher-priced regions
are largely concentrated in major metropolitan areas, suggesting stronger demand, higher cost
structures, or greater willingness to pay. In contrast, regions with lower average prices may be
more price-sensitive or benefit from closer proximity to supply sources. This regional variation
highlights opportunities for differentiated pricing and targeted market strategies.

In [17]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

In [18]:
yearly_price = (
    df.groupby('year')['averageprice']
    .mean()
)
yearly_price

year
2015    1.372235
2016    1.328633
2017    1.482359
2018    1.347531
Name: averageprice, dtype: float64

#### Average Avocado Price Trend by Year

Average avocado prices fluctuated over the observed period. Prices declined slightly from 2015
($1.37) to 2016 ($1.33), increased notably in 2017 ($1.48), and  then  decreased  again  in 2018
($1.35). The spike in 2017 suggests a temporary supply or demand shock, while the subsequent
decline indicates partial price normalization. Overall, avocado pricing shows year-to-year
volatility rather than a consistent upward or downward trend.


In [19]:
yearly_volume = (
    df.groupby ('year') ['total_volume']
    .sum()
)
yearly_volume

year
2015    4.385265e+09
2016    4.820425e+09
2017    4.932058e+09
2018    1.382738e+09
Name: total_volume, dtype: float64

#### Total Sales Volume Trend by Year

Total avocado sales volume increased steadily from 2015 ($4.39B) to a peak in 2017 ($4.93B),
indicating growing demand over this period. In 2018, total volume dropped sharply to $1.38B.
This decline is likely due to incomplete data coverage for 2018 rather than a true contraction
in demand. Overall, the data suggests strong growth through 2017, with 2018 requiring cautious
interpretation.


In [20]:
monthly_price = (
    df.groupby('month')['averageprice']
    .mean()
)
monthly_price

month
1     1.308299
2     1.272145
3     1.324540
4     1.362157
5     1.335764
6     1.394793
7     1.452847
8     1.485543
9     1.533166
10    1.543613
11    1.441944
12    1.325712
Name: averageprice, dtype: float64

#### Monthly Price Seasonality

Average avocado prices display clear seasonal patterns throughout the year. Prices are lowest
during the early months (Februaryâ€“March), gradually increase from April, and peak between
September and October. This upward trend during mid-to-late year suggests seasonal supply
constraints or increased demand. Prices decline again toward the end of the year, indicating
seasonal normalization. These patterns highlight the importance of seasonality in avocado
pricing and can inform timing for procurement and pricing strategies.


In [21]:
monthly_volume = (
    df.groupby('month')['total_volume']
    .mean()
)
monthly_volume 

month
1     9.053908e+05
2     1.020563e+06
3     8.892901e+05
4     8.883027e+05
5     9.810292e+05
6     9.400899e+05
7     8.658180e+05
8     8.213477e+05
9     7.780890e+05
10    7.009238e+05
11    6.868249e+05
12    7.779024e+05
Name: total_volume, dtype: float64

#### Monthly Sales Volume Seasonality

Total avocado sales volume shows clear seasonal variation throughout the year. Volumes peak
between February and May, with the highest levels observed in February, and gradually decline
from mid-year through November. Sales volume increases slightly again in December, suggesting
a modest year-end recovery. This pattern indicates that demand is strongest in the first half of
the year, while later months experience softer volumes, highlighting the importance of aligning
supply and inventory planning with seasonal demand trends.


In [22]:
df['revenue'] = df['averageprice'] * df['total_volume']

In [23]:
revenue_stats = df['revenue'].agg(
    avg_revenue = 'mean',
    total_revenue = 'sum', 
    max_revenue = 'max', 
    min_revenue = 'min'
)
revenue_stats

avg_revenue      9.382658e+05
total_revenue    1.692631e+10
max_revenue      5.437991e+07
min_revenue      1.344504e+02
Name: revenue, dtype: float64

#### Revenue Distribution Summary

Revenue per observation shows a wide distribution, reflecting significant variation in sales
scale across regions and time periods. The average revenue is approximately $938K, while total
revenue across the dataset exceeds $16.9B. Revenue ranges from as low as $134 to over $54.4M,
indicating the presence of both small-scale markets and very high-revenue segments. This wide
spread highlights the importance of segmenting revenue analysis by region, product type, and
seasonality to better understand key revenue drivers.


In [24]:
revenue_by_type = (
    df.groupby('type') ['revenue']
    .agg(
        total_revenue = 'sum',
        avg_revenue = 'mean'
    )
)
revenue_by_type

Unnamed: 0_level_0,total_revenue,avg_revenue
type,Unnamed: 1_level_1,Unnamed: 2_level_1
conventional,16253520000.0,1781013.0
organic,672792200.0,75475.9


#### Revenue Analysis by Avocado Type

Conventional avocados generate the vast majority of total revenue, exceeding $16.2B, with an
average revenue per observation of approximately $1.78M. In contrast, organic avocados generate
around $673M in total revenue, with a significantly lower average revenue per observation.
Despite lower volumes, organic avocados command higher prices, confirming their role as a
premium, niche product, while conventional avocados drive overall market revenue through scale.


In [26]:
Q1 = df['averageprice'].quantile(0.25)
Q3 = df['averageprice'].quantile(0.75)
IQR = Q3 - Q1

price_outliers = df[
    (df['averageprice'] < Q1 - 1.5 * IQR) |
    (df['averageprice'] > Q3 + 1.5 * IQR)
]
price_outliers.shape

(5, 18)

#### Price Outlier Detection

Using the interquartile range (IQR) method, only 5 observations were identified as price
outliers. This indicates that the distribution of average avocado prices is relatively stable,
with minimal extreme values. As a result, price variability observed in the dataset reflects
normal market behavior rather than anomalies or data quality issues.


In [27]:
summary = {
    "Avg Price": df['averageprice'].mean(),
    "Median Price": df['averageprice'].median(),
    "Avg Volume": df['total_volume'].mean(),
    "Price-Volume Corr": df[['averageprice', 'total_volume']].corr().iloc[0,1],
    "Total Revenue": df['revenue'].sum()
}
summary

{'Avg Price': np.float64(1.390966740576497),
 'Median Price': 1.36,
 'Avg Volume': np.float64(860337.3529157427),
 'Price-Volume Corr': np.float64(-0.19608494373434956),
 'Total Revenue': np.float64(16926314619.8055)}

## Executive Summary

This analysis examines avocado pricing, sales volume, and revenue dynamics across time, regions,
and product types. The average avocado price is $1.39, with a median of $1.36, indicating a
relatively stable pricing distribution. Average sales volume per observation is approximately
860K units, though volumes are highly skewed, with a small number of regions driving a large
share of total demand.

The correlation between price and volume is weakly negative (-0.20), suggesting that while higher
prices are associated with slightly lower volumes, price alone is not the primary driver of
demand. Other factors such as seasonality, region, and product type play a more significant role.

Total revenue across the dataset exceeds $16.9B, with conventional avocados generating the vast
majority of revenue through high sales volumes, while organic avocados operate as a lower-volume,
premium-priced segment. Overall, the findings highlight a market driven by scale in conventional
products, premium positioning in organic products, and strong seasonal and regional effects that
influence pricing and demand.
