# Agricultural Commodity Price Analysis in India (2001-2025)

This notebook conducts an in-depth exploratory data analysis (EDA) of daily commodity prices in India. The goal is to uncover trends, patterns, and insights related to price stability, seasonality, and regional disparities. The analysis is based on a pre-cleaned and feature-enriched dataset prepared in the `data_cleaning.ipynb` notebook.

### 0. Setup and Data Loading

The first step is to import the necessary libraries and load the cleaned dataset from the Parquet file.

In [1]:
# Import the pandas library, the primary tool for data manipulation and analysis in Python.
import pandas as pd
import numpy as np # Although not used here, it's standard practice to import numpy.

# Load the pre-processed data from the Parquet file created during the cleaning phase.
# Using the Parquet file is highly efficient and ensures all data types are correctly preserved.
df_master = pd.read_parquet('/kaggle/input/cleaned-commodity-prices/cleaned_commodity_prices.parquet')

In [2]:
# Display a concise summary of the DataFrame to verify that all columns, record counts, 
# and data types (especially 'Arrival_Date' and 'Category') are loaded correctly.
print("--- DataFrame Information ---")
df_master.info()

# Perform a final sanity check for any null values to confirm the dataset is clean and ready for analysis.
# The output should show '0' for all columns.
print("\n--- Null Value Check ---")
print(df_master.isnull().sum())

--- DataFrame Information ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75207044 entries, 0 to 75207043
Data columns (total 15 columns):
 #   Column          Dtype         
---  ------          -----         
 0   State           object        
 1   District        object        
 2   Market          object        
 3   Commodity       object        
 4   Variety         object        
 5   Grade           object        
 6   Arrival_Date    datetime64[ns]
 7   Min_Price       float64       
 8   Max_Price       float64       
 9   Modal_Price     float64       
 10  Commodity_Code  int64         
 11  Category        object        
 12  Year            int32         
 13  Month           int32         
 14  Month_Name      object        
dtypes: datetime64[ns](1), float64(3), int32(2), int64(1), object(8)
memory usage: 7.8+ GB

--- Null Value Check ---
State             0
District          0
Market            0
Commodity         0
Variety           0
Grade             0
Arriva

### 1. Analysis of Long-Term Price Trends (2015-2025)

#### Business Query:
*Which commodities show the strongest upward price trends over the last decade (2015-2025), and why (e.g., inflation, demand spikes)?*

To answer this, we will calculate the absolute increase in the average `Modal_Price` for each commodity between the start year (2015) and the end year (2025).

In [3]:
# Calculate the average modal price for each commodity specifically for the year 2015.
avg_prices_2015_df = df_master[df_master['Arrival_Date'].dt.year == 2015].groupby('Commodity').agg(avg_price_2015=('Modal_Price', 'mean'))

# Repeat the same process to get the average modal price for each commodity in 2025.
avg_prices_2025_df = df_master[df_master['Arrival_Date'].dt.year == 2025].groupby('Commodity').agg(avg_price_2025=('Modal_Price', 'mean'))

# Merge the 2015 and 2025 price data into a single DataFrame, aligning them by 'Commodity'.
# This places the start and end prices side-by-side, making comparison easy.
price_trends_df = pd.merge(avg_prices_2015_df, avg_prices_2025_df, on='Commodity', how='left')

# Calculate the absolute price increase over the decade by subtracting the 2015 average from the 2025 average.
price_trends_df['absolute_price_increase'] = price_trends_df['avg_price_2025'] - price_trends_df['avg_price_2015']

# Sort the DataFrame in descending order based on the price increase to identify the top performers.
top_10_trending_commodities = price_trends_df.sort_values(by='absolute_price_increase', ascending=False).head(10)

# Display the final ranked list of the top 10 commodities.
top_10_trending_commodities

Unnamed: 0_level_0,avg_price_2015,avg_price_2025,absolute_price_increase
Commodity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cardamoms,75053.351699,217995.302013,142941.950315
Almond (Badam),6134.710744,92961.325967,86826.615223
Tube Flower,6400.0,64049.19598,57649.19598
Kakada,4250.0,44405.898298,40155.898298
Jasmine,27222.674419,56660.396929,29437.722511
Ghee,30438.65974,55989.558541,25550.898801
She Buffalo,26699.829452,47637.931034,20938.101583
Coconut Oil,14502.177373,32595.693463,18093.51609
Bull,18733.556738,36342.857143,17609.300405
Coffee,9508.096721,26448.22627,16940.129549


### Conclusion & Interpretation (Revised)

The analysis successfully identifies the commodities with the largest absolute increase in average `Modal_Price` between 2015 and 2025.

#### Key Findings:

*   **High-Value Goods Dominate:** The list is overwhelmingly led by non-staple, high-margin commodities. **Cardamoms** showed the most significant increase, with its average price rising by over **₹142,000**. **Almond (Badam)** also demonstrated substantial growth, with an increase of over **₹86,000**.
*   **Distinct Clusters Emerge:** The top performers fall into clear categories:
    *   **Spices & Nuts:** (e.g., Cardamoms, Almonds)
    *   **Horticulture (Flowers):** (e.g., Tube Flower, Kakada, Jasmine)
    *   **Animal & Processed Products:** (e.g., Ghee, She Buffalo, Bull, Coconut Oil)
*   **Absence of Staple Grains:** Notably, staple food grains like wheat and rice do not appear on this list, indicating their absolute price increases were far less significant than those of these specialized products.

#### Interpretation for Stakeholders:

This strong upward trend suggests that the most significant price appreciation over the last decade has occurred in markets for premium and specialized agricultural products. This is likely driven by a combination of factors including ambient inflation, rising disposable incomes leading to greater demand for luxury goods, and potentially more volatile supply chains for these specialized items.

For farmers and exporters, this indicates that the highest potential for revenue growth lies in the cultivation and trade of high-value, non-staple commodities.

### Analysis of Top Commodity Prices in the Most Recent Year (2024)

#### Business Query:
*Which 3 commodities had the highest average `Modal_Price` in 2024?*

This analysis identifies the most expensive commodities based on their average market price in the most recent full year of data.

In [4]:
# Filter the DataFrame to include only data from the year 2024.
df_2024 = df_master[df_master['Year'] == 2024]

# Group the 2024 data by 'Commodity', calculate the mean of 'Modal_Price' for each,
# sort the results in descending order, and select the top 3.
top_3_highest_price_commodities_2024 = df_2024.groupby('Commodity').agg(
    avg_modal_price_2024=('Modal_Price', 'mean')
).sort_values(by='avg_modal_price_2024', ascending=False).head(3)

# Display the final result.
print("--- Top 3 Most Expensive Commodities by Average Price in 2024 ---")
print(top_3_highest_price_commodities_2024.to_string(float_format='₹{:,.2f}'.format))

--- Top 3 Most Expensive Commodities by Average Price in 2024 ---
             avg_modal_price_2024
Commodity                        
Saffron               ₹292,139.28
Cardamoms             ₹191,437.35
poppy seeds            ₹93,352.90


### Conclusion & Interpretation

The analysis pinpoints the three commodities with the highest monetary value in the 2024 market.

| Commodity | Average Modal Price (2024) |
| :--- | :--- |
| **Saffron** | **₹292,139.28** |
| **Cardamoms** | **₹191,437.35** |
| **Poppy Seeds** | **₹93,352.90** |

The results clearly show that high-value spices are the most expensive commodities. **Saffron**, known as one of the most expensive spices in the world, leads with an average price of over **₹2.9 lakh**. **Cardamoms** follows at over **₹1.9 lakh**.

This information is critical for traders and exporters who specialize in high-margin, premium agricultural products.

#### Sub-Analysis: Price Trend by Compound Annual Growth Rate (CAGR)

While the absolute price increase is informative, the **Compound Annual Growth Rate (CAGR)** offers a more standardized measure of investment return. CAGR represents the steady, year-over-year growth rate required to get from the start price to the end price.

For this analysis, we will focus on the **top 10 most-traded commodities** (by record count) to ensure the trend is based on robust, high-volume data.

In [5]:
# First, identify the top 10 most frequently recorded commodities in the dataset.
top_10_commodities_by_volume = df_master['Commodity'].value_counts().head(10).index.tolist()

# Filter the DataFrame to a decade's worth of data (2015-2025) and for only our top 10 commodities.
decade_df = df_master[
    (df_master['Year'].between(2015, 2025)) &
    (df_master['Commodity'].isin(top_10_commodities_by_volume))
]

# Calculate the average Modal_Price for each commodity for every year in the filtered dataset.
yearly_avg_price_df = decade_df.groupby(['Commodity', 'Year'])['Modal_Price'].mean().reset_index()

# Isolate the starting prices from 2015.
start_prices_df = yearly_avg_price_df[yearly_avg_price_df['Year'] == 2015][['Commodity', 'Modal_Price']]
start_prices_df = start_prices_df.rename(columns={'Modal_Price': 'price_2015'})

# Isolate the ending prices from 2025.
end_prices_df = yearly_avg_price_df[yearly_avg_price_df['Year'] == 2025][['Commodity', 'Modal_Price']]
end_prices_df = end_prices_df.rename(columns={'Modal_Price': 'price_2025'})

# Merge the start and end prices into a single DataFrame.
cagr_df = pd.merge(start_prices_df, end_prices_df, on='Commodity')

# Define the number of years for the CAGR calculation.
num_years = 2025 - 2015

# Calculate the CAGR and the total growth percentage.
cagr_df['CAGR'] = ((cagr_df['price_2025'] / cagr_df['price_2015']) ** (1 / num_years)) - 1
cagr_df['total_growth_percent'] = ((cagr_df['price_2025'] / cagr_df['price_2015']) - 1) * 100

# Sort the final results by CAGR in descending order.
final_cagr_results = cagr_df.sort_values(by='CAGR', ascending=False)

# --- Display the final results ---
print("--- CAGR and Total Growth for Top 10 Most-Traded Commodities (2015-2025) ---")
print(final_cagr_results.to_string(index=False, formatters={
    'CAGR': '{:,.2%}'.format,
    'total_growth_percent': '{:,.2f}%'.format,
    'price_2015': '₹{:,.2f}'.format,
    'price_2025': '₹{:,.2f}'.format
}))

--- CAGR and Total Growth for Top 10 Most-Traded Commodities (2015-2025) ---
           Commodity price_2015 price_2025   CAGR total_growth_percent
              Potato    ₹835.16  ₹1,886.68  8.49%              125.91%
              Banana  ₹1,999.43  ₹4,036.11  7.28%              101.86%
             Brinjal  ₹1,442.73  ₹2,677.58  6.38%               85.59%
               Wheat  ₹1,573.86  ₹2,600.70  5.15%               65.24%
        Green Chilli  ₹2,615.29  ₹4,140.81  4.70%               58.33%
         Cauliflower  ₹1,558.71  ₹2,454.18  4.64%               57.45%
Paddy (Dhan)(Common)  ₹1,455.30  ₹2,266.58  4.53%               55.75%
                Rice  ₹2,547.04  ₹3,633.20  3.62%               42.64%
              Tomato  ₹1,729.33  ₹2,026.03  1.60%               17.16%
               Onion  ₹2,295.86  ₹2,274.19 -0.09%               -0.94%


### Conclusion & Interpretation

When analyzing the **rate of return** (CAGR) for the highest-volume commodities, a different set of top performers emerges. While high-value items like Cardamoms had the largest absolute price jump, common staples showed the most consistent and strong relative growth.

| Commodity | Price 2015 | Price 2025 | Total Growth | **CAGR** |
| :--- | :--- | :--- | :--- | :--- |
| **Potato** | ₹835.16 | ₹1,886.68 | 125.91% | **8.49%** |
| **Banana** | ₹1,999.43 | ₹4,036.11 | 101.86% | **7.28%** |
| **Brinjal** | ₹1,442.73 | ₹2,677.58 | 85.59% | **6.38%** |

**Potato** stands out as the top-performing high-volume commodity, with a steady annual growth rate of nearly **8.5%**. This is followed closely by **Banana** and **Brinjal**.

Notably, **Onion** showed a slightly negative CAGR, indicating that despite its price volatility, its average price did not achieve consistent year-over-year growth in this period.

#### Overall Insight:
- **Absolute Growth Leaders:** High-value, lower-volume goods (Cardamoms, Almonds).
- **Relative Growth (CAGR) Leaders:** High-volume, staple goods (Potato, Banana).

This dual analysis provides a more complete picture of market trends, catering to different investment strategies.

### Price Change Analysis for Key Staples (Rice & Wheat)

#### Business Query:
*For Rice and Wheat, what was the average `Modal_Price` in 2020 and in 2024? Show the price increase in rupees and percentage.*

This analysis focuses on the price evolution of India's two most important staple food grains over a recent four-year period, providing a clear picture of food price inflation.

In [6]:
# Filter the DataFrame to isolate records for only 'Rice' and 'Wheat' in the specific years 2020 and 2024.
df_rice_wheat = df_master[
    (df_master['Commodity'].isin(['Rice', 'Wheat'])) & 
    (df_master['Year'].isin([2020, 2024]))
]

# Create a pivot table to place the average prices for 2020 and 2024 side-by-side.
# This structure is ideal for direct comparison and calculations.
avg_modal_price_pivot = df_rice_wheat.pivot_table(
    index='Commodity',
    columns='Year',
    values='Modal_Price',
    aggfunc='mean'
)

# Calculate the absolute price increase using the .diff() method across the columns (axis=1).
# For each row, this calculates the difference (2024 price - 2020 price), and we select the result from the 2024 column.
avg_modal_price_pivot['price_increase_inr'] = avg_modal_price_pivot[[2020, 2024]].diff(axis=1)[2024]

# Calculate the percentage change using the .pct_change() method, which is the idiomatic pandas way to find growth rate.
# We then format the result as a percentage.
avg_modal_price_pivot['increase_percentage'] = round(avg_modal_price_pivot[[2020, 2024]].pct_change(axis=1)[2024] * 100, 2)


# --- Display the final comparison table ---
print("--- Price Evolution of Rice & Wheat (2020 vs. 2024) ---")
# Renaming columns for a cleaner final presentation.
avg_modal_price_pivot.columns.name = 'Metric'
avg_modal_price_pivot.rename(columns={2020: 'Avg Price 2020 (INR)', 2024: 'Avg Price 2024 (INR)'}, inplace=True)
print(avg_modal_price_pivot.to_string(float_format='{:,.2f}'.format))

--- Price Evolution of Rice & Wheat (2020 vs. 2024) ---
Metric     Avg Price 2020 (INR)  Avg Price 2024 (INR)  price_increase_inr  increase_percentage
Commodity                                                                                     
Rice                   2,773.30              3,481.27              707.97                25.53
Wheat                  1,860.52              2,558.25              697.73                37.50


### Conclusion & Interpretation

The analysis of staple food prices reveals significant inflation between 2020 and 2024.

| Metric | Avg Price 2020 (INR) | Avg Price 2024 (INR) | price_increase_inr | increase_percentage |
| :--- | :--- | :--- | :--- | :--- |
| **Rice** | 2,773.30 | 3,481.27 | 707.97 | 25.53% |
| **Wheat**| 1,860.52 | 2,558.25 | 697.73 | 37.50% |

Over the four-year period:
- **Wheat** experienced the larger relative price increase, rising by **37.50%**.
- **Rice** prices increased by **25.53%**.

In absolute terms, the average price for both commodities rose by approximately **₹700**. This analysis provides a clear, quantitative measure of the impact of recent food price inflation on the country's most essential food staples.

### Analysis of Seasonal Price Inflation (YoY)

#### Business Query:
*Calculate the year-over-year (YoY) price change (as a percentage) for Grains and Vegetables categories, aggregated monthly. Determine the months with the largest average YoY increases across these categories.*

This analysis aims to identify specific months of the year that consistently experience the highest price inflation for staple food categories, providing insight into seasonal market pressures.

In [7]:
# Filter the dataset to focus only on the 'Grains' and 'Vegetables' categories.
staples_df = df_master[df_master['Category'].isin(['Grains', 'Vegetables'])].copy()

# Group the data by category, year, and month, and then calculate the average price for each period.
monthly_avg_price_df = staples_df.groupby(['Category', 'Year', 'Month'])['Modal_Price'].mean().reset_index()

# It is critical to sort the DataFrame chronologically within each category before calculating percentage change.
monthly_avg_price_df.sort_values(by=['Category', 'Year', 'Month'], inplace=True)

# --- This is the next step to complete the analysis ---

# Calculate the Year-over-Year (YoY) percentage change.
# `groupby('Category')` ensures we only compare Grains-to-Grains and Vegetables-to-Vegetables.
# `periods=12` is the key: it tells pandas to compare each month's price to the price from 12 rows (months) prior.
monthly_avg_price_df['YoY_Change'] = monthly_avg_price_df.groupby('Category')['Modal_Price'].pct_change(periods=12)

# Now, to find the average inflation for each month, we group by 'Month' and calculate the mean of all the YoY changes.
monthly_inflation_ranking = monthly_avg_price_df.groupby('Month')['YoY_Change'].mean().sort_values(ascending=False)

# Display the final ranked list of months.
print("--- Average Year-over-Year Price Increase by Month ---")
print("Categories Analyzed: Grains & Vegetables")
# We multiply by 100 to display the result as a proper percentage.
print((monthly_inflation_ranking * 100).to_string(float_format='{:,.2f}%'.format))

--- Average Year-over-Year Price Increase by Month ---
Categories Analyzed: Grains & Vegetables
Month
12   12.20%
1    11.11%
7    10.08%
3     8.47%
5     8.21%
11    8.07%
4     8.06%
2     7.82%
6     7.48%
10    7.03%
9     6.58%
8     6.12%


### Conclusion & Interpretation (Revised)

The analysis successfully identified the months that, on average, exhibit the highest year-over-year price inflation for the staple food categories of Grains and Vegetables.

| Rank | Month | Average YoY Increase |
| :--- | :--- | :--- |
| 1 | **December** | 12.20% |
| 2 | **January** | 11.11% |
| 3 | **July** | 10.08% |

#### Interpretation for Stakeholders:

The results reveal **two distinct peak seasons** for year-over-year price inflation in staple foods:

1.  **The Winter Period (December-January):** The most significant and consistent price increases occur at the turn of the year. This is likely driven by a combination of factors, including the transition between the Kharif (monsoon) and Rabi (winter) crop cycles, potential weather-related disruptions to supply chains in northern India, and heightened demand during festive periods.

2.  **The Mid-Monsoon Period (July):** A secondary spike in price inflation occurs in the middle of the monsoon season. This can be attributed to heavy rains disrupting transportation and logistics, which can lead to temporary, localized supply shortages and subsequent price increases.

#### Actionable Insight:
Instead of a single period of inflation, cooperatives and policymakers should anticipate and prepare for two primary high-risk windows: the **peak winter months (December-January)** and the **peak monsoon month (July)**. Strategic decisions regarding inventory management, pricing, and supply chain logistics should be focused on mitigating price volatility during these specific times of the year.

### Regional Price Analysis for a Key Commodity (Onion)

#### Business Query:
*Which state had the highest average `Modal_Price` for Onion in 2024?*

This analysis pinpoints the most expensive state for a key, volatile commodity in the most recent year, which is valuable information for understanding regional supply chains and price disparities.

In [18]:
# Filter the dataset to isolate all records for 'Onion' specifically in the year 2024.
df_onion_2024 = df_master[
    (df_master['Commodity'] == 'Onion') & 
    (df_master['Year'] == 2024)
].copy()

# As discovered in our earlier data quality checks, the dataset contains a significant outlier
# for Onion prices. We explicitly remove this row to prevent it from skewing the results
# and ensure the calculation of the average price is accurate.
# The index 69409345 corresponds to the erroneous entry.
# df_onion_2024.drop(index=69409345, inplace=True, errors='ignore') # 'errors=ignore' prevents a crash if the code is re-run

# Group the cleaned 2024 onion data by 'State', calculate the mean of 'Modal_Price',
# sort the results, and select the top state.
highest_price_state_for_onion = df_onion_2024.groupby('State').agg(
    avg_modal_price=('Modal_Price', 'mean')
).sort_values(by='avg_modal_price', ascending=False).head(1)

# Display the final result.
print("--- State with the Highest Average Price for Onion in 2024 ---")
print(highest_price_state_for_onion.to_string(float_format='₹{:,.2f}'.format))

--- State with the Highest Average Price for Onion in 2024 ---
             avg_modal_price
State                       
Maharashtra       ₹57,261.22


In [20]:
# Filter for 'Onion' in 2024, AND add a condition to exclude any impossibly high prices.
# This is the most robust way to remove the outlier without relying on a specific index number.
df_onion_2024_cleaned = df_master[
    (df_master['Commodity'] == 'Onion') & 
    (df_master['Year'] == 2024) &
    (df_master['Modal_Price'] < 1000000) # This condition removes the outlier
].copy()

# Now, group the CLEANED data by 'State' to get the correct average price.
highest_price_state_for_onion = df_onion_2024_cleaned.groupby('State').agg(
    avg_modal_price=('Modal_Price', 'mean')
).sort_values(by='avg_modal_price', ascending=False).head(1)

# --- Display the final, correct result ---
print("--- State with the Highest Average Price for Onion in 2024 (Outlier Removed) ---")
print(highest_price_state_for_onion.to_string(float_format='₹{:,.2f}'.format))

--- State with the Highest Average Price for Onion in 2024 (Outlier Removed) ---
                     avg_modal_price
State                               
Andaman and Nicobar        ₹5,875.00


### Conclusion & Interpretation

After correcting for a significant data outlier, the analysis identified the state with the highest average price for onions in 2024.

| State | Average Modal Price (2024) |
| :--- | :--- |
| **Andaman and Nicobar** | **₹5,875.00** |

The highest prices for onions were found in **Andaman and Nicobar**.

This is a logical finding, as island territories often face higher prices for staple goods due to significant transportation and logistics costs involved in bringing produce from the mainland.

### Analysis of District-Level Price Volatility

#### Business Query:
*For each state, compute the average price spread (`Max_Price - Min_Price`) as a percentage of `Modal_Price`, segmented by district. Identify the five districts with the highest average spreads and their associated volatility metrics.*

This analysis pinpoints local markets with the highest intraday price volatility. The "spread percentage" is a normalized measure of risk and price instability, which is critical for traders and cooperatives in making logistical decisions.

In [9]:
# To prevent potential 'divide by zero' errors, we first filter out any records where the Modal_Price is zero.
df_filtered_for_spread = df_master[df_master['Modal_Price'] > 0].copy()

# Using .assign() is an efficient method to create multiple new columns in one chained operation.
# We first calculate the absolute spread, then use that new 'Spread' column to calculate the 'Spread_Percentage'.
daily_spreads_df = df_filtered_for_spread.assign(
    Spread = lambda x: x['Max_Price'] - x['Min_Price'],
    Spread_Percentage = lambda x: (x['Spread'] / x['Modal_Price']) * 100
)

# To find the average volatility for each district, we group by 'State' and 'District'
# and then calculate the mean of the daily 'Spread_Percentage' for each group.
avg_district_volatility = daily_spreads_df.groupby(['State', 'District'])['Spread_Percentage'].mean()

# Sort the results in descending order to identify the 5 districts with the highest average volatility.
top_5_volatile_districts = avg_district_volatility.sort_values(ascending=False).head(5)

# --- Display the final results ---
print("--- Top 5 Districts by Average Daily Price Volatility ---")
print(round(top_5_volatile_districts, 2).to_string(name="Avg. Daily Spread %"))

--- Top 5 Districts by Average Daily Price Volatility ---
State        District  
Maharashtra  Sholapur      69.09
             Kolhapur      65.44
Telangana    Hyderabad     62.15
Chandigarh   Chandigarh    58.92
Tripura      Dhalai        58.42
Name: Spread_Percentage


### Conclusion & Interpretation

The analysis aimed to identify districts exhibiting the highest market volatility, quantified by the average daily price spread as a percentage of the modal price. A higher percentage signifies a less stable market.

| Rank | State | District | Avg. Daily Spread % |
|:---|:---|:---|:---:|
| 1 | Maharashtra | Sholapur | 69.09% |
| 2 | Maharashtra | Kolhapur | 65.44% |
| 3 | Telangana | Hyderabad | 62.15% |
| 4 | Chandigarh | Chandigarh | 58.92% |
| 5 | Tripura | Dhalai | 58.42% |

The district with the highest price volatility in India is **Sholapur, Maharashtra**, with an average daily price spread of **69.09%**.

A notable regional pattern emerges, as **Maharashtra** is home to two of the top five most volatile districts in the country: **Sholapur** (Rank 1) and **Kolhapur** (Rank 2). This concentration indicates that markets within these two districts experience the most significant and consistent intraday price fluctuations, suggesting a higher-risk trading environment compared to other regions in the dataset.

### Regional Deep-Dive: Top Tomato Markets in Maharashtra

#### Business Query:
*List the top 5 districts in Maharashtra with the highest average `Modal_Price` for Tomato during the 2023–2024 period.*

This analysis performs a focused, regional deep-dive to identify the most lucrative local markets for a specific high-volume vegetable, providing actionable intelligence for farmers and distributors within Maharashtra.

In [10]:
# Filter the master DataFrame to create a specific subset for tomatoes in Maharashtra between 2023 and 2024.
df_maharashtra_tomato = df_master[
    (df_master['Commodity'] == 'Tomato') & 
    (df_master['State'] == 'Maharashtra') &
    (df_master['Year'].between(2023, 2024))
].copy()

# Group this subset by 'District', calculate the average of 'Modal_Price' for each,
# sort the results in descending order, and select the top 5.
top_5_districts_tomato_price = df_maharashtra_tomato.groupby('District').agg(
    avg_modal_price=('Modal_Price', 'mean')
).sort_values(by='avg_modal_price', ascending=False).head(5)

# Display the final ranked list of districts.
print("--- Top 5 Districts in Maharashtra for Tomato Prices (2023-2024) ---")
print(top_5_districts_tomato_price.to_string(float_format='₹{:,.2f}'.format))

--- Top 5 Districts in Maharashtra for Tomato Prices (2023-2024) ---
                       avg_modal_price
District                              
Dharashiv (Usmanabad)        ₹4,933.33
Raigad                       ₹3,128.14
Thane                        ₹3,082.30
Mumbai                       ₹2,595.64
Ratnagiri                    ₹2,524.02


The analysis identified the top five districts within Maharashtra where farmers could get the best average price for tomatoes during the 2023-2024 period.

| Rank | District | Average Modal Price |
| :--- | :--- | :--- |
| 1 | **Dharashiv (Usmanabad)** | **₹4,933.33** |
| 2 | Raigad | ₹3,128.14 |
| 3 | Thane | ₹3,082.30 |
| 4 | Mumbai | ₹2,595.64 |
| 5 | Ratnagiri | ₹2,524.02 |

**Dharashiv (Usmanabad)** stands out as the most profitable district for selling tomatoes, with an average price significantly higher than the other top districts. The average price here is over **₹1,800** higher than the second-ranked district, Raigad, indicating a very strong local market or potential supply shortages in the region during this period.

### Market-Level Deep-Dive: Top Potato Market in Uttar Pradesh

#### Business Query:
*Which market (mandi) in Uttar Pradesh had the highest average `Modal_Price` for Potato in 2024?*

This hyper-specific analysis drills down to the individual market level (`mandi`) to find the single most profitable location for selling a key commodity in a major agricultural state.

In [21]:
# Create a filtered DataFrame for all commodities sold in Uttar Pradesh during 2024.
df_uttar_pradesh_2024 = df_master[
    (df_master['State'] == 'Uttar Pradesh') & 
    (df_master['Year'] == 2024)
].copy()

# Further filter this DataFrame to isolate only the records for 'Potato'.
df_up_potato_2024 = df_uttar_pradesh_2024[df_uttar_pradesh_2024['Commodity'] == 'Potato']

# Group the potato data by the specific 'Market' (mandi), calculate the mean 'Modal_Price',
# sort to find the highest price, and select the top entry.
top_market_for_potato = df_up_potato_2024.groupby('Market')['Modal_Price'].mean().sort_values(ascending=False).head(1)

# Display the final result.
print("--- Top Market in Uttar Pradesh for Potato Prices in 2024 ---")
print(top_market_for_potato.to_string(float_format='₹{:,.2f}'.format))

--- Top Market in Uttar Pradesh for Potato Prices in 2024 ---
Market
Sehjanwa   ₹2,273.20


The analysis successfully identified the single most profitable market (mandi) for selling potatoes in Uttar Pradesh during 2024.

| Market | Average Modal Price (2024) |
| :--- | :--- |
| **Sehjanwa** | **₹2,273.20** |

The **Sehjanwa** mandi offered the highest average price realization for potato farmers in Uttar Pradesh in 2024. This information is a vital, actionable insight for local farmers and cooperatives, enabling them to make more profitable logistical decisions about where to sell their produce.

### Analysis of Quality-Based Price Premiums

#### Business Query:
*Group by commodity and variety, then compute the average price premium (`Modal_Price` difference) for 'FAQ' grade versus 'Medium' grade. Quantify premiums for the top five commodities showing the largest differentials.*

This analysis aims to quantify the economic benefit of producing higher-quality goods, providing a clear data-driven incentive for farmers to invest in quality improvement.

In [12]:
# Filter the DataFrame to only include records with the two grades we want to compare.
grades_to_compare = ['FAQ', 'Medium']
df_filtered_grades = df_master[df_master['Grade'].isin(grades_to_compare)].copy()

# Reshape the data to place the prices for 'FAQ' and 'Medium' grades side-by-side
# by grouping, aggregating, and then unstacking the 'Grade' column.
grade_comparison_df = df_filtered_grades.groupby(['Commodity', 'Variety', 'Grade']) \
    ['Modal_Price'].mean() \
    .unstack(level='Grade')

# To ensure a fair comparison, remove any products that do not have a price for BOTH grades.
grade_comparison_df.dropna(subset=['FAQ', 'Medium'], inplace=True)

# Calculate the price premium for each variety.
grade_comparison_df['Price_Premium_FAQ'] = grade_comparison_df['FAQ'] - grade_comparison_df['Medium']

# Calculate the average premium for each commodity across all its varieties.
avg_commodity_premiums = grade_comparison_df.groupby('Commodity')['Price_Premium_FAQ'].mean()

# Sort to find the 5 commodities with the highest average premium for the 'FAQ' grade.
top_5_premium_commodities = avg_commodity_premiums.sort_values(ascending=False).head(5)

# The 'top_5_premium_commodities' Series now holds the final result of our analysis.
top_5_premium_commodities

Commodity
Mace                72380.994330
Almond (Badam)      47742.942499
Pepper ungarbled    36259.972019
Cashewnuts          13349.292410
Rubber              12159.757808
Name: Price_Premium_FAQ, dtype: float64

### Conclusion & Interpretation

The analysis successfully quantified the financial premium associated with higher-grade ('FAQ') produce compared to 'Medium' grade. High-value, non-staple commodities show the largest premiums for superior quality.

| Commodity | Average Price Premium (for 'FAQ' Grade) |
| :--- | :--- |
| **Mace** | **₹72,380.99** |
| **Almond (Badam)** | **₹47,742.94**|
| **Pepper ungarbled**| **₹36,259.97**|
| **Cashewnuts**| **₹13,349.29**|
| **Rubber**| **₹12,159.76**|

#### Actionable Insight:
This data provides a powerful, data-driven argument for farmers and cooperatives. For example, on average, farmers producing 'FAQ' grade **Mace** earned **₹72,380.99** more per unit than those producing 'Medium' grade.

This demonstrates that investments in processes that yield higher-quality grades—such as better sorting, drying, and storage—can lead to substantial increases in revenue, particularly for high-value export-oriented commodities. Advising cooperatives to focus on quality improvement initiatives is a key recommendation from this finding.

### Identification of High-Price Markets for Tomato

#### Business Query:
*Find 3 markets where the Tomato price went above ₹3,000 per quintal at least once in 2024.*

This analysis identifies specific markets that experienced significant price spikes for a key commodity, highlighting potentially lucrative but volatile sales locations.

In [13]:
# Filter the DataFrame to create a subset containing only records where the 'Modal_Price' for 'Tomato' 
# in 2024 was at or above the ₹3,000 threshold.
df_tomato_high_price = df_master[
    (df_master['Commodity'] == 'Tomato') & 
    (df_master['Year'] == 2024) & 
    (df_master['Modal_Price'] >= 3000)
].copy()

# To find the markets where this occurred most often, we can simply count the occurrences of each market's name
# in our filtered DataFrame. This tells us how many days each market met the high-price criteria.
top_markets_by_frequency = df_tomato_high_price['Market'].value_counts()

# Select the top 3 markets from this frequency count.
top_3_high_price_markets = top_markets_by_frequency.head(3)

# Display the final result.
print("--- Top 3 Markets with Most Frequent High Prices for Tomato in 2024 ---")
print(top_3_high_price_markets.to_string(name="Number of Days Price >= ₹3,000"))

--- Top 3 Markets with Most Frequent High Prices for Tomato in 2024 ---
Market
Kottayam         345
Kanjirappally    326
Anchal           296
Name: count


The analysis identified the markets that most frequently experienced high prices (≥ ₹3,<i></i>000 per quintal) for tomatoes in 2024, revealing a remarkable and consistent high-price environment in specific locations.

| Rank | Market | State | Number of Days Price >= ₹3,000 |
|:---|:---|:---|:---:|
| 1 | **Kottayam** | Kerala | 345 |
| 2 | **Kanjirappally** | Kerala | 326 |
| 3 | **Anchal** | Kerala | 296 |

#### Key Findings & Interpretation:

*   **Persistent High-Price Environment:** The most striking finding is the sheer number of days the price threshold was met. In **Kottayam**, the price for tomatoes was above ₹3,000 on **345 different days**, which is nearly every day the market was recorded. This is not an indicator of temporary volatility, but rather a sustained, year-round premium price environment.

*   **Strong Regional Concentration:** All three top markets—Kottayam, Kanjirappally, and Anchal—are located in **Kerala**. This points to a clear, state-wide market dynamic where tomato prices are structurally higher than in the rest of the country, likely due to a combination of high local demand, logistical costs, and regional consumption patterns.

#### Actionable Insight for Stakeholders:
For tomato producers and distributors, the state of Kerala, and specifically the **Kottayam** market, represents the most significant and reliable opportunity for achieving premium price realization in India. Unlike markets that experience occasional price spikes, these locations offer consistent, high prices, making them a prime strategic target for supply contracts and distribution efforts.