In [None]:
import pandas as pd
import numpy as np

# Data Overview

In [None]:
data = pd.read_csv("/kaggle/input/produce-prices-dataset/ProductPriceIndex.csv")
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

##  Converting relevant columns to appropriate data types

In [None]:
data['date'] = pd.to_datetime(data['date'])

numeric_columns = ['farmprice', 'atlantaretail', 'chicagoretail', 'losangelesretail', 'newyorkretail']
for col in numeric_columns:
    data[col] = pd.to_numeric(data[col].str.replace('$', ''), errors='coerce')

data["averagespread"] = pd.to_numeric(data["averagespread"].str.replace('%', ''), errors='coerce')
data.dropna(inplace=True)
    

# Exploratory Data Analysis (EDA)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

### Correlation matrix

In [None]:
# Select only numeric columns for correlation calculation
numeric_columns = data.select_dtypes(include=[np.number])

# Calculate correlation matrix
correlation_matrix = numeric_columns.corr()

# Display a heatmap of the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()


### Visualizing the spread

In [None]:
# Visualizing the spread
plt.figure(figsize=(15, 8))
sns.boxplot(x='productname', y='averagespread', data=data)
plt.xticks(rotation=90)
plt.title('Spread Distribution by Product')
plt.show()

### Time series analysis

In [None]:
plt.figure(figsize=(15, 8))
data.groupby('date')['averagespread'].mean().plot()
plt.title('Average Spread Over Time')
plt.xlabel('Date')
plt.ylabel('Average Spread')
plt.show()

### Rolling Statistics:

In [None]:
data['rolling_avg'] = data['averagespread'].rolling(window=7).mean()
data[7:10]

### Seasonal Decomposition:
Decompose the time series data into trend, seasonal, and residual components.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
plt.figure(figsize=(15, 8))
result = seasonal_decompose(data['averagespread'], model='additive', period=7)
result.plot()
plt.show()

###  Pair Plots
Create pair plots to visualize relationships between multiple numeric variables.

In [None]:
plt.figure(figsize=(15, 8))
sns.pairplot(data[['farmprice', 'atlantaretail', 'chicagoretail', 'losangelesretail', 'newyorkretail', 'averagespread']])
plt.show()

# Exploratory Data Analysis Report

## 1. Dataset Overview

The dataset contains information about various agricultural products, including product name, date, farm price, retail prices in different cities, and the average spread.

## 2. Dataset Information

The dataset comprises multiple columns:

- **Product Name:** Name of the produce item.
- **Date:** The date of the pricing information.
- **Farm Price:** The price at which the produce is sold at the farm.
- **Retail Prices:** Retail prices in major cities (Atlanta, Chicago, Los Angeles, New York).
- **Average Spread:** Percentage indicating the average markup between farm and retail prices.

## 3. Summary Statistics

Summary statistics provide an overview of the numerical features in the dataset:

In [None]:
# Summary statistics
summary_stats = data.describe()
summary_stats

## 4. Spread Distribution

Upon analyzing the distribution of the average spread, it was observed that the spread varies significantly across different agricultural products. Some products exhibit higher average spreads, indicating greater price variability.

## 5. Time Series Analysis

A time series analysis was performed to understand how the average spread has changed over time. It was found that certain products experienced fluctuations in spread, suggesting potential seasonal trends or external factors influencing pricing.

## 6. Correlation Analysis

A correlation matrix revealed interesting relationships between variables. Notably, the farm price showed a positive correlation with the average spread. Further investigation into this relationship could provide insights into pricing dynamics.

## 7. Regional Price Variations

Retail prices in different cities exhibited variations, with certain cities consistently having higher or lower prices across multiple products. Understanding these regional variations could be crucial for market analysis and decision-making.

## 8. Outliers and Anomalies

Identification of outliers and anomalies in farm prices, retail prices, or average spread was essential. These data points could provide valuable insights into exceptional market conditions or errors in data collection.

## 9. Impact of Specific Products

Analyzing the impact of specific products on the overall spread distribution could help identify key drivers in the agricultural market. Some products might contribute more significantly to price fluctuations than others.

# Recommendations

Based on the EDA findings, the following recommendations are suggested:

- Conduct further investigation into the positive correlation between farm prices and average spread to uncover underlying factors.
- Explore the reasons behind the observed regional variations in retail prices and assess their impact on overall market dynamics.
- Consider deeper analysis of time series trends, especially for products with significant spread fluctuations, to uncover potential seasonality or external influences.


## Conclusion

The EDA provided valuable insights into the dataset, laying the groundwork for more in-depth analyses. The identified patterns and relationships will guide further exploration and contribute to a better understanding of the agricultural market dynamics.
