# Strawberry Price Prediction - Exploratory Data Analysis

Objective : create a model that would predict the price of strawberries at a 2 weeks horizon, using past data to train the model and the 2022-2023 season data for testing.

With an internet search, the strawberries depend of :
* Supply: season, yield, weather, diseases
* Demand: consumption, events, competition
* Production costs: inputs, labor, crop type
* Logistics: transportation, storage, preservation
* Policies and economics: taxes, subsidies, currency fluctuations

This notebook explores the strawberry price prediction dataset, focusing on:
1. Data overview and missing values
2. Price distribution and trends
3. Weather features analysis
4. Seasonal patterns

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
from utils import plot_missing_values

SyntaxError: invalid syntax (959152598.py, line 7)

## 1. Load and Examine Data

In [2]:
# Load data
df = pd.read_csv('data/raw/senior_ds_test.csv')

# Display basic information
print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData Types:\n", df.dtypes)
print("\nSample Data:\n", df.head())

Dataset Shape: (521, 15)

Columns: ['Unnamed: 0', 'year', 'week', 'windspeed', 'temp', 'cloudcover', 'precip', 'solarradiation', 'start_date', 'end_date', 'category', 'unit', 'price_min', 'price_max', 'price']

Data Types:
 Unnamed: 0          int64
year                int64
week                int64
windspeed         float64
temp              float64
cloudcover        float64
precip            float64
solarradiation    float64
start_date         object
end_date           object
category           object
unit               object
price_min         float64
price_max         float64
price             float64
dtype: object

Sample Data:
    Unnamed: 0  year  week  windspeed       temp  cloudcover    precip  \
0           1  2013    28  21.900000  26.100000   13.500000  0.013000   
1           2  2013    29  24.185714  26.914286   10.242857  0.000000   
2           3  2013    30  21.728571  27.871429   13.785714  0.000000   
3           4  2013    31  22.971429  27.014286   18.900000  0.00

## 2. Missing Values Analysis

In [3]:
# Check missing values
print("Missing Values Count:\n")
print(df.isnull().sum())

# Plot missing values heatmap
plot_missing_values(df)

Missing Values Count:

Unnamed: 0          0
year                0
week                0
windspeed           0
temp                0
cloudcover          0
precip              0
solarradiation      0
start_date        242
end_date          242
category          242
unit              242
price_min         242
price_max         242
price             242
dtype: int64


NameError: name 'plot_missing_values' is not defined

## 3. Split Train/Test Before Analysis

In [None]:
# Split data
train_df, test_df = split_train_test(df)

print("Training set shape:", train_df.shape)
print("Testing set shape:", test_df.shape)

# Display date ranges
print("\nTraining data date range:")
print(f"Start: {train_df['start_date'].min()}, End: {train_df['start_date'].max()}")
print("\nTesting data date range:")
print(f"Start: {test_df['start_date'].min()}, End: {test_df['start_date'].max()}")

## 4. Price Analysis

In [None]:
# Plot price distribution and time series
plot_price_distribution(train_df)

# Basic statistics
print("\nPrice Statistics (Training Set):")
print(train_df['price'].describe())

## 5. Weather Features Analysis

In [None]:
# Plot correlations
plot_weather_correlations(train_df)

# Weather features statistics
weather_cols = ['windspeed', 'temp', 'cloudcover', 'precip', 'solarradiation']
print("\nWeather Features Statistics (Training Set):")
print(train_df[weather_cols].describe())

## 6. Seasonal Patterns

In [None]:
# Plot seasonal patterns
plot_seasonal_patterns(train_df)

# Additional seasonal analysis
print("\nAverage Price by Year:")
print(train_df.groupby('year')['price'].mean())

## 7. Key Findings and Next Steps

1. Missing Values Pattern:
   - [To be filled after analysis]

2. Price Trends:
   - [To be filled after analysis]

3. Weather Correlations:
   - [To be filled after analysis]

4. Seasonal Effects:
   - [To be filled after analysis]

Next Steps:
1. Handle missing values separately for train and test sets
2. Feature engineering based on observed patterns
3. Model selection considering the seasonal nature of the data