# Comprehensive Exploratory Data Analysis (EDA)

This notebook performs a deep dive into the retail sales data to understand underlying patterns, seasonality, and stationarity before modeling.

## Analysis Steps
1. **Data Quality Check**: Missing values, types, summary stats.
2. **Visual Inspection**: Time series plots.
3. **Decomposition**: Trend, Seasonality, Residuals.
4. **Stationarity Test**: Augmented Dickey-Fuller (ADF) test.
5. **Autocorrelation**: ACF and PACF plots.
6. **Seasonality Analysis**: Day of week, Monthly patterns.
7. **Demand Classification**: ADI/CV Analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Configuration
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Load Data
df = pd.read_csv("../data/raw/retail_sales.csv")
df['date'] = pd.to_datetime(df['date'])
df.head()

## 1. Data Quality Check

In [None]:
print("Dataset Info:")
print(df.info())

print("\nMissing Values:")
print(df.isnull().sum())

print("\nDescriptive Statistics:")
display(df.describe())

## 2. Visual Inspection
Let's pick a representative Store and Product to visualize.

In [None]:
store_id = 1
product_id = 1

subset = df[(df['store_id'] == store_id) & (df['product_id'] == product_id)].set_index('date').sort_index()

plt.figure(figsize=(15, 5))
plt.plot(subset['sales'], label='Sales')
plt.title(f'Daily Sales - Store {store_id}, Product {product_id}')
plt.legend()
plt.show()

## 3. Time Series Decomposition
Decomposing the series into Trend, Seasonality, and Residuals.

In [None]:
decomposition = seasonal_decompose(subset['sales'], model='additive', period=7)
fig = decomposition.plot()
plt.show()

## 4. Stationarity Test (ADF Test)
Checking if the time series is stationary (constant mean and variance over time). If p-value < 0.05, it is stationary.

In [None]:
result = adfuller(subset['sales'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
if result[1] < 0.05:
    print("Result: The series is Stationary.")
else:
    print("Result: The series is Non-Stationary.")

## 5. Autocorrelation Analysis (ACF & PACF)
Helps in determining parameters for ARIMA models (p, q).

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
plot_acf(subset['sales'], ax=ax[0], lags=40)
plot_pacf(subset['sales'], ax=ax[1], lags=40)
plt.show()

## 6. Seasonality Analysis
Checking sales distribution by Day of Week.

In [None]:
subset['day_of_week'] = subset.index.day_name()
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

plt.figure(figsize=(10, 6))
sns.boxplot(data=subset, x='day_of_week', y='sales', order=order)
plt.title('Sales Distribution by Day of Week')
plt.show()

## 7. Demand Classification (ADI/CV)
Recalculating ADI/CV for all products to categorize them.

In [None]:
def classify_demand(df):
    results = []
    grouped = df.groupby(['store_id', 'product_id'])
    
    for (store_id, product_id), group in grouped:
        non_zero = group[group['sales'] > 0]['sales']
        n_periods = len(group)
        n_non_zero = len(non_zero)
        
        if n_non_zero == 0:
            adi, cv, category = np.nan, np.nan, "No Demand"
        else:
            adi = n_periods / n_non_zero
            cv = non_zero.std() / non_zero.mean() if non_zero.mean() != 0 else 0
            
            if adi < 1.32:
                category = "Smooth" if cv < 0.49 else "Erratic"
            else:
                category = "Intermittent" if cv < 0.49 else "Lumpy"
        
        results.append({'store_id': store_id, 'product_id': product_id, 'ADI': adi, 'CV': cv, 'Category': category})
    return pd.DataFrame(results)

classification_df = classify_demand(df)

plt.figure(figsize=(10, 6))
sns.scatterplot(data=classification_df, x='ADI', y='CV', hue='Category', style='Category', s=100)
plt.axvline(x=1.32, color='r', linestyle='--')
plt.axhline(y=0.49, color='g', linestyle='--')
plt.title('Demand Classification Matrix')
plt.show()