# ðŸ›’ Capstone Phase 1: Exploration & Feature Engineering

**Goal**: Understand our synthetic retail data and build the "Industrial" feature pipeline required for non-time-series models (like LightGBM) and Deep Learning models.

---

## 1. Load & Explore Data

We start by loading the `parquet` file generated by our ingestion script. Parquet is column-oriented and much faster/smaller than CSV for large datasets.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Data
df = pd.read_parquet('../data/raw/m5_lite_synthetic.parquet')

# Convert object columns to category for memory efficiency
for col in ['store_id', 'item_id']:
    df[col] = df[col].astype('category')

print(f"Dataset Shape: {df.shape}")
df.head()

## 2. Visualizing Demand Patterns

Before modeling, we must verify if our synthetic data looks "real".
We expect to see:
1. **Seasonality**: Sales going up/down weekly.
2. **Trends**: Long-term increase or decrease.
3. **Promo Effects**: Spikes when `is_promo=1`.

In [None]:
# Pick one item to plot
sample_item = 'item_0'
sample_store = 'store_0'

subset = df[(df['item_id'] == sample_item) & (df['store_id'] == sample_store)].set_index('date')

plt.figure(figsize=(15, 5))
plt.plot(subset['sales'], label='Sales', alpha=0.7)
plt.scatter(subset[subset['is_promo']==1].index, 
            subset[subset['is_promo']==1]['sales'], 
            color='red', label='Promo', s=20)

plt.title(f"Sales History: {sample_store} - {sample_item}")
plt.legend()
plt.show()

## 3. Feature Engineering (The "Secret Sauce")

Machine Learning models (like XGBoost/LightGBM) don't "know" time. They treat every row as independent. 
To tell them about the past, we must explicitly create features representing history.

### 3.1 Lag Transform
What were the sales 7 days ago? 28 days ago?
- `lag_7`: Captures weekly seasonality (e.g., this Saturday is like last Saturday).
- `lag_28`: Captures monthly patterns.

In [None]:
def create_lag_features(df, lags=[7, 14, 28]):
    df = df.sort_values(['store_id', 'item_id', 'date']).copy()
    
    for lag in lags:
        df[f'lag_{lag}'] = df.groupby(['store_id', 'item_id'])['sales'].shift(lag)
        
    return df

df_lags = create_lag_features(df)
df_lags.dropna(inplace=True)
df_lags[['date', 'item_id', 'sales', 'lag_7', 'lag_28']].head(10)

### 3.2 Rolling Window Statistics
Instead of just a single point (lag), we look at a summary of a window.
- **Rolling Mean**: "What is the average level of sales recently?" (Trend)
- **Rolling Std**: "How volatile is this item?" (Uncertainty)

In [None]:
def create_rolling_features(df, windows=[7, 28]):
    df = df.sort_values(['store_id', 'item_id', 'date']).copy()
    
    # We shift by 1 first to avoid leakage (including today's sales in today's features)
    # But since we use lag_7 as a base for rolling usually in forecasting, let's keep it simple:
    # Rolling mean of the last 28 days, shifted by 28 days (to be safe for 4-week forecast horizon)
    
    for window in windows:
        # Group by item, shift by horizon (28), then roll
        # For simplicity in this demo, we assume we are predicting day-ahead mostly, 
        # but for multi-step, we normally roll on the shifted series.
        
        df[f'rolling_mean_{window}'] = df.groupby(['store_id', 'item_id'])['sales'] \
                                      .transform(lambda x: x.shift(28).rolling(window).mean())
                                      
        df[f'rolling_std_{window}'] = df.groupby(['store_id', 'item_id'])['sales'] \
                                     .transform(lambda x: x.shift(28).rolling(window).std())
            
    return df

df_feats = create_rolling_features(df_lags)
df_feats.dropna(inplace=True)

### 3.3 Date Features
The model needs to know "Is it a weekend?" or "Is it December?".

In [None]:
def create_date_features(df):
    df['month'] = df['date'].dt.month
    df['day_of_week'] = df['date'].dt.dayofweek # 0=Monday, 6=Sunday
    df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
    
    # Cyclical Encoding for Month (Dec is close to Jan)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    
    return df

df_final = create_date_features(df_feats)
df_final.head()

## 4. Save Processed Data

We now have a "Tabular" dataset ready for training LightGBM or Deep Learning models.
Features:
- `lag_7`, `lag_14`, `lag_28`
- `rolling_mean_7`, `rolling_std_7`
- `month`, `day_of_week`, `sell_price`, `is_promo`

Target: `sales`

In [None]:
output_path = '../data/processed/tabular_train_data.parquet'
df_final.to_parquet(output_path, index=False)
print(f"Saved processed data to {output_path}")