# Walmart Sales Forecasting: Feature Engineering Documentation

## Overview

This document outlines the comprehensive feature engineering strategy for the Walmart sales forecasting competition. The approach focuses on creating meaningful features that capture the complex patterns in retail sales data, including temporal trends, promotional effects, economic influences, and store-specific characteristics.

## Core Philosophy

The feature engineering strategy is built on several key principles:

- **Time Series Awareness**: Retail sales are inherently time-dependent with strong seasonal and trend components
- **Business Context**: Features should reflect real-world retail dynamics (holidays, promotions, economic conditions)
- **Hierarchy Capture**: Different stores and departments have unique characteristics that need to be modeled
- **Multi-scale Patterns**: Sales patterns exist at different time scales (weekly, monthly, seasonal, yearly)

## Feature Categories

### 1. Holiday-Related Features

**Purpose**: Capture the significant impact of holidays on retail sales patterns.

#### Core Holiday Features
- **`Holiday_Weight`**: Assigns higher weight (5.0) to holiday weeks vs. regular weeks (1.0)
  - *Rationale*: Holiday sales are typically more volatile and important for revenue
  
- **`Pre_Holiday`**: Binary indicator for the week before a holiday
  - *Business Logic*: Customers often shop in advance of holidays
  
- **`Post_Holiday`**: Binary indicator for the week after a holiday
  - *Business Logic*: Post-holiday sales often show different patterns (returns, clearance)

#### Implementation Notes
- Uses grouped shift operations to maintain store-department granularity
- Fills missing values with 0 (non-holiday assumption)

### 2. Temporal Features

**Purpose**: Capture cyclical and seasonal patterns in sales data.

#### Linear Time Components
- **`Year`**: Linear year component for long-term trends
- **`Month`**: Month of year (1-12)
- **`Week`**: ISO week number (1-52)
- **`Quarter`**: Business quarter (1-4)

#### Cyclical Encoding
- **`Month_sin/cos`**: Sine and cosine transformations of month
  - *Formula*: `sin(2π × month / 12)`, `cos(2π × month / 12)`
  - *Benefit*: Captures cyclical nature (December and January are adjacent)
  
- **`Week_sin/cos`**: Sine and cosine transformations of week
  - *Formula*: `sin(2π × week / 52)`, `cos(2π × week / 52)`
  - *Benefit*: Handles end-of-year continuity

#### Why Cyclical Encoding?
Traditional numerical encoding treats December (12) and January (1) as distant, when they're actually adjacent in the business cycle. Cyclical encoding preserves this relationship.

### 3. Lag Features

**Purpose**: Capture autoregressive patterns - how past sales influence current sales.

#### Lag Periods
- **Short-term**: 1, 2 weeks (immediate recent trends)
- **Medium-term**: 4, 8 weeks (monthly patterns)
- **Long-term**: 12 weeks (quarterly patterns)
- **Seasonal**: 52 weeks (year-over-year comparison)

#### Implementation
```python
lag_periods = [1, 2, 4, 8, 12, 52]
Sales_lag_1, Sales_lag_2, ..., Sales_lag_52
```

#### Business Rationale
- **1-2 weeks**: Immediate momentum effects
- **4 weeks**: Monthly buying cycles
- **8-12 weeks**: Seasonal preparation
- **52 weeks**: Year-over-year comparison for same time period

### 4. Rolling Statistics

**Purpose**: Capture moving trends and volatility patterns over different time horizons.

#### Window Sizes
- **4 weeks**: Monthly trends
- **8 weeks**: Bi-monthly patterns
- **12 weeks**: Quarterly trends
- **24 weeks**: Semi-annual patterns
- **52 weeks**: Annual moving statistics

#### Statistics Calculated
For each window size:
- **`Sales_rolling_mean_X`**: Moving average (trend)
- **`Sales_rolling_std_X`**: Moving standard deviation (volatility)
- **`Sales_rolling_max_X`**: Moving maximum (peak performance)
- **`Sales_rolling_min_X`**: Moving minimum (trough identification)

#### Business Applications
- **Mean**: Underlying trend identification
- **Std**: Volatility and predictability measures
- **Max/Min**: Peak and trough pattern recognition

### 5. Markdown (Promotion) Features

**Purpose**: Capture the impact of promotional activities on sales.

#### Core Markdown Processing
- **Missing Value Logic**: NaN → 0 (no promotion assumption)
- **`Total_MarkDown`**: Sum of all markdown values
- **`MarkDown1_Active`, etc.**: Binary indicators for active promotions

#### Advanced Markdown Features
- **`MarkDown_Intensity`**: Total markdowns normalized by store size
  - *Formula*: `Total_MarkDown / (Size + 1)`
  - *Purpose*: Account for store size in promotion impact

#### Promotion Strategy Insights
- Multiple concurrent promotions may have interaction effects
- Store size influences promotion effectiveness
- Binary indicators capture promotion presence vs. magnitude

### 6. Store and Department Performance Features

**Purpose**: Capture relative performance and hierarchy effects.

#### Hierarchical Averages
- **`Store_Type_Avg`**: Average sales for the store's type (A, B, C)
- **`Dept_Avg`**: Average sales for the department across all stores
- **`Store_Dept_Avg`**: Historical average for specific store-department combination

#### Relative Performance
These features enable the model to understand:
- How a store performs relative to similar stores
- How a department performs relative to other departments
- Store-department specific dynamics

### 7. Economic Interaction Features

**Purpose**: Capture how economic conditions interact with other factors.

#### Interaction Terms
- **`Unemployment_Temperature`**: Economic stress × weather interaction
- **`CPI_Fuel_Interaction`**: Inflation × transportation cost interaction

#### Economic Stress Indicator
```python
Economic_Stress = (Unemployment - μ_unemployment) / σ_unemployment + 
                  (Fuel_Price - μ_fuel) / σ_fuel
```

#### Business Logic
- High unemployment + extreme temperatures may affect shopping patterns
- Inflation combined with high fuel costs creates compound economic pressure
- Standardized stress indicator captures overall economic environment

### 8. Trend Features

**Purpose**: Capture linear growth or decline patterns for each store-department.

#### Trend Calculation
- Uses linear regression slope for each store-department combination
- Minimum 3 data points required for trend calculation
- Captures whether sales are generally increasing, decreasing, or stable

#### Implementation Details
```python
def calculate_trend(group):
    if len(group) < 3:
        return 0  # Insufficient data
    x = np.arange(len(group))
    slope = np.polyfit(x, group.values, 1)[0]
    return slope
```

## Missing Value Strategy

### Domain-Specific Logic

#### Markdown Columns
- **Strategy**: Fill NaN with 0
- **Rationale**: Missing markdown data implies no promotion was active

#### Lag and Rolling Features
- **Strategy**: Forward fill within store-department groups
- **Rationale**: Maintains temporal continuity within each time series

#### Remaining Numeric Features
- **Strategy**: Linear interpolation within groups
- **Fallback**: Forward fill → Backward fill → Zero fill

### Missing Value Hierarchy
1. **Domain Logic**: Use business understanding (markdowns = 0)
2. **Temporal Logic**: Forward fill for time series continuity
3. **Statistical Logic**: Interpolation for smooth transitions
4. **Fallback Logic**: Zero fill as last resort

## Feature Scaling Strategy

### When to Scale
- Neural network models require scaled features
- Tree-based models can handle raw features
- Mixed model ensembles may benefit from both versions

### Scaling Approach
- **Method**: StandardScaler (zero mean, unit variance)
- **Scope**: Selected feature columns
- **Output**: Creates `{feature}_scaled` columns alongside originals

## Implementation Best Practices

### Data Integrity
- Always sort by Store, Department, Date before feature creation
- Maintain groupby operations at appropriate granularity
- Handle edge cases (insufficient data, extreme values)

### Performance Optimization
- Use vectorized pandas operations
- Minimize loops and apply functions
- Consider memory usage for large datasets

### Validation Strategy
- Create features on training data first
- Apply same transformations to test data
- Validate feature distributions and ranges

## Feature Importance Considerations

### Expected High-Impact Features
1. **Recent Lag Features** (1-4 weeks): Strong autoregressive patterns
2. **Rolling Means** (4-12 weeks): Trend capture
3. **Holiday Indicators**: Known sales drivers
4. **Store-Department Averages**: Baseline performance

### Expected Medium-Impact Features
1. **Seasonal Encodings**: Cyclical patterns
2. **Markdown Features**: Promotion effects
3. **Economic Interactions**: External factors

### Feature Validation
- Monitor feature importance in model training
- Check for feature redundancy and correlation
- Validate business logic alignment

## Usage Guidelines

### Data Pipeline Integration
1. Load and merge base datasets
2. Apply feature engineering class
3. Handle missing values
4. Scale features if needed
5. Validate feature quality

### Model-Specific Considerations
- **Tree Models**: Can use raw features, benefit from all feature types
- **Linear Models**: Require scaled features, may need feature selection
- **Neural Networks**: Require scaled features, benefit from rich feature sets

## Future Enhancements

### Potential Additional Features
- **Cross-Store Patterns**: Regional or competitive effects
- **Weather Interactions**: Temperature × season interactions
- **Advanced Time Features**: Fourier transforms for complex seasonality
- **External Data**: Economic indicators, demographic data
- **Text Features**: Department name embeddings

### Advanced Techniques
- **Automated Feature Engineering**: Using tools like Featuretools
- **Deep Feature Learning**: Embedding layers for categorical variables
- **Feature Selection**: Automated importance-based selection
- **Feature Interaction Discovery**: Automated interaction term generation

## Conclusion

This feature engineering strategy provides a comprehensive foundation for Walmart sales forecasting by:

- Capturing multiple time scales and patterns
- Incorporating business domain knowledge
- Handling data quality issues appropriately
- Providing flexibility for different model types
- Maintaining interpretability and business relevance

The features are designed to work together as a cohesive system, with each category addressing different aspects of the retail sales forecasting challenge.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


class WalmartFeatureEngineering:
    """Advanced feature engineering specific to Walmart competition"""

    def __init__(self, merged_data):
        self.data = merged_data.copy()
        self.feature_importance = {}

    def create_walmart_features(self):
        """Create features specific to Walmart competition"""
        print("=== WALMART-SPECIFIC FEATURE ENGINEERING ===")

        # Sort data for time-based features
        self.data = self.data.sort_values(["Store", "Dept", "Date"])

        # 1. Holiday-related features
        self.data["Holiday_Weight"] = self.data["IsHoliday"].apply(
            lambda x: 5.0 if x == 1 else 1.0
        )

        # Pre/post holiday indicators
        self.data["Pre_Holiday"] = (
            self.data.groupby(["Store", "Dept"])["IsHoliday"].shift(-1).fillna(0)
        )
        self.data["Post_Holiday"] = (
            self.data.groupby(["Store", "Dept"])["IsHoliday"].shift(1).fillna(0)
        )

        # 2. Temporal features
        self.data["Year"] = self.data["Date"].dt.year
        self.data["Month"] = self.data["Date"].dt.month
        self.data["Week"] = self.data["Date"].dt.isocalendar().week
        self.data["Quarter"] = self.data["Date"].dt.quarter

        # Cyclical encoding
        self.data["Month_sin"] = np.sin(2 * np.pi * self.data["Month"] / 12)
        self.data["Month_cos"] = np.cos(2 * np.pi * self.data["Month"] / 12)
        self.data["Week_sin"] = np.sin(2 * np.pi * self.data["Week"] / 52)
        self.data["Week_cos"] = np.cos(2 * np.pi * self.data["Week"] / 52)

        # 3. Lag features (critical for time series)
        lag_periods = [1, 2, 4, 8, 12, 52]  # Including yearly lag
        for lag in lag_periods:
            self.data[f"Sales_lag_{lag}"] = self.data.groupby(["Store", "Dept"])[
                "Weekly_Sales"
            ].shift(lag)

        # 4. Rolling statistics
        windows = [4, 8, 12, 24, 52]
        for window in windows:
            self.data[f"Sales_rolling_mean_{window}"] = (
                self.data.groupby(["Store", "Dept"])["Weekly_Sales"]
                .rolling(window=window)
                .mean()
                .reset_index(level=[0, 1], drop=True)
            )
            self.data[f"Sales_rolling_std_{window}"] = (
                self.data.groupby(["Store", "Dept"])["Weekly_Sales"]
                .rolling(window=window)
                .std()
                .reset_index(level=[0, 1], drop=True)
            )
            self.data[f"Sales_rolling_max_{window}"] = (
                self.data.groupby(["Store", "Dept"])["Weekly_Sales"]
                .rolling(window=window)
                .max()
                .reset_index(level=[0, 1], drop=True)
            )
            self.data[f"Sales_rolling_min_{window}"] = (
                self.data.groupby(["Store", "Dept"])["Weekly_Sales"]
                .rolling(window=window)
                .min()
                .reset_index(level=[0, 1], drop=True)
            )

        # 5. Markdown features
        markdown_cols = [
            "MarkDown1",
            "MarkDown2",
            "MarkDown3",
            "MarkDown4",
            "MarkDown5",
        ]

        # Fill markdown NaN with 0 (no promotion)
        for col in markdown_cols:
            if col in self.data.columns:
                self.data[col] = self.data[col].fillna(0)

        # Total markdown
        self.data["Total_MarkDown"] = self.data[markdown_cols].sum(axis=1)

        # Markdown indicators
        for col in markdown_cols:
            if col in self.data.columns:
                self.data[f"{col}_Active"] = (self.data[col] > 0).astype(int)

        # Markdown intensity
        self.data["MarkDown_Intensity"] = self.data["Total_MarkDown"] / (
            self.data["Size"] + 1
        )

        # 6. Store and department-level features
        # Store performance relative to store type average
        store_type_avg = self.data.groupby("Type")["Weekly_Sales"].mean().to_dict()
        self.data["Store_Type_Avg"] = self.data["Type"].map(store_type_avg)

        # Department performance relative to department average
        dept_avg = self.data.groupby("Dept")["Weekly_Sales"].mean().to_dict()
        self.data["Dept_Avg"] = self.data["Dept"].map(dept_avg)

        # Store-department interaction
        store_dept_avg = (
            self.data.groupby(["Store", "Dept"])["Weekly_Sales"].mean().to_dict()
        )
        self.data["Store_Dept_Avg"] = self.data.set_index(["Store", "Dept"]).index.map(
            store_dept_avg
        )

        # 7. Economic interaction features
        self.data["Unemployment_Temperature"] = (
            self.data["Unemployment"] * self.data["Temperature"]
        )
        self.data["CPI_Fuel_Interaction"] = self.data["CPI"] * self.data["Fuel_Price"]

        # Economic stress indicator
        self.data["Economic_Stress"] = (
            self.data["Unemployment"] - self.data["Unemployment"].mean()
        ) / self.data["Unemployment"].std() + (
            self.data["Fuel_Price"] - self.data["Fuel_Price"].mean()
        ) / self.data[
            "Fuel_Price"
        ].std()

        # 8. Trend features
        # Linear trend for each store-department combination
        def calculate_trend(group):
            if len(group) < 3:
                return pd.Series([0] * len(group), index=group.index)
            x = np.arange(len(group))
            slope = np.polyfit(x, group.values, 1)[0]
            return pd.Series([slope] * len(group), index=group.index)

        self.data["Sales_Trend"] = (
            self.data.groupby(["Store", "Dept"])["Weekly_Sales"]
            .apply(calculate_trend)
            .reset_index(level=[0, 1], drop=True)
        )

        print(f"Feature engineering completed. New shape: {self.data.shape}")
        print(
            f"Added {self.data.shape[1] - len(['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday'])} new features"
        )

        return self.data

    def handle_missing_values(self):
        """Handle missing values with Walmart-specific logic"""
        print("=== HANDLING MISSING VALUES ===")

        # Markdown columns: NaN means no promotion (fill with 0)
        markdown_cols = [
            "MarkDown1",
            "MarkDown2",
            "MarkDown3",
            "MarkDown4",
            "MarkDown5",
        ]
        for col in markdown_cols:
            if col in self.data.columns:
                self.data[col] = self.data[col].fillna(0)

        # For lag and rolling features, use forward fill within store-dept groups
        lag_cols = [
            col for col in self.data.columns if "lag_" in col or "rolling_" in col
        ]
        for col in lag_cols:
            self.data[col] = self.data.groupby(["Store", "Dept"])[col].fillna(
                method="ffill"
            )

        # For remaining missing values, use interpolation
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if self.data[col].isnull().sum() > 0:
                self.data[col] = (
                    self.data.groupby(["Store", "Dept"])[col]
                    .apply(lambda x: x.interpolate(method="linear"))
                    .reset_index(level=[0, 1], drop=True)
                )

        # Final cleanup: fill any remaining NaN
        self.data = self.data.fillna(method="ffill").fillna(method="bfill").fillna(0)

        print(
            f"Missing values handled. Remaining NaN: {self.data.isnull().sum().sum()}"
        )
        return self.data

    def scale_features(self, feature_columns):
        """Scale features for neural network models"""
        print("=== SCALING FEATURES ===")

        scalers = {}
        for col in feature_columns:
            if col in self.data.columns:
                scaler = StandardScaler()
                self.data[f"{col}_scaled"] = scaler.fit_transform(self.data[[col]])
                scalers[col] = scaler

        print(f"Scaled {len(feature_columns)} features")
        return self.data, scalers


if __name__ == "__main__":
    # Assuming you have merged_data from previous steps
    from src.data_loader import WalmartDataLoader
    from src.data_processing import WalmartComprehensiveEDA

    data_loader = WalmartDataLoader()
    data_loader.load_data()
    train_data = data_loader.train_data
    test_data = data_loader.test_data
    features_data = data_loader.features_data
    stores_data = data_loader.stores_data

    eda = WalmartComprehensiveEDA(train_data, test_data, features_data, stores_data)
    merged_data = eda.merge_datasets()

    feature_eng = WalmartFeatureEngineering(merged_data)
    processed_data = feature_eng.create_walmart_features()
    processed_data = feature_eng.handle_missing_values()
    print("Feature Engineering class ready!")

All datasets loaded successfully!
MERGING DATASETS
Merged training data shape: (421570, 16)
Date range: 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Number of stores: 45
Number of departments: 81
=== WALMART-SPECIFIC FEATURE ENGINEERING ===
Feature engineering completed. New shape: (421570, 67)
Added 62 new features
=== HANDLING MISSING VALUES ===
Missing values handled. Remaining NaN: 0
Feature Engineering class ready!


In [53]:
processed_data

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,...,MarkDown4_Active,MarkDown5_Active,MarkDown_Intensity,Store_Type_Avg,Dept_Avg,Store_Dept_Avg,Unemployment_Temperature,CPI_Fuel_Interaction,Economic_Stress,Sales_Trend
0,1,1,2010-02-05,24924.50,False,42.31,2.572,0.00,0.00,0.00,...,0,0,0.000000,20099.568043,19213.485088,22513.322937,342.96486,542.939833,-1.642631,-19.053179
1,1,1,2010-02-12,46039.49,True,38.51,2.548,0.00,0.00,0.00,...,0,0,0.000000,20099.568043,19213.485088,22513.322937,312.16206,538.245049,-1.694974,-19.053179
2,1,1,2010-02-19,41595.55,False,39.93,2.514,0.00,0.00,0.00,...,0,0,0.000000,20099.568043,19213.485088,22513.322937,323.67258,531.180905,-1.769127,-19.053179
3,1,1,2010-02-26,19403.54,False,46.63,2.561,0.00,0.00,0.00,...,0,0,0.000000,20099.568043,19213.485088,22513.322937,377.98278,541.189605,-1.666622,-19.053179
4,1,1,2010-03-05,21827.90,False,46.50,2.625,0.00,0.00,0.00,...,0,0,0.000000,20099.568043,19213.485088,22513.322937,376.92900,554.794125,-1.527041,-19.053179
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421565,45,98,2012-09-28,508.37,False,64.88,3.997,4556.61,20.64,1.50,...,1,1,0.080087,12237.075977,6824.694889,561.239037,563.41792,767.478190,1.775434,5.419078
421566,45,98,2012-10-05,628.10,False,64.89,3.985,5046.74,0.00,18.82,...,1,1,0.081702,12237.075977,6824.694889,561.239037,562.40163,765.799090,1.740139,5.419078
421567,45,98,2012-10-12,1061.02,False,54.47,4.000,1956.28,0.00,7.89,...,1,1,0.055438,12237.075977,6824.694889,561.239037,472.09149,769.309062,1.772853,5.419078
421568,45,98,2012-10-19,760.01,False,56.47,3.969,2004.02,0.00,3.18,...,1,1,0.033686,12237.075977,6824.694889,561.239037,489.42549,763.361160,1.705244,5.419078
