## Computational Complexity Analysis

| Model | Training Complexity | Prediction Complexity | Memory Usage | Data Level |
|-------|-------------------|---------------------|--------------|------------|
| Prophet | O(n log n) | O(1) | Low | Aggregated |
| Random Forest | O(n log n × m × t) | O(log t) | Medium | Store-Dept |
| LSTM | O(n × s × f × h²) | O(s × f × h²) | High | Store-Dept |
| Transformer | O(n × s² × f) | O(s² × f) | Very High | Store Level |

Where:
- n = number of samples
- m = number of features  
- t = number of trees
- s = sequence length
- f = feature dimension
- h = hidden units

### Granularity vs Efficiency Trade-offs

The system demonstrates a clear **granularity-efficiency spectrum**:

**Granularity Hierarchy** (Most to Least Detailed):
1. **LSTM & Random Forest**: Store-Department level (most granular)
2. **Transformer**: Store level (departments aggregated)  
3. **Prophet**: Total aggregated (all stores and departments combined)

**Computational Efficiency** (Fastest to Slowest):
1. **Prophet**: O(n log n) - fastest, simplest
2. **Random Forest**: O(n log n × m × t) - fast, parallel trees
3. **LSTM**: O(n × s × f × h²) - slow, sequential processing
4. **Transformer**: O(n × s² × f) - slowest, quadratic attention

**Feature Complexity Progression**:
- **Transformer**: 5 features (computational efficiency priority)
- **Prophet**: 7 regressors (economic focus)
- **LSTM**: 15 features (sequential pattern optimization)
- **Random Forest**: 20 features (maximum information utilization)

This design allows the ensemble to capture patterns at multiple scales while balancing computational constraints.# Walmart Sales Forecasting Models - Technical Documentation

## Overview

The `WalmartForecastingModels` class implements a comprehensive ensemble of state-of-the-art forecasting models specifically optimized for retail sales prediction. This system addresses the unique challenges of the Walmart Sales Forecasting competition, incorporating multiple advanced techniques including statistical methods, tree-based algorithms, and deep learning approaches.

## Problem Context

Walmart sales forecasting presents several distinct challenges:
- **Seasonal patterns**: Weekly, monthly, and yearly seasonality
- **Holiday effects**: Significant sales spikes during major holidays (5x normal weight)
- **Store heterogeneity**: 45 stores with different characteristics and performance patterns
- **Department variations**: 99 departments with unique sales behaviors
- **External factors**: Economic indicators (unemployment, CPI, fuel prices) and weather impacts
- **Promotional effects**: Markdown events that drive sales variations

## Model Architecture

The system implements four complementary forecasting approaches:

### 1. Prophet Model (Statistical Approach)

**What it is:**
Prophet is Facebook's time series forecasting tool designed for business applications with strong seasonal patterns and holiday effects.

**Implementation Details:**
- **Seasonality**: Multiplicative mode with yearly seasonality enabled
- **Holiday handling**: Custom holiday calendar for major retail events (Super Bowl, Labor Day, Thanksgiving, Christmas)
- **External regressors**: Temperature, fuel price, CPI, unemployment, and markdown totals
- **Data aggregation**: Aggregates all store-department combinations by date for macro-level trends

**Why it works for Walmart:**
- Explicitly models holiday effects with custom prior scales (25x importance)
- Handles missing data gracefully
- Incorporates external economic factors as regressors
- Robust to outliers and structural breaks

**Strengths:**
- Interpretable components (trend, seasonality, holidays)
- Fast training and prediction
- Built-in uncertainty quantification
- Minimal hyperparameter tuning required

**Weaknesses:**
- Loses granular store-department level patterns through aggregation
- Linear relationship assumptions for regressors
- Less flexible than neural networks for complex interactions

### 2. LSTM Model (Deep Learning - RNN)

**What it is:**
A deep Long Short-Term Memory network with holiday-aware architecture designed to capture sequential dependencies in sales data.

**Architecture:**
```
Input (sequence_length=5, features=15) 
→ LSTM(64, return_sequences=True) + Dropout(0.3)
→ LSTM(32, return_sequences=True) + Dropout(0.2) 
→ LSTM(16) + Dropout(0.2)
→ Dense(32, ReLU) → Dense(16, ReLU) → Dense(1)
```

**Key Features:**
- **Sequence modeling**: Uses 5-week lookback windows
- **Feature selection**: Top 15 features based on correlation analysis
- **Holiday weighting**: Sample weights applied during training (5x for holiday weeks)
- **Store-department level**: Maintains granular predictions
- **Scaling**: StandardScaler normalization for stable training

**Why it works for Walmart:**
- Captures temporal dependencies and sequential patterns
- Memory cells retain long-term seasonal information
- Multiple LSTM layers learn hierarchical patterns
- Holiday weighting emphasizes critical sales periods

**Strengths:**
- Excellent for sequential pattern recognition
- Handles variable-length sequences
- Captures long-term dependencies
- Non-linear feature interactions

**Weaknesses:**
- High computational complexity O(n×m×h²) where n=samples, m=features, h=hidden units
- Requires large amounts of training data
- Black box model with limited interpretability
- Sensitive to hyperparameter choices

### 3. Transformer Model (Attention-Based)

**What it is:**
A Transformer architecture adapted for time series forecasting using multi-head self-attention mechanisms.

**Architecture:**
```
Input (sequence_length=5, features=5)
→ MultiHeadAttention(heads=4, key_dim=32) + LayerNorm
→ Dense(128, ReLU) + Dropout(0.2) → Dense(features) + LayerNorm  
→ GlobalAveragePooling1D → Dense(64, ReLU) + Dropout(0.3) → Dense(1)
```

**Key Features:**
- **Self-attention**: Learns relationships between different time steps
- **Store-level aggregation**: Reduces computational complexity while preserving temporal patterns
- **Multi-head attention**: 4 attention heads capture different pattern types
- **Feed-forward networks**: Dense layers for non-linear transformations

**Why it works for Walmart:**
- Attention mechanism identifies relevant historical periods
- Parallel processing of temporal sequences
- Captures both local and global temporal dependencies
- Less prone to vanishing gradients than RNNs

**Strengths:**
- Superior long-range dependency modeling
- Parallel computation (faster than RNNs)
- Attention weights provide some interpretability
- State-of-the-art performance on many sequence tasks

**Weaknesses:**
- High memory requirements O(n²) for sequence length n
- Complex architecture requires careful tuning
- May overfit on small datasets
- Computational intensity limits scalability

### 4. Random Forest Model (Tree-Based Baseline)

**What it is:**
An ensemble of decision trees with Walmart-specific feature engineering and holiday sample weighting.

**Configuration:**
- **Estimators**: 100 trees
- **Max depth**: 15 levels
- **Sample weighting**: Holiday weeks weighted 5x during training
- **Features**: Top 20 most important features
- **Regularization**: min_samples_split=10, min_samples_leaf=5

**Why it works for Walmart:**
- Handles mixed data types (categorical and numerical)
- Built-in feature importance ranking
- Robust to outliers and missing values
- Non-linear relationships without explicit feature engineering

**Strengths:**
- Fast training and prediction
- Feature importance interpretation
- Handles missing values naturally
- No scaling requirements
- Robust to outliers

**Weaknesses:**
- Limited extrapolation capability
- Can overfit with deep trees
- Biased toward features with more levels
- Struggles with linear relationships

## Data Handling Strategies

### Time Series Splitting
- **Validation approach**: Time-based split using last 8 weeks
- **Rationale**: Prevents data leakage and mimics real-world deployment
- **Store-department integrity**: Maintains temporal order within each store-department combination

### Holiday Weighting System
```python
Holiday_Weight = 5.0 if IsHoliday else 1.0
```
- Applied during model training to emphasize holiday performance
- Addresses class imbalance (few holiday weeks vs. many regular weeks)
- Critical for competition metric optimization

### Model-Specific Data Structures

Each model requires fundamentally different data formatting due to their unique architectures:

#### Prophet Model Data Structure
**Format**: Aggregated time series
```python
# Aggregates all store-department combinations by date
prophet_train = train_data.groupby("Date").agg({
    "Weekly_Sales": "sum",        # Total sales across all stores/depts
    "Temperature": "mean",        # Average temperature
    "Fuel_Price": "mean",        # Average fuel price
    "CPI": "mean",               # Average CPI
    "Unemployment": "mean",      # Average unemployment
    "IsHoliday": "max",          # Holiday indicator
    "Total_MarkDown": "sum"      # Total markdowns
})

# Prophet-required column names
columns = ["ds", "y", "temperature", "fuel_price", "cpi", "unemployment", "holiday", "markdown"]
```

**Characteristics**:
- **Single time series**: One aggregated series instead of multiple combinations
- **Shape**: `(n_weeks, 8_columns)` - one row per week
- **External regressors**: Economic and weather variables
- **Custom holidays**: Manually defined retail holidays

#### Random Forest Data Structure
**Format**: Flat tabular structure
```python
# Standard ML format with features as columns
feature_cols = [
    'Temperature', 'Fuel_Price', 'CPI', 'Unemployment',
    'IsHoliday', 'Total_MarkDown', 'Month_sin', 'Month_cos',
    'Sales_lag_1', 'Sales_rolling_mean_4', 'Store_Size_encoded'
    # ... up to 20 features total
]

X_train = train_data[feature_cols].fillna(0)      # Shape: (n_samples, 20)
y_train = train_data["Weekly_Sales"]              # Shape: (n_samples,)
sample_weights = train_data["Holiday_Weight"]     # Shape: (n_samples,)
```

**Characteristics**:
- **Granular level**: Store-department-week observations
- **Shape**: `(n_store_dept_weeks, 20_features)`
- **Sample weighting**: Holiday weeks weighted 5x during training
- **No aggregation**: Preserves all granular patterns

#### LSTM Data Structure
**Format**: 3D sequential tensors
```python
# Creates time sequences for each store-department combination
def create_sequences_walmart(data, seq_length=5):
    X, y, weights = [], [], []
    
    for (store, dept), group in data.groupby(['Store', 'Dept']):
        group_sorted = group.sort_values('Date')
        
        for i in range(seq_length, len(group_sorted)):
            # 5 weeks of feature data
            X.append(group_sorted[features].iloc[i-5:i].values)  # Shape: (5, 15)
            # Next week's sales
            y.append(group_sorted['Weekly_Sales'].iloc[i])
            # Holiday weight
            weights.append(group_sorted['Holiday_Weight'].iloc[i])

# Final scaled shapes
X_train_scaled.shape  # (n_sequences, 5, 15) - 3D tensor
y_train_scaled.shape  # (n_sequences,) - 1D array
```

**Characteristics**:
- **3D structure**: `(samples, timesteps, features)` = `(n_sequences, 5, 15)`
- **Temporal sequences**: 5 consecutive weeks per sample
- **Store-department level**: Separate sequences for each combination
- **Feature selection**: Top 15 most important features

#### Transformer Data Structure
**Format**: Store-level aggregated sequences
```python
# Store-level aggregation (departments combined)
agg_train = train_data.groupby(['Store', 'Date']).agg({
    'Weekly_Sales': 'sum',        # Total sales per store
    'Temperature': 'mean',        
    'Fuel_Price': 'mean',         
    'Unemployment': 'mean',       
    'IsHoliday': 'max',          
    'Total_MarkDown': 'sum'      
})

# Simplified feature set (5 features only)
features = ['Temperature', 'Fuel_Price', 'Unemployment', 'IsHoliday', 'Total_MarkDown']

# Final shapes
X_train_scaled.shape  # (n_sequences, 5, 5) - 3D tensor
```

**Characteristics**:
- **Store-level aggregation**: Combines departments within stores
- **Reduced features**: Only 5 core features vs 15 for LSTM
- **3D structure**: `(samples, timesteps, features)` = `(n_sequences, 5, 5)`
- **Computational efficiency**: Simplified for faster attention computation

### Data Structure Comparison Table

| Model | Data Level | Structure | Shape | Features | Rationale |
|-------|------------|-----------|-------|----------|-----------|
| **Prophet** | All Stores Aggregated | 2D Time Series | `(weeks, 8)` | 7 regressors | Macro-level trends, interpretability |
| **Random Forest** | Store-Dept Level | 2D Tabular | `(observations, 20)` | 20 engineered | Maximum granularity, feature importance |
| **LSTM** | Store-Dept Level | 3D Sequences | `(sequences, 5, 15)` | 15 selected | Sequential patterns, dept-level detail |
| **Transformer** | Store Level | 3D Sequences | `(sequences, 5, 5)` | 5 core | Attention efficiency, store-level focus |

### Feature Complexity Trade-offs

**Why Transformer Uses Fewer Features (5 vs 15 for LSTM)**:

1. **Computational Complexity**: Transformer attention mechanism has quadratic complexity with feature dimensions
2. **Store-level Aggregation**: Department-specific features become less meaningful when aggregated
3. **Attention Benefits**: Self-attention learns feature interactions automatically, reducing need for engineered features  
4. **Memory Constraints**: Fewer features = smaller attention matrices = more efficient training
5. **Stable Features**: Focuses on economic indicators and business factors that work well across stores

### Sequence Creation (Neural Networks)
- **LSTM**: 5-week sequences per store-department combination
- **Transformer**: Store-level sequences with department aggregation
- **Padding strategy**: Skip insufficient sequences rather than padding
- **Scaling**: Feature-wise standardization for neural network stability

### Data Flow Pipeline
1. **Raw Data**: Store-Dept-Week level observations
2. **Feature Engineering**: Add lags, rolling statistics, cyclical encodings
3. **Model-Specific Preparation**:
   - **Prophet**: Aggregate to weekly totals → 2D time series
   - **Random Forest**: Keep original structure → 2D matrix  
   - **LSTM**: Create store-dept sequences → 3D tensor (15 features)
   - **Transformer**: Aggregate to store level → create sequences → 3D tensor (5 features)

### Granularity vs Efficiency Trade-offs

The system demonstrates a clear **granularity-efficiency spectrum**:

**Granularity Hierarchy** (Most to Least Detailed):
1. **LSTM & Random Forest**: Store-Department level (most granular)
2. **Transformer**: Store level (departments aggregated)  
3. **Prophet**: Total aggregated (all stores and departments combined)

**Computational Efficiency** (Fastest to Slowest):
1. **Prophet**: O(n log n) - fastest, simplest
2. **Random Forest**: O(n log n × m × t) - fast, parallel trees
3. **LSTM**: O(n × s × f × h²) - slow, sequential processing
4. **Transformer**: O(n × s² × f) - slowest, quadratic attention

**Feature Complexity Progression**:
- **Transformer**: 5 features (computational efficiency priority)
- **Prophet**: 7 regressors (economic focus)
- **LSTM**: 15 features (sequential pattern optimization)
- **Random Forest**: 20 features (maximum information utilization)

This design allows the ensemble to capture patterns at multiple scales while balancing computational constraints.

## Model Selection and Ensemble Strategy

### Individual Model Use Cases

**Prophet**: Best for
- Quick baseline establishment
- Business stakeholder communication (interpretable components)
- Long-term trend analysis
- Scenarios with limited computational resources

**Random Forest**: Best for
- Feature importance analysis
- Robust baseline performance
- Mixed data types handling
- When interpretability is required

**LSTM**: Best for
- Complex temporal pattern recognition
- When sequential dependencies are critical
- Non-linear feature interactions
- Sufficient training data availability

**Transformer**: Best for
- State-of-the-art performance requirements
- Long-range dependency modeling
- When computational resources are abundant
- Complex attention pattern analysis

### Ensemble Considerations
The models complement each other:
- **Prophet** captures macro trends and seasonality
- **Random Forest** provides robust baseline and feature insights
- **LSTM** models sequential dependencies at granular level
- **Transformer** captures complex attention patterns

## Performance Optimization Strategies

### Neural Network Optimizations
- **Early stopping**: Prevents overfitting (patience=8-10)
- **Learning rate scheduling**: ReduceLROnPlateau for convergence
- **Dropout layers**: Regularization to prevent overfitting
- **Batch size optimization**: 32-64 for memory efficiency

### Feature Selection
- **Correlation-based**: Top features by correlation with target
- **Critical feature enforcement**: Ensures important features included
- **Dimensionality management**: Limits features to prevent curse of dimensionality

### Memory Management
- **Data aggregation**: Transformer uses store-level rather than store-department level
- **Sequence batching**: Processes sequences in manageable batches
- **Model checkpointing**: Saves best weights during training

## Limitations and Considerations

### Data Requirements
- **Minimum sequence length**: Neural networks require sufficient historical data
- **Holiday representation**: Limited holiday examples may affect generalization
- **Store-department combinations**: Some combinations have insufficient data

### Scalability Constraints
- **Neural networks**: Memory and computation scale poorly with large datasets
- **Real-time deployment**: LSTM and Transformer have higher latency
- **Storage requirements**: Multiple models increase storage needs

### Generalization Challenges
- **Walmart-specific**: Heavy customization may not transfer to other retailers
- **Holiday calendar**: Hard-coded holidays may not apply to other regions
- **Economic indicators**: External factors may not be available in all contexts

## Best Practices and Recommendations

### Model Development
1. **Start with Prophet** for quick insights and baseline
2. **Use Random Forest** for feature importance analysis
3. **Deploy LSTM** when sequential patterns are critical
4. **Consider Transformer** only with sufficient computational resources

### Production Deployment
1. **Ensemble voting**: Combine predictions from multiple models
2. **A/B testing**: Compare model performance in production
3. **Monitoring**: Track prediction accuracy and model drift
4. **Retraining schedule**: Regular model updates with new data

### Performance Tuning
1. **Cross-validation**: Use time series cross-validation for hyperparameter tuning
2. **Feature engineering**: Continuously evaluate and add relevant features
3. **Architecture search**: Experiment with different neural network architectures
4. **Regularization**: Balance model complexity with generalization

## Conclusion

This Walmart forecasting system represents a comprehensive approach to retail sales prediction, combining the strengths of statistical, tree-based, and deep learning methods. The holiday weighting system and Walmart-specific feature engineering make it particularly well-suited for retail forecasting challenges. While the neural network approaches offer superior performance potential, the statistical and tree-based methods provide valuable baselines and interpretability. The choice of model should be based on specific requirements for accuracy, interpretability, computational resources, and deployment constraints.

In [None]:
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (
    LSTM,
    Dense,
    Dropout,
    GRU,
    Input,
    MultiHeadAttention,
    LayerNormalization,
    GlobalAveragePooling1D,
)
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from prophet import Prophet


class WalmartForecastingModels:
    """Forecasting models optimized for Walmart competition"""

    def __init__(self, data):
        self.data = data
        self.models = {}
        self.results = {}
        self.feature_columns = []
        self.train_data = None
        self.val_data = None

    def prepare_walmart_data(self, validation_weeks=8):
        """Prepare data specifically for Walmart forecasting with holiday weights"""
        print("=== PREPARING WALMART DATA FOR MODELING ===")

        # Remove rows without sales data (early periods with insufficient lags)
        self.data_clean = self.data.dropna(subset=["Weekly_Sales"])

        # Sort by date for proper time series split
        self.data_clean = self.data_clean.sort_values(["Store", "Dept", "Date"])

        # Select feature columns
        exclude_cols = ["Weekly_Sales", "Store", "Dept", "Date", "Type"]
        self.feature_columns = [
            col
            for col in self.data_clean.columns
            if col not in exclude_cols and not col.endswith("_scaled")
        ]

        # Time-based split (last N weeks for validation)
        unique_dates = sorted(self.data_clean["Date"].unique())
        split_date = unique_dates[-validation_weeks]

        self.train_data = self.data_clean[self.data_clean["Date"] < split_date].copy()
        self.val_data = self.data_clean[self.data_clean["Date"] >= split_date].copy()

        # Create holiday weights for training data
        self.train_weights = self.train_data["Holiday_Weight"].values
        self.val_weights = self.val_data["Holiday_Weight"].values

        print(f"Training data: {self.train_data.shape}")
        print(f"Validation data: {self.val_data.shape}")
        print(f"Features: {len(self.feature_columns)}")
        print(f"Holiday weeks in training: {(self.train_weights == 5.0).sum()}")

        return self.train_data, self.val_data

    def prophet_walmart_model(self):
        """Prophet model optimized for Walmart data"""
        print("=== TRAINING WALMART-OPTIMIZED PROPHET MODEL ===")
        start_time = time.time()

        try:
            # Aggregate by date for Prophet (total sales across all stores/depts)
            prophet_train = (
                self.train_data.groupby("Date")
                .agg(
                    {
                        "Weekly_Sales": "sum",
                        "Temperature": "mean",
                        "Fuel_Price": "mean",
                        "CPI": "mean",
                        "Unemployment": "mean",
                        "IsHoliday": "max",
                        "Total_MarkDown": "sum",
                    }
                )
                .reset_index()
            )

            prophet_train.columns = [
                "ds",
                "y",
                "temperature",
                "fuel_price",
                "cpi",
                "unemployment",
                "holiday",
                "markdown",
            ]

            # Initialize Prophet with Walmart-specific settings
            model = Prophet(
                yearly_seasonality=True,
                weekly_seasonality=False,  # Weekly data, not daily
                daily_seasonality=False,
                seasonality_mode="multiplicative",
                changepoint_prior_scale=0.1,  # More flexible for retail
                seasonality_prior_scale=15,  # Strong seasonality in retail
                holidays_prior_scale=25,  # Strong holiday effects
                interval_width=0.95,
            )

            # Add regressors
            regressors = [
                "temperature",
                "fuel_price",
                "cpi",
                "unemployment",
                "markdown",
            ]
            for regressor in regressors:
                model.add_regressor(regressor)

            # Add custom holidays (major retail holidays)
            holidays = pd.DataFrame(
                {
                    "holiday": ["Super Bowl", "Labor Day", "Thanksgiving", "Christmas"]
                    * 3,
                    "ds": [
                        "2010-02-12",
                        "2010-09-10",
                        "2010-11-26",
                        "2010-12-31",
                        "2011-02-11",
                        "2011-09-09",
                        "2011-11-25",
                        "2011-12-30",
                        "2012-02-10",
                        "2012-09-07",
                        "2012-11-23",
                        "2012-12-28",
                    ],
                }
            )
            holidays["ds"] = pd.to_datetime(holidays["ds"])
            model.holidays = holidays

            # Fit model
            model.fit(prophet_train)

            # Prepare validation data
            prophet_val = (
                self.val_data.groupby("Date")
                .agg(
                    {
                        "Temperature": "mean",
                        "Fuel_Price": "mean",
                        "CPI": "mean",
                        "Unemployment": "mean",
                        "IsHoliday": "max",
                        "Total_MarkDown": "sum",
                        "Weekly_Sales": "sum",
                    }
                )
                .reset_index()
            )

            prophet_val.columns = [
                "ds",
                "temperature",
                "fuel_price",
                "cpi",
                "unemployment",
                "holiday",
                "markdown",
                "actual",
            ]

            # Make predictions
            forecast = model.predict(prophet_val[["ds"] + regressors])

            # Extract predictions
            predictions = forecast["yhat"].values
            actual = prophet_val["actual"].values

            training_time = time.time() - start_time

            self.models["Prophet"] = model
            self.results["Prophet"] = {
                "predictions": predictions,
                "actual": actual,
                "training_time": training_time,
                "model_type": "Statistical",
                "forecast": forecast,
                "weights": np.ones(
                    len(predictions)
                ),  # Simplified weights for aggregated data
            }

            print(f"Prophet model trained in {training_time:.2f} seconds")
            return model, predictions

        except Exception as e:
            print(f"Prophet training failed: {e}")
            return None, None

    def lstm_walmart_model(self, sequence_length=5, epochs=50):
        """LSTM model with holiday weighting for Walmart data"""
        print("=== TRAINING WALMART LSTM MODEL ===")
        start_time = time.time()

        try:
            # Select top features for LSTM
            feature_importance = self._calculate_feature_importance()
            top_features = feature_importance.head(15).index.tolist()

            # Ensure critical features are included
            critical_features = [
                "IsHoliday",
                "Total_MarkDown",
                "Month_sin",
                "Month_cos",
                "Sales_lag_1",
                "Sales_rolling_mean_4",
            ]
            for feat in critical_features:
                if feat in self.train_data.columns and feat not in top_features:
                    top_features.append(feat)

            # Create sequences for time series
            def create_sequences_walmart(
                data, seq_length, features, target="Weekly_Sales"
            ):
                X, y, weights = [], [], []

                # Process by store-department combinations
                for (store, dept), group in data.groupby(["Store", "Dept"]):
                    if len(group) < seq_length + 1:
                        continue

                    group_sorted = group.sort_values("Date")

                    for i in range(seq_length, len(group_sorted)):
                        # Features sequence
                        X.append(group_sorted[features].iloc[i - seq_length : i].values)
                        # Target
                        y.append(group_sorted[target].iloc[i])
                        # Holiday weight
                        weights.append(group_sorted["Holiday_Weight"].iloc[i])

                return np.array(X), np.array(y), np.array(weights)

            # Create training sequences
            X_train, y_train, train_weights = create_sequences_walmart(
                self.train_data, sequence_length, top_features
            )

            # Create validation sequences
            X_val, y_val, val_weights = create_sequences_walmart(
                self.val_data, sequence_length, top_features
            )

            if len(X_train) == 0 or len(X_val) == 0:
                print("Insufficient data for LSTM sequences")
                return None, None

            # Scale features
            scaler_X = StandardScaler()
            scaler_y = StandardScaler()

            X_train_scaled = scaler_X.fit_transform(
                X_train.reshape(-1, X_train.shape[-1])
            ).reshape(X_train.shape)

            X_val_scaled = scaler_X.transform(
                X_val.reshape(-1, X_val.shape[-1])
            ).reshape(X_val.shape)

            y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()

            # Build LSTM with holiday-aware architecture
            model = Sequential(
                [
                    LSTM(
                        64,
                        return_sequences=True,
                        input_shape=(sequence_length, len(top_features)),
                    ),
                    Dropout(0.3),
                    LSTM(32, return_sequences=True),
                    Dropout(0.2),
                    LSTM(16, return_sequences=False),
                    Dropout(0.2),
                    Dense(32, activation="relu"),
                    Dense(16, activation="relu"),
                    Dense(1),
                ]
            )

            model.compile(
                optimizer=Adam(learning_rate=0.001), loss="mse", metrics=["mae"]
            )

            # Train with callbacks
            callbacks = [
                EarlyStopping(
                    monitor="val_loss", patience=10, restore_best_weights=True
                ),
                ReduceLROnPlateau(
                    monitor="val_loss", factor=0.5, patience=5, min_lr=1e-6
                ),
            ]

            history = model.fit(
                X_train_scaled,
                y_train_scaled,
                validation_data=(
                    X_val_scaled,
                    scaler_y.transform(y_val.reshape(-1, 1)).flatten(),
                ),
                epochs=epochs,
                batch_size=64,
                callbacks=callbacks,
                verbose=0,
            )

            # Make predictions
            y_pred_scaled = model.predict(X_val_scaled, verbose=0)
            y_pred = scaler_y.inverse_transform(y_pred_scaled).flatten()

            training_time = time.time() - start_time

            self.models["LSTM"] = model
            self.results["LSTM"] = {
                "predictions": y_pred,
                "actual": y_val,
                "weights": val_weights,
                "training_time": training_time,
                "model_type": "Neural Network",
                "history": history,
                "scalers": {"X": scaler_X, "y": scaler_y},
                "features": top_features,
            }

            print(f"LSTM model trained in {training_time:.2f} seconds")
            print(f"Using {len(top_features)} features")
            return model, y_pred

        except Exception as e:
            print(f"LSTM training failed: {e}")
            return None, None

    def transformer_walmart_model(self, sequence_length=5, epochs=40):
        """Transformer model for Walmart time series"""
        print("=== TRAINING WALMART TRANSFORMER MODEL ===")
        start_time = time.time()

        try:
            # Use aggregated data for transformer (computational efficiency)
            agg_train = (
                self.train_data.groupby(["Store", "Date"])
                .agg(
                    {
                        "Weekly_Sales": "sum",
                        "Temperature": "mean",
                        "Fuel_Price": "mean",
                        "Unemployment": "mean",
                        "IsHoliday": "max",
                        "Total_MarkDown": "sum",
                        "Holiday_Weight": "max",
                    }
                )
                .reset_index()
            )

            agg_val = (
                self.val_data.groupby(["Store", "Date"])
                .agg(
                    {
                        "Weekly_Sales": "sum",
                        "Temperature": "mean",
                        "Fuel_Price": "mean",
                        "Unemployment": "mean",
                        "IsHoliday": "max",
                        "Total_MarkDown": "sum",
                        "Holiday_Weight": "max",
                    }
                )
                .reset_index()
            )

            # Prepare features
            features = [
                "Temperature",
                "Fuel_Price",
                "Unemployment",
                "IsHoliday",
                "Total_MarkDown",
            ]

            # Create sequences
            def create_transformer_sequences(
                data, seq_len, features, target="Weekly_Sales"
            ):
                X, y, weights = [], [], []

                for store in data["Store"].unique():
                    store_data = data[data["Store"] == store].sort_values("Date")

                    if len(store_data) < seq_len + 1:
                        continue

                    for i in range(seq_len, len(store_data)):
                        X.append(store_data[features].iloc[i - seq_len : i].values)
                        y.append(store_data[target].iloc[i])
                        weights.append(store_data["Holiday_Weight"].iloc[i])

                return np.array(X), np.array(y), np.array(weights)

            X_train, y_train, train_weights = create_transformer_sequences(
                agg_train, sequence_length, features
            )
            X_val, y_val, val_weights = create_transformer_sequences(
                agg_val, sequence_length, features
            )

            if len(X_train) == 0 or len(X_val) == 0:
                print("Insufficient data for Transformer")
                return None, None

            # Scale data
            scaler_X = StandardScaler()
            scaler_y = StandardScaler()

            X_train_scaled = scaler_X.fit_transform(
                X_train.reshape(-1, X_train.shape[-1])
            ).reshape(X_train.shape)

            X_val_scaled = scaler_X.transform(
                X_val.reshape(-1, X_val.shape[-1])
            ).reshape(X_val.shape)

            y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()

            # Build Transformer model
            inputs = Input(shape=(sequence_length, len(features)))

            # Multi-head attention
            attention_output = MultiHeadAttention(num_heads=4, key_dim=32, dropout=0.1)(
                inputs, inputs
            )

            # Add & Norm
            attention_output = LayerNormalization()(inputs + attention_output)

            # Feed Forward Network
            ffn_output = Dense(128, activation="relu")(attention_output)
            ffn_output = Dropout(0.2)(ffn_output)
            ffn_output = Dense(len(features))(ffn_output)

            # Add & Norm
            ffn_output = LayerNormalization()(attention_output + ffn_output)

            # Global pooling and output
            pooled = GlobalAveragePooling1D()(ffn_output)
            pooled = Dense(64, activation="relu")(pooled)
            pooled = Dropout(0.3)(pooled)
            outputs = Dense(1)(pooled)

            model = Model(inputs=inputs, outputs=outputs)
            model.compile(
                optimizer=Adam(learning_rate=0.001), loss="mse", metrics=["mae"]
            )

            # Train model
            callbacks = [
                EarlyStopping(
                    monitor="val_loss", patience=8, restore_best_weights=True
                ),
                ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=4),
            ]

            history = model.fit(
                X_train_scaled,
                y_train_scaled,
                validation_data=(
                    X_val_scaled,
                    scaler_y.transform(y_val.reshape(-1, 1)).flatten(),
                ),
                epochs=epochs,
                batch_size=32,
                callbacks=callbacks,
                verbose=0,
            )

            # Make predictions
            y_pred_scaled = model.predict(X_val_scaled, verbose=0)
            y_pred = scaler_y.inverse_transform(y_pred_scaled).flatten()

            training_time = time.time() - start_time

            self.models["Transformer"] = model
            self.results["Transformer"] = {
                "predictions": y_pred,
                "actual": y_val,
                "weights": val_weights,
                "training_time": training_time,
                "model_type": "Neural Network",
                "history": history,
            }

            print(f"Transformer model trained in {training_time:.2f} seconds")
            return model, y_pred

        except Exception as e:
            print(f"Transformer training failed: {e}")
            return None, None

    def random_forest_walmart_model(self):
        """Random Forest baseline with Walmart-specific features"""
        print("=== TRAINING WALMART RANDOM FOREST MODEL ===")
        start_time = time.time()

        try:
            # Prepare features
            feature_cols = [
                col for col in self.feature_columns if col in self.train_data.columns
            ][
                :20
            ]  # Top 20 features

            X_train = self.train_data[feature_cols].fillna(0)
            y_train = self.train_data["Weekly_Sales"]
            X_val = self.val_data[feature_cols].fillna(0)
            y_val = self.val_data["Weekly_Sales"]

            # Train Random Forest with holiday sample weights
            model = RandomForestRegressor(
                n_estimators=100,
                max_depth=15,
                min_samples_split=10,
                min_samples_leaf=5,
                random_state=42,
                n_jobs=-1,
            )

            # Fit with sample weights (holiday weeks weighted 5x)
            sample_weights = self.train_data["Holiday_Weight"].values
            model.fit(X_train, y_train, sample_weight=sample_weights)

            # Make predictions
            y_pred = model.predict(X_val)

            training_time = time.time() - start_time

            # Feature importance
            feature_importance = pd.DataFrame(
                {"feature": feature_cols, "importance": model.feature_importances_}
            ).sort_values("importance", ascending=False)

            self.models["RandomForest"] = model
            self.results["RandomForest"] = {
                "predictions": y_pred,
                "actual": y_val.values,
                "weights": self.val_data["Holiday_Weight"].values,
                "training_time": training_time,
                "model_type": "Tree-based",
                "feature_importance": feature_importance,
            }

            print(f"Random Forest model trained in {training_time:.2f} seconds")
            print("Top 5 most important features:")
            print(feature_importance.head())

            return model, y_pred

        except Exception as e:
            print(f"Random Forest training failed: {e}")
            return None, None

    def _calculate_feature_importance(self):
        """Calculate feature importance using correlation and variance"""
        numeric_features = self.train_data.select_dtypes(include=[np.number]).columns
        numeric_features = [
            col
            for col in numeric_features
            if col not in ["Store", "Dept", "Weekly_Sales"]
        ]

        # Calculate correlation with target
        correlations = {}
        for feature in numeric_features:
            if feature in self.train_data.columns:
                corr = abs(
                    self.train_data[feature].corr(self.train_data["Weekly_Sales"])
                )
                correlations[feature] = corr if not np.isnan(corr) else 0

        return pd.Series(correlations).sort_values(ascending=False)


if __name__ == "__main__":
    from src.data_loader import WalmartDataLoader
    from src.data_processing import WalmartComprehensiveEDA
    from src.feature_engineering import WalmartFeatureEngineering

    data_loader = WalmartDataLoader()
    data_loader.load_data()
    train_data = data_loader.train_data
    test_data = data_loader.test_data
    features_data = data_loader.features_data
    stores_data = data_loader.stores_data

    eda = WalmartComprehensiveEDA(train_data, test_data, features_data, stores_data)
    merged_data = eda.merge_datasets()

    feature_eng = WalmartFeatureEngineering(merged_data)
    processed_data = feature_eng.create_walmart_features()
    processed_data = feature_eng.handle_missing_values()
    print("Feature Engineering class ready!")

    forecasting_models = WalmartForecastingModels(processed_data)
    train_data, val_data = forecasting_models.prepare_walmart_data()

    # models
    prophet_model, prophet_pred = forecasting_models.prophet_walmart_model()
    lstm_model, lstm_pred = forecasting_models.lstm_walmart_model()
    rf_model, rf_pred = forecasting_models.random_forest_walmart_model()
    trans_model, trans_pred = forecasting_models.transformer_walmart_model()

    print("Forecasting Models class ready!")

✅ All datasets loaded successfully!
=== MERGING DATASETS ===
Merged training data shape: (421570, 16)
Date range: 2010-02-05 00:00:00 to 2012-10-26 00:00:00
=== WALMART-SPECIFIC FEATURE ENGINEERING ===
Feature engineering completed. New shape: (421570, 67)
Added 62 new features
=== HANDLING MISSING VALUES ===


17:44:11 - cmdstanpy - INFO - Chain [1] start processing


Missing values handled. Remaining NaN: 0
Feature Engineering class ready!
=== PREPARING WALMART DATA FOR MODELING ===
Training data: (397841, 67)
Validation data: (23729, 67)
Features: 62
Holiday weeks in training: 26695
=== TRAINING WALMART-OPTIMIZED PROPHET MODEL ===


17:44:11 - cmdstanpy - INFO - Chain [1] done processing


Prophet model trained in 0.14 seconds
=== TRAINING WALMART LSTM MODEL ===
LSTM model trained in 198.94 seconds
Using 19 features
=== TRAINING WALMART RANDOM FOREST MODEL ===
Random Forest model trained in 4.09 seconds
Top 5 most important features:
         feature  importance
10          Size    0.658443
9   Unemployment    0.079670
8            CPI    0.074539
16          Week    0.056932
5      MarkDown3    0.031259
=== TRAINING WALMART TRANSFORMER MODEL ===
Transformer model trained in 10.54 seconds
Forecasting Models class ready!
