# Walmart Sales Dataset 

## Overview

This dataset contains historical sales data for 45 Walmart stores across different regions, designed for predicting department-wide sales. The dataset spans from February 5, 2010, to November 1, 2012, and includes various external factors that may influence sales performance. (Data Source: https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/data)

A key challenge in this dataset is modeling the effects of promotional markdowns on holiday weeks, where evaluation metrics are weighted five times higher than non-holiday weeks for the four major holidays: Super Bowl, Labor Day, Thanksgiving, and Christmas.

## Dataset Files

### 1. train.csv
**Purpose**: Historical training data for model development  
**Date Range**: February 5, 2010 to November 1, 2012  
**Key Fields**:
- `Store`: Store identifier (1-45)
- `Dept`: Department number within each store
- `Date`: Week ending date
- `Weekly_Sales`: Target variable - sales for the given department and store
- `IsHoliday`: Boolean indicator for special holiday weeks

### 2. test.csv
**Purpose**: Test dataset for predictions (identical structure to train.csv, excluding Weekly_Sales)  
**Usage**: Contains store-department-date combinations requiring sales predictions

### 3. stores.csv
**Purpose**: Store metadata and characteristics  
**Key Fields**:
- Store type classification
- Store size information
- Anonymized store attributes

### 4. features.csv
**Purpose**: External factors and promotional data  
**Key Fields**:
- `Store`: Store identifier
- `Date`: Week ending date
- `Temperature`: Regional average temperature
- `Fuel_Price`: Regional fuel costs
- `MarkDown1-5`: Promotional markdown data (available after November 2011, with gaps)
- `CPI`: Consumer Price Index
- `Unemployment`: Regional unemployment rate
- `IsHoliday`: Holiday week indicator

## Data Characteristics

### Holiday Weighting
Special evaluation weighting (5x) applies to weeks containing:
- **Super Bowl**: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
- **Labor Day**: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
- **Thanksgiving**: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
- **Christmas**: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

### Data Quality Notes
- **Markdown Data**: Only available after November 2011 and not consistently across all stores
- **Missing Values**: Markdown fields contain NA values where data is unavailable
- **Anonymization**: Store and markdown information is anonymized for privacy

## Loading Instructions

Use the provided `WalmartDataLoader` class to load and explore the dataset:

```python
from walmart_data_loader import WalmartDataLoader

# Initialize and load data
loader = WalmartDataLoader()
success = loader.load_data()

if success:
    # Display dataset overview
    loader.basic_info()
```

### File Structure Requirements
Ensure the following files are in your `data/` directory:
- `train.csv`
- `test.csv` 
- `stores.csv`
- `features.csv`

## Modeling Considerations

1. **Holiday Impact**: Model the amplified importance of holiday weeks in predictions
2. **Missing Markdowns**: Handle sparse promotional data appropriately
3. **Regional Factors**: Incorporate temperature, fuel prices, and economic indicators
4. **Store Heterogeneity**: Account for different store types and sizes
5. **Temporal Patterns**: Consider seasonal and weekly sales patterns

## Expected Outcomes

The goal is to predict `Weekly_Sales` for each store-department-date combination in the test set, with particular attention to accuracy during weighted holiday periods.


In [None]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")


class WalmartDataLoader:
    """
    Data loader for actual Walmart sales dataset from Kaggle
    Dataset: Walmart Recruiting - Store Sales Forecasting
    """

    def __init__(self):
        self.train_data = None
        self.test_data = None
        self.features_data = None
        self.stores_data = None
        self.holiday_weights = None

    def load_data(self):
        """Load all CSV files"""
        try:
            self.train_data = pd.read_csv("data/train.csv")
            self.test_data = pd.read_csv("data/test.csv")
            self.stores_data = pd.read_csv("data/stores.csv")
            self.features_data = pd.read_csv("data/features.csv")

            # Convert dates
            self.train_data["Date"] = pd.to_datetime(self.train_data["Date"])
            self.test_data["Date"] = pd.to_datetime(self.test_data["Date"])
            self.features_data["Date"] = pd.to_datetime(self.features_data["Date"])

            print("All datasets loaded successfully!")
            return True
        except FileNotFoundError as e:
            print(f"Error loading files: {e}")
            print(
                "Please make sure all CSV files (train.csv, test.csv, stores.csv, features.csv) are in the working directory."
            )
            return False

    def basic_info(self):
        """Display basic information about all datasets"""
        print("=" * 80)
        print("WALMART SALES FORECASTING - DATASET OVERVIEW")
        print("=" * 80)

        datasets = {
            "Training Data": self.train_data,
            "Test Data": self.test_data,
            "Stores Data": self.stores_data,
            "Features Data": self.features_data,
        }

        for name, df in datasets.items():
            if df is not None:
                print(f"\n {name}:")
                print(f"   Shape: {df.shape}")
                print(f"   Columns: {list(df.columns)}")
                print(
                    f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB"
                )

                if name == "Training Data":
                    print(f"   Date range: {df['Date'].min()} to {df['Date'].max()}")
                    print(f"   Unique stores: {df['Store'].nunique()}")
                    print(f"   Unique departments: {df['Dept'].nunique()}")
                    print(f"   Total sales records: {len(df):,}")


if __name__ == "__main__":
    loader = WalmartDataLoader()
    loader.load_data()
    loader.basic_info()

    print("Data loading complete!")

All datasets loaded successfully!
WALMART SALES FORECASTING - DATASET OVERVIEW

 Training Data:
   Shape: (421570, 5)
   Columns: ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday']
   Memory usage: 13.27 MB
   Date range: 2010-02-05 00:00:00 to 2012-10-26 00:00:00
   Unique stores: 45
   Unique departments: 81
   Total sales records: 421,570

 Test Data:
   Shape: (115064, 4)
   Columns: ['Store', 'Dept', 'Date', 'IsHoliday']
   Memory usage: 2.74 MB

 Stores Data:
   Shape: (45, 3)
   Columns: ['Store', 'Type', 'Size']
   Memory usage: 0.00 MB

 Features Data:
   Shape: (8190, 12)
   Columns: ['Store', 'Date', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'IsHoliday']
   Memory usage: 0.70 MB
Data loading complete!
