# 🛠️ M5 Forecasting Data Preprocessing and Feature Engineering

This notebook focuses on efficiently preparing the **M5 Forecasting** dataset for further analysis and modeling, primarily using **Dask** for handling large data.

---

#### 1. 📦 Import Required Libraries
- Imported Dask, NumPy, Pandas, OS, and JSON libraries.
- Note: All necessary libraries should be installed from the provided `requirements.txt`.

---

#### 2. 📂 Create Output Directory
- Created a `./data` folder if it doesn't already exist to store processed datasets.

---


#### 3. 🛒 Sales Data Preprocessing
- Read `sales_train_validation.csv` using Dask.
- **Truncated** the dataset to only include `HOBBIES` category products due to limited compute resources.
- **Repartitioned** the dataset based on `item_id` for better parallelism.
- **Melted** the data:
  - Converted it from wide format (columns `d_1`, `d_2`, ...) to long format (one row per item-day pair).
- Extracted day numbers (e.g., `d_1` → `1`) from the `day` column.
- **Sorted** data by `id` and `day`.
- Saved the processed sales data in **Parquet** format for faster loading.

---

#### 4. 📈 Load Datasets
- Loaded the following datasets:
  - **Sell prices**: `sell_prices.csv`
  - **Calendar**: `calendar.csv` (handled categorical columns and missing values properly)
  - **Sales**: Loaded preprocessed sales data from the saved parquet files.

---

#### 5. 🧹 Handling Missing Values
- Created a utility function `handle_missing_values(df)`:
  - Generated a missing value report (percentage of nulls per column).
  - Imputed missing values using **forward fill** (`ffill`) followed by **backward fill** (`bfill`).
- Applied this function across **sales**, **prices**, and **calendar** datasets.

---

#### 6. 🛠️ Memory Management
- Repartitioned the sales dataset for better memory management and processing efficiency.

---

#### 7. 🧪 Feature Engineering
- **Sorted** sales data by `id` and `day` again to ensure correct sequencing.
- **Set `id` as the index** (ensuring a sorted index for partition operations).
- Created **lag features**:
  - Added lagged sales columns for **1**, **7**, and **28** days.
  - These features will help capture sales patterns and temporal dependencies during modeling.

---

### ✅ Output:
- Preprocessed sales data with lag features ready for modeling.
- Missing values handled across all datasets.
- All transformations are optimized for large-scale data handling using Dask.

---


In [1]:
#Importing required libraries
#Note: Install all libraries from the requirements.txt file
import dask.dataframe as dd
import numpy as np
import os
import json
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


⚡ Note: The datasets are too large to upload directly to GitHub, hence manual placement is required.

# 📂 Dataset Placement Instructions

Please download the following datasets from the [M5 Forecasting Accuracy Kaggle competition](https://www.kaggle.com/competitions/m5-forecasting-accuracy/data):

1. `sales_train_evaluation.csv`
2. `sales_train_validation.csv`

After downloading, **place them in the following directory structure** relative to this notebook:

In [2]:
# Create output folder if it doesn't exist
os.makedirs("./data", exist_ok=True)

#### 🛠️ Sales Data Reshaping: Wide to Long Format

The **sales dataset** provided is originally in a **wide format**, where each day's sales are represented as separate columns.

For effective **time-series analysis** and **modeling**, it is essential to reshape this data into a **long format** — where each row represents a single product's sales on a specific day.

This preprocessing step **transforms** the dataset and **stores** the reshaped version, enabling easier feature engineering, model training, and forecasting tasks.


In [None]:
## The Sales dataset is in wide format, we need it in long format, this operation preprocess it and stores in long format
if not os.path.isdir('./data/processed_sales_data'):
    sales_data_prep = dd.read_csv('.\dataset\M5 forecasting accuracy\sales_train_validation.csv')

    ## For this project I am not able to efford the compute, so truncating the dataset
    sales_data_prep = sales_data_prep[sales_data_prep["cat_id"] == "HOBBIES"]

    print("1 : Partitions : ", sales_data_prep.npartitions)
    groups = sales_data_prep['item_id'].unique().compute().sort_values().values.tolist()
    groups.append(groups[-1])
    sales_data_prep = sales_data_prep.set_index('item_id',
                        divisions=groups
                        ).reset_index()
    print("2 : re-Partitions : ", sales_data_prep.npartitions)

    # Melt the data: Convert wide format to long format
    sales_data_prep = sales_data_prep.melt(
        id_vars=["id", "item_id", "dept_id", "cat_id", "store_id", "state_id"],
        var_name="day",
        value_name="sales"
    )
    sales_data_prep.head()

    # Extract day number from column names safely using Pandas
    def extract_day(df):
        df["day"] = df["day"].str.extract(r"d_(\d+)").astype(int)
        return df

    meta = {
        "id": "object",
        "item_id": "object",
        "dept_id": "object",
        "cat_id": "object",
        "store_id": "object",
        "state_id": "object",
        "day": "int64",
        "sales": "float64"
    }

    # Apply transformation using map_partitions
    sales_data_prep = sales_data_prep.map_partitions(extract_day, meta=meta)
    sales_data_prep = sales_data_prep.map_partitions(lambda df: df.sort_values(['id',"day"]))

    print("3 : final Partitions : ", sales_data_prep.npartitions)
    # Save processed sales data
    sales_data_prep.to_parquet("./data/processed_sales_data")

In [4]:
# Step 1: Load the dataset efficiently using Dask

prices_data = dd.read_csv('.\dataset\M5 forecasting accuracy\sell_prices.csv')
calendar_data = dd.read_csv('.\dataset\M5 forecasting accuracy\calendar.csv', dtype={
        'event_name_1': 'object',
        'event_type_1': 'object',
        'event_name_2': 'object',
        'event_type_2': 'object'
    },
    assume_missing=True  # Ensures proper dtype handling for missing values
)
sales_data = dd.read_parquet("./data/processed_sales_data")

In [5]:
# Step 2: Handle missing values & outliers
def handle_missing_values(df):
    missing_report = df.isnull().mean().compute()
    df = df.ffill().bfill()
    return df, missing_report

sales_data, sales_missing = handle_missing_values(sales_data)
prices_data, prices_missing = handle_missing_values(prices_data)
calendar_data, calendar_missing = handle_missing_values(calendar_data)

In [6]:
#Repartition to manage memory better
print("Partitions : ", sales_data.npartitions)

Partitions :  565


In [7]:
# Step 3: Feature Engineering
# Creating lag features for sales

sales_data = sales_data.map_partitions(
    lambda df: df.sort_values(["id", "day"])
)
sales_data = sales_data.set_index("id", sorted=True)

def add_lag_features(df, lags=[1, 7, 28]):
    for lag in lags:
        df[f'lag_{lag}'] = df.groupby('id')['sales'].shift(lag)
    return df


# Create updated meta
new_cols = {f'lag_{lag}': 'float64' for lag in [1, 7, 28]}
meta = sales_data._meta.assign(**{k: pd.Series(dtype=v) for k,v in new_cols.items()})
sales_data = sales_data.map_partitions(add_lag_features, meta=meta)



In [8]:
# Rolling window features
def create_rolling_features(df, window_sizes=[7, 14]):
    def rolling_func(partition_df):
        partition_df = partition_df.sort_values(['id', 'day'])
        for window in window_sizes:
            partition_df[f'rolling_mean_{window}'] = (
                partition_df.groupby('id')['sales']
                            .rolling(window=window, min_periods=1)
                            .mean()
                            .reset_index(drop=True)
            )
        return partition_df
    
    # Create updated meta
    new_cols = {f'rolling_mean_{w}': 'float64' for w in window_sizes}
    meta = df._meta.assign(**{k: pd.Series(dtype=v) for k,v in new_cols.items()})
    df = df.map_partitions(rolling_func, meta=meta)
    return df

sales_data = create_rolling_features(sales_data)

In [9]:
# Step 4: Generate Data Quality Report
quality_report = {
    'sales_missing': sales_missing.to_dict(),
    'prices_missing': prices_missing.to_dict(),
    'calendar_missing': calendar_missing.to_dict()
}

with open('./data/data_quality_report.json', 'w') as f:
    json.dump(quality_report, f, indent=4)

# Save processed data
sales_data.to_parquet("./data/final_sales_data")  # Export cleaned and engineered dataset
print("Data engineering completed. Processed data saved.")


Data engineering completed. Processed data saved.
