## **Saudi Retail Demand Forecasting**
## Business Problem
Retail chains in Saudi Arabia struggle to maintain optimal inventory and staffing levels due to unpredictable daily and weekly sales patterns.

This leads to:

  - **Overstocking** : Increased costs and spoilage, especially for perishable goods.
  - **Understocking** : Lost sales, supply shortages, and poor customer experience.
Without accurate, localized demand forecasts, store managers rely on guesswork — resulting in inefficiencies and avoidable losses.

## Project Objective  
Build a data-driven forecasting pipeline that:  
1. **Ingests** four years of daily retail sales data (Q1 2020–Q1 2024) across multiple Saudi cities and channels (in-store vs. online).  
2. **Automates** data cleaning, feature engineering (lags, rolling averages, holiday effects), and modeling in reusable, object-oriented modules.  
3. **Tracks** experiments, parameters, and metrics with MLflow, enabling easy comparison of XGBoost, Prophet, or other models.  
4. **Delivers** weekly sales forecasts per city and product category, exported as CSV (and optionally served via a simple API) to help operations teams optimize inventory and staffing.

## Required Forecasts
|Forecast Type                    | Granularity  | Use Case                                     |
|---------------------------------|--------------|----------------------------------------------|                        
|1. **`City-Level Weekly Sales`**      |	Weekly	     | Help regional managers plan weekly inventory.|
|2. **`Product Category Weekly Sales`** |	Weekly       | Understand category trends over weeks.       |

#### Add Project Root to Path
Adds the project root to Python's path so we can import modules like src.config when running notebooks inside subfolders (e.g., notebooks/).

In [2]:
import os
import sys
# Go one level up from current working directory
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Add project root to Python path if not already there
if project_root not in sys.path:
    sys.path.append(project_root)

In [3]:
from src.config_loader import get_config
config = get_config()
config

{'data': {'raw_dir': '../data/raw/saudi_store_sales_dataset.csv',
  'processed_dir': 'data/processed'},
 'spark': {'master': 'local[*]',
  'app_name': 'RetailForecast',
  'configs': {'spark.sql.shuffle.partitions': 200}},
 'mlflow': {'tracking_uri': 'file:./mlruns',
  'experiment_name': 'retail_sales_forecast'},
 'model': {'random_seed': 42,
  'horizon': 'weekly',
  'train_start': '2020-01-01',
  'train_end': '2023-12-31',
  'test_start': '2024-01-01',
  'test_end': '2024-03-31'},
 'general': {'log_level': 'INFO',
  'feature_flags': {'use_new_preprocessor': False}}}

#### Load Data

In [4]:
from src.load_data import DataLoader

loader = DataLoader()
df = loader.load_csv(config['data']['raw_dir'])

Spart Session created with app name: Saudi Retail Demand Forecasting
Loading CSV file from: ../data/raw/saudi_store_sales_dataset.csv


## Data Exploration

In [5]:
import importlib
import src.explore_data 
importlib.reload(src.explore_data)
from src.explore_data import DataExplorer

explorer = DataExplorer(df)
print("Is Spark DataFrame?:", explorer.is_spark)
print("DataFrame loaded successfully.")

Is Spark DataFrame?: True
DataFrame loaded successfully.


In [13]:
explorer.explore_data(schema=True)

root
 |-- Invoice Date: string (nullable = true)
 |-- Invoice ID: string (nullable = true)
 |-- Customer Type: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Customer Gender: string (nullable = true)
 |-- Employee Name: string (nullable = true)
 |-- Manager Name: string (nullable = true)
 |-- Product Name: string (nullable = true)
 |-- Product Category: string (nullable = true)
 |-- Channel: string (nullable = true)
 |-- Customer Satisfaction: string (nullable = true)
 |-- Total Sales: integer (nullable = true)



In [14]:
explorer.explore_data(overview=True)

Rows: 49998, Columns: ['Invoice Date', 'Invoice ID', 'Customer Type', 'Customer Name', 'City', 'Customer Gender', 'Employee Name', 'Manager Name', 'Product Name', 'Product Category', 'Channel', 'Customer Satisfaction', 'Total Sales']


In [15]:
explorer.explore_data(summary=True)

+-------+------------+----------+---------------+----------------+---------------+---------------+--------------------+-------------+----------------+----------------+-------+---------------------+------------------+
|summary|Invoice Date|Invoice ID|  Customer Type|   Customer Name|           City|Customer Gender|       Employee Name| Manager Name|    Product Name|Product Category|Channel|Customer Satisfaction|       Total Sales|
+-------+------------+----------+---------------+----------------+---------------+---------------+--------------------+-------------+----------------+----------------+-------+---------------------+------------------+
|  count|       49998|     49998|          49998|           49998|          49998|          49998|               49998|        49998|           49998|           49998|  49998|                49998|             49998|
|   mean|        NULL|      NULL|           NULL|            NULL|           NULL|           NULL|                NULL|         NULL

In [16]:
explorer.explore_data(nulls=True)

+------------+----------+-------------+-------------+----+---------------+-------------+------------+------------+----------------+-------+---------------------+-----------+
|Invoice Date|Invoice ID|Customer Type|Customer Name|City|Customer Gender|Employee Name|Manager Name|Product Name|Product Category|Channel|Customer Satisfaction|Total Sales|
+------------+----------+-------------+-------------+----+---------------+-------------+------------+------------+----------------+-------+---------------------+-----------+
|           0|         0|            0|            0|   0|              0|            0|           0|           0|               0|      0|                    0|          0|
+------------+----------+-------------+-------------+----+---------------+-------------+------------+------------+----------------+-------+---------------------+-----------+



In [17]:
explorer.explore_data(duplicates=True)

Duplicate Rows: 0


In [19]:
explorer.explore_data(value_counts=True)


--- Invoice Date ---
+------------+-----+
|Invoice Date|count|
+------------+-----+
|  11/12/2023|   59|
|    8/1/2020|   53|
|   6/25/2023|   53|
|   10/7/2020|   52|
|    7/3/2020|   52|
|  12/27/2020|   51|
|    8/2/2021|   51|
|   8/20/2020|   50|
|   2/27/2020|   50|
|    7/3/2023|   50|
|   5/15/2022|   49|
|  12/30/2021|   49|
|    3/4/2021|   49|
|   9/15/2022|   49|
|   7/30/2022|   49|
|   5/22/2020|   48|
|  10/22/2020|   48|
|    6/9/2021|   48|
|  11/15/2022|   48|
|    8/3/2021|   48|
+------------+-----+
only showing top 20 rows


--- Invoice ID ---
+----------+-----+
|Invoice ID|count|
+----------+-----+
| ID-242374|    1|
|  ID-88813|    1|
| ID-306624|    1|
|  ID-72919|    1|
|  ID-61245|    1|
|   ID-1038|    1|
| ID-198702|    1|
| ID-297519|    1|
| ID-458725|    1|
| ID-158345|    1|
| ID-101394|    1|
| ID-153240|    1|
| ID-346847|    1|
| ID-170627|    1|
| ID-340587|    1|
| ID-219963|    1|
| ID-141305|    1|
| ID-138569|    1|
| ID-447967|    1|
| ID-25503