# **1. Business Understanding**

## **Project Overview**
A Sales and Inventory Forecasting System was developed to help retail and product-driven businesses anticipate future demand, optimize stock levels, and support strategic decision-making. Accurate forecasting enables improved budgeting, efficient purchasing cycles, and prevention of operational risks such as stock-outs and overstocking. By analyzing historical sales and inventory movement over an 18-month period, the system identifies patterns, detects seasonality, and predicts future behavior across hundreds of products.

## **1.1 Business Problem**
Retail operations frequently encounter uncertainty in customer demand, fluctuating sales volumes, and inconsistent supplier lead times. Without data-driven forecasting, businesses may face:

- Stock-outs during high-demand periods  
- Overstocking that ties up capital and increases storage costs  
- Poor financial planning and inaccurate budgeting  
- Reduced ability to respond to market changes  

A forecasting system provides visibility into future demand and stock requirements, reducing operational risk.

## **1.2 Project Objectives**
The forecasting system is designed to:

- Predict monthly sales for each product  
- Forecast revenue trends across the business  
- Estimate future inventory requirements  
- Identify products at risk of depletion or overstocking  
- Support budgeting, procurement planning, and operational strategy  

These objectives ensure that sales performance, stock movement, and purchasing decisions are aligned with future expectations.

## **1.3 Key Business Questions**
The system aims to answer several critical operational and financial questions:

- What are the expected sales levels over the next 3, 6, and 12 months?  
- Which products are experiencing growth or decline?  
- When will current stock fall below reorder thresholds?  
- How much inventory should be replenished, and when?  
- Which suppliers contribute to delays due to long lead times?  

Answering these questions strengthens both operational decision-making and financial forecasting.

## **1.4 Success Criteria**
The project is considered successful if:

- Forecasting accuracy meets acceptable business thresholds  
  (for example, MAPE < 20%, depending on product behavior)  
- Sales, revenue, and stock forecasts are clear and actionable  
- Operational users can identify inventory risks early  
- A dashboard enables leaders to visualize trends and predictions  
- The forecasting results support data-driven procurement and budgeting  

## **Key Highlights**
- End-to-end forecasting pipeline covering sales, revenue, and inventory  
- 18 months of synthetic but realistic daily transactional data  
- Multi-table relational database reflecting real business operations  
- Forecasts support strategic planning, purchasing, and stock control  
- Designed for scalability to hundreds of products and multiple suppliers  


# **2. Data Understanding**

## **Overview of the Dataset**
The project uses a multi-table relational dataset designed to resemble a real retail or product-based business. The data spans an 18-month period and includes detailed information about products, daily sales transactions, supplier performance, and inventory conditions. Understanding the structure of these datasets is essential for detecting patterns, identifying trends, validating data quality, and selecting the appropriate forecasting techniques.

## **2.1 Data Sources / Tables**
Four primary tables form the foundation of the forecasting system:

### **A. Products Table**
Contains core product attributes used for sales forecasting, pricing analysis, and inventory planning.

**Key Columns**
- `product_id`  
- `product_name`  
- `category`  
- `brand`  
- `sku_code`  
- `cost_price`  
- `selling_price`  
- `weight_kg`  
- `dimensions`  
- `launch_date`  
- `discontinued` (0/1)

These attributes enable advanced analytics such as product lifecycle evaluation, category-level performance, and profitability estimation.

### **B. Sales Table (Daily Transaction Data)**
Captures every sales transaction over the 18-month period, enriched with customer, pricing, and operational details. This table is the core source for identifying demand patterns, seasonality, and price-driven behavior.

**Key Columns**
- `sale_id`  
- `product_id`  
- `sale_date`  
- `quantity_sold`  
- `selling_price_at_time`  
- `revenue`  
- `profit`  
- `customer_type` (retail, wholesale, online, store)  
- `region` (Nairobi, Mombasa, Kisumu, etc.)  
- `payment_method` (cash, mpesa, card, transfer)  
- `discount_applied`  
- `order_channel` (website, app, store)  
- `promotion_flag` (0/1)

This table supports analysis of demand shifts, pricing effects, regional behavior, and channel performance.

### **C. Inventory Table**
Provides operational stock information and logistics-related attributes that help determine inventory stability and future replenishment requirements.

**Key Columns**
- `product_id`  
- `current_stock`  
- `reorder_level`  
- `safety_stock`  
- `lead_time_days`  
- `last_restock_date`  
- `warehouse_location`  
- `max_capacity`  
- `stock_value`

This table supports forecasting stock depletion, identifying risks, and optimizing purchases.

### **D. Suppliers Table**
Contains supplier details used to model lead times, delivery reliability, and sourcing impact on forecasting.

**Key Columns**
- `supplier_id`  
- `supplier_name`  
- `contact_email`  
- `country`  
- `delivery_lead_time_days`  
- `reliability_score`

This enables simulation of supply chain delays and supplier performance trends.

## **2.2 Entity Relationship (ER) Structure**
The relational database follows a clean and scalable structure suitable for forecasting and operational analytics.

**Core Relationships**
- `Products.product_id` → `Sales.product_id`  
  (Each product has many sales records)

- `Products.product_id` → `Inventory.product_id`  
  (Each product has one inventory profile)

- `Suppliers.supplier_id` → `Products.supplier_id`  
  (Each product is linked to a supplier)

This design ensures data integrity and supports complex joins for forecasting, trend detection, and inventory risk analysis.

## **2.3 Initial Data Exploration Approach**
Before modeling, exploratory checks are performed to validate the generated dataset:

### **A. Date Range Checks**
- Identify earliest and latest `sale_date`  
- Confirm full 18-month coverage  
- Detect gaps or inconsistencies in the daily timeline  

### **B. Row Count Checks**
Expected approximate sizes:
- Products: *200–300 items*  
- Inventory: *same number of products*  
- Suppliers: *20–30 suppliers*  
- Sales: *3,000–5,000 daily transactions (randomized)*  

### **C. Sample Data Previews**
Preview the first rows from each table to confirm:
- Correct column names and formats  
- Realistic pricing and quantity distributions  
- Accurate foreign key relationships  
- Randomized, non-sequential ordering of sales  

## **2.4 Understanding Sales Patterns**
Several aspects of the daily sales data are examined prior to modeling:

- Frequency and randomness of transactions  
- Product-level demand variability  
- Seasonal patterns or spikes  
- Regional differences  
- Effects of promotions, discounts, or channels  

These patterns guide the selection of forecasting models and feature engineering strategies.

## **Key Highlights**
- Multi-table relational structure representing real retail operations  
- Detailed daily sales spanning 18 months  
- Randomized and unsorted transactions for realistic behavior  
- Comprehensive product, supplier, and inventory attributes  
- Designed for advanced forecasting and inventory optimization  


# **3. Data Preparation**

## **Overview**
This stage establishes the structured foundation used for analysis, forecasting, and dashboard development. It involves creating the relational database schema, generating synthetic but realistic datasets, enforcing data quality standards, and transforming the raw data into analytical formats. A clean, well-organized dataset ensures that forecasting models receive accurate and consistent input.

## **3.1 Creating the Database Structure**
A complete relational database schema is prepared to support sales forecasting, inventory planning, and supplier performance analysis. Four main tables are constructed:

### **A. Products Table**
Defines foundational product characteristics, including:
- Category and brand information  
- SKU codes and physical attributes  
- Pricing and lifecycle status  
- Supplier linkage  

This table enables product-level forecasting and inventory evaluation.

### **B. Sales Table**
Stores granular daily sales transactions across an 18-month period.  
Attributes include:
- Quantities sold  
- Sales price at time of transaction  
- Revenue and profit calculations  
- Customer, region, payment method, and channel identifiers  
- Promotion and discount usage  

This table is the primary source for demand analytics.

### **C. Inventory Table**
Represents product-level inventory conditions with:
- Current stock  
- Safety levels  
- Lead times  
- Restock dates  
- Warehouse capacity metrics  

This table is used to evaluate inventory health and predict stock depletion.

### **D. Suppliers Table**
Captures supplier performance with:
- Lead time behavior  
- Delivery reliability  
- Country of origin  
- Supplier contact details  

This supports inventory optimization and restocking logic.

## **3.2 Cleaning and Validating Source Data**
Before loading data for modeling, several quality checks are performed:

### **A. Column Type Verification**
- Dates converted to valid date formats  
- Prices stored as numeric values  
- Categorical entries standardized (e.g., payment methods, regions, channels)

### **B. Removing Irregularities**
- Invalid or impossible dates  
- Negative quantities  
- Inconsistent pricing or discounts  
- Missing or mismatched product references  

### **C. Ensuring Referential Integrity**
To maintain a reliable relational structure:
- Every `Sales.product_id` must exist in `Products`  
- Every `Inventory.product_id` must exist in `Products`  
- Every `Products.supplier_id` must correspond to a record in `Suppliers`  

Maintaining strict integrity ensures accurate joins and prevents misleading results during analysis.

## **3.3 Transforming Sales Data for Forecasting**
Raw daily transactions are not directly used for forecasting. They must be aggregated and reshaped into structured time-series datasets.

### **A. Converting Daily Sales → Monthly Sales**
Daily-level data is grouped into monthly totals for each product:

**Aggregated Metrics**
- Total monthly quantity sold  
- Total monthly revenue  
- Total monthly profit  
- Number of monthly transactions  
- Average monthly discount  
- Dominant sales region  
- Promotion activity indicators  

Monthly-level formatting simplifies forecasting and captures broader trends.

### **B. Filling Missing Months**
In time-series forecasting, each product must have:
- A continuous sequence of months  
- No gaps in the timeline  
- Zero-sales months recorded explicitly  

Missing periods are filled to ensure uniform modeling.

### **C. Creating Modeling Tables**
Several structured modeling datasets are produced:

#### **1. `monthly_sales`**
Includes:
- `product_id`  
- `year_month`  
- Monthly quantity, revenue, profit  
- Discount and promotion indicators  
- Region dominance  

#### **2. `monthly_inventory_snapshot`**
Includes:
- Starting and ending monthly stock  
- Lead times  
- Stock-out flags  
- Inventory turnover metrics  

#### **3. `product_master`**
A consolidated reference table including:
- Category, brand, SKU codes  
- Cost and selling price  
- Supplier information  
- Physical attributes  

These datasets serve as the primary inputs for feature engineering and forecasting.

## **3.4 Data Quality Assessment Before Modeling**
Before advancing to feature engineering and modeling, the prepared datasets undergo several checks:

- Confirm that all products have continuous 18-month timelines  
- Verify that aggregated monthly sales align with daily totals  
- Detect outliers or extreme spikes in demand  
- Confirm no missing product or supplier references  
- Validate the range and distribution of prices, discounts, and quantities  

Ensuring quality at this stage reduces model errors and increases forecasting accuracy.

## **Key Highlights**
- Fully structured relational schema reflecting real retail operations  
- Cleaned and validated synthetic datasets ensuring reliability  
- Daily sales transformed into monthly analytical formats  
- Missing months filled for stable time-series forecasting  
- Modeling-ready datasets created for product-level and category-level analysis  


In [3]:
# Imports and lookup lists

import pandas as pd
import numpy as np
import sqlite3
import random
from datetime import datetime, timedelta

# For full randomness across the dataset
np.random.seed(42)
random.seed(42)

# Lookup Lists (Static Values)

categories = [
    "Electronics", "Home Appliances", "Beauty & Personal Care", "Groceries",
    "Office Supplies", "Kitchenware", "Sports Equipment", "Clothing",
    "Footwear", "Toys", "Automotive", "Health Products"
]

brands = [
    "NovaTech", "PrimePro", "EcoLine", "UltraHome", "Zenith", "BrightPlus",
    "FlexiStore", "ApexGear", "PureEssence", "UrbanStyle"
]

regions = [
    "Nairobi", "Mombasa", "Kisumu", "Nakuru", "Eldoret", "Thika"
]

payment_methods = [
    "Cash", "Mpesa", "Card", "Bank Transfer"
]

customer_types = [
    "Retail", "Wholesale", "Online", "Walk-in"
]

order_channels = [
    "Website", "App", "Physical Store"
]

supplier_names = [
    "Kenya Distributors Ltd", "Prime Imports", "EastAfrica Supply Co",
    "Urban Retail Logistics", "GlobalMart Partners", "AfriTrade Solutions",
    "SupplyLink Movers", "ProSource Traders", "Zen Wholesale Group"
]

print("Lookup lists created successfully.")


Lookup lists created successfully.


In [4]:
#Generate products table
num_products = 300

# Function to generate random product names
def generate_product_name():
    adjectives = [
        "Premium", "Advanced", "Eco", "Ultra", "Smart", "Compact",
        "Portable", "Durable", "Enhanced", "Pro", "Lite", "Classic"
    ]
    items = [
        "Blender", "TV", "Laptop", "Notebook", "Shoes", "Backpack", "Perfume",
        "Headphones", "Iron Box", "Frying Pan", "Sports Bottle", "Desk Lamp",
        "Hand Mixer", "Electric Kettle", "Hair Dryer", "Vacuum Cleaner",
        "Soccer Ball", "Yoga Mat", "T-shirt", "Wireless Charger", "Keyboard"
    ]
    return f"{random.choice(adjectives)} {random.choice(items)}"

# Generate product attributes
product_ids = np.arange(1, num_products + 1)
product_names = [generate_product_name() for _ in range(num_products)]
product_categories = np.random.choice(categories, num_products)
product_brands = np.random.choice(brands, num_products)
product_suppliers = np.random.choice(np.arange(1, len(supplier_names) + 1), num_products)

# Random pricing logic
cost_prices = np.round(np.random.uniform(200, 15000, num_products), 2)
selling_prices = np.round(cost_prices * np.random.uniform(1.1, 1.8, num_products), 2)

# Random weight and dimensions
weights = np.round(np.random.uniform(0.1, 10.0, num_products), 2)
dimensions = [f"{np.random.randint(5,50)}x{np.random.randint(5,50)}x{np.random.randint(5,50)} cm"
              for _ in range(num_products)]

# Random launch dates (within the last 3 years)
start_date = datetime(2021, 1, 1)
launch_dates = [start_date + timedelta(days=random.randint(0, 900)) for _ in range(num_products)]
launch_dates = [d.strftime("%Y-%m-%d") for d in launch_dates]

# Random discontinued flag (10% discontinued)
discontinued_flags = np.random.choice([0, 1], num_products, p=[0.9, 0.1])

# SKU codes
sku_codes = [f"SKU-{random.randint(100000,999999)}" for _ in range(num_products)]

# Reorder & safety stock
reorder_points = np.random.randint(10, 200, num_products)
safety_stocks = np.random.randint(5, 120, num_products)

# Create products DataFrame
df_products = pd.DataFrame({
    "product_id": product_ids,
    "product_name": product_names,
    "category": product_categories,
    "brand": product_brands,
    "sku_code": sku_codes,
    "cost_price": cost_prices,
    "selling_price": selling_prices,
    "weight_kg": weights,
    "dimensions": dimensions,
    "launch_date": launch_dates,
    "discontinued": discontinued_flags,
    "supplier_id": product_suppliers,
    "reorder_level": reorder_points,
    "safety_stock": safety_stocks
})

# Shuffle rows so data is NOT in any order
df_products = df_products.sample(frac=1).reset_index(drop=True)

df_products.head()

Unnamed: 0,product_id,product_name,category,brand,sku_code,cost_price,selling_price,weight_kg,dimensions,launch_date,discontinued,supplier_id,reorder_level,safety_stock
0,56,Eco Iron Box,Toys,NovaTech,SKU-650121,1930.09,3304.88,8.68,16x36x36 cm,2021-11-24,0,6,145,118
1,98,Lite Vacuum Cleaner,Health Products,PureEssence,SKU-177680,7687.15,11726.1,1.78,5x27x41 cm,2023-01-23,0,1,68,79
2,11,Premium Yoga Mat,Automotive,ApexGear,SKU-931450,14851.41,23185.25,8.72,26x10x6 cm,2021-05-05,0,8,197,79
3,60,Classic T-shirt,Home Appliances,PureEssence,SKU-973074,2261.23,4066.94,4.92,33x47x6 cm,2021-06-04,0,9,36,109
4,219,Premium Frying Pan,Clothing,PrimePro,SKU-901909,6259.29,10367.68,0.28,9x6x26 cm,2021-05-04,0,3,66,72


In [6]:
# Generate 18-month daily date range 
start_date = datetime(2023, 1, 1)
end_date = start_date + timedelta(days=18 * 30)  # ~18 months approx

# Generate full daily timeline
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Convert to strings for easier dataset merging
dates_list = [d.strftime("%Y-%m-%d") for d in date_range]

# Shuffle the dates so they are NOT in order
random.shuffle(dates_list)

print("Total dates generated:", len(dates_list))
print("Sample dates:", dates_list[:10])


Total dates generated: 541
Sample dates: ['2023-07-23', '2023-01-31', '2023-07-07', '2023-07-09', '2023-04-14', '2023-09-09', '2024-03-23', '2024-05-19', '2023-08-28', '2023-07-13']


In [7]:
# PART 4: GENERATE DAILY SALES TABLE (3,000–5,000 ROWS)

# Number of daily sales transactions to generate
num_sales = random.randint(3000, 5000)

# Prepare empty lists
sale_ids = np.arange(1, num_sales + 1)
product_ids_list = np.random.choice(df_products["product_id"], num_sales)
quantities = np.random.randint(1, 10, num_sales)  # quantity sold per transaction
customer_type_list = np.random.choice(customer_types, num_sales)
region_list = np.random.choice(regions, num_sales)
payment_list = np.random.choice(payment_methods, num_sales)
channel_list = np.random.choice(order_channels, num_sales)

# Random discounts and promotions
discounts = np.round(np.random.uniform(0, 0.25, num_sales), 2)
promotion_flags = np.random.choice([0, 1], num_sales, p=[0.85, 0.15])  # 15% chance of promotion

# Assign random sale dates from shuffled 18-month timeline
sale_dates = np.random.choice(dates_list, num_sales)

# Determine selling price at time of sale (random fluctuation)
selling_prices_at_time = []
profits = []

for pid, qty, disc in zip(product_ids_list, quantities, discounts):
    base_price = df_products.loc[df_products["product_id"] == pid, "selling_price"].values[0]
    price_after_discount = base_price * (1 - disc)
    selling_prices_at_time.append(round(price_after_discount, 2))

    cost_price = df_products.loc[df_products["product_id"] == pid, "cost_price"].values[0]
    profit_per_unit = price_after_discount - cost_price
    profits.append(round(profit_per_unit * qty, 2))

# Total revenue per transaction
revenues = [round(q * p, 2) for q, p in zip(quantities, selling_prices_at_time)]

# Create final sales DataFrame
df_sales = pd.DataFrame({
    "sale_id": sale_ids,
    "product_id": product_ids_list,
    "sale_date": sale_dates,
    "quantity_sold": quantities,
    "selling_price_at_time": selling_prices_at_time,
    "revenue": revenues,
    "profit": profits,
    "customer_type": customer_type_list,
    "region": region_list,
    "payment_method": payment_list,
    "order_channel": channel_list,
    "discount_applied": discounts,
    "promotion_flag": promotion_flags
})

# Shuffle rows so the entire dataset is NOT ordered
df_sales = df_sales.sample(frac=1).reset_index(drop=True)

df_sales.head()


Unnamed: 0,sale_id,product_id,sale_date,quantity_sold,selling_price_at_time,revenue,profit,customer_type,region,payment_method,order_channel,discount_applied,promotion_flag
0,1563,165,2023-06-07,1,1300.6,1300.6,128.11,Retail,Thika,Cash,Website,0.01,0
1,4024,122,2023-05-29,5,9794.04,48970.2,4215.79,Wholesale,Eldoret,Bank Transfer,App,0.06,0
2,1494,264,2024-01-11,7,2315.48,16208.36,-2418.54,Retail,Eldoret,Mpesa,Physical Store,0.24,1
3,3832,165,2023-07-22,9,1024.72,9222.48,-1329.96,Retail,Kisumu,Mpesa,Physical Store,0.22,0
4,1831,209,2024-05-16,9,10423.62,93812.58,14351.38,Wholesale,Nakuru,Cash,App,0.11,0


In [8]:
#Generate Inventory Master Table
#Warehouse locations
warehouses = ["Main Warehouse", "Industrial Area Depot", "CBD Storage", "Westlands Hub"]

inventory_records = []

for _, row in df_products.iterrows():
    product_id = row["product_id"]
    cost_price = row["cost_price"]

    # Random stock levels
    current_stock = np.random.randint(20, 1000)             # current quantity on-hand
    reorder_level = row["reorder_level"]                    # already randomized earlier
    safety_stock = row["safety_stock"]                      # from products table
    lead_time_days = np.random.randint(3, 30)               # supplier delivery time
    max_capacity = np.random.randint(500, 2000)             # warehouse capacity per product

    # Last restock date: random date in the last 180 days
    last_restock = datetime.now() - timedelta(days=np.random.randint(1, 180))
    last_restock = last_restock.strftime("%Y-%m-%d")

    # Stock value = current stock × cost price
    stock_value = round(current_stock * cost_price, 2)

    inventory_records.append([
        product_id,
        current_stock,
        reorder_level,
        safety_stock,
        lead_time_days,
        last_restock,
        np.random.choice(warehouses),
        max_capacity,
        stock_value
    ])

df_inventory = pd.DataFrame(
    inventory_records,
    columns=[
        "product_id",
        "current_stock",
        "reorder_level",
        "safety_stock",
        "lead_time_days",
        "last_restock_date",
        "warehouse_location",
        "max_capacity",
        "stock_value"
    ]
)

# Shuffle inventory rows to avoid any natural ordering
df_inventory = df_inventory.sample(frac=1).reset_index(drop=True)

df_inventory.head()


Unnamed: 0,product_id,current_stock,reorder_level,safety_stock,lead_time_days,last_restock_date,warehouse_location,max_capacity,stock_value
0,145,541,121,100,27,2025-10-07,Westlands Hub,541,5170612.91
1,108,466,114,72,16,2025-11-17,Westlands Hub,1474,4067746.62
2,102,40,168,99,18,2025-06-16,Westlands Hub,1213,559814.0
3,278,972,66,114,9,2025-11-13,CBD Storage,534,10732892.04
4,137,844,181,84,15,2025-11-16,CBD Storage,595,5820308.4


In [11]:
#Generate suppliers table
# Supplier countries list
supplier_countries = [
    "Kenya", "China", "India", "United Arab Emirates", "Turkey",
    "South Africa", "United Kingdom"
]

supplier_records = []

for supplier_id, supplier_name in enumerate(supplier_names, start=1):
    contact_email = supplier_name.lower().replace(" ", "_") + "@supplier.com"

    # Random delivery lead time
    delivery_lead_time = np.random.randint(3, 35)  # between 3–35 days

    # Random reliability score (0.5 to 1.0)
    reliability_score = round(np.random.uniform(0.5, 1.0), 2)

    # Random country of origin
    country = np.random.choice(supplier_countries)

    supplier_records.append([
        supplier_id,
        supplier_name,
        contact_email,
        country,
        delivery_lead_time,
        reliability_score
    ])

df_suppliers = pd.DataFrame(
    supplier_records,
    columns=[
        "supplier_id",
        "supplier_name",
        "contact_email",
        "country",
        "delivery_lead_time_days",
        "reliability_score"
    ]
)

# Shuffle to avoid natural order
df_suppliers = df_suppliers.sample(frac=1).reset_index(drop=True)

df_suppliers.head()


Unnamed: 0,supplier_id,supplier_name,contact_email,country,delivery_lead_time_days,reliability_score
0,7,SupplyLink Movers,supplylink_movers@supplier.com,United Kingdom,29,0.8
1,2,Prime Imports,prime_imports@supplier.com,China,11,0.61
2,9,Zen Wholesale Group,zen_wholesale_group@supplier.com,Kenya,23,0.58
3,1,Kenya Distributors Ltd,kenya_distributors_ltd@supplier.com,Kenya,18,0.98
4,4,Urban Retail Logistics,urban_retail_logistics@supplier.com,United Arab Emirates,18,0.99


In [12]:
#Generate purchases (Restock events)
# Number of restock purchase orders to generate
num_purchases = random.randint(600, 1200)   # fairly large & realistic

purchase_records = []

for purchase_id in range(1, num_purchases + 1):

    # Random product
    product_row = df_products.sample(1).iloc[0]
    product_id = product_row["product_id"]
    supplier_id = product_row["supplier_id"]

    # Random quantity ordered
    quantity_ordered = np.random.randint(50, 1000)

    # Random purchase date (within the 18-month window)
    purchase_date = pd.to_datetime(np.random.choice(dates_list))

    # Supplier lead time
    supplier_lead_time = df_suppliers.loc[
        df_suppliers["supplier_id"] == supplier_id,
        "delivery_lead_time_days"
    ].values[0]

    # Delivery date = purchase_date + lead_time
    delivery_date = purchase_date + timedelta(days=int(supplier_lead_time))

    purchase_records.append([
        purchase_id,
        product_id,
        supplier_id,
        quantity_ordered,
        purchase_date.strftime("%Y-%m-%d"),
        delivery_date.strftime("%Y-%m-%d")
    ])

df_purchases = pd.DataFrame(
    purchase_records,
    columns=[
        "purchase_id",
        "product_id",
        "supplier_id",
        "quantity_ordered",
        "purchase_date",
        "delivery_date"
    ]
)

# Shuffle dataset to ensure no natural order
df_purchases = df_purchases.sample(frac=1).reset_index(drop=True)

df_purchases.head()


Unnamed: 0,purchase_id,product_id,supplier_id,quantity_ordered,purchase_date,delivery_date
0,146,98,1,109,2024-05-09,2024-05-27
1,951,89,8,318,2024-03-01,2024-03-11
2,386,191,5,972,2023-01-04,2023-01-21
3,249,284,8,187,2024-05-13,2024-05-23
4,788,263,6,580,2023-07-30,2023-08-13


In [13]:
#Generate daily inventory snapshots for each product
# Convert dates back to sorted list for sequential inventory simulation
sorted_dates = sorted(pd.to_datetime(dates_list))

inventory_snapshots = []

# Pre-calculate sales grouped by product and date for efficiency
sales_grouped = df_sales.groupby(["product_id", "sale_date"])["quantity_sold"].sum()

# Pre-calculate incoming purchases grouped by product and delivery date
purchases_grouped = df_purchases.groupby(["product_id", "delivery_date"])["quantity_ordered"].sum()

# Start daily simulation
for _, product_row in df_products.iterrows():

    product_id = product_row["product_id"]
    
    # Start with current stock from inventory master
    stock_level = int(df_inventory.loc[df_inventory["product_id"] == product_id, "current_stock"].values[0])
    
    # Simulate day-by-day changes
    for date in sorted_dates:
        date_str = date.strftime("%Y-%m-%d")

        # Subtract total sales on this date
        if (product_id, date_str) in sales_grouped.index:
            stock_level -= int(sales_grouped.loc[(product_id, date_str)])

        # Add incoming deliveries on this date
        if (product_id, date_str) in purchases_grouped.index:
            stock_level += int(purchases_grouped.loc[(product_id, date_str)])

        # Natural random fluctuation (spoilage/damage/adjustment)
        stock_level += np.random.randint(-2, 3)

        # Prevent negative stock
        stock_level = max(stock_level, 0)

        # Record snapshot
        inventory_snapshots.append([
            product_id,
            date_str,
            stock_level
        ])

# Create DataFrame
df_daily_inventory = pd.DataFrame(
    inventory_snapshots,
    columns=["product_id", "date", "stock_on_hand"]
)

# Shuffle to ensure daily rows are NOT in order
df_daily_inventory = df_daily_inventory.sample(frac=1).reset_index(drop=True)

df_daily_inventory.head()


Unnamed: 0,product_id,date,stock_on_hand
0,255,2024-03-03,31
1,39,2023-06-21,1332
2,151,2024-05-10,1573
3,110,2023-06-25,322
4,272,2024-04-27,1444


In [14]:
#Save all tables to sqlite database

db_name = "sales_inventory_forecasting.db"

# Create / connect to SQLite database
conn = sqlite3.connect(db_name)

# Save each DataFrame as a table
df_products.to_sql("Products", conn, if_exists="replace", index=False)
df_sales.to_sql("Sales", conn, if_exists="replace", index=False)
df_inventory.to_sql("Inventory", conn, if_exists="replace", index=False)
df_daily_inventory.to_sql("DailyInventory", conn, if_exists="replace", index=False)
df_suppliers.to_sql("Suppliers", conn, if_exists="replace", index=False)
df_purchases.to_sql("Purchases", conn, if_exists="replace", index=False)

# Quick row count check for each table
tables = ["Products", "Sales", "Inventory", "DailyInventory", "Suppliers", "Purchases"]
for table in tables:
    count = pd.read_sql_query(f"SELECT COUNT(*) AS row_count FROM {table};", conn)
    print(f"Table {table}: {count['row_count'][0]} rows")

conn.close()
print(f"\nAll tables saved successfully to '{db_name}'.")


Table Products: 300 rows
Table Sales: 4625 rows
Table Inventory: 300 rows
Table DailyInventory: 162300 rows
Table Suppliers: 9 rows
Table Purchases: 970 rows

All tables saved successfully to 'sales_inventory_forecasting.db'.
