# Business Understanding
This project addresses a core business need faced by retail and product-driven organizations: the ability to anticipate future sales and inventory requirements with accuracy. Reliable forecasting supports informed budgeting, optimized purchasing decisions, and long-term strategic planning. By modelling sales behavior and stock movement over time, the business gains visibility into demand patterns, operational risks, and the resources required to remain competitive.

## **1. Business Problem**
Retail and product-based businesses must **anticipate future sales** to:
- **Plan inventory levels**
- **Avoid stock-outs**
- **Prevent overstocking**
- **Improve budgeting and financial forecasting**

Without forecasting, the business risks:
- **Running out of stock** during demand peaks  
- **Holding too much inventory**, tying up cash  
- **Poor financial planning** and delayed decisions  

## **2. Objectives**
This project aims to:
- **Forecast monthly sales** for each product  
- **Forecast monthly revenue**  
- **Predict future inventory requirements**  
- **Identify products at risk of low stock / stock-out**  
- **Support budgeting and strategic planning** using predictive analytics  

## **3. Key Business Questions**
The forecasting system should answer:
- **What will sales look like in the next 3, 6, and 12 months?**  
- **Which products are growing or declining?**  
- **When will current stock fall below reorder level?**  
- **How much inventory will be needed in the coming months?**

## **4. Success Criteria**
The project is successful if:
- **Forecasting accuracy is acceptable**  
  (e.g., **MAPE < X%** depending on business tolerance)
- **Insights are clear and actionable** for both finance and operations  
- A **dashboard** is created to visualize sales trends, future predictions, and inventory risks  
- The system **supports budgeting, purchasing decisions, and financial planning**  


# Data Understanding

This section explores the structure and richness of the dataset that drives the forecasting system. The project uses a comprehensive, multi-table design to reflect real-world retail operations, including product profiles, transactional sales data, supplier performance, and inventory conditions. Understanding how these data sources interact is essential for identifying trends, validating data quality, and selecting the appropriate forecasting approach.

## **1. Identify Data Sources / Tables**

We work with four main tables: **Products, Sales, Inventory, Suppliers.**  
Each contains detailed columns to mimic realistic business datasets.

### **Products Table (Expanded Attributes)**  
Contains core product information for forecasting, pricing, and inventory planning.

**Columns include:**

- product_id (PK)  
- product_name  
- category  
- brand  
- sku_code  
- cost_price  
- selling_price  
- weight_kg  
- dimensions  
- launch_date  
- discontinued (0/1)

These attributes allow advanced analysis such as product lifecycle, profitability, and category-level forecasting.

---

### **Sales Table (Highly Detailed Transaction Data)**  
Contains every sales transaction with additional operational, customer, and pricing dimensions.

**Columns include:**

- sale_id (PK)  
- product_id (FK → Products)  
- sale_date  
- quantity_sold  
- selling_price_at_time  
- revenue  
- profit  
- customer_type (retail, wholesale, online, store)  
- region (e.g., Nairobi, Mombasa, Kisumu...)  
- payment_method (cash, mpesa, card, transfer)  
- discount_applied  
- order_channel (website, store, app)  
- promotion_flag (0/1)

This table is critical for understanding demand, seasonality, price effects, customer mix, and regional patterns.

---

### **Inventory Table (Operational Inventory Data)**  
Contains detailed stock status and logistics-related attributes.

**Columns include:**

- product_id (FK → Products)  
- current_stock  
- reorder_level  
- safety_stock  
- lead_time_days  
- last_restock_date  
- warehouse_location  
- max_capacity  
- stock_value

This supports forecasting future inventory needs, flagging risks, and planning replenishment cycles.

---

### **Suppliers Table (Optional but Realistic)**  
Provides external sourcing information for products.

**Columns include:**

- supplier_id (PK)  
- supplier_name  
- contact_email  
- country  
- delivery_lead_time_days  
- reliability_score

This helps simulate lead-time impacts on inventory forecasting.

---

## **2. Entity Relationship (ER) Diagram**

The database structure forms a clean relational model:

### **Core Relationships**
- **Products.product_id → Sales.product_id**  
  (Each product has many sales)

- **Products.product_id → Inventory.product_id**  
  (Each product has one inventory record)

- **Suppliers.supplier_id → Products.supplier_id**  
  (Each product has one supplier)

This structure supports complex joins for forecasting, trend analysis, and inventory management.

---

## **3. Initial Data Exploration**

Once the synthetic dataset is generated, we will perform:

### **A. Date Range Checks**
- Identify the earliest and latest sale dates  
- Confirm the dataset spans the full 3-year period  
- Ensure no gaps in date continuity

### **B. Row Counts**
- Products: ~10–20  
- Inventory: same number of products  
- Suppliers: 3–6  
- **Sales: 3,000+ entries (random, messy, no pattern)**

### **C. Sample Previews**
Examine first few rows from each table to verify:
- Column names and data types  
- Correct foreign key mappings  
- Realistic pricing, quantities, and profit values  
- Random distribution of customer types, regions, payments

---

## **4. Understand Sales Frequency & Patterns**

Since the dataset is synthetic but randomly generated, we verify:

- Sales occur **daily or randomly** across the timeline  
- Each product exhibits **unique demand patterns**  
- Seasonal boosts (e.g., holidays, mid-year sales) if applied  
- Variability in regions, customer types, and discounts

Understanding these helps choose appropriate forecasting models.


# Data Preparation
This stage prepares the dataset for analysis and forecasting by establishing a structured schema, enforcing data quality, and transforming raw information into meaningful analytical formats. Multiple tables are designed with detailed attributes that support analytics, forecasting, and operational insights. Cleaning, validating, and aggregating the data ensures that the modeling pipeline is built on accurate, complete, and well-organized information.

## **1. Creating the Database Structure**

Four main tables are prepared for this project:

### **Products Table**
Contains expanded product attributes such as:
- category  
- brand  
- sku_code  
- pricing  
- physical characteristics  
- lifecycle status  

### **Sales Table**
Stores granular sales transactions with:
- customer characteristics  
- pricing changes  
- payment methods  
- regions  
- discounts & promotions  

### **Inventory Table**
Provides operational stock information:
- current inventory  
- safety thresholds  
- lead times  
- warehouse locations  
- capacity metrics  

### **Suppliers Table**
Includes:
- supplier performance  
- delivery times  
- reliability score  

These tables create a comprehensive structure suited for demand forecasting and inventory planning.

---

## **2. Cleaning and Checking the Source Data**

Before generating modeling tables, the following checks are performed:

### **A. Verify Column Types**
- Dates are formatted as proper date types  
- Prices and quantities stored as numeric  
- Categorical values standardized (e.g., region lists, payment methods)

### **B. Remove Irregularities**
- Invalid dates  
- Negative sales quantities  
- Missing product IDs  
- Incorrect category labels  

### **C. Ensure Referential Integrity**
- Every `Sales.product_id` must exist in `Products`  
- Every `Inventory.product_id` must exist in `Products`  
- Every `Products.supplier_id` must exist in `Suppliers`

These checks ensure the dataset is clean and stable for forecasting.

---

## **3. Transforming Sales Data for Forecasting**

To achieve meaningful forecasting, raw daily transactions are transformed into structured time-series data.

### **A. Aggregate Daily Sales → Monthly Sales**
Data is grouped by:
- product_id  
- year-month  

Metrics include:
- total monthly sales  
- total monthly revenue  
- total monthly profit  
- number of transactions  
- average discount  
- region-based breakdowns (optional)

### **B. Fill Missing Time Periods**
If a product has no sales in a month, a zero-sales row is inserted to maintain continuous time-series structure.

### **C. Create Modeling Tables**
The following datasets are prepared:

#### **1. monthly_sales**
Contains:
- product_id  
- year_month  
- monthly_quantity  
- monthly_revenue  
- monthly_profit  
- average_discount  
- dominant_region  
- promo_activity_flag  

#### **2. monthly_inventory_snapshot**
Contains:
- product_id  
- year_month  
- starting_stock  
- ending_stock  
- stockouts_occurred  
- lead_time_days  

#### **3. product_master**
Contains static attributes such as:
- category  
- brand  
- cost_price  
- selling_price  
- weight & dimensions  
- supplier info  

These datasets feed into the forecasting models.

---

## **4. Assess Data Quality Before Modeling**

Before moving to step 4 (modeling), the following assessments are made:

- Check for **missing months**  
- Detect **outliers** (extremely high sales spikes)  
- Confirm **seasonality patterns** (holidays, mid-year boosts)  
- Verify the final dataset contains **at least 3,000+ random sales rows**  
- Ensure the time series is **continuous** for each product

This guarantees a stable input for the forecasting process.

