### Step 1: Generate Semiconductor Supplier List
1. This script creates a realistic supplier dataset for a semiconductor company.
2. Includes: Supplier ID, Name, Country, Tier Level, On-Time Delivery Rating.
3. Tier 1 = critical, high-value suppliers; Tier 2 = secondary or less critical.
4. Ratings are given as % and will be used later in supplier performance analysis.

In [1]:
import pandas as pd
import random

# Predefined realistic supplier names (fictional but industry-inspired)
supplier_names = [
    "Nippon Substrate Co.", "FormoTech Packaging", "Taiwan Advanced Materials",
    "Korea Probe Solutions", "Silicon Precision Works", "Tokyo WaferTech",
    "GlobalTest Equipment", "Micron Substrates EU", "United Semiconductor Parts",
    "Shenzhen Bonding Supplies", "ASE Materials Division", "SPIL Components Ltd.",
    "Hanwa Leadframes", "Hitachi Bonding Wires", "K&S Precision Tools",
    "Amkor Assembly Supplies", "Infineon Packaging Materials", "ST Micro Parts Asia",
    "Applied Materials Korea", "Nanotech Packaging Taiwan"
]

# Corresponding realistic supplier countries (aligned with actual IC supply hubs)
countries = [
    "Japan", "Taiwan", "Korea", "United States", "China", "Germany", "Singapore", "Malaysia"
]

# Generate data
suppliers_data = []
for i, name in enumerate(supplier_names, start=1):
    supplier_id = f"S{i:03d}"
    country = random.choice(countries)
    tier_level = random.choices(["Tier 1", "Tier 2"], weights=[0.6, 0.4])[0]
    on_time_rating = round(random.uniform(85, 99), 2)  # percentage
    suppliers_data.append([supplier_id, name, country, tier_level, on_time_rating])

# Create DataFrame
suppliers_df = pd.DataFrame(suppliers_data, columns=[
    "supplier_id", "supplier_name", "country", "tier_level", "on_time_rating"
])

# Save to CSV for later MySQL loading
suppliers_df.to_csv("suppliers.csv", index=False)

print("Supplier list generated and saved to suppliers.csv")
print(suppliers_df.head())

Supplier list generated and saved to suppliers.csv
  supplier_id              supplier_name        country tier_level  \
0        S001       Nippon Substrate Co.          Japan     Tier 2   
1        S002        FormoTech Packaging       Malaysia     Tier 1   
2        S003  Taiwan Advanced Materials  United States     Tier 1   
3        S004      Korea Probe Solutions          Korea     Tier 1   
4        S005    Silicon Precision Works      Singapore     Tier 1   

   on_time_rating  
0           88.76  
1           94.34  
2           88.20  
3           92.80  
4           92.33  


### Generate BOM Components with Categories, Unit Costs, and Lead Times

1. This script creates a realistic Bill of Materials (BOM) for a semiconductor packaging/testing context.
2. Fields: component_id, component_name, category, unit_cost_usd, lead_time_days
    - Categories reflect real IC packaging/test materials & parts.
    - Unit costs and lead times are sampled from realistic ranges per category.
    - Output is saved to components.csv for downstream loading.

In [2]:
import numpy as np

random.seed(42)
np.random.seed(42)

# Category specifications: (name, unit_cost_range_usd, lead_time_days_range)
CATEGORY_SPECS = [
    ("Silicon Wafers", (80, 450), (35, 70)),
    ("Organic Substrates", (0.6, 3.5), (28, 56)),
    ("Leadframes", (0.05, 0.45), (21, 42)),
    ("Bonding Wire (Au/Cu)", (0.12, 0.9), (21, 49)),
    ("Mold Compound", (2.0, 8.0), (28, 56)),
    ("Die Attach Film/Paste", (4.0, 16.0), (28, 63)),
    ("Underfill/Epoxy", (3.0, 14.0), (28, 63)),
    ("Solder Balls/Spheres", (0.01, 0.08), (21, 42)),
    ("Probe Cards", (2000, 12000), (56, 98)),
    ("Test Sockets", (60, 380), (35, 77)),
    ("Carrier Tapes/Trays", (0.03, 0.25), (14, 35)),
    ("Nozzles/Capillaries", (15, 95), (21, 49)),
]

# Helper to create realistic component names within each category
NAME_TEMPLATES = {
    "Silicon Wafers": lambda i: f"200mm Si Wafer {i}" if i % 2 else f"300mm Si Wafer {i}",
    "Organic Substrates": lambda i: f"ABF Substrate {i}",
    "Leadframes": lambda i: f"Cu Leadframe QFN-{3+i%5}x{3+i%5} #{i}",
    "Bonding Wire (Au/Cu)": lambda i: f"{'Au' if i%3 else 'Cu'} Wire Ø{0.6 + (i%5)*0.05:.2f}mil #{i}",
    "Mold Compound": lambda i: f"EMC Low-Alpha Grade {i}",
    "Die Attach Film/Paste": lambda i: f"DAF {25 + (i%6)*5}µm #{i}",
    "Underfill/Epoxy": lambda i: f"Capillary Underfill UF-{100 + (i%7)*20} #{i}",
    "Solder Balls/Spheres": lambda i: f"SnAgCu BGA Φ{0.25 + (i%6)*0.05:.2f}mm #{i}",
    "Probe Cards": lambda i: f"MEMS Probe Card {i}",
    "Test Sockets": lambda i: f"BGA Test Socket {0.4 + (i%6)*0.1:.1f}mm pitch #{i}",
    "Carrier Tapes/Trays": lambda i: f"JEDEC Tray {i}",
    "Nozzles/Capillaries": lambda i: f"Bond Capillary ID{30 + (i%8)*5}µm #{i}",
}

# Target total SKUs (within the 200–300 range from the plan)
TOTAL_SKUS = 240

# Allocate SKUs roughly proportional to category importance/cost impact
weights = np.array([10, 24, 20, 18, 16, 16, 14, 20, 4, 10, 24, 14], dtype=float)
weights = weights / weights.sum()
allocations = (weights * TOTAL_SKUS).round().astype(int)

# Adjust to hit exactly TOTAL_SKUS
diff = TOTAL_SKUS - allocations.sum()
for k in range(abs(diff)):
    allocations[k % len(allocations)] += 1 if diff > 0 else -1

rows = []
comp_counter = 1
for (cat, cost_rng, lt_rng), count in zip(CATEGORY_SPECS, allocations):
    for i in range(count):
        unit_cost = float(np.round(np.random.uniform(*cost_rng), 2))
        # Add mild log-normal noise to lead time to reflect variability
        base_lt = np.random.uniform(*lt_rng)
        lt_noise = np.random.lognormal(mean=0.0, sigma=0.15)
        lead_time_days = int(max(7, round(base_lt * lt_noise)))

        name_fn = NAME_TEMPLATES[cat]
        component_name = name_fn(i + 1)

        rows.append({
            "component_id": f"C{comp_counter:04d}",
            "component_name": component_name,
            "category": cat,
            "unit_cost_usd": unit_cost,
            "lead_time_days": lead_time_days
        })
        comp_counter += 1

components_df = pd.DataFrame(rows)

# Basic sanity checks: ranges and duplicates
assert components_df["component_id"].is_unique, "component_id must be unique"
assert components_df["unit_cost_usd"].gt(0).all(), "unit_cost_usd must be positive"
assert components_df["lead_time_days"].ge(7).all(), "lead_time_days must be >= 7 days"

# Save for downstream use
components_df.to_csv("components.csv", index=False)

print(" Components BOM generated:", components_df.shape[0], "rows -> components.csv")
print(components_df.sample(8, random_state=7))

 Components BOM generated: 240 rows -> components.csv
    component_id                   component_name               category  \
80         C0081             Au Wire Ø0.75mil #13   Bonding Wire (Au/Cu)   
129        C0130                     DAF 30µm #19  Die Attach Film/Paste   
3          C0004                 300mm Si Wafer 4         Silicon Wafers   
205        C0206                    JEDEC Tray 14    Carrier Tapes/Trays   
148        C0149   Capillary Underfill UF-180 #18        Underfill/Epoxy   
190        C0191  BGA Test Socket 0.4mm pitch #12           Test Sockets   
94         C0095            EMC Low-Alpha Grade 4          Mold Compound   
88         C0089             Cu Wire Ø0.65mil #21   Bonding Wire (Au/Cu)   

     unit_cost_usd  lead_time_days  
80            0.40              32  
129           5.72              52  
3            87.62              77  
205           0.07              31  
148           8.10              36  
190          94.30              42  
94

###  Generate Realistic Delivery Logs (orders, ETAs, actual receipts)

1. Creates delivery_logs with realistic order/expected/actual dates and quantities.
    - Reads suppliers.csv and components.csv from prior steps.
    - Expected delivery = order_date + component lead time (days).
    - Actual delivery reflects delays driven by supplier on-time ratings & randomness.
    - Quantities scale by component category; partial receipts & shorts included.
    - ~2% missing actual_delivery_date to mimic real-world data gaps.
    - Output: delivery_logs.csv

In [3]:
# ---------- Config ----------
np.random.seed(7)

# Simulation horizon (18 months back from today)
END_DATE = pd.Timestamp.today().normalize()
START_DATE = END_DATE - pd.DateOffset(months=18)

# Order intensity per category (relative weekly volume)
CATEGORY_QTY_SCALE = {
    "Silicon Wafers": 50,
    "Organic Substrates": 4000,
    "Leadframes": 12000,
    "Bonding Wire (Au/Cu)": 8000,
    "Mold Compound": 3000,
    "Die Attach Film/Paste": 2500,
    "Underfill/Epoxy": 2200,
    "Solder Balls/Spheres": 20000,
    "Probe Cards": 6,
    "Test Sockets": 80,
    "Carrier Tapes/Trays": 15000,
    "Nozzles/Capillaries": 300
}

# Weekly order probability by category (how often we place orders)
CATEGORY_ORDER_PROB = {
    "Silicon Wafers": 0.25,
    "Organic Substrates": 0.65,
    "Leadframes": 0.7,
    "Bonding Wire (Au/Cu)": 0.6,
    "Mold Compound": 0.5,
    "Die Attach Film/Paste": 0.45,
    "Underfill/Epoxy": 0.45,
    "Solder Balls/Spheres": 0.7,
    "Probe Cards": 0.08,
    "Test Sockets": 0.2,
    "Carrier Tapes/Trays": 0.55,
    "Nozzles/Capillaries": 0.35
}

# ---------- Load prior artifacts ----------
suppliers = pd.read_csv("suppliers.csv")
components = pd.read_csv("components.csv")

# Ensure expected columns exist
assert {"supplier_id","tier_level","on_time_rating"}.issubset(suppliers.columns)
assert {"component_id","category","lead_time_days"}.issubset(components.columns)

# Helper: pick a supplier (bias Tier 1 slightly, but allow Tier 2)
tier_weight = suppliers["tier_level"].map({"Tier 1": 1.3, "Tier 2": 0.7}).fillna(1.0)
supplier_probs = tier_weight / tier_weight.sum()

# Build a weekly calendar; orders are generated on random weekdays
weeks = pd.date_range(START_DATE, END_DATE, freq="W-MON")

rows = []
delivery_counter = 1

for _, comp in components.iterrows():
    cat = comp["category"]
    comp_id = comp["component_id"]
    base_qty = CATEGORY_QTY_SCALE.get(cat, 1000)
    order_prob = CATEGORY_ORDER_PROB.get(cat, 0.4)

    # Determine how many weeks we place orders for this component
    for wk_start in weeks:
        if np.random.rand() > order_prob:
            continue

        # Randomize intra-week order day
        order_date = wk_start + pd.Timedelta(days=int(np.random.randint(0, 5)))

        # Choose supplier with weighted probability
        idx = np.random.choice(suppliers.index, p=supplier_probs.values)
        sup = suppliers.loc[idx]

        # Expected delivery = order + component lead time (+/- small planning jitter)
        lead = int(comp["lead_time_days"])
        plan_jitter = int(np.random.normal(loc=0, scale=max(2, lead * 0.05)))
        expected_delivery_date = order_date + pd.Timedelta(days=max(1, lead + plan_jitter))

        # Actual delivery delay driven by on-time rating (lower rating -> more delay)
        # Convert rating (85-99%) to delay distribution
        rating = float(sup["on_time_rating"])
        # Mean extra delay increases as rating drops; add occasional severe tails
        mean_delay = max(0, (95 - rating) * 0.6)  # days
        extra_delay = np.random.lognormal(mean=np.log(1 + mean_delay/5 + 1e-6), sigma=0.35) - 1
        # Rare long-tail disruptions
        if np.random.rand() < 0.03:
            extra_delay += np.random.randint(5, 21)

        delay_days = int(round(max(0, extra_delay)))
        actual_delivery_date = expected_delivery_date + pd.Timedelta(days=delay_days)

        # Quantity ordered: lognormal around base with moderate variance, min 1
        qty_ordered = int(max(1, np.random.lognormal(mean=np.log(max(1, base_qty)), sigma=0.5)))

        # Partial receipts/shorts (5% chance), else full receipt
        if np.random.rand() < 0.05:
            short_ratio = np.clip(np.random.beta(2, 10), 0.02, 0.25)
            qty_received = int(max(0, round(qty_ordered * (1 - short_ratio))))
        else:
            qty_received = qty_ordered

        # Status from schedule adherence
        if pd.isna(actual_delivery_date):
            delivery_status = "Unknown"
        elif actual_delivery_date > expected_delivery_date:
            delivery_status = "Delayed"
        elif qty_received < qty_ordered:
            delivery_status = "Partial"
        else:
            delivery_status = "On-Time"

        rows.append({
            "delivery_id": f"D{delivery_counter:07d}",
            "supplier_id": sup["supplier_id"],
            "component_id": comp_id,
            "order_date": order_date.date(),
            "expected_delivery_date": expected_delivery_date.date(),
            "actual_delivery_date": actual_delivery_date.date(),
            "quantity_ordered": qty_ordered,
            "quantity_received": qty_received,
            "delivery_status": delivery_status
        })
        delivery_counter += 1

# Assemble DataFrame
dl = pd.DataFrame(rows)

# Inject ~2% missing actual_delivery_date to emulate data gaps
mask_missing = np.random.rand(len(dl)) < 0.02
dl.loc[mask_missing, "actual_delivery_date"] = pd.NaT
dl.loc[mask_missing, "delivery_status"] = "Unknown"

# Light sanity checks
assert dl["quantity_ordered"].ge(1).all(), "quantity_ordered must be >= 1"
assert set(["On-Time","Delayed","Partial","Unknown"]).issuperset(set(dl["delivery_status"].unique()))

# Save
dl.sort_values("order_date", inplace=True)
dl.to_csv("delivery_logs.csv", index=False)

print(f" Delivery logs generated: {len(dl):,} rows -> delivery_logs.csv")
print(dl.head(8))

 Delivery logs generated: 9,650 rows -> delivery_logs.csv
     delivery_id supplier_id component_id  order_date expected_delivery_date  \
8668    D0008669        S001        C0212  2024-02-19             2024-03-03   
2175    D0002176        S015        C0052  2024-02-19             2024-03-21   
449     D0000450        S005        C0018  2024-02-19             2024-03-25   
498     D0000499        S008        C0019  2024-02-19             2024-04-08   
7347    D0007348        S016        C0170  2024-02-19             2024-03-11   
2505    D0002506        S016        C0058  2024-02-19             2024-03-22   
9304    D0009305        S004        C0229  2024-02-19             2024-03-19   
7713    D0007714        S015        C0184  2024-02-19             2024-05-06   

     actual_delivery_date  quantity_ordered  quantity_received delivery_status  
8668           2024-03-04             15347              15347         Delayed  
2175           2024-03-21             19149              19

### Simulate Daily Inventory Levels with Stock‑In/Stock‑Out Events


1. Generates daily inventory per component using:
    - Deliveries from delivery_logs (stock_in on actual_delivery_date)
    - Stochastic daily consumption (stock_out) by component category
    - Safety stock heuristic to initialize opening balances

Output schema: date, component_id, opening_stock, stock_in, stock_out, closing_stock
- File: inventory_levels.csv

2. Notes:
    - Uses category-based demand rates (aligned with our delivery log scales).
    - Introduces realistic demand spikes and occasional negative balances (rare), which we will clean in Task 2 (data quality issues).
    - Produces ~130–150k rows (e.g., ~240 components × ~540 days).

In [4]:
np.random.seed(11)

# ---------- Load prior artifacts ----------
components = pd.read_csv("components.csv")
deliveries = pd.read_csv(
    "delivery_logs.csv",
    parse_dates=["order_date", "expected_delivery_date", "actual_delivery_date"]
)

# Build simulation calendar
start_date = min(
    deliveries["order_date"].min(),
    deliveries["actual_delivery_date"].min(skipna=True)
)
end_date = max(
    deliveries["expected_delivery_date"].max(),
    deliveries["actual_delivery_date"].max(skipna=True)
)
calendar = pd.date_range(start_date, end_date, freq="D")

# Category baseline: weekly -> daily demand
CATEGORY_QTY_SCALE_WEEKLY = {
    "Silicon Wafers": 50,
    "Organic Substrates": 4000,
    "Leadframes": 12000,
    "Bonding Wire (Au/Cu)": 8000,
    "Mold Compound": 3000,
    "Die Attach Film/Paste": 2500,
    "Underfill/Epoxy": 2200,
    "Solder Balls/Spheres": 20000,
    "Probe Cards": 6,
    "Test Sockets": 80,
    "Carrier Tapes/Trays": 15000,
    "Nozzles/Capillaries": 300
}
CATEGORY_DAILY_BASE = {k: max(1, v/7) for k, v in CATEGORY_QTY_SCALE_WEEKLY.items()}

# Pre-aggregate deliveries by component & date
inbound = (
    deliveries
    .dropna(subset=["actual_delivery_date"])
    .assign(date=lambda df: df["actual_delivery_date"].dt.date)  # ensure date column exists
    .groupby(["component_id", "date"], as_index=False)["quantity_received"]
    .sum()
    .rename(columns={"quantity_received": "stock_in"})
)

# Prepare per-component demand parameters
comp_meta = components[["component_id", "category", "lead_time_days"]].copy()
comp_meta["daily_base"] = comp_meta["category"].map(CATEGORY_DAILY_BASE).astype(float)

rng = np.random.default_rng(11)
comp_meta["demand_multiplier"] = rng.normal(loc=1.0, scale=0.25, size=len(comp_meta)).clip(0.5, 1.6)
comp_meta["daily_demand_mu"] = (comp_meta["daily_base"] * comp_meta["demand_multiplier"]).clip(0.5, None)

# Safety stock heuristic
comp_meta["safety_stock"] = np.ceil(
    comp_meta["daily_demand_mu"] * rng.integers(10, 17, size=len(comp_meta)) *
    (1 + (comp_meta["lead_time_days"]/90).clip(0, 0.6))
).astype(int)

# Inject occasional negative initial stock (data quality issue)
glitch_mask = rng.random(len(comp_meta)) < 0.01
comp_meta.loc[glitch_mask, "safety_stock"] *= -1

daily_mu = dict(zip(comp_meta["component_id"], comp_meta["daily_demand_mu"]))
safety_stock = dict(zip(comp_meta["component_id"], comp_meta["safety_stock"]))

# Build inventory ledger
records = []
for comp_id in comp_meta["component_id"]:
    opening = int(safety_stock[comp_id])

    comp_inbound = inbound[inbound["component_id"] == comp_id]
    inbound_by_date = dict(zip(comp_inbound["date"], comp_inbound["stock_in"]))

    spike_days = set(np.random.choice(calendar.date, size=max(3, int(0.02*len(calendar))), replace=False))

    for dt in calendar:
        date_key = dt.date()
        stock_in = int(inbound_by_date.get(date_key, 0))

        mu = max(0.5, daily_mu[comp_id])
        base_out = np.random.poisson(mu)

        if date_key in spike_days:
            spike_mult = np.random.uniform(1.5, 3.0)
        else:
            spike_mult = np.random.normal(1.0, 0.10)

        stock_out = int(max(0, round(base_out * max(0.2, spike_mult))))
        closing = opening + stock_in - stock_out

        records.append({
            "date": date_key,
            "component_id": comp_id,
            "opening_stock": int(opening),
            "stock_in": int(stock_in),
            "stock_out": int(stock_out),
            "closing_stock": int(closing)
        })

        opening = closing

inv = pd.DataFrame(records)
inv.sort_values(["component_id", "date"], inplace=True)
inv.to_csv("inventory_levels.csv", index=False)

print(f" Inventory levels generated: {len(inv):,} rows -> inventory_levels.csv")
print(inv.head(8))

 Inventory levels generated: 151,200 rows -> inventory_levels.csv
         date component_id  opening_stock  stock_in  stock_out  closing_stock
0  2024-02-19        C0001            162         0         10            152
1  2024-02-20        C0001            152         0          2            150
2  2024-02-21        C0001            150         0          4            146
3  2024-02-22        C0001            146         0         11            135
4  2024-02-23        C0001            135         0          5            130
5  2024-02-24        C0001            130         0          3            127
6  2024-02-25        C0001            127         0         12            115
7  2024-02-26        C0001            115         0          7            108


### Generate Production Orders with Daily Spikes (consistent with inventory usage)

This script creates production_orders that are *consistent* with previously simulated inventory movements. We treat the inventory "stock_out" as the ACTUAL issued quantity (units_issued). Then, we simulate higher "units_required" on spike days to create unmet demand/backorders when inventory is tight (e.g., closing_stock <= 0).

- Output schema:
prod_order_id, date, component_id, units_required, units_issued

- File: production_orders.csv

In [5]:
np.random.seed(21)

# Load prior artifacts
inv = pd.read_csv("inventory_levels.csv", parse_dates=["date"])
components = pd.read_csv("components.csv")

# Merge category into inventory for spike behavior
inv = inv.merge(
    components[["component_id", "category"]],
    on="component_id",
    how="left"
)

# Prepare spike days for each component
per_comp_spikes = {}
for comp_id, comp_df in inv.groupby("component_id", sort=False):
    unique_dates = np.sort(comp_df["date"].unique())  # FIX: np.sort to avoid AttributeError
    n_days = len(unique_dates)
    n_spikes = max(3, int(0.02 * n_days))
    spike_idx = np.random.choice(n_days, size=n_spikes, replace=False)
    spike_days = set(pd.to_datetime(unique_dates[spike_idx]).date)
    per_comp_spikes[comp_id] = spike_days

# Category-level spike intensity tuning
CATEGORY_SPIKE_MULT = {
    "Silicon Wafers": (1.3, 1.8),
    "Organic Substrates": (1.4, 2.1),
    "Leadframes": (1.4, 2.2),
    "Bonding Wire (Au/Cu)": (1.3, 1.9),
    "Mold Compound": (1.3, 1.9),
    "Die Attach Film/Paste": (1.3, 2.0),
    "Underfill/Epoxy": (1.3, 2.0),
    "Solder Balls/Spheres": (1.5, 2.4),
    "Probe Cards": (1.1, 1.4),
    "Test Sockets": (1.2, 1.6),
    "Carrier Tapes/Trays": (1.5, 2.3),
    "Nozzles/Capillaries": (1.2, 1.7)
}

records = []
counter = 1

for row in inv.itertuples(index=False):
    comp_id = row.component_id
    dt = row.date.date()
    cat = row.category
    issued = int(max(0, row.stock_out))

    # Base variation
    base_mult = np.random.normal(loc=1.0, scale=0.08)
    base_mult = max(0.85, min(1.25, base_mult))

    # Spike multiplier
    spike_low, spike_high = CATEGORY_SPIKE_MULT.get(cat, (1.25, 1.9))
    if dt in per_comp_spikes.get(comp_id, set()):
        spike_mult = np.random.uniform(spike_low, spike_high)
    else:
        spike_mult = 1.0

    # Units required and unmet demand
    rough_required = int(max(issued, round(issued * base_mult * spike_mult)))
    if row.closing_stock <= 0:
        unmet = int(np.random.poisson(lam=max(1, issued * 0.25)))
    else:
        unmet = int(np.random.binomial(n=max(0, issued), p=0.03))

    units_required = rough_required + unmet

    records.append({
        "prod_order_id": f"PO{counter:08d}",
        "date": dt,
        "component_id": comp_id,
        "units_required": int(units_required),
        "units_issued": int(issued)
    })
    counter += 1

po = pd.DataFrame(records)
po.sort_values(["component_id", "date"], inplace=True)
po.to_csv("production_orders.csv", index=False)

print(f"Production orders generated: {len(po):,} rows -> production_orders.csv")
print(po.head(8))

Production orders generated: 151,200 rows -> production_orders.csv
  prod_order_id        date component_id  units_required  units_issued
0    PO00000001  2024-02-19        C0001              10            10
1    PO00000002  2024-02-20        C0001               2             2
2    PO00000003  2024-02-21        C0001               4             4
3    PO00000004  2024-02-22        C0001              12            11
4    PO00000005  2024-02-23        C0001               5             5
5    PO00000006  2024-02-24        C0001               3             3
6    PO00000007  2024-02-25        C0001              12            12
7    PO00000008  2024-02-26        C0001               8             7


### Generate Monthly Demand Forecasts with ±20–40% Variation

1. Creates forecasts.csv using actual monthly demand (units_required from production_orders)
2. Adds realistic forecast error of ±20–40% (both over- and under-estimation)
3. Output schema: month (YYYY-MM), component_id, forecast_units

In [6]:
np.random.seed(33)

# Load production orders
po = pd.read_csv("production_orders.csv", parse_dates=["date"])

# Aggregate actual monthly demand per component
po["month"] = po["date"].dt.to_period("M").astype(str)
monthly_demand = (
    po.groupby(["month", "component_id"], as_index=False)["units_required"]
      .sum()
      .rename(columns={"units_required": "actual_units"})
)

# Apply forecast variation: ±20–40%
# Positive error = overestimate, negative = underestimate
variation_pct = np.random.uniform(-0.4, 0.4, size=len(monthly_demand))
monthly_demand["forecast_units"] = (
    monthly_demand["actual_units"] * (1 + variation_pct)
).round().astype(int)

# Avoid zero or negative forecasts (set a min of 1 unit)
monthly_demand["forecast_units"] = monthly_demand["forecast_units"].clip(lower=1)

# Save only required forecast columns
forecasts = monthly_demand[["month", "component_id", "forecast_units"]]
forecasts.to_csv("forecasts.csv", index=False)

print(f"Forecasts generated: {len(forecasts):,} rows -> forecasts.csv")
print(forecasts.sample(8, random_state=7))

Forecasts generated: 5,280 rows -> forecasts.csv
        month component_id  forecast_units
4445  2025-08        C0126           19576
4139  2025-07        C0060           74500
3707  2025-05        C0108           22253
1010  2024-06        C0051           64578
722   2024-05        C0003             268
4648  2025-09        C0089           55860
3870  2025-06        C0031           22757
1684  2024-09        C0005             346
