### The Smart Supplier: Optimizing Orders in a Fluctuating Market - 6 Marks

Develop a reinforcement learning agent using dynamic programming to help a Smart Supplier decide which products to manufacture and sell each day to maximize profit. The agent must learn the optimal policy for choosing daily production quantities, considering its limited raw materials and the unpredictable daily demand and selling prices for different products.

#### **Scenario**
 A small Smart Supplier manufactures two simple products: Product A and Product B. Each day, the supplier has a limited amount of raw material. The challenge is that the market demand and selling price for Product A and Product B change randomly each day, making some products more profitable than others at different times. The supplier needs to decide how much of each product to produce to maximize profit while managing their limited raw material.

#### **Objective**
The Smart Supplier's agent must learn the optimal policy π∗ using dynamic programming (Value Iteration or Policy Iteration) to decide how many units of Product A and Product B to produce each day to maximize the total profit over the fixed number of days, given the daily changing market conditions and limited raw material.

### --- 1. Custom Environment Creation (SmartSupplierEnv) --- ( 1 Mark )

In [13]:
!pip install numpy
import numpy as np
import random
from collections import defaultdict

# ------------------------
# 1. Define Environment
# ------------------------

# Raw material per day
MAX_RM = 10

# Market states
MARKET_STATES = {
    1: {'price_A': 8, 'price_B': 2},   # High Demand A
    2: {'price_A': 3, 'price_B': 5},   # High Demand B
}

# Actions: (units of A, units of B) and their RM cost
ACTIONS = {
    'Produce_2A_0B': (2, 0),
    'Produce_1A_2B': (1, 2),
    'Produce_0A_5B': (0, 5),
    'Produce_3A_0B': (3, 0),
    'Do_Nothing':     (0, 0),
}
ACTION_LIST = list(ACTIONS.keys())

# Horizon: number of days
HORIZON = 5

# Transition probability of market state (50/50)
TRANSITION_PROB = {1: 0.5, 2: 0.5}


# ------------------------
# 2. Reward Function
# ------------------------
def reward(day, rm, market_state, action):
    """
    Compute immediate reward for taking `action` in given (day, rm, market_state).
    """
    units_A, units_B = ACTIONS[action]
    cost_rm = 2*units_A + 1*units_B
    # If not enough raw material, action fails -> zero production
    if cost_rm > rm:
        units_A, units_B = 0, 0
    pA = MARKET_STATES[market_state]['price_A']
    pB = MARKET_STATES[market_state]['price_B']
    return units_A * pA + units_B * pB


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### --- 2. Dynamic Programming Implementation (Value Iteration or Policy Iteration) --- (2 Mark)

In [14]:
# ------------------------
# 3. Finite‑Horizon Value Iteration
# ------------------------
# V[t][rm][m] = max expected cumulative reward from day t..HORIZON starting
V = np.zeros((HORIZON+2, MAX_RM+1, len(MARKET_STATES)+1))
# π[t][rm][m] = best action at (t,rm,m)
policy = {}

# Initialize V at day HORIZON+1 to zero (no future days)
for t in reversed(range(1, HORIZON+1)):
    for rm in range(0, MAX_RM+1):
        for m in MARKET_STATES:
            best_val = -np.inf
            best_act = None
            # Evaluate each action
            for act in ACTION_LIST:
                r = reward(t, rm, m, act)
                # Next day resets rm to MAX_RM; market transitions randomly
                exp_future = 0.0
                for m_next, p in TRANSITION_PROB.items():
                    exp_future += p * V[t+1, MAX_RM, m_next]
                total = r + exp_future
                if total > best_val:
                    best_val = total
                    best_act = act
            V[t, rm, m] = best_val
            policy[(t, rm, m)] = best_act

#### --- 3. Simulation and Policy Analysis ---  ( 1 Mark)

In [15]:
# ------------------------
# 4. Inspect Optimal Policy & Value
# ------------------------
start_value = V[1, MAX_RM, 1]
print(f"Optimal state‑value V*(day=1, RM=10, market=1): {start_value:.2f}\n")

# Show policy for a few sample states
for day in [1, 3, 5]:
    for m in [1, 2]:
        act = policy[(day, MAX_RM, m)]
        print(f"Day {day}, RM=10, Market {m} → Produce: {act}")
    print()

# ------------------------
# 5. Simulate the Learned Policy
# ------------------------
def run_episode(policy):
    total = 0
    rm = MAX_RM
    market = random.choice([1, 2])
    for t in range(1, HORIZON+1):
        act = policy[(t, rm, market)]
        r = reward(t, rm, market, act)
        total += r
        # Next day
        rm = MAX_RM
        market = 1 if random.random() < 0.5 else 2
    return total

# Run many episodes to estimate average profit
N_EPISODES = 1000
profits = [run_episode(policy) for _ in range(N_EPISODES)]
avg_profit = np.mean(profits)
print(f"Average total profit over {N_EPISODES} runs: {avg_profit:.2f}")

Optimal state‑value V*(day=1, RM=10, market=1): 122.00

Day 1, RM=10, Market 1 → Produce: Produce_3A_0B
Day 1, RM=10, Market 2 → Produce: Produce_0A_5B

Day 3, RM=10, Market 1 → Produce: Produce_3A_0B
Day 3, RM=10, Market 2 → Produce: Produce_0A_5B

Day 5, RM=10, Market 1 → Produce: Produce_3A_0B
Day 5, RM=10, Market 2 → Produce: Produce_0A_5B

Average total profit over 1000 runs: 122.55


#### --- 4. Impact of Dynamics Analysis --- (1 Mark)

# Discusses the impact of dynamic market prices on the optimal policy.
# This section should primarily be a written explanation in the report.
When market prices for products A and B fluctuate over time, a static “one‑size‑fits‑all” production plan quickly becomes suboptimal. Dynamic programming (DP), and in particular finite‑horizon value iteration, lets the supplier tailor its daily production decisions to the current price regime while still accounting for future uncertainties. Here’s how:

(1) Price‑Driven Action Selection

High‑Price States: On days when product A’s price spikes (e.g., state 1: $8 for A vs. $2 for B), the DP‑derived policy will allocate as much raw material as feasible toward A, even if that means foregoing B completely. Conversely, when B’s price is higher (state 2), the policy pivots to produce B.

Threshold Behavior: Because production of A consumes twice as much raw material per unit as B, the policy often exhibits a “threshold” rule: only when A’s price advantage exceeds a certain margin does it choose the higher‑cost, higher‑reward option.

(2) Balancing Immediate vs. Future Reward

Finite Horizon Tradeoff: DP explicitly balances today’s profit against expected profits in future days. For instance, if tomorrow’s market is very likely to flip back to a state favoring A, the policy may “save” some raw material today (by producing fewer units or even doing nothing) in order to reap higher returns later.

Adaptive Conservatism: In practice, this yields adaptive conservatism—moderate production on marginally favorable price days, with heavier production reserved for days forecasted (probabilistically) to be more lucrative.

(3) Robustness to Price Volatility

Stochastic Averaging: By incorporating the 50/50 transition probabilities into its Bellman updates, DP smooths out erratic price swings. Actions are chosen not just for the current price but for how they buffer risk across all future price paths.

Policy Stability: Even if actual day‑to‑day price realizations differ from expectations, the DP policy remains near‑optimal, since it was optimized over the full distribution of possible future states.

(4) Operational Insights

Spot‑Price Exploitation: The DP solution highlights when “spot plays” (i.e., producing only on peak‑price days) are optimal versus maintaining a steady production rhythm.

Inventory Valuation: In settings where raw material can carry over (if modified), the same DP framework would assign an “option value” to holding inventory when both products are temporarily undervalued.

(5) Economic Value of Flexibility

Quantified Gains: Simulations typically show that a DP‑based policy yields significantly higher average profit than a naïve fixed rule (e.g., always produce two units of A), because it systematically exploits favorable price states and avoids overproduction in unfavorable ones.

Decision Support: Managers can use the value function surface V(t,RM,m) to quantify how much an extra unit of raw material is worth in each market state and time, guiding not only production but also procurement and pricing strategies.

In summary, dynamic programming transforms a volatile pricing environment from a risk‑laden guessing game into a structured optimization problem. By embedding market‑state transitions and finite‑horizon goals into the Bellman equations, DP yields an optimal policy that dynamically reallocates resources to wherever they earn the highest expected return—both today and tomorrow.

In [16]:
# --- Main Execution ---
import random

# ------------------------
# 1. Define Environment
# ------------------------
MAX_RM = 10
MARKET_STATES = {
    1: {'price_A': 8, 'price_B': 2},
    2: {'price_A': 3, 'price_B': 5},
}
ACTIONS = {
    'Produce_2A_0B': (2, 0),
    'Produce_1A_2B': (1, 2),
    'Produce_0A_5B': (0, 5),
    'Produce_3A_0B': (3, 0),
    'Do_Nothing':     (0, 0),
}
ACTION_LIST = list(ACTIONS.keys())
HORIZON = 5
TRANSITION_PROB = {1: 0.5, 2: 0.5}

# ------------------------
# 2. Reward Function
# ------------------------
def reward(rm, market_state, action):
    uA, uB = ACTIONS[action]
    cost = 2*uA + 1*uB
    if cost > rm:
        uA, uB = 0, 0
    pA = MARKET_STATES[market_state]['price_A']
    pB = MARKET_STATES[market_state]['price_B']
    return uA*pA + uB*pB

# ------------------------
# 3. Initialize Value & Policy Tables
# ------------------------
# V[t][rm][m] = 0 for all t,rm,m
V = [
    [ {1:0.0, 2:0.0} for _ in range(MAX_RM+1) ]
    for _ in range(HORIZON+2)
]
policy = {}  # (t,rm,m) -> best action

# ------------------------
# 4. Finite‑Horizon Value Iteration
# ------------------------
for t in range(HORIZON, 0, -1):
    for rm in range(MAX_RM+1):
        for m in MARKET_STATES:
            best_val = float('-inf')
            best_act = None
            for act in ACTION_LIST:
                r = reward(rm, m, act)
                # next-day RM resets to MAX_RM
                exp_future = sum(
                    TRANSITION_PROB[m_next] * V[t+1][MAX_RM][m_next]
                    for m_next in MARKET_STATES
                )
                total = r + exp_future
                if total > best_val:
                    best_val = total
                    best_act = act
            V[t][rm][m] = best_val
            policy[(t, rm, m)] = best_act

# ------------------------
# 5. Inspect Results
# ------------------------
print(f"Optimal V*(day=1, RM=10, market=1): {V[1][10][1]:.2f}\n")
for day in [1,3,5]:
    for m in [1,2]:
        print(f" Day {day}, RM=10, Market {m} → {policy[(day,10,m)]}")
    print()

# ------------------------
# 6. Simulate Policy
# ------------------------
def run_episode():
    total = 0
    rm = MAX_RM
    market = random.choice([1,2])
    for t in range(1, HORIZON+1):
        act = policy[(t, rm, market)]
        total += reward(rm, market, act)
        rm = MAX_RM
        market = 1 if random.random() < 0.5 else 2
    return total

# Estimate average profit
trials = 1000
profits = [run_episode() for _ in range(trials)]
print(f"Avg. profit over {trials} runs: {sum(profits)/trials:.2f}")



Optimal V*(day=1, RM=10, market=1): 122.00

 Day 1, RM=10, Market 1 → Produce_3A_0B
 Day 1, RM=10, Market 2 → Produce_0A_5B

 Day 3, RM=10, Market 1 → Produce_3A_0B
 Day 3, RM=10, Market 2 → Produce_0A_5B

 Day 5, RM=10, Market 1 → Produce_3A_0B
 Day 5, RM=10, Market 2 → Produce_0A_5B

Avg. profit over 1000 runs: 122.57
