File Name: mini_project_02b.ipynb

Description: This program trains a deep Q-network (DQN) to optimize inventory management in a simulated environment, handling raw material and product stocks, cash flow, and variable demand. The agent learns an effective policy using experience replay, target networks, and reward normalization, achieving improved cash performance while mitigating challenges from stockout penalties.

Record of Revisions (Date | Author | Change):  
10/29/2025 | Rhys DeLoach | Initial creation

In [31]:
# Import Libraries
import gymnasium as gym
import torch
import random
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn

# Import the required classes from the Inventory_env_class.py file that is provided
from mini_project_02a import InventoryManagementEnv, NormalizeObservation, ReplayBuffer, DQN

In [32]:
# Create environment instance - Notice we do not need to use gym.make here as I provided you the enviroment class
env = InventoryManagementEnv(max_steps=50)

# Normalize the observation space for better training performance
env = NormalizeObservation(env)

In [33]:
# Check the observation and action spaces
print(f"Observation space which is continuous:\n{env.observation_space}\n")
print(f"Action space which is discrete:\n{env.action_space}\n")

# Check the dimensions of the observation space 
print(f"Observation space dimensions: {env.observation_space.shape[0]}")
print(f"Action space dimensions: {env.action_space.n}")

Observation space which is continuous:
Box(0.0, inf, (6,), float32)

Action space which is discrete:
Discrete(3)

Observation space dimensions: 6
Action space dimensions: 3


In [34]:
# Reset the environment to a start state
observation, info = env.reset()

# Here we are just taking some random actions to see how the env works and what information rendering give us
# We do this over 5 episodes
for episode in range(5):
    action = env.action_space.sample()  # Replace with agent policy
    observation, reward, terminated, truncated, info = env.step(action)
    print(f"\nEpisode {episode+1} | Action taken: {action}")
    env.render()
    
    if terminated or truncated:
        observation, info = env.reset()
        
env.close()


Episode 1 | Action taken: 2
Step: 1
Raw Inventory: 0.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.76, Product Price: 20.34
Demand: 7.34, Cash: 1000.00

Episode 2 | Action taken: 1
Step: 2
Raw Inventory: 1.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.66, Product Price: 20.37
Demand: 8.69, Cash: 995.24

Episode 3 | Action taken: 1
Step: 3
Raw Inventory: 2.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.65, Product Price: 20.58
Demand: 8.71, Cash: 990.58

Episode 4 | Action taken: 1
Step: 4
Raw Inventory: 3.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.34, Product Price: 20.51
Demand: 7.15, Cash: 985.92

Episode 5 | Action taken: 0
Step: 5
Raw Inventory: 3.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.48, Product Price: 20.18
Demand: 12.68, Cash: 985.92


#### <u>When developing the DQN solution you need a few things</u> ####

In [35]:
# GPU or CPU selection
device = torch.device("mps" if torch.cuda.is_available() else "cpu")

# Assign dimension values of state and actions
input_dim = env.observation_space.shape[0]  # 6 dimensions
output_dim = env.action_space.n             # 3 actions

# Initialize networks
Q_net = DQN(input_dim, output_dim).to(device)
target_net = DQN(input_dim, output_dim).to(device)

# Copy the weights from Q network to target network
target_net.load_state_dict(Q_net.state_dict())

# Put target network in "no Training" mode.
target_net.eval()

DQN(
  (network): Sequential(
    (0): Linear(in_features=6, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=64, bias=True)
    (5): ReLU()
    (6): Linear(in_features=64, out_features=3, bias=True)
  )
)

In [36]:
replay_buffer = ReplayBuffer(capacity=50000)    # The value is your choice

In [21]:
# Hyperparameters
lr = 1e-4  # DQN learning rate
gamma = 0.95 # Gamma value
batchSize = 64 # NN batch size
episodes = 30000 # Max episodes

# Epsilon-greedy
epsStart = 1.0 # Initial epsilon
epsEnd = 0.1 # Final epsilon
epsDecayEpisodes = 20000 # Episode at which decays to final epsilon

# Training controls
startTrainAfter = 3000 # Delay to let buffer fill
targetUpdateFreq = 1000 # Update frequency for target NN
gradientClip = 0.5 # Used to clip gradients brought on by large rewards

# Initialize
optimizer = optim.Adam(Q_net.parameters(), lr=lr) # Adam optimizer
lossFn = nn.MSELoss()  # MSE loss

# Tracking progress
globalSteps = 0
episodeRewards = []
episodeLossesHistory = []

# Curriculum Learning
def getMaxSteps(episode):
    if episode < 8000:
        return 50
    elif episode < 18000:
        return 150
    else:
        return 300

# Train Function
def training(Q_net, target_net, replay_buffer, optimizer, lossFn, batchSize, gamma, device):
    states, actions, rewards, nextStates, dones = replay_buffer.sample(batchSize) # Pull samples from buffer

    # DQN Inputs
    states = torch.tensor(states, dtype=torch.float32, device=device)
    actions = torch.tensor(actions, dtype=torch.long, device=device).unsqueeze(1)
    rewards = torch.tensor(rewards, dtype=torch.float32, device=device).unsqueeze(1)
    nextStates = torch.tensor(nextStates, dtype=torch.float32, device=device)
    dones = torch.tensor(dones, dtype=torch.float32, device=device).unsqueeze(1)
    
    rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8) # Normalize rewards
    
    qValues = Q_net(states).gather(1, actions) # Current Q-values
    
    # Target Q-values
    with torch.no_grad():
        nextQ = target_net(nextStates).max(1)[0].unsqueeze(1)
        qTargets = rewards + gamma * nextQ * (1 - dones)
    
    loss = lossFn(qValues, qTargets) # Loss function
    
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(Q_net.parameters(), gradientClip)
    optimizer.step()
    
    return loss.item()

# Train Loop
for episode in range(episodes):
    currentMaxSteps = getMaxSteps(episode)
    
    # Recreate environment
    env = InventoryManagementEnv(max_steps=currentMaxSteps)
    env = NormalizeObservation(env)
    
    state, _ = env.reset() # Reset environment
    done = False # Done flag
    
    # Reinitialize episode parameters
    totalReward = 0.0
    episodeLosses = []
    tStep = 0
    
    epsilon = max(epsEnd, epsStart - (epsStart - epsEnd) * episode / epsDecayEpisodes) # Epsilon function

    # Exploitation vs exploration
    while not done and tStep < currentMaxSteps:
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                stateTensor = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
                action = torch.argmax(Q_net(stateTensor)).item()
        
        nextState, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        replay_buffer.push(state, action, reward, nextState, done) # Update buffer

        # Update step info
        totalReward += reward
        state = nextState
        tStep += 1
        global_steps += 1

        # Fill buffer
        if len(replay_buffer) >= startTrainAfter:
            loss = training(Q_net, target_net, replay_buffer, optimizer, lossFn, 
                            batchSize, gamma, device)
            episodeLosses.append(loss)

        # Update target NN
        if globalSteps % targetUpdateFreq == 0:
            target_net.load_state_dict(Q_net.state_dict())
            target_net.eval()
    
    # Update episode info
    episodeRewards.append(totalReward)
    avgLoss = np.mean(episodeLosses) if episodeLosses else 0.0
    episodeLossesHistory.append(avgLoss)
    
    # Logging
    if episode % 100 == 0:
        avgReward100 = np.mean(episodeRewards[-100:]) if len(episodeRewards) >= 100 else np.mean(episodeRewards)
        avgLoss100 = np.mean(episodeLossesHistory[-100:]) if len(episodeLossesHistory) >= 100 else avgLoss
        print(f"Episode {episode}/{episodes} | Steps: {currentMaxSteps} | "
              f"Avg Reward (100ep): {avgReward100:.2f} | "
              f"Epsilon: {epsilon:.3f} | Avg MSE Loss: {avgLoss100:.4f} | Buffer: {len(replay_buffer)}")

print("\n✅ Training Complete!")
torch.save(Q_net.state_dict(), 'dqn_inventory_final.pth')
print("Model saved as 'dqn_inventory_final.pth'")

Using device: mps
Starting DQN Training for Inventory Management
Goal: Minimize losses (negative rewards)
Expected: Rewards start very negative, stabilize at less negative values

Episode 0/30000 | Steps: 50 | Avg Reward (100ep): -2293.89 | Epsilon: 1.000 | Avg MSE Loss: 0.0000 | Buffer: 50
Episode 100/30000 | Steps: 50 | Avg Reward (100ep): -2274.04 | Epsilon: 0.996 | Avg MSE Loss: 0.3213 | Buffer: 5050
Episode 200/30000 | Steps: 50 | Avg Reward (100ep): -2261.88 | Epsilon: 0.991 | Avg MSE Loss: 0.6550 | Buffer: 10050
Episode 300/30000 | Steps: 50 | Avg Reward (100ep): -2271.32 | Epsilon: 0.987 | Avg MSE Loss: 0.7283 | Buffer: 15050
Episode 400/30000 | Steps: 50 | Avg Reward (100ep): -2269.89 | Epsilon: 0.982 | Avg MSE Loss: 0.8331 | Buffer: 20050
Episode 500/30000 | Steps: 50 | Avg Reward (100ep): -2254.31 | Epsilon: 0.978 | Avg MSE Loss: 0.9289 | Buffer: 25050
Episode 600/30000 | Steps: 50 | Avg Reward (100ep): -2256.74 | Epsilon: 0.973 | Avg MSE Loss: 0.9705 | Buffer: 30050
Episode

In [40]:
# Test Policy
Q_net.load_state_dict(torch.load("dqn_inventory_final.pth", map_location=device))
Q_net.eval()

state, info = env.reset()
total_reward = 0

for step in range(50):
    env.render()

    state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
    with torch.no_grad():
        action = torch.argmax(Q_net(state_tensor)).item()

    next_state, reward, terminated, truncated, info = env.step(action)

    total_reward += reward
    state = next_state

    if terminated or truncated:
        print(f"Environment ended early at step {step+1}.")
        break
        
env.close()


Step: 0
Raw Inventory: 0.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.74, Product Price: 19.98
Demand: 13.01, Cash: 1000.00
Step: 1
Raw Inventory: 1.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.85, Product Price: 19.73
Demand: 14.02, Cash: 995.26
Step: 2
Raw Inventory: 2.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 4.76, Product Price: 20.03
Demand: 8.49, Cash: 990.41
Step: 3
Raw Inventory: 0.0
Product Inventory Before Sale: 1.0, After Sale: 0.0
Raw Price: 4.74, Product Price: 19.89
Demand: 8.44, Cash: 1010.44
Step: 4
Raw Inventory: 1.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 5.05, Product Price: 19.99
Demand: 12.62, Cash: 1005.70
Step: 5
Raw Inventory: 2.0
Product Inventory Before Sale: 0.0, After Sale: 0.0
Raw Price: 5.07, Product Price: 20.24
Demand: 11.54, Cash: 1000.65
Step: 6
Raw Inventory: 0.0
Product Inventory Before Sale: 1.0, After Sale: 0.0
Raw Price: 5.43, Product Price: 19.95
Demand: 11.

1. One of the primary challenges I encountered was managing the large negative reward values caused by stockout penalties. These penalties not only caused instability in the DQN, often leading to exploding gradients, but also shifted the agent’s focus almost entirely toward avoiding stockouts, at the expense of learning other beneficial actions. To address this, I applied gradient clipping and reward normalization to keep training stable and to ensure that the reward signal remained informative. During troubleshooting, I also implemented extensive logging to monitor the model’s behavior throughout training. This experience emphasized the importance of observing the neural network’s loss trends, as waiting solely for increases in cumulative reward is impractical for slow converging models. One question I still have is how to determine whether the model has converged to a local minimum rather than the optimal policy. In a real-world deployment, would it be advantageous to maintain a prolonged exploration period to ensure sufficient coverage of the state space?
2. Overall, the model performed reasonably well, achieving approximately a 10% increase in cash over 50 steps.
3. Reinforcement learning offers several advantages over traditional control methods, and control plays a central role across many industries. In agriculture in particular, RL has the potential to significantly improve system performance. Biological processes are often highly non-linear and influenced by numerous interacting variables, which makes them difficult to manage using traditional PID controllers that rely on linear assumptions. RL, by contrast, is inherently well-suited for non-linear and dynamic environments, enabling it to adapt control policies based on feedback and changing conditions. Furthermore, within the broader context of AI development, RL is considered a key component in progressing toward artificial general intelligence (AGI). Unlike large language models, which are stateless and limited to the distribution of data on which they were trained, RL agents actively interact with and explore their environments. This ability allows RL systems to discover new states and behaviors beyond their initial training data, and often at a lower computational cost than retraining or scaling static models. As a result, RL represents a promising path toward building systems that can operate flexibly in open-ended, real-world environments.

Note: I showcased the model running for 50 steps rather than 10 to better show the increase in cash.

ChatGPT Help:
Prompt: "Adjust for grammar, spelling, and academic tone"
Result: See above