## REINFORCE (POLICY-BASED) ALGORITHM ON INVENTORY MANAGEMENT

Suppose a retailer wants to manage the inventory of a retail product in order to maximize product availability over a 3-month period. There are three inventory actions of ORDER, MAINTAIN AND REDUCE. The demand for the product can be High, Medium, or Low.


![Product Pricing](https://www.foodrepublic.com/img/gallery/why-you-should-never-eat-canned-food-that-was-accidentally-frozen/l-intro-1693380530.jpg)

### States:
- High

- Medium

- Low

### Actions:
- Reduce

- Maintain

- Order



### Let's use the REINFORCE Algorithm to Find the Best Policy for the Retail Inventory

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
import numpy as np
import pandas as pd

# Assume we load the inventory dataset like this (please replace this with actual data)
inventory_data = pd.read_csv('/content/drive/MyDrive/Deep Learning/Lab 2 Practice/inventory_dataset.csv')
inventory_data

Unnamed: 0,Current Inventory,Demand,Lead Time,Price,Season,Action Taken
0,5,91,5,16,Spring,Reduce
1,8,39,5,14,Spring,Maintain
2,10,238,10,20,Winter,Reduce
3,12,37,10,17,Autumn,Maintain
4,4,295,8,13,Summer,Order
...,...,...,...,...,...,...
995,20,229,3,18,Spring,Maintain
996,5,11,8,12,Winter,Reduce
997,23,91,8,11,Winter,Order
998,16,11,9,15,Spring,Maintain


#### Q1 Define your states, actions, learning rate (typically between 0.5 - 0.001), and discount factor. Fill in numerical values to replace the question mark (?) for transition_probs and rewards. For transition_probb, ensure that your values for each state-action pair sum up to 1.

In [13]:
# Define states, actions, and parameters
states =  ["Low", "Medium", "High"]
actions = ["Order", "Maintain", "Reduce"]
learning_rate =  0.001
discount_factor =  0.9

# Initialize policy parameters
policy_params = {state: {action: 0.0 for action in actions} for state in states}

# Define transition probabilities
transition_probs = {
    ("Low", "Order"): {"Low": 0.8, "Medium": 0.2, "High": 0.0},
    ("Medium", "Order"): {"Low": 0.1, "Medium": 0.8, "High": 0.1},
    ("High", "Order"): {"Low": 0.0, "Medium": 0.2, "High": 0.8},
    ("Low", "Maintain"): {"Low": 0.9, "Medium": 0.1, "High": 0.0},
    ("Medium", "Maintain"): {"Low": 0.0, "Medium": 0.9, "High": 0.1},
    ("High", "Maintain"): {"Low": 0.0, "Medium": 0.0, "High": 1.0},
    ("Low", "Reduce"): {"Low": 0.0, "Medium": 0.0, "High": 1.0},
    ("Medium", "Reduce"): {"Low": 0.0, "Medium": 0.0, "High": 1.0},
    ("High", "Reduce"): {"Low": 0.0, "Medium": 0.0, "High": 1.0}}

# Define rewards
rewards = {
    'Low': {'Order': 5, 'Maintain': 0.5, 'Reduce': -10},
    'Medium': {'Order': 0, 'Maintain': 0.5, 'Reduce': 5},
    'High': {'Order': -1, 'Maintain': -0.5, 'Reduce': 2}
}

#### Q2 Implement the Policy Gradient Update Rule in the function, 'update_policy'. Replace the question mark (?)

In [14]:
def get_state_from_inventory(inventory_level):
    """Convert inventory level to state"""
    if inventory_level <= 10:
        return "Low"
    elif inventory_level <= 20:
        return "Medium"
    else:
        return "High"

def get_policy_probabilities(state):
    """Convert policy parameters to probabilities using softmax"""
    params = np.array([policy_params[state][action] for action in actions])
    exp_params = np.exp(params - np.max(params))
    return exp_params / np.sum(exp_params)

def choose_action(state):
    """Choose action based on policy probabilities"""
    probabilities = get_policy_probabilities(state)
    return np.random.choice(actions, p=probabilities)

def get_next_state(current_state, action):
    """Determine next state based on transition probabilities"""
    next_state_probs = transition_probs[(current_state, action)]
    states_list = list(next_state_probs.keys())
    probs = list(next_state_probs.values())
    return np.random.choice(states_list, p=probs)

def generate_episode():
    """Generate a single episode"""
    episode = []
    current_state = np.random.choice(states)

    for _ in range(10):  # Fixed episode length
        action = choose_action(current_state)
        reward = rewards[current_state][action]
        next_state = get_next_state(current_state, action)
        episode.append((current_state, action, reward))
        current_state = next_state

    return episode

def calculate_returns(episode):
    """Calculate returns for each step"""
    returns = []
    G = 0
    for _, _, reward in reversed(episode):
        G = reward + discount_factor * G
        returns.insert(0, G)
    return returns

def update_policy(episode, returns):
    """Update policy parameters using policy gradient"""
    for (state, action, _), G in zip(episode, returns):
        action_probs = get_policy_probabilities(state)
        action_idx = actions.index(action)

        for a_idx, a in enumerate(actions):
            if a_idx == action_idx:
                policy_params[state][a] += learning_rate * G * (1 - action_probs[a_idx])
            else:
                policy_params[state][a] -= learning_rate * G * action_probs[a_idx]



In [15]:
# Training loop
print("Starting training...")
num_episodes = 1000
for episode_num in range(num_episodes):
    episode = generate_episode()
    returns = calculate_returns(episode)
    update_policy(episode, returns)

    if (episode_num + 1) % 100 == 0:
        print(f"Episode {episode_num + 1}/{num_episodes} completed")

# Print final policy
print("\nFinal Policy:")
for state in states:
    probs = get_policy_probabilities(state)
    print(f"\nState: {state}")
    for action, prob in zip(actions, probs):
        print(f"Action: {action}, Probability: {prob:.3f}")

# Test policy on specific inventory levels
print("\nPolicy Recommendations for Test Inventory Levels:")
test_inventory_levels = [5, 15, 25]

for inventory in test_inventory_levels:
    state = get_state_from_inventory(inventory)
    probs = get_policy_probabilities(state)
    recommended_action = actions[np.argmax(probs)]

    print(f"\nInventory Level: {inventory}")
    print(f"State: {state}")
    print(f"Recommended Action: {recommended_action}")
    print("Action Probabilities:")
    for action, prob in zip(actions, probs):
        print(f"  {action}: {prob:.3f}")

Starting training...
Episode 100/1000 completed
Episode 200/1000 completed
Episode 300/1000 completed
Episode 400/1000 completed
Episode 500/1000 completed
Episode 600/1000 completed
Episode 700/1000 completed
Episode 800/1000 completed
Episode 900/1000 completed
Episode 1000/1000 completed

Final Policy:

State: Low
Action: Order, Probability: 0.911
Action: Maintain, Probability: 0.066
Action: Reduce, Probability: 0.022

State: Medium
Action: Order, Probability: 0.146
Action: Maintain, Probability: 0.178
Action: Reduce, Probability: 0.677

State: High
Action: Order, Probability: 0.025
Action: Maintain, Probability: 0.028
Action: Reduce, Probability: 0.947

Policy Recommendations for Test Inventory Levels:

Inventory Level: 5
State: Low
Recommended Action: Order
Action Probabilities:
  Order: 0.911
  Maintain: 0.066
  Reduce: 0.022

Inventory Level: 15
State: Medium
Recommended Action: Reduce
Action Probabilities:
  Order: 0.146
  Maintain: 0.178
  Reduce: 0.677

Inventory Level: 25
St

The REINFORCE algorithm has been run for a small number of episodes (10 in this case), and the policy parameters have been updated. The values represent the learned preferences for each action in each state.


Keep in mind that due to the small number of episodes, these values are not necessarily indicative of an optimal policy. In practice, you would run the REINFORCE algorithm for many more episodes and potentially with more sophisticated policy representations (like neural networks) to learn a more reliable policy.​

## DYNA-Q (MODEL-BASED) ALGORITHM FOR INVENTORY MANAGEMENT

Inventory management is the backbone to any retail business, essentially enabling you to keep your business in order. It’s the system and processes you implement to keep a record of your stores inventory. Inventory management process is crucial.

![Inventory Management](https://aotmp.com/wp-content/uploads/3-Ways-to-Drive-Efficiency-for-Your-Enterprise-Expense-Management-Program-via-Inventory-Management.png)

We can define the states, actions, transition probabilities, rewards, and discount factor for an inventory management problem. Let's go through each component:

- **States**:
    - The states represent the different levels of inventory. In this case, the states are defined as **["Low", "Medium", "High"]**, indicating low, medium, and high levels of inventory, respectively.

- **Actions**:
    - The actions represent the decisions the agent can take regarding the inventory. The available actions in this problem are **["Order", "Maintain", "Reduce"]**, which correspond to ordering more inventory, maintaining the current inventory level, or reducing the inventory level, respectively.

    In the context of the product inventory management problem, the actions 'Reduce', 'Maintain', and 'Order' have specific meanings:

   - *Reduce*: The 'Reduce' action means **decreasing the product inventory level**. This could involve strategies such as **selling or promoting products** to reduce the inventory to a desired level. The specific implementation of the 'Reduce' action would depend on the business's inventory management practices.

   - *Maintain*: The 'Maintain' action means **keeping the product inventory level** unchanged. When the agent selects the 'Maintain' action, it implies that the current inventory level is considered satisfactory, and there is no need to increase or decrease it.

   - *Order*: The 'Order' action means **replenishing the product inventory** by placing an order for more products. When the agent chooses the 'Order' action, it indicates that the current inventory level is insufficient, and it is necessary to **order more products** to meet the expected demand.

- **Transition Probabilities**:
    - The transition probabilities define the likelihood of **moving from one state to another** when a particular action is taken. The probabilities are represented in a nested dictionary format, where the keys are tuples of the form **(current_state, action)**, and the values are dictionaries mapping possible next states to their corresponding probabilities.

- **Rewards**:
    - The rewards represent the **immediate rewards** associated with transitioning from one state to another after taking a specific action. Similar to transition probabilities, rewards are represented as a nested dictionary, where the keys are tuples of the form **(current_state, action)**, and the values are the associated rewards.

- **Discount Factor**:
    - The discount factor, represented as discount_factor, **determines the importance of immediate rewards** versus future rewards. It is a value between 0 and 1, where a higher value places more emphasis on future rewards.

#### Q3 Define your states, actions, transition_probs, rewards, learning rate, discount factor and epsilon.

In [16]:
states = ["Low", "Medium", "High"]

In [17]:
actions = ["Order", "Maintain", "Reduce"]

In [18]:
transition_probs = {
    ("Low", "Order"): {"Low": 0.8, "Medium": 0.2, "High": 0.0},
    ("Medium", "Order"): {"Low": 0.1, "Medium": 0.8, "High": 0.1},
    ("High", "Order"): {"Low": 0.0, "Medium": 0.2, "High": 0.8},
    ("Low", "Maintain"): {"Low": 0.9, "Medium": 0.1, "High": 0.0},
    ("Medium", "Maintain"): {"Low": 0.0, "Medium": 0.9, "High": 0.1},
    ("High", "Maintain"): {"Low": 0.0, "Medium": 0.0, "High": 1.0},
    ("Low", "Reduce"): {"Low": 0.0, "Medium": 0.0, "High": 1.0},
    ("Medium", "Reduce"): {"Low": 0.0, "Medium": 0.0, "High": 1.0},
    ("High", "Reduce"): {"Low": 0.0, "Medium": 0.0, "High": 1.0}}

In [19]:
# Define rewards
rewards = {
    'Low': {'Order': 5, 'Maintain': 0.5, 'Reduce': -10},
    'Medium': {'Order': 0, 'Maintain': 0.5, 'Reduce': 5},
    'High': {'Order': -1, 'Maintain': -0.5, 'Reduce': 2}
}

In [20]:
learning_rate =  0.001
discount_factor = 0.9
epsilon = 0.1
planning_steps = 50  # Number of planning steps for Dyna-Q
num_episodes = 1000

In [21]:
# Initialize Q-values
Q = {
    state: {action: 0.0 for action in actions}
    for state in states
}

# Initialize model for Dyna-Q
# Model stores (next_state, reward) for each state-action pair
model = {
    state: {action: [] for action in actions}
    for state in states
}

# Keep track of visited state-action pairs
visited_pairs = set()

### Let's use the DYNA-Q Algorithm to obtain the Optimal Q-Values for all State-Action Pairs.

We implement the Dyna-Q algorithm below and use the Q-learning update rule to update the Q values:

![Alt text](https://miro.medium.com/v2/resize:fit:1400/1*XRF0ejkSFrQsWh55BC1H1Q.png)

#### Q4: Using the Q-Learning Update Rule shown above, write the line for temporal difference target (td_target), TD error (td_error) and Qnew(s,a). td_target, td_error and Q(s,a) has been replaced with question mark (?) in the function, update_q_value.

In [22]:
def get_state_from_inventory(inventory_level):
    """Convert inventory level to state"""
    if inventory_level <= 10:
        return "Low"
    elif inventory_level <= 20:
        return "Medium"
    else:
        return "High"

def choose_action(state, epsilon=0.1):
    """Choose action using epsilon-greedy policy"""
    if np.random.random() < epsilon:
        return np.random.choice(actions)
    else:
        return max(actions, key=lambda a: Q[state][a])

def get_next_state(current_state, action):
    """Get next state based on transition probabilities"""
    probs = transition_probs[(current_state, action)]
    return np.random.choice(list(probs.keys()), p=list(probs.values()))

def get_reward(state, action):
    """Get reward for state-action pair"""
    return rewards[state][action]

def update_q_value(state, action, reward, next_state):
    """Update Q-value using Q-learning update rule"""
    best_next_action = max(actions, key=lambda a: Q[next_state][a])
    td_target = reward + discount_factor * Q[next_state][best_next_action]
    td_error = td_target - Q[state][action]
    Q[state][action] += learning_rate * td_error

def update_model(state, action, next_state, reward):
    """Update model with observed transition"""
    model[state][action] = (next_state, reward)
    visited_pairs.add((state, action))

def planning_step():
    """Perform one planning step using the model"""
    if not visited_pairs:
        return

    # Randomly select a previously visited state-action pair
    state, action = list(visited_pairs)[np.random.randint(len(visited_pairs))]
    next_state, reward = model[state][action]

    # Update Q-value using the model
    update_q_value(state, action, reward, next_state)


In [23]:
# Training loop with Dyna-Q
print("Starting Dyna-Q training...")
for episode in range(num_episodes):
    # Start from any state
    current_state = np.random.choice(states)

    # Real experience
    for _ in range(10):  # Steps per episode
        # Choose and take action
        action = choose_action(current_state, epsilon)
        next_state = get_next_state(current_state, action)
        reward = get_reward(current_state, action)

        # Update Q-value with real experience
        update_q_value(current_state, action, reward, next_state)

        # Update model
        update_model(current_state, action, next_state, reward)

        # Planning steps (using the model)
        for _ in range(planning_steps):
            planning_step()

        current_state = next_state

    if (episode + 1) % 100 == 0:
        print(f"Episode {episode + 1}/{num_episodes} completed")


Starting Dyna-Q training...
Episode 100/1000 completed
Episode 200/1000 completed
Episode 300/1000 completed
Episode 400/1000 completed
Episode 500/1000 completed
Episode 600/1000 completed
Episode 700/1000 completed
Episode 800/1000 completed
Episode 900/1000 completed
Episode 1000/1000 completed


In [24]:
# Print final Q-values
print("\nFinal Q-values:")
for state in states:
    print(f"\nState: {state}")
    for action in actions:
        print(f"Action: {action}, Q-value: {Q[state][action]:.3f}")

# Test policy on specific inventory levels
print("\nPolicy Recommendations for Test Inventory Levels:")
test_inventory_levels = [5, 15, 25]

for inventory in test_inventory_levels:
    state = get_state_from_inventory(inventory)
    best_action = max(actions, key=lambda a: Q[state][a])
    q_values = [Q[state][a] for a in actions]

    print(f"\nInventory Level: {inventory}")
    print(f"State: {state}")
    print(f"Recommended Action: {best_action}")
    print("Q-values for each action:")
    for action, q_value in zip(actions, q_values):
        print(f"  {action}: {q_value:.3f}")


Final Q-values:

State: Low
Action: Order, Q-value: 26.571
Action: Maintain, Q-value: 23.480
Action: Reduce, Q-value: 7.964

State: Medium
Action: Order, Q-value: 18.100
Action: Maintain, Q-value: 21.162
Action: Reduce, Q-value: 22.964

State: High
Action: Order, Q-value: 17.178
Action: Maintain, Q-value: 17.464
Action: Reduce, Q-value: 19.965

Policy Recommendations for Test Inventory Levels:

Inventory Level: 5
State: Low
Recommended Action: Order
Q-values for each action:
  Order: 26.571
  Maintain: 23.480
  Reduce: 7.964

Inventory Level: 15
State: Medium
Recommended Action: Reduce
Q-values for each action:
  Order: 18.100
  Maintain: 21.162
  Reduce: 22.964

Inventory Level: 25
State: High
Recommended Action: Reduce
Q-values for each action:
  Order: 17.178
  Maintain: 17.464
  Reduce: 19.965


### Key differences from the REINFORCE implementation:

- Q-value Based: Instead of policy parameters, we maintain Q-values for each state-action pair.


- Model-Based Learning: The Dyna-Q algorithm maintains a model of the environment that stores observed transitions and rewards.


- Planning Steps: After each real experience, the algorithm performs multiple planning steps using the learned model.


- Epsilon-Greedy Exploration: Uses an epsilon-greedy policy for action selection instead of probabilistic policy.


- While both algorithms can solve reinforcement learning problems, they have distinct characteristics that make them suitable for different scenarios. REINFORCE is better for problems requiring stochastic policies or continuous actions, while Dyna-Q is more efficient for discrete problems where model-based learning can be leveraged.

##### Gerald Onwujekwe - PhD