# **Lab-based Assignment 4:** Practice - Q Learning for Retail Inventory Management

![Imgur](https://i.imgur.com/b4dtRYW.png)

<div class="LI-profile-badge"  data-version="v1" data-size="large" data-locale="en_US" data-type="horizontal" data-theme="light" data-vanity="drsalihtutun"><a class="LI-simple-link" href='https://www.linkedin.com/in/drsalihtutun/en-us?trk=profile-badge'>Salih Tutun, PhD</a></div>

![Imgur](https://i.imgur.com/jmmupc7.png)

By Smart Retail Solutions

## **What is Retail Inventory?**

**Retail inventory management** is the backbone to any retail business, essentially enabling you to keep your business in order. It’s the system and processes you implement to keep a record of your stores inventory. Inventory management process is crucial. Having the right **automated inventory management system** in place can make all the difference. Out of stock items equals **frustrated customers and loss of sales** which over time could damage a retailer’s reputation and lose future customers and sales.

![Imgur](https://i.imgur.com/UUUcCid.png)

### **Components of Inventory Management**

We can define the states, actions, rewards, and discount factor for an inventory management problem.

Let's go through each component:

- **States**:
    - The states represent the different levels of inventory. In this case, the states are defined as **["Low", "Medium", "High"]**, indicating low, medium, and high levels of inventory, respectively.

- **Actions**:
    - The actions represent the decisions the agent can take regarding the inventory. The available actions in this problem are **["Order", "Maintain", "Reduce"]**, which correspond to ordering more inventory, maintaining the current inventory level, or reducing the inventory level, respectively.

    In the context of the product inventory management problem, the actions 'Reduce', 'Maintain', and 'Order' have specific meanings:

   - *Reduce*: The 'Reduce' action means **decreasing the product inventory level**. This could involve strategies such as **selling or promoting products** to reduce the inventory to a desired level. The specific implementation of the 'Reduce' action would depend on the business's inventory management practices.

   - *Maintain*: The 'Maintain' action means **keeping the product inventory level** unchanged. When the agent selects the 'Maintain' action, it implies that the current inventory level is considered satisfactory, and there is no need to increase or decrease it.

   - *Order*: The 'Order' action means **replenishing the product inventory** by placing an order for more products. When the agent chooses the 'Order' action, it indicates that the current inventory level is insufficient, and it is necessary to **order more products** to meet the expected demand.

- **Rewards**:
    - The rewards represent the **immediate rewards** associated with transitioning from one state to another after taking a specific action. Similar to transition probabilities, rewards are represented as a nested dictionary, where the keys are tuples of the form **(current_state, action)**, and the values are the associated rewards.

- **Discount Factor**:
    - The discount factor, represented as discount_factor, **determines the importance of immediate rewards** versus future rewards. It is a value between 0 and 1, where a higher value places more emphasis on future rewards.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import random

# Read the CSV file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/Deep Learning/Lab 4 Files/inventory_dataset.csv')
df.head()

- In **Value Iteration and Policy Iteration**, the transition probabilities (which describe the probability of moving from one state to another given a specific action) are a critical component because they are part of the Markov Decision Process (**MDP**) model. These methods rely on a complete knowledge of **the environment's dynamics**, including how actions affect state transitions.

- In contrast, Q-learning is a **model-free** reinforcement learning algorithm. It does not require knowledge of transition probabilities or the reward function of the environment. Instead, it learns by interacting with the environment, receiving feedback in the form of rewards, and updating its estimates of the Q-values (which represent the expected utility of taking a given action in a given state) based on experience.

- The **key difference** is that Q-learning can learn directly from experience without needing a model of the environment (such as transition probabilities), making it suitable for situations where the model is unknown or too complex to formulate.

- However, **if transition probabilities are known** or can be estimated from data, they can be used to improve the learning process in Q-learning, especially in the initial phases. This is often done by simulating experiences using the transition probabilities, which can help to initialize or guide the learning process. But fundamentally, Q-learning is designed to work without needing them.

- In your current setup, if you have reliable data that can be used to estimate transition probabilities, we could incorporate that into the initial estimates or use it to simulate experiences for the Q-learning algorithm. Otherwise, the algorithm can learn effectively through interaction with the data as it is.

## Updating Q-learning algorithm with expanded states and actions

## **Q1: Let's define the three states and three actions and then find the optimal policy?**

#### **How to define the states, actions and learning parameters?**

- We updated Q-learning algorithm with expanded states and actions seems well-structured. In this version, we have three states (**"Low Inventory", "Medium Inventory", "High Inventory"**) and three actions (**"Order", "Maintain", "Reduce"**). The reward system is also more nuanced, reflecting different scenarios in inventory management.

- The Q-learning algorithm iterates over a number of episodes, updating the Q-values based on the rewards received for each state-action pair. It balances exploration and exploitation using the **epsilon** parameter and updates the Q-values with the learning rate (**alpha**) and **discount factor**.

- The convergence plot tracks how the Q-value for a specific state-action pair ("Low Inventory", "Order") evolves over the episodes, which is helpful for understanding the learning process and checking if the algorithm is converging to a stable policy.

In [None]:
import random
import matplotlib.pyplot as plt

# Step 1: Define the states, actions, and rewards

# Expanded states based on inventory levels and demand
states =  ["Low Inventory", "Medium Inventory", "High Inventory"]

# Expanded actions
actions = ["Order", "Maintain", "Reduce"]

# Define a more complex reward system
rewards = {
    ('Low Inventory', 'Order'): 2,       # Reward for ordering when inventory is low
    ('Low Inventory', 'Maintain'): -1,   # Penalty for maintaining low inventory
    ('Low Inventory', 'Reduce'): -2,     # Penalty for reducing already low inventory
    ('Medium Inventory', 'Order'): 1,    # Smaller reward for ordering when inventory is medium
    ('Medium Inventory', 'Maintain'): 2, # Reward for maintaining medium inventory
    ('Medium Inventory', 'Reduce'): 1,   # Small reward for reducing medium inventory
    ('High Inventory', 'Order'): -3,     # Penalty for ordering when inventory is high
    ('High Inventory', 'Maintain'): -1,  # Penalty for maintaining high inventory
    ('High Inventory', 'Reduce'): 2      # Reward for reducing high inventory
}

# Step 2: Initialize the Q-table
Q = {state: {action: 0 for action in actions} for state in states}

# Step 3: Set the learning parameters
alpha = 0.5  # Learning rate
epsilon = 0.2  # Exploration rate
discount_factor = 0.9  # Discount factor
num_episodes = 100  # Number of episodes

# Tracking Q-values for convergence analysis
tracked_state = 'Low Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Step 4: Q-learning algorithm
for episode in range(num_episodes):
    state = random.choice(states)

    while True:
        # Exploration-exploitation trade-off
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = max(Q[state], key=Q[state].get)

        reward = rewards[(state, action)]
        next_state = random.choice(states)

        # Q-learning update rule
        old_value = Q[state][action]
        Q[state][action] = old_value + alpha * (reward + discount_factor * max(Q[next_state].values()) - old_value)

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append(Q[state][action])

        state = next_state

        # Break condition for simplicity
        if random.uniform(0, 1) < 0.1:
            break

# Step 5: Determine the optimal policy
policy = {state: max(Q[state], key=Q[state].get) for state in states}

# Prepare for displaying the results
Optimal_Policy = {
    "Final Q-Table": {state: Q[state] for state in states},
    "Optimal Policy": policy
}

print("\nOptimal Policy:")
print(Optimal_Policy)


# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()



After executing this code, we should see:

1. The final Q-table: It will display the learned Q-values for each state-action pair.
2. The optimal policy: It will indicate the best action to take for each state based on the learned Q-values.
3. A plot of Q-value convergence: It will show how the Q-value for the tracked state-action pair changes over the episodes.

This implementation can serve as a solid basis for more complex scenarios or more detailed inventory management models.



## **Q2: Please define the states, actions, state transitions, rewards and incorparate the dataset?**

### **How to incorparate the dataset to the algorithm?**

To incorporate the dataset into the Q-learning algorithm, we'll first analyze the data to determine how best to define the states, actions, and rewards. Based on our dataset, which includes "Current Inventory", "Demand", "Lead Time", "Price", and "Season", we can create a more dynamic and realistic model.

Here's an approach to integrate the dataset:

1. **Defining States**: States can be derived from "Current Inventory" and "Demand". For example, states could be defined based on the ratio of current inventory to demand (e.g., "Low", "Medium", "High").

2. **Actions**: These could be similar to what you have already defined - "Order", "Maintain", "Reduce".

3. **Rewards**: The reward system can be based on the impact of actions on inventory levels relative to demand, and possibly factor in "Lead Time" and "Price".

4. **Dataset-Driven Simulation**: Instead of randomly choosing states and next states, we can use the dataset to simulate realistic transitions based on historical data.

5. **Policy Evaluation**: The policy derived from the Q-learning algorithm can then be evaluated against the historical data to see how well it performs.

First, let's analyze the dataset to define the states more precisely. We'll examine the distributions of "Current Inventory" and "Demand" to help us in defining appropriate thresholds for the states.

The dataset provides the following statistics for "Current Inventory" and "Demand":

- **Current Inventory**:
  - Mean: 14.65
  - Standard Deviation: 8.81
  - Min: 0
  - Max: 30

- **Demand**:
  - Mean: 148.54
  - Standard Deviation: 88.80
  - Min: 0
  - Max: 300

Given this information, we can define states based on inventory levels relative to demand. For example:

- **Low Inventory**: Inventory is significantly lower than the average demand.
- **Medium Inventory**: Inventory is around the average demand level.
- **High Inventory**: Inventory is significantly higher than the average demand.

The exact thresholds for these categories can be set based on the statistical distribution. For instance, we might consider inventory to be "Low" if it is less than 50% of the average demand, "Medium" if it is between 50% and 150% of the average demand, and "High" if it is more than 150%.



Let's integrate the dataset into the Q-learning algorithm with the following steps:

1. **Define States**: We'll define states based on the ratio of the current inventory to the average demand. The states will be "Low Inventory", "Medium Inventory", and "High Inventory".

2. **Define Actions and Rewards**: We'll keep the actions as "Order", "Maintain", and "Reduce". The rewards will be determined based on the effectiveness of these actions in maintaining an optimal inventory level.

3. **Update the Q-Learning Algorithm**: The algorithm will be modified to simulate transitions and rewards based on the dataset.

4. **Track and Plot Q-Value Convergence**: We'll track the convergence of Q-values for a specific state-action pair for analysis.

Here's the updated plan:

- **States**:
  - "Low Inventory": Current Inventory < 10% of Average Demand
  - "Medium Inventory": 10% <= Current Inventory <= 30% of Average Demand
  - "High Inventory": Current Inventory > 30% of Average Demand

- **Rewards**: These will be dynamically calculated based on the state-action outcomes.

Let's implement this updated Q-learning model. We'll start by defining the states and rewards based on the dataset's statistics, and then proceed to update the Q-learning algorithm.

### **How to define the state transitions?**

In the updated Q-learning code, we are not explicitly using transition probabilities in the traditional sense as used in model-based reinforcement learning methods like Value Iteration or Policy Iteration. Instead, we are using the dataset to simulate state transitions. Here's how this approach differs from using transition probabilities:

1. **Model-Free Learning**: Q-learning is a model-free method. It does not require a model of the environment (which includes transition probabilities and reward functions). The agent learns the value of actions directly from the experience of interacting with the environment.

2. **Dataset-Driven Transitions**: In our updated code, the transition from one state to another is guided by the actual data from the inventory dataset. When the Q-learning algorithm processes each row of the dataset, it determines the current state based on the inventory and demand values. The next state is determined by the inventory and demand of the next time step (or row) in the dataset. This approach mimics the dynamics of the environment using real data rather than probabilistic models.

3. **Empirical Approach**: This method is empirical and data-driven. It leverages the actual sequences of inventory levels and demands observed in the past to inform the agent about how the state transitions occur in reality. This can be seen as an alternative to using estimated transition probabilities, especially in environments where such probabilities are difficult to model accurately.

In summary, while we are not using explicit transition probabilities in the code, we are using the sequence of states as they occur in the dataset to guide the learning process. This approach allows the Q-learning algorithm to learn an optimal policy based on real-world data, adapting to the actual dynamics observed in the inventory management system.

To illustrate how state transitions are driven by the sequence of inventory levels and demands in the dataset, let's consider a simple example with hypothetical data. You can understand how the Q-learning algorithm navigates through different states based on real-world data.

#### Hypothetical Dataset Example
Imagine we have a small dataset representing inventory levels and demands over several days:

| Day | Current Inventory | Demand |
|-----|-------------------|--------|
| 1   | 100               | 200    |
| 2   | 120               | 180    |
| 3   | 80                | 60     |
| 4   | 150               | 100    |
| 5   | 200               | 120    |

#### State Categorization
Let's assume the states are categorized as follows based on the inventory-to-demand ratio:
- "Low Inventory": Inventory < 50% of demand
- "Medium Inventory": Inventory between 50% and 150% of demand
- "High Inventory": Inventory > 150% of demand

#### State Transitions Based on Data

- **Day 1**: Inventory = 100, Demand = 200
  - Inventory-to-Demand Ratio = 100 / 200 = 0.5
  - State = "Low Inventory"

- **Day 2**: Inventory = 120, Demand = 180
  - Inventory-to-Demand Ratio = 120 / 180 ≈ 0.67
  - State = "Medium Inventory"

- **Day 3**: Inventory = 80, Demand = 60
  - Inventory-to-Demand Ratio = 80 / 60 ≈ 1.33
  - State = "Medium Inventory"

- **Day 4**: Inventory = 150, Demand = 100
  - Inventory-to-Demand Ratio = 150 / 100 = 1.5
  - State = "High Inventory"

- **Day 5**: Inventory = 200, Demand = 120
  - Inventory-to-Demand Ratio = 200 / 120 ≈ 1.67
  - State = "High Inventory"

#### Explanation of State Transitions

The Q-learning algorithm will process each row in the dataset sequentially. For each row, it determines the current state based on the inventory and demand values. Here's how the state transitions occur over the days:

1. Day 1: Starts in "Low Inventory".
2. Day 2: Transitions to "Medium Inventory".
3. Day 3: Remains in "Medium Inventory".
4. Day 4: Transitions to "High Inventory".
5. Day 5: Remains in "High Inventory".

These transitions are driven by the actual changes in inventory and demand over time, as recorded in the dataset. The Q-learning algorithm learns from these transitions by updating its Q-values based on the actions taken and the rewards received in each state. This way, the algorithm gradually learns the optimal action to take in each state based on historical data.

### **How we define the rewards?**

Certainly! Defining the rewards in a Q-learning model is a crucial step, as it directly influences the learning and the eventual policy that the model adopts. The rewards should be set based on the desired outcomes and the specific characteristics of the inventory management system. Here are some guidelines for defining rewards in this context:

1. **Reward for Reducing High Inventory**: If the inventory is high, reducing it (to avoid overstocking and possible wastage or storage costs) should be rewarded.

2. **Penalty for Ordering with High Inventory**: Ordering more items when the inventory is already high should be penalized.

3. **Reward for Ordering with Low Inventory**: If the inventory is low, ordering more should be rewarded (to avoid stockouts and lost sales).

4. **Penalty for Reducing Low Inventory**: Reducing inventory when it's already low should be penalized.

5. **Maintaining Medium Inventory**: Maintaining a medium level of inventory might be the most desirable state, so actions that help stay in this state should be slightly rewarded.

Based on these guidelines, let's define a reward system:

```python
rewards = {
    ('Low Inventory', 'Order'): 2,       # Reward for ordering when inventory is low
    ('Low Inventory', 'Maintain'): -1,   # Penalty for maintaining low inventory
    ('Low Inventory', 'Reduce'): -2,     # Penalty for reducing already low inventory
    ('Medium Inventory', 'Order'): 0,    # Neutral or slight reward for ordering when inventory is medium
    ('Medium Inventory', 'Maintain'): 1, # Reward for maintaining medium inventory
    ('Medium Inventory', 'Reduce'): 0,   # Neutral or slight reward for reducing medium inventory
    ('High Inventory', 'Order'): -2,     # Penalty for ordering when inventory is high
    ('High Inventory', 'Maintain'): -1,  # Penalty for maintaining high inventory
    ('High Inventory', 'Reduce'): 2      # Reward for reducing high inventory
}
```

These values are initial estimates. You may need to adjust them based on the actual dynamics of your inventory system and the outcomes you observe. The key is to align the rewards with your business objectives and adjust them as you learn more about how they impact the agent's behavior.

## **Q3: Based on the Q learning, please get the optimal policy and then improve the results?**

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset
file_path = '/content/drive/MyDrive/Deep Learning/Lab 4 Files/inventory_dataset.csv'
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand  # Calculate the ratio based on average demand
        if ratio < 0.1:
            return "Low Inventory"
        elif ratio <= 0.3:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize the Q-table
Q = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
alpha = 0.3  # Learning rate
epsilon = 0.5  # Exploration rate
discount_factor = 0.7  # Discount factor
num_episodes = 1000  # Number of episodes

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Q-learning algorithm
for episode in range(num_episodes):
    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])

        # Exploration-exploitation trade-off
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = max(Q[state], key=Q[state].get)

        # Get the reward for the current state-action pair
        reward = rewards[(state, action)]

        # Update the Q-value
        old_value = Q[state][action]
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])  # Based on the next row in the dataset
        Q[state][action] = old_value + alpha * (reward + discount_factor * max(Q[next_state].values()) - old_value)

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append(Q[state][action])

# Determine the optimal policy
policy = {state: max(Q[state], key=Q[state].get) for state in states}

# Print the final Q-table and optimal policy
print("Final Q-Table:")
for state in states:
    print(f"{state}: {Q[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()

Based on the final Q-table and the optimal policy you've shared, here's an analysis of the results:

#### Final Q-Table Analysis
- **Low Inventory**: The highest Q-value is for 'Order', suggesting that the optimal action when the inventory is low is to order more items. This aligns well with intuitive inventory management strategies.
- **Medium Inventory**: The highest Q-value is for 'Maintain', indicating that the best action for medium inventory levels is to maintain the current level. This seems reasonable as medium inventory levels are likely adequate to meet demand without the need for additional ordering or reduction.
- **High Inventory**: All Q-values are zero, which is intriguing. It suggests that either the algorithm did not learn effectively for this state or that this state was rarely or never encountered in the dataset. If the latter is true, it could mean that the dataset does not have many instances where the inventory level is categorized as 'High', or the way we defined 'High Inventory' is too restrictive.

#### Optimal Policy Analysis
- **Low Inventory**: 'Order' as the optimal action aligns with standard inventory management practices.
- **Medium Inventory**: 'Maintain' is a logical choice, balancing the need to meet demand without overstocking.
- **High Inventory**: The policy suggests 'Order', which is counterintuitive. This discrepancy likely stems from the lack of learning in the 'High Inventory' state, as indicated by zero Q-values.

#### Q-Value Convergence Plot
The convergence plot (which we assume tracks the Q-value for 'Medium Inventory' and 'Order') would ideally show how the Q-value changes over time. Observing the plot would provide insights into the learning process, such as whether the Q-values are stabilizing or if more episodes are needed for convergence.



### **How to improve the results?**

Early convergence in a Q-learning model, especially when some states have not been adequately learned (as indicated by zero Q-values for the 'High Inventory' state), can be a sign of several potential issues or misconfigurations in the learning process. Here's how to address this:

1. **Increase Exploration**: Early convergence might be due to insufficient exploration. The algorithm could be exploiting its current knowledge too quickly without exploring other possibilities. Increasing the exploration rate (\( \epsilon \)) can encourage the algorithm to try out different actions more frequently.

2. **Adjust Learning Rate** (\( \alpha \)): A high learning rate can lead to rapid changes in Q-values, which might cause early convergence. Experiment with a lower learning rate to allow more gradual updates to the Q-table.

3. **Review Reward Structure**: Ensure that the rewards are encouraging the desired behavior in all states. If the rewards are not balanced or do not accurately reflect the goals of the system, the learning process can be skewed.

4. **Dataset and State Definitions**: Check if the dataset includes enough diversity in terms of state occurrences. Also, review the definitions of the states (especially 'High Inventory') to ensure they are realistic and achievable given the dataset.

5. **Longer Training**: Increase the number of episodes to give the model more time to learn from different state-action pairs. Early convergence might indicate that the model has not had enough interaction with the environment.



By exploring these areas, you can often diagnose and correct the issue of early convergence in a Q-learning model.

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand  # Calculate the ratio based on average demand
        if ratio < 0.01:
            return "Low Inventory"
        elif ratio <= 0.1:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize the Q-table
Q = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
alpha = 0.3  # Learning rate
epsilon = 0.5 # Exploration rate
discount_factor = 0.7  # Discount factor
num_episodes = 10000  # Number of episodes

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Q-learning algorithm
for episode in range(num_episodes):
    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])

        # Exploration-exploitation trade-off
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = max(Q[state], key=Q[state].get)

        # Get the reward for the current state-action pair
        reward = rewards[(state, action)]

        # Update the Q-value
        old_value = Q[state][action]
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])  # Based on the next row in the dataset
        Q[state][action] = old_value + alpha * (reward + discount_factor * max(Q[next_state].values()) - old_value)

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append(Q[state][action])

# Determine the optimal policy
policy = {state: max(Q[state], key=Q[state].get) for state in states}

# Print the final Q-table and optimal policy
print("Final Q-Table:")
for state in states:
    print(f"{state}: {Q[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()


## **Q4: Please find the optimal policy based on the Q Learning with the experience replay?**

Incorporating **experience replay** into your Q-learning algorithm can help improve its performance, particularly in terms of learning from a more diverse range of experiences and breaking correlations between consecutive learning updates.

Experience replay involves storing past experiences (state, action, reward, next state) in a memory buffer and then randomly sampling from this buffer to update the Q-values. This approach can help the algorithm learn from a wider range of experiences and smooth out the learning process.

Here's how you can modify your code to include experience replay:

1. **Create a Replay Buffer**: Store experiences (state, action, reward, next state) in a buffer.
2. **Sample from the Buffer**: During the learning process, instead of using the immediate next state-action pair for updating the Q-values, randomly sample a batch of experiences from the buffer.
3. **Update Q-values Using Sampled Experiences**: Use these experiences to update the Q-values.

So, the concept remains: storing experiences and randomly sampling from them to break the correlation between consecutive updates.


Experience replay is a technique used in reinforcement learning, particularly in algorithms that utilize deep learning, like Deep Q-Networks (DQNs). Let's understand it with a simple example:

#### The Concept of Experience Replay
In traditional reinforcement learning, an agent learns from experiences (state, action, reward, next state) sequentially and immediately. This approach can have several drawbacks:
1. **Sequential Dependency**: Learning from consecutive experiences can lead to strong correlations between these experiences, which might hinder the learning process.
2. **Forgetting Past Experiences**: Without a mechanism to remember past states, the agent might forget valuable lessons from earlier in the training.
3. **Inefficient Use of Data**: Each experience is used only once and then discarded, which is not efficient, especially in complex environments.

#### Example Without Experience Replay
Imagine you're teaching a robot to navigate a room to reach a charging station. The robot tries different paths, hitting obstacles and occasionally finding the station. Without experience replay, the robot learns from each attempt right after it happens and then forgets it. The learning is:
- **Sequential**: The robot learns from each attempt in the order they occur.
- **Short-lived**: Each learning opportunity is used once.
- **Prone to Biases**: If the robot keeps hitting an obstacle repeatedly, it might overfit to this particular experience.

#### Example With Experience Replay
Now, consider the robot has a memory where it stores its experiences (paths taken, obstacles hit, successful navigation, etc.). Periodically, the robot revisits these memories randomly and learns from them again. This approach changes the learning process:
- **Random Sampling**: By revisiting old experiences randomly, the robot breaks the sequence and correlation, leading to more robust learning.
- **Repeated Use**: Experiences are used multiple times, making the learning process more efficient.
- **Balanced Learning**: The robot learns from a variety of experiences (both successes and failures) which prevents overfitting to recent or frequent experiences.

#### In Summary
Experience replay improves the learning process by enabling the agent to learn from a more diverse and balanced set of experiences. It helps in breaking the correlation between sequential experiences, makes better use of past experiences, and leads to more stable and effective learning, especially in complex environments where learning from immediate, sequential experiences might not be sufficient.

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd
from collections import deque
import numpy as np

# Load the dataset
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand  # Calculate the ratio based on average demand
        if ratio < 0.01:
            return "Low Inventory"
        elif ratio <= 0.1:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize the Q-table
Q = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
alpha = 0.3  # Learning rate
epsilon = 0.5  # Exploration rate
discount_factor = 0.7  # Discount factor
num_episodes = 1000  # Number of episodes
buffer_size = 100  # Size of the experience replay buffer
batch_size = 32  # Batch size for sampling from the buffer

# Initialize replay buffer
replay_buffer = deque(maxlen=buffer_size)

# Function to add experience to the buffer
def add_to_buffer(state, action, reward, next_state):
    replay_buffer.append((state, action, reward, next_state))

# Function to sample from the buffer
def sample_from_buffer(batch_size):
    return random.sample(replay_buffer, min(len(replay_buffer), batch_size))

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Q-learning algorithm with experience replay
for episode in range(num_episodes):
    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])
        action = random.choice(actions) if random.uniform(0, 1) < epsilon else max(Q[state], key=Q[state].get)
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])  # Based on the next row in the dataset
        reward = rewards[(state, action)]

        # Add experience to buffer
        add_to_buffer(state, action, reward, next_state)

        # Sample from buffer
        batch = sample_from_buffer(batch_size)
        for exp_state, exp_action, exp_reward, exp_next_state in batch:
            old_value = Q[exp_state][exp_action]
            Q[exp_state][exp_action] = old_value + alpha * (exp_reward + discount_factor * max(Q[exp_next_state].values()) - old_value)

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append(Q[state][action])

# Determine the optimal policy
policy = {state: max(Q[state], key=Q[state].get) for state in states}

# Print the final Q-table and optimal policy
print("Final Q-Table:")
for state in states:
    print(f"{state}: {Q[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()

The final Q-table and the derived optimal policy from your Q-learning model provide insightful information about the learned strategy for inventory management.

Let's analyze the results:

#### Final Q-Table Analysis
The Q-table shows the learned values for each state-action pair, indicating the expected long-term reward for taking a certain action in a given state.

- **Low Inventory**:
  - The action 'Order' has the highest Q-value, suggesting that the model has learned it is most beneficial to order more items when the inventory is low. This aligns well with intuitive inventory management.
- **Medium Inventory**:
  - 'Maintain' has the highest Q-value, indicating that the best strategy in a medium inventory situation is to keep the inventory level as it is, balancing between overstocking and understocking.
- **High Inventory**:
  - The action 'Reduce' has the highest Q-value. This implies that the model recommends reducing inventory when levels are high, which is sensible to avoid overstocking issues and potential associated costs.

#### Optimal Policy Analysis
The optimal policy derived from the Q-table is:

- **Low Inventory**: 'Order' more stock. This is a logical approach to avoid stockouts and lost sales opportunities.
- **Medium Inventory**: 'Maintain' the current level, indicating an efficient balance has been achieved.
- **High Inventory**: 'Reduce' inventory, which is a typical response to prevent excess holding costs and potential waste.

#### Interpretation
- The policy seems logical and aligns with standard inventory management practices.
- The model appears to have learned an effective strategy for each inventory level, suggesting successful training.
- The values in the Q-table indicate the relative preference for each action in the corresponding state, guiding decision-making.



## **Q5: Please find the optimal policy based on the Double Q Learning?**

#### Double Q Learning

Double Q-learning is an advanced reinforcement learning technique that aims to reduce the overestimation bias often present in the standard Q-learning algorithm. This overestimation occurs because, in standard Q-learning, the same Q-value estimates are used both to select the best action and to evaluate its value. Double Q-learning addresses this issue by maintaining two separate Q-tables (or Q-functions in the case of deep learning models), which independently estimate the Q-values.

Here's a basic outline of how Double Q-learning works:

#### Double Q-learning Algorithm:
1. **Maintain Two Q-Tables**: Let's call them \( Q_A \) and \( Q_B \). Each table is updated independently.

2. **Experience Sampling**: Just like in standard Q-learning, the agent interacts with the environment, collecting experiences in terms of state, action, reward, and next state.

3. **Updating the Q-Tables**:
   - When updating \( Q_A \), the action is selected using \( Q_A \) (like in standard Q-learning), but the Q-value for updating is taken from \( Q_B \).
   - Conversely, when updating \( Q_B \), the action is selected using \( Q_B \), but the Q-value for updating is taken from \( Q_A \).
   - Whether \( Q_A \) or \( Q_B \) is updated is typically decided randomly at each learning step.

4. **Policy Derivation**: The policy (i.e., the decision-making strategy) can be derived by averaging the Q-values from both tables or by using the sum of the Q-values.

#### Why Double Q-learning?
- **Reducing Overestimation**: In standard Q-learning, there's a tendency to overestimate the Q-values, which can lead to suboptimal policy learning. Double Q-learning helps in mitigating this issue.
- **More Accurate Value Estimates**: By decoupling the action selection and value estimation, Double Q-learning tends to give a more accurate estimation of the action values.

#### Example:
Imagine a simple game where an agent needs to choose between two doors, behind one of which is a reward. The true value of the reward behind each door changes slightly each time the agent chooses a door, but the agent does not know this.

- In standard Q-learning, the agent might consistently overestimate the value of one door if it initially gets higher rewards from it, leading to a lack of exploration and potential neglect of the other door.
- In Double Q-learning, since two separate Q-tables are maintained and updated independently, the agent gets a more balanced view of the potential rewards behind each door, reducing the chance of such overestimation and potentially leading to more exploration and better overall reward accumulation.

Double Q-learning is especially useful in environments with high variability and uncertainty, where overestimation of Q-values can significantly impact the learning of an optimal policy.

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand  # Calculate the ratio based on average demand
        if ratio < 0.01:
            return "Low Inventory"
        elif ratio <= 0.1:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize two Q-tables for Double Q-Learning
Q_A = {state: {action: 0 for action in actions} for state in states}
Q_B = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
alpha = 0.3  # Learning rate
epsilon = 0.5  # Exploration rate
discount_factor = 0.7  # Discount factor
num_episodes = 1000  # Number of episodes

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Double Q-learning algorithm
for episode in range(num_episodes):
    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])
        action = random.choice(actions) if random.uniform(0, 1) < epsilon else max(Q_A[state], key=Q_A[state].get)

        # Get reward and next state
        reward = rewards[(state, action)]
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])

        # Update rule for Double Q-Learning
        if random.uniform(0, 1) < 0.5:
            next_action = max(Q_B[next_state], key=Q_B[next_state].get)
            Q_A[state][action] += alpha * (reward + discount_factor * Q_B[next_state][next_action] - Q_A[state][action])
        else:
            next_action = max(Q_A[next_state], key=Q_A[next_state].get)
            Q_B[state][action] += alpha * (reward + discount_factor * Q_A[next_state][next_action] - Q_B[state][action])

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append((Q_A[state][action] + Q_B[state][action]) / 2)

# Determine the optimal policy based on combined Q-tables
policy = {state: max({a: Q_A[state][a] + Q_B[state][a] for a in actions}, key=lambda a: Q_A[state][a] + Q_B[state][a]) for state in states}

# Print the final Q-tables and optimal policy
print("Final Q-Table A:")
for state in states:
    print(f"{state}: {Q_A[state]}")

print("Final Q-Table B:")
for state in states:
    print(f"{state}: {Q_B[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()


## **Q6: Please find the optimal policy with the Double Q Learning with Experience Replay?**

In this code, Double Q-learning is implemented with experience replay. Experiences are stored in a buffer and sampled randomly to update the Q-values in both Q-tables Q_A​ and Q_B​. This approach combines the benefits of Double Q-learning (reducing overestimation bias) with those of experience replay (breaking the correlation between consecutive experiences and making better use of past experiences).

Combining Double Q-learning with experience replay involves maintaining two separate Q-tables and an experience replay buffer. The updates to the Q-tables are based on experiences randomly sampled from this buffer. Here's how you can modify the code to include both Double Q-learning and experience replay:

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd
from collections import deque

# Load the dataset
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand
        if ratio < 0.01:
            return "Low Inventory"
        elif ratio <= 0.1:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize two Q-tables for Double Q-Learning
Q_A = {state: {action: 0 for action in actions} for state in states}
Q_B = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
alpha = 0.3  # Learning rate
epsilon = 0.5  # Exploration rate
discount_factor = 0.99  # Discount factor
num_episodes = 1000  # Number of episodes
buffer_size = 100  # Size of the experience replay buffer
batch_size = 32  # Batch size for sampling from the buffer

# Initialize replay buffer
replay_buffer = deque(maxlen=buffer_size)

# Function to add experience to the buffer
def add_to_buffer(state, action, reward, next_state):
    replay_buffer.append((state, action, reward, next_state))

# Function to sample from the buffer
def sample_from_buffer(batch_size):
    return random.sample(replay_buffer, min(len(replay_buffer), batch_size))

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Double Q-learning algorithm with experience replay
for episode in range(num_episodes):
    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])
        action = random.choice(actions) if random.uniform(0, 1) < epsilon else max(Q_A[state], key=Q_A[state].get)
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])
        reward = rewards[(state, action)]

        # Add experience to buffer
        add_to_buffer(state, action, reward, next_state)

        # Sample from buffer
        batch = sample_from_buffer(batch_size)
        for exp_state, exp_action, exp_reward, exp_next_state in batch:
            # Update rule for Double Q-Learning with experience replay
            if random.uniform(0, 1) < 0.5:
                next_action = max(Q_B[exp_next_state], key=Q_B[exp_next_state].get)
                Q_A[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_B[exp_next_state][next_action] - Q_A[exp_state][exp_action])
            else:
                next_action = max(Q_A[exp_next_state], key=Q_A[exp_next_state].get)
                Q_B[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_A[exp_next_state][next_action] - Q_B[exp_state][exp_action])

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append((Q_A[state][action] + Q_B[state][action]) / 2)

# Determine the optimal policy based on combined Q-tables
policy = {state: max({a: Q_A[state][a] + Q_B[state][a] for a in actions}, key=lambda a: Q_A[state][a] + Q_B[state][a]) for state in states}

# Print the final Q-tables and optimal policy
print("Final Q-Table A:")
for state in states:
    print(f"{state}: {Q_A[state]}")

print("Final Q-Table B:")
for state in states:
    print(f"{state}: {Q_B[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()


The results from your Double Q-learning algorithm with experience replay indicate that the model has learned distinct preferences for actions in each inventory state. Here's an analysis of the results:

#### Final Q-Tables
Both Q-tables (Q-Table A and Q-Table B) have very similar values, which is a good sign of stability in learning. It suggests that both tables are converging towards similar value estimates.

- **Low Inventory**: The action 'Order' has the highest Q-value. This implies that the model has learned that ordering more inventory is the best action when inventory levels are low, which aligns well with typical inventory management strategies.
- **Medium Inventory**: The action 'Maintain' has the highest Q-value. This suggests that when the inventory is at a medium level, the best action is to maintain the current inventory level, indicating a balance between overstocking and understocking.
- **High Inventory**: The action 'Reduce' has the highest Q-value. This means that the model recommends reducing inventory when the levels are high, which is sensible to prevent overstocking and potential waste or increased storage costs.

#### Optimal Policy
The derived optimal policy is consistent with logical inventory management practices:
- **Low Inventory**: 'Order' more items to replenish stock.
- **Medium Inventory**: 'Maintain' current levels, indicating an optimal balance.
- **High Inventory**: 'Reduce' stock to mitigate overstocking issues.

#### Interpretation and Considerations
- **Realistic Policy**: The policy appears realistic and aligned with standard inventory management practices, suggesting effective learning.
- **Stable Learning**: The similarity in Q-values between both tables indicates that the learning process is stable and robust.
- **Experience Replay Impact**: The use of experience replay likely contributed to the stability and efficiency of the learning process by allowing the model to learn from a diverse set of experiences and reduce the correlation inherent in sequential learning.



### **How to define the discount factor?**

When you increase the discount factor in a Q-learning algorithm, it indeed often leads to larger Q-values. This is due to the nature of how the discount factor affects the calculation of these values.

#### Understanding the Discount Factor
The discount factor (usually denoted as \( \gamma \)) in reinforcement learning is a crucial parameter that determines how future rewards are valued compared to immediate rewards. Here's how it influences the learning:

1. **Discount Factor Close to 0**: The agent values immediate rewards much more than future rewards. It becomes short-sighted in its policy, focusing on actions that yield quick benefits.

2. **Discount Factor Close to 1**: The agent places significant value on future rewards. It is more far-sighted and considers the long-term consequences of its actions.

#### Larger Q-Values with Higher Discount Factor
When you increased the discount factor from 0.7 to 0.99, you shifted the agent's focus from short-term gains to long-term gains. This change has a few implications:

- **Long-Term Planning**: The agent now values future rewards almost as much as immediate rewards. It's more willing to take actions that might have a lower immediate payoff but lead to greater rewards in the future.

- **Accumulation of Future Rewards**: The Q-value of a state-action pair is essentially an estimation of all future rewards that can be obtained from that state onwards, discounted back to the present. A higher discount factor means that future rewards are less diminished, leading to a higher overall estimation.

- **Greater Sensitivity to Future Rewards**: The agent becomes more sensitive to potential high-reward states in the distant future, which can result in overall higher Q-values.

#### Practical Implications
- **Policy Impact**: The policy derived from the Q-learning process might shift towards actions that favor long-term benefits.
- **Risk of Overestimation**: A very high discount factor can sometimes lead to overestimation of Q-values, especially in environments with a lot of uncertainty or variability.
- **Balance is Key**: It's important to choose a discount factor that balances immediate and future rewards appropriately for your specific application. The optimal value often depends on the nature of the environment and the objectives of the agent.

In summary, by increasing the discount factor to 0.99, you encouraged the model to prioritize long-term rewards more heavily, leading to an overall increase in the calculated Q-values. This reflects a more far-sighted approach in the decision-making process of the agent.

### **Why the convergence early in our problem?**

The plot we've provided indicates that the Q-value for the state-action pair (Medium Inventory, Order) quickly converges to a value and then remains flat for the remainder of the episodes. This could suggest that the learning algorithm has found what it believes to be the optimal Q-value for this state-action pair early on and then doesn't experience significant updates afterward.

If the Q-value convergence is occurring too early and remaining constant (which might look like convergence to zero if the scale of the plot is too large or if the Q-values are small relative to the scale), it could be due to several reasons:

1. **Insufficient Exploration**: The exploration rate (epsilon) might be too low, causing the algorithm to exploit the current policy without exploring other possibilities sufficiently.

2. **Learning Rate Decay**: If the learning rate doesn't decay over time or episodes, the Q-values might be updated too aggressively throughout the training process. Introducing a decay can help the model gradually move toward convergence, allowing for finer adjustments as it learns more about the environment.

3. **Reward Structure**: The rewards might not be scaled properly, or the reward signal could be too weak or sparse, leading to insignificant updates after the initial learning phase.

4. **State Definition**: The thresholds for categorizing inventory levels might be too broad or not representative of meaningful differences in inventory states, leading to a lack of granularity in the learned policy.

5. **Discount Factor**: A very high discount factor (close to 1) might lead to a quick convergence as future rewards are heavily valued. However, this is usually accompanied by larger Q-values, which doesn't seem to be the case here.

To address the early convergence and improve the learning process, you can:

- **Adjust Exploration**: Increase the exploration rate or implement an exploration decay strategy to encourage the algorithm to explore more thoroughly.
- **Learning Rate Decay**: Implement a learning rate decay over episodes to fine-tune the updates to the Q-values as learning progresses.
- **Review Rewards**: Ensure that the rewards are meaningful and scaled properly to provide a strong enough signal for learning.
- **State Granularity**: Refine the state definitions to ensure they capture meaningful differences in inventory levels.
- **Evaluate Data**: Ensure the dataset provides sufficient variability in states and rewards. If the environment is very deterministic, the Q-values might converge quickly because there isn't much new information to learn from.

By implementing these strategies, you should see a more gradual convergence of Q-values, which would typically indicate a more robust and reliable learning process.

### **How to define the learning rate decay?**

Implementing a learning rate decay involves reducing the learning rate α over time or over episodes. This allows the model to make large updates to the Q-values initially when it knows less about the environment, and then make smaller, more fine-tuned updates as it learns more.

A common approach to decay the learning rate is to use a multiplicative decay, which reduces the learning rate by a certain factor each episode. Another approach is to use a time-based decay, which reduces the learning rate based on the inverse of the iteration count. You can also use a step decay, where the learning rate is reduced by a factor every fixed number of episodes.

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd
from collections import deque

# Load the dataset

inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand
        if ratio < 0.01:
            return "Low Inventory"
        elif ratio <= 0.1:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize two Q-tables for Double Q-Learning
Q_A = {state: {action: 0 for action in actions} for state in states}
Q_B = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
initial_alpha = 0.3  # Initial learning rate
min_alpha = 0.01     # Minimum learning rate
alpha_decay_rate = 0.01  # Decay rate per episode
epsilon = 0.5  # Exploration rate
discount_factor = 0.8  # Discount factor
num_episodes = 1000  # Number of episodes
buffer_size = 100  # Size of the experience replay buffer
batch_size = 32  # Batch size for sampling from the buffer

# Initialize replay buffer
replay_buffer = deque(maxlen=buffer_size)

# Function to add experience to the buffer
def add_to_buffer(state, action, reward, next_state):
    replay_buffer.append((state, action, reward, next_state))

# Function to sample from the buffer
def sample_from_buffer(batch_size):
    return random.sample(replay_buffer, min(len(replay_buffer), batch_size))

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Double Q-learning algorithm with experience replay and learning rate decay
for episode in range(num_episodes):
    # Decay learning rate
    alpha = max(min_alpha, initial_alpha * (1 - (episode / num_episodes * alpha_decay_rate)))

    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])
        action = random.choice(actions) if random.uniform(0, 1) < epsilon else max(Q_A[state], key=Q_A[state].get)
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])
        reward = rewards[(state, action)]

        # Add experience to buffer
        add_to_buffer(state, action, reward, next_state)

        # Sample from buffer
        batch = sample_from_buffer(batch_size)
        for exp_state, exp_action, exp_reward, exp_next_state in batch:
            # Update rule for Double Q-Learning with experience replay
            if random.uniform(0, 1) < 0.5:
                next_action = max(Q_B[exp_next_state], key=Q_B[exp_next_state].get)
                Q_A[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_B[exp_next_state][next_action] - Q_A[exp_state][exp_action])
            else:
                next_action = max(Q_A[exp_next_state], key=Q_A[exp_next_state].get)
                Q_B[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_A[exp_next_state][next_action] - Q_B[exp_state][exp_action])

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append((Q_A[state][action] + Q_B[state][action]) / 2)

# Determine the optimal policy based on combined Q-tables
policy = {state: max({a: Q_A[state][a] + Q_B[state][a] for a in actions}, key=lambda a: Q_A[state][a] + Q_B[state][a]) for state in states}

# Print the final Q-tables and optimal policy
print("Final Q-Table A:")
for state in states:
    print(f"{state}: {Q_A[state]}")

print("Final Q-Table B:")
for state in states:
    print(f"{state}: {Q_B[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()


### **How to use exploration decay?**

To introduce an exploration decay in your Q-learning algorithm, you will adjust the exploration rate (epsilon) over time, just as we did with the learning rate (alpha). A common method is to start with a high exploration rate and then gradually decrease it as the agent learns more about the environment.

In this code, both the learning rate and the exploration rate are set to decay over the course of the episodes, which helps the agent to start with a high degree of exploration and learning rate and then gradually shift to exploiting the learned knowledge and fine-tuning the Q-values as it gains more experience.

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd
from collections import deque

# Load the dataset
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Define the function to categorize inventory level
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    else:
        ratio = inventory / average_demand
        if ratio < 0.01:
            return "Low Inventory"
        elif ratio <= 0.1:
            return "Medium Inventory"
        else:
            return "High Inventory"

# States, actions, and rewards
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define the rewards
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 0,
    ('Medium Inventory', 'Maintain'): 1,
    ('Medium Inventory', 'Reduce'): 0,
    ('High Inventory', 'Order'): -2,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# Initialize two Q-tables for Double Q-Learning
Q_A = {state: {action: 0 for action in actions} for state in states}
Q_B = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
initial_alpha = 0.3  # Initial learning rate
min_alpha = 0.01     # Minimum learning rate
alpha_decay_rate = 0.01  # Decay rate per episode
initial_epsilon = 1.0  # Initial exploration rate
min_epsilon = 0.01  # Minimum exploration rate
epsilon_decay_rate = 0.01  # Decay rate per episode
discount_factor = 0.8  # Discount factor
num_episodes = 1000  # Number of episodes
buffer_size = 100  # Size of the experience replay buffer
batch_size = 32  # Batch size for sampling from the buffer

# Initialize replay buffer
replay_buffer = deque(maxlen=buffer_size)

# Function to add experience to the buffer
def add_to_buffer(state, action, reward, next_state):
    replay_buffer.append((state, action, reward, next_state))

# Function to sample from the buffer
def sample_from_buffer(batch_size):
    return random.sample(replay_buffer, min(len(replay_buffer), batch_size))

# Tracking Q-values for convergence analysis
tracked_state = 'Medium Inventory'
tracked_action = 'Order'
tracked_q_values = []

# Implement the Double Q-learning algorithm with experience replay and learning rate decay
for episode in range(num_episodes):
    # Decay learning rate and exploration rate
    alpha = max(min_alpha, initial_alpha * (1 - (episode / num_episodes * alpha_decay_rate)))
    epsilon = max(min_epsilon, initial_epsilon * (1 - (episode / num_episodes * epsilon_decay_rate)))

    for index, row in inventory_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])
        action = random.choice(actions) if random.uniform(0, 1) < epsilon else max(Q_A[state], key=Q_A[state].get)
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])
        reward = rewards[(state, action)]

        # Add experience to buffer
        add_to_buffer(state, action, reward, next_state)

        # Sample from buffer
        batch = sample_from_buffer(batch_size)
        for exp_state, exp_action, exp_reward, exp_next_state in batch:
            # Update rule for Double Q-Learning with experience replay
            if random.uniform(0, 1) < 0.5:
                next_action = max(Q_B[exp_next_state], key=Q_B[exp_next_state].get)
                Q_A[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_B[exp_next_state][next_action] - Q_A[exp_state][exp_action])
            else:
                next_action = max(Q_A[exp_next_state], key=Q_A[exp_next_state].get)
                Q_B[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_A[exp_next_state][next_action] - Q_B[exp_state][exp_action])

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append((Q_A[state][action] + Q_B[state][action]) / 2)

# Determine the optimal policy based on combined Q-tables
policy = {state: max({a: Q_A[state][a] + Q_B[state][a] for a in actions}, key=lambda a: Q_A[state][a] + Q_B[state][a]) for state in states}

# Print the final Q-tables and optimal policy
print("Final Q-Table A:")
for state in states:
    print(f"{state}: {Q_A[state]}")

print("Final Q-Table B:")
for state in states:
    print(f"{state}: {Q_B[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()


## **Q7: Please update states and rewards and get the optimal policies?**

To update your code with a dynamic reward function suitable for the new discretized state representation, we'll first define a reward function and then integrate it into your Q-learning algorithm. Let's assume a simple reward strategy for the sake of demonstration. You might need to adjust this based on the specific dynamics of your inventory management system.

This code utilizes discretized inventory levels as states and uses a dynamic reward function that assigns rewards based on the current state and action. The reward function here is quite simple and might need further refinement to accurately reflect the complexities of a real-world inventory management system. The key is to ensure that the rewards incentivize the desired outcomes in line with your business objectives.

In [None]:
import random
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from collections import deque

# Load the dataset
inventory_data = pd.read_csv(file_path)
average_demand = inventory_data['Demand'].mean()

# Discretize inventory levels into bins
def discretize_inventory(inventory, bins):
    return np.digitize(inventory, bins)  # Assign inventory to a bin

# Define the bins for inventory levels
inventory_bins = np.linspace(min(inventory_data['Current Inventory']),
                             max(inventory_data['Current Inventory']),
                             num=10)  # Create 10 bins for simplicity

# States and actions
states = range(len(inventory_bins) + 1)  # +1 for the rightmost edge
actions = ["Order", "Maintain", "Reduce"]

# Define a reward function
def calculate_reward(state, action, inventory, demand):
    if state < 3 and action == 'Order':  # Low inventory levels
        return 2
    elif 3 <= state < 7 and action == 'Maintain':  # Medium inventory levels
        return 1
    elif state >= 7 and action == 'Reduce':  # High inventory levels
        return 2
    return -1  # Default penalty for non-optimal actions

# Initialize two Q-tables for Double Q-Learning
Q_A = {state: {action: 0 for action in actions} for state in states}
Q_B = {state: {action: 0 for action in actions} for state in states}

# Learning parameters
initial_alpha = 0.3  # Initial learning rate
min_alpha = 0.01     # Minimum learning rate
alpha_decay_rate = 0.01  # Decay rate per episode
initial_epsilon = 1.0  # Initial exploration rate
min_epsilon = 0.01  # Minimum exploration rate
epsilon_decay_rate = 0.01  # Decay rate per episode
discount_factor = 0.8  # Discount factor
num_episodes = 1000  # Number of episodes
buffer_size = 100  # Size of the experience replay buffer
batch_size = 32  # Batch size for sampling from the buffer

# Initialize replay buffer
replay_buffer = deque(maxlen=buffer_size)

# Function to add experience to the buffer
def add_to_buffer(state, action, reward, next_state):
    replay_buffer.append((state, action, reward, next_state))

# Function to sample from the buffer
def sample_from_buffer(batch_size):
    return random.sample(replay_buffer, min(len(replay_buffer), batch_size))

# Tracking Q-values for convergence analysis
tracked_state = 5  # Example tracked state, corresponding to one of the bins
tracked_action = 'Order'
tracked_q_values = []

# Implement the Double Q-learning algorithm with experience replay and learning rate decay
for episode in range(num_episodes):
    # Decay learning rate and exploration rate
    alpha = max(min_alpha, initial_alpha * (1 - (episode / num_episodes * alpha_decay_rate)))
    epsilon = max(min_epsilon, initial_epsilon * (1 - (episode / num_episodes * epsilon_decay_rate)))

    for index, row in inventory_data.iterrows():
        state = discretize_inventory(row['Current Inventory'], inventory_bins)
        action = random.choice(actions) if random.uniform(0, 1) < epsilon else max(Q_A[state], key=Q_A[state].get)
        next_state = discretize_inventory(row['Current Inventory'], inventory_bins)

        # Calculate reward
        reward = calculate_reward(state, action, row['Current Inventory'], row['Demand'])

        # Add experience to buffer
        add_to_buffer(state, action, reward, next_state)

        # Sample from buffer
        batch = sample_from_buffer(batch_size)
        for exp_state, exp_action, exp_reward, exp_next_state in batch:
            # Update rule for Double Q-Learning with experience replay
            if random.uniform(0, 1) < 0.5:
                next_action = max(Q_B[exp_next_state], key=Q_B[exp_next_state].get)
                Q_A[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_B[exp_next_state][next_action] - Q_A[exp_state][exp_action])
            else:
                next_action = max(Q_A[exp_next_state], key=Q_A[exp_next_state].get)
                Q_B[exp_state][exp_action] += alpha * (exp_reward + discount_factor * Q_A[exp_next_state][next_action] - Q_B[exp_state][exp_action])

        # Track Q-value for a specific state-action pair
        if state == tracked_state and action == tracked_action:
            tracked_q_values.append((Q_A[state][action] + Q_B[state][action]) / 2)

# Determine the optimal policy based on combined Q-tables
policy = {state: max({a: Q_A[state][a] + Q_B[state][a] for a in actions}, key=lambda a: Q_A[state][a] + Q_B[state][a]) for state in states}

# Print the final Q-tables and optimal policy
print("Final Q-Table A:")
for state in states:
    print(f"{state}: {Q_A[state]}")

print("Final Q-Table B:")
for state in states:
    print(f"{state}: {Q_B[state]}")

print("\nOptimal Policy:")
print(policy)

# Plot the Q-value convergence
plt.plot(tracked_q_values)
plt.xlabel('Episodes')
plt.ylabel('Q-value')
plt.title(f'Q-value Convergence for state-action pair ({tracked_state}, {tracked_action})')
plt.show()


The results from your Double Q-learning algorithm with discretized inventory states indicate a logical and coherent policy for inventory management. Here's a breakdown of the results:

#### Final Q-Tables
Both Q-Table A and Q-Table B show similar Q-values for each state-action pair, indicating stable learning. The Q-values represent the expected cumulative reward for taking a particular action in a given state and following the optimal policy thereafter.

- States `0`, `1`, `2`: The 'Order' action has the highest Q-value, suggesting that ordering more inventory is the best action when inventory levels are very low (state `0` might represent no inventory or extremely low levels).
- States `3`, `4`, `5`, `6`: The 'Maintain' action has the highest Q-value, indicating that maintaining the current inventory level is the optimal action for these medium inventory levels.
- States `7`, `8`, `9`, `10`: The 'Reduce' action has the highest Q-value. This implies that reducing inventory is the most beneficial action when the inventory levels are high.

#### Optimal Policy
The optimal policy derived from the Q-tables is:

- **Low Inventory States (0, 1, 2)**: 'Order' more stock, suggesting replenishment is needed to avoid stockouts.
- **Medium Inventory States (3, 4, 5, 6)**: 'Maintain' current inventory levels, indicating a balance between overstocking and understocking.
- **High Inventory States (7, 8, 9, 10)**: 'Reduce' the inventory, likely to minimize holding costs or avoid excess inventory.

#### Interpretation
- The policy appears to be aligned with logical inventory management practices, suggesting effective learning.
- The discretization of inventory levels into bins allows for more nuanced decision-making compared to broad categories like 'Low', 'Medium', and 'High'.
- The similarity in Q-values between Q-Table A and Q-Table B suggests that the learning process is stable and robust.


If you have any questions, please contact salihtutun@wustl.edu.

Salih Tutun, Ph.D.