# **Lab-based Assignment 5:** Practice - Deep Q Learning for Retail Inventory Management

![Imgur](https://i.imgur.com/b4dtRYW.png)

## **What is Retail Inventory?**

**Retail inventory management** is the backbone to any retail business, essentially enabling you to keep your business in order. It’s the system and processes you implement to keep a record of your stores inventory. Inventory management process is crucial. Having the right **automated inventory management system** in place can make all the difference. Out of stock items equals **frustrated customers and loss of sales** which over time could damage a retailer’s reputation and lose future customers and sales.

We can define the states, actions, rewards, and discount factor for an inventory management problem.

Let's go through each component:

- **States**:
    - The states represent the different levels of inventory. In this case, the states are defined as **["Low", "Medium", "High"]**, indicating low, medium, and high levels of inventory, respectively.

- **Actions**:
    - The actions represent the decisions the agent can take regarding the inventory. The available actions in this problem are **["Order", "Maintain", "Reduce"]**, which correspond to ordering more inventory, maintaining the current inventory level, or reducing the inventory level, respectively.

    In the context of the product inventory management problem, the actions 'Reduce', 'Maintain', and 'Order' have specific meanings:

   - *Reduce*: The 'Reduce' action means **decreasing the product inventory level**. This could involve strategies such as **selling or promoting products** to reduce the inventory to a desired level. The specific implementation of the 'Reduce' action would depend on the business's inventory management practices.

   - *Maintain*: The 'Maintain' action means **keeping the product inventory level** unchanged. When the agent selects the 'Maintain' action, it implies that the current inventory level is considered satisfactory, and there is no need to increase or decrease it.

   - *Order*: The 'Order' action means **replenishing the product inventory** by placing an order for more products. When the agent chooses the 'Order' action, it indicates that the current inventory level is insufficient, and it is necessary to **order more products** to meet the expected demand.

- **Rewards**:
    - The rewards represent the **immediate rewards** associated with transitioning from one state to another after taking a specific action. Similar to transition probabilities, rewards are represented as a nested dictionary, where the keys are tuples of the form **(current_state, action)**, and the values are the associated rewards.

- **Discount Factor**:
    - The discount factor, represented as discount_factor, **determines the importance of immediate rewards** versus future rewards. It is a value between 0 and 1, where a higher value places more emphasis on future rewards.

![Imgur](https://i.imgur.com/UUUcCid.png)

We can define the states, actions, rewards, and discount factor for an inventory management problem.

Let's go through each component:

- **States**:
    - The states represent the different levels of inventory. In this case, the states are defined as **["Low", "Medium", "High"]**, indicating low, medium, and high levels of inventory, respectively.

- **Actions**:
    - The actions represent the decisions the agent can take regarding the inventory. The available actions in this problem are **["Order", "Maintain", "Reduce"]**, which correspond to ordering more inventory, maintaining the current inventory level, or reducing the inventory level, respectively.

    In the context of the product inventory management problem, the actions 'Reduce', 'Maintain', and 'Order' have specific meanings:

   - *Reduce*: The 'Reduce' action means **decreasing the product inventory level**. This could involve strategies such as **selling or promoting products** to reduce the inventory to a desired level. The specific implementation of the 'Reduce' action would depend on the business's inventory management practices.

   - *Maintain*: The 'Maintain' action means **keeping the product inventory level** unchanged. When the agent selects the 'Maintain' action, it implies that the current inventory level is considered satisfactory, and there is no need to increase or decrease it.

   - *Order*: The 'Order' action means **replenishing the product inventory** by placing an order for more products. When the agent chooses the 'Order' action, it indicates that the current inventory level is insufficient, and it is necessary to **order more products** to meet the expected demand.

- **Rewards**:
    - The rewards represent the **immediate rewards** associated with transitioning from one state to another after taking a specific action. Similar to transition probabilities, rewards are represented as a nested dictionary, where the keys are tuples of the form **(current_state, action)**, and the values are the associated rewards.

- **Discount Factor**:
    - The discount factor, represented as discount_factor, **determines the importance of immediate rewards** versus future rewards. It is a value between 0 and 1, where a higher value places more emphasis on future rewards.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import random

# Read the CSV file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/Deep Learning/Lab 5 Files/inventory_dataset.csv')
df.head()

In [None]:
!pip install tensorflow numpy pandas matplotlib

We cover deep learning, experience replay, and Q learning:

In [None]:
# Importing necessary libraries
import random
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from collections import deque
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import layers

"""Question 1: Provide the file path"""
# Load the dataset
inventory_data = df

# Split the dataset into training and testing sets
train_data, test_data = train_test_split(inventory_data, test_size=0.2, random_state=42)  # 80% training, 20% testing

# Calculate the average demand from the training set for use in state categorization
average_demand = train_data['Demand'].mean()

# Function to categorize the state of inventory based on current inventory level and demand
def categorize_inventory(inventory, demand):
    if demand == 0:
        return "Low Inventory"
    ratio = inventory / average_demand  # Ratio of current inventory to average demand
    if ratio < 0.01:
        return "Low Inventory"
    elif ratio <= 0.1:
        return "Medium Inventory"
    else:
        return "High Inventory"

# Define the possible states and actions for the agent
states = ["Low Inventory", "Medium Inventory", "High Inventory"]
actions = ["Order", "Maintain", "Reduce"]

# Define a reward structure for the agent based on state-action pairs
rewards = {
    ('Low Inventory', 'Order'): 2,
    ('Low Inventory', 'Maintain'): -1,
    ('Low Inventory', 'Reduce'): -2,
    ('Medium Inventory', 'Order'): 1,
    ('Medium Inventory', 'Maintain'): 2,
    ('Medium Inventory', 'Reduce'): 1,
    ('High Inventory', 'Order'): -3,
    ('High Inventory', 'Maintain'): -1,
    ('High Inventory', 'Reduce'): 2
}

# One-hot encode states and actions for input into the neural network
state_encoding = {state: np.array([i == index for i in range(len(states))]) for index, state in enumerate(states)}
action_encoding = {action: np.array([i == index for i in range(len(actions))]) for index, action in enumerate(actions)}

#Write the input layer of the model. It is a fully connected (dense) layer with 64 neurons. Set activation to 'relu' and input shape to len(states).
#Write the output layer. It is also a fully connected layer with size of len(actions)

# Function to build the neural network model for Deep Q-Learning
"""Question 2: Write the input layer of the model. It is a fully connected (dense) layer with 64 neurons.
Set activation to 'relu' and input shape to as the len(states).

Question 3: Write the output layer. It is also a fully connected (dense) layer with size of len(actions)"""
def build_model():
    model = tf.keras.Sequential([
        layers.Dense(64, activation='relu', input_shape=(len(states),)),   # Write the input layer of the model here
        layers.Dense(64, activation='relu'),  # Hidden layer
        layers.Dense(len(actions), activation='linear') # Write the Output layer of the model here
    ])
    model.compile(optimizer='adam', loss='mse')  # Compile model with MSE loss function and Adam optimizer
    return model

# Create the model
model = build_model()

# Set learning parameters
alpha = 0.3  # Learning rate (not used in this snippet, typically for manual updates)
initial_epsilon = 1.0  # Starting rate for exploration
final_epsilon = 0.01  # Final rate for exploration after decay
epsilon_decay = 0.995  # Rate at which to decay exploration
epsilon = initial_epsilon  # Initialize epsilon
discount_factor = 0.7  # Discount factor for future rewards
num_episodes = 20  # Number of episodes to train for
buffer_size = 100  # Max size of replay buffer
batch_size = 16  # Number of experiences to sample from buffer

replay_buffer = deque(maxlen=buffer_size)  # Initialize the replay buffer

# Functions to add to and sample from the replay buffer
def add_to_buffer(state, action, reward, next_state):
    replay_buffer.append((state, action, reward, next_state))  # Add experience to buffer

def sample_from_buffer(batch_size):
    return random.sample(replay_buffer, min(len(replay_buffer), batch_size))  # Sample a batch from buffer

# Lists to track total rewards, losses, epsilon and average Q-values per episode for plotting
episode_rewards = []
average_q_values = []
losses = []
epsilon_values = []

# Training loop
for episode in range(num_episodes):
    total_reward = 0
    total_q_values = 0
    q_value_count = 0
    episode_loss = 0

    # Iterate over each step in the episode
    for index, row in train_data.iterrows():
        state = categorize_inventory(row['Current Inventory'], row['Demand'])
        state_vector = state_encoding[state]

        # Initialize q_values to zeros
        q_values = np.zeros((1, len(actions)))

        # Decide whether to take a random action or the best action according to the model
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            q_values = model.predict(state_vector.reshape(1, -1))
            action = actions[np.argmax(q_values[0])]

        # Get the next state and reward, and update the replay buffer
        next_state = categorize_inventory(row['Current Inventory'], row['Demand'])
        next_state_vector = state_encoding[next_state]
        reward = rewards[(state, action)]
        add_to_buffer(state_vector, action, reward, next_state_vector)

        # Update totals
        total_reward += reward
        if q_values.size > 0:  # Check if q_values has been updated
            total_q_values += np.max(q_values[0])
        q_value_count += 1

    # Training the network on a batch of experiences from the buffer
    if len(replay_buffer) >= batch_size:
        batch = sample_from_buffer(batch_size)
        for exp_state, exp_action, exp_reward, exp_next_state in batch:
            # Prepare the target Q-value for training
            exp_state = exp_state.reshape(1, -1)
            exp_next_state = exp_next_state.reshape(1, -1)
            target_q = exp_reward + discount_factor * np.max(model.predict(exp_next_state)[0])
            target_q_array = model.predict(exp_state)
            action_index = actions.index(exp_action)
            target_q_array[0][action_index] = target_q
            model.fit(exp_state, target_q_array, epochs=1, verbose=0)
            history = model.fit(exp_state, target_q_array, epochs=1,verbose=0)
            episode_loss += history.history['loss'][0]

    losses.append(episode_loss/len(train_data) if len(train_data) > 0 else 0)

    # Decay epsilon and track rewards and Q-values
    epsilon = max(final_epsilon, epsilon * epsilon_decay)
    episode_rewards.append(total_reward)
    average_q_values.append(total_q_values / q_value_count if q_value_count > 0 else 0)
    epsilon_values.append(epsilon)

# Testing loop to evaluate the performance of the model on unseen data
total_reward = 0
for index, row in test_data.iterrows():
    state = categorize_inventory(row['Current Inventory'], row['Demand'])
    state_vector = state_encoding[state].reshape(1, -1)
    q_values = model.predict(state_vector)
    action = actions[np.argmax(q_values[0])]
    next_state = categorize_inventory(row['Current Inventory'], row['Demand'])
    reward = rewards[(state, action)]
    total_reward += reward

"""Question 4: Write the code to Print total reward from the test data"""

              #Write the code to print total reward here

# Plot the rewards and Q-values for analysis
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Rewards per Episode')
plt.subplot(1, 2, 2)
plt.plot(average_q_values)
plt.xlabel('Episode')
plt.ylabel('Average Q-Value')
plt.title('Average Q-Values per Episode')
plt.tight_layout()
plt.show()


In [None]:
# Predict Q-values for each state-action pair
state_q_values = {}
for state in states:
    state_vector = state_encoding[state].reshape(1, -1)
    q_values = model.predict(state_vector)
    state_q_values[state] = {action: q_values[0][i] for i, action in enumerate(actions)}

# Print the Q-values for verification
for state, q_vals in state_q_values.items():
    print(f"State: {state}, Q-Values: {q_vals}")

# Create a plot for Q-values
plt.figure(figsize=(10, 6))
for action in actions:
    q_values = [state_q_values[state][action] for state in states]
    plt.plot(states, q_values, label=f"Action: {action}", marker='o')

plt.title("Q-Values for Each State-Action Pair")
plt.xlabel("State")
plt.ylabel("Q-Value")
plt.legend()
plt.show()


In [None]:
# Plot Loss Over Episodes
plt.subplot(1, 2, 2)
plt.plot(losses)
plt.title("Loss Over Episodes")
plt.xlabel("Episode")
plt.ylabel("Loss")
plt.show()

The plots depict the total rewards per episode and the average Q-values per episode for a reinforcement learning agent over 100 episodes. Let's analyze each plot:

### Total Rewards per Episode:

- **Upward Trend**: There's a clear upward trend in the total rewards, indicating that the agent is learning and improving its policy over time.
- **Reduced Fluctuations**: Initially, there are some fluctuations, but as episodes increase, the fluctuations in total reward decrease. This suggests that the agent is becoming more consistent in its performance.
- **Higher Plateau**: Towards the later episodes, the rewards seem to plateau, suggesting that the agent may be approaching an optimal policy for the environment it is in.

### Average Q-Values per Episode:

- **Sharp Increase**: The average Q-values rise sharply in the initial episodes, which indicates that the agent is quickly learning from its environment.
- **Plateauing of Q-Values**: After the sharp increase, the average Q-values plateau, which typically means the agent has learned to predict the expected rewards from its actions fairly consistently.
- **Stability**: The relatively flat line towards the end suggests that the Q-value function has stabilized, which in turn indicates that the agent's policy may have converged.

### Combined Interpretation:

- **Learning Efficiency**: The agent's learning process seems efficient, as indicated by the consistent upward trend in rewards and average Q values.

In [None]:
model.save('/content/drive/MyDrive/Deep Learning/Lab 5 Files/inventory_management_model.keras')

print("Model saved as 'inventory_management_model.keras'")

Using a trained Deep Q-Learning model in a business context, especially for inventory management, involves several steps. These steps include integrating the model into a business process or system, making predictions, and taking actions based on those predictions. Here's a general outline of how you can use your trained model in a business setting:

### 1. **Integration with Business Systems**

   - **API Development**: Develop an API around your model. This allows various systems in your business to interact with the model. You can use frameworks like Flask or FastAPI for Python to create a simple web service that takes input and returns the model's predictions.
   - **Database Integration**: Ensure your model has access to real-time or updated data from your business's inventory database. It's important that the model receives current inventory and demand data to make accurate predictions.

### 2. **Making Predictions**

   - **Data Preprocessing**: Process the input data (inventory levels, demand, etc.) in the same way as you did during training. This often involves categorizing inventory levels and encoding them before feeding them into the model.
   - **Model Prediction**: Use the model to predict the optimal action based on the current inventory state. The model, given a state, will output the Q-values for each action, and you choose the action with the highest Q-value.

### 3. **Action Based on Predictions**

   - **Implementing Decisions**: The chosen action (e.g., order more stock, maintain current levels, or reduce stock) should be implemented in your inventory management system.
   - **Monitoring & Feedback**: Monitor the outcomes of these actions to provide feedback into the system. This can be used for further training or model refinement.

### 4. **Model Maintenance and Updating**

   - **Regular Re-training**: Update the model periodically with new data to ensure it stays accurate. The business environment often changes, and your model should adapt to these changes.
   - **Performance Monitoring**: Continuously monitor the model's performance and its impact on business metrics. If performance degrades, consider retraining the model with more recent data.

### 5. **Compliance and Ethical Considerations**

   - **Data Privacy**: Ensure that the use of data complies with all relevant data privacy laws and regulations.
   - **Transparency**: Maintain transparency about how the model makes decisions, especially if these decisions significantly impact the business or customers.

### 6. **User Interface**

   - **Dashboard**: Develop a user-friendly dashboard that shows the model's recommendations, current inventory levels, and other relevant metrics. This aids in decision-making and provides insight into the model's performance.

### Example Scenario:

Imagine a scenario where your model suggests ordering new stock when inventory levels fall into the "Low Inventory" category. Your system, via an API, automatically sends this recommendation to the supply chain management team. The team can then review and approve the order, ensuring that decisions are both data-driven and human-reviewed.

### Conclusion:

Implementing a Deep Q-Learning model in a business, especially in a critical area like inventory management, requires careful planning, continuous monitoring, and regular updates. The key is to ensure that the model's predictions align well with the business's operational realities and objectives.