Below is my project where I create a simulated cooking environment where an RL agent learns the correct sequence of steps (i.e. a recipe) to complete a task. The project uses a Deep Q-Network (DQN) algorithm implemented in PyTorch. The environment is designed as a sequential task where the agent must choose the correct action at each stage of a recipe. In this case the “recipe” is defined as a fixed sequence of steps. Each correct action moves the agent to the next step and yields a positive reward, while any incorrect action incurs a penalty and may reset progress. This is analogous to how, in nature, a foraging animal learns a reliable route by trial and error to obtain food—each step is verified against the known path and errors lead to re-learning the route.

The design of this project follows key principles of reinforcement learning (RL). First, the environment is formulated as a Markov Decision Process (MDP) with discrete states and actions. Second, the DQN algorithm is used to approximate the Q‑function; this approximates the value of each action in a given state. The use of a replay buffer helps break correlations between consecutive samples and stabilises training, much like how natural systems learn by replaying past experiences during sleep.

# Imports

In [19]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

# Check device: use GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


# Environment: Simulated Cooking Task

In [20]:
class CookingEnv:
    """
    A simple environment that simulates a cooking recipe.
    The recipe is a fixed sequence of steps. The agent must perform the
    correct action at each step to complete the recipe.
    
    State: current step index (0-indexed).
    Actions: discrete actions representing possible steps.
    Reward: +1 for correct step, -1 for wrong step.
    Episode terminates when the recipe is completed or a wrong step is taken.
    
    This setup is analogous to following a recipe where each step is checked
    before moving on, similar to how animals learn sequential tasks.
    """

    ROWS = 6
    COLS = 7

    def __init__(self):
        # Define the recipe as a sequence of actions (for example, represented by integers)
        # For instance, a recipe with 4 steps: [0, 1, 2, 3] where each number represents a step.
        self.recipe = [0, 1, 2, 3]  # The correct sequence of actions
        self.n_steps = len(self.recipe)
        self.action_space = 5  # Total number of possible actions (more than recipe steps)
        self.state = 0  # current step index
        self.done = False
        
    def reset(self):
        """Reset the environment for a new episode."""
        self.state = 0
        self.done = False
        return self.state
    
    def step(self, action):
        """
        Take an action and return the next state, reward, done flag and extra info.
        If the action is the correct next step, move forward; else, terminate with a penalty.
        """
        if self.done:
            raise Exception("Episode has terminated. Please reset the environment.")
        
        # Check if the chosen action is the expected one in the recipe
        if action == self.recipe[self.state]:
            reward = 1
            # If we are at the last step, do not increment further
            if self.state == self.n_steps - 1:
                self.done = True
                info = "Recipe completed"
            else:
                self.state += 1
                info = "Correct step, continue"
        else:
            reward = -1
            self.done = True
            info = "Wrong step, episode terminated"
        
        return self.state, reward, self.done, info

# Replay Buffer

In [21]:
class ReplayBuffer:
    """
    Experience replay buffer to store transitions.
    This buffer stores tuples of (state, action, reward, next_state, done).
    Such buffers help stabilise training by randomising over the agent's experiences.
    """
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)

# DQN Network

In [22]:
class DQN(nn.Module):
    """
    A simple feed-forward network that approximates the Q-value function.
    The input is the state (represented as an integer, but converted to one-hot encoding)
    and the output is a vector of Q-values for each action.
    """
    def __init__(self, state_size, action_size, hidden_size=16):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)
    
    def forward(self, x):
        # Pass through fully connected layers with ReLU activation
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        q_values = self.fc3(x)
        return q_values

# Helper Functions

In [23]:
def one_hot(state, state_size):
    """
    Convert a state (integer index) to a one-hot encoded vector.
    This is similar to representing an organism's state using distinct signals.
    """
    vec = np.zeros(state_size, dtype=np.float32)
    vec[state] = 1.0
    return vec

# DQN Agent

In [24]:
class DQNAgent:
    """
    The agent that uses DQN to learn the cooking procedure.
    It employs an epsilon-greedy policy for exploration.
    """
    def __init__(self, state_size, action_size, hidden_size=16, lr=0.001, gamma=0.99, epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.995):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = gamma  # Discount factor for future rewards
        self.epsilon = epsilon_start  # Exploration rate
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        
        # Initialize the network and optimizer
        self.policy_net = DQN(state_size, action_size, hidden_size).to(device)
        self.target_net = DQN(state_size, action_size, hidden_size).to(device)
        self.update_target_network()
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        self.replay_buffer = ReplayBuffer()
    
    def update_target_network(self):
        """Copy the policy network weights to the target network."""
        self.target_net.load_state_dict(self.policy_net.state_dict())
    
    def select_action(self, state):
        """
        Select an action using an epsilon-greedy policy.
        With probability epsilon, a random action is selected (exploration);
        otherwise, the best action is chosen (exploitation).
        """
        if random.random() < self.epsilon:
            return random.randrange(self.action_size)
        else:
            # Convert state to one-hot tensor
            state_one_hot = torch.tensor(one_hot(state, self.state_size), dtype=torch.float32).unsqueeze(0).to(device)
            with torch.no_grad():
                q_values = self.policy_net(state_one_hot)
            return q_values.argmax().item()
    
    def train_step(self, batch_size):
        """Sample a batch from the replay buffer and perform a training step."""
        if len(self.replay_buffer) < batch_size:
            return 0
        
        batch = self.replay_buffer.sample(batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        
        states = torch.tensor([one_hot(s, self.state_size) for s in states], dtype=torch.float32).to(device)
        next_states = torch.tensor([one_hot(s, self.state_size) for s in next_states], dtype=torch.float32).to(device)
        actions = torch.tensor(actions, dtype=torch.int64).unsqueeze(1).to(device)
        rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1).to(device)
        dones = torch.tensor(dones, dtype=torch.float32).unsqueeze(1).to(device)
        
        # Compute current Q-values using the policy network
        q_values = self.policy_net(states).gather(1, actions)
        # Compute next Q-values from the target network and take max over actions
        next_q_values = self.target_net(next_states).max(1)[0].unsqueeze(1)
        # Compute expected Q-values using the Bellman equation
        expected_q_values = rewards + self.gamma * next_q_values * (1 - dones)
        
        # Compute loss using Mean Squared Error (MSE)
        loss = F.mse_loss(q_values, expected_q_values)
        
        # Perform backpropagation and optimization
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
    
    def decay_epsilon(self):
        """Reduce the exploration rate over time."""
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

# Training Loop

In [25]:
def train_agent(num_episodes=500, batch_size=32, target_update_interval=10):
    """
    Train the DQN agent in the CookingEnv environment.
    The training loop iterates over episodes, gathering experience and updating the network.
    """
    env = CookingEnv()
    # The state is represented as a one-hot vector; state_size equals number of steps in recipe.
    state_size = env.n_steps
    action_size = env.action_space
    agent = DQNAgent(state_size, action_size)
    
    episode_losses = []
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        total_loss = 0
        steps = 0
        
        while not done:
            action = agent.select_action(state)
            next_state, reward, done, info = env.step(action)
            # Store the transition in the replay buffer
            agent.replay_buffer.push(state, action, reward, next_state, done)
            state = next_state
            steps += 1
            
            # Train on a mini-batch from the replay buffer
            loss = agent.train_step(batch_size)
            total_loss += loss
        
        # Decay the exploration rate after each episode
        agent.decay_epsilon()
        
        # Update the target network periodically
        if (episode + 1) % target_update_interval == 0:
            agent.update_target_network()
        
        episode_losses.append(total_loss)
        print(f"Episode {episode+1}/{num_episodes}, Steps: {steps}, Loss: {total_loss:.4f}, Epsilon: {agent.epsilon:.4f}")
    
    return agent, episode_losses

# Evaluation Loop

In [26]:
def evaluate_agent(agent, num_episodes=100):
    """
    Evaluate the trained agent over several episodes.
    Compute the average reward and the success rate in completing the recipe.
    """
    env = CookingEnv()
    total_rewards = []
    successes = 0
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        while not done:
            action = agent.select_action(state)
            next_state, reward, done, info = env.step(action)
            episode_reward += reward
            state = next_state
        total_rewards.append(episode_reward)
        if episode_reward > 0:
            successes += 1
    average_reward = np.mean(total_rewards)
    success_rate = successes / num_episodes
    print(f"Average Reward: {average_reward:.4f}, Success Rate: {success_rate*100:.2f}%")
    return average_reward, success_rate

# Training & Evaluation

In [29]:
# Train the agent
trained_agent, losses = train_agent(num_episodes=1000, batch_size=32, target_update_interval=20)
        
evaluate_agent(trained_agent, num_episodes=666)

Episode 1/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9950
Episode 2/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9900
Episode 3/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9851
Episode 4/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9801
Episode 5/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9752
Episode 6/1000, Steps: 2, Loss: 0.0000, Epsilon: 0.9704
Episode 7/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9655
Episode 8/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9607
Episode 9/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9559
Episode 10/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9511
Episode 11/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9464
Episode 12/1000, Steps: 2, Loss: 0.0000, Epsilon: 0.9416
Episode 13/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9369
Episode 14/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9322
Episode 15/1000, Steps: 2, Loss: 0.0000, Epsilon: 0.9276
Episode 16/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9229
Episode 17/1000, Steps: 1, Loss: 0.0000, Epsilon: 0.9183
Episode 18/1000, Steps: 2, Loss: 0.0000,

(3.8828828828828827, 0.9834834834834835)

# Interprating

1. **Episode Number (e.g. Episode 1/1000, Episode 2/1000, …):**  
   This indicates which training episode is being executed out of the total (here, 1000). In reinforcement learning, one episode is like one complete “attempt” at solving the task—much as an animal might try a new route from its nest each day.

2. **Steps:**  
   The number of steps is the count of actions the agent took during that episode. For example, “Steps: 3” means the agent took 3 actions before the episode terminated. A higher step count in an episode might indicate that the agent was able to proceed further in the task before failing or completing it.

3. **Loss:**  
   This is the training loss computed during that episode (often a combination of policy and value losses). Early on, the loss might be zero or very low because the agent is taking random actions or because there is not enough experience to make the network error significant. Later, when learning begins to kick in, you start seeing nonzero loss values. In our output, the loss starts at 0.0000 and later becomes a very small positive value (for example, 0.0004 or 0.0124). This loss tells you how far off the network’s predictions are from the targets.

4. **Epsilon:**  
   Epsilon is the exploration rate in an epsilon‑greedy policy. It starts high (close to 1), meaning the agent takes many random actions. Over time, epsilon decays (e.g. from 0.9950 to around 0.0816 by episode 500) as the agent begins to rely more on what it has learned. This decay is key to shifting from exploration (trying new things) to exploitation (using learned behavior), a concept discussed in reinforcement learning literature.

5. **Average Reward and Success Rate:**  
   At the end of training, you see “Average Reward: 3.8829” and “Success Rate: 98.35%.” The average reward is computed over all episodes and gives you a summary of how much reward the agent is receiving on average per episode. The success rate of 98.35% means that in 98.35% of the episodes, the agent reached a terminal state that is considered a “win” (or at least a successful completion of the task).