# The Cooking Chef Problem

Consider the case where the agent is your personal Chef. In particular, the agent (the smiley on the map) wants to cook the eggs recipe according to your indication (scrambled or pudding).

In order to cook the desired recipe, the agent must first collect the needed tools (the egg beater on the map). Then he must reach the stove (the frying pan or the oven on the map). Finally, he can cook.

Note that there are two special interlinked cells (marked with the G) that allow the agent to go from one side of the map to the other. But to do so, the agent needs to express his will to go on the other side.

Cells in (4, 2) and (9, 3) are the special gate ones. They allow the agent to go from one side of the map to another. Those two special cells are interlinked, but the agent needs to express his will to go on the other side.

Since you are very hungry, it is fundamental that the agent cooks the eggs according to your taste (scrambled/pudding) as fast as he can without letting you wait for more than necessary.

In order to apply optimal control techniques such as value iteration, you need to model the aforementioned scenario as an MDP. Recall that an MDP is defined as a tuple (S, A, P, R, γ), where:

- **S**: The (finite) set of all possible states.
- **A**: The (finite) set of all possible actions.
- **P**: The transition function P: S x S x A -> [0,1], which maps (s',s,a) to P(s′|s,a), i.e., the probability of transitioning to state s′ ∈ S when taking action a ∈ A in state s ∈ S. Note that ∑{s' ∈ S} P(s'|s, a) = 1 for all s ∈ S, a ∈ A.
- **R**: The reward function R : S x A x S -> *R*, which maps (s,a,s') to R(s,a,s′), i.e., the reward obtained when taking action a ∈ A in state s ∈ S and arriving at state s′ ∈ S.
- **γ**: The discount factor which controls how important rewards are in the future.


![Instance Image](Problem_image.png)

##### A particular instance of the cooking Chef problem. The goal is for the agent currently located in state (4, 3) to have a policy that always leads to cooking the eggs in location (1, 4) or (8, 4). Cells in (4, 2) and (9,3) are the special gate.

##### In the figure:
• The agent is at (4, 3) (but it can start at any of the grid cells).
• The agent needed cooking tools as the egg beater is in position (1, 3) and (8, 3).
• There are two different final goals, displayed as the frying pan is in position (1, 4) and the oven in position (8, 4).
• Cells in (4, 2) and (9, 3) are the special gate ones. They allow the agent to go from one side of the map to another. Those two special cells are interlinked, but the agent needs to express his will to go on the other side.
• The agent is not able to move diagonally.
• Walls are represented by thick black lines.
• The agent cannot move through walls.
• An episode will end when the agent successfully cooks the scrambled eggs (see the above description).

Now, considering the problem as a model-free scenario, provide a program (written in Python, possibly based on the labs) that can compute the optimal policy for this world by solely considering the pudding eggs scenario. Draw the computed policy in the grid by putting the optimal action in each cell. If multiple actions are possible, include the probability of each arrow. There may be multiple optimal policies; pick one to show it. Note that the model is not available for computation but must be encoded to be used as the "real-world" environment.

Note that the enviroment is 'setted' in the following way

![Instance_Image](states.png)





__Author__ : Alessandro Pio n.294417

# Imports

In [1]:
from itertools import zip_longest
import numpy as np
import random

try:
    from terminaltables import AsciiTable
except ImportError:
    print("No terminaltables library was found but there is no problem, the code could still run")

# Utilities functions

I the following cell i provide some utilities functions

- `get_optimal_policy(q_table, actions)`
This function calculates the optimal policy based on the Q-table and the available actions. It iterates through each state, finds the action with the highest Q-value, and assigns it to the optimal policy for that state.
- `get_optimal_policy_matrix(optimal_policy)`
This function creates two policy matrices (4x4 grids) for displaying the optimal policy visually. It maps actions to arrows for better representation.
- `get_info(beater_opt_pol, cook_opt_pol)`
This function prints a tabular representation of the optimal policy for both taking the beater and cooking actions. It uses the AsciiTable library for better formatting.

In [2]:
def get_optimal_policy(q_table, actions):
    """
    Calculates the optimal policy based on the Q-table and the available actions.

    Parameters:
    - q_table (numpy.ndarray): Q-table containing Q-values for different actions and states.
    - actions (list): List of possible actions.

    Returns:
    - dict: Optimal policy mapping states to corresponding optimal actions.
    """
    
    optimal_policy = {}
        
    for state in range(1, 33): # starting from 1 to 32
        optimal_action = actions[np.argmax(q_table[state - 1])]
        optimal_policy[state] = optimal_action
    return optimal_policy 

def get_optimal_policy_matrix(optimal_policy):
    """
    Creates two policy matrices for visualizing the optimal policy.

    Parameters:
    - optimal_policy (dict): Optimal policy mapping states to corresponding optimal actions.

    Prints:
    - Two policy matrices displayed in the console.
    """
    
    mapped_actions = {
            "U": "↑",
            "D": "↓",
            "L": "←",
            "R": "→",
            "T": "⤴",
            "TR": "⇒",
            "TL": "⇒"
        }
    
    policy_matrix1 = np.zeros((4, 4), dtype='U2')
    policy_matrix2 = np.zeros((4, 4), dtype='U2')
         
    for state, action in optimal_policy.items():
        
        if state <= 16:
            if Environment.oven == state:
                policy_matrix1[(state - 1) // 4, (state - 1) % 4] = "C"
            else:
                policy_matrix1[(state - 1) // 4, (state - 1) % 4] = mapped_actions[action]
        else:
            if Environment.oven == state:
                policy_matrix2[(state - 17) // 4, (state - 17) % 4] = "C"
            else:
                policy_matrix2[(state - 17) // 4, (state - 17) % 4] = mapped_actions[action]
        
    print("Optimal Policy Matrix 1:")
    print(policy_matrix1)
        
    print("\nOptimal Policy Matrix 2:")
    print(policy_matrix2)
    
def get_info(beater_opt_pol, cook_opt_pol):
    """
    Prints a tabular representation of the optimal policy for taking the beater and cooking actions.

    Parameters:
    - beater_opt_pol (dict): Optimal policy mapping states to corresponding optimal actions for taking the beater.
    - cook_opt_pol (dict): Optimal policy mapping states to corresponding optimal actions for cooking.

    Prints:
    - A formatted table displaying the optimal policies for both actions.
    """
    
    data = [["STATE", "BEATER OPT POLICY", "COOK OPT POLICY"]]

    keys_beater, values_beater = list(beater_opt_pol.keys()), list(cook_opt_pol.values())
    keys_cook, values_cook = list(beater_opt_pol.keys()), list(cook_opt_pol.values())
    
    for num, (key_b, value_b, key_c, value_c) in enumerate(zip_longest(keys_beater, values_beater, keys_cook, values_cook, fillvalue=['-', '-', '-', '-']), start=1):
        data.append([num, value_b, value_c])
    try:
        print(AsciiTable(data).table)
    except NameError:
        column_widths = [max(len(str(item)) for item in column) + 5 for column in zip(*data)]
        
        for row in data:
            formatted_row = "|".join(f"{item:^{width}}" for item, width in zip(row, column_widths))
            print(formatted_row)
            

# Chef Class

The `Chef` class represents a chef in a cooking environment. It is designed for use in a reinforcement learning scenario where the chef needs to perform various actions such as moving, taking a beater, and cooking eggs in an oven.

## Attributes

- `state` (int): Current state or position of the chef in the environment.
- `has_beater` (bool): A flag indicating whether the chef currently has a beater.
- `q_beater` (numpy.ndarray): Q-values for actions related to taking the beater.
- `q_cook` (numpy.ndarray): Q-values for cooking actions.
- `actions` (list): List of possible actions for the chef ('L', 'R', 'U', 'D', 'T', 'TR', 'TL').

## Methods

- `take_beater(reward)`: Attempts to take the beater, updates state and reward accordingly.
- `cook_eggs(reward)`: Cooks eggs, updates reward, and sets cooking_done flag to True.
- `step(state, action, illegal_moves, gates, beaters, oven)`: Takes a step in the environment based on the given action.

In [3]:
class Chef:

    def __init__(self, state):
        """
        Initializes a new Chef instance.

        Parameters:
        - state (int): The initial state or position of the chef in the environment.
        """
        
        self.state = state
        self.has_beater = False
        self.q_beater = np.zeros((32, 7))
        self.q_cook = np.zeros((32, 7))
        
        # LEFT, RIGHT, UP, DOWN, TAKE, TELEPORT RIGHT, TELEPORT LEFT
        self.actions = ['L', 'R', 'U', 'D', 'T', 'TR', 'TL']

    def take_beater(self, reward):
        """
        Attempts to take the beater.

        Parameters:
        - reward (float): The current reward value.

        Returns:
        - float: The updated reward value after attempting to take the beater.
        """
        
        if not self.has_beater:
            self.has_beater = True
            reward += Environment.rewards['take_beater']
        return reward

    def cook_eggs(self, reward):
        """
        Cooks eggs.

        Parameters:
        - reward (float): The current reward value.

        Returns:
        - tuple: A tuple containing the updated reward value and cooking_done flag.
        """
        
        reward += Environment.rewards['cook_eggs']
        cooking_done = True
        return reward, cooking_done
    
    def step(self, state, action, illegal_moves, gates, beaters, oven):
        """
        Takes a step in the environment based on the given action.

        Parameters:
        - state (int): The current state or position in the environment.
        - action (str): The action to be taken.
        - illegal_moves (set): Set of illegal state transitions.
        - gates (tuple): Tuple containing the positions of the teleportation gates.
        - beaters (set): Set containing positions of beaters.
        - oven (int): The position of the oven.

        Returns:
        - tuple: A tuple containing the updated state, reward, and cooking_done flag.
        """
        
        cooking_done = False
        reward = Environment.rewards['not_final']

        moves = {
            'L': action == 'L' and (state, state - 1) not in illegal_moves,
            'R': action == 'R' and (state, state + 1) not in illegal_moves,
            'U': action == 'U' and (state, state - 4) not in illegal_moves,
            'D': action == 'D' and (state, state + 4) not in illegal_moves,
            'T': action == 'T' and state in beaters,
            'TR': action == 'TR' and state == gates[0],
            'TL': action == 'TL' and state == gates[1]
        }

        if moves['L']:
            state -= 1
        elif moves['R']:
            state += 1 
        elif moves['U']:
            state -= 4
        elif moves['D']:
            state += 4
            
        elif moves['T']:
            reward = self.take_beater(reward)
            
        elif moves['TR']: # teleport 1 changing state
            state = 24
        elif moves['TL']: # teleport 2 changing state
            state = 12
            
        elif state == oven and self.has_beater:
            reward, cooking_done = self.cook_eggs(reward)
        elif (self.has_beater and action == 'T' or 
              state == oven and not self.has_beater or
              action in ['TR', 'TL'] and state not in gates):
            
            reward -= Environment.rewards['discouraged_action_penalty']

        return state, reward, cooking_done


# Environment Class

The `Environment` class models the cooking environment where a `Chef` agent interacts. It defines the rewards for different scenarios, the layout of the environment, and the main simulation loop.

## Attributes

- `rewards` (dict): Dictionary containing reward values for different scenarios.
- `oven` (int): Position of the oven in the environment.
- `gates` (list): Positions of teleportation gates.
- `beaters` (list): Positions of beaters.
- `illegal_moves` (list): List of illegal state transitions in the environment.
- `learning_rate` (float): Learning rate for updating Q-values in the simulation.
- `gamma` (float): Discount factor for future rewards in the Q-learning algorithm.
- `epsilon` (float): Exploration-exploitation trade-off parameter.

## Methods

- `__init__(self, learning_rate, gamma, epsilon)`: Initializes the environment with given learning parameters.
- `generate_illegal_moves(self)`: Generates a list of illegal state transitions based on the environment layout.
- `run(self, chef: Chef)`: Runs the simulation loop where the chef agent takes actions in the environment, and Q-values are updated accordingly.


In [4]:
class Environment:
    """
    The `Environment` class models the cooking environment where a `Chef` agent interacts. It defines the rewards for
    different scenarios, the layout of the environment, and the main simulation loop.

    """
    
    rewards = {"not_final": -0.5,
               "wall": -150, 
               "discouraged_action_penalty": 7.0, 
               "take_beater": 10, 
               "cook_eggs": 50}
    oven = 19
    
    def __init__(self, learning_rate, gamma, epsilon):
        """
        Initializes a new Environment instance.

        Parameters:
        - learning_rate (float): Learning rate for updating Q-values in the simulation.
        - gamma (float): Discount factor for future rewards in the Q-learning algorithm.
        - epsilon (float): Exploration-exploitation trade-off parameter.
        """
    
        self.gates = [12, 24]
        self.beaters = [5, 23]
        self.illegal_moves = self.generate_illegal_moves()
        
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon

    def generate_illegal_moves(self):
        """
        Generates a list of illegal state transitions based on the environment layout.

        Returns:
        - list: List of illegal state transitions.
        """
        
        return [
            (1, 5), (5, 6), (6, 10), (7, 11), (9, 13), (10, 14), (11, 15),                  # -------
            (5, 1), (6, 5), (10, 6), (11, 7), (13, 9), (14, 10), (15, 11),                  # | internal
            (19, 23), (20, 24), (22, 23), (26, 27), (27, 31),                               # | walls
            (23, 19), (24, 20), (23, 22), (27, 26), (31, 27),                               # --------
            (1, 0), (5, 4), (9, 8), (13, 12), (17, 16), (21, 20), (25, 24), (29, 28),       # | left   
            (4, 5), (8, 9), (12, 13), (16, 17), (20, 21), (24, 25), (28, 29), (32, 33),     # | right  - external 
            (1, -3), (2, -2), (3, -1), (4, 0), (17, 13), (18, 14), (19, 15), (20, 16),      # | up     - walls
            (13, 17), (14, 18), (15, 19), (16, 20), (29, 33), (30, 34), (31, 35), (32, 36)  # | down
        ]

    def run(self, chef: Chef):
        """
        Runs the simulation loop where the chef takes actions in the environment, and Q-values are updated accordingly.

        Parameters:
        - chef (Chef): The chef agent interacting with the environment.

        Returns:
        - float: Total reward accumulated during the simulation.
        """
        
        state = random.randint(1, 32)
        has_cooked = False
        total_reward = 0
        chef.has_beater = False
    
        while not has_cooked:
            if random.uniform(0, 1) < self.epsilon:
                action = np.random.choice(chef.actions)
            else:
                if not chef.has_beater:
                    action = chef.actions[np.argmax(chef.q_beater[state - 1])]
                else:
                    action = chef.actions[np.argmax(chef.q_cook[state - 1])]
    
            next_state, reward, cooking_done = chef.step(state, action, self.illegal_moves, self.gates, self.beaters, self.oven)
    
            if (state, next_state) in self.illegal_moves:
                reward -= Environment.rewards['wall']
    
            if not chef.has_beater:
                chef.q_beater[state - 1][chef.actions.index(action)] += self.learning_rate * (reward + self.gamma * np.max(chef.q_beater[next_state - 1]) - chef.q_beater[state - 1][chef.actions.index(action)])
            else:
                chef.q_cook[state - 1][chef.actions.index(action)] += self.learning_rate * (reward + self.gamma * np.max(chef.q_cook[next_state - 1]) - chef.q_cook[state - 1][chef.actions.index(action)])
    
            state = next_state
            total_reward += reward
    
            if cooking_done:
                has_cooked = True
    
        return total_reward

# Main

In [9]:
epsilon = 0.01
gamma = 0.8 
learning_rate = 0.2
    
environment = Environment(learning_rate=learning_rate, gamma=gamma, epsilon=epsilon)
chef = Chef(8)
max_episodes = 100000
    
for episode in range(max_episodes):
    environment.epsilon = 1 / (episode + 1)
    reward = environment.run(chef)
        
    print(f"Episode {episode + 1}, Reward: {reward}")
    
beater_optimal_policy = get_optimal_policy(chef.q_beater, chef.actions)
cook_optimal_policy = get_optimal_policy(chef.q_cook, chef.actions)

Episode 1, Reward: -7101.0
Episode 2, Reward: -2405.0
Episode 3, Reward: -417.5
Episode 4, Reward: -270.0
Episode 5, Reward: -81.0
Episode 6, Reward: -46.5
Episode 7, Reward: -183.0
Episode 8, Reward: -48.0
Episode 9, Reward: -89.5
Episode 10, Reward: 6.0
Episode 11, Reward: -108.0
Episode 12, Reward: -32.5
Episode 13, Reward: 0.5
Episode 14, Reward: -27.0
Episode 15, Reward: -123.5
Episode 16, Reward: -50.0
Episode 17, Reward: -28.0
Episode 18, Reward: -76.5
Episode 19, Reward: 18.5
Episode 20, Reward: 20.5
Episode 21, Reward: 4.0
Episode 22, Reward: 27.5
Episode 23, Reward: 40.0
Episode 24, Reward: 30.5
Episode 25, Reward: 11.0
Episode 26, Reward: -5.0
Episode 27, Reward: 2.0
Episode 28, Reward: 28.5
Episode 29, Reward: 25.0
Episode 30, Reward: -19.5
Episode 31, Reward: 30.5
Episode 32, Reward: 8.5
Episode 33, Reward: 36.0
Episode 34, Reward: 20.5
Episode 35, Reward: 44.0
Episode 36, Reward: 8.0
Episode 37, Reward: 43.0
Episode 38, Reward: -0.5
Episode 39, Reward: 20.0
Episode 40, Re

In [10]:
get_info(beater_optimal_policy, cook_optimal_policy)

+-------+-------------------+-----------------+
| STATE | BEATER OPT POLICY | COOK OPT POLICY |
+-------+-------------------+-----------------+
| 1     | D                 | D               |
| 2     | D                 | D               |
| 3     | D                 | D               |
| 4     | L                 | L               |
| 5     | D                 | D               |
| 6     | R                 | R               |
| 7     | D                 | D               |
| 8     | R                 | R               |
| 9     | R                 | R               |
| 10    | R                 | R               |
| 11    | R                 | R               |
| 12    | TR                | TR              |
| 13    | D                 | D               |
| 14    | L                 | L               |
| 15    | D                 | D               |
| 16    | R                 | R               |
| 17    | R                 | R               |
| 18    | R                 | R         

In [11]:
get_optimal_policy_matrix(beater_optimal_policy)

Optimal Policy Matrix 1:
[['→' '→' '→' '↓']
 ['⤴' '→' '→' '↓']
 ['↑' '←' '←' '⇒']
 ['→' '→' '→' '↑']]

Optimal Policy Matrix 2:
[['→' '↓' 'C' '←']
 ['→' '↓' '⤴' '←']
 ['→' '↓' '↑' '←']
 ['→' '→' '→' '↑']]


In [12]:
get_optimal_policy_matrix(cook_optimal_policy)

Optimal Policy Matrix 1:
[['↓' '↓' '↓' '←']
 ['↓' '→' '↓' '→']
 ['→' '→' '→' '⇒']
 ['↓' '←' '↓' '→']]

Optimal Policy Matrix 2:
[['→' '→' 'C' '←']
 ['↑' '↑' '↓' '↓']
 ['→' '↑' '→' '↓']
 ['←' '↑' '←' '←']]
