# Example Agent and Environment

This example of a rational agent enables experimentation with decision strategies in two ways: 
1. based on expected utility 
2. based on rewards obtained after each action

The environment is inspired by the Open AI gym framework, but extended for decisions based on expected utility.

# 1. Definition of the Environment

The code below defines all characteristics of a simple Vacuum Cleaner Environment with the following characteristics:

Environment state:
- the agent is in one of two rooms (A or B)
- there is a certain amount of dust (d_A) on the floor in room A
- there is a certain amount of dust (d_B) on the floor in room B

this state is represented by a dictionary.

An environment object has the following methods:
- reset() which brings the environment in a random (start) state, return value: the state.
- step(action) processes the action of the agent and returns the new state, done, reward (and optional debug info)
- render() simple visualisation of the current state of the world

The actions (Left, Right and Suck) are represented by an enum.

We will illustrate each of the elements above by simple code examples below.

In [2]:
from enum import Enum
from random import random, choice

ActionSpace = Enum('ActionSpace', 'Left Right Suck')

Room = Enum('Room', 'A B')

IN_ROOM_A = [
    '┌───────┬───────┐',
    '│ A     │ B     │',
    '│   ╱╲  │       │',
    '│   --  │       │',
    '├───────┼───────┤',
    '│   0.0 │   0.0 │',
    '└───────┴───────┘'
]
    
IN_ROOM_B = [
    '┌───────┬───────┐',
    '│ A     │ B     │',
    '│       │   ╱╲  │',
    '│       │   --  │',
    '├───────┼───────┤',
    '│   0.0 │   0.0 │',
    '└───────┴───────┘'
]

class VacuumCleanerEnvironment():
    def __init__(self, room = Room.A, mm_A = 0.5, mm_B = 0.5):
        self.__state = {'room': room, 'd_A': mm_A, 'd_B': mm_B}
    
    def reset(self):
        mm_A = round(random(), 1)
        mm_B = round(random(), 1)
        room = choice([Room.A, Room.B])
        self.__state = {'room': room, 'd_A': mm_A, 'd_B': mm_B}
        return self.__state
        
    def get_new_state(self, action):
        r, d_A, d_B = self.__state['room'], self.__state['d_A'], self.__state['d_B']
        # process the action selected by the agent
        if action == ActionSpace.Left:
            r = Room.A
        elif action == ActionSpace.Right:
            r = Room.B
        else:  # action == ActionSpace.Suck
            if r == Room.A:
                d_A = max(0.0, d_A - 0.1)
            else:  # self.__state['room'] == Room.B
                d_B = max(0.0, d_B - 0.1)
        return {'room': r, 'd_A': d_A, 'd_B': d_B}
        
    def step(self, action):
        self.__state = self.get_new_state(action)
        observation = self.__state  # state is fully observable
        done = self.__state['d_A'] <= 0.001 and self.__state['d_B'] <= 0.001 # prevent rounding errors
        reward = -1
        info = {}  # optional debug info
        return observation, done, reward, info

    def render(self):
        if self.__state['room'] == Room.A:
            rendering = IN_ROOM_A
        else:
            rendering = IN_ROOM_B
        d_A = round(self.__state['d_A'], 1)
        d_B = round(self.__state['d_B'], 1)
        rendering[5] = '│ ' + str(d_A).rjust(5) + ' │ ' + str(d_B).rjust(5) + ' │'
        for line in rendering:
            print(line)

## Creation of an Environment

The Environment Class allows creation of an Environment in initial state:
- default (state parameters not specified at construction)
- specific (state parameters specified at construction)
- random (by using method reset())

Below are examples of all three uses.

In [3]:
# example of creation of an environment in the default state
env = VacuumCleanerEnvironment()
env.render()

┌───────┬───────┐
│ A     │ B     │
│   ╱╲  │       │
│   --  │       │
├───────┼───────┤
│   0.5 │   0.5 │
└───────┴───────┘


In [4]:
# example of creation of an environment in a specific state
env = VacuumCleanerEnvironment(Room.B, 0.2, 0.9)
env.render()

┌───────┬───────┐
│ A     │ B     │
│       │   ╱╲  │
│       │   --  │
├───────┼───────┤
│   0.2 │   0.9 │
└───────┴───────┘


In [7]:
# example of creation of an environment in a random state
env = VacuumCleanerEnvironment()
env.reset()
env.render()

┌───────┬───────┐
│ A     │ B     │
│   ╱╲  │       │
│   --  │       │
├───────┼───────┤
│   0.4 │   1.0 │
└───────┴───────┘


## Action Space

We will only deal with environments with a finite number of discrete actions.

In that case the so-called Action Space (set of all possible actions) can be easily represented by an enum:

In [8]:
for nr, action in enumerate(ActionSpace, 1):
    print('action', nr, 'is', action)

action 1 is ActionSpace.Left
action 2 is ActionSpace.Right
action 3 is ActionSpace.Suck


In [17]:
env.step(ActionSpace.Suck)
env.render()

┌───────┬───────┐
│ A     │ B     │
│   ╱╲  │       │
│   --  │       │
├───────┼───────┤
│   0.0 │   0.9 │
└───────┴───────┘


# 2. Random Agent

In the cell below, you can see the effect of an agent choosing an arbitrary action regardless of the new state.



In [24]:
def select_random_action(state):
    # action is random choice from all actions in Action Space
    action = choice([a for a in ActionSpace])
    return action

# create a random environment
env = VacuumCleanerEnvironment(Room.A, 0.4, 0.5)
state = env.reset()

total_reward = 0.0
done = False
while not done:
    next_action = select_random_action(state)
    state, done, reward, info = env.step(next_action)
    total_reward += reward
    print('action: {0}\tstate: ({1}, {2:.1f}, {3:.1f}), reward: {4:.1f}\n'
          .format(next_action, state['room'], state['d_A'], state['d_B'], reward))
print('episode done. total reward:', total_reward)

action: ActionSpace.Right	state: (Room.B, 0.0, 0.4), reward: -1.0

action: ActionSpace.Suck	state: (Room.B, 0.0, 0.3), reward: -1.0

action: ActionSpace.Right	state: (Room.B, 0.0, 0.3), reward: -1.0

action: ActionSpace.Left	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Left	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Left	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Left	state: (Room.A, 0.0, 0.3), reward: -1.0

action: ActionSpace.Right	state: (Room.B, 0.0, 0.3), reward: -1.0

action: ActionSpace.Left	state: (Room.A, 0.0, 0.3), reward: -1.0

action:

If you run the cell above a couple of times, you can see that the environment always ends up in a state with no dust in rooms A and B, but the agent is not very 'efficient'! It is natural to interpret the negative reward of -1 at each step as a penalty 'wasting time' with 'useless' intermediate steps. Let's see if we can do better using Utility Theory.

# 3. Decisions based on Utility

Rational Agents have an order of preference for states that is expressed in a Utility Function.

As an example: let the Utility of state $s = (r, d_A, d_B)$ be given by:

$U(s) = - d_A/2 - d_B$ if $r = RoomA$

$U(s) = - d_A - d_B/2$ if $r = RoomB$

According to Utility Theory a Ration Agent should choose the action with the highest expected utility (the utility of the new state after the action).

In the code example below the agent tries out every action to observe the new state (and calculate the utility). (Note that this neads fresh copy of the environment for every action).

In [22]:
from copy import deepcopy 

def utility(state):
    if state['room'] == Room.A:
        return - state['d_A']/2 - state['d_B'] 
    else:
        return - state['d_A'] - state['d_B']/2

def select_action_with_max_utility(env, state):
    r = state['room']
    d_A = state['d_A']
    d_B = state['d_B']
    max_utility = float('-inf')
    best_action = None
    for action in ActionSpace:
        new_state = env.get_new_state(action)
        new_utility = utility(new_state)
        if new_utility > max_utility:
            best_action = action
            max_utility = new_utility
    return best_action, max_utility

# create a random environment
env = VacuumCleanerEnvironment(Room.A, 0.5, 0.5)
#state = env.reset()

total_reward = 0.0
done = False
while not done:
    best_action, new_utility = select_action_with_max_utility(env, state)
    state, done, reward, info = env.step(best_action)
    total_reward += reward
    print('action: {0}\tstate: ({1}, {2:.1f}, {3:.1f}), utility: {4:.2f}\n'
          .format(best_action, state['room'], state['d_A'], state['d_B'], new_utility))
print('episode done. total reward:', total_reward)

action: ActionSpace.Suck	state: (Room.A, 0.4, 0.5), utility: -0.70

action: ActionSpace.Right	state: (Room.B, 0.4, 0.5), utility: -0.65

action: ActionSpace.Suck	state: (Room.B, 0.4, 0.4), utility: -0.60

action: ActionSpace.Suck	state: (Room.B, 0.4, 0.3), utility: -0.55

action: ActionSpace.Left	state: (Room.A, 0.4, 0.3), utility: -0.50

action: ActionSpace.Suck	state: (Room.A, 0.3, 0.3), utility: -0.45

action: ActionSpace.Suck	state: (Room.A, 0.2, 0.3), utility: -0.40

action: ActionSpace.Right	state: (Room.B, 0.2, 0.3), utility: -0.35

action: ActionSpace.Suck	state: (Room.B, 0.2, 0.2), utility: -0.30

action: ActionSpace.Suck	state: (Room.B, 0.2, 0.1), utility: -0.25

action: ActionSpace.Left	state: (Room.A, 0.2, 0.1), utility: -0.20

action: ActionSpace.Suck	state: (Room.A, 0.1, 0.1), utility: -0.15

action: ActionSpace.Suck	state: (Room.A, 0.0, 0.1), utility: -0.10

action: ActionSpace.Right	state: (Room.B, 0.0, 0.1), utility: -0.05

action: ActionSpace.Suck	state: (Room.B, 0.0,

# 4. Decisions based on Reward

Although decisions based on Utility can lead to efficient behavior, they have drawbacks:
- it is not always simple to find a Utility Function that yields all aspects of desired agent behavior
- often the utility depends on a sequence of actions rather than a single action, this cannot be handled by the decision process of part 3.

TBD

## 4.a. Value Iteration

TBD

## 4.b. Policy Iteration

TBD