# Linear World

In this notebook we implement a very simple example of the agent-environment interface used in reinforcement learning, called "linear world".

The world consists of $n$ places in a row, labelled $0, 1, \dots, n-1$, and the state of the world consists of the position where the player is located. The initial state has the player in the middle of the world:
- The (empty) world for $n=5$: `"_ _ _ _ _"`
- With the player in position $2$: `"_ _ X _ _"`

The actions that the agent can take are `LEFT` and `RIGHT`, each moving the player one place in the indicated direction. In the two outer positions ($0$ and $n-1$), both actions result in a step towards the inside:
- After action `RIGHT`: `"_ _ _ X _"`
- After action `RIGHT`: `"_ _ _ _ X"`
- After action `LEFT` or `RIGHT`: `"_ _ _ X _"`

The reward is $+1$ for an action that leaves the player in one of the outer positions, and $0$, else. A possible sequence of events in this setting is:
- `"_ _ X _ _"` $S_0 = 2, A_0 = \text{"RIGHT"}$
- `"_ _ _ X _"` $S_1 = 3, A_1 = \text{"LEFT"}, R_1 = 0$
- `"_ _ X _ _"` $S_2 = 2, A_2 = \text{"LEFT"}, R_2 = 0$
- `"_ X _ _ _"` $S_3 = 1, A_3 = \text{"LEFT"}, R_3 = 0$
- `"X _ _ _ _"` $S_4 = 0, A_4 = \text{"RIGHT"}, R_4 = 1$
- `"_ X _ _ _"` $S_5 = 1, A_5 = \dots, R_5 = 0$




*<span style="color:red">Below, the parts indicated by `#??` need to be filled in!</span>*

In [4]:
import math

In [5]:
# We use constants 1 and 2 to represent LEFT, RIGHT:
# CHANGED TO -1 AND 1 BECAUSE EASIER TO DEAL WITH
LEFT = -1
RIGHT = 1

In [6]:
class LinearWorld:
    def __init__(self, length):
        # Store length of world
        # underscore before atttribute ensures it cannot be changed by user after creation ("private")
        self._length = length
        
        # Initialize state of world in the middle
        self._state = length // 2
    
    def step(self, action):
        # Compute new state
        #?? (1. handle the outer two positions)
        if self._state == 0:
            self._state = 1
        elif self._state == self._length - 1:
            self._state = self._length - 2
        #?? (2. change position to the left/right)
        elif action in (-1, 1):
            self._state = self._state + action
        else:
            raise Exception('Wrong input')

        # Compute reward
        if self._state in (0, self._length - 1):
            reward = 1
        else:
            reward = 0
        
        # Return state and reward
        return self._state, reward
    
    def reset(self):
        # Reset the position to the middle
        self._state = self._length//2
    
    def showWorld(self):
        #Print a representation of the linear world
        # Start with an empty string
        # Add "_" for every empty spot, "X" for the player
        world_string = ['_']*self._length
        world_string[self._state] = 'X'
        world_string = ' '.join(world_string)
        
        # Print the complete string
        print(world_string)
    
    def __str__(self):
        # (!) Advanced concept:
        # Custom string-conversion (used e.g. by `print()`)
        
        # Use the same logic as in .showWorld(), but return the string
        # (instead of printing it)
        world_string = ['_']*self._length
        world_string[self._state] = 'X'
        world_string = ' '.join(world_string)

        return world_string


## Testing the linear world

First, we create an instance of `LinearWorld`, then we use the `.step()` method to perform actions ($A_t$) and observe the resulting state ($S_{t+1}$) and reward ($R_{t+1}$).

In [7]:
# Create a new instance of the LinearWorld class
lw = LinearWorld(5)

In [8]:
# Check the properties `length` and `pos` of the instance
print(lw._length)
print(lw._state)

5
2


In [9]:
# Make a step and assign the outcome to variables (state + reward)
_state, reward = lw.step(RIGHT)

# Print the outcome
print(_state)
print(reward)

3
0


We can use the method `.showWorld()` to visualize the events "graphically":

In [10]:
# Make a step
lw.step(RIGHT)

# Show the new state of the world
lw.showWorld()

_ _ _ _ X


In [11]:
lw = LinearWorld(5)
lw.showWorld()
lw.step(RIGHT)
lw.showWorld()
lw.step(RIGHT)
lw.showWorld()
lw.step(RIGHT)
lw.showWorld()

_ _ X _ _
_ _ _ X _
_ _ _ _ X
_ _ _ X _


## Two simple policies

Next, we implement two policies and see how they perform over a timespan of $T = 100$ steps:
- The random policy randomly chooses and action
- The "right" policy always goes `RIGHT`

In [12]:
# We use numpy to choose a random action
import numpy as np

In [13]:
# Number of steps
T = 100

In [14]:
# Run the random policy for T steps, update the total reward
lw = LinearWorld(7)
totalRandom = 0
for t in range(T):
    _, reward = lw.step(np.random.choice((-1, 1), size=1)[0])
    lw.showWorld()
    totalRandom = totalRandom + reward

_ _ _ _ X _ _
_ _ _ X _ _ _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ X _ _ _
_ _ X _ _ _ _
_ _ _ X _ _ _
_ _ _ _ X _ _
_ _ _ X _ _ _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ X _ _ _
_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ X _ _
_ _ _ X _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ _ _ X _ _ _
_ _ _ _ X _ _
_ _ _ X _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
X _ _ _ _ _ _
_ X _ _ _ _ _
X _ _ _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
X _ _ _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
X _ _ _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ X _ _ _ _ _
X _ _ _ _ _ _
_ X _ _ _ _ _
_ _ X _ _ _ _
_ _ _ X _ _ _
_ _ _ _ X _ _
_ _ _ 

In [15]:
# Check the total rewards we got
print(totalRandom)

16


In [16]:
# Run the "right" policy for T steps, update the total reward
lw = LinearWorld(7)
totalRight = 0
for t in range(T):
    _, reward = lw.step(RIGHT)
    lw.showWorld()
    totalRight = totalRight + reward

_ _ _ _ X _ _
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ _ _ X _
_ _ _ _ _ _ X
_ _ _ 

In [17]:
# Check the total rewards we got
print(totalRight)

49
