**1.1 Chess:**_  
  
**environment** - 8x8 checker board  
**agent** - the individual moving the pawns   
**states** - current board layout    
**transitions/actions** - moving pawns (in different ways for each kind of pawn), kicking the others pawns off, gaining a pawn via crossing to the other side with a pawn  
**reward** - checkmate on the other players king   


_**1.2 LunarLander:**_  
  
**environment** - two-dimensional space with a spaceship   
**agent** - individual controlling the ship  
**state** - point at which the rocket currently is      
**transitions/actions** - do nothing, fire left engine, fire main engine, fire right engine  
**reward** - successful landing (from the LunarLander page: Reward for moving from the top of the screen to the landing pad and coming to rest is about 100-140 points. If the lander moves away from the landing pad, it loses reward. If the lander crashes, it receives an additional -100 points. If it comes to rest, it receives an additional +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points.)

_**1.3 Model-based RL**_  
  
- **What are environment dynamics?:**  
    The environment dynamics of a problem are the reward function and the transition function. For the reward function we need to figure out what the goal state is (and in model-based RL there usually is one) or otherwise what is going to increase our reward. In classical Gridworld, for example, that would be the tile granting positive points and acting as end. Let's say the agent is playing as adventurer and gains the reward once the holy grail is reached. The transition function would then constitute of the moves and actions needed to reach the end tile. For the adventurer it might be to escape traps or defeat enemies before they are able to reach the holy grail at the end. Another example would be the game of tictactoe. There you have the reward of winning the game against your opponent. To achieve that you have take the action of putting your assigned symbol into the free squares, aiming to have three of them within a row, column or in the diagonal next to each other. 
    ** **
- **Can we use that?**
    Unfortunately, the environment dynamics are oftentimes not accessible, especially not to determine the optimal sequence of actions to gain the reward. In the adventurer example we cannot determine *the* optimal move, as the adventurer may backtrack or run into deadends from time to time. More importantly the enemies' actions may be random and unpredictable, which would make it impossible to know the optimal sequence of actions. In the TicTacToe example, we have a similar problem. Since we do not know the moves of thge opponent beforehanfs, we cannot ascribe the optimal policy. This is what makes it hard to use the environment dynamics within Reinforcement Learning.

_**2 & 3**_

In [6]:
import numpy as np

In [7]:
class GridWorld:
    
    """Gridworld as MDP"""
    def __init__(self, m, n):
        self.m = m
        self.n = n
        self.grid = np.zeros(shape = (m,n))
        self.end = False

        # possible actions
        self.actions = ["up", "down", "right", "left"]

        #startpoint of the agent
        self.grid[1][1] = 5
        
        #goal position
        self.grid[4][3] = 7
        
        # icy tiles
        self.grid[2][2] = 1
        self.grid[3][3] = 1
        self.grid[4][4] = 1
        self.grid[0][1] = 1
        self.grid[3][5] = 1

        self.reward = 0

        # wall tiles
        for i in range(self.m):
            self.grid[i][0] = 9
            self.grid[i][self.n-1] = 9
        for j in range(self.n):
            self.grid[0][j] = 9
            self.grid[self.m-1][j] = 9
        
        # other wall tiles
        self.grid[1][2] = 9
        self.grid[1][3] = 9
        self.grid[5][4] = 9
    
    def __getitem__(self, grid):
        print(self.grid)
        return self.grid

    # a function that returns the four tiles that are visible to the agent
    def visible_tiles(self):
        x,y = self.current_pos()
        visible_tiles = []
        
        for i in self.actions:
            next, next_two, con, con2 = self.action_parameter(i)
            if con == True:
                visible_tiles.append(self.grid[next])
        return visible_tiles

                
     # a function that creates probabilities for the agent to follow and that has no zeros in it that uses only the visible tiles
    def probabilities(self):
        visible_tiles = self.visible_tiles()
        probabilities = []
        for i in range(len(visible_tiles)):
            # if the agent is seeing an icy tile, the probability of moving in the direction of the edge is 0.25
            if visible_tiles[i] == 1:
                probabilities.append(0.25)
                self.reward -= 0.1
            elif visible_tiles[i] == 7:
                probabilities.append(1)
                self.reward += 10
            elif visible_tiles[i] == 9:
                probabilities.append(0.1)
                self.reward -= 0.1
            else:
                probabilities.append(0.5)
                self.reward -= 0.1
        return probabilities, self.reward
    
    
    #state (tile on which the agent is positioned)
    def current_pos(self):
        found = False
        for x in range(self.m):
            for y in range(self.n):
                if self.grid[x,y] == 5:
                    # print the x and y coordinates of the agent
                    found = True
                    break
            if found == True:
                break
        return x,y

    # return random action and activates it
    def random_action(self):
        action = np.random.choice(self.actions)
        self.move(action)
        return action
       
    
    def action_parameter(self, action):
        x, y = self.current_pos()
        con = False
        con2 = False
        if action == "up":
            next = x-1,y
            next_two = x-2,y
            if x-1 >= 0:
                con = True
            if x-2 >= 0:
                con2 = True

        elif action == "down":
            next = x+1,y
            next_two = x+2,y
            if x+1 >= 0:
                con = True
            if x+2 <= self.m-1:
                con2 = True

        elif action == "right":
            next = x,y+1
            next_two = x,y+2
            if y+1 <= self.n-1:
                con = True
            if y+2 <= self.n-1:
                con2 = True
        elif action == "left":
            next = x,y-1
            next_two = x,y-2
            if y-1 >= 0:
                con = True
            if y-2 >= 0:
                con2 = True
        return next, next_two, con, con2
            
    def move(self, action):
        next, next_two, con, con2 = self.action_parameter(action)
        x, y = self.current_pos()
        self.grid[x,y] = 0
        # if the agent reaches the goal, the game ends
        if self.grid[next] == 7:
            self.grid[next] = 5
            self.end = True
        # if the agent reaches an icy tile, it moves two steps left
        elif self.grid[next] == 1 and con2 == True:
            if self.grid[next_two] == 7:
                self.grid[next_two] == 5
                self.end = True
            elif self.grid[next_two] == 9:
                self.grid[next] = 5
            else:
                self.grid[next_two] = 5            
        # if the agent reaches a wall, it cannot move
        elif self.grid[next] == 9:
            self.grid[x,y] = 5
            #print("Wall! Try again.")
        elif self.grid[next] == 0:
            self.grid[next] = 5
  

In [8]:

def agent():
    reward = 0
    # probability of the agent to follow the probabilities_no_zeros function
    if np.random.uniform(0,1) < 0.5:
        probs, reward = world.probabilities()
        # print(probs)
        # normalize the probabilities
        probs = [float(i)/sum(probs) for i in probs]

        # choose one of the 4 actions according to the probabilities
        action = np.random.choice(world.actions, p=probs)
        world.move(action)

        return reward
        
    # probability of the agent to follow the random_action function
    else:
        world.random_action()
        # print("random action")
        return reward

# Evaluate the policy

• Sample at least 1000 episodes of your agent interacting with your self-built
GridWorld

• For all states s, which have been reached at least once in these episodes,
calculate a MC-estimation of Vπ(s) of this state.

In [17]:

# sample 1000 episodes and calculate a Monte Carlo estimate of the value function for each state
world = GridWorld(6,6)
def mc(n_episodes):    
    # initialize the value function
    V = np.zeros((world.m, world.n))
    # initialize the number of times each state is visited
    N = np.zeros((world.m, world.n))
    
    for i in range(n_episodes):
        # initialize the state
        world.__init__(6,6)
        # initialize the list of states and rewards
        states = []
        rewards = []
        while world.end == False:
            # append the current state and reward to the lists
            rewards.append(agent())
            states.append(world.current_pos())
        # reverse the lists
        states.reverse()
        rewards.reverse()
        
        # initialize the return
        G = 0
        for j in range(len(states)):
            # update the return
            G = rewards[j] + G
            # update the number of times each state is visited
            N[states[j]] += 1
            # update the value function
            V[states[j]] += (G - V[states[j]])/N[states[j]]
    return V, N, rewards

        


In [18]:
mc(1000)


  0%|          | 0/1000 [00:00<?, ?it/s]

(array([[  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        , -31.06789734,   0.        ,   0.        ,
         -69.24022787,   0.        ],
        [  0.        , -28.25534905, -27.29275362, -28.6806962 ,
         -46.64663677,   0.        ],
        [  0.        , -14.40022936,  -6.64321678,   0.        ,
         -17.89804104,   0.        ],
        [  0.        ,  13.67972028,  24.48336466,   6.60523466,
          25.7908642 ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,  -1.19035874]]),
 array([[   0.,    0.,    0.,    0.,    0.,    0.],
        [   0., 4442.,    0.,    0., 1141.,    0.],
        [   0., 4412.,  690., 1896., 1784.,    0.],
        [   0., 2616., 1430.,    0., 1072.,    0.],
        [   0., 1430., 1064.,  554.,  405.,    0.],
        [   0.,    0.,    0.,    0.,    0.,  446.]]),
 [-4.4,
  0,
  0,
  0,
  -4.000000000000002,
  0,
  0,
  0,
  -3.6