Short Corridor with Switched Actions
---
Consider the small corridor grid world shown inset in the graph below. The reward is -1 per step, as usual. In each of the three nonterminal states there are only two actions, right and left. These actions have their usual consequences in the first and third states (left causes no movement in the first state), but in the second state they are reversed, so that right moves to the left and left moves to the right. The problem is difficult because all the states appear identical under the function approximation. In particular, we define `x(s, right) = [1, 0]` and `x(s, left) = [0, 1]`, for all s.

$$J(\theta) = V_{\pi_\theta}(S)$$

<img src="corridor.png" width="600">

MC Policy Gradient
---
<img src="mc_policy_gradient.png" width="600">
<img src="h.png" width="300">
<img src="policy.png" width="400">

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [5]:
class ShortCorridor:
    def __init__(self, alpha=0.2, gamma=0.8):
        self.actions = ["left", "right"]
        self.x = np.array([[0, 1], [1, 0]])  # left|s, right|s
        self.theta = np.array([-1.47, 1.47])
        self.state = 0  # initial state 0
        self.gamma = gamma
        self.alpha = alpha
        
    def softmax(self, vector):
        return np.exp(vector)/sum(np.exp(vector))
        
    def chooseAction(self):
        h = np.dot(self.theta, self.x)
        prob = self.softmax(h)  # left, right probability for all state
        
        imin = np.argmin(prob)
        epsilon = 0.05
        
        if prob[imin] < epsilon:
            prob[:] = 1 - epsilon
            prob[imin] = epsilon
        
        action = np.random.choice(self.actions, p=prob)
        return action
    
    def takeAction(self, action):
        if self.state == 0:
            nxtState = 0 if action == "left" else 1
        elif self.state == 1:
            nxtState = 2 if action == "left" else 0  # reversed
        elif self.state == 2:
            nxtState = 1 if action == "left" else 3
        else:
            nxtState = 2 if action == "left" else 3
        return nxtState
    
    def giveReward(self):
        if self.state == 3:
            return 0
        return -1
    
    def reset(self):
        self.state = 0
    
    def run(self, rounds=100):
        actions = []
        rewards = []
        for i in range(1, rounds+1):
            reward_sum = 0
            while True:
                action = self.chooseAction()
                nxtState = self.takeAction(action)
                reward = self.giveReward()
                reward_sum += reward

                actions.append(action)
                rewards.append(reward)
                
                self.state = nxtState
                # game end
                if self.state == 3:
                    T = len(rewards)
                    for t in range(T):
                        # calculate G
                        G = 0
                        for k in range(t+1, T):
                            G += np.power(self.gamma, k-t-1)*rewards[k]
                
                        j = 1 if actions[t] == "right" else 0  # dev on particular state
                        h = np.dot(self.theta, self.x)
                        prob = self.softmax(h)
                        grad = self.x[:, j] - np.dot(self.x, prob)

                        self.theta += self.alpha*np.power(self.gamma, t)*G*grad
                    # reset 
                    self.state = 0
                    actions = []
                    rewards = []
                    
                    if i % 50 == 0: 
                        print("round {}: current prob {} reward {}".format(i, prob, reward_sum))
                        reward_sum = 0
                    break

In [14]:
sc = ShortCorridor(alpha=2e-4, gamma=1)

In [16]:
sc.run(1000)

round 50: current prob [0.42473803 0.57526197] reward -15
round 100: current prob [0.41771671 0.58228329] reward -8
round 150: current prob [0.40590189 0.59409811] reward -42
round 200: current prob [0.40479147 0.59520853] reward -7
round 250: current prob [0.43495446 0.56504554] reward -5
round 300: current prob [0.39796873 0.60203127] reward -5
round 350: current prob [0.40440231 0.59559769] reward -7
round 400: current prob [0.40185045 0.59814955] reward -7
round 450: current prob [0.38858673 0.61141327] reward -3
round 500: current prob [0.37848939 0.62151061] reward -26
round 550: current prob [0.39928466 0.60071534] reward -23
round 600: current prob [0.36995291 0.63004709] reward -7
round 650: current prob [0.35917078 0.64082922] reward -5
round 700: current prob [0.36502733 0.63497267] reward -3
round 750: current prob [0.37528428 0.62471572] reward -10
round 800: current prob [0.41668482 0.58331518] reward -9
round 850: current prob [0.43119919 0.56880081] reward -8
round 900:

In [18]:
h = np.dot(sc.theta, sc.x)
sc.softmax(h)  # left, right probability for all state

array([0.41148944, 0.58851056])