Cart and Pole attempt 2

There are 2 ways to solve the Cart and Pole env

1.) problem specific solution:
    takes specific observations about the environment i.e pole velocity, pole angle, cart position, cart velocity. 
    Take these observations to represent state and calculate Q-values accordingly



2.) Generalised solution:
    takes in simply the pixels of the screen as inputs. (Deepmind took 4 consecutive images as inputs (this is so the AI has a temporal understanding of the env)
    Screen inputs = state which is used to estimate Q-values
    

Bellman Equation Derivation

Q(st,at) = Rt+1 + yRt+2 + y2Rt+3 

Q value for a state action pair is given by the sum of all future rewards in subsequent states. We are less certain about rewards in future states so  a discount rate applied

Q(st+1,at+1) = Rt+2 + yRt+3 + y2Rt+4

As you can see the Q values for the next state will be represent as such. Therfore Q(st,at) can actually be represented the following way:

Q(st,at) = Rt+1 + y * Q(st+1,at+1)

This can be simplified further. The action in t+1 will be the best action for its given state-action pair. 

at+1 = at+1 : Q(st+1,at+1) = max(Q(st+1))

Therefore, our final equation is:

Q(s,a) = Rt+1 + y * max(Qst+1)


Learning Q values

We want all values in our Q table to satisfy the bellman equation. Since we do not immediately know the Q values they must be learned via iterations through small updates as we lean the environment

q(s,a) = q(s,a) + a(target - q(s,a)



In [3]:
import gym
import random
import numpy as np
import time
from IPython.display import clear_output

In [4]:
cart_pole = "CartPole-v1"
mountain_car = "MountainCar-v0"
mountain_car_cont = "MountainCarContinuous-v0"
acrobot = "Acrobot-v1"
pendulum = "Pendulum-v0"
frozen_lake = "FrozenLake-v0"
env = gym.make(frozen_lake)
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
type(env.action_space)

Observation space: Discrete(16)
Action space: Discrete(4)


gym.spaces.discrete.Discrete

In [5]:
class Agent():
    def __init__(self, env):
        self.is_discrete = \
            type(env.action_space) == gym.spaces.discrete.Discrete
        
        if self.is_discrete:
            self.action_size = env.action_space.n
            print("Action size:", self.action_size)
        else:
            self.action_low = env.action_space.low
            self.action_high = env.action_space.high
            self.action_shape = env.action_space.shape
            print("Action range:", self.action_low, self.action_high)
        
    def get_action(self, state):
        if self.is_discrete:
            action = random.choice(range(self.action_size))
        else:
            action = np.random.uniform(self.action_low,
                                       self.action_high,
                                       self.action_shape)
        return action

In [25]:
class QAgent(Agent):
    def __init__(self,
                 env,
                 discount_rate =0.99,
                 learning_rate =0.1,
                 min_exploration_rate =0.01,
                 max_exploration_rate =1,
                 exploration_decay_rate= 0.001,   
                ):
        super().__init__(env)
        self.state_size = env.observation_space.n
        print("state size:", self.state_size)
        
        self.eps = 1
        self.discount_rate = discount_rate
        self.learning_rate = learning_rate
        self.min_exploration_rate = min_exploration_rate
        self.max_exploration_rate = max_exploration_rate
        self.exploration_decay_rate = exploration_decay_rate
        self.build_model()
        
    def build_model(self):
        self.q_table = np.zeros([self.state_size,self.action_size])
        
    def get_action(self,state):
        q_state = self.q_table[state]
        exploit = np.argmax(q_state)
        explore = super().get_action(state)
        return explore if np.random.random() < self.eps else exploit
    
    def train(self, experience):
        state, action, next_state, reward, done = experience
        
        q_old = self.q_table[state,action]
        q_next = self.q_table[next_state]
        q_next = np.zeros([self.action_size]) if done else q_next

        self.q_table[state,action] = (1- self.learning_rate) * q_old \
        + self.learning_rate *(reward + self.discount_rate * np.max(q_next))

    def decay(self,episode):
        self.eps = self.min_exploration_rate + \
        (self.max_exploration_rate - self.min_exploration_rate) * np.exp(-self.exploration_decay_rate*episode)
        return self.eps
    

In [1]:
agent = QAgent(env)       
num_episodes = 400


total_reward = 0
for ep in range(num_episodes):
    if ep % 1000 == 0:
        print(ep)
        print(total_reward)
    state = env.reset()
    done = False
    for step in range(100):
        action = agent.get_action(state)
        
        next_state, reward, done, info = env.step(action)
        agent.train((state, action, next_state, reward, done))
        state = next_state
        total_reward += reward
        
        if done:
            agent.decay(ep)
            break
        
print(total_reward)   
print(agent.eps)

NameError: name 'QAgent' is not defined

In [28]:
print(agent.q_table)

[[0.4909542  0.45334818 0.46546493 0.45782367]
 [0.30429607 0.28255306 0.22845774 0.39051919]
 [0.29439742 0.25832868 0.25530994 0.26097333]
 [0.03822265 0.14998099 0.05042876 0.07943631]
 [0.52393394 0.31258843 0.32656837 0.38475094]
 [0.         0.         0.         0.        ]
 [0.22230575 0.15603042 0.15714952 0.15868687]
 [0.         0.         0.         0.        ]
 [0.32023891 0.3883467  0.33936479 0.56698469]
 [0.47753114 0.64420596 0.34044153 0.37090803]
 [0.57198816 0.29637058 0.4192104  0.2465289 ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.44327615 0.47935633 0.75870145 0.46250383]
 [0.69563635 0.90033893 0.73898216 0.80188869]
 [0.         0.         0.         0.        ]]
