In [1]:

# Exercise 1
## Implement Q learning in the cartpole environment


Reinforcement learning is trying to optimise behaviour based on rewards.
Using games as environments is easy and fast to repeat, which makes it ideal for learning by trial and error


### Markov decision processes
Decisionmaker = Agent
In this example, the agent is the cart, and its actions are moving either left or right

Environment -> Agent -> States -> Actions -> Rewars -> Repeat

The agent wants to achieve higher cumulative reward, not just instant rewards.

Set of states = S
Set of actions = A
Set of rewards = R

At each time step (t = 0,1,2...) the agent receives a set of the environments state St.
Based on this state, the agent selects and action At. This gives a set pair og action and state (St, At).
Time is then incremented t + 1, and the environment is updated with new state. S(t + 1).
At this time, the agent receives a reward R(t + 1), taken from the action At. The reward R(t + 1) is based on the state-action pair(St, At)

We can look at this as a function *f*(St, At) = R(t +1)

This is a sequential process, which can be presented like this: S0, A0, R1, S1, A1, R2, S2...

![Illustrated diagram](images/Environment-state-action-flow.png)

### Expected return
The goal for the agent is to maximise the cumulative rewards. The return is the sum of future rewards.
Gt = Rt+1 + Rt+2 + Rt+3 + ... + RT

T is the final time step.

The interactions of the agent with the actions and environment breaks up into episodes. Where a rewards is calculated Rt+1 at the end of every episode.
The environment is reset and the agent can start over with new state.

We modify the agent to try to maximise the cumulative discounted rewards.

The discount rate d(gamma) = a number between 0 and 1

The discounted reward will be Gt = Rt+1 + d²(Rt+2) + d³(Rt+3) + d⁴(Rt+4), Gt is the sum of the discounted rewards at each timestep

This will lead the agent to prioritise current rewards, since future rewards will be more discounted.

### Policies and value functions
How likely is an agent to take any given action based on the state?

#### Policies
A policy is a function which maps a given state which to the probability of selecting each possible action from that state.
Generally an agent follows a policy. If an agent follows policy p at a time t, then p(a | s) is the probability that At = a if St = s.
This means that, at time t, under policy p, the probability of taking action a in state s is p(a|s)

#### Value functions
Value functions determine how good it is for an agent to perform a given action in a given state.
The value the value function return is the Expected return.

We have a state-value function and the action-value function

#### state-value function
The state-value function for policy p denoted as q<sub>p</sub> tells us how good it is for the agent to take any given action from a given state while following policy p.

In other words, the Q function gives us the value of an action under policy p

q<sub>p</sub>(s, a )


### Optimal policies


### Q - learning



SyntaxError: invalid syntax (1279566891.py, line 5)

In [2]:

from collections import deque
import random

import numpy as np
import gym
import time
import math
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

cartpole_environment = gym.make("CartPole-v1")

In [3]:
#Constants
LR = 0.1
DISCOUNT = 0.95
EPISODES = 20000

#Variables
total = 0
total_reward = 0
prior_reward = 0

# Cart position, cart velocity, pole angle, pole velocity
# Number of different values in every bucket
# The number of buckets in each does not seem to matter that much, I tried 100 in each
observation = [30, 30, 50, 50]

#Steps
#Cart position, cart velocity, ple angle, pole velocity
scaling_values = np.array([0.25, 0.25, 0.01, 0.1])



# Exploration rate, just called epsilon
exploration_rate = 1
exploration_rate_decay = 0.99995
exploration_rate_minimum_threshold = 0.05


# The Q table is a policy table, which will be used by the agent to determine the next move
# Every move will be determined as positive or negative
# The Q table starts with zeroes, which will be optimise as the Q table gets better
#q_table = np.zeros(observation + [cartpole_environment.action_space.n])

# The Q table starts with randomized values, which will be optimised as the exploration rate decreases
q_table = np.random.uniform(low=0, high=1, size=(observation + [cartpole_environment.action_space.n]))

# Q-table shape: [30, 30, 50, 50, 2]


#Getting the discrete state is dealing with the problems of continuos state. This function will group similiar state values into "buckets".
#This yields a more manageable state-space, which we can use to calculate values
# The touple returned from this function is a reduced discretisised state we can use to make calculations
# This function takes state/observation and converts it into values we can evaluate with a Q function and update new state
def get_discrete_state(state):
    discrete_state = state/scaling_values + np.array([15, 10, 1, 10])
    return tuple(discrete_state.astype(np.int))

In [None]:

for episode in range(EPISODES + 1): # Adding +1 So it will complete the epochs on the final number
    time_0 = time.time() # t0 for timing when we started balancing

    discrete_state = get_discrete_state(cartpole_environment.reset())

    done = False

    episode_reward = 0 # Initialising reward for this episode

    if episode % 1000 == 0: # Just printing the Episode
        print("Episode: " + str(episode))

    while not done: # continue balancing the pole as long as it has not fallen

        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            # Choosing an exploitation action
            action = np.argmax(q_table[discrete_state])
        else: # Choosing a random action from the environment
            action = cartpole_environment.action_space.sample()

        # Passing our action the to the environment
        # New state is the new state we have to work with/ Also called observation
        # Reward is the reward for the action we chose
        # done signals if the action led to failure, which will close this episode
        # info contains diagnostics, which are not used at the moment, could be _
        new_state, reward, done, info = cartpole_environment.step(action)

        # Updates the reward for the current episode
        episode_reward += reward

        new_discrete_state = get_discrete_state(new_state)


        #Rendering the gui showing the crazy moves of the agent
        if episode % 1000 == 0:
            # The cartpole will simply freeze when the episode ends, and will wait for the next 1000 iterations
            cartpole_environment.render()


        if not done:
            # What is the highest possible q value?
            max_q = np.max(q_table[new_discrete_state])
            # Current q value
            current_q = q_table[discrete_state + (action,)]

            new_q = current_q * (1 - LR) + LR * (reward + DISCOUNT * max_q)

            # Updating our current q_table with new state and action
            q_table[discrete_state + (action,)] = new_q

        # Discrete state is updated
        discrete_state = new_discrete_state

    # Checking if the exploration rate is greater than the threshold
    if exploration_rate > exploration_rate_minimum_threshold:
        # Reducing the exploration rate
        if episode_reward > prior_reward and episode > 10000:
            exploration_rate = math.pow(exploration_rate_decay, episode - 1000)

            if episode % 500 == 0:
                print("Exploration rate: " + str(exploration_rate))

    # Measuring the time spent balancing
    time_1 = time.time()
    episode_total = time_1 - time_0

    #Updating the total
    total = total + episode_total

    #Updating the total reward
    total_reward += episode_reward

    # Saving the last reward
    prior_reward = episode_reward

    #Measuring averages
    if episode % 1000 == 0:
        mean = total / 1000
        print("Time Average: " + str(mean))
        total = 0

        mean_reward = total_reward / 1000
        print("Mean Reward: " + str(mean_reward))
        total_reward = 0

cartpole_environment.close()



#Mean reward after 20000 episodes == 77 with buckets [30, 30, 50, 50]
#2000 episodes, mean reward = 79, time average = 0.00726 with buckets [50, 50, 80, 80]
