## Q-Learning

Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation(particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement.

The model to update the Q-values in the Q-table is given below where $\alpha$ and $\gamma$ are the learning rate and the discount factor 0 < $\alpha$, $\gamma$ < 1

<img src="Qlearning_update.png">

Since Q-learning is a model-free algorithm, we use a enviroment from gym. Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. Here in our example we use a Car which has to reach a position located at the top of a mountain which the Car has to climb in order to complete the goal. At the goal, the reward is maximum and hence the most desired state. The Car is given an $action$ to which the environment responds by assigning a new state to the car and providing a $reward$ for the action and a boolean variable whether the goal was achieved or not.

In [4]:
import gym
import numpy as np

env = gym.make("MountainCar-v0")
#env.reset()

alpha = 0.18
discount = 0.95
Episodes = 500 #goal to be actually reached efficiently by episode 2000 with the above parameters
freq = 100

discrete_os_size = [20] * len(env.observation_space.low) #using 20 blocks/windows to represent all possible states
discrete_block_size = (env.observation_space.high - env.observation_space.low)/discrete_os_size

q_values = np.random.normal(0, 2, size = (discrete_os_size + [env.action_space.n]))

def get_discrete_state(state):
    dis_state = (state - env.observation_space.low)/discrete_block_size
    dis_state = dis_state.astype(np.int)
    return tuple(dis_state)

#dis_state = get_discrete_state(env.reset())
#print(q_values[dis_state].shape)

for episode in range(Episodes + 1) :
    done = False
    dis_state = get_discrete_state(env.reset())
    cnt = 0
    if episode % freq == 0:
        #print(q_values)
        print(episode)
    while not done:
        #print(dis_state)
        action = np.argmax(q_values[dis_state])
        new_state, reward, done, valx = env.step(action)
        temp_dis_state = get_discrete_state(new_state)
        if episode % freq == 0 :
            env.render()
        if not done:
            curr = q_values[dis_state + (action, )]
            mx = np.max(q_values[temp_dis_state])
            q_values[dis_state + (action, )] = (1-alpha)*curr + alpha*(reward + discount*mx)
        elif new_state[0] >= env.goal_position :
            q_values[dis_state + (action, )] = 0
        dis_state = temp_dis_state
        cnt += 1
        if cnt > 1000 :
            break
        #print(cnt)
                                                                 
env.close()

0
100
200
300
400
500
