<a href="https://colab.research.google.com/github/Ranjani94/Reinforcement_Learning/blob/main/Q_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Q-Learning algorithm for Mount car example

The mountain car is a standard testing problem in the domain of reinforcement learning. It consists of an under-powered car, which has to drive up a steep hill to the flag point.

The catch here is that gravity is stronger than the car's engine, so even at full throttle the car cannot accelerate up that steep slope. Therefore, the car has to make use of the potential energy by driving in reverse, in the opposite direction and then utilize that to reach the flag point at the top right.

Here, state space is continuous and is defined by two points: position and velocity. For a given state (that is, position and velocity) the agent can take three discrete actions, which are move forward (towards top-right in the diagram), move opposite (towards top-left in the diagram) or not use the engine, that is, the car is in neutral. The agent receives a negative reward until it reaches the goal state.



In [12]:
pip install gym



In [13]:
pip show pyglet

Name: pyglet
Version: 1.5.0
Summary: Cross-platform windowing and multimedia library
Home-page: http://pyglet.readthedocs.org/en/latest/
Author: Alex Holkner
Author-email: Alex.Holkner@gmail.com
License: BSD
Location: /usr/local/lib/python3.6/dist-packages
Requires: future
Required-by: gym


Q-learning can be easily applied to the environment having discrete state space and actions, but this problem became the test bed for reinforcement learning algorithms as it has continuous state space and requires either discretization of continuous state space or function approximation to map it to a discrete class.

The state space is two dimensional and continuous. It consists of position and velocity, with the following values:

Position: (-1.2,0.6)
Velocity: (-0.07,0.07)
Action space is discrete and one-dimensional and has three options:
- (left, neutral, right)

Reward -1 for every timestep.

Start state:

- Position: -0.5
- Velocity: 0.0
- Terminal state condition:  

An episode ends at: Position  0.6

In [5]:
EPSILON_MIN = 0.005
STEPS_PER_EPISODE = 200
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps
ALPHA = 0.05  # Learning rate
GAMMA = 0.98  # Discount factor
NUM_DISCRETE_BINS = 30  # Number of bins to Discretize each observation dim

class Q_Learner(object):
    def __init__(self, env):
        self.obs_shape = env.observation_space.shape
        self.obs_high = env.observation_space.high
        self.obs_low = env.observation_space.low
        self.obs_bins = NUM_DISCRETE_BINS  # Number of bins to Discretize each observation dim
        self.bin_width = (self.obs_high - self.obs_low) / self.obs_bins
        self.action_shape = env.action_space.n
        # Create a multi-dimensional array (aka. Table) to represent the
        # Q-values
        self.Q = np.zeros((self.obs_bins + 1, self.obs_bins + 1,
                           self.action_shape))  # (51 x 51 x 3)
        self.alpha = ALPHA  # Learning rate
        self.gamma = GAMMA  # Discount factor
        self.epsilon = 1.0

    def discretize(self, obs):
        return tuple(((obs - self.obs_low) / self.bin_width).astype(int))

    def get_action(self, obs):
        discretized_obs = self.discretize(obs)
        # Epsilon-Greedy action selection
        if self.epsilon > EPSILON_MIN:
            self.epsilon -= EPSILON_DECAY
        if np.random.random() > self.epsilon:
            return np.argmax(self.Q[discretized_obs])
        else:  # Choose a random action
            return np.random.choice([a for a in range(self.action_shape)])

    def learn(self, obs, action, reward, next_obs):
        discretized_obs = self.discretize(obs)
        discretized_next_obs = self.discretize(next_obs)
        td_target = reward + self.gamma * np.max(self.Q[discretized_next_obs])
        td_error = td_target - self.Q[discretized_obs][action]
        self.Q[discretized_obs][action] += self.alpha * td_error

In [6]:
def train(agent, env):
    best_reward = -float('inf')
    for episode in range(MAX_NUM_EPISODES):
        done = False
        obs = env.reset()
        total_reward = 0.0
        while not done:
            action = agent.get_action(obs)
            next_obs, reward, done, info = env.step(action)
            agent.learn(obs, action, reward, next_obs)
            obs = next_obs
            total_reward += reward
        if total_reward > best_reward:
            best_reward = total_reward
        print("Episode#:{} reward:{} best_reward:{} eps:{}".format(episode,
                                     total_reward, best_reward, agent.epsilon))
    # Return the trained policy
    return np.argmax(agent.Q, axis=2)

In [7]:
def test(agent, env, policy):
    done = False
    obs = env.reset()
    total_reward = 0.0
    while not done:
        action = policy[agent.discretize(obs)]
        next_obs, reward, done, info = env.step(action)
        obs = next_obs
        total_reward += reward
    return total_reward

In [9]:
import numpy as np
if __name__ == "__main__":
    env = gym.make('MountainCar-v0')
    agent = Q_Learner(env)
    learned_policy = train(agent, env)
    # Use the Gym Monitor wrapper to evalaute the agent and record video
    gym_monitor_path = "./gym_monitor_output"
    env = gym.wrappers.Monitor(env, gym_monitor_path, force=True)
    for _ in range(1000):
        test(agent, env, learned_policy)
    env.close()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Episode#:1 reward:-200.0 best_reward:-200.0 eps:0.9989999999999934
Episode#:2 reward:-200.0 best_reward:-200.0 eps:0.9984999999999902
Episode#:3 reward:-200.0 best_reward:-200.0 eps:0.9979999999999869
Episode#:4 reward:-200.0 best_reward:-200.0 eps:0.9974999999999836
Episode#:5 reward:-200.0 best_reward:-200.0 eps:0.9969999999999803
Episode#:6 reward:-200.0 best_reward:-200.0 eps:0.9964999999999771
Episode#:7 reward:-200.0 best_reward:-200.0 eps:0.9959999999999738
Episode#:8 reward:-200.0 best_reward:-200.0 eps:0.9954999999999705
Episode#:9 reward:-200.0 best_reward:-200.0 eps:0.9949999999999672
Episode#:10 reward:-200.0 best_reward:-200.0 eps:0.994499999999964
Episode#:11 reward:-200.0 best_reward:-200.0 eps:0.9939999999999607
Episode#:12 reward:-200.0 best_reward:-200.0 eps:0.9934999999999574
Episode#:13 reward:-200.0 best_reward:-200.0 eps:0.9929999999999541
Episode#:14 reward:-200.0 best_reward:-200.0 eps:0.9924999999

NameError: ignored

### Better way to explain Q-learning

The action space is a discrete set showing three possible actions, and the state space is a two-dimensional continuous space, where one dimension caters to the position while the other, the velocity of the car.

In [None]:
#exploring Mountain Car environment

env_name = 'MountainCar-v0'
env = gym.make(env_name)

print("Action Set size :",env.action_space)
print("Observation set shape :",env.observation_space) 
print("Highest state feature value :",env.observation_space.high) 
print("Lowest state feature value:",env.observation_space.low) 
print(env.observation_space.shape) 

we will assign the hyperparameters such as number of states, number of episodes, learning rate (both initial and minimum), discount factor gamma, maximum steps in an episode, and epsilon for epsilon-greedy, using the following code:

In [None]:
n_states = 40  # number of states
episodes = 10 # number of episodes

initial_lr = 1.0 # initial learning rate
min_lr = 0.005 # minimum learning rate
gamma = 0.99 # discount factor
max_steps = 300
epsilon = 0.05

env = env.unwrapped
env.seed(0)         #setting environment seed to reproduce same result
np.random.seed(0)   #setting numpy random number generation seed to reproduce same random numbers

Our next task would be to create a function to perform discretization of the continuous state space. Discretization is the conversion of continuous states space observation to a discrete set of state space:

In [None]:
def discretization(env, obs):
    
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    
    env_den = (env_high - env_low) / n_states
    pos_den = env_den[0]
    vel_den = env_den[1]
    
    pos_high = env_high[0]
    pos_low = env_low[0]
    vel_high = env_high[1]
    vel_low = env_low[1]
    
    pos_scaled = int((obs[0] - pos_low)/pos_den)  #converts to an integer value
    vel_scaled = int((obs[1] - vel_low)/vel_den)  #converts to an integer value
    
    return pos_scaled,vel_scaled

Now, we will start implementing our Q-learning algorithm by initializing a Q-table and updating the Q-values accordingly. Here, we have updated the reward value as absolute differences between current position and position at the lowest point, that is, start point so that it maximizes the reward by going away from the central, that is, lowest point. This has been done for better convergence:

In [None]:
#Q table
#rows are states but here state is 2-D pos,vel
#columns are actions
#therefore, Q- table would be 3-D

q_table = np.zeros((n_states,n_states,env.action_space.n))
total_steps = 0
for episode in range(episodes):
      obs = env.reset()
      total_reward = 0
      # decreasing learning rate alpha over time
      alpha = max(min_lr,initial_lr*(gamma**(episode//100)))
      steps = 0
      while True:
          env.render()
          pos,vel = discretization(env,obs)
          
          #action for the current state using epsilon greedy
          if np.random.uniform(low=0,high=1) < epsilon:
                a = np.random.choice(env.action_space.n)
          else:
                a = np.argmax(q_table[pos][vel])
          obs,reward,terminate,_ = env.step(a) 
          total_reward += abs(obs[0]+0.5)
    
          #q-table update
          pos_,vel_ = discretization(env,obs)
          q_table[pos][vel][a] = (1-alpha)*q_table[pos][vel][a] + alpha*(reward+gamma*np.max(q_table[pos_][vel_]))
          steps+=1
          if terminate:
                break
      print("Episode {} completed with total reward {} in {} steps".format(episode+1,total_reward,steps)) 

while True: #to hold the render at the last step when Car passes the flag
      env.render() 