<a href="https://colab.research.google.com/github/Ranjani94/Reinforcement_Learning/blob/main/SARSA_Learning_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SARSA Algorithm for Mountain Car Example

SARSA - State-Action-Reward-State-Action strategy. The agent in state S, takes an action A and get a reward R and then go to next state S'to take action A' based on the prediction of future reward points made by the agent. Using this experience, the agent can update the Q-function otherwise called as Action Value function. The new experience learned by the agent will have the direct impact on the policy which updates the Q-function and so it is called as on-policy reinforcement learning algorithm. It considers the current exploration policy for instance when a robot goes near the top of stairs by following a policy even it is dangerous, SARSA finds the optimal policy to keep the robot away from steps. It uses the discounted action from state S' and the current retrun for updatin the Q-funtion.



###Importing the necessary gym library which has the toolkit for developing the reinforcement learning environment.

In [1]:
import gym
import numpy as np

### Environment used is mountain car. The action space, observation shape, highest and lowest feature value is printed

In [2]:
env_name = 'MountainCar-v0'
env = gym.make(env_name)

print("Action Set size :",env.action_space)
print("Observation set shape :",env.observation_space) 
print("Highest state feature value :",env.observation_space.high) 
print("Lowest state feature value:",env.observation_space.low) 
print(env.observation_space.shape) 

Action Set size : Discrete(3)
Observation set shape : Box(2,)
Highest state feature value : [0.6  0.07]
Lowest state feature value: [-1.2  -0.07]
(2,)


### Assigning the Hyperparameters

 The number of states, episodes which an agent can learn within given steps in episodes, Learning rate, minimum learning rate, gamma value - discounted factor for agent, maximum steps - an agent can learn, epsilon - agent follows using greedy policy are defined

In [3]:
n_states = 40  
episodes = 10 

initial_lr = 1.0 
min_lr = 0.005
gamma = 0.99
max_steps = 300
epsilon = 0.05

env = env.unwrapped
env.seed(0)         
np.random.seed(0)   # reproduces the same random values

### Performing discretization 

Which converts the continuous state space into discrete state space which makes it easier for the agent to learn in discrete space.

In [4]:
def discretization(env, obs):
    
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    
    env_den = (env_high - env_low) / n_states
    pos_den = env_den[0]
    vel_den = env_den[1]
    
    pos_high = env_high[0]
    pos_low = env_low[0]
    vel_high = env_high[1]
    vel_low = env_low[1]
    
    pos_scaled = int((obs[0] - pos_low)/pos_den)  #converts to an integer value
    vel_scaled = int((obs[1] - vel_low)/vel_den)  #converts to an integer value
    
    return pos_scaled,vel_scaled

### SARSA Algorithm
So far, most of the task that we used is similar to Q-learning algorithm. After this, in SARSA algoritm we will update the Q value and the corresponding Q table. 

Reward value is updated by taking the difference between the current position and the lowest point which is considered as the starting point from which the agent increases the reward.

In [9]:
q_table = np.zeros((n_states,n_states,env.action_space.n))
total_steps = 0
for episode in range(episodes):
   obs = env.reset()
   total_reward = 0
   # decreasing learning rate alpha over time
   alpha = max(min_lr,initial_lr*(gamma**(episode//100)))
   steps = 0

   #action for the initial state using epsilon greedy
   if np.random.uniform(low=0,high=1) < epsilon:
        a = np.random.choice(env.action_space.n)
   else:
        pos,vel = discretization(env,obs)
        a = np.argmax(q_table[pos][vel])
  
   while True:
      # env.render()
      pos,vel = discretization(env,obs)
    
      obs,reward,terminate,_ = env.step(a) 
      total_reward += abs(obs[0]+0.5) 
      pos_,vel_ = discretization(env,obs)

      #action for the next state using epsilon greedy
      if np.random.uniform(low=0,high=1) < epsilon:
          a_ = np.random.choice(env.action_space.n)
      else:
          a_ = np.argmax(q_table[pos_][vel_])

      #q-table update
      q_table[pos][vel][a] = (1-alpha)*q_table[pos][vel][a] + alpha*(reward+gamma*q_table[pos_][vel_][a_])
      steps+=1
      if terminate:
          break
      a = a_
   print("Episode {} completed with total reward {} in {} steps".format(episode+1,total_reward,steps)) 
while True: #to hold the render at the last step when Car passes the flag
   env.render()  

The following code will print the reward points for all steps in each episode. The code runs only in the local..