<a href="https://colab.research.google.com/github/Nov05/Google-Colaboratory/blob/master/20240102_solve_openAI_gym's_taxi_v2_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **\<TOP>**

* created by nov05 on 2020-05-28. changed on 2024-01-02  
* project folder [on google drive](https://drive.google.com/drive/folders/1I_xoXvtDxSrTJ-CkFE662q_MZrtuiYch), [on github](https://github.com/udacity/deep-reinforcement-learning)    
* go to the course ["Solve OpenAI Gym's Taxi-v2 Task"](https://learn.udacity.com/nanodegrees/nd893/parts/6f8342e1-2278-4998-a384-283c136c9f69/lessons/d122ea3e-79eb-457a-bdf1-6c37ff3f18c8/concepts/cc1028c3-f559-4d18-a09a-17f1fef2c6e8)    
* go to [the "Temporal Difference" notebook](https://drive.google.com/file/d/1itgKIuuoXybSGnWwO1PfUsQPAk0ZmPKh)  

# **OpenAI Gym's Taxi-v2 Task**    

Before proceeding, read the description of the environment in subsection 3.1 of this paper.  
https://arxiv.org/pdf/cs/9905014.pdf

```
To make the discussion concrete, let us consider the following simple example. Figure 1 shows
a 5-by-5 grid world inhabited by a taxi agent. There are four specially-designated locations in
this world, marked as R(ed), B(lue), G(reen), and Y(ellow). The taxi problem is episodic. In
each episode, the taxi starts in a randomly-chosen square. There is a passenger at one of the
four locations (chosen randomly), and that passenger wishes to be transported to one of the four
locations (also chosen randomly). The taxi must go to the passenger’s location (the “source”), pick
up the passenger, go to the destination location (the “destination”), and put down the passenger
there. (To keep things uniform, the taxi must pick up and drop off the passenger even if he/she
is already located at the destination!) The episode ends when the passenger is deposited at the
destination location.

There are six primitive actions in this domain:
(a) four navigation actions that move the taxi
one square North, South, East, or West,
(b) a Pickup action, and
(c) a Putdown action. Each action
is deterministic. There is a reward of −1 for each action and an additional reward of +20 for
successfully delivering the passenger. There is a reward of −10 if the taxi attempts to execute the
Putdown or Pickup actions illegally. If a navigation action would cause the taxi to hit a wall, the
action is a no-op, and there is only the usual reward of −1.

We seek a policy that maximizes the total reward per episode. There are 500 possible states:
25 squares, 5 locations for the passenger (counting the four starting locations and the taxi), and 4
destinations.
```

<img src="https://github.com/Nov05/pictures/blob/master/Udacity/20231221_reinforcement%20learning/taxi-v2.png?raw=true" width=200>

# **Workspace**  
  
The workspace contains three files:

* **agent.py**: Develop your reinforcement learning agent here. This is the only file that you should modify.
* **monitor.py**: The **interact** function tests how well your agent learns from interaction with the environment.
* **main.py**: Run this file in the terminal to check the performance of your agent.

When you run **main.py**, the agent that you specify in agent.py interacts with the environment for 20,000 episodes. The details of the interaction are specified in monitor.py, which returns two variables: **avg_rewards** and **best_avg_reward**.

* **avg_rewards** is a deque where **avg_rewards[i]** is the average (undiscounted) return collected by the agent from episodes **i+1** to episode **i+100**, inclusive. So, for instance, **avg_rewards[0]** is the average return collected by the agent over the first 100 episodes.
* **best_avg_reward** is the largest entry in **avg_rewards**. This is the final score that you should use when determining how well your agent performed in the task.

Your assignment is to modify the agents.py file to improve the agent's performance.

* Use the **__init__()** method to define any needed instance variables. Currently, we define the number of actions available to the agent **(nA)** and initialize the action values (**Q**) to an empty dictionary of arrays. Feel free to add more instance variables; for example, you may find it useful to define the value of epsilon if the agent uses an epsilon-greedy policy for selecting actions.
* The **select_action()** method accepts the environment state as input and returns the agent's choice of action. The default code that we have provided randomly selects an action.
* The **step()** method accepts a (**state**, **action**, **reward**, **next_state**) tuple as input, along with the **done** variable, which is **True** if the episode has ended. The default code (which you should certainly change!) increments the action value of the previous state-action pair by 1. You should change this method to use the sampled tuple of experience to update the agent's knowledge of the problem.

# **Evaluate your Performance**

OpenAI Gym defines ["solving"](https://www.gymlibrary.dev/environments/toy_text/taxi/) this task as getting **average return of 9.7** over 100 consecutive trials.

While this coding exercise is ungraded, we recommend that you try to attain an average return of at least 9.1 over 100 consecutive trials (**best_avg_reward > 9.1**).


# **Not sure where to start?**

Note that this exercise is intentionally open-ended, and we won't provide an official solution. For help with this exercise, please reach out to your instructors and fellow students! As a first step, you should figure out how to adapt your implementation in the **Temporal-Difference Methods** lesson to implement an agent to learn in this new environment. The code will likely be very similar to the notebook from the Temporal-Difference Methods lesson, where you need only modify very few things to fit this slightly different format.

# **Share your Results**  

If you arrive at an implementation that you are proud of, please share your results with the student community! You can also reach out to ask questions, get implementation hints, share ideas, or find collaborators!

As a final step, towards sharing your ideas with the wider RL community, you may like to create a write-up and submit it to the OpenAI Gym Leaderboard! https://github.com/openai/gym/wiki/Leaderboard

# **The Code**  

In [None]:
## get the "lab-taxi" folder
!gdown --no-check-certificate --folder https://drive.google.com/drive/folders/10uOw8cEHoE2gDn9ZTAi3EuIQWGiQuq6e
!pip uninstall -y gym ## colab pre-installed 0.25.2
!pip install gym==0.9.6
import gym
print(gym.__version__)
## restart the session "Ctrl+M."

In [1]:
%cd lab-taxi
!pwd

/content/lab-taxi
/content/lab-taxi


In [2]:
import gym
env = gym.make('Taxi-v2')
env.action_space.n ## 6

  result = entry_point.load(False)


6

In [22]:
import numpy as np
from collections import defaultdict

class Agent:


    def __init__(self, nA=6):
        """ Initialize agent.

        Params
        ======
        - nA: number of actions available to the agent
        """
        self.nA = nA
        self.Q = defaultdict(lambda: np.zeros(self.nA))
        self.alpha = .01
        self.gamma = 1.0
        self.epsilon = 1./20000 ## for Ɛ-greedy(Q)


    def select_action(self, state):
        """ Given the state, select an action.

        Params
        ======
        - state: the current state of the environment

        Returns
        =======
        - action: an integer, compatible with the task's action space
        """
        # return np.random.choice(self.nA) ## equiprobable random policy
        if state in self.Q: ## Ɛ-greedy(Q)
            policy_s = self.get_policy_for_state(self.Q[state])
            return np.random.choice(np.arange(self.nA), p=policy_s)
        else: ## equiprobable random policy
            return np.random.choice(self.nA)


    def step(self, state, action, reward, next_state, done):
        """ Update the agent's knowledge, using the most recently sampled tuple.

        Params
        ======
        - state: the previous state of the environment
        - action: the agent's previous choice of action
        - reward: last reward received
        - next_state: the current state of the environment
        - done: whether the episode is complete (True or False)
        """
        # self.Q[state][action] += 1 ## equiprobable random policy
        if done:
            self.update_Q_sarsamax(state, action, reward)
            return
        next_action = self.select_action(next_state)
        self.update_Q_sarsamax(state, action, reward, next_state)
        state, action = next_state, next_action


    def get_policy_for_state(self, Q_s): ## Ɛ-greedy(Q)
        policy_s = self.epsilon/self.nA * np.ones(self.nA)
        policy_s[np.argmax(Q_s)] = 1 - self.epsilon + self.epsilon/self.nA
        return policy_s


    def update_Q_sarsamax(self, state, action, reward, next_state=None):
        ## CAUTION: next_state could be 0
        Q_sa_next = self.Q[next_state].max() if next_state is not None else 0
        self.Q[state][action] = (1-self.alpha) * self.Q[state][action] + \
                                self.alpha * (reward + self.gamma*Q_sa_next)

In [24]:
%%time
# from agent import Agent
from monitor import interact
import gym
import numpy as np

env = gym.make('Taxi-v2')
agent = Agent()
avg_rewards, best_avg_reward = interact(env, agent)
## Wall time: 2min 46s
## equiprobable random policy: Episode 20000/20000 || Best average reward -723.32
## Q-Learning (alpha=.01, epsilon=1/20000): Episode 20000/20000 || Best average reward 9.05, 9.29
##                        epsilon=1/10000, 9.2, (1/5000, 9.0), (1/15000, 9.07), (1/21000, 9.01), (1/30000, 9.0)

Episode 20000/20000 || Best average reward 9.21

CPU times: user 2min 29s, sys: 4.58 s, total: 2min 34s
Wall time: 2min 43s


# **\<BOTTOM>**