<img src="../assets/images/Cover.png" alt="Cover" title="AI2E Cover" />

# AI2E - [Workshop 11 ] - [RL : Q-learning]- Self-Driving Cab

Let's design a simulation of a self-driving cab. The major goal is to demonstrate, in a simplified environment, how we can use RL techniques to develop an efficient and safe approach for tackling this problem.

<br/> The Smartcab's job is to pick up the passenger at one location and drop him off in another. Here are a few things that we'd love our Smartcab to take care of:
<ul>
    <li>Drop off the passenger to the right location.</li>
    <li>Save passenger's time by taking minimum time possible to drop off. </li>
    <li>Take care of passenger's safety and traffic rules ?</li>
    <li>There are different aspects that need to be considered here while modeling an RL solution to this problem: rewards, states, and actions.</li>
</ul>


## Content 
1. What's OpenAI Gym
2. Taxi-v3 environment
3. Q-Learning conception
<br/> 3.1. Rewards
<br/> 3.2. State space
<br/> 3.2. Action space
4. Q-Learning implementation
<br/> 4.1. Agent Training
<br/> 4.2. Agent Evaluating
5. Conclusion


### What's OpenAI Gym
**OpenAI** is an artificial intelligence research organization that aims to promote and develop artificial general intelligence in ways that benefit humanity as a whole. 
<br/> For our purposes, we will be using OpenAI Gym, which is a toolkit for developing and comparing reinforcement learning algorithms. Gym will provide the environment, Taxi-v3, for us to train our agents: Q-learning;
We will use the following functions:
<ul>
<li>env.reset: resets the environment and returns a random initial state.</li>
<li>env.step(action): advances the environment by one timestep.</li>
<li>env.render: renders one frame of environment (helpful in visualization)</li>

<li>The environment returns:
    <ul>
<li>observation: an environment-specific object representing your observation of the environment </li>
<li>reward: the amount of reward/score achieved by the previous action </li>
<li>done: indicates whether it is time to reset the environment again, i.e., the agent has achieved the goal </li>
<li>info: diagnostic info such as performance and latency useful for debugging purposes</li>
    </ul>
</li>
</ul>
In each timestep, our agent performs a specific action, and the environment returns the observation and a reward as a consequence of performing that particular action in the environment.

### Taxi-v3 environement
This task was introduced in [Dietterich2000] to illustrate some issues in hierarchical reinforcement learning. There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.


### Q-Learning conception


#### Rewards
Since the agent (the imaginary driver) is reward-motivated and is going to learn how to control the cab by trial experiences in the environment, we need to decide the rewards and/or penalties and their magnitude accordingly. Here a few points to consider:
<ul>
    <li> The agent should receive a high positive reward for a successful dropoff because this behavior is highly desired </li>
    <li> The agent should be penalized if it tries to drop off a passenger in wrong locations </li>
    <li> The agent should get a slight negative reward for not making it to the destination after every time-step. "Slight" negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible  </li>

</ul>    

#### State Space
The state should contain useful information the agent needs to make the right action

<br/> The State Space is the set of all possible situations our taxi could inhabit, Let's assume Smartcab is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space.

<br/> Notice the current location state of our taxi is coordinate (3, 1).

<br/> You'll also notice there are four (4) locations that we can pick up and drop off a passenger represented by **R**, **G**, **Y**, **B** or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our illustrated passenger is in location Y and they wish to go to location R.

<br/> When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there's four (4) destinations and five (4 + 1) passenger locations.

<br/> **So, our taxi environment has 5×5×5×4=500 total possible states**.

<br/>

<div>
<img src="../assets/images/Self-Driving Cab.png" alt="Self-Driving Cab" title="Self-Driving Cab"/>
</div>

#### Action Space
The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

<br/> In other words, we have six possible actions:
<ul>
    <li> Move south </li> 
    <li> Move north </li> 
    <li> Move east </li> 
    <li> Move west </li> 
    <li> Pickup </li> 
    <li> Dropoff </li> 
</ul>
    
<br/> This is the action space: the set of all the actions that our agent can take in a given state.

You'll notice in the illustration above, that the taxi cannot perform certain actions in certain states due to walls. In environment's code, we will simply provide a -1 penalty for every wall hit and the taxi won't move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.

### Q-Learning implementation

#### Imports

In [3]:
#!pip install cmake gym[atari] scipy
import gym
import numpy as np
import random
from IPython.display import clear_output
from time import sleep

### Environement

In [4]:
env = gym.make("Taxi-v2").env 
#env.render()
#env.reset() # reset environment to a new, random state
#env.render() 

#print("Action Space {}".format(env.action_space))
#print("State Space {}".format(env.observation_space))

state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)
env.s = state
env.render()
q_table = np.zeros([env.observation_space.n, env.action_space.n])



[2020-05-05 00:29:00,770] Making new env: Taxi-v2
  result = entry_point.load(False)


State: 328
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



####  Agent Training

In [5]:
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    frames = [] # for animation
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        # Put each rendered frame into dict for animation
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
            }
        )
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.



In [6]:
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|R: | : :[35m[42mG[0m[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 18
State: 97
Action: 5
Reward: 20


#### Agent Evaluating

In [8]:
total_epochs, total_penalties = 0, 0
episodes = 100
frames = [] # for animation

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])   #Using Q table only
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1
                # Put each rendered frame into dict for animation
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
            }
        )

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs
print_frames(frames)
print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

+---------+
|R: | : :[35m[42mG[0m[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 1226
State: 97
Action: 5
Reward: 20
Results after 100 episodes:
Average timesteps per episode: 12.26
Average penalties per episode: 0.0


## Conclusion 
Alright! We began with understanding Reinforcement Learning with the help of real-world analogies. We then dived into the basics of Reinforcement Learning and the different concepts ! After that we framed a Self-driving cab as a Reinforcement Learning problem. We then used OpenAI's Gym in python to provide us with a related environment, where we can develop our agent and evaluate it. So we went ahead to implement the Q-learning algorithm from scratch. The agent's performance improved significantly after Q-learning which means it can dropoff the passanger in the right destination with 0 penalities comparing with rhe training phase where it was taking random actions which doesn't lead necessarily to the wanted destination, but towards the end you could see that it's taking only the right one.
#### References :
https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
https://medium.com/@anirbans17/reinforcement-learning-for-taxi-v2-edd7c5b76869