# Train a Smartcab to Drive

#### Summary

In the not-so-distant future, taxicab companies across the United States no longer employ human drivers to operate their fleet of vehicles. Instead, the taxicabs are operated by self-driving agents — known as smartcabs — to transport people from one location to another within the cities those companies operate. In major metropolitan areas, such as Chicago, New York City, and San Francisco, an increasing number of people have come to rely on smartcabs to get to where they need to go as safely and efficiently as possible. Although smartcabs have become the transport of choice, concerns have arose that a self-driving agent might not be as safe or efficient as human drivers, particularly when considering city traffic lights and other vehicles.

To alleviate these concerns, our task is to use reinforcement learning techniques to construct a demonstration of a smartcab operating in real-time to prove that both safety and efficiency can be achieved.


#### Environment

The smartcab operates in an ideal, grid-like city (similar to New York City), with roads going in the North-South and East-West directions. Other vehicles will certainly be present on the road, but there will be no pedestrians to be concerned with. At each intersection there is a traffic light that either allows traffic in the North-South direction or the East-West direction. U.S. Right-of-Way rules apply:

* On a green light, a left turn is permitted if there is no oncoming traffic making a right turn or coming straight through the intersection.
* On a red light, a right turn is permitted if no oncoming traffic is approaching from your left through the intersection. To understand how to correctly yield to oncoming traffic when turning left, you may refer to this official drivers’ education video, or this passionate exposition.

#### Inputs and Outputs

The smartcab has only an egocentric view of the intersection it is at: It can determine the state of the traffic light for its direction of movement, and whether there is a vehicle at the intersection for each of the oncoming directions. For each action, the smartcab may either idle at the intersection, or drive to the next intersection to the left, right, or ahead of it. Finally, each trip has a time to reach the destination which decreases for each action taken (the passengers want to get there quickly). If the allotted time becomes zero before reaching the destination, the trip has failed.

#### Rewards and Goal

The smartcab receives a reward for each successfully completed trip, and also receives a smaller reward for each action it executes successfully that obeys traffic rules. The smartcab receives a small penalty for any incorrect action, and a larger penalty for any action that violates traffic rules or causes an accident with another vehicle. Based on the rewards and penalties the smartcab receives, the self-driving agent implementation will learn an optimal policy for driving on the city roads while obeying traffic rules, avoiding accidents, and reaching passengers’ destinations in the allotted time.

### Implement a Basic Driving Agent

Our first task is to get our smartcab moving around our environment. During this task, no attention will be spent on finding the optimal driving policy. Using the `random` module have our smartcab randomly select an action from `[None, 'forward', 'left', 'right']`.

`
action = random.choice(Environment.valid_actions)
`

**Question:** What do we see with the agent's behavior as it takes random actions. Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?

**Answer**: When taking a random walk through the environment, our smartcab sometimes makes it to the destination and sometimes hits the hard deadline. Either way, it is clear that the route taken is not optimal. Sometimes the smartcab will sit still at an intersection even though there is no oncoming traffic or a red light. Sometimes the smartcab will run a red light with complete disregard for the reward system. When the smartcab does reach the destination, it does not learn anything from the trip. The next trip will again be completely random.

### Inform the Driving Agent

Our next task is to identify a set of states that are appropriate for modeling the smartcab and environment. This is implemented in the following code.

    self.next_waypoint = self.planner.next_waypoint()
    inputs = self.env.sense(self)
    inputs = inputs.items()
    self.state = (inputs[0], inputs[1], inputs[3], self.next_waypoint

**Question:** What states have we identified that are appropriate for modeling the smartcab and environment? Why do we believe each of these states to be appropriate for this problem?

**Answer**: I chose to use a combination of environmental inputs as an implicit state for our smartcab. The first input I chose was `'light'`. If  `'light'` is red, I know I can turn right or stay still. If `'light'` is green, I know I can turn right, go straight, or maybe turn left. The second input I chose was `'oncoming'`. If there is oncoming traffic and the light is green, I know I can’t turn left. The third input I chose was `'left'`. If there is traffic coming from the left and the light is red, I know I can’t turn right. The fourth input I used was `'next_waypoint'`. This is the location of the target destination relative to our current position and will be useful for deciding how to act optimally given a time restraint.

I did not include the input `'right'` since traffic coming from the right never affects our decision. I considered using `'deadline'` but felt that this should be irrelevant. It should be captured in the reward system and I do not want my agent to start breaking laws when time is running out. Also, it would blow up our state space causing us to suffer from the curse of dimensionality.

### Implement a Q-Learning Driving Agent

Now that our driving agent can interpret input information, our next task is to implement a Q-Learning algorithm for our driving agent to choose the best action at each step. This will be based on the reward given for each action at each step. To implement this, I have created a `Qtable` `class` that can be used to map Q-values to a table.


    class QTable(object):
        """ 
        Table to store Q-learning data Q(s,a).  
        Does not scale well with increasing sizes of state/action space.
        """
    
        def __init__(self):
            self.Qhat = dict()

        def get(self, state, action):
            key = (state, action)
            return self.Qhat.get(key, None)

        def set(self, state, action, q):
            key = (state, action)
            self.Qhat[key] = q

**Question:** What changes do we notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?

**Answer:** It quickly becomes apparent that the Q-Learning is giving our smartcab a better appreciation of the environment. For example, the smartcab is actually making an effort to reach the destination. Before, the smartcab would often sit still at a green light but that happens less frequently now. The smartcab is also beginning to understand traffic rules, avoiding the punishments of running red lights.

This behavior is occuring because as the q-table is updated throughout trials, the agent starts to understand what it gets rewarded for and what it gets punished for. Through this reinforcement, the agent starts to formulate a policy for maximizing value at each state.

### Improve the Q-Learning Driving Agent

Our final task is to enchance our driving agent so that, after sufficient training, the smartcab is able to reach the destination within the allotted time safely and efficiently. To implement this we will include `alpha`, `gamma`, and `epsilon` as parameters for our Q-Learning algorithm. The main chunk of learning code is:

    # Q(s,a) <-- Q(s,a) + alpha * (r + gamma * maxQ(s', a') - Q(s,a))
    # Calculate maxQ(s', a')
    next_q = [self.Qhat.get(next_state, a) for a in self.val_actions]
    future_util = max(next_q)         
    if future_util is None:
        future_util = 0.0

    # Get current q from Qtable
    q = self.Qhat.get(self.state, action)
        
    # Update q through value iteration, setting initial q to reward
    if q is None:
        q = reward
    else:
        # Old value + learning rate * (reward + discount * est future value - old value)
        q += self.alpha * (reward + self.gamma * future_util - q)

    self.Qhat.set(self.state, action, q)
        
**Question:** Report the different values for the parameters tuned in our basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?

**Answer:**

| `alpha` | `gamma` | `epsilon` | trial 1 #passed | trial 2 #passed | trial 3 #passed | average |
|---------|---------|-----------|-----------------|-----------------|-----------------|---------|
| 1 --> 0 | 0.5     | 0.9       | 24              | 27              | 24              | 25      |
| 1 --> 0 | 0.5     | 0.5       | 61              | 51              | 68              | 60      |
| 1 --> 0 | 0.5     | 0.05      | 81              | 63              | 76              | 73      |
| 1 --> 0 | 0.5     | 0.1       | 71              | 85              | 80              | 79      |
| 1 --> 0 | 0.7     | 0.1       | 79              | 76              | 89              | 81      |
| 1 --> 0 | 0.17    | 0.05      | 97              | 79              | 76              | 84      |
| 1 --> 0 | 0.1     | 0.1       | 93              | 83              | 90              | 89      |
| 1 --> 0 | 0.3     | 0.1       | 93              | 87              | 89              | 90      |
| 1 --> 0 | 0.2     | 0.05      | 82              | 98              | 92              | 91      |
| 1 --> 0 | 0.23    | 0.05      | 83              | 98              | 91              | 91      |
| 1 --> 0 | 0.2     | 0.1       | 94              | 93              | 86              | 91      |
| 1 --> 0 | 0.25    | 0.05      | 87              | 93              | 96              | 92      |

It is worth going over how `alpha`, `gamma`, and `epsilon` affect our algorthim.
* `alpha`: The learning rate determines the extent that new information will override old information. A factor of 0 means the agent will not learn anything, while a factor of 1 means the agent will only consider the newest information. I felt that learning should diminish over time so I set `alpha` to start at 1 and then approach zero over the number of trials.
    * `self.alpha = 1 / self.counter`
* `gamma`: The discount factor determines how important future rewards will be. A factor of 0 means the agent will only consider immediate rewards, while a factor approaching 1 means the agent will strive for a long-term high reward. Through trial and error I arrived at a value of 0.25 for `gamma`
    * `self.gamma = 0.25`
* `epsilon`: The exploration factor determines how frequent we choose a random action. A factor of 0 means the agent will always use our Q-value and won't learn, while a factor of 1 means the agent will always choose randomly and won't use what we've learned. Through trial and error I arrived at a value of 0.05 for `epsilon`.
    * `self.epsilon = 0.05`
    
Our final agent performs fairly well. Of the 100 trials, the smartcab reached it's destination on time 92% of the time.

**Question:** Does our agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?

**Answer:** Our agent is successful 92% of the time and on average the last 10 trials are passed. However, there are still some issues with the agent. Consider the following table showing times the agent recieves a negative reward in the last 10 trials.

| trial |                                      input                                     | waypoint |  action | reward | destination reached |
|:-----:|:------------------------------------------------------------------------------:|:--------:|:-------:|:------:|:-------------------:|
|   91  | [('light', 'green'), ('oncoming', None), ('right', None), ('left', 'forward')] |   right  | forward |  -0.5  |         yes         |
|   92  |     [('light', 'red'), ('oncoming', None), ('right', None), ('left', None)]    |  forward |   left  |  -1.0  |         yes         |
|   93  |     [('light', 'red'), ('oncoming', None), ('right', None), ('left', None)]    |  forward |  right  |  -0.5  |         yes         |
|   94  |     [('light', 'red'), ('oncoming', None), ('right', None), ('left', None)]    |  forward |  right  |  -0.5  |         yes         |
|   96  |  [('light', 'red'), ('oncoming', None), ('right', None), ('left', 'forward')]  |  forward |   left  |  -1.0  |         yes         |

This would not pass muster in a real world setting. Four red lights are ran through and the optimal path isn't always taken. Despite these issues, I still think it is possible to find an optimal policy with some more experimentation. I could try decaying `epsilon`, creating new tiebreakers, and changing the initial states of the q-values.

I would describe an optimal policy as:
1. Choose the path that minimizes travel distance to the waypoint.
2. Follow that path while observing traffic laws and avoiding collisions.
3. Repeat