# Smart cab report
## Definitions
### Environment
The smartcab operates in an ideal, grid-like city (similar to New York City), with roads going in the North-South and East-West directions. Other vehicles will certainly be present on the road, but there will be no pedestrians to be concerned with. At each intersection there is a traffic light that either allows traffic in the North-South direction or the East-West direction. U.S. Right-of-Way rules apply:

> On a green light, a left turn is permitted if there is no oncoming traffic making a right turn or coming straight through the intersection.

> On a red light, a right turn is permitted if no oncoming traffic is approaching from your left through the intersection. 


### Inputs and Outputs
Assume that the smartcab is assigned a route plan based on the passengers’ starting location and destination. The route is split at each intersection into waypoints, and you may assume that the smartcab, at any instant, is at some intersection in the world. Therefore, the next waypoint to the destination, assuming the destination has not already been reached, is one intersection away in one direction (North, South, East, or West). 

The smartcab has only an egocentric view of the intersection it is at: It can determine the state of the traffic light for its direction of movement, and whether there is a vehicle at the intersection for each of the oncoming directions. For each action, the smartcab may either idle at the intersection, or drive to the next intersection to the left, right, or ahead of it. 

Finally, each trip has a time to reach the destination which decreases for each action taken (the passengers want to get there quickly). If the allotted time becomes zero before reaching the destination, the trip has failed.

### Rewards and Goal
The smartcab receives a reward for each successfully completed trip, and also receives a smaller reward for each action it executes successfully that obeys traffic rules. 

The smartcab receives a small penalty for any incorrect action, and a larger penalty for any action that violates traffic rules or causes an accident with another vehicle. 

Based on the rewards and penalties the smartcab receives, the self-driving agent implementation should learn an optimal policy for driving on the city roads while obeying traffic rules, avoiding accidents, and reaching passengers’ destinations in the allotted time.

## Tasks
### Project Report
You will be required to submit a project report along with your modified agent code as part of your submission. As you complete the tasks below, include thorough, detailed answers to each question provided in italics.


### Task 1: Implement a Basic Driving Agent
To begin, your only task is to get the smartcab to move around in the environment. At this point, you will not be concerned with any sort of optimal driving policy. Note that the driving agent is given the following information at each intersection:
> The next waypoint location relative to its current location and heading.

> The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.

> The current time left from the allotted deadline.

To complete this task, simply have your driving agent choose a random action from the set of possible actions (None, 'forward', 'left', 'right') at each intersection, disregarding the input information above. Set the simulation deadline enforcement, enforce_deadline to False and observe how it performs.

In [43]:
import environment # I added the least distance to print to value different trail differently.
import planner
import simulator


### QUESTION 1: 
Observe what you see with the agent's behavior as it takes random actions. Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?

**What I observed:**
> First, Not all those agents reached their destination, actually it is randomly reached with low efficiency with almost no possibility for them to reach the destination.

> The agents choose action randomly regardless of the traffic rules.

In [44]:
import random
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator

class LearningAgent(Agent):
    def __init__(self, env):
        super(LearningAgent, self).__init__(env)  # sets self.env = env, state = None, next_waypoint = None, and a default color
        self.color = 'red'  # override color
        self.planner = RoutePlanner(self.env, self)  # simple route planner to get next_waypoint
        # TODO: Initialize any additional variables here

    def reset(self, destination=None):
        self.planner.route_to(destination)
        # TODO: Prepare for a new trip; reset any variables here, if required

    def update(self, t):
        # Gather inputs
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        # TODO: Update state
        
        # TODO: Select action according to your policy
        action = None

        # Execute action and get reward
        reward = self.env.act(self, action)

        # TODO: Learn policy based on state, action, reward

        #print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward)  # [debug]
        # I choose not to follow every statue of an agent for the computer may breakdown...

def run():
    """Run the agent for a finite number of trials."""

    # Set up environment and agent
    e = Environment()  # create environment (also adds some dummy traffic)
    a = e.create_agent(LearningAgent)  # create agent
    e.set_primary_agent(a, enforce_deadline=False)  # specify agent to track
    # NOTE: You can set enforce_deadline=False while debugging to allow longer trials

    # Now simulate it
    sim = Simulator(e, update_delay=0.001, display=False)  # create simulator (uses pygame when display=True, if available)
    # NOTE: To speed up simulation, reduce update_delay and/or set display=False

    sim.run(n_trials=100)  # run for a specified number of trials
    # NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line


if __name__ == '__main__':
    run()


Simulator.run(): Trial 0
Environment.reset(): Trial set up with start = (8, 5), destination = (3, 5), deadline = 25
At least the distance is 5.
Environment.step(): Primary agent hit hard time limit (-100)! Trial aborted.
Simulator.run(): Trial 1
Environment.reset(): Trial set up with start = (7, 1), destination = (2, 5), deadline = 45
At least the distance is 9.
Environment.step(): Primary agent hit hard time limit (-100)! Trial aborted.
Simulator.run(): Trial 2
Environment.reset(): Trial set up with start = (4, 4), destination = (1, 5), deadline = 20
At least the distance is 4.
Environment.step(): Primary agent hit hard time limit (-100)! Trial aborted.
Simulator.run(): Trial 3
Environment.reset(): Trial set up with start = (5, 5), destination = (1, 3), deadline = 30
At least the distance is 6.
Environment.step(): Primary agent hit hard time limit (-100)! Trial aborted.
Simulator.run(): Trial 4
Environment.reset(): Trial set up with start = (8, 1), destination = (4, 4), deadline = 35


### Task 2: Inform the Driving Agent
Now that your driving agent is capable of moving around in the environment, your next task is to identify a set of states that are appropriate for modeling the smartcab and environment. 

The main source of state variables are the current inputs at the intersection, but not all may require representation. You may choose to explicitly define states, or use some combination of inputs as an implicit state. 

At each time step, process the inputs and update the agent's current state using the self.state variable. Continue with the simulation deadline enforcement enforce_deadline being set to False, and observe how your driving agent now reports the change in state as the simulation progresses.

In [45]:
import random
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator

class LearningAgent(Agent):
    """An agent that learns to drive in the smartcab world."""

    def __init__(self, env):
        super(LearningAgent, self).__init__(env)  # sets self.env = env, state = None, next_waypoint = None, and a default color
        self.color = 'red'  # override color
        self.planner = RoutePlanner(self.env, self)  # simple route planner to get next_waypoint
        # TODO: Initialize any additional variables here
        self.reward = 0
        self.next_waypoint = None

    def reset(self, destination=None):
        self.planner.route_to(destination)
        # TODO: Prepare for a new trip; reset any variables here, if required
        self.state = None
        self.next_waypoint = None
    def update(self, t):
        # Gather state_factors
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        state_factors = self.env.sense(self)
        deadline = self.env.get_deadline(self)
        # TODO: Update state

        # TODO: Select action according to your policy
        action = random.choice(Environment.valid_actions)

        # TODO: Update state
        # Update as a DummyAgent 
        action_okay = True
        if self.next_waypoint == 'right':
            if state_factors['light'] == 'red' and state_factors['left'] == 'forward':
                action_okay = False

        elif self.next_waypoint == 'straight':
            if state_factors['light'] == 'red':
                action_okay = False

        elif self.next_waypoint == 'left':
            if state_factors['light'] == 'red' or state_factors['oncoming'] == 'forward' or state_factors['oncoming'] == 'right':
                action_okay = False

        if action_okay == False:
            action = None

        self.state = state_factors
        self.state['next_waypoint'] = self.next_waypoint
        self.state = tuple(sorted(self.state.items()))
        # Execute action and get reward
        reward = self.env.act(self, action)
        # TODO: Learn policy based on state, action, reward
        #print "LearningAgent.update(): deadline = {}, state_factors = {}, action = {}, reward = {}".format(deadline, state_factors, action, reward)  # [debug]

def run():
    """Run the agent for a finite number of trials."""

    # Set up environment and agent
    e = Environment()  # create environment (also adds some dummy traffic)
    a = e.create_agent(LearningAgent)  # create agent
    e.set_primary_agent(a, enforce_deadline=True)  # specify agent to track
    # NOTE: You can set enforce_deadline=False while debugging to allow longer trials

    # Now simulate it
    sim = Simulator(e, update_delay=0.001, display=True)  # create simulator (uses pygame when display=True, if available)
    # NOTE: To speed up simulation, reduce update_delay and/or set display=False

    sim.run(n_trials=100)  # run for a specified number of trials
    # NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line


if __name__ == '__main__':
    run()


Simulator.__init__(): Unable to import pygame; display disabled.
ImportError: No module named pygame
Simulator.run(): Trial 0
Environment.reset(): Trial set up with start = (7, 2), destination = (1, 5), deadline = 45
At least the distance is 9.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 1
Environment.reset(): Trial set up with start = (4, 1), destination = (1, 6), deadline = 40
At least the distance is 8.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 2
Environment.reset(): Trial set up with start = (7, 5), destination = (4, 1), deadline = 35
At least the distance is 7.
Environment.act(): Primary agent has reached destination 
with time step for 20,
'final' reward is 12.0.
Simulator.run(): Trial 3
Environment.reset(): Trial set up with start = (2, 4), destination = (6, 6), deadline = 30
At least the distance is 6.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Tri

### QUESTION 2: 
What states have you identified that are appropriate for modeling the smartcab and environment? Why do you believe each of these states to be appropriate for this problem?

**The factors of state:**

> Traffic light: The color of traffic lights. The agents will make action based on the traffic light. Obey or disobey the rules will have different reward. So, we should update and take this into environment.

> Other agents: Other agents at this intersection. They may action differently and our agent must not violate others. So not going to chaos will give good reward and we should update and take this into environment.

> Next waypoint: Actions the agent will do for going regardless of obey or disobey the rules of traffic rules. This is the choices that one agent will be resopnsible for to get the reward or penalty. So, we should update and take this into environment.

> Deadline: This could measure the performance to say if agent reach destination in time. But this not essential to take into account because it will only give penalty in final steps.

> Locations: You need specific locations to update the space. But this will not affect the beavior of agent for the agent is learning by its actions. So, we do not need this in update.

### OPTIONAL:
How many states in total exist for the smartcab in this environment? Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?

**States numbers:**
> Actions: None, Forward, Left, Right

> Traffic Lights: Red, Green

> Other agents at same intersectiions: Oncoming(None, Forward, Left, Right), Left(None, Forward, Left, Right)

> So the final situations here are 4x2x4x4=128 conditions.

> I give those conditions  by the actual meaning that will affect the destionation. So, with the time you change the Q-Learning score for the best reward. If you use the status randomly as the destionation and start do or use something change at reset but not always at any time step that would have no meaning.



### Task 3: Implement a Q-Learning Driving Agent
With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. 

Each action taken by the smartcab will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. 

Once implemented, set the simulation deadline enforcement enforce_deadline to True. Run the simulation and observe how the smartcab moves about the environment in each trial.

In [46]:
import random
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator


class QLearningAgent(Agent):
    def __init__(self, env):
        super(QLearningAgent, self).__init__(env)
        self.color = 'red'
        self.planner = RoutePlanner(self.env, self)
        self.deadline = self.env.get_deadline(self)
        self.next_waypoint = None
        self.moves = 0

        self.qDict = dict()
        self.alpha = 0.8 # learning rate
        self.epsilon = 0.2 # probability of flipping the coin
        self.gamma = 0.5

        self.state = None
        self.new_state = None

        self.reward = None
        self.cum_reward = 0

        self.possible_actions = Environment.valid_actions
        self.action = None

    def reset(self, destination = None):
        self.planner.route_to(destination)
        self.next_waypoint = None
        self.moves = 0

        self.state = None
        self.new_state = None

        self.reward = 0
        self.cum_reward = 0

        self.action = None

    def getQvalue(self, state, action):
        key = (state, action)
        return self.qDict.get(key, 10.0)

    def getMaxQ(self, state):
        q = [self.getQvalue(state, a) for a in self.possible_actions]
        return max(q)

    def get_action(self, state):
        """
        epsilon-greedy approach to choose action given the state 
        """
        if random.random() < self.epsilon:
            action = random.choice(self.possible_actions)
        else:
            q = [self.getQvalue(state, a) for a in self.possible_actions]
            if q.count(max(q)) > 1: 
                best_actions = [i for i in range(len(self.possible_actions)) if q[i] == max(q)]                       
                index = random.choice(best_actions)

            else:
                index = q.index(max(q))
            action = self.possible_actions[index]

        return action

    def qlearning(self, state, action, nextState, reward):
        """
        use Qlearning algorithm to update q values
        """
        key = (state, action)
        if (key not in self.qDict):
            # initialize the q values
            self.qDict[key] = 10.0
        else:
            self.qDict[key] = self.qDict[key] + self.alpha * (reward + self.gamma*self.getMaxQ(nextState) - self.qDict[key])

    def update(self, t):
        self.next_waypoint = self.planner.next_waypoint()
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        self.new_state = inputs
        self.new_state['next_waypoint'] = self.next_waypoint
        self.new_state = tuple(sorted(self.new_state.items()))

        # for the current state, choose an action based on epsilon policy
        action = self.get_action(self.new_state)
        # observe the reward
        new_reward = self.env.act(self, action)
        # update q value based on q learning algorithm
        if self.reward != None:
            self.qlearning(self.state, self.action, self.new_state, self.reward)
        # set the state to the new state
        self.action = action
        self.state = self.new_state
        self.reward = new_reward
        self.cum_reward = self.cum_reward + new_reward
        self.moves = self.moves + 1
        #print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, new_reward)  # [debug]

def run():
    """Run the agent for a finite number of trials."""

    # Set up environment and agent
    e = Environment() 
    a = e.create_agent(QLearningAgent)  # create agent
    e.set_primary_agent(a, enforce_deadline=True)  # set agent to track

    # Now simulate it
    sim = Simulator(e, update_delay=0.001)  # reduce update_delay to speed up simulation
    sim.run(n_trials=100)  # press Esc or close pygame window to quit

if __name__ == '__main__':
    run()


Simulator.__init__(): Unable to import pygame; display disabled.
ImportError: No module named pygame
Simulator.run(): Trial 0
Environment.reset(): Trial set up with start = (2, 1), destination = (5, 6), deadline = 40
At least the distance is 8.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 1
Environment.reset(): Trial set up with start = (1, 4), destination = (6, 6), deadline = 35
At least the distance is 7.
Environment.act(): Primary agent has reached destination 
with time step for 9,
'final' reward is 12.0.
Simulator.run(): Trial 2
Environment.reset(): Trial set up with start = (8, 4), destination = (1, 5), deadline = 40
At least the distance is 8.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 3
Environment.reset(): Trial set up with start = (7, 3), destination = (2, 3), deadline = 25
At least the distance is 5.
Environment.act(): Primary agent has reached destination 
with time step for 23,
'final

### QUESTION 3:
What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?

**Changes**
> First, the Q-Learning agent can reach destination more frequently with time passing.

> Second, the step ratio of reaching destination(final steps divided by the least distance)can reduce with the agent learning at former trails.

> Q-Learning give every trial  a chance to make a composition of whole experience. So with time passing, Qlearning agent learn how to behave more wisely using less step ratio and getting more reward.


### Task 4: Improve the Q-Learning Driving Agent
Your final task for this project is to enhance your driving agent so that, after sufficient training, the smartcab is able to reach the destination within the allotted time safely and efficiently. Parameters in the Q-Learning algorithm, such as the learning rate (alpha), the discount factor (gamma) and the exploration rate (epsilon) all contribute to the driving agent’s ability to learn the best action for each state. To improve on the success of your smartcab:

Set the number of trials, n_trials, in the simulation to 100.
Run the simulation with the deadline enforcement enforce_deadline set to True (you will need to reduce the update delay update_delay and set the display to False).
Observe the driving agent’s learning and smartcab’s success rate, particularly during the later trials.
Adjust one or several of the above parameters and iterate this process.
This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.

In [47]:
import random
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator


class QLearningAgent(Agent):

    def __init__(self, env):
        super(QLearningAgent, self).__init__(env)
        self.color = 'red'
        self.planner = RoutePlanner(self.env, self)
        self.deadline = self.env.get_deadline(self)
        self.next_waypoint = None
        self.moves = 0

        self.qDict = dict()
        self.alpha = 0.6  # The fold of reward of adding value.
        self.epsilon = 0.01  # Chance of randomly move.
        self.gamma = 0.2  # Determine the possible sensitivity of reward of each action.

        self.state = None
        self.new_state = None

        self.reward = None
        self.cum_reward = 0

        self.possible_actions = Environment.valid_actions
        self.action = None

    def reset(self, destination = None):
        self.planner.route_to(destination)
        self.next_waypoint = None
        self.moves = 0

        self.state = None
        self.new_state = None

        self.reward = 0
        self.cum_reward = 0

        self.action = None

    def getQvalue(self, state, action):
        key = (state, action)
        return self.qDict.get(key, 10.0)

    def getMaxQ(self, state):
        q = [self.getQvalue(state, a) for a in self.possible_actions]
        return max(q)

    def get_action(self, state):
        """
        epsilon-greedy approach to choose action given the state 
        """
        if random.random() < self.epsilon:
            action = random.choice(self.possible_actions)
        else:
            q = [self.getQvalue(state, a) for a in self.possible_actions]
            if q.count(max(q)) > 1: 
                best_actions = [i for i in range(len(self.possible_actions)) if q[i] == max(q)]                       
                index = random.choice(best_actions)

            else:
                index = q.index(max(q))
            action = self.possible_actions[index]

        return action

    def qlearning(self, state, action, nextState, reward):
        """
        use Qlearning algorithm to update q values
        """
        key = (state, action)
        if (key not in self.qDict):
            # initialize the q values
            self.qDict[key] = 10.0
        else:
            self.qDict[key] = self.qDict[key] + self.alpha * (reward + self.gamma*self.getMaxQ(nextState) - self.qDict[key])

    def update(self, t):
        self.next_waypoint = self.planner.next_waypoint()
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)

        self.new_state = inputs
        self.new_state['next_waypoint'] = self.next_waypoint
        self.new_state = tuple(sorted(self.new_state.items()))

        action = self.get_action(self.new_state)
        new_reward = self.env.act(self, action)

        if self.reward != None:
            self.qlearning(self.state, self.action, self.new_state, self.reward)
        self.action = action
        self.state = self.new_state
        self.reward = new_reward
        self.cum_reward = self.cum_reward + new_reward
        self.moves = self.moves + 1
        #print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, new_reward)  # [debug]

def run():
    """Run the agent for a finite number of trials."""

    # Set up environment and agent
    e = Environment()  
    a = e.create_agent(QLearningAgent)  # Create Qlearning agent
    e.set_primary_agent(a, enforce_deadline=True)  # Set agent to track

    # Now simulate it
    sim = Simulator(e, update_delay=0.001)  # Reduce update_delay to speed up simulation
    sim.run(n_trials=100)  # Press Esc or close pygame window to quit

if __name__ == '__main__':
    run()


Simulator.__init__(): Unable to import pygame; display disabled.
ImportError: No module named pygame
Simulator.run(): Trial 0
Environment.reset(): Trial set up with start = (1, 6), destination = (7, 3), deadline = 45
At least the distance is 9.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 1
Environment.reset(): Trial set up with start = (8, 2), destination = (2, 5), deadline = 45
At least the distance is 9.
Environment.act(): Primary agent has reached destination 
with time step for 23,
'final' reward is 12.0.
Simulator.run(): Trial 2
Environment.reset(): Trial set up with start = (8, 1), destination = (4, 6), deadline = 45
At least the distance is 9.
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 3
Environment.reset(): Trial set up with start = (3, 3), destination = (6, 6), deadline = 30
At least the distance is 6.
Environment.act(): Primary agent has reached destination 
with time step for 17,
'fina


### QUESTION 4:
Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?

**What parameters mean:**
> Epsilon means the chance of randomly choose an action based on the condition at current intersection. While Qlearning agent will choose best score to get the action of obey the oder of scores and choose the max of it.


> Alpha means the learning rate of adding new reward score of an action to renew Q value of it. But it will mainly affect the response of steps.It controls how much the agent should learn from new run and how much it shall use its accumulated experience. Usually, learning rate shall drop (learning rate decay) along agent learning process. In this project, we can keep it as constant, but it should not be too high.

> Gamma means the discount rate. It controls how much the agent weigh the future reward into the Q value calculation. It can be tuned to help the agent be more "determined" to heading to the destination instead of circling around collecting the small safe rewards.


**Tunning theory and pratice:**
> For the definition of those three parameters, I try to max the chance of reaching destination so that epsilon should be as much as low. Since trying to find best parameters, I set epsilon to 0.

> Alpha should be like the higher socre to change rapidly with response of actions. to minimize the number of steps.

> Gamma changes score of actions that can be different as the situation changing for every trail.


<img src="conbinations.png">


### QUESTION 5: 
Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?

**Value of Qlearning agent:**

> With time passing, every combination is convergent to raising success rate. I give each parameter to the trend of generate least steps and most reward. There are value function to auto test and return best model. But to this agent, the 97 percent of succes is good and pratical in this ideal situation.

> The optimal policy should contain least steps, most success rate and highest score. 

> My agent always fell to reach destination at beginning 10 to 20 steps and continue to sucess after that. So this agent do learn from experience.

> The Q-Learning learn based on the experience, so the success rate may still not be 100 percents for starting learning need time and practice.

> Even the agent can have 100% success rate, there are still in-time situation occurs at driving. This model still need to be done for improvement.
