# Training a Smartcab
**Implement a Basic Driving Agent**<br>
To begin, your only task is to get the smartcab to move around in the environment. At this point, you will not be concerned with any sort of optimal driving policy. Note that the driving agent is given the following information at each intersection:

The next waypoint location relative to its current location and heading.
The state of the traffic light at the intersection and the presence of oncoming vehicles from other directions.
The current time left from the allotted deadline.
To complete this task, simply have your driving agent choose a random action from the set of possible actions (None, 'forward', 'left', 'right') at each intersection, disregarding the input information above. Set the simulation deadline enforcement, enforce_deadline to False and observe how it performs.

**QUESTION: Observe what you see with the agent's behavior as it takes random actions. Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?**

I have added a few lines of code in order to make the smartcab move randomly in the environment.       

In [1]:
    def update(self, t):
        # Gather inputs
        self.next_waypoint = self.planner.next_waypoint()  # from route planner, also displayed by simulator
        inputs = self.env.sense(self)
        deadline = self.env.get_deadline(self)
        randint = random.randint(0,3)

        # TODO: Update state
        
        # TODO: Select action according to your policy
        tempaction = Environment.valid_actions[randint]
        action = tempaction

        # Execute action and get reward
        reward = self.env.act(self, action)

        # TODO: Learn policy based on state, action, reward

        print "LearningAgent.update(): deadline = {}, inputs = {}, action = {}, reward = {}".format(deadline, inputs, action, reward)

I ran this agent for over 100 trails and it was clear that this wasn't the best way to go about this as it reached the destination but it took ages and nowhere near the allcoated time. 

It is interesting to note that the random function doesn't seem too random as consecutive actions seem to be the same. Also the reward is mainly negative and rarely positive.

**Inform the Driving Agent**<br>
Now that your driving agent is capable of moving around in the environment, your next task is to identify a set of states that are appropriate for modeling the smartcab and environment. The main source of state variables are the current inputs at the intersection, but not all may require representation. You may choose to explicitly define states, or use some combination of inputs as an implicit state. At each time step, process the inputs and update the agent's current state using the self.state variable. Continue with the simulation deadline enforcement enforce_deadline being set to False, and observe how your driving agent now reports the change in state as the simulation progresses.

QUESTION: What states have you identified that are appropriate for modeling the smartcab and environment? Why do you believe each of these states to be appropriate for this problem?

The states that I think we need to consider for this problem is the **light** whether it is **GREEN or RED**, as the car needs to be safe and not run red lights. The traffic situation so **oncomming, left and right** whether it is **FORWARD or LEFT or RIGHT or NONE**, as we do not want the car to crash into an oncomming car. Finally where the waypoint is in regard to the car so it can navigte optimally to the final destination, so **waypoint** whether it is ** FORWARD or LEFT or RIGHT or NONE**.

I found it interesting that the problem is modelled as traffic rules in the USA where you drive on the right side of the road with the driver sitting on the left side of the road. I am living in the UK so logically it took me a while to get my head around the actual logical turning operations and when/where you can safely turn or move forward.

**Implement a Q-Learning Driving Agent**
With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. Each action taken by the smartcab will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, set the simulation deadline enforcement enforce_deadline to True. Run the simulation and observe how the smartcab moves about the environment in each trial.


**QUESTION: What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?**

The agent is now more responsive to the environment and takes actions that are logical from my perspective regarding the route he is taking.The actions are no longer random, they take into account the previous state, the current state and the action which leads to the next state. The agents seems to be learning which way is best to go and sometimes he is taking routes that I couldn't see before but when he takes them I see that it was logically the shortest route to take. This behaviour is due to the probabilites that are calculated as part of the Q-Learning algorithm.

**Improve the Q-Learning Driving Agent**
Your final task for this project is to enhance your driving agent so that, after sufficient training, the smartcab is able to reach the destination within the allotted time safely and efficiently. Parameters in the Q-Learning algorithm, such as the learning rate (alpha), the discount factor (gamma) and the exploration rate (epsilon) all contribute to the driving agent’s ability to learn the best action for each state. To improve on the success of your smartcab:

Set the number of trials, n_trials, in the simulation to 100.
Run the simulation with the deadline enforcement enforce_deadline set to True (you will need to reduce the update delay update_delay and set the display to False).
Observe the driving agent’s learning and smartcab’s success rate, particularly during the later trials.
Adjust one or several of the above parameters and iterate this process.
This task is complete once you have arrived at what you determine is the best combination of parameters required for your driving agent to learn successfully.

**QUESTION: Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?**

So I decided to play with the values of the alpha (learning rate), the gamma (discount rate) and the epsilon (exploration rate).


| Alpha | Gamma | Epsilon | Reward | Sucesses | Penalties |
| :------: | :------: | :------: | :------: | :-------:| :---:|
|0|0|0|513.3|14|1608|
|0.1|0.1|0.1|2298.5|98|146|
|0.5|0.1|0.1|99|2169.5|107|
|0.9|0.1|0.1|98|2198.0|97|
|0.7|0.1|0.1|97|2200.0|80|
|0.7|0.2|0.1|97|2221.5|124|
|0.7|0.5|0.1|97|2416.0|240|
|0.7|0.1|0.2|94|2227.5|182|
|0.9|0.3|0.1|99|2272.0|113|

After a lot of playing around I think that as the alpha increases the agent learns more so he is more likely to reach his destination, hence I think the optimat value is 0.9. I think as the gamma increases the more the agent is trying to maximise reward and not caring about the penalties he incurs, therefore I think that the Gamma value should be fairly low as we want make sure we are safe but we also want to maximise reward so not too low, hence an optimal value of 0.3. Finally, the epsilon value. I think that as the epsilon value increases the more the agent wants to explore alternative routes to the optimal route so I think this should be low, hence 0.1. The optimal values give a relatively high reward of 2272, a success of 99/100 and only 113 penalties which is relatively low.


**QUESTION: Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and
not incur any penalties? How would you describe an optimal policy for this problem?**

The agent tries to reach the destination in the minimum time possible. During 100 trials it incurred 109 penalties. The first 50 penalties occured in the first 37 trials! The number of penalties inccured decreased as the number of trials increased. The output from those 100 trails are in the file called "100 Trial output". I think that sometimes the optimal policy may include incurring a penalty, the optimal policy would be to reach the destination in the shortest amount of time without incurring any penalities and driving safely, but this is not always possible. If more than 100 trials were used to train the smartcab then it may get better