<a href="https://colab.research.google.com/github/GabeMaldonado/JupyterNotebooks/blob/master/ReinforcementLearningAWSDeepRacer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# REINFORCEMENT LEARNING

## AWS DeepRacer Notes

### What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of Machine Learning in which an *agent* explores an *environment*  to learn how to perform desired tasks by taking actions with good outcomes and avoiding actions with bad outcomes. 

An RL model will learn from its experience and over time-- will be able to identify which actions lead to the best results.

Other types of Machine Learning:


*   Supervised Learning -- example-driven learning, with labeled data of known outputs for given inputs, where a model is trained to predict outcomes from new inputs.
*   Unsupervised Learning -- inference-based training, with unlabeled data without known outputs, where a model is trained to identify related structures or similar patterns within the input data.

### How does AWS DeepRacer learn how to drive by itself?

In RL the agent interacts with an environment with the objective to maximize its total reward. 
The agent takes an *action*  based on the environment *state* and the environment returns the reward and the next state. The agent learns from trial and error, initially taking random actions and over time identifying the actions that lead to long-term rewards. 

### AGENT
The *agent* simulates the AWS DeepRacer vehicle in the simulation for the training. More especifically, it embodies the neural network that controls the vehicle, taking inputs and deciding actions.

### ENVIRONMENT
The *environment* contains a track that defines where the agent can go and what state it can be in. The agent explores the environment to collect data to train the underlying neural network.

### STATE 
A *state* represents a snapshot of the environment the agent is in at a point in time. For AWS DeepRacer, a state is an image captured by the front-facing camera of the vehicle. 

### ACTION
An *action* is a move made by the agent in the current state. For the AWS DeepRacer, an action represents a move at a particular speed and steering angle.

### REWARD
A *reward* is the score given as feedback to the agent when it takes an action in a given state. In training the AWS DeepRacer model, the reward is returned by a *reward function*. In general, you decide or supply a reward function to specify what is desirable or undesirable action for the agent to take in a given state. 

## TRAINING AN RL MODEL
Training is an iterative process. In a simulator, the agent explores the environment and builds up experience. The experiences collected are used to update the neural network periodically and the updated models are used to create more experiences. 
With AWS DeepRacer, we are training a vehicle to drive itslef. Let's explore this concept with a simplified example.

### A Simplified Environment
In this example, we want the vehicle to travel from point A to point B following the shortest path. Let's pretend we have a grid of squares and that each square represents an individual state and we will allow the vehicle to move up or down while facing the direction of the goal.

### Scores
We can assign a score to each square in the grid to decide wich behavior to incentivize. We can designate the squares on the edge of the track as 'stop states' which will tell the vehicle that it has gone off the track and failed. 
Since we want to incentivize the vehicle to learn to drive down the center of the track, we assign a high reward for the squares on the center line and a low reward elsewhere.

### An Episode
In RL training, the vehicle will start by exploring the grid until it moves out of bounds or reaches its destination.
As it drives around, the vehicle accumulates rewards from the scores we defined. This process is called an *episode*.

### Iteration
RL algorithms are trained by repeated optimization of cumulative rewards. 
The model will learn which action (and subsequent actions) will result in the highest cummulative reward on the way to the goal.
Learning does not just happen on the first go, it takes some iteration. First, the agent needs to explore the environment and see where it can get the highests rewards, before it can exploit that knowledge. 

### Exploration
As the agent gains more and more experience, it learns to stay on the central squares to get higher rewards.
If we plot the results from each episode , we can see how the model performs and improves over time.

### Exploitation and Convergence

With more experience, the agent gets better and it is able to get to the destination reliably. Depending on the exploration-exploitation strategy, the vehicle may still have a small probability of taking random actions to explore the environment. 

## Parameters of Reward Functions 

### Reward function parameter for AWS DeepRacer

In AWS DeepRacer, the reward function is a Python function which is given certain parameters that describe the current state and returns a numeric reward value.
The parameters passed to the reward function describe various aspects of the state of the vehicle, such as its position and orientation on the track, observed speed, stearing angle and more. 
We will explore some of these parameters and how they describe the vehicle as it drives around the track:


*   Position on the track
*   Heading
*   Waypoints
*   Track width
*   Distance from center line
*   All wheels on track
*   Speed
*    Steering angle

### 1- Position on the track

The parameters ```x``` and ```y``` describe the position of the vehicle in meters, measured from the lower-left corner of the environment. 

### 2- Heading

The ```heading``` parameter describes the orientation of the vehicle in degrees, measured counter-clockwise from the X-axis of the coordinate system. 

### 3- Waypoints

The ```waypoints``` parameter is an ordered list of milestones placed along the track center. 
Each waypoint in ```waypoints``` is a pair ```[x, y]``` of coordinates in meters, measured in the same coordinate system as the car's position. 

### 4- Track Width

The ```track_width``` parameter is the width of the track in meters.

### 5- Distance from the center line

The ```distance_from_center``` parameter measures the displacement of the vehicle from the center of the track.
The ```is_left-of_center``` parameter is a boolean describing whether the vehicle is to the left of the center of the track. 

### 6- All wheels on track 

The ```all_wheel_on_track``` parameter is a boolean (true/false) which is true if all four wheels of the vehicle are inside the track borders and false if any wheel is outside the track.

### 7- Speed

The ```speed``` parameter measures the observed speed of the vehicle, measured in meters per second. 

### 8- Steering angle

The ```steering_angle``` parameter measures the steering angle of the vehicle, measured in degrees. 
This value is negative if the vehicle is steering right and positive if the vehicle is steering left. 

## The Reward Function

### Putting all the pieces together

With all these parameters, we can define a reward function to incentivize whatever driving behavior we like.
The following are examples of reward functions:

### 1- Stay on Track

With this function, we give a high reward for when the car stays on the track and penalize if the car deviates from the track boundaries.
This examples uses the ```all_wheels-on_track```, ```distance_from_center``` and ```track_width``` parameters to determine whether the car is on the track and give a high reward if so. 

### 2- Follow Center Line

In this example, we measure how far the car is from the center of the track and give a higher reward if the car is close to the center line. 
This example uses ```track_width``` and ```distance_from_center``` parameters  and returns a decreasing reward the further the car is from the center of the track. 

### 3- No Incentive

An alternative strategy is to give a constant reward on each step regardless of how the car is driving. 
This example does not use any of the input parameters -- instead it returns a constant reward of ```1.0``` at each step. 
The agent's only incentive is to sucessfully finish the track and it has no incentive to drive faster or follow any particular path though it may behave erratically. 





## The Basic Reward Function

The basic reward function operates off of distance from the center lane using 'markers' at different percentages of the track width.

![alt text](https://s3.amazonaws.com/video.udacity-data.com/topher/2019/June/5d116efd_l2-reward-function/l2-reward-function.png)

In the above image, we can see that there are different rewards based on how far the car is from the center line. 
Here's the code from the AWS console:

```
def reward_function(params):
     '''
     Example of rewarding the agent to follow center line
     '''
     
     # Read input parameters
     
     track_width = params['track_width']
     distance_from_center = params['distance_from_center']
     
     # Calculate 3 markers that are varying distances away from the center line
     
     marker_1 = 0.1 * track_width
     marker_2 = 0.25 * track_width
     marker_3 = 0.5 * track_width
     
     # Give higher reward if the car is closer to center line and vice versa
     
     if distance_from_center <= marker_1:
         reward = 1.0
         
     elif distance_from_center <= marker_2:
         reward = 0.5
         
     elif distance_from_center <= marker_3:
         reward = 0.1
         
     else:
         reward = 1e-3 # likely crashed / close to off track
         
     return float(reward)

```

In this code, we first gather the parameters for track width and the distance from the center line. Then, we create three markers based off if the track width-- one at 10% of the track width, another at 25% width and the last at half of the track width.
From there, it's a simple if/else statement that gives different, decreasing rewards based on the vehicle being within a given marker or not. If its outside all three markers, notice that the reward is almost zero.
