# Deep Reinforcement Learning

## Definition

Uses **deep learning** and **reinforcement learning** principles in order to create efficient algorithms that can be applied on areas like robotics, video games, finance, healthcare.  

## Reinforcement learning task

Training an **agent** which interacts with its **environment**. The agent arrives at different scenarios known as **states** by performing actions. **Actions** lead to rewards which could be positive and negative.  

The agent has only one purpose here – to **maximize its total reward** across an **episode**. This episode is anything and everything that happens between the first state and the last or terminal state within the environment. We reinforce the agent to learn to **perform the best actions** by experience. This is the **strategy** or **policy**.  

## The Bellman Equation

It writes the *value* of a decision problem at a certain point in time in terms of the payoff from some initial choices and the *value* of the remaining decision problem that results from those initial choices.

### Concepts

- s = State  
- a = Action  
- R = Reward  
- Y = Discount  
- V = Value of a state  

### Equation 

![equation](https://i.imgur.com/I39aaIa.png)

- s' = New State

#### Dicount Factor

Determines the importance of future rewards.

## Markov Decision Process

**Discrete time stochastic control** process.  
- Discrete Time : Time series consisting of a sequence of quantities.  
- Stochastic : Randomly determined process.  
- Control : Deals with finding a control law for a dynamical system over a period of time such that an objective function is optimized.  

### Markov Property
State **depends** solely on the **previous** state and the **transition** from that state to the current state.  

### Equation

![equation](https://i.imgur.com/K7ILfti.png)

## Q-Learning
Finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state.  

Perform the sequence of actions that will eventually generate the maximum total reward.  

- Q = Value of an action  

![formula](https://i.imgur.com/XmPfsqi.png)

## Temporal Difference

Adjust predictions to match later, more accurate, predictions about the future before the final outcome is known.  

![formula](https://i.imgur.com/L0SJiqe.png)

![formula2](https://i.imgur.com/HLEQd20.png)

### Full formula 

![formula3](https://i.imgur.com/2TvtcCc.png)

- α = Learning Rate  

#### Learning Rate

Hyperparameter which determines to what extent newly acquired information overrides old information.

## Deep Q-Learning

When the cheatsheet of the action is too long.  

Use a **neural network** to approximate the Q-value function.  

The state is given as the input and the Q-value of all possible actions is generated as the output.  

![qlearning vs deepQL](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/04/Screenshot-2019-04-16-at-5.46.01-PM-850x558.png)

1) All the past experience is stored by the user in memory  
2) The next action is determined by the maximum output of the Q-network  
3) The loss function here is mean squared error of the predicted Q-value and the target Q-value – Q*. This is basically a regression problem. However, we do not know the target or actual value here as we are dealing with a reinforcement learning problem. Going back to the Q-value update equation derived from the *Bellman equation*.  

#### 1- Loss Function 

![loss](https://i.imgur.com/c7eQbkV.png)

#### 2- Selecting the right Q 

Select the action with the highest Q-value using SoftMax algorithm.  

![softmax](https://i.imgur.com/dgHJB4q.png)

## Experience Replay

Maintain a **buffer** of the old experiences and train it again on them.  

Store the (S,A) = R situations that create the highest “surprise” (highest loss).  

- Advantages of experience replay:

    - More efficient use of previous experience, by learning with it multiple times. This is key when gaining real-world experience is costly, you can get full use of it. The Q-learning updates are incremental and do not converge quickly, so multiple passes with the same data is beneficial, especially when there is low variance in immediate outcomes (reward, next state) given the same state, action pair.

    - Better convergence behaviour when training a function approximator. Partly this is because the data is more like data assumed in most supervised learning convergence proofs.

- Disadvantage of experience replay:

    - It is harder to use multi-step learning algorithms, such as Q(𝜆), which can be tuned to give better learning curves by balancing between bias (due to bootstrapping) and variance (due to delays and randomness in long-term outcomes).

## Action Selection Policies

- Epsilon-greedy : The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon.

- Epsilon-soft (1-Epsilon) : Selecting at random 1 - epsilon. 

- Softmax : Takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities  

### Exploration vs Exploitation
- Exploration : Where we gather more information that might lead us to better decisions in the future. Don't end up in local maximum.  

- Exploitation : Where we make the best decision given current information.  

## Applications 

![applications](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/04/Screenshot-2019-04-17-at-3.48.22-PM-850x626.png)