## Berkeley Deep RL Course Notes

### Video Lecture 4

---
__[Berkeley DeepRL Course Lecture 4 Video](https://youtu.be/qVsLk5CVy_c)__

### What do we do if the dynamics are unknown?
Some systems have very difficult dynamics, very hard to model ahead of time.  You won't know how the state and action lead to the next set.  The model of the dynamics must be learned.

#### Overview of model-based RL
Why learn the model?
Need to know the results of actions. 

Version 0.5: collect random samples, train dynamics, plan
Pro: simple and no iterative process.

Classic system identification approach, does it work?
Yes... but in reality- no.
- how system identification works in classical robotics
- must design a good base policy
- effective when dynamic representations can be hand engineered using physics and fit just a few parameters.
Problems?
- distribution mismatch problem: the states we observe at test time are different than the states used to train the dynamics model.
- the distribution mismatch problem becomes exacerbated as we use more expressive model classes
- the model class is very restrictive in classical robotics so it works there.

Version 1.0: iteratively collect data, replan, collect data
Can we do better?
_Replanning helps with model errors_
Errors can accumulate over time steps, this needs to be fixed at each error.  Your model can tell you how much to correct, you can replan your actions at each time steps as you find errors and then relearn the dynamics model over time.
It is simple and solves the distribution mismatch but the open loop plan can perform poorly in stochastic domains.

Version 1.5: iteratively collet data using MPC (replan at each step)
It is robust where there are small errors, but it is computationally expensive and requires a planning algorithm.

Version 2.0: backpropagate directly into policy
Computationally cheap at runtime, but can be unstable, especially in stochastic domains

#### What kind of models can we use?
__Gaussian process__
Fit a gp where the tuple is state and action and output is next state.
They are very data-efficient but not great with non-smooth dynamics, they get slow when the dataset is large.

__Neural Network__
Input is a tuple of state and action output is next state.
Euclidean training loss corresponds to Gaussians, more complex losses.  
Very expressive and can use lots of data but are not great with low data regimes.

__Other__
Gaussian Mixture Model (GMM)
Domain specific models
Conditional Restricted Boltzmann Machines
GPs and GPLVMs
Linear and switching linear dynamical systems

#### Global models and local models
The model based RL will seek out the regions where the model is overly optimistic and you visit all the possible mistakes until you get the right behavior.
Must find all the very good models for the state space to converge on a good solution.
For easy tasks the model is much more complex than the policy. - instead of fitting a global model, fit a local model in the region you are located.

#### Learning with local models and trust regions
Local models- calculate new model at each time step.

What controller to execute?
Version 0.5: doesn't correct for drift or deviations.
Version 1.0: gives you one trajectory over and over again.
Version 2.0: stochastic, adds noise to vary the states.  How much noise depends on the importance of the step.

__local model__
- run controllers
- collect dynamics
- fit dynamics to get derivatives
- feed derivatives them to LQR, repeat
- fit the dynamics

Can we do better?
Version 2.0 using Bayesian linear regression using a global model (GP, deep net, GMM)

What if we go to far?
The linearization is only valid in a local region. We want to keep the error of the controller as small as possible.  
The contoller is a Linear Gaussian thing- creates trajectory over time.  We use a trust region.

KL divergengces between trajectories

### Video Lecture 5

---
__[Berekely DeepRL Video Lecture 5](https://youtu.be/o0Ebur3aNMo)__

__Gaze Heuristic__
Useful when the physics of a task is prohibitively complex: view an object and make corrections as the object moves to keep it in the same location in the field of view.  Allows you to ignore the model of the environment when you develop a policy.

##### Can we backpropogate into the policy?
Problems: each action effects all future actions- so early action changes will have a big gradient.  If you change an action at the begining, it changes the last action A LOT.  (susceptible to vanishing/exploding gradients)<br>
The dynamics aren't simple, they are chosen by nature and can be very complex.<br>

##### Collocation Methods
The early actions don't 'wave the end' around but are more expensive computationally.<br>
Extending trajetory optimization to policy optimization.<br>

How to solve the constraing optimization: 
- dual gradient descent: imposes constraints, 
- DDG with small tweak- augmented Lagrangian, makes it more well behaved, keeps constraing violation small.

#### Contraining trajectory optimization with dual gradient descent
Minimize the cost of some trajectory, subject to a constraint that the actions along the trajectory are the actions the policy would have taken.<br>

_Optimize to convergence_
1. minimize in respect to trajectory 
2. minimize with respect to the policy parameters 
3. increment the dual variable

__Guided Policy Search__ (deterministic policies, dynamics)
- constrained trajectory optimization (generating the data for imitation)
- imitation of an optimal control expert (imitation learning)
- the teacher adapts to the learner to avoid actions the learner can't mimic.

Need to choose:<br>
form of p(tau)<br>
optimization method for p(tau)<br>
surogate cost (how to enforce constraint)<br>
supevised objective for the policy<br>

__Stochastic Case__ (gaussian distribution)
1. Optimize p(tau) with respect to some surrogate cost. (LQR)
2. Optimize theta to some supervised objective.
3. Increment or modify the dual variables
Can now be used with learned local models.  

#### Input Remapping Trick
Assumes that you have the state. (observations are not markovian, but states are)<br>
At training time you have a full state to do the trajectory optimization but for the policy you only give it the observations so that at test time, it can work with observation input.

### Case Study: Vision-based control with GPS
Paper: End-to-End Training of Deep Visuomotor Policies<br>
Controlling a one arm robot using vision- modifying trajectorys to best suit policy optimization proceedure.<br>

---
Paper: Deep Learning for Real-Time Atari Game play Using Offline Monte-Carlo Tree Search Planning <br>
Imitation learning with DAgger, learn to play atari games without Q-learning, instead model based policy.

*DAgger Problem*<br>
Computer can label the action, but policy must be run in the real world.  Initially the policy won't be very good, sometimes you want to aviod running bad policies (safety...).<br>
Model Predictive Control: PLATO algorithm<br>
Use this to avoid running a partially trained policy.  Pi hat minimizes the modified cost function.  Take actions that are as close to possible, but minimize the cost.

replanning = Model Predictive Control (MPC)

### DAgger vs Guided Policy Search (GPS)
- DAgger does not require an adaptive expert, any expert will work as long as states can be labeled from learned policy.  Expert's behavior is able to achieve a bounded loss- can learn something from the expert and measure how wrong you are.
- GPS adapts the expert behavior, no need for bounded loss because expert changes (NPC).  

#### Why do we want to imitate optimal control?
These techniques are stable and easy to use.  Once you have the planner, the rest is supervised learning.  <br>
Input mapping trick can allow for quicly getting the policy.<br>
Overcomes the challenges of backpropogating into a policy.<br>
Viable for real systems, efficient.<br>

**Limitations**
You need a model.<br>
It takes time and data to learn the model<br>
There are assumptions.