## Berkeley Deep RL Course Notes

### Video Lecture 6

---
__[Berkeley DeepRL Course Lecture 6 Video](https://youtu.be/iHugFEgovhc)__

--- 
**Direct Collocation Methods for Trajectory Optimization and Policy Learning**
*Igor Mordatch, OpenAI*

Forward shooting has issues with small changes at the begining can have large impact on the end result.  It also has a very narrow feasible region- it can't work through infeasible regions or gets stuck in a local minima.  This can be mitigated by initializing from a demo, or adding in some randomness.<br>

Collocation: simultaneously optimized for actions and states, with constraints.<br>
Direct collocation: uses an inverse dynamics function to have the states rely on their neighbors, not on all the following states.  No instabilities from forward integration, not a forward dynamics function anymore.  Satisfaction of dynamics is an explicit constraint (can be hard or soft), so less prone to local minima.<br>

#### Numerical Solutions for Direct Collocation Methods
- Set up a TensorFlow graph and optimize with gradient descent, convergence is slow, instead:
- Use a truncated 2nd order method (Iterative LQR, DDP)
 - Gauss-Newton Method: get trajectory, get the gradient and Hessian, get the solution by iterative Gauss-newton steps.
 
#### Dynamics with contact
Shooting and collocation methods work for movement without contact, with contact (manipulation, walking, etc.) either method is hard to apply.<br>
Jumps and contact forces may be discontinuous.<br>
No gradient for inactive contacts, can't understand that new forces will be applied.<br>

Contact activity is an indirect function of state. Make contact activity a direct optimization variable.<br>

#### Contact-invariant optimization
State also contains contact decision varibles that tell if a part is in contact with something.  These are independet of the poses, can be decided independent of the pose.  Must have consistence for contact variable, penalties are added for forces that are not wanted.  This removes discontinuities because of contact.<br>
Start out very soft (few rules for movement) and start enforcing them until you get motion where all forces are plausible and all forces agree with the poses.

#### Learning Policies from Trajectory Optimization
Trajectory optimization only solves a particular problem. Need to learn control policies to generalize the learning through imitation of optimal control.  <br>
Some policies may be difficult to learn because they are very inconsistent, sometimes splitting between 2 policies may create a problem (starting w/left or right- can't use the average).<br>
Can add auxillary variables, enforce soft consistency between variables, search in the larger/easier space.

_Policy is not converging_
Errors are not independent, and can cause drift over time. <br>
Policy should be using the neighborhood to determine how the controls should change to maintain optimal trajectory.  Inject noise into the states and then the policy will always converge.<br>

Decompose alternating optimization into:
- trajectory optimizations
- regression

#### Unkown/Uncertain Dynamics Applications
If the actual robots specs are not perfeclty know, noise can be added to the models and optimized over multiple state trajectories. This can generate a better policy that is more likely to succeed when taken from the simulation to the real world.  If there is too much noise, it will not do anything- becomes so conservative that it can't move.   You can also have the policy update during the running... ask what the next step should look like after the next action, then analyze what happens.

### Video Lecture 7

---
__[Berkeley DeepRL Course Lecture 7 Video](https://youtu.be/IL3gVyJMmhg)__

--- 
**Markov Decision Processes and Solving Finite problems**
Focus: solving without the expert.
Assumes a known system with finite states and acions.

If you modify value iteration to use small updates and neural nets-> deep Q-network methods
If you modify polity iteration to use small updates and neural nets -> deep policy gradient methods

#### Markov Decision Process
- S: state space (a set of states of the environment)
- A: action space (a set of actions which the agent selects from at each timestep)
- P: probability of the reward and the next state (transition probablitity distribution)

**Partially observed MDP**
Frozen Lake: Simple MDP (in openAIGym- gridworld), has start location and goal location.
- Gym: FrozenLake-v0
- START state, GOAL state, other locations are FROZEN (safe) or HOLE (unsafe)
- Episode terminates when GOAL or HOLE state is reached
- Reward = 1 for GOAL, 0 otherwise
- 4 directions are actions, but you will move in the wrong direction wth probability of .5

Policies: how the actions are chosen<br>
Deterministic Policies a = pi(s), the action is a function of the state.<br>
Stochastic policies a ~ pi(a|s), the action is a conditional distribution given the state.<br>

The problems solved with an MDP include: policy optimization- maximize the reward in respect to the policy, policy evaluation- compute the expected return for a given policy. <br>
The return is the sum of future rewards in an episode, the discount is the weighting of future rewards.<br>

#### Value iteration
Value iteration is related to dynamic programming algorithms.  <br>
The finite horizon case: the optimal policy might be time dependent.  (LQR is value iteration)
Infinite (or super long) time horizon: the discount downweights future rewards and get the discounted return.  Find the policty that will optimized the discounted sum of rewards for each state.  

#### Policy iteration
Intialize the policy, evaluate it to get the value function, then compute new policy to be greedy version of policy.<br>
This will equal the optimal policy and value function for a finite number of iteration.  It converges faster than value iteration.<br>
