## Berkeley Deep RL Course Notes

### Video Lecture 1
***
__[Berkeley DeepRL Course Lecture 1 Video](https://youtu.be/8jQIKgTzQd4)__

Reinforcement learning has a stateful world.  The input is not just randomly sampled, it depends on the previous inputs and previous actions.  

The input data changes as a function of what it is doing: takes an action and receives a cost.

### Video Lecture 2
***
__[Berkeley DeepRL Course Lecture 2 Video](https://youtu.be/kl_G95uKTHw)__

_Definitions_

State: $X_{t}$,<br>
Observation: $O_{t}$,<br>
Action: $U_{t}$, <br>
Policy: $\pi_{\theta}(O_{t}|U_{t})$,<br>
Cost Function: $C(O_{t}|U_{t})$,<br>
Reward Function: Policy: $R(O_{t}|U_{t})$, (just negative C)<br>

### Imitation Learning and Behavior Cloning

Sequential learning will have time steps. Can have discrete or continuous actions. 

Markov Property: each x influences the next one, if you know the current state you can forget about the past (it is included in the current state).

State X has to satisfy the Markov property, Observation O does not.  (process is partial Markov Model)
If fully observed O and X are the same- then full Markov Model.

#### Imitation Learning

observations --> training data --> supervised learning --> output  (does this work: no, but...)

__Distribution Mismatch Problem__
Every time you do training with supervised learning and then run policy- it will deviate a little from the training trajectory (epsilon error).  The error causes a small error in the state- leading to a state that was not present in the training set. The next step will add more error, increasingly compounding as the trajectory progresses.

The problem can be addressed by adding in self-stabilizing methods to estimate and correct for the error.  A distribution of training data, rather than just a path makes it more stable.  Include data that has errors/mistakes to correct and conduct a trajectory distribution to keep the error from compounding. 

Instead of trying to be creative with the policy, be creative about the data.

The cost function for imitation learning is just the distance between the action versus the human action.

##### DAgger: Dataset Aggregation
Collect training data from the policy- but how to get the labels?

How it works:
1. Collect human data, train using supervised learning algorithm.
2. Run the policy to collect a new set of observations. (can be a mixture)
3. Ask a human to label the new observations with what the human thinks is a good action.
4. Construct a dataset that is a union of the original and new datasets.
5. Repeat until the dataset becomes primarily from the policy.

Problems with DAgger: 
-step 3 is not intuitive to implement (like a video) and non-trivial.

Imitation learning is often insufficient by itself.  I can work well with some hacky modifications, samples from a stable trajectory data, and adding more on-policy data (DAgger).

##### Problems with Imitation Data
- Humans need to provide the data- this is finite and may not be sufficient.
- There are actions that humans are not good at and can't demonstrate well, some are impossible.
- Humans can learn autonomously- this gives unlimited data collection, can constantly improve (machines can't through imitation learning)

__Learning without humans?__
What makes an action good? minimize the sum of the cost over all time steps.

__Cost and Reward Function Problems__
Reward functions are not obvious, can be hard to evaluate (is there water in the class... hard to answer with a computer).



#### Research Papers on Imitation learning

- A Machine learning Approach to Visual Perception of Forest Trails for Mobile Robots
Goal: to navigate a quadcopter down a forest trail without it going off the trail.  
Abstracted the scene to forward, left, right.
Perceiving the images is difficult: imitation learning.
Collected training data by having a person wear a headband with 3 cameras (left, right, center) walk through the path.
Used a classifier, then tested with a person using a cell phone camera.

- Learning Transferable Policies for Monocular Reactive MAV Control
Goal: use human labeled data from quadcopter to train copter to fly in winter (data from summer).
Trained a CNN to map from the image to the actions.  
Train a network to get upper levels from winter/spring to match.
Attempted to transfer from one type of drone to another, one type of forest to another, one forest in summer to one in the winter.

- Learning Real Manipulation Tasks from Virtual demonstrations using LSTM
Goal: put an object on the shelf, push an object.
Not using computer vision, just using tags.
If the grip fails, it will try again.

#### Other Topics in Imitation Learning
- Structured Prediction: network that answers questions.
- Interaction and Active Learning: reason about which states to visit to best learn.
- Inverse Reinforcement learning: instead of copying, figure out what was the goal and determine on your own how to accomplish it.


### Video Lecture 3

---
__[Berkeley DeepRL Course Lecture 3 Video](https://youtu.be/mZtlW_xtarI)__

#### Making Decisions under Known Dynamics
There is a sequence to the cost- you could choose a low cost action at the beginning that leads to very high costs later.

#### Trajectory Optimization: backpropagation through dynamical systems
![Trajectory Formula](Trajectory-Formula.jpg)
Boils down to: differentiate via back propagation and optimize. (it helps to use 2nd order methods)

#### Linear Dynamics:  linear-quadratic regulator (LQR)
_Shooting Methods_
Optimizing over actions treating the states as constants.  The action at step one affects the whole trajectory.  Simple.

_Collocation_
Keep the original constrained optimization problem that includes both the actions and states.  

_Linear Case (a shooting method) LQR_

Simplified version for linear dynamical systems (robotics).  Make a simplified assumption about the dynamics of the system, assume the dynamics (f) is linear.  The next state can be expressed as a linear equation of the previous state.
Assume the cost is a quadratic function of the state. Start at the end of the trajectory, go backwards through time and compute all the actions in between (backward recursion for actions, then forward recursion to get the values).

#### Nonlinear Dynamics: differential dynamic programming (DDP) & iterative LQR
_Nonlinear Case: DPP (differential dynamic programming)/iterative LQR_

Approximate a nonlinear system as a linear-quadratic system (uses Taylor expansion). 
_Iteratively optimizing trajectory_
1. Current guess about states and actions 
2. Linearized dynamics around those states and actions 
3. Quadracize you costs around those states and actions
4. Run LQR with a four pass using the true nonlinear dynamics
5. Get updated actions
6. Repeat until convergence

Newton's method can allow large errors.

##### Case Study: nonlinear model-predictive control
_Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization_
(no deep or reinforcement learning)
Uses iterative LQR with model predictive control (NPC) where ever single time you step replan the optimal trajectory.

#### Discrete Systems: Monte-Carlo tree search (MCTS)
Discrete systems (like Atari games)
States (memory), observations (pixels), actions
State and observations are often combined because they are essentially the same for these types of games. For this- we consider the image the state.  
The image and actions are discrete.

_Monte Carlo tree search_ heuristic version of a search algorithm, frame a planning problem as a search problem

Start at time step, apply actions- then states.  Can ignore some of the states off by making an estimate of how good the state is, the policy may be random. 
Choose paths to search first by starting at leaves and expanding better subtrees first.  Choose the nodes that appear to be best so far with a preference nodes that haven't been visited often.  The system is more confident in the score of the commonly visited nodes and will be encouraged to explore less visited ones to improve the accuracy of the score.

_Generic MCTS sketch_
>find a leaf st using TreePolicy(s1)
>evaluate the def using DefaultPolicy(st)
>update all valus in tree between st and s1
>repeat
>take best action from s1

_UCT TreePolicy(st)_
>if st is not fully expanded, choose new at (child)
>else choose child with best Score(st+1) 

What is wrong with known dynamics?
There are domains and physics where the dynamics are exceptionally complex.