### What is Q?
Q(s,a) = immediate reward + discounted reward
- Q does not think short term

##### How to use Q?
P(s) = argmax_a(Q(s,a)) finds the a that maximizes Q(s)

P*(s) = Q*(s,a)

How do we build that Q table?

### Learning Procedure
Big Picture
- Select training data
- iterate over time (s,a,s',r)
- test policy P
- repeat until converge (until it does't get any better.  No better return)
Details
- set starttime, init Q: We initialize our Q table with small random numbers.
- compute s
- select a
- observe r, s'
- update Q

### Update Rule
Updating Q:
- $\alpha$ learning rate between 0 to 1.0 usually .2
- $\lambda$ discount rate from 0 to 1.0
$$Q'(s,A) = (1 - \alpha)\cdot Q(s,a) + \alpha \cdot improved estimate$$

$$Q'(s,A) = (1 - \alpha)\cdot Q(s,A) + \alpha \cdot (r + \lambda \cdot Q(s', argmax_{A'}(Q(s', A'))))$$


### Update Rule Notes

The formula for computing Q forany state-action pair (s, a), given an experience tuple (s, a, s', r), is:

Q'[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s', argmaxa'(Q[s', a'])])


Here:

- r = R[s, a] is the immediate reward for taking action a in state s,
- γ ∈ [0, 1] (gamma) is the discount factor used to progressively reduce the value of future rewards,
- s' is the resulting next state,
- argmaxa'(Q[s', a']) is the action that maximizes the Q-value among all possible actions a' from s', and,
- α ∈ [0, 1] (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.



### Two Finer Points
- Success depends on exploration: One way to do this is with randomness.
- Choose random action with prob c.

Set c to .3 at the beginning of learning and make it smaller and smaller until we have multiple iterations.  It lets us arrive at different states we may not otherwise arrive at.

### The Trading Problem: Actions
- Buy
- Sell
- Nothing

### The Trading Problem: Rewards
Which results in faster convergence?
- r = daily return. Immediate reward
- r = 0 until exit, then cumulative return. Delayed Reward

Daily returns.  If we choose the other one, the learner has to infer from all the way back and see if every action is correct.

### The Trading Problem: State
What belongs in state?
- adjusted close/SMA: adj. close and SMA separately are bad because adj. close and SMA differs between stocks greatly
- Bollinger Band Value
- P/E ratio
- Holding stock
- return since entry

### Creating the state
- State is an integer. (Not rounding)
- discretize each factor
- combine

### Discretizing or Discretization
We want to turn a real number on a limited scale (e.g., 0 to 9)
```
stepsize = size(data)/steps
data.sort()
for i in range(0, steps):
    threshold(i) = data((i+1) * stepsize)
```

### Q-Learning Recap
Building a model
- define states, actions, rewards
- choose in-sample training period
- iterate: Q-table update
- backtest
- repeat last 2 steps

Training a model
- backtest on later data

#### Summary
Advantages

- The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined.
- As a result, we do not need additional data structures to store transitions T(s, a, s') or rewards R(s, a).
- Also, the Q-value for any state-action pair takes into account future rewards. Thus, it encodes both the best possible value of a state (maxa Q(s, a)) as well as the best policy in terms of the action that should be taken (argmaxa Q(s, a)).

Issues

- The biggest challenge is that the reward (e.g. for buying a stock) often comes in the future - representing that properly requires look-ahead and careful weighting.
- Another problem is that taking random actions (such as trades) just to learn a good strategy is not really feasible (you'll end up losing a lot of money!).
- In the next lesson, we will discuss an algorithm that tries to address this second problem by simulating the effect of actions based on historical data.

### Resources

- CS7641 Machine Learning, taught by Charles Isbell and Michael Littman
    - Watch for free on [Udacity](https://classroom.udacity.com/courses/ud262) (mini-course 3, lessons RL 1 - 4)
    - Watch for free on [YouTube](https://www.youtube.com/watch?v=_ocNerSvh5Y&list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp)
    - Or take the course as part of the OMSCS program!
- [RL course by David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html) (videos, slides)
- [A Painless Q-Learning Tutorial](http://mnemstudio.org/path-finding-q-learning-tutorial.htm)