# Reinforcement Learning in Action
## Course1: Introduction to Reinforcement Learning
___

## Course Syllabus
---
- What is RL?
- What kind of problems does RL solve?
- **C1: Foolsball**
- Modeling problems using the RL framework: Agent, Environment, State, Goals, Rewards, Returns
- MDPs and single-step dynamics
- State-values, action values, policies, optimality
- Solving MDP with known single-step dynamics
- Monte Carlo estimation, greedy policies, exploration and exploitation, epsilon-greedy policies
- TD methods: Sarsa, Sarsamax Q-learning, Expected Sarsa
- **P1: Taxi-V3**
- Discretization: Tile-coding, Kernels

## Week 2 Summary
---

## Supervised Learning vs Planning and Control

| Supervised Learning                   | Learning Planning/Control                     |
|---------------------------------------|--------------------------------------------|
|Learning from examples          | Learning from experience |


| Supervised Learning                   | Learning Planning/Control                     |
|---------------------------------------|--------------------------------------------|
|Learning from examples          | Learning from experience |
|Labels accompany every example | Labels(rewards) are sparse| 

| Supervised Learning                   | Learning Planning/Control                     |
|---------------------------------------|--------------------------------------------|
|Learning from examples          | Learning from experience |
|Labels accompany every example | Labels(rewards) are sparse| 
|Trying to maximize the number of correct predictions | Trying to maximize long-term reward|


| Supervised Learning                   | Learning Planning/Control                     |
|---------------------------------------|--------------------------------------------|
|Learning from examples          | Learning from experience |
|Labels accompany every example | Labels(rewards) are sparse| 
|Trying to maximize the number of correct predictions | Trying to maximize long-term reward|
|Dataset is static, representative and manually labeled | Dataset is dynamic and generated through interaction| 


### Supervised Learning Useful for Planning/Control?

Sometimes [\[1\]](http://rail.eecs.berkeley.edu/deeprlcoursesp17/docs/week_2_lecture_1_behavior_cloning.pdf) [\[2\]](https://images.nvidia.com/content/tegra/automotive/images/2016/solutions/pdf/end-to-end-dl-using-px.pdf)

![behavior-cloning.png](res/behavior-cloning.png)

## Modeling Learning to Plan and Control From Experience

### Environment and Agent Framework

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTMDmrmnl_dAyjCOErHPak2gLXmQTgQnVT8gQ&usqp=CAU)

- The agent performs an action in the environment

- The state of the environment and agent change as a result

- The agent receives a reward and the updated state from the environment

- The goal is to learn the 'best' action(decision) for every state such that we maximize accumulated reward over a long-term

### Converting the problem into Code

- Construct a returns table (n_states x n_actions)

- Fill the table with the highest possible return possible for every (state, action) pair

### Solving the Problem

- Set the return for all actions in all terminal states to 0

- Exploit the recursive relationship between $return(state_t, action_t)$ and $return(state_{t+1}, action_{t+1})$

$$ Return(state_t,action_t) = Reward(state_t,state_{t+1}) + \gamma * \max \begin{bmatrix} Return(state_{t+1}, action_{t+1}=='n')\\ Return(state_{t+1}, action_{t+1}=='e')\\  Return(state_{t+1}, action_{t+1}=='w')\\  Return(state_{t+1}, action_{t+1}=='s') \end{bmatrix}$$

### Solving the Problem

- To avoid running into indefinite, mutual recursion use the an iterative approach

    - Initialize all entries in the returns table to arbitrary values

    - Update each entry in the returns table using the recursive forumula 

$$ RETURNS\_TABLE(state_t,action_t) = Reward(state_t,state_{t+1}) + \gamma * \max \begin{bmatrix} RETURNS\_TABLE(state_{t+1}, action_{t+1}=='n')\\ RETURNS\_TABLE(state_{t+1}, action_{t+1}=='e')\\  RETURNS\_TABLE(state_{t+1}, action_{t+1}=='w')\\  RETURNS\_TABLE(state_{t+1}, action_{t+1}=='s') \end{bmatrix}$$

    - Repeat the previous step until all values in the table converge

### Solution At a Glance

```
def make_returns_table(states_list, actions_list, terminal_states):
    """Create an empty returns table where each entry is initialized arbitrarily."""
    table = pd.DataFrame.from_dict({s:{a:0 for a in actions_list} for s in states_list}, orient='index')
    return table
```

```
RETURNS_TBL = make_returns_table(range(foolsball.n_states), foolsball.actions, terminal_states)
```

```
def compute_returns(table,state,action, debug=False): 
    """ Recursively compute the discounted return for a (state,action) pair"""
    if not foolsball.__is_terminal_state__(state):

        next_state = foolsball.__get_next_state_on_action__(state, action)
        reward = foolsball.__get_reward_for_transition__(state, next_state)

        update = HYPER_PARAMS['gamma'] *\
        max(table.loc[next_state, foolsball.actions[0]],\
        table.loc[next_state, foolsball.actions[1]],\
        table.loc[next_state, foolsball.actions[2]],\
        table.loc[next_state, foolsball.actions[3]])
  

        table.loc[state, action]  = reward + update
        #print(state,action,next_state,reward,update,reward + update)
    
    return table.loc[state,action]
```

### Solution At a Glance

```
for i in range(1,50):
    RETURNS_TBL_OLD = RETURNS_TBL.copy()
    for s in range(foolsball.n_states):
        for a in foolsball.actions:
            compute_returns_v2(RETURNS_TBL,state=s, action=a, debug=True)
    
    if i%5 == 0:
        print(f'\n{i} iterations')
        print(RETURNS_TBL)
    
    deltas = ((RETURNS_TBL- RETURNS_TBL_OLD)).abs().values.max()
    if deltas < 1e-3:
        print(f'\nConvergence achieved at {i} iterations')
        print(RETURNS_TBL)
        break
```        

### Best Policy from Rewards Table

An optimum policy is a sequence $(state_t, best\_action_t)$

1: Set state to initial state

2: Pick the action with the highest return in the state

3: Transition to a new state based on the state and the selected action

4: Update state to new state and repeat from step 2 until the new state is a goal state

## A More Realistic Setup

- Dealing with incomplete Knowledge of the environment

- We may not know all the states of the environment in advance

- We may not know the single-step dynamics $(state_{t},action_{t}) \rightarrow state_{t+1}=??, reward_t=?? $

### The solution

- Interact with the environment by taking random actions a.k.a sampling

- Every step in the interaction provides information about state transitions and accompanying rewards

- The interaction ends once we reach a terminal state

- The steps make up an episode 

- Calculate return for each step: $(state_t, action_t)$

- The return calculated for any $(state_t,action_t)$ pair is not an accurate estimate of the highest possible return

- Generate many episodes to get better estimates of the highest possible returns

### Incremental Improvements (W3S1)

- Random actions (exploration) do not yield the best returns

- Instead of taking random actions, pick greedy actions (exploitation) using entries from the returns table

- Interleave episode generation and returns table updates

- Combine random actions strategy and greedy actions strategy (exploration + exploitation)

- Optimize for faster convergence 

### Putting Everything Together Specific Algorithms (W3S2)

- Markov decision processes

- State-values and action-values

- Policies: Greedy and Optimal Policies

- Bellman equations

- Monte Carlo Sampling

- Exploration and exploitation

- Q learning

- SARSA and expected SARSA

- **P1:OpenAI Gym Taxi V3 Problem**

## Week 4

- Introducing stochasticity 

- Playing against intellignet adversaries

- Introducing continuous state/action spaces