# Reinforcement Learning in Action
## Course1: Introduction to Reinforcement Learning
___

## Course Syllabus
---
- What is RL?
- What kind of problems does RL solve?
- **C1: Foolsball**
- Modeling problems using the RL framework: Agent, Environment, State, Goals, Rewards, Returns
- MDPs and single-step dynamics
- State-values, action values, policies, optimality
- Solving MDP with known single-step dynamics
- Monte Carlo estimation, greedy policies, exploration and exploitation, epsilon-greedy policies
- TD methods: Sarsa, Sarsamax Q-learning, Expected Sarsa
- **P1: Taxi-V3**
- Discretization: Tile-coding, Kernels

## Week 3 Summary
---

## The exploration-exploitation dilemma

### Exploration using random sampling

Select a uniformly random action in every state within an episode

Good strategy to learn about the environment in the beginning

In [None]:
action = np.random.choice(foolsball.actions)

### Exploitation using a greedy sampling

Select the action corresponding to the highest discounted return in every state within an episode

Does not give good results in the beginning when the returns table is based on very little experience

In [None]:
greedy_action = table.loc[state].idxmax()

### Combining Exploration and Exploitation  with $\epsilon$-greedy sampling

Select the action with the highest discounted return most of the time

Select a random action with a small probability

In [None]:
actions = table.columns
action_probs = np.asarray([epsilon/len(actions)]*len(actions),dtype=np.float)

greedy_action_index = np.argmax(table.loc[state].values)
action_probs[greedy_action_index] += 1-epsilon

epsilon_greedy_action = np.random.choice(table.columns,p=action_probs)

### $\epsilon$ Decay

Start with pure exploration

Rely more and more on exploitation with each passing episode

In [None]:
episode_i = collect_epsilon_greedy_episode_from_returns_tbl(estimated_returns,epsilon=epsilon)
epsilon *= epsilon_decay
epsilon = max(epsilon, min_epsilon)

## Converging Faster

### Constant Alpha ($\alpha$)

The division by number of visits has a non-linear effect on the size of every new update to the returns table

Use the difference of existing estimate and new estimate based on an episode to update the existing estimate 

In [None]:
ESTIMATED_RETURNS_TBL.loc[s,a] += alpha*(ret - ESTIMATED_RETURNS_TBL.loc[s,a])

### Temporal Difference (TD) Learning

Update the returns table after every step of an episode 

In the absence of an entire episode, use the difference of the current estimate and new estimate based on just on step

In [None]:
ESTIMATED_RETURNS_TBL.loc[s0,a0] +=\
    alpha*(reward + HYPER_PARAMS['gamma']*ESTIMATED_RETURNS_TBL.loc[s1,a1] - ESTIMATED_RETURNS_TBL.loc[s0,a0])

### SARSA

```
for i in tqdm(range(n_episodes)):
    foolsball.reset()
    s0 = foolsball.init_state
    a0 = epsilon_greedy_action_from_Q(Q,s0,epsilon)
    done = False
    
    while not done:
        s1, reward, done  = foolsball.step(a0)
        a1 = epsilon_greedy_action_from_Q(Q,s1,epsilon)
        
        Q.loc[s0,a0] += alpha*(reward + HYPER_PARAMS['gamma']*Q.loc[s1,a1] - Q.loc[s0,a0])
        
        s0, a0 = s1, a1
  
    epsilon *= epsilon_decay
    epsilon = max(epsilon,min_epsilon)

```

### Q-Learning

```
for i in tqdm(range(n_episodes)):
    foolsball.reset()
    s0 = foolsball.init_state
    done = False
    
    while not done:
        a0 = epsilon_greedy_action_from_Q(Q,s0,epsilon)
        s1, reward, done  = foolsball.step(a0)
        
        Q.loc[s0,a0] += alpha*(reward + HYPER_PARAMS['gamma']*Q.loc[s1].max() - Q.loc[s0,a0])
        
        s0 = s1
  
    epsilon *= epsilon_decay
    epsilon = max(epsilon,min_epsilon)
```

## Terminology
---

## Project P1 Intro
---

### The environment

### The rubric 