## Q Learning

I am going to introduce it from two prospectives, the first is from Cambridge Lectures, the other from the Limu's book.

### Cambridge Version (Deterministic)

#### Define Q value, and its properties.

Let's look at why we define the action-value function (Q value).

To find the optimum policy $\pi^{*}(s)$, we need to maximize the value V:

$\pi^{*}(s) = a^{*}(s) = \arg\max_{a}[r(s,a) + V(f(s,a))]$

Here f(s,a) is the **model** that tells us where does taking action a lead us to.

But we don't know the model!!

Hence, to go around it, let define: $Q(s,a) = r(s,a) + V(f(s,a))$.

And we have: $ V(s) = max_{a}Q(s,a) $.

Hence, by definition: 

$ Q(s,a) = r(s,a) + V(f(s,a)) = r(s,a) + max_{a'}Q(f(s,a'),a')$. 

This gives us the another recursion equation that converges to optimum (Bellman's equation).

(Note we still need the model f() at this point.)

With that, we can find the policy using $\pi^{*}(s) = a^{*}(s) = \max_{a} Q(s,a)$

With discount factor $\gamma$, we have:

$ Q(s,a) = r(s,a) + \gamma max_{a'}Q(f(s,a'),a')$

#### Q-learning

Take samples s(i), a(i), r(i), s(i+1)  Using the bellman's equation to converge the Q function:

$Q_{k+1}(s(i),a(i)) = r(s(i),a(i)) + \gamma max_{a'}Q_{k}(s(i+1),a')$

And now we don't need to know the model, instead, we learn implicitly from sample data.

For stochastic case, it make sense that we want to keep the $Q_{k}$ for a while, instead of all updating to the new value hence everything is probabalitic, hence:

$Q_{k+1}(s(i),a(i)) = (1-\alpha_{k}) Q_{k}(s(i), a(i)) +\alpha_{k}[r(s(i),a(i)) + \gamma max_{a'}Q_{k}(s(i+1),a')]$

where $\alpha$ is the learning rate.




### Limu's Version

(Maybe not as strict, but the idea is right)

If we have found the optimum policy $\pi^{*}(s)$ and the optimum Q values, we should have:

$Q(s(i),a(i)) = r(s(i),a(i)) + \gamma\max_{a'}(Q(s(i+1), a'))$

If we now collect a a dataset along one trajectory {s(0), a(0), s(1), a(1) ...}, we can define a loss function as:

$ L(Q) = \sum_{i}[Q(s(i),a(i)) - r(s(i),a(i)) - \gamma\max_{a'}(Q(s(i+1), a'))]^{2} $

From which we can minimize to find optimum Q:

$  \hat{Q} = \min_{Q}L(Q)$

Again, we do not explicitly need the transition function (model), yet we implicitly have it by sampling data.

Now we use gradient descend:

$Q_{k+1}(s(i), a(i)) \leftarrow Q_{k}(s(i), a(i)) - \alpha \nabla_{Q}L(Q_{k})  \\ = (1-\alpha_{k}) Q_{k}(s(i), a(i)) +\alpha_{k}[r(s(i),a(i)) + \gamma max_{a'}Q_{k}(s(i+1),a')]$

To take in the account of the case when the trajectory ends, the value of the terminal state is zero because the robot does not any furhter actions:

$Q_{k+1}(s(i), a(i)) \leftarrow Q_{k}(s(i), a(i)) - \alpha \nabla_{Q}L(Q_{k})  \\ = (1-\alpha_{k}) Q_{k}(s(i), a(i)) \\ +\alpha_{k}[r(s(i),a(i)) + \gamma (1 - 1_{s(i+1) is terminal})max_{a'}Q_{k}(s(i+1),a')]$

Finally, with the solution of these updates $\hat{Q}$, which is an approximation of the optimal value function $Q^{*}*$, we can recover the optimal policy easily:

$\hat{\pi}(s) = \arg\max_{a}\hat{Q}(s,a)$



### My question

I don't get why do we have to use Q value. it seems like using Value function V can do the very same thing. E.g. converging by applying bellman's equation randomly.

ChatGPT:

Q learning is model free, V learning can be model free / model-based

We are using Q-learning more because:
1. More effective exploration (we will talk about soon, about data sampling)
2. More detailed data (both action and state recorded)
3. which leads to faster convergence

### Exploration in Q-learning

It is cruicial that all states $s \in S$ are visited, otherwise the estimateion of $Q^{*}$ will be bad as value iteration essentially ties together the value of all state-action pairs. It is therefore essential to pick the right $\pi_{e}$ to collect the data.

#### Epislon Greedy


\begin{equation}
  \pi_{e}(a|s) =
  \begin{cases}
    \arg\max_{a'}\hat{Q}(s,a'), & \text{with prob $1-\epsilon$}.\\
    uniform(A), & \text{with prob $\epsilon$}.
  \end{cases}
\end{equation}

This ensures we are mostly choosing the action that maximize Q, yet leaving some space for random exploration. Good actions, e.g., whose value 
is large, are explored more often by the robot and thereby reinforced.

Other exploration policy such as **softmax** also works simialrly.

### Implementation of Q Learning

In [None]:
"""
Using the same frozenlake example
"""

%matplotlib inline
import random
import numpy as np
from d2l import torch as d2l

seed = 0  # Random number generator seed
gamma = 0.95  # Discount factor
num_iters = 256  # Number of iterations
alpha   = 0.9  # Learing rate
epsilon = 0.9  # Epsilon in epsilion gready algorithm
random.seed(seed)  # Set the random seed
np.random.seed(seed)

# Now set up the environment
env_info = d2l.make_env('FrozenLake-v1', seed=seed)

In [None]:
def e_greedy(env, Q, s, epsilon):
    if random.random() < epsilon:
        return env.action_space.sample()

    else:
        return np.argmax(Q[s,:])

In [None]:
def q_learning(env_info, gamma, num_iters, alpha, epsilon):
    env_desc = env_info['desc']  # 2D array specifying what each grid item means
    env = env_info['env']  # 2D array specifying what each grid item means
    num_states = env_info['num_states']
    num_actions = env_info['num_actions']

    Q  = np.zeros((num_states, num_actions))
    V  = np.zeros((num_iters + 1, num_states))
    pi = np.zeros((num_iters + 1, num_states))

    for k in range(1, num_iters + 1):
        # Reset environment
        state, done = env.reset(), False
        while not done:
            # Select an action for a given state and acts in env based on selected action
            action = e_greedy(env, Q, state, epsilon)
            next_state, reward, done, _ = env.step(action)

            # Q-update:
            y = reward + gamma * np.max(Q[next_state,:])
            Q[state, action] = Q[state, action] + alpha * (y - Q[state, action])

            # Move to the next state
            state = next_state
        # Record max value and max action for visualization purpose only
        for s in range(num_states):
            V[k,s]  = np.max(Q[s,:])
            pi[k,s] = np.argmax(Q[s,:])
    d2l.show_Q_function_progress(env_desc, V[:-1], pi[:-1])

q_learning(env_info=env_info, gamma=gamma, num_iters=num_iters, alpha=alpha, epsilon=epsilon)

#### Some remarks on implementation

This Q-learning is much slower than previous method using value iteration, this is because we do not have access to the model (transition function / MDP)