# Q Learning

### Monte-Carlo (MC) methods for computing action-value functions and improving policies


Monte-Carlo methods are those based on repeated sampling to estimate a quantity. 

Similarly to what we did using DP, we can do MC based implementations of policy and value iteration by sampling experienced values of states, rather than simulating them with a model.

The goal of using MC methods is to avoing the need for a model - if we don't have to look ahead from each state, then we can remove the model.

However, even if we find an optimal value function, we will still need to use a model to extract an optimal policy to understand what states are reachable from others. It's no good having a chess piece next you your opponent's King if you don't know how that piece can move.

The way we chose actions greedily when we had a model, was by using it to consider all possible actions and then taking the one that gave us the best expected return.

As such, what would be more useful would be to use MC methods to estimate the action-value (Q) function. This tells us how good any particular action is from a certain state. If we know how good each action is, then we don't need a model - as long as we know all possible actions, we can just plug them into our q function and then take the one which for which our q function returns the largest value.

### Q-functions - how good is taking a certain action in a certain state?
![](images/q_def.jpg)

![](images/q_optimality_derivation.JPG)

### Computing action-value functions using Monte-Carlo methods

For this method, we will use Monte-Carlo sampling to estimate the q-value of each state by running many episodes, and then doing backup from the terminal state.

The q-value is the <strong>mean</strong> expected future reward following an action from a given state.
Rather than storing all of our experience and taking the mean over them, we can use each experience to update an exponentially weighted average forget that exprience.

![](./images/exp-avg.jpg)

![](images/q_learning_algorithm.JPG)

In [None]:
def initialise_action_value_table(num_states=16, num_actions=4):
    q_table = {}
    for s in range(num_states):
        q_table[s] = {}
        for a in range(num_actions):
            q_table[s][a] = 0
    return q_table

def get_state_idx_from_observation(observation):
    return np.argmax(observation[2].reshape(1, -1))

In [None]:
env = GriddyEnv(4, 4)
epsilon = 1
i_episode=0
discount_factor=0.9
learning_rate=0.3

def return_optimal_policy_from_q(q_table):
    optimal_policy = {}
    for state, actions in q_table.items():
        optimal_policy[state] = np.argmax(actions.values())
    return optimal_policy

In [None]:
def update_q_table(episode_mem, q_table, discount_factor=0.9, learning_rate=0.1):
    for idx, mem in reversed(list(enumerate(episode_mem))):
        state = get_state_idx_from_observation(mem['observation'])
        new_state = get_state_idx_from_observation(mem['new_observation'])
        action = mem['action']
        if idx == len(episode_mem) - 1: # if terminal state, G=reward
            _return = episode_mem[idx]['reward']
        else:
            _return = mem['reward'] + (discount_factor * _return)
        q_table[state][action] =  q_table[state][action] + learning_rate * (_return - q_table[state][action]) # take exponential average
        q_table[state][action] = np.round(q_table[state][action], 2)
    return q_table

In [None]:
def MC_policy_evaluation(policy, n_episodes=100):
    q_table = initialise_action_value_table()
    try:
        
        # SAMPLE SOME EPISODES
        for episode_idx in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t = 0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation = new_observation
                t += 1
                
            q_table = update_q_table(episode_mem, q_table, discount_factor, learning_rate) # this is the implicit policy update
            print("Episode {} finished after {} timesteps. Eplsilon={}.".format(episode_idx, t, epsilon))#, end='\r')
            print('q table:', q_table)
            time.sleep(2)


        env.close()
    except KeyboardInterrupt:
        env.close()
        
    return q_table

MC_policy_evaluation(random_policy)

In [None]:
def greedy_policy(state, q_table):
    return np.argmax(q_table[state].values()) # return action with best value for this state

In [None]:
def greedy_MC_control():
    greedy_policy = policy_map_to_func(q_table)
    while True:
        
        # POLICY EVALUATION
        q_table = MC_policy_evaluation(greedy_policy)
        
        # POLICY IMPROVEMENT
        
        
        # CHECK CONVERGENCE
    

## Why not just act greedily all the time?

If we act greedily all the time then we will move into the state with the best value. But remember that these values are only estimates based on our agent's experience with the game, which means that they might not be correct. So if we want to make sure that our agent will do well by always choosing the next action greedily, we need to make sure that it has good estimates for the values of those states. This brings us to a core challenge in reinforcement learning: **the exploration vs exploitation dilemma**. Our agent can either exploit what it knows by using it's current knowledge to choose the best action, or it can explore more and improve it's knowledge perhaps learning that some actions are even worse than what it does currently.

Because we aren't using a model here, we aren't considering all possible actions. Instead we are just acting based on experience. 
If we experience a lower return from taking one action in a state rather than another, then we are not going to try the one that gave us the low return again.
This would be more problematic in a stochastic environment - where environment transitions can vary randomly.
When we act greedily we stop exploring, only exploiting the experience we have, even though we might have not experienced the best states.

## An epsilon-greedy policy
We can combine our random policy and our greedy policy to make an improved policy that both explores its environment and exploits its current knowledge. An $\epsilon$-greedy (epsilon-greedy) policy is one which exploits what it knows most of the time, but with probability $\epsilon$ will instead select a random action to try.

## Do we need to keep exploring once we are confident in the values of states?

As our agent explores more, it becomes more confident in predicting how valuable any state is. Once it knows a lot, it should start to explore less and exploit what it knows more. That means that we should decrease epsilon over time.

Let's implement it

In [None]:
def epsilon_greedy_policy(state):
    epsilon = 0.05
    if random.random() < epsilon:
        return random_policy(state)
    else:
        return greedy_policy(state)

## Tabular Q learning

In [1]:
import time
import pickle
import numpy as np

import gym
from GriddyEnv import GriddyEnv

In [2]:
def visualise_agent(policy, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                env.render()
                policy_action = policy(observation)
                observation, reward, done, info = env.step(policy_action)
                time.sleep(0.5)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [3]:
def random_policy(state):
    return env.action_space.sample()

In [4]:
#visualise_agent(random_policy, 3)

In [5]:
def greedy_policy(state, return_action_val=False):
    action_values=[] #store the value of each action from this state
    for test_action in range(4): #for each action
        key = pickle.dumps(np.array((*np.copy(state).flatten(), test_action))) #calculate the key for our dictionary
        if key not in q_table: q_table[key] = 0 #if unseen state-action pair, initialize q value to 0
        action_values.append(q_table[key]) #append the value of this action to a list
    policy_action = np.argmax(action_values) #get an action by performing argmax operation
    if return_action_val: return policy_action, action_values[policy_action] #if flag, return value of action aswell
    return policy_action

In [6]:
def create_epsilon_greedy_policy(policy):
    def epsilon_greedy_policy(state):
        action = env.action_space.sample() if np.random.rand()<epsilon else policy(state) #epsilon greedy policy
        return action
    return epsilon_greedy_policy

In [7]:
def update_q_table(episode_mem, q_table, discount_factor=0.95, alpha=0.2):
    all_diffs=[] #store difference between new and old q values
    for i, mem in reversed(list(enumerate(episode_mem))): #iterate over the memories in reverse chronological order
        if i==len(episode_mem)-1: #if terminal state
            calculated_q = mem['reward'] #set q = the reward in that memory
        else:#if non-terminal state
            _, next_obs_q = greedy_policy(mem['new_observation'], return_action_val=True) #get q value of next state
            calculated_q = mem['reward']+discount_factor*next_obs_q #calculate new q value estimate for this state-action pair
        
        key = pickle.dumps(np.array((*mem['observation'].flatten(), mem['action']))) #get key of current state-action pair
        if key not in q_table: q_table[key]=0 #if unseen state-action pair, initialize q value to 0
        new_val = q_table[key] + alpha*(calculated_q-q_table[key]) #update q with a step of size alpha to new q value
        diff = abs(q_table[key]-new_val) #calculate difference between old and new q values estimate
        all_diffs.append(diff)
        q_table[key] = new_val
    return q_table, np.mean(all_diffs)

In [8]:
i_episode=0
epsilon = 1 #initialize epsilon
q_table = {} #initialize q table
discount_factor=0.95
alpha=0.1

env = GriddyEnv()
epsilon_greedy_policy = create_epsilon_greedy_policy(greedy_policy)

In [9]:
def train(policy, n_episodes=100):
    global epsilon
    global q_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
            epsilon*=0.995 #decay epsilon
            q_table, q_delta = update_q_table(episode_mem, q_table, discount_factor, alpha) #update our q table using the current episode memory
            i_episode+=1
            print("Episode {} finished after {} timesteps. Epsilon={}. Q_Delta={}".format(i_episode, t, epsilon, q_delta))#, end='\r')
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train(epsilon_greedy_policy, 50)

In [11]:
visualise_agent(greedy_policy, n=3)

Episode 0 finished after 5 timesteps


## Deep Q Learning

Instead of using a table to store our q values for each state, which becomes computationally inefficient when we have a large state space, we can use a neural network. This is not a problem for NNs as we don't need to store the value for each action-state pair explicity. We also improve the performance of our agent in unseen states. This is because, in the tabular method, we assume a q value of 0 for unseen action-state pairs while our network will make a guess based on similar state-action pairs it has seen before.

In [11]:
import torch
import torch.nn.functional as F

class QNetwork(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(np.prod(env.observation_space.shape), 32) #input layer
        self.fc2 = torch.nn.Linear(32, 32)
        self.fc3 = torch.nn.Linear(32, env.action_space.n) #output layer
    def forward(self, obs):
        obs = obs.view(-1, np.prod(env.observation_space.shape)) #flatten input
        x = F.relu(self.fc1(obs))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    def create_optimizer(self, lr=0.001):
        self.optimizer = torch.optim.Adam(self.parameters(), lr=lr)

In [12]:
def create_greedy_policy(q_network):
    def greedy_policy(state, return_action_val=False):
        action_values = q_network(torch.tensor(state).double()).detach().numpy()
        policy_action = np.argmax(action_values)
        if return_action_val: return policy_action, action_values[0][policy_action]
        return policy_action
    return greedy_policy

def create_stochastic_policy(q_network):
    def stochastic_policy(state, return_action_val=False):
        action_values = q_network(torch.tensor(state).double()).detach().numpy()
        action_probs = F.softmax(torch.tensor(action_values), dim=-1)
        policy_action = torch.distributions.Categorical(action_probs).sample().item()
        if return_action_val: return policy_action, action_values[0][policy_action]
        return policy_action
    return stochastic_policy

In [13]:
def update_q_network(episode_mem, q_network, discount_factor=0.95, alpha=0.2):
    all_diffs=[] #store difference between new and old q values
    for i, mem in reversed(list(enumerate(episode_mem))): #iterate over the memories in reverse chronological order
        if i==len(episode_mem)-1: #if terminal state
            calculated_q = mem['reward'] #set q = the reward in that memory
        else:#if non-terminal state
            _, next_obs_q = greedy_policy(mem['new_observation'], return_action_val=True) #get q value of next state
            calculated_q = mem['reward']+discount_factor*next_obs_q #calculate new q value estimate for this state-action pair
            
        predicted_old_q = q_network(torch.tensor(mem['observation']).double())[0, mem['action']] #what does our network predict for the current state-value pair
        label_new_q = predicted_old_q.item() + alpha*(calculated_new_q-predicted_old_q.item()) #what should our label be for the network given our new prediction of q
        cost = F.mse_loss(predicted_old_q, torch.tensor(label_new_q).double()) #calculate cost
        cost.backward() #calculate gradients
        q_network.optimizer.step() #update weights
        q_network.optimizer.zero_grad() #reset gradients
        all_diffs.append(abs(calculated_new_q-predicted_old_q.item()))
    return np.mean(all_diffs)

In [14]:
#HYPER-PARAMS
epsilon = 1
i_episode=0
discount_factor=0.95
alpha=0.1
lr = 0.001

env = GriddyEnv(4, 4)
q_network = QNetwork().double()
q_network.create_optimizer(lr)
greedy_policy = create_greedy_policy(q_network)
stochastic_policy = create_stochastic_policy(q_network)
epsilon_greedy_policy = create_epsilon_greedy_policy(greedy_policy)

In [15]:
def train_deep(policy, n_episodes=100):
    global epsilon
    global q_network
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
            epsilon*=0.995
            q_delta = update_q_network(episode_mem, q_network, discount_factor, alpha) #update our q network using the current episode memory
            i_episode+=1
            print("Episode {} finished after {} timesteps. Epsilon={}. Q_Delta={}".format(i_episode, t, epsilon, q_delta))#, end='\r')
            #print(value_table_viz(value_table))
            #print()
            #env.render(value_table_viz(value_network, observation))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train_deep(epsilon_greedy_policy, 100)

In [17]:
visualise_agent(greedy_policy, n=3)

Episode 0 finished after 3 timesteps
Episode 1 finished after 1 timesteps
Episode 2 finished after 4 timesteps
