Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

CQL

1. introduction

Conservative Q-learning (CQL) [1] is an algorithmic framework for offline RL that learns a expected lower bound on the policy value, which effectively penalizes the Q function at states in the dataset for actions not observed in the dataset. This enables a conservative estimation of the value function for any policy, mitigating the challenges of over-estimation bias and distribution shift. On d4rl tasks, CQL is implemented on top of soft actor-critic (SAC).The iteration of Q function is shown as

It has a bootstrap error term and a CQL divergence term, which is a result for the optimal current policy optimized aiming to minimize the Q function(added by an additional term of a Unif regularization over current policy) of actions sampled by the current policy and simultaneously maximize the actions sampled by the behavioral policy, implicitly shows in the first and second term of the ‘max’ term respectively. is an automatically adjustable value via Lagrangian dual gradient descent and is a threshold value. When CQL is running on continuous benchmark like the Mujoco tasks, is computed using importance sampling, shown as follows:

The policy improvement step is the same as SAC's.

2. Instruction

3. Performance

Reference

  1. Kumar A, Zhou A, Tucker G, Levine S. Conservative q-learning for offline reinforcement learning[C]//Advances in Neural Information Processing Systems. 2020;33:1179-91.