# Abstract

* In one perspective, Bellman optimality equation is used to derive a globally optimality condition useful for iterative learning of control policies through interactions with an environment. Alternatively, the Bellman equation is also widely adopted to derive tractable optimization-based control policies that satisfy a local notion of optimality.
* Framework: learn a intepretable local decision maker that approximately satisfy the global bellman equation. 

# Introduction
* The aim of this paper is to delineate the local-global perspective underlying dynamic programming; in particular, the role of local optimization-based approximators in a global Bellman equation framework. 
* Two possible ways of combining RL and MPC:
    - One approach takes a modular view towards learning MPC
    - The other adopts an all-in-one perspective

# Stochastic Optimal Control
* The dynamic programming solution is computationally challenging for several reasons. For continuous state and action spaces, the Q-function comes from an infinite dimensional function space, and, as a consequence, the policy defined can become intractable. 
* In practice, value-based and policy-based strategies are often combined, leading to actor-critic RL methods. The actor refers to the policy and the critic refers to the value function. Combining these two strategies allows for learning from a predefined policy class using one-step transitions. This is an important property, as value-based approaches alone lead to policies of the form in ![image.png](attachment:image.png), which may be prohibitively expensive unless special care is taken. Additionally, learning from one-step transitions is also an appealing benefit, as it enables online parameter updates, of the actor or critic, directly aimed at the optimal value function.

# The Local-Global Interface in Optimal Control
* The Bellman optimality equation contains two insights:
    - A global property that characterizes optimal trajectories over the whole state space
    - A local strategy for interacting with the environment at a given state

* It follows that optimization as a function approximator provides a straightforward interface for integrating system knowledge into policy structure. 

* The benefits of choosing the maps provided by parametric optimization for function optimization of Q and $\pi$ can be attributed to their equation-oriented structure:
    - modularity: independent learning or design of individual components. Doing so can simplify learning with the inclusion of physics-based models, or other prior models, which can bring forth a reduction of learnable parameters in function approximators of Q and $\pi$. Moreover, practical design requirements can be directly encoded through the cost functions and constraints. 
    - safety: The parameterization readily accomodates (robust) constraint satisfaction. This aspect makes well suited for function approximation in safety-critical applications, or when assurances such as stability or robustness are required. 
    - Interpretability. Provided a comprehensible set of equations, the clear specification of objective and constraints enhances understanding for how optimal decisions and values are obtained. 

# Practical Complications in Local-Global Learning
* Local-Global Optimization requirements
    - For the local-global interface to be realizable for online optimization, one should be mindful that $Q_{\phi}$ be
        1) rich enough to approximate the global MDP solution
        2) conducive to local onlie refinement of the approximation

* Exploration versus Exploitation
