intro.tex.bak

%!TEX root =  autocontgrlp.tex
\section{Introduction}
The  framework of Markov decision processes (MDPs) is useful to mathematically cast optimal sequential decision making problems arising in science and engineering. At any decision instance, an action is made which yeilds an immediate reward and the system moves to the next state in a stochastic manner such that the next state depends only on the current state and the action chosen. The set of all states, the state space, is denoted by $S$, and the set of all actions, the action space, is denoted by $A$. Formally, a decision rule is called a policy $u$, ($u\colon S\ra A$) and has an associated value function $J_u$\footnote{Without loss of generality value function $J_u$ can be thought of as a vector in $\R^{|S|}$.}, ($J_u(s)\colon S\ra \R$) which specifies the expected cumulative reward obtained by following the policy $u$ starting from each state.\par
The so-called dynamic programming (DP) methods \cite{BertB} compute the optimal value function $J^*$ first and then obtain an optimal policy $u^*$ using $J^*$\footnote{Obtaining $u^*$ from $J^*$ is computationally cheap}. Conventional DP  techniques, such as value-, or policy-iteration, or linear programming (LP) \cite{BertB} can compute the exact value of $J^*$ (and $u^*$). However, a shortcoming of these conventional methods is that their computational overhead grows with the number of states, a practical hindrance when the MDP has a large number of states.\par
This paper is related to LP based techniques for MDPs with large state spaces. One way to handle the large number of states is to restrict the value function to the sub-space spanned by the columns of an $n\times k$ feature matrix $\Phi$\footnote{This is known as linear function approximation (LFA) where in the value function is approximated by $\Phi \tr$, for some $\tr\in \R^k$. The idea of LFA is not restricted to LP based approach and is also widely used in other approximate dynamic programming methods \cite{dpchapter}, which are not discussed in this paper.}. This sub-space restriction can be accommodated in the LP formulation and results in the approximate linear programming (ALP) formulation \cite{ALP,CS,SALP,ALP-Bor}. ALP computes an approximate value function $\tj=\Phi \tr$ and a sub-optimal policy $\tu$ can be obtained using $\tj$\footnote{It is computationally cheap to obtain such a $\tu$ from $\tj$ or $u^*$ from $J^*$}. The sub-optimal policy can then be used to make the decision and results in a cumulative return given by $J_{\tu}$. The performance of the ALP was studied in \cite{ALP} in terms of the quantities $\norm{J^*-\tilde{J}}$ and $\norm{J^*-J_{\tu}}$ ($\norm{\cdot}$ is an appropriate norm) known as prediction error and the control error respectively. Here, prediction error is the error in the approximate value function $\tj$ and the control error is the loss in performance due to the sub-optimal policy $\tu$.\par
%The \emph{approximate linear program} (ALP) \cite{ALP,CS,SALP,ALP-Bor} and its variants introduce linear function approximation in the linear programming formulation.
A critical shortcoming of vanilla ALP is that the number of constraints are of the order of the size of the state space, making this vanilla version intractable in MDPs with large number of states. A way out is to choose a subset of constraints at random and drop the rest, thereby formulating a \emph{reduced linear program} (RLP). The performance analysis of the RLP can be found in \cite{CS} and the RLP has also been shown to perform well in experiments \cite{ALP,CS,CST}. An alternative approach to handle the issue of large number of constraints is to employ function approximation in the dual variables of the ALP \cite{ALP-Bor,dolgov}, an approach that was also found useful in experiments. However, to this date, there exist no theoretical guarantees for this approach.\par
In this paper, we generalize the RLP to define a generalized reduced linear program (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints.
The salient aspects of our contribution are listed below:
\begin{enumerate}
\item We develop novel analytical machinery to relate $\hat{J}$, the solution to the GRLP, and the optimal value function $J^*$ by bounding the prediction error $\norm{J^*-\hj}$ (\Cref{cmt2mn}).
\item We also bound the performance loss due to using the policy $\hu$ that is obtained using $\hj$ (Theorem~\ref{polthe}).
\item Our analysis is based on two novel $\max$-norm contraction operators and our results hold \emph{deterministically}, as opposed to the results on RLP \cite{SALP,CS}, where the guarantees have a probabilistic nature.
\item Our analysis also makes use of arguments based on \emph{Lyapunov} function, an approach much similar to prior works in ALP literature \cite{ALP,SALP}.
\item Our results on the GRLP are the first to theoretically analyze the use of linear function approximation of Lagrangian (dual) variables underlying the constraints.
\item A numerical example in controlled queues is provided to illustrate the theory.
\end{enumerate}
A short and preliminary version of this paper without the theoretical analysis can be found in \cite{aaaipaper}.
\begin{comment}
\section{Introduction}
Markov decision processes (MDPs) are a powerful mathematical framework to study optimal sequential decision making problems  arising in science and engineering. In an MDP, the configuration of the system, the state, evolves in a stochastic manner in a way that the next state depends only on the last state and the action chosen. The set of all states, the state space, is denoted by $S$, and the set of all actions, the action space, is denoted by $A$.
%An instance of an MDP is a controlled queue setting, where there is a cost associated with the number of customers in the queue, and the aim is to control the service level depending on the number of %customers so as to achieve a minimum cumulative cost. In more general terms, given any MDP,
An optimal policy $u^*$ is a map from $S$ to $A$ that results in the best cumulative reward that can be obtained by starting from any state. The rewards for the various The so-called dynamic programming (DP) methods first compute what is known as the optimal \emph{value-function} ($J^*$), a vector whose dimension is the number of states, and use it to compute $u^*$. When the number of state is small conventional DP techniques, such as value-, or policy-iteration, or linear programming (LP) can be used to compute $J^*$ and $u^*$\cite{BertB}.\par
Curse-of-dimensionality  refers to the fact that the number of states grows exponentially in the number of state variables. Many practical systems such as controlled queues, inventory control etc are
afflicted by the curse, i.e., have large number of states. In such scenarios, it becomes increasingly difficult to compute the exact values of $J^*$ and $u^*$ because the DP methods are based on computations involving as many (or even more) number of variables as the number of states. A practical solution then is to compute an approximate value function $\tilde{J}$ instead of $J^*$. Approximate dynamic programming (ADP) methods combine an approximation architecture to represent $\tj$ and a conventional DP method to compute $\tj$. Eventually, ADP methods output a sub-optimal policy $\tu$ using the $\tj$ they compute. %Here, success depends on the quality of approximation, i.e., on the quantity $||J^*-\tilde{J}||$ for an appropriately chosen norm.\par
Linear function approximation (LFA), i.e., letting $\tilde{J}=\Phi \tr$ where $\Phi \in \R^{|S|\times k}$ is a so-called feature matrix and $\tr*\in \R^k$ is a weight vector (to be computed), is the most widely used method of approximation. Here, dimensionality reduction is achieved by choosing $\Phi$ to have fewer columns in comparison to the number of states ($k<<|S|$), holding the promise of being able to work with MDPs regardless of the number of states.\par
It is natural to expect that approximations lead to errors and it is important to quantify the errors. For a given ADP method, theoretical performance analysis analytically bounding the error terms $||J^*-\tilde{J}||$  and $\norm{J^*-J_{\tu}}$ which denote the error in approximating the value function, and performance loss due to following policy $\tu$ (here $J_{\tu}$ is the value of $\tu$) respectively. Further, in most cases the error terms reveal some structure that can offer insights and act as guide to the designer of the ADP method (for example the choice of $\Phi$). \par
%While many ADP methods use LFA, not all of them are successful. For instance, ADP methods that use linear least squares projection are known to exhibit `policy oscillations' \cite{dpchapter}, i.e., %output a repeating sequence of bad sub-optimal policies.
%Such an analysis is important to establish that the error is always bounded.
%Further, in most cases the error terms reveal some structure that can offer insights and act as guide to the designer of the ADP method (for example the choice of $\Phi$). \par
The \emph{approximate linear program} (ALP) \cite{ALP,CS,SALP,ALP-Bor} and its variants introduce LFA in the linear programming formulation to dynamic programming. Theoretical performance analysis of the ALP can be found in \cite{ALP}.
%and a salient feature is that it does not suffer from issues such as `policy oscillations'\footnote{ADP methods that use linear least squares projection are known to exhibit `policy oscillations' %\cite{BertB}, i.e., output a repeating sequence of bad sub-optimal policies. Since our focus in this paper is the ALP formulation, we refrain from a detailed presentation of the other methods.}
The number of variables of the ALP is only $k$, a critical shortcoming of the ALP is that the number of constraints are of the order of the size of the state space, making it intractable.
Two approaches have been found empirically successful \cite{CS,dolgov,ALP-Bor} in addressing the issue of large number of constraints in the ALP. In the first approach \cite{CS}, a random subset of constraints is chosen (dropping the rest), thereby formulating a \emph{reduced linear programming} (RLP) problem. The performance analysis of the RLP can be found in \cite{CS}, however, the bounds hold only in high probability under idealized assumptions. The second approach involves employing function approximation in the dual variables of the ALP \cite{ALP-Bor,dolgov}. However, to this date, there exist no theoretical guarantees bounding the loss in performance due to this approach.\par
Our motivation stems from the fact that ALP with tractable number of constraints will result in a full dimeniosnality free ADP method. However, constraint reduction in the ALP is an extra source of error (in addition to the error due to LFA), which has not been theoretically well understood.  The focus of this paper is to fill this gap in theory by deriving performance bounds for constraint reduction in the ALP formulation.
The salient aspects of our contribution are listed below:
\begin{enumerate}
\item We define a generalized reduced linear programming (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints. The GRLP amounts to linear function approximation of the dual variables, and the RLP is a special case of GRLP.
\item We develop a novel analytical machinery to bound the prediction error $\norm{J^*-\hj}$ where $hj$ is the solution to the GRLP.
		\item We also bound the performance loss due to using the policy $\hu$ that is one-step greedy with respect to $\hj$ (Theorem~\ref{polthe}).
		\item Our analysis is based on two novel $\max$-norm contraction operators and our results hold \emph{deterministically}, as opposed to the results on RLP \cite{SALP,CS}, where the guarantees have a probabilistic nature.
%Our analysis also makes use of arguments based on \emph{Lyapunov} functions, an approach much similar to prior works in the ALP literature \cite{ALP,SALP}.
\item Our results on GRLP are the first to theoretically analyze the use of linear function approximation of Lagrangian (dual) variables underlying the constraints.
\item A numerical example in controlled queues is provided to illustrate the theory.
\end{enumerate}
%ALP with tractable number of constraints would result in  a full dimensionality free ADP method, without issues such as policy oscillations.
A short and preliminary version of this paper without the theoretical analysis is available in \cite{aaaipaper}. The rest of the paper is organized as follows:
\end{comment}