prob.tex

\section{Exact and Approximate Linear Programming for MDP}
In this section, we review relevant results \cite{BertB,ALP,CS} in the linear programming approach to MDPs. Throughout the rest of the paper, we let $c\in (0,1)^n$ to denote a probability distribution with strictly positive components.\\
\begin{comment}
%first present linear programming (LP), approximate linear programming (ALP) and reduced linear programming (RLP) respectively. The LP is an exact method in that it computes $J^*$ and the other two methods (ALP and RLP) are approximate methods that are based on introducing LFA in LP. We then discuss prior literature that have analyzed the approximation errors in ALP and RLP. Finally, we present the open question that we wish to understand in this paper.
\end{comment}
\textbf{Linear Programming (LP)}: As it is well known \cite{manne}, the optimal value function $J^*$ can be obtained by solving the following linear program:
\begin{align}\label{mdplp}
&\text{LP:~}\min_{J\in \R^n}\,c^\top J\,\,\\
%\mb
&J(s)\geq g_a(s)+\alpha\sum_{s'}p_a(s,s')J(s'),\,\,
\forall (s,a)\in S\times A.\nn
\end{align}
Once $J^*$ is obtained from \eqref{mdplp}, an optimal policy $u^*$ can then be obtained via \eqref{bellpol} for each state relatively cheaply even for large state-spaces. However, notice that \eqref{mdplp} is a linear program in $n$ variables with $nd$ constraints. As a result, when $n$ is large it is computationally expensive to solve \eqref{mdplp}, a fact that motivated the need to look at the approximate linear programming (ALP) formulation which computes an approximate value function $\tj$.\\
\textbf{Approximate Linear Programming (ALP)} is obtained by adding the extra constraint $J=\Phi r$ in \eqref{mdplp} with $\Phi \in \R^{n\times k}$ \citep{SchSei85}. By substitution, this leads to
\begin{flalign}\label{alp}
\begin{split}
\text{ALP:}& \min_{r\in \R^k}\, c^\top \Phi r\\
& E\Phi r\geq H \Phi r,
\end{split}
\end{flalign}
where $E\Phi r\geq H \Phi r$ is a shorthand for the $nd$ constraints in \eqref{mdplp} which uses the definition of $H$ from \Cref{notations}-\eqref{bellactval}.
ALP introduces what is known as linear function approximation in the LP, i.e., the value function computed by ALP can be given by $\tj=\Phi \tr$ for an appropriate $\tr\in \R^k$. ALP is a linear program in $k$ ($<<n$) variables and hence is computationally easier to handle in practice than LP. Note that, for $k=n$ and $\Phi=I_{n\times n}$, ALP is same as the LP. In what follows, we use the notation $\tj=\Phi \tr$ and $\tu=u_{\tj}$ (the greedy policy with respect to $\tj$, see \Cref{notations}-\eqref{greedy}) where $\tr\in\R^k$ denotes an arbitrary solution to ALP in \eqref{alp}. Further, when convenient, we also use the notation $\Phi r \geq T\Phi r$, instead of $E\Phi r\geq H \Phi r$ in \eqref{alp} using the definition of $T$ from \Cref{notations}-\eqref{bellopval}.\\
\textbf{Error Analysis of ALP:} %ALP is an approximate method as opposed to LP. Thus, it is important for us to characterize the approximation error. In the context of ALP for MDPs,
We care about two error terms namely the \emph{prediction error} $\norm{\tj-J^*}$, that captures the error in the approximate value function $\tj$ and the \emph{control error} $\norm{J_{\tu}-J^*}$, that is the loss in performance when a sub-optimal policy $\tu$ is used instead of an optimal policy $u^*$ (note that here $\norm{\cdot}$ is an appropriate norm).\par
A detailed theoretical analysis of ALP can be found in \cite{ALP}. However, we restate (from \cite{ALP}) a useful preliminary\footnote{The improved error bounds in \cite{ALP} as well as in \Cref{sec:improv} of this paper are in a weighted $L_\infty$ norm (see \Cref{notations}-\eqref{norms}).} performance bound for ALP as follows:
\begin{theorem}[Error Bound for ALP]\label{alpvanilla}
Let $\one\in \R^k$ belong to the column span of $\Phi$ and let $\mu_u\eqdef (1-\alpha)c^\top(I-\alpha P_u)^{-1}$ be a probability distribution over $S$ induced by a policy $u$. Then it follows that
\begin{enumerate}
\item\label{alppe} $\norm{J^*-\tj}_{1,c}\leq \frac{2}{1-\alpha}\min_{r}\norm{J^*-\Phi r}_\infty$.
\item\label{alpce} $\norm{J^*-J_{\tu}}_{1,c}\leq \frac{2}{1-\alpha}\norm{J^*-\tj}_{1,\mu_{\tu}}$.
\end{enumerate}
\end{theorem}
The gist of \Cref{alpvanilla}-\eqref{alppe} is that the prediction error $\norm{J^*-\tj}_{1,c}$ of ALP is worse utmost by a constant factor than the best possible error that can be achieved using the span of $\Phi$. Further from \Cref{alpvanilla}-\eqref{alpce} the control error is worse utmost by a constant factor than the prediction error.\\
\textbf{Issues with ALP}: While ALP has only $k$ variables, it has $nd$ constraints.
Two approaches have been found empirically successful \cite{CS,dolgov,ALP-Bor} in addressing the issue of large number of constraints in ALP. In the first approach \cite{CS}, a random subset of constraints is chosen (dropping the rest), thereby formulating a \emph{reduced linear programming} (RLP) problem. The performance analysis of RLP can be found in \cite{CS}, however, the bounds hold only in high probability under idealized assumptions. The second approach involves employing function approximation in the dual variables of ALP \cite{ALP-Bor,dolgov}. However, to this date, there exist no theoretical guarantees bounding the loss in performance due to this approach. We now dicuss RLP and its error analysis.\\
\textbf{Reduced Linear Programming (RLP)} is formulated by retaining only $m$ out of the $nd$ ($m<<nd$) constraints of ALP. RLP is easier to solve compared to ALP due to the reduced number of constraints.
\begin{align}\label{rlp}
\begin{split}
&\text{RLP:~}\underset{r\in \N\subset R^k}{\min}\, c^\top \Phi r\,\,\\
%\mb
&\Phi r (s)\geq g_a(s)+\alpha\sum_{s'}p_a(s,s')\Phi r(s'),\,\,
\forall (s,a)\in \I,
\end{split}
\end{align}
where $\I\subset S \times A$ is the index set of sampled constraints, i.e., $|\I|=m$ and $\N$\footnote{The appendix explains how $\N$ can be chosen analytically.} is a compact set such that $\tr \in \N$, and it ensures boundedness of RLP \eqref{alp}. We denote an arbitrary solution to RLP by $\tr_{RLP}$.\\
\textbf{Error Analysis of RLP:} We now state a useful result from \cite{CS} that characterizes the performance analysis RLP as follows:
\begin{theorem}[Error Bound for RLP]\label{rlpt}
Let $\mu_{u^*}\eqdef(1-\alpha)c^\top (I-\alpha P_{u^*})^{-1}$ be a probability distribution (of discounted number of visits) over the states $S$. Define $\psi_{u^*}$ to be the distribution over the state-action pairs such that $\psi(s,a)\eqdef \frac{\mu_{u^*}}{d}$ and define $\theta\eqdef\frac{1+\alpha}{2 c^\top J^*}\underset{r\in \N}{\sup}\parallel J^*-\Phi r \parallel_\infty$. Then for a given $\epsilon>0$ and $\delta>0$ it holds with probability greater than $1-\delta$ that
\begin{align*}
\norm{J^*-\Phi\tilde{r}_{RLP}}_{1,c}\leq \norm{J^*-\Phi\tilde{r}}_{1,c}+\epsilon \norm{J^*}_{1,c}
\end{align*}
for all $m\geq \frac{16d\theta}{(1-\alpha)\epsilon}\big(k\ln\frac{48d\theta}{(1-\alpha)\epsilon}+\ln \frac{2}{\delta}\big)$.
\end{theorem}
The gist of \Cref{rlpt} is that when `sufficient' number of samples are retained using a specialized distribution ($\psi_{u^*}$), then with high probability, the prediction error of RLP and ALP differ only by a small factor (i.e., $\epsilon\norm{J^*}_{1,c}$).\\
\textbf{Limitations of Error Analysis for RLP:} A major weakness of \Cref{rlpt} is that it holds only for RLP formulated using $\psi_{u^*}$ which is impractical since $u^*$ is unknown in general. However, empirical studies \cite{ALP,CS,CST} have shown that RLP formulated with other sampling distributions do perform well in experiments. In this context, another interesting approach is linear approximation of the dual variables \cite{dolgov}, which also is known to work in experiments but without any theoretical guarantees. These fact hint to the fact that the theoretical understanding of RLP and constraint sampling/approximation needs to be improved. Further, such theoretical understanding will justify that ALP with constraint reduction is a fully dimensionality free method in that the problem variables as well as constraints are independent of the number of states of the MDP.
\begin{comment}
The Generalized Reduced Linear Program is given as:
\begin{align}\label{grlp}
\begin{split}
&\text{GRLP:~}\underset{r\in \N\subset R^k}{\min}\, \, c^\top \Phi r\,\,\,\,\\
&  \,W^\top E\Phi r\geq W^\top H \Phi r.
\end{split}
\end{align}
\end{comment}
\begin{comment}
\begin{enumerate}

\item In \eqref{mdplp}, \eqref{alp}, \eqref{rlp} and \eqref{grlp} $c>0 \in \R^n$ is a probability distribution.
\item The ALP \eqref{alp} is obtained by making use of LFA in the LP, i.e., by introducing the new variables $r\in \R^k$ and adding the extra constraint $J=\Phi r$ in \eqref{mdplp} with $\Phi \in \R^{n\times k}$ \citep{SchSei85}. Note that in \eqref{alp}, $Phi r\geq T\Phi r$ is a shortcut to denote the $S\times A$ constraints. Also, for $k=n$ and $\Phi=I_{n\times n}$, ALP is same as the LP. We denote an arbitrary solution to ALP by $\tr\in \R^k$.\par
\item The RLP \eqref{rlp} was first introduced in \cite{CS}. The RLP \eqref{rlp} has the same objective function as ALP \eqref{rlp}. The construction of RLP needs three parameters to be specified namely, the number of samples $m$, the sampling distribution $\psi$ and the constraint set $\N$. In particular, $\psi=\psi(u^*)$ i.e., the sampling distribution
In RLP in \eqref{rlp}, $\I\subset S \times A$ is the index set of sampled constraints, i.e., $|\I|=m$. The samples are generated using a distribution $\psi$ (see a restatement of the results in \cite{CS} summarized in \Cref{}).
$\N$ is a compact set such that $\tr \in \N$, and it ensures boundedness of RLP \eqref{alp}. We denote an arbitrary solution to RLP by $\tr_{RLP}$. It is important to note that RLP \eqref{rlp} is a random linear program, in the sense that it depends on the $m$ samples obtained from $\psi$.
\item In the GRLP in \eqref{grlp}, $W \in \R^{nd\times m}_+$ is a matrix with all positive entries which specifies the linear combination of constraints. The RLP \eqref{rlp} can be recovered as a special case of the GRLP \cite{grlp} by setting $W$ to the matrix with the $i^{th}$ column having $1$ corresponding to the constraint that was sampled (implicitly assuming that there is a natural ordering of the $nd$ constraints) and all the other entries as $0$. The $\N$ in RLP \eqref{rlp} is retained in the GRLP \eqref{grlp} as well. For $m=nd$ and $W=I_{nd\times nd}$, the GRLP is same as ALP. We denote an arbitrary solution to the GRLP by $\hr$.
\end{enumerate}
In what follows, unless specified otherwise we use let $\hj=\Phi \hr$/$\tj=\Phi \tr$ to denote the corresponding value functions and $\hu$/$\tu$ to denote the greedy policy w.r.t. $\hj$/$\tj$.\par
\FloatBarrier
\input{cartoon}
\Cref{cartoon} which shows the solutions to the LP, ALP and GRLP respectively. The error in ALP solution has already been studied in \cite{ALP}. Our objective is to study the extra source of error due to constraint approximation.
\end{comment}
\begin{comment}
In addition, we answer the following questions related to constraint reduction in ALP that have so far remained open. \\
$\bullet$ As a natural generalization of RLP, what happens if we define a generalized reduced linear program (GRLP) whose constraints are positive linear combinations of the original constraints of ALP?\\
$\bullet$ Unlike \cite{CS} which provides error bounds for a specific RLP formulated using an idealized sampling distribution, is it possible to provide error bounds for any GRLP (and hence any RLP)?
In this paper, we address both of the questions above.
\end{comment}