rel.tex

\begin{comment}
In order to interpret the results in \Cref{cmt2} and \Cref{st}, we now discuss relevant prior work. We begin by re-stating a preliminary\footnote{The improved error bounds in \cite{ALP} as well as in \Cref{sec:improv} of this paper are in a weighted $L_\infty$ norm and are based on Lyapunov analysis.} error bounds for the ALP from \cite{ALP}:
\begin{theorem}[Error Bound for ALP]\label{alpvanilla}
\begin{align*}
||J^*-\tj||_{1,c}\leq \frac{2}{1-\alpha}\min_{r}||J^*-\Phi r||_\infty
\end{align*}
\end{theorem}
The ALP is a linear program in $k$ ($<<n$) variables as opposed to the LP in \eqref{mdplp} which has $n$ variables.
Nevertheless, the ALP has $nd$ constraints (same as the LP) which is an issue when $n$ is large and calls for constraint approximation/reduction techniques.

Most works in literature make use of the underlying structure of the problem to cleverly reduce the number of constraints of the ALP. A good example is \cite{gkp}, wherein the structure in factored linear functions is exploited. The use of basis function also helps constraint reduction in \cite{Mor-Kum}. In \cite{ALP-Bor}, the constraints are approximated indirectly by approximating the square of the Lagrange multipliers. In \cite{petrik} the transitional error is reduced ignoring the representational and sampling errors.\par

%The most important work in the direction of constraint reduction is constraint sampling \cite{CS} wherein a reduced linear program (RLP) is solved instead of the ALP. While the objective of the RLP is same as that of the ALP, the RLP has only $m<<nd$ constraints \emph{sampled} from the original $nd$ constraints of the ALP.
%The following is a preliminary error bound for the RLP from \cite{CS} holds for a special sampling distribution which is dependent on the optimal policy $u^*$ (see \cite{CS} for a detailed presentation):
\begin{theorem}[Error Bound for RLP]\label{rlpt}
Let $\mu_{u^*}\eqdef(1-\alpha)c^\top (I-\alpha P_{u^*})^{-1}$ be a probability distribution (of discounted number of visits) over the states $S$. Define $\psi_{u^*}$ to be the distribution over the state-action pairs such that $\psi(s,a)\eqdef \frac{\mu_{u^*}}{d}$ and define $\theta\eqdef\frac{1+\alpha}{2 c^\top J^*}\underset{r\in \N}{\sup}\parallel J^*-\Phi r \parallel_\infty$. Then for a given $\epsilon>0$ and $\delta>0$ it holds with probability greater than $1-\delta$ that
\begin{align*}
||J^*-\Phi\tilde{r}_{RLP}||_{1,c}\leq ||J^*-\Phi\tilde{r}||_{1,c}+\epsilon ||J^*||_{1,c}
\end{align*}
for all $m\geq \frac{16d\theta}{(1-\alpha)\epsilon}\big(k\ln\frac{48d\theta}{(1-\alpha)\epsilon}+\ln \frac{2}{\delta}\big)$.
\end{theorem}
\end{comment}
\begin{comment}
\subsection{Open Questions}
Interestingly, the RLP has nevertheless been found to do well empirically in domain such as Tetris \cite{CST} and controlled queues \cite{CS} even when the constraints were sampled using distribution other than the ideal distribution. This fact indicates a gap in the theoretical analysis and points to the need for a more elaborate theory that addresses the issue of constraint approximation. In this paper, we answer the following questions related to constraint reduction in ALP that have so far remained open. \\
$\bullet$ As a natural generalization of the RLP, what happens if we define a generalized reduced linear program (GRLP) whose constraints are positive linear combinations of the original constraints of the ALP?\\
$\bullet$ Unlike \cite{CS} which provides error bounds for a specific RLP formulated using an idealized sampling distribution, is it possible to provide error bounds for any GRLP (and hence any RLP)?
In this paper, we address both of the questions above.
\end{comment}
\begin{comment}
\subsection{Comparison with ADP methods}
A host of the ADP methods such as \cite{lspi,lspe,lstd,Tsit} are based on solving the projected Bellman equation (PBE). The PBE based methods have been empirically successful and also have theoretical guarantees for the approximate value function. However, a significant shortcoming is that they suffer from the issue of \emph{policy-chattering} (see section~$6.4.3$ of \cite{BertB}), i.e., the sequence of policies might oscillate within a set of bad policies. A salient feature of the ALP based methods is that they find only one approximate value function $\tj$ and one sub-optimal policy derived as a greedy policy with respect to $\tj$. As a result there is no such issue of policy-chattering for the ALP based methods. By providing the error bounds for the GRLP, our paper provides the much required theoretical support for the RLP. Our GRLP framework closes the long-standing gap in the literature of providing a theoretical framework to bound the error due to constraint reduction in ALP based schemes.\\

\FloatBarrier
\begin{table}[H]
\resizebox{\columnwidth}{!}{
\begin{tabular}{|c|c|c|}\hline
ADP Method	&Empirical	&Theoretical\\\hline
Projected Bellman &\ding{51}    &\ding{53}-Policy Chattering\\
Equation	&	&\\\hline
ALP		&\ding{53}-Large number of Constraints &\ding{51}\\\hline
RLP		&\ding{51} &\ding{53}-  Only under\\
&&ideal assumptions\\\hline
\end{tabular}
}
\end{table}
\end{comment}