# 9 Proximal Gradient Method

In the last section we have introduced the subgradient method, which has a slow convergence. Now we introduce the proximal gradient method, a better method for solving 
$$\min \left\{g(x)+h(x)\right\}$$
where $g$ is differentiable over $\mathbb R^n$ while $h$ is convex but not necessarily differentiable. We shall see how to solve it when $h(x)$ is somehow special.


## Proximal Gradient Method

In each step we iterate by
$$x_{k+1} = \text{argmin}_x\left\{\nabla g(x_k)^T(x - x_k) + \frac{1}{2t_k}\Vert x - x_k\Vert^2 + h(x)\right\}.$$

### Special Cases

#### Gradient Descent

When $h(x)\equiv 0$, then it degenerates to the gradient descent.

Proof:
$$\nabla g(x_k)^T(x - x_k) + \frac{1}{2t_k}\Vert x - x_k\Vert^2 
= \frac{1}{2t_k} \Vert t_k\nabla g(x_k)  +( x - x_k)\Vert^2 - \frac{t_k}{2}\Vert g(x_k)\Vert^2
$$
So we choose $x_{k+1} = x_k -  t_k\nabla g(x_k)$, precisely the iteration in gradient descent. Hence proximal gradient method is a generalization of the gradient descent when $h(x)\not\equiv 0$.

#### Projected Subgradient Method

When $C$ is closed convex and $h(x)  = I_C(x) =\left\{\begin{array}{ll} 0 & x\in C\\ +\infty & x \notin C\end{array}\right.$ is the indicator, then it degenerates to the projected subgradient method.

Proof: Since $h(x) = +\infty$ as long as $x\notin C$, it suffices to consider the cases where $x\in C$, which is solving for
$$\text{argmin}_{x\in C}\{\frac{1}{2t_k} \Vert t_k\nabla g(x_k)  +( x - x_k)\Vert^2 - \frac{t_k}{2}\Vert g(x_k)\Vert^2\}.$$
And it is clear that the minimizer should be the projection, i.e.
$$x_{k+1} = \prod_C (x_k - t_k \nabla g(x_k)).$$




## Proximal Mapping 

As claimed above, the key is to solve the minimization problem
$$\text{argmin}_x\left\{\frac12 \Vert t_k\nabla g(x_k)  +( x - x_k)\Vert^2 +t_k h(x)\right\}. $$
If we use the notation
$${\rm prox}_f(x_0)= \text{argmin}_x\{f(x) +\frac 12 \Vert x -x_0\Vert^2  \},$$
then the problem is equivalent to finding ${\rm prox}_{t_kh}(x_k)$. The notation is called the proximal mapping. 

Intrinsically, the proximal mapping is the minimization of $f$ around a neigborhood of $x_0$.

### Examples 

#### Quadratic Function

Let $A$ be positive semidefinite, 
$$th(x) = t\left(\frac 12 x^TAx+b^Tx+c\right)\quad\Rightarrow\quad {\rm prox}_{th}(z) = (I+tA)^{-1}(z - tb).$$

Proof: The proximal mapping here is to minimize (with respect to $x$) 
$$t\left(\frac 12x^TAx+b^Tx+c\right) +\frac12  x^Tx -z^Tx +\frac12 z^Tz
=\frac 12x^T(I + tA)x - (z - tb)^Tx + \frac 12 z^TZ.$$
Take the derivative to yield $x_* = (I + tA)^{-1}(z - tb)$.

#### Euclidean Norm 

$$th(x) = t\Vert x \Vert_2 \quad\Rightarrow\quad {\rm prox}_{th}(z) = \left\{\begin{array}{ll}(\Vert z \Vert_2 - t) \frac{z}{\Vert z \Vert_2} & {\rm if\ }\Vert z \Vert_2 \geqslant t,\\  0 & {\rm if \ }\Vert z \Vert_2 < t.\end{array}\right.$$

Proof: The proximal mapping here is to minimize (with respect to $x$)
$t\Vert x \Vert_2 + \Vert x - z\Vert_2^2$. When $\Vert z \Vert \geqslant  t$, we have 
$$t\Vert x\Vert +\frac 12 \Vert x - z \Vert ^2\geqslant t\Vert x \Vert  +\frac12 \left(\Vert z\Vert - \Vert x \Vert\right)^2
$$
with the minimum reached at $\Vert x_*\Vert = \Vert z \Vert - t$ and $x_* = \Vert x_*\Vert \frac{z}{\Vert z\Vert}$.

#### Logarithmic Barrier

$$th(x) = -t\sum_{i=1}^n \log x_i\quad \Rightarrow\quad {\rm prox}_{th}(z)_i = \frac{z_i+\sqrt{z_i^2+4t}}{2}.$$

Proof: It suffices to minimize the target entrywise, which is $u(x_i) = -t\log x_i +\frac 12 (x_i - z_i)^2$ and 
$u'(x)=-\frac tx+x - z$, the minimum is reached when $x_i = \frac{z_i+\sqrt{z_i^2+4t}}{2}$ is the positive root.

#### Soft Threshold

$$th(x) = t\Vert x\Vert_1\quad\Rightarrow\quad {\rm prox}_{th}(z)_i= \left\{\begin{array}{ll} z_i - t & {\rm if\ }z_i \geqslant  t,\\ 0 & {\rm if \ } -t<z_i < t,\\ z_i + t & {\rm if\ } z_i < -t.\end{array}\right.$$

Proof: It suffices to minimize the target entrywise, which is $u(x_i) = t|x_i| + \frac 12(x_i - z_i)^2$, simple discussion over $z_i>t$, $z_i<-t$ and $-t<z_i<t$ leads to the result. 

### Convergence 

For convergence analysis, we assume the following: $g$ is convex over $\mathbb R^n$ and its gradient is Lipschitz continuous, $\Vert \nabla g(x) -\nabla g(y)\Vert \leqslant \Vert x - y\Vert$ and it is strongly convex, $g(y)\geqslant g(x)+\nabla g(x)^T(y-x)+\frac m2 \Vert y - x\Vert^2$, and $h$ is closed and convex. Lastly, we assume the optimal value of $f(x)+g(x)$ can be attained at $x_*$. Then, when we use fixed step size $t_k = \frac 1L$, the proximal gradient method has $O(\frac 1k)$ convergence rate. 

Proof: 