## References 

1. Ruszczyński, Andrzej. Nonlinear Optimization. Princeton University Press, Princeton, New Jersey, 2006.

# 8 Subgradient Method

In this chapter we develop optimization startegies dealing with non-differentiable functions.  In the discussions below we would assume that $f$ is <font color=red> convex</font>.

### Subgradient

Recall that for a differentiable convex function $f$, the gradient $\nabla f$ has the property that 
$$f(y)\geqslant f(x) + \nabla f(x)^T(y - x).$$

In a same manner, we define the **subgradient** $g$ of a non-differentiable but convex function $f$ if $g$ satisfies that 
$$f(y)\geqslant f(x)+g(x)^T(y-x).$$
It is also a supporting hyperplane of the epigraph at $x$. Hence the subgradient (for convex functions) must exist.

### Subdifferential

The subdifferential of $f$ at $x$ is the set of all subgradients:
$$\partial f(x) = \{g:\ g^T(y - x)\leqslant f(y) - f(x)\quad  \forall y\in {\rm dom}(f )\}.$$


## Basic Rules


### Calculus

The proof to the basic properties below can be found in [1]. If $f$ is convex and has nonempty and open domain, then

1. $\partial (\alpha f) = \alpha \partial f$ for $\alpha>0$. 
2. $\partial (f_1 + f_2) = \partial f_1 + \partial f_2$. 
3. If $g(x) = f(Ax+b)$ is an affine transformation, then $$\partial g = A^T\partial f(Ax+b).$$
4. If $f(x) = \max_{1\leqslant i\leqslant n}f_i(x)$ is a pointwise maximum, then
$$\partial f(x) = {\rm Conv\ Hull\ }\bigcup_{ f_i(x)= f(x)}  \partial f_i.$$

### Optimality Conditions
 
A point $x$ minimizes $f(x)$ if and only if $0\in \partial f(x)$. The proof is trivial.



### Examples




#### Maximum of Linear

Let $f(x) = \max_{1\leqslant i\leqslant m}a_i^Tx+b_i$, characterize the minimizer of $f$.

Solution: It is clear that $f$ is convex. So for arbitrary $x$, 
$$\partial f(x) = {\rm Conv\ Hull\ }\bigcup_{ f_i(x)= f(x)}  \{a_i\}.$$
Assume $0\in \partial f(x)$, then $0$ lies in the convex hull of the active sets means that it is the convex combination of the active $\{a_i\}$. Thus there exists $\lambda_i\geqslant 0$ such that
$$0 = \sum_{i=1}^m \lambda_i a_i\quad{\rm and}\quad \lambda_i = 0 {\ \rm if\ }f_i(x)\neq f(x)\quad{\rm and}\quad \sum_{i=1}^m\lambda_i = 1.$$

It is equivalent to KKT (after introducing the slackness variable) and the second constraint corresponds to the complementary slackness.

## Projection

### Indicator

Given a <font color=red>convex</font> set $C$, define an indicator $I_C(x)$ by $I_C(x) = \left\{\begin{array}{ll}0 & x\in C\\ +\infty & x\notin C\end{array}\right.$, then for $x\in C$, $\partial I_C(x) = N_C(x)$ is the normal cone where $$N_C(x) = \{g:\ g^T(y - x)\leqslant 0\quad \forall y\in C\}.$$

Here the convexity of $C$ guarantees the convexity of $I_C(x)$. In the following sections, $C$ is always convex.

Proof: As we require $f(y)\geqslant f(x) + g^T(y - x) = g^T(y-x)$. When $y\in C$, it turns to $0\geqslant g^T(y-x)$.

### Minimum on a Set
Suppose $f$ is convex over a domain $C$, then $\min_{x\in C}f(x)$ = $\min_x f(x) + I_C(x)$.


### Projection

Let $\prod_C(x)$ be the projection of $x$ on a set $C$, a point $y$ in $C$ that has minimum distance to $x$,
$$y = \prod_C(x) = {\rm argmin}_{y\in C}\frac12 \Vert y - x\Vert_2^2
= {\rm argmin}_{y}\frac12 \Vert y - x\Vert_2^2 + I_C(y).$$
Hence it has subdifferential over $C$, 
$$\{y - x+g:\quad g^T(u-y)\leqslant 0\quad\forall u\in C\}.$$
Let $0$ fall in the set to reach a minimum. Now $g = x-y$ and 
$$(x - y)^T(u - y)\leqslant 0 \quad \forall u \in C.$$ 

Also, if we let $f(x) = d_C(x) = \min_{y\in C}\Vert y - x\Vert = \Vert \prod_C(x) - x\Vert$ where $C$ is convex, then $\frac{x - \prod_C(x)}{\Vert x - \prod_C(x)\Vert}\in \partial f$ at $x$.

Proof: Follow the definition of a subgradient, it suffices to prove that 
$$\frac{(x - \prod(x))^T}{\Vert x - \prod(x)\Vert}(y - x)\leqslant \Vert \prod (y) - y\Vert -\Vert x - \prod(x)\Vert.$$
This is because 
$$\begin{aligned}(x - \prod(x))^T(y - x) + \Vert x - \prod(x)\Vert^2&
=(x - \prod(x))^T(y - \prod(x))\\ 
&= (x - \prod(x))^T(\prod(y) - \prod(x))+(x - \prod(x))^T(y - \prod(y) )\\ 
&\leqslant (x - \prod(x))^T(y - \prod(y) )\\ 
&\leqslant \Vert x - \prod(x)\Vert \Vert y - \prod(y)\Vert.
\end{aligned}$$

### Law of Cosines

For $x\in C$ and arbitrary $y$, we have 
$$\Vert \prod_C(y) - x\Vert \leqslant \Vert y - x\Vert.$$
Proof: Since we have seen that $\left(y - \prod_C(y)\right)^T\left(x - \prod_X(y)\right)\leqslant 0$, 
$$\Vert y - x\Vert^2 = \Vert y - \prod_C(y)\Vert^2 + \left(y - \prod_C(y)\right)^T\left(\prod_C(y) - x\right)
+\Vert  \prod_C(y) - x\Vert^2\geqslant \Vert \prod_C(y) - x\Vert^2.$$

## Projected Subgradient Method

Consider the optimization problem $\min_{x\in C}f(x)$ where $f$ is convex but not necessarily differentiable and $C$ is a closed convex set and $C$ is not a single point.
We solve it by iterating as follows:
$$y_{k+1} = x_k - \alpha_k g_k\quad\quad x_{k+1} = \prod_C(y_{k+1})$$
where $g_k$ is an arbitrary subgradient from $\partial f(x_k)$ while $\prod_C(y_{k+1})$ stands for the projection. The parameter $\alpha_k$ determines the step size, which can be a constant or diminishing to zero.

Further, if $C = \mathbb R^n$ is the whole real plane, then we also call it the subgradient method (where the projection is unnecessary).

<br> 

<br>

In the following part, we assume that the target function $f$ is convex and  has finite infimum $f_*$ and the subgradient is bounded,
$$\Vert g \Vert\leqslant G\quad\quad\forall g\in \partial f.$$
We further introduce the notation $R = \Vert x_1  - x_*\Vert$.

### Lipschitz Continuity

In fact, for a convex $f$, we can show that the bounded-subgradient property $\Vert g \Vert\leqslant G\ \forall g\in \partial f$ required above is equivalent to 
the Lipschitz continuity $| f(y) - f(x)|\leqslant G\Vert y-x \Vert $.

Proof: $\Rightarrow$: For arbitrary $y$ and $x$, we may assume $f(y) \leqslant f(x)$. By the convexity of $f$ we learn that $\partial f_x$ is not empty, we pick arbitrary $g_x\in \partial f_x$ so that 
$$ 0\geqslant f(y) - f(x)\geqslant g_x^T(y - x) \geqslant -G\Vert y - x\Vert.$$

$\Leftarrow$: For arbitrary $g\in \partial f$, we have 
$$g^T(y - x)\leqslant f(y) - f(x) \leqslant G\Vert y - x\Vert.$$
Find $y$ such that $g$ and $y - x$ are parallel and thus $\Vert g\Vert\leqslant G$.

### Convergence

In the projected subgradient method, we have that
$$\Vert x_{i+1} - x_*\Vert^2 \leqslant \Vert x_i - x_*\Vert^2 - 2\alpha_i \left(f(x_i) - f_*\right) + \alpha_i^2 \Vert g_i\Vert^2.$$

If we denote $f_{bs} = \min_i f(x_i)$, the best solution in our iteration (The projected subgradient method does not always provide monotonically decreasing solution!), then 
$$f_{bs} - f_*\leqslant \frac{\Vert x_1 -x_*\Vert^2+\sum_{i=1}^k \alpha_i^2 \Vert g_i\Vert^2}{2\sum_{i=1}^k \alpha_i}
\leqslant \frac{R^2+G^2\sum_{i=1}^k \alpha_i^2 }{2\sum_{i=1}^k \alpha_i}.$$

Proof: By the iteration $x_{k+1} = \prod_C(x_k - \alpha_k g_k)$ and the law of cosines, 
$$\begin{aligned}\Vert x_{i+1} - x_*\Vert^2 &= \Vert \prod_C(x_i - \alpha_i g_i) - x_*\Vert^2
\leqslant \Vert \ x_i - \alpha_i g_i  - x_*\Vert ^2\\
&= \Vert x_i-x_*\Vert^2 +2\alpha_i g_i^T(x_* - x_i  ) + \alpha_i^2\Vert g_i\Vert^2 \\ 
&\leqslant \Vert x_i-x_*\Vert^2 +2\alpha_i  (f(x_* )- f(x_i)  ) + \alpha_i^2\Vert g_i\Vert^2.
\end{aligned}$$

Sum it up from $i = 1$ to $k$, and replace $f(x_i)$ with $\geqslant f_{bs}$, we obtain
$$0\leqslant \Vert x_{k+1}^2 - x_*\Vert^2\leqslant \Vert x_1 - x_*\Vert^2 - 2\sum_{i=1}^k \alpha_i^2 \left(f_{bs} - f_*\right) 
+ \sum_{i=1}^k \alpha_i^2 \Vert g_i\Vert^2. $$
Sorting the inequality leads to the desired result. 


### Step Size

We further analyze the convergence with different step size strategies. 

#### Constant Step Size

When $\alpha_i$ is constant $\alpha$, then 
$$f_{bs} - f_* \leqslant \frac{R^2 +  k\alpha^2 G^2}{2k\alpha}\rightarrow \frac 12\alpha G^2.$$
Hence our convergence bound does not guarantee reaching the optimal solution. The smaller $\alpha$ is, the better bound it yields.

#### Constant Step Length

The constant step length suggests we select $\alpha$ such that $\alpha_k \Vert g_k \Vert = \gamma$ is a constant. In this case, 
$$f_{bs} - f_*\leqslant \frac{R^2+k\gamma^2}{2\gamma\sum_{i=1}^k \frac{1}{\Vert g_i\Vert}}\leqslant 
G\frac{R^2+k\gamma^2}{2k\gamma }\rightarrow \frac 12 G\gamma . $$

It faces similar problem with the constant-step-size startegy. 

#### Diminishing Step Size

If we choose $\alpha_k\rightarrow 0$ while $\sum_{i=1}^{\infty}\alpha_i = +\infty$, then it guarantees the optimal solution.

Proof: When $\sum_{i=1}^{k}\alpha_i^2$ is bounded, it is clear that 
$\frac{R^2+G^2\sum_{i=1}^k \alpha_i^2}{2\sum_{i=1}^k \alpha_i}\rightarrow 0$ and $f_{bs}$ converges to $f_*$. When it is not bounded, we can apply the **Stolz** theorem that 
$$\lim_{k\rightarrow \infty}\frac{R^2+G^2\sum_{i=1}^k \alpha_i^2}{2\sum_{i=1}^k \alpha_i}
= \lim_{k\rightarrow \infty}\frac{G^2\alpha_i^2}{2\alpha_i} = 0.$$

#### Polyak's Step Size

When $f_*$ is known (e.g. $f_* = 0$ for some problems), we can choose 
$$\alpha_k = \frac{f_k - f_*}{\Vert g_k\Vert^2}$$
and the previous inequality $\Vert x_{i+1} - x_*\Vert^2 \leqslant \Vert x_i - x_*\Vert^2 - 2\alpha_i \left(f(x_i) - f_*\right) + \alpha_i^2 \Vert g_i\Vert^2$ now reads
$$\Vert x_{i+1} - x_*\Vert^2 \leqslant \Vert x_i - x_*\Vert^2 -\frac{\left(f(x_i) - f_*\right)^2}{\Vert g_i\Vert^2}
 \leqslant R^2 -\frac{\left(f_{bs} - f_*\right)^2}{G^2}. $$
 Hence, the convergence is guaranteed by
 $$f_{bs} - f_*\leqslant \frac{GR}{\sqrt k}.$$