# Problem 0

# Problem 1. Convexity and least squares

## 1.1

Note that
$$f(x) = \|b - Ax\|^2 = (b - Ax)^T(b - Ax) = \underbrace{b^Tb - 2b^TAx}_{f_1(x)} + \underbrace{x^TA^TAx}_{f_2(x)}$$
and for any $x, y, \alpha\in(0, 1)$, 
$$f_1(x + \alpha(y - x)) = b^Tb - 2b^TA(x + \alpha(y - x)) = \alpha(b^Tb - 2b^TAy) + (1 - \alpha)(b^Tb - 2b^TAx) = \alpha f_1(y) + (1 - \alpha)f_1(x)$$.

Also, Let $Q = A^TA$, we can see $Q^T = Q$ and thus $Q$ is symmetric. Moreover, for any vector x, $x^TQx = x^TA^TAx = \|Ax\|^2\ge 0$, $Q$ is thus positive semi-definite. By the conclusion from the last homework, we know $f_2(x)$ is convex. 

Consequently, 
$$f(x + \alpha(y - x)) = f_1(x + \alpha(y - x)) + f_2(x + \alpha(y - x))\le \alpha f_1(y) + (1 - \alpha)f_1(x) + \alpha f_2(y) + (1 - \alpha)f_2(x) = \alpha f(x) + (1 - \alpha)f(x).$$
Hence, $f(x)$ is convex. 

## 1.2

Given any matrix $A$, the corresponding null space is 
$$\text{Null}(A) = \{x|Ax = 0\}$$
Take $x, y\in\text{Null}(A)$, it satisfies that $Ax = Ay = 0$. Thus, for any $\alpha\in(0, 1)$, we have 
$$A(x + \alpha(y - x)) = \alpha Ay + (1 - \alpha)Ax = 0$$
which means $x + \alpha(y - x)\in\text{Null}(A)$ and thus $\text{Null}(A)$ is convex set.

# Problem 2. Ridge Regression

## 2.1

Let $f_\lambda(x) = \|b - Ax\|_2^2 + \lambda\|x\|_2^2$. A necessary condition for a minimizer is $f_\lambda^\prime(x) = 0$, i.e., 
$$-2A^T(b - Ax) +2\lambda x = 0$$
which is equivalent to 
$$(A^TA + \lambda I)x = A^Tb\tag{1}$$

For any vector $y$, we have 
$$y^T(A^TA + \lambda I)y = \|Ay\|^2 + \lambda\|y\|^2 > 0$$ 
and thus $A^TA+ \lambda I$ is positive definite and thus has full rank. Therefore, equation (1) has unique solution
$$x^* = (A^TA + \lambda I)^{-1}A^Tb$$

We will prove $x^*$ is the global minimizer by verifying $f_\lambda(x)$ is (strictly) convex. From Problem 1, we know that $\|b - Ax\|_2^2$ is convex. Also, 
$$\lambda\|x\|_2^2 = x^T(\lambda I)x$$
and $\lambda I$ is positive definite, we know $\lambda\|x\|_2^2$ is (strictly) convex. Consequently, $f_\lambda(x)$ is (strictly) convex (one can use the same argument as in 1.1). Hence, $x^*$ is the unique solution to ridge regression. 

## 2.2

Suppose $A$ is a $p\times q$ matrix, we can conduct SVD on A as follows 
$$A = U\Sigma V^T$$
where $U$ is  a $p\times q$ unitary matrix, $\Sigma$ is a $q\times q$ diagonal matrix and $V$ is a $q\times q$ unitary matrix. We have 
$$x^* = (A^TA + \lambda I)^{-1}A^Tb = (V\Sigma U^TU\Sigma V^T + \lambda I)^{-1}V\Sigma U^Tb$$
$$=(V\Sigma^2V^T + \lambda I)^{-1}V\Sigma U^Tb = [V(\Sigma^2 + \lambda I)V^T]^{-1}V\Sigma U^Tb$$
$$=V(\Sigma^2 + \lambda I)^{-1}V^TV\Sigma U^Tb = V(\Sigma^2 + \lambda I)^{-1}\Sigma U^Tb\tag{2}$$
Now, suppose $V = (v_1, \dots, v_q), U = (u_1, \dots, u_q), \Sigma = \text{diag}\{s_1, \dots, s_q\}$ and $b = (b_1, \dots, b_p)^T$, we can express equation (2) as 
$$x^* = \sum_{i = 1}^q\frac{s_i}{s_i^2 + \lambda}v_iu_i^Tb_i\tag{3}$$

## 2.3

Let $\lambda\to\infty$, we get $x^*\to 0$. 

Let $\lambda\to 0$, we get $x^*\to\sum_{i = 1}^qs_i^{-1}v_iu_i^Tb_i$, which can be easily seen to be ols solution since 
$$x_\text{ols} = (A^TA)^{-1}A^Tb = (V\Sigma^2V^T)^{-1}V\Sigma U^Tb = V\Sigma^{-1}U^Tb= \sum_{i = 1}^qs_i^{-1}v_iu_i^Tb.$$

## 2.4

The code is listed below.

In [1]:
using LinearAlgebra, Plots, Random, SparseArrays

In [2]:
# data
teams = ["duke","miami","unc","uva","vt"]
data = [ # team 1 team 2, team 1 pts, team 2 pts
    1 2  7 52 # duke played Miami and lost 7 to 52 
    1 3 21 24 # duke played unc and lost 21 to 24 
    1 4  7 38
    1 5  0 45
    2 3 34 16
    2 4 25 17
    2 5 27  7
    3 4  7  5
    3 5  3 30
    4 5 14 52
]
ngames = size(data,1)
nteams = length(teams)

G = zeros(ngames, nteams)
p = zeros(ngames, 1)

for g=1:ngames
    i = data[g,1]
    j = data[g,2]
    Pi = data[g,3]
    Pj = data[g,4]
  
    G[g,i] = 1
    G[g,j] = -1
    p[g] = Pi - Pj
end

In [8]:
# For lambda = 0

F = svd(G)
ridge_zero = transpose(F.Vt) * Diagonal(F.S) * transpose(F.U) * p

5×1 Matrix{Float64}:
 -124.00000000000001
   91.00000000000003
  -40.00000000000001
  -17.000000000000014
   90.00000000000001

In [9]:
sum(ridge_zero)

0.0

As we can see, the rank of the five teams is: Duke < UNC < UVA < VT < Miami, which corresponds to what we got in class. However, recall that in class we solved a constrained least square problem by constraining the sum of the scores to be a constant, here instead, without any constraint, we get essentially a zero-sum result. 

For $\lambda = \infty$, the score vector would be $0$, which is noninformative.

## 2.5

First consider the partition of $A$ and $x$: 
$$A = \begin{pmatrix} A_1 & \tilde{A}\end{pmatrix}$$
$$x = \begin{pmatrix} x_1 & \tilde{x}^T\end{pmatrix}^T$$
where $A_1$ has one column and $x_1$ is a scalar.

The objective function is then 
$$f_\lambda(x) = \|b - A_1x_1 - \tilde{A}\tilde{x}\|_2^2 + \lambda x_1^2 = \|b - \tilde{A}\tilde{x}\|_2^2 + \|A_1x_1\|_2^2 - 2x_1A_1^T(b - \tilde{A}\tilde{x}) + \lambda x_1^2$$
Now, we take partial derivative w.r.t. $\tilde{x}$ and $x_1$, respectively, 
$$\frac{\partial f_\lambda(x)}{\partial\tilde{x}} = -2\tilde{A}^T(b - \tilde{A}\tilde{x}) + 2\tilde{A}^TA_1x_1\tag{4}$$
$$\frac{\partial f_\lambda(x)}{\partial x_1} = 2A_1^TA_1x_1 - 2A_1^T(b - \tilde{A}\tilde{x}) + 2\lambda x_1\tag{5}$$
By (5) equal to zero, we get 
$$x_1 = (A_1^TA_1 + \lambda)^{-1}(A_1^Tb - A_1^T\tilde{A}\tilde{x}):= a(A_1^Tb - A_1^T\tilde{A}\tilde{x})$$
which can be substituted to (4) and yields 
$$\tilde{A}^T(I - aA_1A_1^T)\tilde{A}\tilde{x} = \tilde{A}^T(I - aA_1A_1^T)b$$

We can solve the former equation to get $x_1, \tilde{x}_1$. 

# Problem 3. Thinking about constraints

Substituting $y = 5x + 2$ into the function, we get 
$$f(x, y) = x^2 + 2(5x + 2)^2 = 51x^2 + 40x + 8 = 51\left(x + \frac{20}{51}\right)^2 + \frac{8}{51}\ge\frac{8}{51}$$
The function is minimized at $\left(-\frac{20}{51}, \frac{2}{51}\right)$, with minimum value $8 / 51$. 
This is a global minimizer because any other point along the line will produce larger value.

The gradient at $\left(-\frac{20}{51}, \frac{2}{51}\right)$ is 
$$\left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)\bigg|_{(x, y) = \left(-\frac{20}{51}, \frac{2}{51}\right)} = (2x, 4y)\bigg|_{(x, y) = \left(-\frac{20}{51}, \frac{2}{51}\right)} = \left(-\frac{40}{51}, \frac{8}{51}\right)$$. 

Without the constraint, the minimizer would be $(0, 0)$ with gradient also $(0, 0)$. So the difference lies in the gradient: under constraint, the gradient of the minimizer may not be zero.

# Problem 4. Alternate formulations of Least Squares



## 4.1 & 4.2 & 4.3

We can rewrite the optimization problem as 
$$\min_{y}\frac{1}{2}\|b - Cy\|_2^2\quad\text{s.t. }Cy = b - r$$
The Lagragian is 
$$\mathcal{L}(y; \lambda) = \frac{1}{2}\|b - Cy\|_2^2 - \lambda^T(Cy - b + r)$$
And we get 
$$\frac{\partial\mathcal{L}}{\partial y} = -C^T(b - Cy) - C^T\lambda = 0$$
$$\frac{\partial\mathcal{L}}{\partial\lambda} = Cy + r - b = 0$$
and the augmented system is 
$$\begin{pmatrix}C^TC & C^T\\C & 0\end{pmatrix}\begin{pmatrix}y\\-\lambda\end{pmatrix} = \begin{pmatrix}C^Tb\\ b - r\end{pmatrix}\tag{6}$$
Since rank$(C^TC) = \text{rank}(C) = n$, we know $C^TC$ is invertible. Let 
$$A = \begin{pmatrix}C^TC & C^T & C^Tb\\C & 0 & b - r\end{pmatrix}$$
be the augmented matrix of the linear system.

By left multiplying $-C(C^TC)^{-1}$ on the first row of $A$ and add it to the second row, we get another matrix 
$$A_1 = \begin{pmatrix} C^TC & C^T & C^Tb\\0 & -C(C^TC)^{-1}C^T & b - r - C(C^TC)^{-1}C^Tb\end{pmatrix}$$
Then, we left multiply $C^T$ on the second row of $A_1$ and add it to the first row to get a diagonal partition matrix
$$A_2 = \begin{pmatrix}C^TC & 0 & C^Tb + C^Tb - C^Tr - C^Tb\\0 & C(C^TC)^{-1}C^T & b - r - C(C^TC)^{-1}C^Tb\end{pmatrix}$$
Now by the theory of linear equation, we know that equation (6) has the same solution as 
$$\begin{pmatrix} C^TC & 0\\ 0 & C(C^TC)^{-1}C^T\end{pmatrix}\begin{pmatrix}y\\-\lambda\end{pmatrix} = \begin{pmatrix}C^Tb + C^Tb - C^Tr - C^Tb\\b - r - C(C^TC)^{-1}C^Tb\end{pmatrix}$$
We get $\lambda$ satisfies $-C(C^TC)^{-1}C^T\lambda = b - r - C(C^TC)^{-1}C^Tb$. Left multiply $C^T$ on both sides, we have 
$$-C^T\lambda = C^Tb - C^Tr - C^Tb = -C^Tr = -C^T(b - Cy) = C^TCy - C^Tb$$
Since by equation (6), $C^TCy - C^T\lambda = C^Tb$, we get 
$$C^TCy + C^TCy - C^Tb = C^Tb\Rightarrow C^TCy = C^Tb$$

Honestly, I don't see any advantage...