# Linear Algebra and Optimisation

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Week 1

In this lab we will practice minimising a cost function with gradient descent.

### Assumed knowledge
- Linear algebra (see Sam Roweis' notes, linked below, for matrix calculus tips)
- Python programming
- Preferably: Using numpy for matrix calculations (precourse material)

### After this lab, you should be comfortable with:
- Using numpy ndarrays for matrix calculations
- Using scipy.optimise routines to minimise a cost function, with and without a gradient
- Randomly generating input values for testing

## Pre-lab notes
In this lab, you will apply linear algebra to to minimise a cost function in three steps: implementing the cost function, implementing a gradient function, and applying gradient descent. We will be doing this to solve problems throughout the course.

As in all labs, feel free to skip questions if you get stuck, and ask your tutor if you have any questions!

A note on style: in this course we emphasise *functional decomposition* in code style. Avoid using global variables, and remember that often splitting code off into separate functions can make it more readable and testable. (Jupyter notebooks let you call functions defined in previous cells.)

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$
$\newcommand{\Norm}[1]{\lVert#1\rVert}$
$\newcommand{\RR}{\mathbb{R}}$
$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$
$\newcommand{\DD}{\mathscr{D}}$
$\newcommand{\grad}[1]{\operatorname{grad}#1}$
$\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up python environment (this cell contains Latex macros).

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.optimize as opt
import time

%matplotlib inline

A *cost function* or *loss function* is the function we want to minimize in a given problem. For example, it might measure the error between the indicator values our models predicts and the true values for the training data. In this lab, we consider a toy example. We will define a cost function $f(X)$ where $X$ is a $n\times p$ matrix.

If $A$ is a square matrix, then we write $\trace{A}$ for its trace. Let $ \Norm{A}_F = \sqrt{\trace{A^T A}} $, the *Frobenius norm* of a matrix.

Let our cost function $f(X)$ be defined for $n\times p$ matrices $X$ as follows. Let $C$ be a fixed symmetric $n\times n$ matrix (so $C = C^T$). Let $\mu$ be a scalar that is larger than the $p^{th}$ smallest eigenvalue of $C$. Let $N$ be a diagonal $p\times p$ matrix with distinct positive entries on the diagonal.

The cost function is defined as
\begin{equation}
  f(X) = \frac{1}{2} \trace{X^T C X N} + \mu \frac{1}{4} \Norm{N - X^T X}^2_F
\end{equation}
where $ X \in \RR^{n \times p} $, $ n \ge p $.

## Frobenious Norm

Implement a Python function ```frobenius_norm``` which accepts an arbitrary matrix $ A $ and returns
$ \Norm{A}_F $ using the formula given. (Use ```numpy.trace``` and ```numpy.sqrt```.) We represent matrices and vectors as numpy ndarrays.
1. Given a matrix $ A \in \RR^{n \times p} $, what is the complexity of your implementation of ```frobenius_norm```
using the formula above?
2. Can you come up with a faster implementation, if you were additionally told that $ p \ge n $ ?

Extension: Can you find an even faster implementation than in 1. and 2.? 

### <span style="color:blue">Answer</span>

1. Matrix transpose complexity = $O(n \times p)$  
   Matrix multiplication complexity = $O(n^2 \times p)$  
   Trace complexity = $O(n)$  
   Total complexity = $O(n^2 \times p + n \times p + n) \approx O(n^2 \times p) $

2. 

In [2]:
# replace this with your solution, add and remove code and markdown cells as appropriate
def frobenius_norm(A):
    return np.sqrt(np.trace(A.T@A))


## Implementing the cost function

Write a Python function, ```cost_function_for_matrix```, which implements the function $f(X)$ defined above.

Hint: What should the arguments to this function be?

In [3]:
# replace this with your solution, add and remove code and markdown cells as appropriate
def cost_function_for_matrix(X, C, N, mu):
    return (1/2)*np.trace(X.T@(C@(X@N))) + (mu/4)*(frobenius_norm(N-X.T@X)**2)
    

## Cost function with vector argument

The standard optimisation functions we will be using work only for cost functions that take a vector as the varying argument. Write a new function, ```cost_function_for_vector```, that takes $X$ represented as a vector of length $np$ rather than a matrix of dimensions $n\times p$. What arguments will this function take?

In [4]:
def cost_function_for_vector(Xvec, C, N, mu):
    return cost_function_for_matrix(Xvec.reshape(C.shape[0], N.shape[0]),
                                   C, N, mu)

## Minimising the cost function

At this point, we have two main choices in how we minimise the cost function using gradient descent functions from ``scipy.optimize``. First, we can use ``fmin``, which takes a function to minimize and an initial value. Second, we can use ``fmin_bfgs``, which takes an additional argument: the gradient of the function. As a result, (we would expect to find that) ``fmin_bfgs`` has substantially faster convergence.

### Minimizing with ```fmin```

Implement a function ```minimise_f_using_fmin``` that, for given values of $C$, $N$ and $\mu$, finds the matrix $X$ that minimizes $f(X)$ using ``fmin``. You will likely need [the docs for ``fmin``](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin.html). Check if your function converges for some (randomly generated) values of $C$, $N$ and $\mu$.

Summary of the docs: if you have a cost function $g(x, y)$ with a fixed value of $y$ and wish to find the value of $x$ that minimizes it, the syntax for calling ``fmin`` would be ``fmin(g, x0, args=(y))`` where ``x0`` is an initial guess for the value of $x$, and ``args=(y)`` gives ``fmin`` the rest of the values to pass to the cost function. Note that it is necessary that the variable that can change is the first argument to the cost function.

In [5]:
# replace this with your solution, add and remove code and markdown cells as appropriate
def minimize_f_using_fmin(Xg, C, N, mu):
    return opt.fmin(cost_function_for_vector, Xg, 
                    args=(C, N, mu), full_output=1)


In [6]:
# Testing fmin optimization

np.random.seed(100)
n = 3
p = 2

C = np.random.rand(n,n)
C = C+C.T
N = np.diag(np.random.rand(p))
mu = frobenius_norm(C)
Xg = np.random.rand(n*p)

fmin = minimize_f_using_fmin(Xg,C,N,mu)
print("Optimal X:\n", fmin[0].reshape(n,p))
print("Min function value:", fmin[1])
print("Number of iterations:", fmin[3])


Optimization terminated successfully.
         Current function value: -0.463333
         Iterations: 761
         Function evaluations: 1176
Optimal X:
 [[-0.55754081  0.20110317]
 [ 0.22286015 -0.8939434 ]
 [ 0.5531214   0.56289308]]
Min function value: -0.46333302519264374
Number of iterations: 1176


### Calculating the gradient of the cost function

To use ``fmin_bfgs``, which is substantially more time efficient, we need to compute the gradient of $f(X)$ with respect to $X$. Calculate this gradient, then implement a function to calculate it. You may want to use Sam Roweis' [Matrix Identities](https://cs.nyu.edu/~roweis/notes/matrixid.pdf) and/or the [Matrix Cookbook](https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf) as a reference for matrix calculus. As our cost function uses its main argument $X$ represented as a vector, also implement a function ```gradient_for_vector``` which returns the gradient represented as a vector.

### <span style="color:blue">Answer</span>

Cost function $f(X)$ can be rewritten as:
\begin{align*}
    f(X) &= \frac{1}{2} \trace{X^T C X N} + \frac{\mu}{4} \Norm{N - X^T X}^2_F \\
    &= \frac{1}{2} \trace{X^T C X N} + \frac{\mu}{4} \big( \trace{N^\top N} - \trace{N^\top X^\top X} - \trace{X^\top X N} + \trace{X^\top XX^\top X} \big) \\
    &= \frac{1}{2} \trace{X^T C X N} + \frac{\mu}{4} \big( \trace{N^\top N} - \trace{(N^\top X^\top X)^\top} - \trace{X^\top X N} + \trace{X^\top XX^\top X} \big) \\
    &= \frac{1}{2} \trace{X^T C X N} + \frac{\mu}{4} \big( \trace{N^\top N} - \trace{X^\top X N} - \trace{X^\top X N} + \trace{X^\top XX^\top X} \big)
\end{align*}
Thus,
\begin{align*}
    \nabla_{X} f &= \frac{1}{2} (CXN + C^\top X N^\top) + \frac{\mu}{4} \big( - (XN + XN^\top) - (XN + XN^\top ) + 4XX^\top X \big) \\
    &= \frac{1}{2} (CXN + C X N) + \frac{\mu}{4} \big( - (XN + XN) - (XN + XN) + 4XX^\top X \big) \\
    &= CXN + \frac{\mu}{4} (-4 XN + 4XX^\top X) \\
    &= CXN - \mu X(N - X^\top X)
\end{align*}

In [7]:
def gradient_for_vector(X, C, N, mu):
    X = X.reshape(C.shape[0], N.shape[0]) #reshape X to matrix
    return (C@X@N - mu * (X@(N - X.T @ X))).reshape(C.shape[0]*N.shape[0],) #reshape back tp vector


### Minimizing the cost function using the gradient

Write a function ```minimise_f_using_fmin_bfgs``` to minimise $f(X)$ using ```fmin_bfgs```. Have a look at the docs to find the correct syntax. Again, have a try of your function to check that it converges.

* bfgs documentation: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_bfgs.html

In [8]:
def minimize_f_using_bfgs(Xg, C, N, mu):
    return opt.fmin_bfgs(cost_function_for_vector, Xg,
                         fprime=gradient_for_vector, 
                         args=(C,N,mu), 
                         full_output=1)

In [9]:
# Testing fmin bfgs optimization

np.random.seed(100)
n = 3
p = 2

C = np.random.rand(n,n)
C = C+C.T
N = np.diag(np.random.rand(p))
mu = frobenius_norm(C)
Xg = np.random.rand(n*p)


fmin_bfgs = minimize_f_using_bfgs(Xg, C, N, mu)
optX_bfgs = fmin_bfgs[0].reshape(n,p)
print("Optimal X:\n", optX_bfgs)
print("Min function value:", fmin_bfgs[1])
print("Number of function calls:", fmin_bfgs[4])

Optimization terminated successfully.
         Current function value: -0.463333
         Iterations: 27
         Function evaluations: 34
         Gradient evaluations: 34
Optimal X:
 [[ 0.55753844  0.20111853]
 [-0.22287788 -0.89393062]
 [-0.55314803  0.56290324]]
Min function value: -0.46333302750197697
Number of function calls: 34


## Time for convergence

We wish to check whether ``fmin_bfgs`` is actually faster than ``fmin``.

First, we need a way of randomly generating input parameters for our cost function.

### Construction of a random matrix $C$ with given eigenvalues

A diagonal matrix has the nice property that the eigenvalues can be directly read off
the diagonal.

* Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, 
how many different diagonal matrices have the same set of eigenvalues?

* Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues,
how many different matrices have the same set of eigenvalues?

Given a set of $ n $ distinct real eigenvalues $ \mathcal{E} = \{e_1, \dots, e_n \} $, 
write a Python function ```random_matrix_from_eigenvalues``` which takes a list of
eigenvalues $ E $ and returns a random symmetric matrix $ C $ having the same eigenvalues.

### <span style="color:blue">Answer</span>

* For a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, there are $n!-1$ diagonal matrices with the same set of eigenvalues.
* For a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, there are infinitely many different matrices with the same set of eigenvalues.


In [10]:
# replace this with your solution, add and remove code and markdown cells as appropriate
def random_matrix_from_eigenvalues(e):
    n = len(e)
    A = np.random.rand(n,n)
    Q,R = np.linalg.qr(A)
    D = np.diag(e)
    C = Q@D@Q.T
#     print("C:\n",C)
#     print("Generated eigenvalues:",e)
#     print("Eigenvalues of C:",np.linalg.eigvals(C))
    return C

#### Note:
Above function generates random matrix $C$ is generated using matrix equation $C = QDQ^{\top}$.

We start with $C = ADA^{-1}$ where $A$ is an invertible matrix, and diagonal elements of $D$ are eigenvalues of $C$ and $A$.  

A random orthogonal matrix $Q$ is then generated using QR decomposition, and then assigned to $A:=Q$, and because of orthogonality of columns in $Q$, then $Q^{-1}=Q^{\top}$, therefore  $C = QDQ^{-1} = QDQ^{\top}$

References:  
* https://www.mapleprimes.com/questions/40024-Creating-A-Matrix-By-Setting-Eigenvalues  
* QR decomposition

### Checking convergence time

Is ``fmin_bfgs`` actually faster than ``fmin``? Write some code to find out, using ```time.clock()```.

Make sure to check this for relatively small and relatively large values of $n$ and $p$. Use ``random_matrix_from_eigenvalues`` to generate your $C$ parameter.

In [13]:
# Solution
np.random.seed(1)
def initialise_low_dimensional_data():
    """Initialise the data, low dimensions"""
    n = 3
    p = 2
    mu = 2.7

    N = np.matrix(np.diag([2.5, 1.5]))
    E = [1, 2, 3]
    C = random_matrix_from_eigenvalues(E)
    X0 = np.random.rand(n*p)

    return C, N, mu, n, p, X0


def initialise_higher_dimensional_data():
    """Initialise the data, higher dimensions"""
    n  = 20
    p  =  5
    mu = p + 0.5

    N = np.matrix(np.diag(np.arange(p, 0, -1)))
    E = np.arange(1, n+1)
    C = random_matrix_from_eigenvalues(E)
    X0 = np.random.rand(n*p)

    return C, N, mu, n, p, X0

def pretty_printing(task_string):
    line_length  = 76
    spaces       = 2
    left_padding = (line_length - len(task_string)) // 2
    right_padding = line_length - left_padding - len(task_string)
    print("=" * line_length)
    print("=" * (left_padding - spaces) + " " * spaces + task_string + \
            " " * spaces + "=" * (right_padding - spaces))
    print("=" * line_length)    

def run_and_time_all_tests():
    """Run all test and time them using a list of function names"""
    List_of_Test_Names = ["minimize_f_using_fmin",
                 "minimize_f_using_bfgs"]

    List_of_Initialisations = ["initialise_low_dimensional_data",
                               "initialise_higher_dimensional_data"]

    for test_name in List_of_Test_Names:
        for init_routine in List_of_Initialisations:
            task_string  = test_name + "(" + init_routine + ")"
            pretty_printing(task_string)

            start = time.clock()
            C, N, mu, n, p, X0 = globals()[init_routine]()
            exec(test_name+"(X0,C,N,mu)")
            run_time = time.clock() - start
            print("run_time :", run_time)

run_and_time_all_tests()

Optimization terminated successfully.
         Current function value: 3.962963
         Iterations: 439
         Function evaluations: 681
run_time : 0.11900199999999961
run_time : 6.001239999999999


ValueError: shapes (6,6) and (1,6) not aligned: 6 (dim 1) != 1 (dim 0)

In [11]:
n = 4
p = 3

# np.random.seed(100)
Xg = np.random.rand(n*p)
N = np.diag(np.random.rand(p))
e = np.random.rand(n)
C = random_matrix_from_eigenvalues(e)
mu = np.amin(e[:(len(e)-p)])

print("\nStart bfgs...")
start = time.clock()
fmin_bfgs = minimize_f_using_bfgs(Xg, C, N, mu)
bfgs_time = time.clock() - start
print("bfgs time:", bfgs_time)
print("Min function value w/ bfgs:", fmin_bfgs[1])

print("\nStart fmin...")
start = time.clock()
fmin = minimize_f_using_fmin(Xg,C,N,mu)
fmin_time = time.clock() - start
print("fmin time:", fmin_time)
print("Min function value w/ fmin:", fmin[1])



Start bfgs...
Optimization terminated successfully.
         Current function value: 0.014929
         Iterations: 99
         Function evaluations: 104
         Gradient evaluations: 104
bfgs time: 0.03460900000000011
Min function value w/ bfgs: 0.014928879562503591

Start fmin...
fmin time: 0.3741439999999998
Min function value w/ fmin: 0.014983180890579224


## Minima of $f(X)$

Compare the columns $x_1,\dots, x_p$ of the matrix $X^\star$ which minimises $ f(X) $ 
\begin{equation}
  X^\star = \argmin_{X \in \RR^{n \times p}} f(X)
\end{equation}

with the eigenvectors related to the smallest eigenvalues of $ C $.

What do you believe this means about $f(X)$?


### <span style="color:blue">Answer</span>

code references:
* get eigenvalues and eigenvectors `numpy.linalg.eig`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html
* eigenvalue-eigenvector sorting: https://stackoverflow.com/questions/8092920/sort-eigenvalues-and-associated-eigenvectors-after-using-numpy-linalg-eig-in-pyt

In [17]:
# Solution
np.random.seed(5)

def pretty_printing(task_string):
    line_length  = 76
    spaces       = 2
    left_padding = (line_length - len(task_string)) // 2
    right_padding = line_length - left_padding - len(task_string)
    print("=" * line_length)
    print("=" * (left_padding - spaces) + " " * spaces + task_string + \
            " " * spaces + "=" * (right_padding - spaces))
    print("=" * line_length)  
    
def normalize_columns(A):
    """Normalise the columns of a 2-D array or matrix to length one
    A - array or matrix (which will be modified)
    """
    if A.ndim != 2:
        raise ValueError("A is not a 2-D array")

    number_of_columns = A.shape[1]
    for i in range(number_of_columns):
        A[:,i] /= np.linalg.norm(A[:,i], ord=2)


def show_results(X_at_min, C):
    """Display the found arg min and compare with eigenvalues of C
    X_at_min -- arguement at minimum found
    C        -- symmetric matrix
    """
    n,p = X_at_min.shape

    normalize_columns(X_at_min)

    # Get the eigenvectors belonging to the smallest eigenvalues
    Eigen_Values, Eigen_Vectors = np.linalg.eig(C)
    Permutation = Eigen_Values.argsort()
    Smallest_Eigenvectors = Eigen_Vectors[:, Permutation[:p]]

    if n < 10:
        print("X_at_min               :\n", X_at_min)
        print()
        print("Smallest_Eigenvectors  :\n", Smallest_Eigenvectors)
        print()
    else:
        Project_into_Eigenvectorspace = \
          Smallest_Eigenvectors * Smallest_Eigenvectors.T * X_at_min
        Normal_Component = X_at_min - Project_into_Eigenvectorspace

        print("norm(Normal_Component)/per entry :", \
            np.linalg.norm(Normal_Component, ord=2) / float(n*p))

def show_comparision(num=3):
    for _ in range(num):
        pretty_printing("Comparing X_at_min and Eigenvalues for random values")
        C, N, mu, n, p, X0 = initialise_low_dimensional_data()
        X_at_min = minimize_f_using_bfgs(X0,C,N,mu)
        show_results(X_at_min, C)

show_comparision()



ValueError: shapes (6,6) and (1,6) not aligned: 6 (dim 1) != 1 (dim 0)

In [13]:
X_opt_bfgs = fmin_bfgs[0]
X_opt_fmin = fmin[0]

X_opt_bfgs = np.reshape(X_opt_bfgs, (n,p))
X_opt_fmin = np.reshape(X_opt_fmin, (n,p))

# normalize columns of optimized X
for i in range(X_opt_bfgs.shape[1]):
    normb = np.linalg.norm(X_opt_bfgs[:,i], ord=1)
    X_opt_bfgs[:,i] = X_opt_bfgs[:,i] / normb
    normf = np.linalg.norm(X_opt_fmin[:,i], ord=1)
    X_opt_fmin[:,i] = X_opt_fmin[:,i] / normf    
    
# get eigenvalues and eigernvectors
pmineigval, pmineigvec = np.linalg.eig(C)
# sort eigenvalues and eigenvectors
permut = pmineigval.argsort()
pmineigval = pmineigval[permut]
pmineigvec = pmineigvec[:,permut]
    
# normalize columns of eigenvectors of C
for i in range(pmineigvec.shape[1]):
    normb = np.linalg.norm(pmineigvec[:,i], ord=1)
    pmineigvec[:,i] = pmineigvec[:,i] / normb
    normf = np.linalg.norm(pmineigvec[:,i], ord=1)
    pmineigvec[:,i] = pmineigvec[:,i] / normf
    
print("bfgs optimized X:")
print(X_opt_bfgs)
print("fmin optimized X:")
print(X_opt_fmin)

print("min eigvenvalues:")
print(pmineigval)
print("corresponding eigenvectors:")
print(np.around(pmineigvec, decimals=4))

bfgs optimized X:
[[-0.43033632  0.47752377 -0.2200242 ]
 [-0.33325752  0.21646084  0.41024105]
 [-0.09914416  0.14742203 -0.22746719]
 [-0.137262    0.15859336 -0.14226756]]
fmin optimized X:
[[ 0.3054756   0.21799689  0.2250257 ]
 [ 0.28563533 -0.41251024 -0.40378771]
 [ 0.08595752  0.22762328  0.22791211]
 [ 0.32293155  0.1418696   0.14327448]]
min eigvenvalues:
[0.03647606 0.10514769 0.38194344 0.89041156]
corresponding eigenvectors:
[[ 0.2202  0.442   0.2568  0.067 ]
 [-0.41    0.3346 -0.1255 -0.097 ]
 [ 0.2275  0.0946 -0.2359 -0.4692]
 [ 0.1423  0.1288 -0.3818  0.3668]]
