# What is an interpolative decomposition?  
Now that we have written our neural network as a string of matrix operations, we want to use this to prune our network.  For the next two tutorials, we are going to take a detour into some numerical linear algebra.  


First, we are going to learn about matrix and vector norms, which gives us a metric to understand how "big" a matrix (and eventually the error we are adding to our networks) is.   




### What is a norm?  
Note that I am choosing to work with only real-valued matrices and vectors, since those are the only kind we will see with these neural networks.  

A vector norm is a function that assigns a real-valued length to a vector.  If you have taken physics classes, you are most likely familiar with at least one type of vector norm.  We call it the "magnitude".  For our purposes, the magnitude is called the 2-norm.  There are other types of norms, but the 2-norm is the most relevant here.  The 2-norm of a vector $x$ is denoted $||x|| $.  

We will also be using norms with matrices.  Again, there are multiple types of norms that we could use.  We will be using the norm "induced" by the 2-vector-norm.  In simple terms, the 2-norm of a matrix is the most that a matrix could possibly "stretch" any possible vector that the matrix is applied to.  

This can be written mathematically as:  
$||A||=\sup_{x\in \mathbb{R}^n} \frac{||Ax||}{||x||}$  

where $Ax$ is the vector we get when we multiply $A$ and $x$. This means that to find $||A||$, we have to use
the vector $x$ which maximizes $\frac{||Ax||}{||x||}$.

Scipy has a simple method to compute the 2-norm of a matrix.  

We see an example below.  Because the identity matrix cannot stretch or shrink any vector--every vector it is applied to is the same before and after application of the identity matrix--it has a 2-norm of 1.  

In [None]:
import scipy.linalg
import numpy as np
mat=np.eye(3)
print(mat)
scipy.linalg.norm(mat, ord=2)

However, if we change this matrix so that it is no longer the identity matrix, it will stretch a vector and the 2-norm will not be 1 anymore.


In [None]:
mat[0,0]=3
print("A")
print(mat)
vec=np.array([1,1,1])
print("||Ax||/||x||:")
print(scipy.linalg.norm(mat@vec, ord=2)/scipy.linalg.norm(vec, ord=2))

In the example above, we did not choose the vector $x$ which would be stretched the most by the matrix $A$ (which is the vector that maximizes $\frac{||Ax||}{||x||}$), and so the number on the last output line is not the norm.  

The vector that would be stretched the most (while having its own vector norm as 1) is called a right singular vector (particularly, the singular vector which corresponds to the largest singular value). This is the vector we would have to find to calculate $||A||$. The amount that it is stretched by is called the largest singular value. This is the basis of the singular value decomposition.   

For our purposes though it is sufficient to understand how the norm works and to know that the most that a matrix could stretch any vector is by the largest singular value.  For our example matrix $A$, the right singular vector with the largest singular value is (1,0,0).  



In [None]:
vec=np.array([1,0,0])
print("||Ax||/||x||:")
print(scipy.linalg.norm(mat@vec, ord=2)/scipy.linalg.norm(vec, ord=2))

So 3.0 is the 2-norm of our matrix $A$.  

There are a few other things which are important to know about norms, primarily that they follow the Cauchy-Schwartz inequality.  This means that when multiplying matrices, 

$||AB|| \leq ||A||\ ||B|| $

and  

$||A+B|| \leq ||A||+||B|| $

the latter of which is called the triangle inequality.


This is shown in many numerical linear algebra textbooks.  

### The interpolative decomposition 

Oftentimes matrices need to be made smaller. If we think of a matrix as a collection of columns, this means selecting only a few columns to explicitly keep and setting up a way to figure out (or approximate) the unselected columns of the matrix later.

Interpolative decomposition is the way to do this. We select a set of columns and then create a linear combination of the selected columns to represent the columns that we left out. That way we only have a few of the matrix's columns 'stored', but can still access all of them (approximately). 


Our goal is to approximate $A$ as $A_{:,I} T$, where $A_{:,I}$ is a matrix made up of a subset of columns of $A$ with $I$ as the column index. $T$ is the interpolation matrix, which does not change the columns of $A$ that were kept (except for re-arranging them into the right order) and uses linear combinations of those columns to approximate the columns that were left out.  

If our interpolative decomposition is successful, we would like the values in $T$ to be reasonably small (less than 2), and for the difference between our interpolated matrix and $A$ to be small as well (relative to the size of $A$):   

$||A-A_{:,I} T|| \leq \epsilon ||A||$

This is called an $\epsilon$-accurate interpolative decomposition.  Obviously, we would like $\epsilon$ to be small, since that means our approximation is relatively close to the matrix we are approximating.  We would also like to use the smallest number of columns possible to achieve this accuracy--that is, make $A_{:,I}$ as small as we can.  

This is not an easy wishlist to fulfill.  Finding the optimal ones in practice is hard.  

In the next tutorial, we will show one method which gives us a good interpolative decomposition in practice most of the time.  