***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 3-Singular value decomposition   
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison  
*Updated:* Jan 5, 2024   
*Copyright:* &copy; 2024 Sebastien Roch

***

In [None]:
# IF RUNNING ON GOOGLE COLAB, UNCOMMENT THE FOLLOWING CODE CELL
# When prompted, upload: 
#     * mmids.py
#     * h3n2-snp.csv
#     * h3n2-other.csv 
#     * advertising.csv 
# from your local file system
# Files at: https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils
# Alternative instructions: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
# PYTHON 3
import numpy as np
from numpy import linalg as LA
from numpy.random import default_rng
rng = default_rng(535)
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
import mmids

## Motivating example: vizualizing genetic data

We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From [Wikipedia](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism):

> A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.

Quoting [Jombart et al., BMC Genetics (2010)](https://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-11-94), we analyze:

> the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.

Some details about the Jombart et al. dataset:

> For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.

We load a dataset, which contains a subset of strains from the dataset mentioned above.

In [None]:
df = pd.read_csv('h3n2-snp.csv')

The first five rows are the following.

In [None]:
df.head()

Overall it contains $1642$ strains. 

In [None]:
df.shape[0]

The data lives in a $318$-dimensional space.

In [None]:
df.shape[1]

Obviously, vizualizing this data is not straighforward. How can we make sense of it? More specifically, how can we explore any underlying structure it might have. Quoting [Wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis):

> In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. [...] Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

In this chapter we will encounter an importatn mathematical technique for dimension reduction, which allow us to explore this data -- and find interesting structure -- in $2$ (rather than $318$!) dimensions.

## Background: review of spectral decomposition

**NUMERICAL CORNER:** In Python, the eigenvalues and eigenvectors of a matrix can be computed using [`numpy.linalg.eig`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html).

In [None]:
A = np.array([[2.5, -0.5], [-0.5, 2.5]])

In [None]:
w, v = LA.eig(A)
print(w)
print(v)

$\unlhd$

## Dimension reduction and approximating subspaces

**NUMERICAL CORNER:** In Numpy, the outer product is computed using [`numpy.outer`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html).

In [None]:
u = np.array([0., 2., -1.])
v = np.array([3., -2.])
Z = np.outer(u, v)
print(Z)

In [None]:
print(LA.matrix_rank(Z))

$\unlhd$

## Power iteration

**NUMERICAL CORNER:** We implement the algorithm suggested by the *Power Iteration Lemma*. That is, we compute $B^{k} \mathbf{x}$, then normalize it. To obtain the corresponding singular value and left singular vector, we use that $\sigma_1 = \|A \mathbf{v}_1\|$ and $\mathbf{u}_1 = A \mathbf{v}_1/\sigma_1$.

In [None]:
def topsing(A, maxiter=10):
    x = rng.normal(0,1,np.shape(A)[1])
    B = A.T @ A
    for _ in range(maxiter):
        x = B @ x
    v = x / LA.norm(x)
    s = LA.norm(A @ v)
    u = A @ v / s
    return u, s, v

We will apply it to our previous two-cluster example.

In [None]:
d, n, w = 10, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1])
plt.show()

Let's compute the top singular vector.

In [None]:
u, s, v = topsing(X)
print(v)

This is approximately $-\mathbf{e}_1$. We get roughly the same answer (possibly up to sign) from Python's [`numpy.linalg.svd`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html) function.

In [None]:
u, s, vh = LA.svd(X)
print(vh.T[:,0])

Recall that, when we applied $k$-means clustering to this example with $d=1000$ dimension, we obtained a very poor clustering. Let's try again after projecting onto the top singular vector.

In [None]:
d, n, w = 1000, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1])
plt.show()

In [None]:
assign = mmids.kmeans(X, 2)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()

Let's try again, but after projecting on the top singular vector. Recall that this corresponds to finding the best one-dimensional approximating subspace. The projection can be computed using the truncated SVD $Z= U_{(1)} \Sigma_{(1)} V_{(1)}^T$. We can interpret the rows of $U_{(1)} \Sigma_{(1)}$ as the coefficients of each data point in the basis $\mathbf{v}_1$. We will work in that basis. We need one small hack: because our implementation of $k$-means clustering expects data points in at least $2$ dimension, we add a column of $0$'s.

In [None]:
u, s, v = topsing(X)
Xproj = np.stack((u*s, np.zeros(np.shape(X)[0])), axis=-1)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(Xproj[:,0], Xproj[:,1])
plt.ylim([-3,3])
plt.show()

There is a small - yet noticeable - gap around 0.

A histogram of the first component of `Xproj` gives a better sense of the density of points.

In [None]:
plt.hist(Xproj[:,0])
plt.show()

We run $k$-means clustering on the projected data.

In [None]:
assign = mmids.kmeans(Xproj, 2)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()

Much better. We will give an explanation of this outcome in an upcoming (optional) subsection. In essence, quoting [BHK, Section 7.5.1]:

> [...] let's understand the central advantage of doing the projection to [the top $k$ right singular vectors]. It is simply that for any reasonable (unknown) clustering of data points, the projection brings data points closer to their cluster centers.

Finally, looking at the top right singular vector (or its first ten entries for lack of space), we see that it does align quite well (but not perfectly) with the first dimension. In the next (optional) section, we try again with the top two singular vectors.

In [None]:
print(v[:10])

$\unlhd$

**NUMERICAL CORNER:** We implement this last algorithm. We will need our previous implementation of *Gram-Schimdt*.

In [None]:
def svd(A, l, maxiter=100):
    V = rng.normal(0,1,(np.size(A,1),l))
    for _ in range(maxiter):
        W = A @ V
        Z = A.T @ W
        V, R = mmids.gramschmidt(Z)
    W = A @ V
    S = [LA.norm(W[:, i]) for i in range(np.size(W,1))]
    U = np.stack([W[:,i]/S[i] for i in range(np.size(W,1))],axis=-1)
    return U, S, V

Note that above we avoided forming the matrix $A^T A$. With a small number of iterations, that approach potentially requires fewer arithmetic operations overall and it allows to take advantage of the possible sparsity of $A$ (i.e. the fact that it may have many zeros).

We apply it again to our two-cluster example.

In [None]:
d, n, w = 1000, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

Let's try again, but after projecting on the top two singular vectors. Recall that this corresponds to finding the best two-dimensional approximating subspace. The projection can be computed using the truncated SVD $Z= U_{(2)} \Sigma_{(2)} V_{(2)}^T$. We can interpret the rows of $U_{(2)} \Sigma_{(2)}$ as the coefficients of each data point in the basis $\mathbf{v}_1,\mathbf{v}_2$. We will work in that basis.

In [None]:
U, S, V = svd(X, 2)
Xproj = np.stack((U[:,0]*S[0], U[:,1]*S[1]), axis=-1)
assign = mmids.kmeans(Xproj, 2)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()

Finally, looking at the first two right singular vectors, we see that the first one does align quite well with the first dimension.

In [None]:
print(np.stack((V[:,0], V[:,1]), axis=-1))

$\unlhd$

We load the dataset again and examine its first rows.

In [None]:
df = pd.read_csv('h3n2-snp.csv')

In [None]:
df.head()

Recall that it contains $1642$ strains and lives in a $318$-dimensional space. 

In [None]:
df.shape

Our goal is to find a "good" low-dimensional representation of the data. Two dimensions will do here. We use the SVD. 

Specifically, we extract a data matrix, run our SVD algorithm with $k=2$, and plot the data in the projected subspace of the first two singular vectors.

In [None]:
A = df[[df.columns[i] for i in range(1,len(df.columns))]].to_numpy()

In [None]:
U, S, V = svd(A, 2)

In [None]:
plt.scatter(U[:,0]*S[0], U[:,1]*S[1])
plt.show()

There seems to be some reasonably well-defined clusters in this projection. To further reveal the structure, we color the data points by year. That information is in a separate file. 

In [None]:
dfoth = pd.read_csv('h3n2-other.csv')
dfoth.head()

In [None]:
year = dfoth['year'].to_numpy()

We color the points on the scatterplot by year. (We use [`legend_elements()`](https://matplotlib.org/stable/api/collections_api.html#matplotlib.collections.PathCollection.legend_elements) for automatic legend creation.) 

In [None]:
scatter = plt.scatter(U[:,0]*S[0], U[:,1]*S[1], c=year, label=year)
plt.legend(*scatter.legend_elements())
plt.show()

To some extent, one can "see" the virus evolving from year to year. The $y$-axis in particular seems to correlate strongly with the year. 

## Further applications of the SVD

**NUMERICAL CORNER:** In Numpy, the Frobenius norm of a matrix can be computed using the default of the function `numpy.linalg.norm` while the induced norm can be computed using the same function with [`ord` parameter set to `2`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).

In [None]:
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)

In [None]:
LA.norm(A)

In [None]:
LA.norm(A, 2)

$\unlhd$

**NUMERICAL CORNER:** In Numpy, the pseudoinverse of a matrix can be computed using the function [`numpy.linalg.pinv`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html).

In [None]:
M = np.array([[1.5, 1.3], [1.2, 1.9], [2.1, 0.8]])
print(M)

In [None]:
Mp = LA.pinv(M)
print(Mp)

In [None]:
Mp @ M

Let's try our previous example.

In [None]:
A = np.array([[1., 0.], [-1., 0.]])
print(A)

In [None]:
Ap = LA.pinv(A)
print(Ap)

$\unlhd$

**NUMERICAL CORNER:** In Numpy, the condition number of a matrix can be computed using the function [`numpy.linalg.cond`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.cond.html).

For example, orthogonal matrices have condition number $1$, the lowest possible value for it (Why?). That indicates that orthogonal matrices have good numerical properties.

In [None]:
q = 1/np.sqrt(2)
Q = np.array([[q, q], [q, -q]])
print(Q)

In [None]:
LA.cond(Q)

In contrast, matrices with nearly linearly dependent columns have large condition numbers.

In [None]:
eps = 1e-6
A = np.array([[q, q], [q, q+eps]])
print(A)

In [None]:
LA.cond(A)

In [None]:
u, s, vh = LA.svd(A)
print(s)

We compute the solution to $A \mathbf{x} = \mathbf{b}$ when $\mathbf{b}$ is the left singular vector of $A$ corresponding to the largest singular value. Recall that in the proof of the *Conditioning of Matrix-Vector Multiplication Theorem*, we showed that the worst case bound is achieved when $\mathbf{z} = \mathbf{b}$ is right singular vector of $M= A^{-1}$ corresponding to the lowest singular value. In a previous example, given a matrix $A = \sum_{j=1}^n \sigma_j \mathbf{u}_j \mathbf{v}_j^T$ in compact SVD form, we derived a compact SVD for the inverse as

$$
A^{-1} = \sigma_n^{-1} \mathbf{v}_n \mathbf{u}_n^T + \sigma_{n-1}^{-1} \mathbf{v}_{n-1} \mathbf{u}_{n-1}^T + \cdots + \sigma_1^{-1} \mathbf{v}_1 \mathbf{u}_1^T.
$$

Here, compared to the SVD of $A$, the order of the singular values is reversed and the roles of the left and right singular vectors are exchanged. So we take $\mathbf{b}$ to be the top left singular vector of $A$.

In [None]:
b = u[:,0]
print(b)

In [None]:
x = LA.solve(A,b)
print(x)

We make a small perturbation in the direction of the second right singular vector. Recall that in the proof of the *Conditioning of Matrix-Vector Multiplication Theorem*, we showed that the worst case is achieved when $\mathbf{d} = \delta\mathbf{b}$ is top right singular vector of $M = A^{-1}$. By the argument above, that is the left singular vector of $A$ corresponding to the lowest singular value.

In [None]:
delta = 1e-6
bp = b + delta*u[:,1]
print(bp)

The relative change in solution is:

In [None]:
xp = LA.solve(A,bp)
print(xp)

In [None]:
(LA.norm(x-xp)/LA.norm(x))/(LA.norm(b-bp)/LA.norm(b))

Note that this is exactly the condition number of $A$.

$\unlhd$

**NUMERICAL CORNER:** We give a quick example.

In [None]:
A = np.array([[1., 101.],[1., 102.],[1., 103.],[1., 104.],[1., 105]])
print(A)

In [None]:
LA.cond(A)

In [None]:
LA.cond(A.T @ A)

This observation -- and the resulting increased numerical instability -- is one of the reasons we previously developed an alternative approach to the least-squares problem. Quoting [Sol, Section 5.1]:

> Intuitively, a primary reason that $\mathrm{cond}(A^T A)$ can be large is that columns of $A$ might
look “similar” [...] If two columns $\mathbf{a}_i$ and $\mathbf{a}_j$ satisfy $\mathbf{a}_i \approx \mathbf{a}_j$, then the least-squares residual length $\|\mathbf{b} − A \mathbf{x}\|_2$ will not suffer much if we replace multiples of $\mathbf{a}_i$ with multiples of $\mathbf{a}_j$ or vice versa. This wide range of nearly—but not completely—equivalent solutions yields poor conditioning. [...] To solve such poorly conditioned problems, we will employ an alternative technique with closer attention to the column space of $A$ rather than employing row operations as in Gaussian elimination. This strategy identifies and deals with such near-dependencies explicitly, bringing about greater numerical stability.

$\unlhd$

**NUMERICAL CORNER:** Here is a numerical example taken from [[TB](https://books.google.com/books/about/Numerical_Linear_Algebra.html?id=JaPtxOytY7kC), Lecture 19]. We will approximate the following function with a polynomial.

In [None]:
n = 100 
t = np.arange(n)/(n-1)
b = np.exp(np.sin(4 * t))

In [None]:
plt.plot(t, b)
plt.show()

We use a [Vandermonde matrix](https://en.wikipedia.org/wiki/Vandermonde_matrix), which can be constructed using [`numpy.vander`](https://numpy.org/doc/stable/reference/generated/numpy.vander.html), to perform polynomial regression.

In [None]:
m = 17
A = np.vander(t, m, increasing=True)

The condition numbers of $A$ and $A^T A$ are both high in this case.

In [None]:
print(LA.cond(A))

In [None]:
print(LA.cond(A.T @ A))

We first use the normal equations and plot the residual vector.

In [None]:
xNE = LA.solve(A.T @ A, A.T @ b)
print(LA.norm(b - A@xNE))

In [None]:
plt.plot(t, b - A@xNE)
plt.show()

We then use `numpy.linalg.qr` to compute the QR solution instead.

In [None]:
Q, R = LA.qr(A)
xQR = mmids.backsubs(R, Q.T @ b)
print(LA.norm(b - A@xQR)) 

In [None]:
plt.plot(t, b - A@xNE)
plt.plot(t, b - A@xQR)
plt.show()

$\unlhd$