***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 3-Singular value decomposition   
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison  
*Updated:* Jan 4, 2024   
*Copyright:* &copy; 2024 Sebastien Roch

***

In [None]:
# IF RUNNING ON GOOGLE COLAB, UNCOMMENT THE FOLLOWING CODE CELL
# When prompted, upload: 
#     * mmids.py
#     * h3n2-snp.csv
#     * h3n2-other.csv 
#     * advertising.csv 
# from your local file system
# Files at: https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils
# Alternative instructions: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
# PYTHON 3
import numpy as np
from numpy import linalg as LA
from numpy.random import default_rng
rng = default_rng(535)
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
import mmids

## Motivating example: exploratory data analysis of viral evolution

We consider an application of dimension reduction in biology. We will look at SNP data from viruses. A little background first. From [Wikipedia](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism):

> A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.

Quoting [Jombart et al., BMC Genetics (2010)](https://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-11-94), we analyze:

> the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.

Some details about the Jombart et al. dataset:

> For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.

We load a dataset, which contains a subset of strains from the dataset mentioned above.

In [None]:
df = pd.read_csv('h3n2-snp.csv')

The first five rows are the following.

In [None]:
df.head()

Overall it contains $1642$ strains. 

In [None]:
df.shape[0]

The data lives in a $318$-dimensional space.

In [None]:
df.shape[1]

Obviously, vizualizing this data is not straighforward. How can we make sense of it? More specifically, how can we explore any underlying structure it might have. Quoting [Wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis):

> In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. [...] Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

In this chapter we will encounter an importatn mathematical technique for dimension reduction, which allow us to explore this data -- and find interesting structure -- in $2$ (rather than $318$!) dimensions.

## Background: review of spectral decomposition

**NUMERICAL CORNER:** In Python, the eigenvalues and eigenvectors of a matrix can be computed using [`numpy.linalg.eig`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html).

In [None]:
A = np.array([[2.5, -0.5], [-0.5, 2.5]])

In [None]:
w, v = LA.eig(A)
print(w)
print(v)

$\unlhd$

## Dimension reduction and approximating subspaces

**NUMERICAL CORNER:** In Numpy, the outer product is computed using [`numpy.outer`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html).

In [None]:
u = np.array([0., 2., -1.])
v = np.array([3., -2.])
Z = np.outer(u, v)
print(Z)

In [None]:
print(LA.matrix_rank(Z))

$\unlhd$

## Power iteration

**NUMERICAL CORNER:** We implement the algorithm suggested by the *Power Iteration Lemma*. That is, we compute $B^{k} \mathbf{x}$, then normalize it. To obtain the corresponding singular value and left singular vector, we use that $\sigma_1 = \|A \mathbf{v}_1\|$ and $\mathbf{u}_1 = A \mathbf{v}_1/\sigma_1$.

In [None]:
def topsing(A, maxiter=10):
    x = rng.normal(0,1,np.shape(A)[1])
    B = A.T @ A
    for _ in range(maxiter):
        x = B @ x
    v = x / LA.norm(x)
    s = LA.norm(A @ v)
    u = A @ v / s
    return u, s, v

We will apply it to our previous two-cluster example.

In [None]:
d, n, w = 10, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1])
plt.show()

Let's compute the top singular vector.

In [None]:
u, s, v = topsing(X)
print(v)

This is approximately $-\mathbf{e}_1$. We get roughly the same answer (possibly up to sign) from Python's [`numpy.linalg.svd`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html) function.

In [None]:
u, s, vh = LA.svd(X)
print(vh.T[:,0])

Recall that, when we applied $k$-means clustering to this example with $d=1000$ dimension, we obtained a very poor clustering. Let's try again after projecting onto the top singular vector.

In [None]:
d, n, w = 1000, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1])
plt.show()

In [None]:
assign = mmids.kmeans(X, 2)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()

Let's try again, but after projecting on the top singular vector. Recall that this corresponds to finding the best one-dimensional approximating subspace. The projection can be computed using the truncated SVD $Z= U_{(1)} \Sigma_{(1)} V_{(1)}^T$. We can interpret the rows of $U_{(1)} \Sigma_{(1)}$ as the coefficients of each data point in the basis $\mathbf{v}_1$. We will work in that basis. We need one small hack: because our implementation of $k$-means clustering expects data points in at least $2$ dimension, we add a column of $0$'s.

In [None]:
u, s, v = topsing(X)
Xproj = np.stack((u*s, np.zeros(np.shape(X)[0])), axis=-1)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(Xproj[:,0], Xproj[:,1])
plt.ylim([-3,3])
plt.show()

There is a small - yet noticeable - gap around 0.

A histogram of the first component of `Xproj` gives a better sense of the density of points.

In [None]:
plt.hist(Xproj[:,0])
plt.show()

We run $k$-means clustering on the projected data.

In [None]:
assign = mmids.kmeans(Xproj, 2)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()

Much better. We will give an explanation of this outcome in an upcoming (optional) subsection. In essence, quoting [BHK, Section 7.5.1]:

> [...] let's understand the central advantage of doing the projection to [the top $k$ right singular vectors]. It is simply that for any reasonable (unknown) clustering of data points, the projection brings data points closer to their cluster centers.

Finally, looking at the top right singular vector (or its first ten entries for lack of space), we see that it does align quite well (but not perfectly) with the first dimension. In the next (optional) section, we try again with the top two singular vectors.

In [None]:
print(v[:10])

$\unlhd$