In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Matrix Operations in Python

In [4]:
age_and_height = np.array([[182, 28], [399, 30], [725, 33]])

In [5]:
M = np.array([[1, 0, 0], [0, 1, 1/12]])

In [6]:
age_and_height @ M

array([[182.        ,  28.        ,   2.33333333],
       [399.        ,  30.        ,   2.5       ],
       [725.        ,  33.        ,   2.75      ]])

## Singular Value Decomposition Experiment

### Manual Decomposition

In the table below, we have the width, height, area, and perimeter of a rectangle stored in a dataframe.

In [8]:
rectangle = pd.read_csv("data/rectangle_data.csv")
rectangle.head(5)

Unnamed: 0,width,height,area,perimeter
0,8,6,48,28
1,2,4,8,12
2,1,3,3,8
3,9,3,27,24
4,9,8,72,34


Naturally the perimeter is just the sum of 2x the width and 2x the height. Thus, if we create a new dataframe that has only the width, height, and area...

In [9]:
rectangle_no_perim = rectangle[["width", "height", "area"]]
rectangle_no_perim.head(5)

Unnamed: 0,width,height,area
0,8,6,48
1,2,4,8
2,1,3,3
3,9,3,27
4,9,8,72


... then we can recover the perimeter by multiplying this matrix by

`1   0   0   2
0   1   0   2
0   0   1   0`

In [10]:
transform_3D_to_4D = [[1, 0, 0, 2], [0, 1, 0, 2], [0, 0, 1, 0]]

In [11]:
rectangle_with_perimeter_back = np.array(rectangle_no_perim) @ transform_3D_to_4D
pd.DataFrame(rectangle_with_perimeter_back, columns = ["width", "height", "area", "perimeter"]).head(5)

Unnamed: 0,width,height,area,perimeter
0,8,6,48,28
1,2,4,8,12
2,1,3,3,8
3,9,3,27,24
4,9,8,72,34


### Singular Value Decomposition Example

Singular value decomposition is a numerical technique to (among other things) automatically uncover such redundancies. Given an input matrix X, SVD will return $U\Sigma$ and $V^T$ such that $ X = U \Sigma V^T $.

In [16]:
u, sig, vt = np.linalg.svd(rectangle, full_matrices = False)

As we did before with our manual decomposition, we can recover our original rectangle data by multiplying the three return values of this function back together.

In [17]:
usig = u * s

In [18]:
pd.DataFrame(usig @ vt).head(4)

Unnamed: 0,0,1,2,3
0,8.0,6.0,48.0,28.0
1,2.0,4.0,8.0,12.0
2,1.0,3.0,3.0,8.0
3,9.0,3.0,27.0,24.0


The two key pieces of the decomposition are $U\Sigma$ and $V^T$, which we can think of for now as analogous to our 'data' and 'transformation operation' from our manual decomposition earlier.

Let's start by looking at $U\Sigma$, which we can compute with the Python code `u*s`.

In [19]:
usig

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
         6.39107280e-15],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
        -3.48640947e-15],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
         3.26733408e-17],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00,
        -1.46928838e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -4.80752577e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         1.93442399e-16],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         7.89334171e-17],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.09112653e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.37879780e-16],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -5.64863142e-17],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -9.85652541e-17],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
      

Similarly, we can look at vt.

In [20]:
vt

array([[-0.14643575, -0.12994219, -0.8100201 , -0.55275586],
       [-0.1927359 , -0.18912774,  0.5863482 , -0.76372728],
       [-0.70495745,  0.70915533,  0.00795161,  0.00839577],
       [-0.66666667, -0.66666667,  0.        ,  0.33333333]])

The automatic decomposition returned by the svd function looks quite different than what we got when we manually decomposed our data into "data" and "operations". That is, vt is a bunch of seemingly arbitrary numbers instead of the rather simple:

`1   0   0   2
0   1   0   2
0   0   1   0`

Similarly, if we look at the shape of $U\Sigma$ and $V^T$ we see that they are bigger than in our manual decomposition. Specifically $U\Sigma$ still has 4 columns, meaning that each observation is 4 dimensional. Furthermore, rather than our transformation operation $V^T$ being 3x4, it's 4x4 rows tall, meaning that it maps 4 dimensional inputs back to 4 dimensions.

This seems problematic, because our goal of using SVD was to find a transformation operation that takes 3D inputs and maps them up to 4 dimensions.

Luckily, if we look carefully at $U\Sigma$, we see that the last attribute of each observation is very close to 0. 

In [21]:
usig

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01,
         6.39107280e-15],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00,
        -3.48640947e-15],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00,
         3.26733408e-17],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00,
        -1.46928838e-16],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01,
        -4.80752577e-16],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00,
         1.93442399e-16],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00,
         7.89334171e-17],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01,
         3.09112653e-16],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00,
         3.37879780e-16],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00,
        -5.64863142e-17],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01,
        -9.85652541e-17],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00,
      

Thus, it makes sense that we remove the last column of $U \Sigma$. 

In [23]:
usig = usig[:, 0:3]
usig

array([[-5.63092679e+01,  4.08369641e+00, -7.67968689e-01],
       [-1.39258714e+01, -5.61592446e+00,  1.59106852e+00],
       [-7.38836950e+00, -5.11089273e+00,  1.51352951e+00],
       [-3.68444316e+01, -4.80005945e+00, -3.80095908e+00],
       [-7.94726055e+01,  1.30026983e+01,  1.86597851e-01],
       [-7.42135662e+00, -5.11810904e+00, -1.31469604e+00],
       [-1.39588585e+01, -5.62314077e+00, -1.23715703e+00],
       [-3.79895573e+01, -1.31360807e+00, -2.60712770e-01],
       [-1.56692269e+01, -9.65347804e+00, -4.03555325e+00],
       [-2.54468092e+01, -7.81311695e+00, -3.92620778e+00],
       [-3.26875093e+01, -2.52515864e+00,  3.87695076e-01],
       [-5.38957011e+01,  2.32104364e+00, -2.20593631e+00],
       [-4.08780385e+01, -1.86471027e+00, -2.34708823e+00],
       [-5.34289549e+00, -3.98065864e+00,  7.79631035e-01],
       [-2.82033419e+01, -8.33535389e+00,  5.30031897e+00],
       [-6.40083890e+01,  7.06150790e+00,  1.43570386e+00],
       [-4.32916052e+01, -1.02057499e-01

Similarly, because the observations are now 3D, we should remove the last row of $V^T$, since we want to use $V^T$ to map our now 3D data into 4D (rather than expecting 4D input data).

In [24]:
vt = vt[0:3, :]

After removing the redundant portions of $U\Sigma$ and $V^T$, we can verify that multiplying them together again yields our original array.

In [25]:
usig @ vt

array([[ 8.,  6., 48., 28.],
       [ 2.,  4.,  8., 12.],
       [ 1.,  3.,  3.,  8.],
       [ 9.,  3., 27., 24.],
       [ 9.,  8., 72., 34.],
       [ 3.,  1.,  3.,  8.],
       [ 4.,  2.,  8., 12.],
       [ 6.,  5., 30., 22.],
       [ 7.,  1.,  7., 16.],
       [ 8.,  2., 16., 20.],
       [ 5.,  5., 25., 20.],
       [ 9.,  5., 45., 28.],
       [ 8.,  4., 32., 24.],
       [ 1.,  2.,  2.,  6.],
       [ 2.,  9., 18., 22.],
       [ 7.,  8., 56., 30.],
       [ 7.,  5., 35., 24.],
       [ 2.,  4.,  8., 12.],
       [ 2.,  5., 10., 14.],
       [ 4.,  3., 12., 14.],
       [ 8.,  1.,  8., 18.],
       [ 8.,  5., 40., 26.],
       [ 9.,  8., 72., 34.],
       [ 7.,  3., 21., 20.],
       [ 8.,  3., 24., 22.],
       [ 6.,  6., 36., 24.],
       [ 6.,  4., 24., 20.],
       [ 8.,  4., 32., 24.],
       [ 7.,  2., 14., 18.],
       [ 4.,  4., 16., 16.],
       [ 4.,  1.,  4., 10.],
       [ 9.,  7., 63., 32.],
       [ 4.,  2.,  8., 12.],
       [ 1.,  2.,  2.,  6.],
       [ 8.,  