# Chapter 6: Matrix Multiplication

In [1]:
# standard python library imports
import numpy as np
from matplotlib import pyplot as plt

Reading pp. 141-175  
Exercises pp. 176-185

- There are several ways of multiplying matrices (all explained in this chapter)
- not all pairs of matrices can be multiplied

## 6.1 "Standard" matrix multiplication

*notation-wise, this book uses two matrices next to each other (like this: $AB$) to indicate "standard" matrix multiplication*

- Matrix multiplication is not commutative.  $AB$ != $BA$
- to emphasize this, sometimes people use the phrases "A left-multiplies B" or "A pre-multiples B"
- Only matrices where the "inner dimensions" are equal in size can be multiplied (e.g. [4,2] x [2,6])
- the resulting product matrix's size will be the "outer dimensions" (e.g. [4,2] x [2,6] = [4,6])

- we can use this new info to explain the previous notation for dot product and outer product
- $v^Tw$ for dot product and $vw^T$ for outer product
- *note: this notation is still vague imo, since the dot product of $v^Tw$ is a scalar, while matrix multiplication of $v^Tw$ is a matrix*

In [2]:
# In Python, matrix multiplication uses the @ symbol
M1 = np.random.randn(4,2)
M2 = np.random.randn(2,6)
C = M1 @ M2
print(C)

[[ 1.18065911 -1.14654382 -1.6825322   0.1480106  -1.5144497   0.70675351]
 [-1.2243817  -1.47099013  1.56215408  0.6121936   0.06866105 -1.98875251]
 [ 0.50836082 -2.16128703 -0.83898472  0.54375653 -1.59364308 -0.48299895]
 [-1.42475762 -1.86280673  1.80743187  0.75587133 -0.00540544 -2.38554967]]


There are 4 ways to think about / implement matrix multiplication:
1. The "element perspective"
2. The "layer perspective"
3. The "column perspective"
4. The "row perspective"

### 1) The "element perspective"

- each element of the result is the dot product of row of A and column of B
- memory tool: left hand --> across the top, right hand vvv across the left = upper left element of product

3 important features of matrix multiplication
1. The diagonal of the result contains dot products between rows/columns of the same ordinal position (e.g. $a1b1$, $a2b2$, etc)
  - *important for understanding data covariance matrices*
2. The lower triangle of the result contains dot products between *later* rows in A and *earlier* columns in B (e.g. $a5b1$)
3. The upper triangle of the result contains dot products between *earlier* rows in A and *later* columns in B (e.g. $a1b3$)
  - *2 & 3 are importand for understanding matrix decompositions like QR decomp and generalized eigendecomposition*

### 2) The layer perspective

- layer perspective involves conceptualizing the product matrix as a series of layers or "sheets" that are summed together
- implemented by creating outer products by taking the **columns of $A$ multiplied by the rows of $B$** then summing those outer products together
- Each outer product is the same size as the result $C$ and can be thought of as a layer

Example:
$$
\begin{bmatrix}
3 & 4 \\
-1 & 2 \\
0 & 4
\end{bmatrix}
\begin{bmatrix}
5 & 1 \\
3 & 1
\end{bmatrix}
=
\begin{bmatrix}
15 & 3 \\
-5 & -1 \\
0 & 0
\end{bmatrix}
+
\begin{bmatrix}
12 & 4 \\
6 & 2 \\
12 & 4
\end{bmatrix}
=
\begin{bmatrix}
27 &  7 \\
1 & 1 \\
12 & 4
\end{bmatrix}
$$

### 3) The column perspective

- treats all matrices as sets of column vectors, the product matrix is created one column at a time.
- the 1st column of the product matrix is a linear weighted combination of all columns in the left matrix, where the weights are defined by the elements in the first column of the right matrix.
  - the 2nd column is again a weighted combination of all columns in the left matrix, except that the weights now come from the 2nd column in the right matrix.
  - etc
- **note that for column perspective, matrix B creates the weights!**
- the column perspective is useful in statistics when the columsn of the left matrix contain a set of regressors (a simplified model of the data), and the right matrix contains coefficients (i.e. the importance of each regressor). More on this in ch 14.

Example:
$$
\begin{bmatrix}
3 & 4 \\
-1 & 2 \\
0 & 4
\end{bmatrix}
\begin{bmatrix}
5 & 1 \\
3 & 1
\end{bmatrix}
=
\begin{bmatrix}
5
\begin{bmatrix}
3 \\
-1 \\
0
\end{bmatrix}
+ 3
\begin{bmatrix}
4 \\
2 \\
4
\end{bmatrix}
\hspace{0.5cm}
1
\begin{bmatrix}
3 \\
-1 \\
0
\end{bmatrix}
+ 1
\begin{bmatrix}
4 \\
2 \\
4
\end{bmatrix}
\end{bmatrix}
=
\begin{bmatrix}
27 &  7 \\
1 & 1 \\
12 & 4
\end{bmatrix}
$$

### 4) The row perspective

- similar to column perspective but for rows
- each row in the product matrix is the weighted sum of all rows in the right matrix, where the weights are given by elements in each row of the left matrix.
  - think of matrix $A$ broken out where each element is a scalar (weight)
  - then take the 1st row of $B$ and multiply it by the 1st column of scalars from $A$
  - then take the 2nd row of $B$ and multiply it by the 2nd column of scalars from $A$
  - etc...
- **note that for row perspective, matrix A creates the weights!**
- useful in cases like principal components analysis, where the rows of the right amtrix contain data (obaservations in rows and features in columns) and the rows of the left matrix contain weights for combining the features.  Then the weighted sum of data creates the principal component scores.

*no example because it's a pain to do in LaTeX*

## 6.2 Multiplication and equations

- when multiplying both sides of an equation by a scalar, order doesn't matter
- but since matrix multiplication is **not commutative** if you are multiplying both sides of an equation by a matrix then you must put them in the **same order** on both sides of the equation.
- i.e. if you pre-multiply the left side with matrix $D$ , you must also pre-multiply the right side.  (same for post-multiplyting)

In [3]:
# Confirm in code that A@B != B@A
A = np.random.randn(2,2)
B = np.random.randn(2,2)
C1 = A@B
C2 = B@A
print(C1)

[[-0.4968394   1.0134623 ]
 [ 0.63156963 -4.24308136]]


In [4]:
print(C2)

[[-4.48533552  0.23818947]
 [-1.36932077 -0.25458523]]


## 6.3 Multiplication with diagonals

There's a special property of multiplication when one matrix is a diagonal matrix and the other is a dense matrix
- pre-multiplication by a diagonal matrix scales the **rows** of the right matrix by the diagonal elements
- post-multiplication by a diagonal matrix scales the **columns** of the left matrix by the diagonal elements

Mneumonic:  
- P**R**e-multiply to affect **R**ows
- P**O**st-multiply to affect c**O**columns

#### Multiplying two diagonal matrices
- The product of 2 diagonal matrices is another diagonal matrix whose diagonal elements are the products of the corresponding diagonal elements

Example:
$$\begin{bmatrix}
a & 0 & 0 \\
0 & b & 0 \\
0 & 0 & c
\end{bmatrix}
\begin{bmatrix}
d & 0 & 0 \\
0 & e & 0 \\
0 & 0 & f
\end{bmatrix}
=
\begin{bmatrix}
ad & 0 & 0 \\
0 & be & 0 \\
0 & 0 & cf
\end{bmatrix}$$

## 6.4 LIVE EVIL

**Important but not intuitive point:**
- An operation applied to multiplied matrices gets applied to each matrix individually **in reverse order**
- You will need to swap the order of the matrices before multiplying them
- *note: think of it as similar to chained methods in computer programming.  The rightmost method resolves first, then the next rightmost, etc*

e.g. $(ABC)^T = C^T B^T A^T$

e.g. $(ABCD)^{-1} = D^{-1} C^{-1} B^{-1} A^{-1}$

- be careful with square matrices because you can still get a result (an incorrect one) if you don't apply LIVE EVIL transpose in correct order
- fortunately, it's impossible to multiply rectangular matrices without performing LIVE EVIL transposing

## 6.5 Matrix-vector multiplication

- Important feature of matrix-vector multiplication: the result is always a vector
- this provides the connection between linear transformations and matrices:
  - to apply a transform to a vector, convert the transform to a matrix, then multiply the vector by that matrix
- pre-multiplying by a vector gives a different result from post-multiplying by the same vector (transposed to match size)
- exception: if the matrix is symmetric, then pre-multipying by a vector is essentially the same as post-multipying by the vector (transposed).
  - *Though one is a column vector and the other is a row vector, they're essentially the same*
- formally: if $A = A^T$ then $Ab = (b^TA)^T$

## 6.6 Creating symmetric matrices

2 methods for non-symmetric matrices to be converted to symmetric:
1) additive
2) multiplicative

### Additive method
- not widely used but useful to know
- Add the matrix to its transpose and divide by 2
- *only valid for square matrices*

$C = 1/2(A^T+A)$

### Multiplicative method
- more commonly used
- multiply a matrix by its transpose (this is the $A^TA$ learned previously)
- can use on any matrix, square or non-square

$A^TA$

## 6.7 Multiply symmetric matrices

- in general, the product of 2 symmetric matrices is not a symmetric matrix
- (there are exceptions, but this is generally the case)

Reflection:
- this may seem like a useless factoid, but it leads to one of the biggest limitations of principal components analysis, and one of the most important advantages of generalized eigendecomposition, which is the computational backbone of many machine-learning methods, most prominently linear classifiers and discriminant analyses.

## 6.8 Hadamard multiplication

- Hadamard multiplication is what a layperson would guess matrix multiplication would be like
- Simply multiply each element of one matrix with the corresponding element of another matrix
- Both matrices must be the same size (row x columns) and the result will be the same size
- because Hadamard multiplication is implemented element-wise, it actually is commutative, where "standard" matrix multiplication isn't

Hadamard multiplication applications:
- one of the key algorithms for computing the matrix inverse

In [5]:
# Hadamard multiplication in Python (note the * symbol!)
M1 = np.random.randn(3,4)
M2 = np.random.randn(3,4)
print(M1 * M2)

[[ 3.14037788 -0.00928369 -0.47585834 -0.14712052]
 [ 1.5345449   0.42816     0.78747033 -0.85696369]
 [ 1.92785464 -1.22364961  0.14584775  0.29115969]]


## 6.9 Frobenius dot product

- an operation that produces a scalar (single number) given 2 matrices of the same size (M x N)
- also called Frobenius inner product
- to compute:
  - first vectorize the 2 matrices (concatenate all the columns to make a single large column vector)
  - then compute their dot product as with normal vectors

Notation:  
$<A,B>_F$

In [6]:
# Vectorize a matrix into 1 column in Python
# (note that Python defaults to rows so you have to spedify Fortran convention)
A = np.array([ [1,2,3],[4,5,6] ])
A.flatten(order='F')

array([1, 4, 2, 5, 3, 6])

- A curious but useful way to compute the Frobenius dot product between A and B is by taking the trace of $A^TB$.
- Therefore: $<A,B>_F = tr(A^TB)$

- Frobenius dot product has several uses in signal processing in machine learning, for example, as a measure of "distance" or similarity between 2 matrices
- The Frobenius inner product of a matrix with itself is the sum of all squared elements, and its called the *squared Frobenius norm* or *squared Euclidean norm* of the matrix

In [7]:
# Compute the Frobenius dot product by using the trace transpose trick
A = np.random.randn(3,4)
B = np.random.randn(3,4)
frob = np.trace(A.T@B)
print(frob)

2.4493756775317816


## 6.10 Matrix norms

Vectors:
- The norm of a vector is the square root of the dot product of a vector with itself.
- This is the same as the magnitude / length of a vector (which is also called the norm)

Matrices:
- Annoyingly, the norm of a matrix is more complicated.
- there are many types of norms but they all have some things in common:
  - single number
  - corresponds in some way to the "magnitude" of the matrix

### Frobenius matrix norm

- see eq. 6.24 in book
- If we think of a matrix space as Euclidean, then the Frobenius norm of the subtraction of two matrices provides a measure of Euclidean distance between those matrices (i.e. the pythagorean theorem)
- only valid for 2 matrices of the same size
- sometimes called the ℓ2 norm (cursive l)
  - there is also an ℓ1 matrix norm: sum the absolute values of all elements in a column then take the largest maximum column sum

Now we know 3 ways to compute the Frobenius norm:
1. equation 6.24
2. vectorizing the matrix and computing the dot product with itself
3. computing $tr(A^TB)$

### Other norms
- There are many other matrix norms with varied formulas
- to avoid overwhelming, they won't be covered in this book

In [8]:
# Calculate frobenius norm in Python
A = np.random.randn(3,4)
np.linalg.norm(A, 'fro')

4.9486756166769235

## 6.11 Matrix assymetry index

- **every square matrix can be expressed as the sum of a symmetric matrix and a skew-symmetric matrix**
  - (a skew-symmetric matrix can also be called an asymmetric matrix)
- quite a claim, and not obvious at first
- think of the skew-symmetric matrix as a "residual" that we added to the symmetric matrix to produce the new matrix

*matrix asymmetric index:*
- In order to find out how asymmetric a matrix is, we would like to derive a scalar index to quantify it
- perfectly symmetric = 0, perfectly skew-symmetric = 1.
- computed as the ratio of the norms of asymmetric "layer" to the original matrix

$A_\sigma = ||A_k||^2_F / ||A||^2_F$

## 6.12 What about matrix division?

- matrix division does not exist in general, but there's an equivalent operation
- since $\frac{2}{3} = 2*3^{-1}$, we can perform matrix division with:

$AB^{-1}$

- This is important enough that it has its own chapter (ch 12)

## 6.13-6.14 Exercises

## 6.15 Code Challenges