# 8 OLS Matrix Form

In a given dataset, the linear relationship between dependent and independent variables can be expressed in a general format as follows:

$$
\begin{align}
    y_1 &= \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + ... + \beta_k x_{1k} + u_1 \\
    y_2 &= \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + ... + \beta_k x_{2k} + u_2 \\
    &... \\
    y_n &= \beta_0 + \beta_1 x_{n1} + \beta_2 x_{n2} + ... + \beta_k x_{nk} + u_n \\ 
\end{align}
$$

which is essentially a linear equation system, and therefore linear algebra comes into play. For simplicity, this linear equation system can be expressed in a matrix form,

$$\vec{y} = X\vec{\beta} + \vec{u}$$

where $\vec{y} = (y_1,....,y_n)^T$, $\vec{\beta}=(\beta_0,...,\beta_k)^T$, $\vec{u} = (u_1,...,u_n)^T$, and

$$
X = \begin{bmatrix}
    1 & x_{11} & x_{12} &\dots & x_{1k} \\
    1 & x_{21} & x_{22} &\dots & x_{2k} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    1 & x_{n1} & x_{n2} & \dots & x_{nk}
\end{bmatrix}
$$

X is also known as the **design matrix**.

# 8.1 Why matrix form?

Matrix form simplifies the presentation of a linear system, and enables us to deal with multiple equations at a time. There are profound theorems and properties in **linear algebra** that can help us draw a conclusion within few steps.
> If you are interested in matrix forms, you can refer to textbooks for linear algebra.

# 8.2 Matrix addition and scalar multiplication.

Adding matrices is very simple. Just add each element in the first matrix to the corresponding element in the second matrix.

$
\begin{bmatrix}
    1 & 2 \\
    3 & 4
\end{bmatrix} + 
\begin{bmatrix}
    2 & 3 \\
    4 & 5
\end{bmatrix} =
\begin{bmatrix}
    1+2 & 2+3 \\
    3+4 & 4+5
\end{bmatrix} =
\begin{bmatrix}
    3 & 5 \\
    7 & 9
\end{bmatrix}
$

In order to add two matrices, the entries must correspond. Therefore, addition and subtraction of matrices is only possible when the matrices have the same dimensions.  Matrix addition is commutative and is also associative, so the following is true:

$A+B = B+A$ \
$(A+B)+C = A+(B+C)$

Scalar multiplication of a matrix is done by multiplying each element by that scalar.

$
3 \times \begin{bmatrix}
    1 & 2 \\
    3 & 4
\end{bmatrix} =
\begin{bmatrix}
    3 \times 1 & 3 \times 2 \\
    3 \times 3 & 3 \times 4
\end{bmatrix} =
\begin{bmatrix}
    3 & 6 \\
    9 & 12
\end{bmatrix}
$

And we have the following properties for scalar multiplication of a matrix

 - Left and right distributivity: $c(A+B) = cA+cB = (A+B)c$
 - Associativity: $cdA = c(dA)$
 - Null: $0\times A = 0_{mxn}$ 

The "Identity Matrix" is the matrix equivalent of the number "1":

$I_{3x3} = \begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}$

# 8.3 Transpose of a Matrix

The Transpose of a matrix means flipping a matrix over its diagonal; 

If $A = \begin{bmatrix}
    1 & 2 \\
    3 & 4
\end{bmatrix}$ then,

$A^T = \begin{bmatrix}
    1 & 3 \\
    2 & 4
\end{bmatrix}$

**Properties**

- $(AB)^T = B^TA^T$
- $(A^T)^T = A$

# 8.4 Matrix multiplication

To multiply a matrix by another matrix we need to do the "dot product" of rows and columns, i.e. If 

$
A = \begin{bmatrix}
    \vec{a_1}^T \\
    \vec{a_2}^T \\
    \vdots \\
    \vec{a_n}^T
\end{bmatrix}
$ and $
B = \begin{bmatrix}
    \vec{b_1} & \vec{b_2} & \dots & \vec{b_k}
\end{bmatrix}
$, then

$
AB = \begin{bmatrix}
    A\vec{b_1} & A\vec{b_2} & \dots & A\vec{b_k}
\end{bmatrix} =
\begin{bmatrix}
    \vec{a_1}^T\vec{b_1} & \vec{a_1}^T\vec{b_2} & \dots & \vec{a_1}^T \vec{b_k}\\
    \vec{a_2}^T\vec{b_1} & \vec{a_2}^T\vec{b_2} & \dots & \vec{a_2}^T \vec{b_k}\\
    \vdots & \vdots & \ddots & \vdots\\
    \vec{a_n}^T\vec{b_1} & \vec{a_n}^T\vec{b_2} & \dots & \vec{a_n}^T \vec{b_k}
\end{bmatrix}
$

**special case**

If B has only one column $\vec{b}$, then

$A\vec{b}= \begin{bmatrix}
    \vec{a_1}^T\vec{b} \\
    \vec{a_2}^T\vec{b} \\
    \vdots \\
    \vec{a_n}^T\vec{b}
\end{bmatrix} $


If A has only one row $a^T$, then

$\vec{a}^TB = \begin{bmatrix}
    \vec{a}^T\vec{b_1} & \vec{a}^T\vec{b_2} & \dots & \vec{a}^T\vec{b_k}
\end{bmatrix} $

> These two special cases show when you can factor out a vector. 

![matrix-multiplication.png](./images/matrix-multiplication.png)

With the definition of matrix multiplication, we have the following properties

 - $AB \neq BA$
 - $A(B+C) = AB + AC$
 - $ABC = A(BC)$

# 8.5 Inverse of a matrix

The Inverse of a matrix is a matrix which makes $A^{-1}A = I$

Inverse matrix is used the mimic *division* of real numbers. As an example, If $AB = C$, we can rewrite the equation as $A = CB^{-1}$. (This is because $ABB^{-1} = CBB^{-1}$, $AI = C$, and therefore $A = C$.)

**properties**

- $(AB)^{-1} = B^{-1}A^{-1}$
- $(A^{-1})^{-1} = A$

# 8.6 Matrix derivatives

To mimic the derivative of a real valued function - for example, if y = $\beta_0 + \beta_1 x +u$, then $\frac{\partial y}{\partial x}=\beta_1$ - we define **vector (matrix with one column) derivatives** as 

$\frac{\partial \vec{y}_{n\times 1}}{\partial \vec{x}_{k\times_1}} = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \dots & \frac{\partial y_1}{\partial x_k} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \dots & \frac{\partial y_2}{\partial x_k} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial y_n}{\partial x_1} & \frac{\partial y_n}{\partial x_2} & \dots & \frac{\partial y_n}{\partial x_k}
\end{bmatrix}_{n\times k}$

**properties**

- if $y = Ax$ then $\frac{\partial \vec{y}}{\partial \vec{x}} = A$
- if $z = Ay$ and $y = Bx$, then $\frac{d \vec{z}}{d \vec{x}} = \frac{\partial \vec{z}}{\partial \vec{y}}\frac{\partial \vec{y}}{\partial \vec{x}} = AB$ - chain rule

- if a scalar $ c = \vec{y}^TA\vec{x}$, then $\frac{\partial c}{\partial \vec{x}} = \vec{y}^TA$ and $\frac{\partial c}{\partial \vec{y}} = (A\vec{x})^T$

- if a scalar $c = \vec{y}^T A \vec{x}$, then $\frac{d c}{d \vec{z}} = \frac{\partial c}{\partial \vec{x}} \frac{\partial \vec{x}}{\partial \vec{z}} + \frac{\partial c}{\partial \vec{y}} \frac{\partial \vec{y}}{\partial \vec{z}} = \vec{y}^TA \frac{\partial \vec{x}}{\partial \vec{z}} + \vec{x}^TA^T \frac{\partial \vec{y}}{\partial \vec{z}}$

- if a scalar $ c = \vec{x}^TA\vec{x}$, then $\frac{\partial c}{\partial \vec{x}} = \vec{x}^T(A+A^T)$
- if a scalar $ c = \vec{x}^TA\vec{x}$ and A is symmetric, then $\frac{\partial c}{\partial \vec{x}} = 2\vec{x}^TA$
- if a scalar $ c = \vec{x}^T\vec{x}$, then $\frac{\partial c}{\partial \vec{x}} = 2\vec{x}^T$

# 8.7 Convert linear equation to matrix form

We've learned so many matrix operations. The next the question is how can we convert a linear equation to its matrix form so that we can utilize all of the properties mentioned above.

> Indeed, the properties of matrices are derived the other way round. We first convert matrix operations to their linear equation forms and then prove the properties.

### Summation

$\sum_k x_k\beta_k$ can be expressed as $\vec{x}^T \vec{\beta}$

**special case**
$\sum_k x_k^2 = \vec{x}^T \vec{x}$ this is also denoted as $||\vec{x}||_2^2$ - $L_2$ norm of a vector.

### Double summation

$\sum_i\sum_j a_{ij} x_i y_j$ can be rewritten as $\vec{x}^TA\vec{y}$

**special case 1**: $\vec{x}^TA\vec{x}$ is called the *quadratic form* of A.

**special case 2**: $(\sum_i x_i)^2 = \vec{x}^T \textbf{1}_{n\times n} \vec{x}$

### Stacking - equation system

If for each row of the equation system i, we have $\sum_k x_{ik}\beta_k = y_i$. Then by stacking the equations we have

$
\begin{bmatrix}
    \sum_k x_{1k}\beta_k \\
    \sum_k x_{2k}\beta_k \\
    \vdots \\
    \sum_k x_{nk}\beta_k 
\end{bmatrix} = 
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix}
$

Then we can rewrite each row to its matrix form and stack them together

$
\begin{bmatrix}
    \vec{x_1}^T\vec{\beta} \\
    \vec{x_2}^T\vec{\beta} \\
    \vdots \\
    \vec{x_3}^T\vec{\beta} 
\end{bmatrix} = 
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix}
$

b can be factored out using matrix multiplication

$
\begin{bmatrix}
    \vec{x_1}^T \\
    \vec{x_2}^T \\
    \vdots \\
    \vec{x_3}^T 
\end{bmatrix}\vec{\beta} = 
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix}
$

This is often rewritten as $X\vec{\beta} = \vec{y}$

# 8.8 Application - OLS

Consider a model $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + u$, then in matrix form this becomes

$$y = \vec{x}^T\vec{\beta}+ u$$

This equation holds for each observation, therefore we can stack them together and get
$$\vec{y} = X\vec{b}+\vec{u}$$

Apply the OLS method where we solve 

$$\min_{\beta_0,\beta_1,\beta_2} \sum_i [y_i-(\beta_0+\beta_1 x_{i1} + \beta_2 x_{i2})]^2$$

This is equivalent to

$$\min_{\beta_0,\beta_1,\beta_2} \sum_i u_i^2 = \vec{u}^T\vec{u}$$

Since $\vec{u} = \vec{y} - X\beta$, we have

$$\min_\vec{\beta} (\vec{y} - X\vec{\beta})^T(\vec{y}-X\vec{\beta})$$

To solve this optimization problem, we find the first order condition, let $v = (\vec{y} - X\vec{\beta})^T(\vec{y}-X\vec{\beta})$

$$\frac{\partial v}{\partial \vec{\beta}} = \vec{0}_{1\times 3}$$

Apply matrix derivatives chain rule

$$\frac{\partial v}{\partial \vec{\beta}} = 2(\vec{y}-X\vec{\beta})^T\frac{\partial (\vec{y}-X\vec{\beta})}{\partial \vec{\beta}} = 2(\vec{y}-X\vec{\beta})^TX = \vec{0}_{1\times 3}$$

Transpose both sides of the last equation

$$2X^T(\vec{y}-X\vec{\beta})=\vec{0}_{3\times 1}$$

Rearrange the equation

$$X^TX\vec{\beta} = X^T\vec{y}$$
$$\vec{\hat{\beta}} = (X^TX)^{-1}X^T\vec{y}$$

# 8.9 Application - Ridge Regression

As mentioned in chapter 7-2. Ridge regression tries to minimize MSE by deliberately introduce bias - in order to reduce variance.

The motivation of ridge regression is direct. If more variables are introduced into the model, there will be more $\beta$s waiting to be estimated. To limit the number of variables, we can just add a penalty to the minimization problem.

$$\min_{\vec{\beta}} ||\vec{y}-X\vec{\beta}||_2^2 + \alpha||\vec{\beta}||_2^2$$

where $||\cdot||_2^2$ is the square of $L_2$ norm, and $\alpha$ is called a meta parameter, which means it should be preset by us and then solve the minimization problem.

Taking derivatives and solve the first order condition, we have

$$\vec{\hat{\beta}} = (X^TX+\alpha I)^{-1}X^T\vec{y}$$

Look, we added a "ridge" (diagonal matrix) to $X^TX$, hence the name.

# 8.10 Application - Lasso Regression

Lasso stands for *Least Absolute Shrinkage and Selection Operator*, and it is similar to the ridge regression.

$$\min_{\vec{\beta}} \frac{1}{2n}||\vec{y}-X\vec{\beta}||_2^2 + \alpha||\vec{\beta}||_1$$

Instead of using $L_2$ norm, Lasso uses the $L_1$ norm of vector $\vec{\beta}$. 

> $||\beta||_1 = \sum_p |\beta_p|$

Hence, Lasso is also known as the $L_1$ regularization, and Ridge as the $L_2$ regularization.

# 8.11 Python matrix operation - numpy & patsy

**Numpy** supports matrix operations and **patsy** is used for building design matrices. 

Consider a model $colGPA = \beta_0 + \beta_1 hsGPA +\beta_2 ACT + u$. Instead of using statsmodels, we can calculate the estimators using matrix operations.

First, we need the design matrix X, whose first column contains n ones, and the other columns contains the the observations of independent variables.

In [1]:
import wooldridge as woo
import patsy as pt

df = woo.data("gpa1")

#create the design matrix (and the y vector) using patsy
y, X = pt.dmatrices("colGPA~hsGPA+ACT",data=df)
# pt.dmatrix("hsGPA+ACT", data=df) function will return X only
print(y[:5]) 
print(X[:5])

[[3.       ]
 [3.4000001]
 [3.       ]
 [3.5      ]
 [3.5999999]]
[[ 1.          3.         21.        ]
 [ 1.          3.20000005 24.        ]
 [ 1.          3.5999999  26.        ]
 [ 1.          3.5        27.        ]
 [ 1.          3.9000001  28.        ]]


Next, we can calculate $\hat{\beta}$ using the formula: $\vec{\hat{\beta}} = (X^TX)^{-1}X^T \vec{y}$

This equation involves three matrix operations
- Transpose: X.T in numpy
- Matrix multiplication: X.T @ X in numpy
- Inverse: np.linalg.inv(X.T @ X)

In [2]:
import numpy as np

In [3]:
b_h = np.linalg.inv(X.T@X)@X.T@y
print(b_h)

[[1.28632777]
 [0.45345589]
 [0.00942601]]


Other Examples:
- Elementwise multiplication

In [4]:
A = np.array([[1,2],[3,4]])
B = np.array([[2,3],[4,5]])
A*B

array([[ 2,  6],
       [12, 20]])

- Eigen value

In [5]:
np.linalg.eig(A)

(array([-0.37228132,  5.37228132]),
 array([[-0.82456484, -0.41597356],
        [ 0.56576746, -0.90937671]]))

- Determinant

In [6]:
np.linalg.det(A)

-2.0000000000000004

- Rank

In [7]:
np.linalg.matrix_rank(A)

2

### Exercise,

The estimated variance of u is given by $s^2 = \frac{1}{n-k-1}\sum \hat{u_i}^2=\frac{1}{n-k-1}\vec{\hat{u}}^T\vec{\hat{u}}$, and the variance-covariance matrix of $\vec{\hat{\beta}}$ is given by $s^2(X^TX)^{-1}$. Try to calculate the variance-covariance matrix.

In [8]:
u_h = y-X@b_h
s_2 = u_h.T@u_h/(len(u_h)-X.shape[1])
vcov = s_2*np.linalg.inv(X.T@X)
vcov

array([[ 1.16159717e-01, -2.26063687e-02, -1.59084858e-03],
       [-2.26063687e-02,  9.18011489e-03, -3.57076664e-04],
       [-1.59084858e-03, -3.57076664e-04,  1.16147776e-04]])

In [9]:
# Take the diagonal values
var = np.diagonal(vcov)

# Take the square root of variances to get the standard error
np.sqrt(var)

array([0.34082212, 0.09581292, 0.01077719])

In [10]:
# compare the result with smf
import statsmodels.formula.api as smf
smf.ols("colGPA~hsGPA + ACT",data=df).fit().bse

Intercept    0.340822
hsGPA        0.095813
ACT          0.010777
dtype: float64