# **Equations for multivariat statistics cheat sheet**

### **Week 1 lecture A**

##### **Multivariate normal or gaussian distribution (slide 45)**



$p$-dimensional random variable:
$$
X = \begin{bmatrix} X_1 \\ \vdots \\ X_p \end{bmatrix} \in N_p(\mu, \Sigma)
$$

Mean value or expectation:
$$
\mu = E(X) = \begin{bmatrix} E(X_1) \\ \vdots \\ E(X_p) \end{bmatrix} = \begin{bmatrix} \mu_1 \\ \vdots \\ \mu_p \end{bmatrix}
$$
Dispersion or variance matrix:
$$
\Sigma =D(X) = \begin{bmatrix} V(X_1) & \dots & Cov(X_1, X_p) \\ \vdots & \ddots & \vdots \\ Cov(X_p, X_1) & \vdots & V(X_p) \end{bmatrix}= \begin{bmatrix} \sigma_1^2 & \dots & \sigma_1 \sigma_p \rho_{1p} \\ \vdots & \ddots & \vdots \\ \sigma_p \sigma_1 \rho_{p1} & \dots & \sigma_p^2 \end{bmatrix}
$$
Frequency or density function:
$$
f(x) = \frac{1}{\sqrt{(2\pi)}^p \sqrt{\text{det}{\Sigma}}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)
$$
Contour lines that are ellipsoids with main axes given by eigenvectors of the variance matrix:
$$
f(x) = x \Leftrightarrow (x-\mu)^T \Sigma^{-1}(x-\mu) = k
$$

##### **Estimation of paramters (slides 46-47)**

The mean for the $i$'th observation:
$$
\hat{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
$$
The empirical variance-covariance matrix:
$$
S = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \hat{X})(X_i - \hat{X})^T = \frac{1}{n-1} \Sigma_{i=1}^nX_iX_i^T - \frac{n}{n-1} \hat{X}\hat{X}^T
$$
Other formulas for mean and dispersion:
$$
\hat{X} = \frac{1}{n} X^T \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix} = \frac{1}{n} X^T 1
$$
For variance:
$$
(n-1)S = \sum_{i=1}^{n} (X_i - \hat{X})(X_i - \hat{X})^T = X^TX - n \hat{X}\hat{X}^T = X^TX - \frac{1}{n} X^T 11^TX
$$
Correlation matrix:
$$
\rho_{ij} = \frac{Cov(X_i, X_j)}{\sqrt{V(X_i)} \sqrt{V(X_j)}} = \frac{\sigma_{ij}}{\sigma_i \sigma_j}
$$

##### **Summary of calculation rules for univariate (slide 54):**


Mean rules:
$$
E[a+bX] = a+bE[X]
$$
$$
E[X+Y] = E[X] + E[Y]
$$

Variance rules:
$$
V[a+bX] = b^2V[X]
$$
$$
V[X+Y] = V[X] + V[Y] + 2Cov(X,Y)
$$
Variance if $X$ and $Y$ are independent:
$$
V[X+Y] = V[X] + V[Y]
$$

Covariance rules:
$$
Cov[X,X] = V[X]
$$
$$
Cov[X,Y] = Cov[Y,X]
$$
$$
Cov[aX+bX] = abCov[X,Y]
$$
$$
Cov[X, Y+Z] = Cov[X,Y] + Cov[X,Z]
$$

##### **Summary of calculation rules for multivariate (slide 54):**


Mean rules:
$$
E[A+X] = A + E[X]
$$
$$
E[AX] = A E[X]
$$
$$
E[XB] = E[X] B
$$
$$
E[X+Y] = E[X] + E[Y]
$$

Variance rules:
$$
V[A+BX] = B V[X] B^T
$$
$$
V[AX] = A V[X] A^T
$$
$$
V[X+Y] = V[X] + V[Y]+Cov[X,Y] + Cov[X,Y]^T
$$

Variance if $X$ and $Y$ are independent:
$$
V[X+Y] = V[X] + V[Y]
$$

Covariance rules:
$$
Cov[X,X] = V[X]
$$
$$
Cov[X,Y] = Cov[Y,X]^T
$$
$$
Cov[AX,BY] = A Cov[X,Y] B^T
$$
$$
Cov[X, Y+Z] = Cov[X,Y] + Cov[X,Z]
$$

##### **Conditional distributions (slide 71):**

Mean formula if $(X_1, X_2)$ are multivariate normal and $\mathbf{L}(X_1|X_2=x_2)$ are all normal for all values of $x_2$:
$$
E(X_1|X_2=x_2) = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2)
$$
Variance formula:
$$
V(X_1|X_2=x_2)= \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}
$$
With:
$$
X \sim N_p (\mu, \Sigma), \quad X= \begin{bmatrix} X_1 \\ X_2 \end{bmatrix}, \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}
$$
Variance $\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$ does not depend on the value of $x_2$.

### **Week 2 lecture B**

##### **Formula for two dimensional normal distribution (slide 22):**

If we have a two-dimensional normal distribution $\begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim N_2 \left( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{bmatrix} \right)$ with $\rho = Cor(X_1, X_2).

Then expectation:
$$
E(X_1|X_2=x_2) = \mu_1 + \frac{\sigma_{12}}{\sigma_2^2} (X_2 - \mu_2) = \mu_1 + \rho \frac{\sigma_1}{\sigma_2} (X_2 - \mu_2)
$$

The variance:
$$
V(X_1|X_2=x_2) = \sigma_1^2 - \frac{\sigma_{12}^2}{\sigma_2^2} = \sigma_1^2 (1-\rho^2)
$$

##### **Formula for partial correlation coefficient (slide 25):**

Partial correlation between some variables given others is:
$$
\rho_{ij|k} = \frac{\rho_{ij} - \rho_{ik}\rho_{jk}}{\sqrt{(1-\rho_{ik}^2)(1-\rho_{jk}^2)}}
$$
From successive conditioning:
$$
\rho_{ij|kl} = \frac{\rho_{ij|k} - \rho_{il|k}\rho_{jl|k}}{\sqrt{(1-\rho_{il|k}^2)(1-\rho_{jl|k}^2)}}
$$
$$
= \frac{Cov(X_i, X_j|X_k)}{\sqrt{Var(X_i|X_k)Var(X_j|X_k)}}
$$
THE OTHER FORMULA is if you have the conditional covariance matrix $V(Y|X)$.

##### **Multiple correlation coefficient (slide 33-35):**

We have:
$$
\rho_{y_i|x}= \frac{\sqrt{\sigma_i^T\Sigma_{xx}^{-1}\sigma_i}}{\sqrt{\sigma_{ii}}}
$$
With:
$$
\Sigma_i = \begin{bmatrix} \sigma_{ii} & \sigma_i^T \\ \sigma_i & \Sigma_{xx} \end{bmatrix}
$$
Then:
$$
1-\rho^2_{y_i|x} = \frac{\text{det}(\Sigma_i)}{\sigma_{ii} \text{det}(\Sigma_{xx})} = \frac{V(Y_i|X)}{V(Y_i)}
$$
(this is theorem 1.42 in the book).

##### **Multiple correlation $F$-statistic (slide 39):**

Formula based on $n$ observations:
$$
\frac{R^2}{1-R^2} \cdot \frac{n-(p-m)-1}{p-m} \sim F(p-m, n-(p-m)-1)
$$
IF $\rho_{y_i|x} = 0$ (null hypothesis) holds for the $F$-distribution, we can conclude that the model is appropriate.

### **Week 3 lecture C**

##### **Formula for $100(1-\alpha)\%$ confidence ellpsoid for unknown mean (slide 6):**

##### **Partial correlation coefficient III formula for $t$-test (slide 13):**

This is a formula for estimation of parameters  :
$$
\{\mu | (\mu - \bar{x})^Ts^{-1} (\mu-\bar{x}) \leq \frac{p(n-p)}{n-(p-1)} F(p, n-p)_{1-\alpha} \}
$$

Formula for coming observation $x$ (also $100(1-\alpha)\%$ confidence ellipsoid):
$$
\{x | (x - \bar{x})^Ts^{-1}(x-\bar{x}) \leq \frac{p(n+1)(n-1)}{n(n-p)} F(p, n-p)_{1-\alpha} \}
$$

Formula:
$$
\frac{R}{\sqrt{1-R^2}}\sqrt{n-2-(p-m)} \sim t(n-2-(p-m))
$$
REMEMBER this is for the standard relationship between a partial correlatio, and again $p$ total variables in full model and $m$ is number in reduced and $R$ is partial correlation coefficient between two models, $p-m$ is the number of variables being tested.

USE THIS if $R$ is the sample partial correlation between the two variables after conditioning on the control variables and the df are computed correctly

##### **Formula for squared multiple correlation (slide 17):**


Formula from book:
$$
1-\rho^2_{y_1|x} = \frac{\text{det}\Sigma_i}{\sigma_{ii} \text{det}\Sigma_{xx}}
$$

Formula is (ONLY USE IF SYMMETRIC VARIANCE LIKE SAME NUMBER IN THE DIAGONAL):
$$
\rho^2_{1|ab} = \Sigma_{1, ab} \Sigma_{ab, ab}^{-1} \Sigma_{ab, 1}
$$
Here, $1$, $a$, and $b$ are indices for variables in matrices.

IF NOT symmetric variance matrix, use:
$$
\rho^2_{1|ab} = \Sigma_{1, ab} \Sigma_{ab, ab}^{-1} \Sigma_{ab, 1} \cdot \frac{1}{\sigma_{11}}
$$
So if else, then remember to divide with the variance of variable 1.

##### **$t$-test (t-test) for mula for a simple (or multiple) linear regression estimated by OLS that intercept is zero (exam 2011 problem 1):**

Formula:
$$
t = \frac{\hat{\alpha}-0}{SE(\hat{\alpha})}
$$

##### **The standard t–test for a partial correlation $r$ conditioning on $k$ control variables (exam 2011 problem 5):**

Formula:
$$
t = \frac{r \sqrt{n-k-2}}{\sqrt{1-r^2}}
$$

##### **Formula for test of $k-m$ smallest eigenvalues in the correlation matrix (slide 69):**

Formula:
$$
Z_2 = -n \log \frac{\hat{\lambda}_{m+1} \dots \hat{\lambda}_k}{\hat{\lambda}_*^{k-m}} = -n(k-m)\left( \log G - \log(\hat{\lambda}_*) \right)
$$
In this formula, the numerator is the geometric mean $G$ and the denominator is the arithmetic mean $\hat{\lambda}_*$ given as:
$$
\hat{\lambda}_* = \frac{k-\hat{\lambda}_1 - \dots - \hat{\lambda}_m}{k-m} = \frac{\hat{\lambda}_{m+1} + \dots + \hat{\lambda}_k}{k-m}
$$

BUT use the standard Bartlett-type chi-square DF for testing that the last ($q$) eigenvalues are equal:

$$
\text{df} = \tfrac{1}{2}(q-1)(q+2),
$$

Where $q=k-m$ is the number of smallest eigenvalues being tested


Use this test when testing if all eigenvalues are equal:
$$
\frac{1}{2}(k(k-1))
$$

### **Week 4 lecture D**

**Formulas for factor loadings and model assumptions:**

Formula book:
$$
V(X) = R = V(AF+G) = AA^T + \Delta
$$
$$
V(X_i) = a_{i1}^2+\dots + a_{im}^2+ \delta_i = h_i^2 + \delta_i = 1
$$
$$
Cov(X, F) = Cov(AF + G, F) = A
$$
$$
Cov(X_i, F_j) = Corr(Cov(X_i, F_j)) = a_{ij}
$$
$$
V\begin{bmatrix} X \\ F \end{bmatrix} = \begin{bmatrix} R & A \\ A^T & I \end{bmatrix}
$$
$$
V(X|F) = R-AA^T = \Delta
$$
$$
V\begin{bmatrix} X_i \\F \end{bmatrix} = \begin{bmatrix} 1 & [a_{i1} \dots a_{im}] \\  [a_{i1} \\ \vdots \\ a_{im}] & I \end{bmatrix}
$$
$$
V(X_i|F) = 1-[a_{i1} \cdots a_{im}]I^{-1} \begin{bmatrix} a_{i1} \\ \vdots \\ a_{im} \end{bmatrix} = 1 - (a_{i1}^2 + \dots + a_{im}^2) = \delta_i
$$
$$
V(X_i) - V(X_i|F) = a_{i1}^2 + \dots + a_{im}^2 = h_i^2
$$
REMEMBER:
$$
E(X_1|X_2 = x_2) = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2)
$$
$$
V(X_1|X_2 = x_2) = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}
$$

##### **Formula for reconstructing the factor scores (slide 48):**

Formulas:
$$
V(X_i|F_j) = 1-a_{ij}^2
$$
$$
V(X_1|F_j) + \dots + V(X_k|F_j) = k-(a_{1j}^2 + \dots + a_{kj}^2)
$$
$$
\Sigma V(X_i)- \{V(X_1|F_j) + \dots + V(X_k|F_j)\} = (a_{1j}^2 + \dots + a_{kj}^2)
$$
$$
V \begin{bmatrix} F \\ X \end{bmatrix} = \begin{bmatrix} I & A^T \\ A & R \end{bmatrix}
$$
$$
E(F|X=x) = \mu_F + A^T R^{-1}(x-\mu_X) = A^TR^{-1} x
$$

##### **PCA formula (slide 49):**

Formula:
$$
\Sigma = P \Lambda P^T = P \Lambda^{1/2} \Lambda^{1/2} P^T = P \Lambda^{1/2} (P \Lambda^{1/2})^T = AA^T
$$

##### **Formula for communalities (used in ex 7.5):**


If we have something defined as, which is called the factor model:
$$
X = \begin{bmatrix} X_1 \\ X_2 \\ X_3 \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ a_{31} & a_{32} \end{bmatrix} \begin{bmatrix} F_1 \\ F_2 \end{bmatrix} + \begin{bmatrix} G_1 \\ G_2 \\ G_3 \end{bmatrix} = AF + G
$$
Then the communalities are given by:
$$
h_i^2 = a_{i1}^2 + a_{i2}^1
$$
So like for this example we have $3$ communalities.

##### **Formula for dispersion for factor model (used in ex 7.5):**

The matrix $R$, given as:
$$
R = AA^T+\Delta
$$
Here, $\Delta$ is the diagonal matrix of $G$.

##### **Uniqueness of a variable formula in factor analysis (exam 22 problem 3):**

Formula:
$$
i = 1- \sum_{j=1}^{m} a_{ij}^2
$$
So this is, for instance, when we have maybe $j$ number of rotated factors and a table with all variables, then for a certain variable, we just take the number for each factor, square it, sum them up and subtract from $1$ to get the uniqueness for that variable.

Professor also calls it "simply just $1$ minus the communality" so uniqueness is variance in variable $i$ not explained by common factors.

##### **Explained variance in variable $i$ (communality):**

Formula:
$$
h_i^2 = \sum_{j=1}^{m} a_{ij}^2
$$
Which is for variance in variable $i$ explained by common factors.

##### **Explained fraction of variance of a variable in no intercept regression (exam 2024 problem 1):**


For no intercept regression we have:
$$
\hat{\beta} = (X^TX)^{-1}X^Ty
$$
Explained fraction of variance is:
$$
R^2 = \frac{y^T X (X^TX)^{-1} X^T y}{y^T y}
$$

##### **Proportion of total variance explained by factor $j$**

Formula:
$$
V(F_j) = \sum_{i=1}^k a_{ij}^2 \cdot \frac{1}{k}
$$
Where, $k$ is the number of variables. This is PROPORTION (OR FRACTION) of total variance (across all variables) explained by factor $j$

### **Week 5 lecture E**

##### **Formulas for the factor model (slide 3):**

First the factor model is:
$$
[X_1 \; X_2 \; \cdots \; X_n]^T = A [F_1 \; F_2 \; \cdots \; F_n]^T + [G_1 \; G_2 \; \cdots \; G_n]^T
$$
And:
$$
X = AF + G
$$
For the expression below, data has vairance 1, so the data is normalized and we use correlation matrix.
$$
V(X) = R = \begin{bmatrix} 1 & \dots & \rho_{1k} \\ \vdots & \ddots & \vdots \\ \rho_{k1} & \dots & 1 \end{bmatrix} 
$$
The factors are uncorrelated below:
$$
V(F) = I = I_m = \begin{bmatrix} 1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & 1 \end{bmatrix}
$$
The unique factors are uncorrelated below:
$$
V(G) = \Delta = \begin{bmatrix} \delta_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \delta_k \end{bmatrix}
$$
$F$ ang $G$ are uncorrelated:
$$
Cov(F,G) = 0 = \begin{bmatrix} 0 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & 0 \end{bmatrix}
$$

##### **Testing the factor structure formula (slide 19):**

Test statistic:
$$
Z = \left( n-1 - \frac{1}{6}{2k+5}-\frac{2}{3}m\right) \log \left( \frac{|\hat{A}\hat{A}^T + \hat{\Delta}|}{|R|} \right)
$$
Which asymptotically follows a $\chi^2\left( \frac{1}{2}((k-m)^2-k-m)\right)$ distribution where $n$ is observations, $k$ is variables and $m$ is factors.

##### **Squared correlation formula (slide 28-31):**

Formula:
$$
\rho^2 = \frac{\Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY}}{\sigma_Y^2}
$$

##### **(MORE FORMULAS TO BE ADDED)**

### **Week 6 lecture F**

##### **Canonical correlation formulas (slide 31):**


Formula for covariance between a linear combination $a$ of $Y$ and a linear combination $b$ of $X$:
$$
Cov(a^TY, b^TX) = a^T \Sigma_{YX} b
$$


The correlation is:
$$
Cor(a^TY, b^TX) = \frac{a^T \Sigma_{YX} b}{\sqrt{a^T \Sigma_{YY} a} \sqrt{b^T \Sigma_{XX} b}}
$$


Formula if you want to find lienar combinations so that the correlation between the linear combinations is maximal then solve:
$$
\max_{a,b}a^T \Sigma_{YX} b \quad \text{under the constraints} \quad a^T \Sigma_{YY} a = 1, \quad b^T \Sigma_{XX} b = 1
$$

##### **Optimal discriminator formula (from ex 6.2 and slide 56):**



Formula to see whcih groups are most influential and the strongest variables:
$$
\delta = \Sigma^{-1}(\mu_1 - \mu_2)
$$

##### **Bayes solution formula for a region (from slide 34 and ex 6.2):**

Formula used in exercise
$$
\frac{f_1(\mathbf{x})}{f_2(\mathbf{x})} \ge \frac{L_{21} p_2}{L_{12} p_1}
$$

For the full one:
$$
R_1 = \left\{ x | \frac{f_1(\mathbf{x})}{f_2(\mathbf{x})} \ge \frac{L_{21} p_2}{L_{12} p_1} \right\}    
$$

##### **Bayes solution full derivation (from ex 6.2):**

Formula used in exercise:
$$
\frac{f_1(\mathbf{x})}{f_2(\mathbf{x})} \ge \frac{L_{21} p_2}{L_{12} p_1}
$$
Then after substitution and simplification, the rule becomes:
$$
\mathbf{x}^T \Sigma^{-1}(\mu_1 - \mu_2) - \frac{1}{2} (\mu_1^T \Sigma^{-1} \mu_1 - \mu_2^T \Sigma^{-1} \mu_2) \ge \ln \left(\frac{L_{21} p_2}{L_{12} p_1}\right)
$$
Basically we have this, and this one has a constant $c$ part, which we must calculate:
$$
\mathbf{x}^T \hat{\Sigma}^{-1} (\hat{\mu}_1 - \hat{\mu}_2) \underbrace{- \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2}_{\text{The constant threshold part (c)}}
$$
Hence we need:
$$
c = -\frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 + \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1
$$

##### **Formula for bayes classification rule corresponding to using prior probabilites (used in ex 6.2):**


From above we have (EQUAL PRIORS):
$$
c = -\frac{1}{2} \mu_2^T \Sigma^{-1} \mu_2 + \frac{1}{2} \mu_1^T \Sigma^{-1} \mu_1
$$
This constant $c$ is the scalar part of the discriminant function assuming equal prior probabilities ($\pi_1 = \pi_2 = 0.5$) and equal costs.

But now we have to adjust the decision rule since the prior probabilities ($\pi_k$) have changed. The change in priors introduces an extra term into the classification boundary equation. This term is added to the constant $c$ you calculated in exercise 3.

The general decision rule boundary, comparing class 1 and class 2, is where the linear discriminant function (LDF) equals the log-ratio of the priors and costs gives as:

$$
\mathbf{x}^T \mathbf{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) = -\underbrace{\frac{1}{2} (\boldsymbol{\mu}_1^T \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_1 +\boldsymbol{\mu}_2^T \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_2)}_{\text{part from ex. 3 (c)}} + \underbrace{\ln \left(\frac{L_{21} \pi_2}{L_{12} \pi_1}\right)}_{\text{new term for ex. 4}}
$$

The new decision rule is to classify $\mathbf{x}$ to group 1 if:
$$
\mathbf{x}^T \mathbf{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) > c_{\text{new}}
$$

Where the new threshold $c_{\text{new}}$ is:

$$\mathbf{c}_{\text{new}} = \left(-\frac{1}{2} \mu_1^T \Sigma^{-1} \mu_1 + \frac{1}{2} \mu_2^T \Sigma^{-1} \mu_2\right) + \ln \left(\frac{\pi_2}{\pi_1}\right)$$

EXAMPLE ON HOW TO FIND $\pi$ values:

Not to find the two $\pi$ values, we know that group 1 (class 10 or whatever) has 493 pixels while the other has 73 pixels. These are our proportions, hence:
$$
\pi_1 = \frac{493}{493 + 73} \approx 0.871 \quad \text{and} \quad \pi_2 = \frac{73}{493 + 73} \approx 0.129
$$
Now we just insert everything in Python.

##### **Discrimination with unknown parameters (slide 54):**

Estimated decision rule:
$$
x^T\hat{\Sigma}^{-1} (\hat{\mu}_1 - \hat{\mu}_2) - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2
$$
Here, for large sample sizes we can use distribution of $Z$ from theorem $5.8$ to estimate the separating hyperplane.

Mahanalobis distance:
$$
\Delta^2_{\hat{\Sigma^{-1}}}(\hat{\mu}_1, \hat{\mu}_2) = \| \hat{\mu}_1 - \hat{\mu}_2 \|^2_{\hat{\Sigma^{-1}}} = (\hat{\mu}_1 - \hat{\mu}_2)^T \hat{\Sigma}^{-1} (\hat{\mu}_1 - \hat{\mu}_2)
$$
Formula for how $D^2$ is linked in distribution to Hotellings $T^2$ statistic:
$$
D=\frac{n_1+n_2}{n_1 n_2} T^2
$$

### **Week 7 lecture G**

##### **Formula for leasts squares estimator (used in ex 3.1):**

Formula for maximum likelihood estimator in multiple linear regression:
$$
\hat{\beta} = (X^TX)^{-1}X^Ty

$$

Estimated error:
$$
\hat{\epsilon} = y - X\hat{\beta}
$$

##### **Variance of the least squares estimator (exam 2022 problem 5):**

Formula:
$$
V(\hat{\beta}) = \sigma^2 (X^TX)^{-1}
$$
THIS IS THE SAME FOR COVARIANCE ALSO, like covariance is in the off-diagonal elements.

##### **The Mahanalobis distance (slide 11):**

Formula for empirical Mahanalobis distance:
$$
D^2 = \| \hat{\mu}_1 - \hat{\mu}_2 \|^2_{\hat{\Sigma}^{-1}} 
$$

##### **2-sample test in $p$ dimensions (slide 15):**

Formula:
$$
D^2 = \frac{n_1 n_2}{n_1+n_2}T^2
$$

##### **Formula for estimating residual error for a model (ex 3.1):**

Formula:
$$
\hat{\sigma} = \sqrt{\frac{\sum e^2_1}{n-k}}
$$
Here, $n$ is the number of observations, $k$ is the number of parameters, and $e$ are residuals which can be read on a summary R output.

(NOTE $k$ often includes intecept, while $p$ does NOT)

### **Week 8 lecture H**

##### **(MORE FORMULAS TO BE ADDED)**

### **Week 9 lecture I**

##### **Formula for unbiased estimate (exam 2012 problem 5):**


The formula for $\hat{\sigma}^2$ is given as:
$$
\hat{\sigma}^2 = \frac{(y-x\hat{\theta})'(y-x\hat{\theta})}{n-k}
$$

##### **(MORE FORMULAS TO BE ADDED)**

### **Week 10 lecture J**

##### **Formula for confidence inteval (exam 2007 problem 8):**


Formula:
$$
[u+t(n-k)_{1-\frac{\alpha}{2}}s\sqrt{c}, u+t(n-k)_{\frac{\alpha}{2}}s\sqrt{c}]
$$
Here, $n$ is the number of observations and $k$ is the number of parameters (with intercept), $t(n-k)_{1-\frac{\alpha}{2}}$ is the $t$-value for the confidence level, $s$ is the estimated standard deviation of the residuals, and $c$ is given by:
$$
c = x_0^T (X^TX)^{-1} x_0
$$
Moreover, for instance, if we have $95\%$ confidence interval, then $\alpha = 0.05$ so inside R function we write `qt(0.975, df=n-k)` because $1-\frac{\alpha}{2} = 0.975$.

##### **Formula for reduced variance (exam 2008 problem 3):**

Formula (THIS IS FOR REGRESSION MODELS):
$$
R^2 = \frac{SSR_{M2} - SSR_{M1}}{SSR_{M2}}
$$
This is specifically the reduced variance between two models $M1$ and $M2$.

##### **Formula for when asked to find test tastitic for hypothesis $R^2 = 0$ (exam 2008 problem 3):**

This means that $H_0: \beta_1 = \beta_2 = 0$ is what we have to check for, aka to check whether the model explains any variance at all - then we are testing the null hypothesis that all regression coefficients (except the intercept) are zero.

Formula is (AGAIN, IT IS TESTING FOR GLOBAL MODEL):
$$
F = \frac{R^2/k}{(1-R^2)/(n-k-1)} 
$$
Here, $k$ is the number of parameters (excluding intercept) and $n$ is the number of observations.

OR THIS ONE, if we go from a model with more parameters to less parameters (reduced model):
$$
  F=\frac{(SSE_{\text{reduced}}-SSE_{\text{full}})/q}{SSE_{\text{full}}/(n-p_{\text{full}})} \sim F(q, n-p_\text{full})
$$
With $q$ being number of restrictions (number of parameters in full model minus reduced model), and remember $p$ is with intercept.

OR another testing wether some variables for full and reduced model contribue significantly to descriminating between two models:
$$
F = \frac{N-p-1}{q} \cdot \frac{d_\text{full}^2 - d_\text{reduced}^2}{d_\text{full}^2}
$$
Here, $d$ is distance for full and reduced model, here $N$ is not number of total observations but total number of observations in both groups, $p$ is number of parameters in full model, and $q$ is number of restrictions (number of parameters in full model minus reduced model).

##### **Formula for $R^2=0$ hypethesis again, but based on the book (exam 2008 problem 3):**

Theorem 1.45 in the book:
$$
\frac{R^2}{1-R^2}\frac{n-k-1}{k} \sim F(k, n-k-1)
$$
Here, $k$ is the number of predictors (not including intercept) and $n$ is the number of observations.

OR for $R^2=0$ but with $t$-test (also $F=t^2$) between correlation of $X$ and $Y$ (exam 2023 problem 3):
$$
t= \frac{r}{\sqrt{(1-r^2)}}\sqrt{n-2} \sim t(n-2)
$$
So with $n-2$ degrees of freedom.

##### **(OVERALL for a SINGLE linear MODEL) $F$-statistic (exam 2008 problem 3):**

How is usually is written:
$$
F(k, n-k-1)
$$

##### **(PARTIAL for TWO models) $F$-test for nested models (exam 2014 problem 4):**

Formula:
$$
F(q, n - p_{\text{full}}) 
$$
Where $q$ is the number of restrictions (so difference between reduced and full model), $n$ is the number of observations, and $p_{\text{full}}$ is the number of parameters in the full model (including intercept).

##### **Summary table for values where Cook's D for a model can be compared in a suitable percentage in a distribution (exam 2008 problem 3):**

TABLE:

| Model parameterization     |  $p$ (incl. intercept) | $n$  | Compare Cook’s D to $F(p, n−p)$    |
| -------------------------- | ------------------- | -- | ----------------------------- |
| Simple linear              | $2$                   | $n$  | $F(2, n−2)$                     |
| Two predictors + intercept | $3$                   | $45$ | $F(3, 42)$                |
| Big model with 10 params   | $10$                  | $n$  | $F(10, n−10)$                   |

GENERAL formula for distribution of Cook's D:
$$
F(p, n-p)
$$

##### **Formular for testing if there is a difference in the intercept between three locations (exam 2008 problem 3):**

Formula with models inserted from the exercise:
$$
\frac{[SS_{model}(M3)-SS_{model}(M1)]/ (df_{M3}-df_{M1})}{[SS_{residual}(M3)/df_{residual}(M3)]}
$$

##### **Formula for Cook's distance (slide 54):**

Formula:
$$
D_i = \frac{e^2_i}{p \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2}
$$
Note that Cook's distance is high when both the leverage $h_{ii}$ and the squared residual $e_i^2$ are high and $p$ is the number of parameters in the model (including intercept).

Here, another formula:
$$
h_{ii} = X_i(X^TX)^{-1}X_i^T = \|X_i\|^2_{(X^TX)^{-1}}
$$
NOTE that leverage are hat values, and if a question asks for POTENTIAL to influence the model, then we look at leverage values. If they straight up asks for large influence, then we look at Cook's D values.

NOTE: the higher Cook's distance value is, the more influence the point (or variable) has on the overall regression fit.

##### **Confidence interval and prediction interval formulas (from ex 4.3):**

Confidence interval formula FOR REGRESSION:
$$
\hat{y} \pm t s \sqrt{x_0^T (X^T X)^{-1} x_0}
$$


Prediction interval formula:
$$
\hat{y} \pm t s \sqrt{1 + x_0^T (X^T X)^{-1} x_0}
$$

TOTAL length of prediction interval:
$$
\text{Length} = 2 \cdot t_{1-\alpha/2,n-k} \cdot s \sqrt{1 + x_0^T (X^T X)^{-1} x_0}.
$$

If we want the “half-length”, it is just:
$$
t_s \sqrt{1 + x_0^T (X^TX)^{-1} x_0}.
$$



##### **Formula for Hotelling's T-square in the one sample case (formula 61):**

Formula:
$$
T^2 = n(\bar{X} - \mu)^T S^{-1} (\bar{X} - \mu)
$$

##### **Notations for Hotelling's T-square in one sample case (slide 63):**

Notations:
$$
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \in N_p(\mu, \frac{1}{n}\Sigma)
$$
$$
\bar{Y} = \frac{1}{m} \sum_{i=1}^{m} Y_i \in N_p(\nu, \frac{1}{m}\Sigma)
$$
$$
S_1 = \frac{1}{n-1}  \sum_{i=1}^n (X_i - \bar{X})(X_i - \bar{X})^T
$$
$$
S_2 = \frac{1}{m-1}  \sum_{i=1}^m (Y_i - \bar{Y})(Y_i - \bar{Y})^T
$$
$$
S = \frac{(n-1)S_1 + (m-1)S_2}{n+m-2} \in W(n+m-2, \frac{1}{n+m-2}\Sigma)
$$

##### **Other formula for Hotelling's T-square in one sample case (slide 64):**



Other formula using notation above:
$$
T^2=\frac{n_X n_Y}{n_X+n_Y}(\bar X-\bar Y)^\top S^{-1}(\bar X-\bar Y),
$$
Here, $d^2 = (\bar X-\bar Y)^\top S^{-1}(\bar X-\bar Y)$ is the squared Mahanalobis distance between the two sample means.

Where the pooled covariance is
$$
S=\frac{(n_X-1)S_X+(n_Y-1)S_Y}{n_X+n_Y-2}.
$$

##### **Formula to go from Hotelling's T-square to F-distribution (from ex 5.1):**


The conversion (two-sample case, pooled) is:
$$
F=\frac{(n_X+n_Y-p-1)}{p(n_X+n_Y-2)}T^2  \sim F(p, n_X+n_Y-p-1)
$$
This is for when TWO GROUPS are being compared.

Then $p$-value is found via:
$$
\text{p-value} = P(F_{2,17} \ge 6.1389)\approx 0.00985.
$$
THIS USES numbers from the exercise 5.1, but $6.1389$ is from the $F$ formula above, and $F_{2,17}$ is from $F\sim F_{p,n_X+n_Y-p-1}$ where the exercise had $10$ observations in each group and $p=2$ variables.

##### **Formula for testing that last p-q variables do not contribute to discrimination between population x and y (exam 2020 problem 4):**


Formula:
$$
\frac{n_1 + n_2-p-1}{p-q} \cdot \frac{d^2-d_1^2}{(n_1+n_2)(n_1+n_2-2)/(n_1n_2)+d_1^2} \sim F(p-q, n_X+n_Y-p-1)
$$
This is for when testing that the last $p-q$ VARIABLES do not contribute to discrimination between population $X$ and $Y$. Here, $d^2$ is the squared Mahanalobis distance for the full model and $d_1^2$ is the squared Mahanalobis distance for the reduced model.

##### **Rule of thumb for multicollinearity (from exam 2022 problem 4):**

General rule of thumb for multicollinearity problem:

$$TOL < 0.1$$

$$VIF > 10$$

Both of these indicates multicollinearity problem.

Then a definition of the condition number (another multicollinearity diagnostic):
$$
\text{Condition number} = \sqrt{\frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}}
$$
Should be $< 15$, and above $30$ is serious concern.

##### **Formula for usual test statistic for Wilk's Lambda (from exam 2014 problem 6 or 2011 problem 3):**

Formula:
$$
U(p, k-1, n-k)
$$
Where $p$ is number of response dimensions, $k-1$ is number of restrictions or $k$ is the number of groups, and $n-k$ is the degrees of freedom for error.

##### **Test for additional information in LDA (Theorem 5.19) in exam 2024 problem 2:**

Formula:
$$
U(p,k-1,n-k)
$$

OR for multivariate regression:
$$
U(p, q, n  - r)
$$
SEE EXAM 2024 problem 5. Here, $p$ is number of response variables, $q$ is number of added predictors, $n$ is number of observations, and $r$ is number of predictors in each response variable.