# Metrics

## Linear Models

We need to check the assumptions of the linear model before using the model. This process is called regression diagnostics. We divide the potential problems into the following categories.

**Error term:** $\epsilon \sim N(0, \sigma_\epsilon^2)$, iid.

**Linear assumption:** $E(Y|X) = X\beta$

**Unusual observations:** these few observations might change the choice and fit of the model (example, large values can effect which model we deem as the "best"... but should these large values be included?)






## Binary classification

### Accuracy

$$\hbox{Accuracy} = \frac{\hbox{Number of correct predictions}}{\hbox{Total number of predictions made}}$$

### Confusion matrix


$n = 165$

$$\begin{matrix}
& \hbox{Predicted No (-1)} & \hbox{Predicted Yes (+1)}\\
\hbox{Actual No (-1)} & 50 \hbox{(TN)} & 10 \hbox{(FP)}\\
\hbox{Actual Yes (+1)} & 5 \hbox{(FN)}& 100 \hbox{(TP)}\\
\end{matrix}$$

**True positive (TP)**: cases in which we predicted YES and the actual output was YES.

**True negative (TN)**: cases in which we predicted NO and the actual output was NO.

**False positive (FP)**: cases in which we predicted YES and the actual output was NO.

**False negative (FN)**: cases in which we predicted NO and the actual output was YES.

$$Accuracy = \frac{TP + TN}{\hbox{Total sample}} = \frac{100 + 50}{165} = 0.91$$

**True positive rate (sensitivity)**

$$TPR = \frac{TP}{FN + TP}$$

**True negative rate (specificity)**

$$TNR = \frac{TN}{FP + TN}$$

**False positive rate (1-specificity)**

$$FPR = \frac{FP}{FP + TN}$$

**Precision**

$$Precision = \frac{TP}{TP + FP}$$

**Recall**

$$Recall = \frac{TP}{TP + FN}$$



## Regression

### Mean absolute error (MAE)

$$MAE = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|$$

where $y_i$ are the actual value and $\hat{y}_i$ are the predicted values.

### Mean squared error (MSE)

$$MSE = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2$$

It is easier to compute the gradient of $MSE$ than the $MAE$, so it is often preferred. Additionally, $MAE$ requires complicated linear programming tools to compute the gradient.

## Collinearity

$(X^T X)_{2,2}$ where $X$ is centered by the means is equal to the $(n-1) \times Var(X_1)$.

$(X^T X)_{2,3}$ where $X$ is centered by the means is equal to the $(n-1) \times Cov(X_1, X_2)$, which generates a 2x2 matrix. This 2x2 matrix is not full rank (it is of rank 1 in this case). It is also not invertible. It is ill-conditioned matrix. 

When inverse the matrix, $n$ should go to the denominator. 

The power is 1 - type2 error. The prob that a non-zero beta is detected to be non-zero.

We use collinearity $VIF(\hat{\beta}_j)$

Model selection uses regularized solution.

Full rank when no collinearity.

$X^T X$. if p > n then will not have a full rank matrix; we cannot define x s.t. 


$Var(\hat{\beta}) = (X^T X)^{-1} \sigma_\epsilon^2$

## Multicollinearity

Majority of variability in $X_p$ can be explained by the other predictors.

So, build regression of predictor combinations (e.g. X1 vs X2 and X3,  X2 vs X1 and X3,  X3 vs X1 and X2) to find an $R^2$ score for each. If high $R^2$, then potential multicollinearity. This is captured in the variance inflation factor (VIF) score

$$VIF(\hat{\beta}_j) = \frac{1}{1 - R_j^2} \geq 1$$

$R_{ij} = 0.8$ means 80% of the variability has been explained by the others.

## Metrics from the training dataset

For $d$ features. As you increase the $d$, RSS will increase. But, $\hat{\sigma}_\epsilon^2$ will decrease.

Mallow's $C_p$: 

$$C_p = \frac{1}{n} (RSS + 2 d \hat{\sigma}_\epsilon^2)$$

**AIC (smaller is better)**

$$AIC = \frac{1}{n \hat{\sigma}_\epsilon^2} (RSS + 2 d \hat{\sigma}_\epsilon^2)$$

**BIC (smaller is better)**

$$AIC = \frac{1}{n \hat{\sigma}_\epsilon^2} (RSS + 2 \log(n) d \hat{\sigma}_\epsilon^2)$$

BIC tends to select the smaller model (in number of features), compared to AIC. 

### Model selection

Forward selection may not always be able to find the best model. 
