# **Final Exam: Review**
**STAT 301 – Final lecture &#x1F62D;**

This material covers some important points covered in our lectures, but it is not exhaustive (i.e., complete). Therefore, you should not study only this material. 

## **1 - Model Assessment**

When we build a regression model, we want to know how well it fits the data and can predict new observations. We also want to compare different models and select the best. To do this, we need some criteria or metrics that can quantify the performance of a model, such as:

- $R^2$: the coefficient of determination that indicates how much of the response variability is explained by the model.
- Adj-$R^2$: The adjusted $R^2$ penalizes the model for adding irrelevant predictors and adjusts for the number of predictors.
- Nested models and $F$-test: a way to test whether a more complex model is significantly better than a simpler one.
- *MSE* and *RMSE*: the mean and root mean squared errors that measure the average deviation of the model predictions from the observed values.
- Sensitivity, specificity, precision, accuracy: measures that evaluate the performance of a binary classifier.
- Information criteria: measures that balance the trade-off between model fit and model complexity, such as $AIC$, $BIC$, and $C_p$.

These measures can help us answer questions such as: 

- How well does the model fit the data? 
- How much error does the model make in prediction? 
- Which model is the best among several candidates? 

### **1.1 - Information Criterion**

Another way to compare the performance of different models is to use information criteria, which balance the model's goodness-of-fit and complexity. Some common information criteria are:

#### **1.1.1 - Akaike information criterion (AIC)**

The AIC evaluates how well a model fits the data, considering both the goodness-of-fit and the complexity of the model as indicated by the number of parameters. A lower AIC value suggests a better-fitting model. It can be used to compare models with different numbers of explanatory variables as long as they share the same sample size and response variable.

#### **1.1.2 - Bayesian information criterion**

BIC is similar to AIC, but it penalizes the number of parameters more heavily. BIC generally prefers simpler models compared to AIC. It can also be used to compare models with different numbers of explanatory variables, provided they have the same sample size and response variable.

### **1.2 - Quantitative Response**

#### **1.2.1 - Mean squared error (MSE)**

The MSE is calculated using the formula 

$$\text{MSE} = \frac{\text{RSS}}{n}$$
where RSS represents the residual sum of squares and n is the sample size. It measures the average squared difference between the observed and predicted values of the response variable. 

MSE is commonly used to evaluate the quality of a regression model, and it has the same units as the squared response variable. A lower MSE value indicates a better fit of the model to the data.

#### **1.2.2 - Root mean squared error (RMSE)**

RMSE, or Root Mean Square Error, is the square root of the Mean Squared Error and shares the same unit as the response variable. It is a measure that can be **roughly** interpreted as the average distance between the observed and predicted values of the response variable. It's important to note that this interpretation is an approximation for easier understanding, but it's not mathematically true.

RMSE is commonly used to assess the performance of a linear regression model. A lower RMSE value indicates a better fit of the model to the data.

#### **1.2.3 - $R^2$ and Adjusted-$R^2$**

$R^2$, also known as the coefficient of determination, measures the proportion of the total variation in the response variable explained by the linear regression model. In simple terms, it compares the variability of the response data around the mean (total sum of squares) to the variability of the response data around the fitted values (residual sum of squares). The formula for $R^2$ is:

$$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$$

The $R^2$ value in linear regression represents the percentage of the response variable variation explained by the model. For example, if $R^2 = 0.8$, it means that $80\%$ of the response variable's variation can be attributed to the explanatory variables in the model, while the remaining $20\%$ is due to random error or other factors not considered in the model. <font color='darkred'>**This interpretation is only true if we use Least Squares to fit the line.**</font>

$R^2$ can be used to compare the performance of different linear regression models. A higher $R^2$ indicates a better fit of the model to the data, assuming the models have the <font color='darkred'>**same number of explanatory variables and number of observations**</font>. However, $R^2$ alone is not sufficient to evaluate the quality of a linear regression model, as it does not account for the bias-variance trade-off (which can lead to overfitting -- when model becomes overly complex and captures noise in the data rather than the underlying pattern), the significance of the regression coefficients, or the assumptions of the linear regression analysis.

One limitation of $R^2$ is that it always increases when more explanatory variables are added to the model, even if they are not relevant or useful for predicting the response variable. Another measure, called adjusted-$R^2$, can be used to address this issue. Adjusted-$R^2$ adjusts the $R^2$ value based on the number of explanatory variables to reward only the variables that improve the model fit. 

$$\bar{R}^2 = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)}$$

Note that Adjusted-$R^2$ is always lower than or equal to $R^2$ and can decrease when an irrelevant variable is added to the model. Therefore, adjusted-$R^2$ can be used to compare the performance of different linear regression models with different numbers of explanatory variables.

### **1.3 - Nested Models and F-Test**

Nested models have a hierarchical relationship, where one model is a special case of another model. For example, a simple linear regression model $Y = \beta_0 + \beta_1 X_1 + \varepsilon$ is nested within the multiple linear regression model $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \varepsilon$, because when $\beta_2 = \beta_3 = 0$, the two models are equivalent. Nested models can be compared using an F-test, which tests whether the more complex model provides a significantly better fit than the simpler model.

The $F$-test for nested models compares the fit of two models, one simpler and one more complex. It calculates the ratio of the reduction in the residual sum of squares (RSS) to the increase in the number of parameters when moving from the simpler model to the more complex one. 

The formula for the test statistic is: 
$$F = \frac{\frac{RSS_1 - RSS_2}{p_2 - p_1}}{\frac{RSS_2}{n - p_2}}$$

where $RSS_1$ and $p_1$ are the $RSS$ and number of parameters for the simpler model, and $RSS_2$ and $p_2$ are the $RSS$ and number of parameters for the more complex model. 

The test statistic follows an $F$-distribution with $p_2 - p_1$ and $n - p_2$ degrees of freedom under the hypothesis that the simpler model is adequate. A large $F$-value suggests that the more complex model fits the data significantly better than the simpler model. The significance of the test can be calculated using the $F$-distribution. Additionally, the $F$-test can be used to assess the significance of a subset of explanatory variables in a multiple linear regression model. For example, it can be used to test whether a set of beta coefficients (e.g., $\beta_2$ and $\beta_3$) are both zero in a model.

### **1.4 - Categorical Response**

Sensitivity, specificity, accuracy, and precision are common metrics used to evaluate the performance of a binary classifier. A binary classifier is a model that predicts whether an observation belongs to one of two classes, such as positive or negative. The evaluation is summarized using a confusion matrix, which compares the actual and predicted classes of the observations. The confusion matrix for a binary classifier contains four cells: true positives (TP) - correctly predicted positive observations, false positives (FP) - incorrectly predicted positive observations, true negatives (TN) - correctly predicted negative observations, and false negatives (FN) - incorrectly predicted negative observations.

|               | Predicted Positive | Predicted Negative |
|---------------|:--------------------:|:--------------------:|
Actual Positive |       40 (TP)      |        10 (FN)     |
Actual Negative |       20 (FP)      |        30 (TN)     |

#### **1.4.1 - Sensitivity**

Sensitivity, also known as recall or true positive rate, is the proportion of the actual positive observations that are correctly predicted as positive. Sensitivity measures how well the classifier identifies the positive cases. Sensitivity is calculated as:

$$\frac{TP}{TP + FN}$$

#### **1.4.2 - Specificity**

Specificity, also known as true negative rate, is the proportion of the actual negative observations that are correctly predicted as negative. Specificity measures how well the classifier avoids the false alarms. Specificity is calculated as 

$$\frac{TN}{TN + FP}$$

#### **1.4.3 - Precision**

Precision is the proportion of the predicted positive observations that are actually positive. Precision measures how reliable the classifier is when it predicts a positive case. Precision is calculated as 

$$\frac{TP}{TP + FP}$$

#### **1.4.4 - Accuracy**

Accuracy is the proportion of the total observations that are correctly predicted by the classifier. Accuracy measures the overall performance of the classifier. Accuracy is calculated as 

$$\frac{TP + TN}{TP + TN + FP + FN}$$

Based on the confusion matrix above, we can calculate the sensitivity, specificity, accuracy and precision of the classifier as follows:
- Sensitivity = TP / (TP + FN) = 40 / (40 + 10) = 0.8
- Specificity = TN / (TN + FP) = 30 / (30 + 20) = 0.6
- Accuracy = (TP + TN) / (TP + TN + FP + FN) = (40 + 30) / (40 + 30 + 20 + 10) = 0.7
- Precision = TP / (TP + FP) = 40 / (40 + 20) = 0.67

This means that:
- the classifier correctly identifies 80% of the patients who have the disease
- avoids 60% of the false alarms
- predicts correctly 70% of the total cases; and 
- it is reliable 67% of the time when it predicts a positive case.

#### **1.4.5 - ROC and AUC**

ROC stands for receiver operating characteristic, and it is a graphical plot that shows how well a binary classifier performs as we change its discrimination threshold. The ROC curve plots the true positive rate (TPR) or sensitivity against the false positive rate (FPR) or 1 - specificity for different threshold values. We use the ROC curve to compare different classifiers or to choose the best threshold that balances sensitivity and specificity.

![image.png](attachment:b46cab48-1f62-4542-b81a-97fb8616f705.png)

AUC stands for area under the curve, and it provides a single numerical value to summarize the performance of a binary classifier. The AUC reflects the probability that a randomly selected positive instance will be ranked higher than a randomly selected negative instance by the classifier. 

The AUC ranges from 0 to 1, with 0.5 representing a random classifier and 1 representing a perfect classifier. We can calculate the AUC by finding the area under the ROC curve. 

The AUC is unaffected by the threshold value and the class distribution, and it helps us compare different classifiers or assess a classifier's stability when the data changes.

-------------

#### **1.4.5 - Final Remarks**

- Remember, you don't want to measure/assess the performance of your model in the training set (i.e., the same set of data that you used to fit your model). You want to see how you model deals with observations it has never seen before.

- Therefore, you should use the validation set to compare different models. 

- The test set is to evaluate your **final** model. **Do not** do model comparisons with the Test Set. 
    - Otherwise, your test set becomes effectively your validation set. 

## **2 - Prediction Interval vs Confidence Interval for Prediction**

Understanding the difference between confidence intervals for prediction and prediction intervals is crucial in regression analysis. While both serve to quantify the uncertainty of predicting a new response value based on a fitted regression model, they have distinct interpretations and assumptions. A confidence interval for prediction is an interval for the true mean response value for given levels of the explanatory variables with a specified level of confidence; in other words, it is the confidence interval for the line.

![image.png](attachment:c4df8efd-db32-458b-88b3-b3e1c2b69d63.png)

A prediction interval, on the other hand, is an interval for the actual response value for a single new observation at given levels of the explanatory variable with a specified level of confidence; in other words, is a confidence interval for the a "point". 

![image.png](attachment:d51079c5-d423-4aed-ac45-ed8deeb2cace.png)

The key difference lies in the fact that the mean response value is less variable than the individual response values, leading to a narrower confidence interval for prediction than the prediction interval.

## **3 - Regularization: LASSO and Ridge**

LASSO stands for Least Absolute Shrinkage and Selection Operator, a linear regression method with regularization. Regularization is a technique that adds a penalty term to the loss function of the regression model to prevent overfitting and reduce the model's variance. The penalty term is usually a function of the regression model's coefficients. 


LASSO uses the $L_1$-norm as the penalty term, which is the sum of the absolute values of the coefficients. This penalty term has the effect of shrinking the coefficients towards zero and setting some of them precisely to zero when the tuning parameter that controls the strength of the penalty is sufficiently large. This means that LASSO performs variable selection and regularization, resulting in a sparse and interpretable model. LASSO can also handle multicollinearity (frequently discarding one of the collinear features).


However, LASSO has some limitations, such as the fact that it can select at most $n$ variables when the sample size is $n$ and may not perform well when some variables are very important but have small coefficients. To overcome these limitations, some extensions of LASSO have been proposed, such as elastic net (combination of LASSO and Ridge).

## **4 - Post-selection inference**


Post-selection inference, a crucial aspect of statistical analysis, addresses the challenge of making valid inferences after variable selection, such as LASSO. The process of variable selection can introduce bias and disrupt the standard assumptions of classical inference methods, like confidence intervals or hypothesis tests. For instance, using LASSO to select a subset of variables and then performing ordinary least squares regression on these variables can lead to overly optimistic and unreliable estimates and p-values. Therefore, it's vital to consider the impact of variable selection on the inference procedure and adjust the results accordingly.

There are different ways to deal with this. We covered the simplest way, which is data splitting. But this has the disadvantage of losing part of our dataset, affecting the std. errors of our estimators. 

## **4 - A/B Testing**

A/B testing is a method of comparing two or more versions of a product, service, or treatment. It requires careful planning and execution to determine which version performs better according to specific criteria, such as user satisfaction, conversion rate, revenue, or health outcomes. For example, an online retailer might use A/B testing to compare two different website layouts and identify the one that leads to more sales or clicks. This method not only optimizes the design and effectiveness of interventions but also empowers us to make data-driven decisions, reinforcing the weight of our responsibility.

One problem with A/B testing is that it often involves sequential testing, which means that the data are collected and analyzed in multiple stages, and the test can be stopped early if a significant difference is observed. Sequential testing can increase the probability of false positives (i.e., finding a difference when there is none -- Type I Error). To avoid this problem, sequential testing requires adjusting the significance level or the sample size to account for the multiple testing. However, this can also reduce the power or the ability to detect a true difference. Therefore, sequential testing involves a trade-off between validity and efficiency and should be planned and conducted carefully.

One common approach to control the false positive rate in sequential testing is to use group sequential methods, which divide the data into a fixed number of groups and specify stopping rules for each group based on predefined boundaries. Two popular types of boundaries are the Pocock boundary and the O'Brien-Fleming boundary. The Pocock boundary is a symmetric boundary requiring a constant significance level for each group. The O'Brien-Fleming boundary is an asymmetric boundary that requires a very high significance level for the first groups and then gradually decreases for the subsequent groups. Both boundaries can be used in A/B testing to balance the validity and efficiency of the test, depending on the desired trade-off between early stopping and false positive control.