## Potential problems

Presence of which of the following potential problems in a linear regression model may lead to statistically significant variables appearing insignificant? 

A) Multicollinearity

B) Outliers

C) Overfitting

<span style='color:Blue'>**Answer**: A and B</span>

<span style='color:Blue'>**Explanation**:</span>

<span style='color:Blue'>**A) Multicollinearity:**</span>

<span style='color:Blue'>Recall, the estimated variance of the coefficient $\beta_j$, of the $j^{th}$ predictor $X_j$, can be expressed as:</span>

<span style='color:Blue'>$$\hat{var}(\hat{\beta_j}) = \frac{(\hat{\sigma})^2}{(n-1)\hat{var}({X_j})}.\frac{1}{1-R^2_{X_j|X_{-j}}}$$ </span>

<span style='color:Blue'>If the predictor $X_j$ is collinear with other predictors, $R^2_{X_j|X_{-j}}$ will be large, which in turn will inflate $\hat{var}(\hat{\beta_j})$. In other words, multicollinearity inflates the standard errors of the coefficients for which the variables are collinear. Since $t$-statistic is calculated by dividing the estimated coefficient by its standard error, the $t$-statistics shrinks, and the corresponding $p$-value increases. Therefore, the hypothesis test loses the power to reject the null hypotheses, and thus statistically significant variables appearing insignificant. </span>

<span style='color:Blue'>Another way to think about this can be that if some predictors are collinear, it can be difficult to separate out the individual effects of these variables in the response and significant variables may appear insignificant. </span>

<span style='color:Blue'>**B) Outliers**</span>

<span style='color:Blue'>Recall, the estimate of error variance is given by:</span>
    
<span style='color:Blue'>$$\hat{\sigma}^2 = {\frac{RSS}{n-2}},$$</span>
<span style='color:Blue'>where RSS is the residual sum of squared errors. Outliers result in an increase in $RSS$, leading to an increase in the estimated error variance $\hat{\sigma}^2$, which in turn inflates $\hat{var}(\hat{\beta_j})$. The rest of the explanation follows from the previous explanation on multicollinearity.</span>

<span style='color:Blue'>

<span style='color:Blue'>**C) Overfitting**</span>
    
<span style='color:Blue'>Overfitting shrinks $RSS$, which in turn shrinks $\hat{\sigma}^2$, thereby shrinking $\hat{var}(\hat{\beta_j})$. Thus overfitting will act in way opposite to what we observe in (A) and (B). </span>

## Potential problems
Classify a data point as influential / outlier / high leverage in a linear regression model, based on the description.

A) The data point is likely to have a large effect on the model in terms of prediction: <span style='color:Blue'>**Influential point**</span>

B) The data point has the potential to have a large effect on the model in terms of prediction: <span style='color:Blue'>**High leverage point**</span>

C) The data point is likely to inflate the model R-squared: <span style='color:Blue'>**High leverage point that is not influencial** </span>

D) The data point is unlikely to have a large effect on the model in terms of prediction: <span style='color:Blue'>**outlier**</span>

<span style='color:Blue'>

   
<span style='color:Blue'>**Explanation:**</span>
    
<span style='color:Blue'>See the graphics in class presentation on *Chapter3_Outliers_high_leverage_influential_points*. Think of influential points / high leverage points / outliers as a force (proportional to the residual corresponding to the point) pulling a canteliver beam. Depending on the position from where you pull the cantilever beam, you may move it too much or too little.</span>
    
<span style='color:Blue'>A) **Inluential point** (high leverage & outlier): an outlier with the respect to both the predictor and the response. It has a large effect on the regression line. As shown in the graphics, influence is higher for more extreme outliers with same leverage and for points with higher leverage & similar outlying distance.</span>
    
<span style='color:Blue'>B) **High leverage point**: Observations with high leverage have an unusual value for the predictor (ie. lie outside the domain of most points). High leverage point has the potential to have a large affect on the regression line. It is cause for concern if the least squares line is heavily affected by just a couple of observations, because any problems with these points may invalidate the entire fit.</span>

<span style='color:Blue'>C) If you have a **high leverage point that is not influencial**: The variance of the response may increase in the presence of high leverage points, since an unusual set of predictor values may correspond to an unusual response, which may increase the total variation. However, as the point is not inluential, the increase in the unexplained variation *(the squared residual)* will not be proportionate to the increase in total variation. As $R^2$ is one minus the ratio of unexplained variation to total variation, it is likely to increase.</span>

<span style='color:Blue'>D) **Outliers:** As shown in the graphics, outliers very small effect on prediction.</span>

## Autocorrelation

A linear regression model was developed to predict the number of passengers taking a flight per month. The data consists of number of passengers flying each month from January 1949 to December 1960. The autocorrelation plot below shows the correlation of the residuals with the lagged residuals of the model. Choose the most appropriate option.

In [16]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./Datasets/autocorr2.jpg", width=700)

A) The above plot shows the presence of autocorrelation. The 6-month lagged response is the most appropriate lag to be added as a predictor in the model to address autocorrelation

B) The above plot shows the presence of autocorrelation. The 12-month lagged response is the most appropriate lag to be added as a predictor in the model to address autocorrelation

C) The above plot shows the presence of autocorrelation. The 1-month lagged response is the most appropriate lag to be added as a predictor in the model to address autocorrelation 
 
 D) The above plot shows the absence of autocorrelation as the plot must have a cyclical pattern in the presence of autocorrelation 
 
 E) The above plot shows the absence of autocorrelation as the one month lagged residual must have the highest correlation with the residual in the presence of autocorrelation 
 
<span style='color:Blue'>**Answer**: B</span>
 
<span style='color:Blue'>**Explanation**: As seen in the plot, the residuals are highly correlated (correlation of more than 60%) with lagged residuals of 12 months. This shows the presence of autocorrelation. To address autocorrelation, the 12-month laggged response will be the most appropriate as it has the highest correlation with the response. Thus, it will explain the variation in the respone the most. </span>

<span style='color:Blue'>There is no need for there to be a cyclical pattern for autocorrelation. Even if one of the lagged residuals are highly correlated with the residual, it shows the presence of autocorrelation.</span>

## Logistic regression (goodness-of-fit)
 
Which of the following metrics can be used to assess the goodness-of-fit of a logistic regression model?

A) All of these 

B) LL-Null 

C) Log-Likelihood 

D) Df Model 

E) R-squared 

<span style='color:Blue'>**Answer**: Log-Likelihood</span> 

<span style='color:Blue'>**Explanation**</span>
<span style='color:Blue'>In logistic regression, the response is assumed to follow a Bernoulli distribution, where the probability of success is a function of the predictors and its coefficients *(the model parameters)*. With this assumption, one can compute the the joint probability density of the observed data as a function of the model parameters. This creates a set of probability distributions *(based on different values of model parameters)* that could have generated the data. The algorithm finds the values of the model parameters *(the beta coefficients)* such that the probability of observing the data maximizes. This probability is the likelihood, and its logarithm is the log-likelihood. The higher the log-likelihood, the more probable it is to observe the data. Thus, log-likelihood is a way to measure the goodness-of-fit of the model.</span>

<span style='color:Blue'>LL-NULL is the log-likelihood of the model with no parameters. This is compared with the log-likelihood of the model with predictors to test if the regression is statistically significant. </span>

<span style='color:Blue'>Df Model is the number of predictors in the model.</span>

<span style='color:Blue'>R-squared cannot be used for logistic regression as there are no residuals.</span>

## Logistic regression (threshold probability)
 
For a logistic regression model, as we increase the decision threshold probability, 

A) None of these 

B) the recall will reduce or stay the same 

C) the ROC-AUC will increase or stay the same 

D) the precision will increase or stay the same 

E) the classification accuracy will increase or stay the same 

<span style='color:Blue'>**Answer**: B</span>

<span style='color:Blue'>**Explanation**: See class slide on the confusion matrix below. </span> 

In [20]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./Datasets/cm.jpg", width=600)

<span style='color:Blue'>Increasing threshold probability means that less observations are predicted to be positive. Hence, some TP could turn into FN, reducing the recall. (this might not happen if there is no observations of actual positives between the thresholds). ROC-AUC is independent of the threshold probability. Both precision and classification accuracy might decrease if the number of FP among actual negatives increase more than the increase of TP among actual positives by the shift in the threshold.</span> 

## Decision threshold probability

Which of the following metrics is independent of the decision threshold probability?

A) None of these

B) ROC-AUC 

C) All of these (except the "None of these" option) 

D) Precision 

E) Recall 

<span style='color:Blue'>**Answer**: ROC-AUC </span>

<span style='color:Blue'>**Explanation**</span>
<span style='color:Blue'>By changing the threshold, the number of points classified as negative and positive may change, and so TP, FP, TN and FN may change. Recall and precision may change as they are based on these metrics *(TP ,FP, TN, and FN)*. However, the ROC-AUC specifically analyzes different thresholds. The ROC curve is a plot of TPR against FPR for all possible thresholds, and ROC-AUC is the area under the ROC curve, so the value itself is independent from the decision threshold probability. </span>

## Odds

Consider the following logistic regression model:

$$p(x) =\frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$.

Which of the following metrics will depend on the value of x?

A) Odds ratio when x increases by 2 units

B) increase in log odds when x increases by 10 units

C) All of these 

D) Increase in predicted probability when x increases by 1 unit

E)none of these

<span style='color:Blue'>**Answer**: D</span>

<span style='color:Blue'>**Explanation**: </span>

<span style='color:Blue'>$$p(x) =\frac{1}{1+e^{-(\beta_0+\beta_1x)}}$$</span>

<span style='color:Blue'>$$\implies \log\bigg(\frac{p(x)}{1-p(x)}\bigg) = \beta_0 + \beta_1x$$</span>

<span style='color:Blue'>$$\implies \log\big(Odds(x)\big) = \beta_0 + \beta_1x$$</span>

<span style='color:Blue'>When $x$ increases by 'c' units,</span>

<span style='color:Blue'>$$p(x+c) - p(x) =\frac{1}{1+e^{-(\beta_0+\beta_1(x+c))}}-\frac{1}{1+e^{-(\beta_0+\beta_1(x))}}$$</span>

<span style='color:Blue'>$$log({Odds(x+c)}) - log({Odds(x)}) = \beta_1 c$$</span>

<span style='color:Blue'>$$\frac{Odds(x+c))}{Odds(x)} = e^{\beta_1 c}$$</span>

<span style='color:Blue'>We can see that only the increase in predicted probability when $x$ increases by 1 unit is dependent on $x$.</span>

## Precision-recall

We develop a logistic regression model to predict whether someone will pay a loan back or not. Loans are "approved" by us only for those borrowers who are predicted to pay back. The positive class is the borrowers that pay back the loans. What would a recall of 81% mean?

A) 81% of the borrowers that would pay back the loan are approved by us: <span style='color:Blue'>Recall = TP/(TP + FN). TP here are those who are [approved by us] who [pay back the loan], while FN are those who were [not approved by us] but actually [pay back the loans]. The denominator is [all who pay back the loan]. Thus, Recall here means: among [all who pay back the loan], 81% are [approved by us].</span>

B) Of all the loans we approve, 81% pay us back: <span style='color:Blue'>This is Precision = TP/(TP + FP)</span>

C) Of all the loans we don't approve, 81% would not have paid us back if they were given the loan: <span style='color:Blue'>This is the proportion of negatives correctly predicted - like precision for the negative class</span>

D) Of all the loans we don't approve, 19% would not have paid us back if they were given the loan: <span style='color:Blue'>This is the proportion of negatives incorrectly predicted.</span>

<span style='color:Blue'>**Answer**:</span>
<span style='color:Blue'>81% of the borrowers that would pay back the loan are approved by us.</span>

<span style='color:Blue'>**Explanation**:</span>
<span style='color:Blue'>Recall = True Positives/(True Positives + False Negatives).</span> 

<span style='color:Blue'>In this case, True positives are those who got approved and would pay back. False Negatives are those we didn't approve, but would pay back. Therefore, 81% Recall means 81% of the borrowers that would pay back the loan are approved by us.</span>

## Variable selection

Which of the following algorithms can be used for variable selection?

A) Lasso

B) Ridge regression

C) Forward stepwise selection

D) Best subset selection

<span style='color:Blue'>**Answer**: A,C,D</span>

<span style='color:Blue'>**Explanation**: Both lasso and ridge regression are regularized least squares model, where the a shrinkage penalty is added to the ordinary least squares cost function. The shrinkage penalty in ridge regression shrinks the regression coefficients estimate towards zero, but not exactly zero, while the shrinkage penalty in lasso tends to give a set of zero regression coefficients and leads to a sparse model. Therefore, lasso can be used for variable selection, but not ridge regression.</span>

<span style='color:Blue'>Forward stepwise and best subset selection are variable selection algorithms by fitting multiple models having different combinations/number of predictors and choosing the best model.</span>

## Precision-recall

You are building a facial recognition model to allow people to unlock their phone. If the phone recognizes the person as the authorized user, it will unlock the phone. If it doesn't recognize the user, it will prompt them to try again or try an alternative method (such as a passphrase). The facial recognition model is a classification model that identifies if the person unlocking the phone is the authorized user (positive response) or not (negative response).

 Assume that letting a stranger (unauthorized user) unlock the phone is more risky (or more expensive) than not letting the authorized user unlock the phone.

Which of the following metric is the most important to optimize in the model?

A) Precision 
  
B) Classification accuracy 
  
C) Recall 
  
D) ROC-AUC 

<span style='color:Blue'>**Answer**:</span>
<span style='color:Blue'>Precision</span>

<span style='color:Blue'>**Explanation**:</span>

<span style='color:Blue'>A) Precision: Precision = TP/(TP + FN). Here, FN are those who are falsely assigned as an unauthorized user when they are actually the authorized user. FP are those who are assigned as the authorized user and are actually an unauthorized user. In this case, it’s important to optimize precision because it is more important to reduce the number of FP (strangers being recognized as the authorized user) than to reduce the number of FN (authorized user not being recognized).</span>

<span style='color:Blue'>B) Classification accuracy: This is incorrect because a model with high accuracy but a high FPR would be unacceptable since it would increase the risk of a stranger unlocking the phone. </span>

<span style='color:Blue'>C) Recall: This is incorrect because a high recall indicates that many of the positive cases are being detected. However, it does not measure the fraction of unauthorized users that the model identifies as authorized. A high FPR could lead to an unauthorized user unlocking the phone, which is a more expensive mistake than an FN.</span>

<span style='color:Blue'>D) ROC-AUC: This is incorrect because ROC-AUC does not take into account the cost of the positive and negative classes. It only measures how well the model can distinguish between authorized users and unauthorized users. </span>

## Logistic regression
Consider the following logistic regression model:

$$p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_2  x_1 + \beta_2 x_2)}}$$

where assuming the threshold probability for classifying observations is 0.5. All observation with predicted probability greater than 0.5 are classified as belonging to class $y=1$, while others are classified as belonging to class $y=0$. 

Which of the following plots correctly visualizes the predicted class based on $x_1$ and $x_2$?

In [21]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./Datasets/logit.jpg", width=600)

<span style='color:Blue'>**Answer**:</span>
<span style='color:Blue'>D</span>

<span style='color:Blue'>**Explanation**:</span>


<span style='color:Blue'>$x_1$ will not have an impact on the outcome because its coefficient is 0. When $x_2>5, p(x)$ will be less than $0.5$ and $y$ will equal 0, as the decisions threshold probability ois 0.5. When $x_2<5, p(x)$ will be greater than $0.5$ and $y$ will equal $1$. </span>

## ROC-AUC

In which of the following cases will ROC-AUC be the most appropriate metric to optimize among the all the performance metrics we have seen in this course.

A) There are wide disparities in the cost of false negatives vs. false positives, for example, predicting if the person has a serious disease.

B) The predicted probabilities will be used to rank observations, instead of classifying them, for example, the Google search engine using the predicted probabilities to rank pages in the decreasing order of relevance to the search query, instead of classifying the observations as 'relevant' and 'not relevant'.

C) We wish to maximize the overall classification accuracy, for example, predicting if a person will vote for the Democrat or the Republican candidate in the US Presidential elections. Here, you may assume that the cost of false positives is similar to the cost of false negatives.

<span style='color:Blue'>**Answer**:</span>
<span style='color:Blue'>(B) only </span>

<span style='color:Blue'>**Explanation**:</span>

<span style='color:Blue'>(A) is incorrect because in cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize the performance metric associated with a higher loss. For example when predicting if the person has a serious disease, a false positive could lead to expensive and unnecessary medical treatment. Conversely, a false negative could result in a delay in diagnosis and treatment, potentially leading to a worse outcome. Thus, we want to prioritize minimizing false negatives. Since ROC-AUC is decision-threshold invariant, it's not a useful metric for this type of optimization.</span>

<span style='color:Blue'>(B) is correct. ROC-AUC is scale-invariant. It measures how well predictions are ranked, rather than the absolute values of the predicted probabilities. Check the [link](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc). </span>

<span style='color:Blue'>(C) is incorrect because AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen. However, the overall accuracy changes with change in decision threshold probability. To maximize overall accuracy, we need to find the optimal decision threshold probability.</span>