# Directly estimate the test error using validation or cross-validation 

We can also directly estimate the test error using the validation set and cross-validation method. We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest. 

## MSE
The Mean Squared Error (MSE) is the mean of RSS

\begin{align}
RSS&=\sum_{i=1}^n(y_i-\hat{y}_i)^2 \\
&=\sum_{i=1}^n(y_i-\hat{\beta_0}-\hat{\beta_1}x_{i1}-,,,-\hat{\beta_p}x_{ip})^2
\end{align}


## RMSE
The Root Mean Squared Error (RMSE) is the square root of MSE

## RSE
The Residual Standard Error (RSE) is the square root of RSS/degrees of freedom

\begin{align}
RSE=\sqrt{\frac{RSS}{n-p-1}}
\end{align}

Models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.



**Advantage over statistics evaluation metrics: $C_p, AIC, BIC$, and Adjusted $R^2$**: 
- Direct estimate of the test error, and makes fewer assumptions about the true underlying model. 
- Used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom or hard to estimate the error variance œÉ2.

**One-standard-error rule**: We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
 - **Rationale**: if a set of models appear to **be more or less equally good**, then we might as well choose the **simplest model**‚Äîthat is, the model with the smallest number of predictors. 

# Metric to estimate test error regression model

The training error can be a poor estimate of the test error. Therefore, RSS, MSE(RSS/n) and $R^2$ are not suitable for selecting the best model among a collection of models with different numbers of predictors.

**2 Methods**:

1. **Indirectly** estimate test error by making an **adjustment to the training error** to account for the bias due to overfitting.

2. **Directly** estimate the test error, using either a validation set approach or a cross-validation approach

## Statistics evaluation metric, which adjust training error for the bias: $C_p$, $AIC$, $BIC$, Adjusted $R^2$
which indirectly estimate test error by making an adjustment to the training error

**Why adjusting training error?**
- The training set error is generally an underestimate of the test error. When we achieve a model with minimum training error, it doesn't guarantee that the test error will also be the smallest.
- Especially the training error will decrease as more variables are included in the model, but the test error may not. 
- Therefore, training set RSS and training set $R^2$ cannot be used for model selection.

### Mallows' $C_p$

For a fitted least squares model containing d predictors, $C_p$ estimate of test MSE:

\begin{align}
C_p= \frac{RSS}{\hat{\sigma}^2}+2d‚àín = \frac{1}{n}(RSS+2d\hat{\sigma}^2)
\end{align}

where $\hat{\sigma}^2$ is an estimate of the variance of the error $\epsilon$

**Note**:
- The $C_p$ statistic adds a **penalty** of $2d\hat{\sigma}^2$ to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error.
- The **penalty increases as the number of predictors in the model increases**; this is intended to adjust
for the corresponding decrease in training RSS.
- If $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$, then $C_p$ is an unbiased estimate of test **MSE**. Typically , we estimate $œÉ^2$ using $MSE_{all}$, the mean squared error obtained from fitting the model containing **all of the candidate predictors**.

**How to determine which set of models is best with $C_p$ statistic?**
1. Choose the model with **the lowest $C_p$ value**.
2. Identify the model for which the $C_p$ value is **near d**.
> When the $C_p$ value is near d, the bias is small (next to none). When it's much greater than d, the bias is substantial. When it's below d, it is due to sampling error; interpret as no bias
3. The full model always yields $C_p$ = d, so don't select the full model based on $C_p$.
4. If **all models**, except the full model, yield a large $C_p$ not near d, it suggests some **important predictor(s) are missing**. In this case, we are well-advised to identify the predictors that are missing.
5. When more than one model has a small value of $C_p$ value near d, in general, choose **the simpler model(( or the model that meets your research needs.

### AIC
The AIC criterion is defined for a large class of models fit by least square. 
In this case AIC is given by

\begin{align}
AIC=\frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2)
\end{align}

For least squares models, Cp and AIC are proportional to each other.

### BIC
For the least squares model with d predictors, the BIC is, up to irrelevant constants, given by

\begin{align}
BIC=\frac{1}{n}(RSS+\log(n)d\hat{\sigma}^2)
\end{align}

BIC will tend to take on a small value for a model with **a low test error**.
Since log(n) > 2 for any n > 7, the BIC statistic generally places a **heavier penalty** on models with many variables, and hence results in the selection of **smaller models** than $C_p$.

### Adjusted $R^2$ 

Recall:
\begin{align}
R^2=1 ‚àí \frac{RSS}{TSS} = 1-\frac{RSS}{\sum(y_i-\bar{y})^2}
\end{align}

**TSS**: total sum of squares for the response


**Why not $R^2$?**
Since RSS always decreases as more variables are added to the model, the $R^2$ always increases as more variables are added. 


For a least squares model with d variables, **the adjusted $R^2$** statistic is calculated as

\begin{align}
Adjusted  \, R^2=1 ‚àí \frac{RSS/(n-d-1)}{TSS/(n-1)}
\end{align}


**How to determine which set of models is best with the adjusted $R^2$**:
- **A large value of adjusted $R^2$** indicates a model with a small test error. In theory, the model with the largest adjusted $R^2$ will have **only correct variables and no noise variables**. Maximizing the adjusted $R^2$ is equivalent to minimizing $\frac{RSS}{n‚àíd‚àí1}$, which may increase or decrease due to the presence of d in the
denominator.
> Why: Once **all of the correct variables** have been included in the model, adding additional *noise* variables will lead to only a **very small decrease in RSS**, such variables will lead to an increase in $\frac{RSS}{n‚àíd‚àí1}$, and hence the adjusted $R^2$. Therefiore, unlike the $R^2$ statistic, the adjusted $R^2$ statistic **pays a price for the inclusion of unnecessary variables** in the model.

# Metrics for classification model

1. Training error rates will usually be lower than test error rates. The reason is that we specifically adjust the parameters of our model to do well on the training data. **The higher the ratio of parameters p to number of samples n, the more we expect this overfitting to play a role.**

2. Null accuracy is accuracy that could be achieved by always predicting the most frequent class. Calculating **null accuracy** is a good way to know the minimum we should achieve with our models. In this case, since only 3.33% of the individuals in the training sample defaulted, a simple but useless classifier that always predicts that each individual will not default, regardless of his or her credit card balance and student status, will result in an error rate of 3.33%. Therefore, the trivial **null classifier** will achieve an error rate that is only a bit higher than the LDA training set error rate. This shows how LDA classification accuracy is not that good as it's close to a dumb model.

## Two Types of Error, Confusion Matrix


**Confusion Matrix**

                                   Incorrect Layout! (See the plot below!!!)
<img src="./images/55.png" width=600>

**Prediction results**ÔºöThe matrix table reveals that LDA predicted that a total of 104 people would default. Of these people, 81 actually defaulted and 23 did not. 

1. **False positive/Type I Error**: **A test result which incorrectly indicates that a particular condition or attribute is present.** In this case, it can incorrectly assign an individual who does not default to the default category.
> Only 23 out of 9,667 of the individuals who did not default were incorrectly labeled. This looks like a pretty low error rate! 

2. **False negative/Type II Error**Ôºö**A test result that incorrectly indicates that a condition does not hold, while in fact it does.** In this case, it can incorrectly assign an individual who defaults to the no default category. 
> Of the 333 individuals who defaulted, 252 (or 75.7%) were missed by LDA. So while the overall error rate is low, the error rate among individuals who defaulted is very high. **From the perspective of a credit card company** that is trying to identify high-risk individuals, an error rate of 75.7% among individuals who default may well be unacceptable.

<img src="./images/56.png" width=1000>

## Performance Evaluation Terms

1. **Plain Accuracy**: (TP+TN)/(P+N)


2. **Sensitivity/TPR**: TP/(TP+FN) = 1-FNR, the percentage of true defaulters that are identified, a low 24.3% in this case.
For an **imbalanced dataset**, if we care about the positive outcome and the positive class is also the minority class, sensitivity will be a very important metric to consider.


3. **Specificity/TNR**: TN/(TN+FP) = 1-FPR, the percentage of non-defaulters that are correctly identified, here (1 ‚àí 23/9667)√ó100 = 99.8%.


4. **Brier score**: a way to verify the accuracy of a probability forecast. The best possible Brier score is 0, for total accuracy. The lowest possible score is 1, which mean the forecast was wholly inaccurate. $Brier = \frac{1}{N}\sum_{t=1}^N(f_t-o_t)^2$, where N is the number of observation,$f_t$ is the forecast probability (i.e. 25% chance), $o_t$ is the outcome (1 if it happened, 0 if it didn‚Äôt).


<img src="./images/57.png" width=400>
<img src="./images/58.png" width=400>

## Baseline Performance

**Good choices for baseline model for comparison:**

1. Classification: Majority classifier (a naive classifier that always chooses the majority class of the training dataset)
2. Regression: Predict the average value over the population (usually the mean or median).
3. Other simple or reduced-data models.

# Problems and Techniques of Classifier

## Imbalanced Classes: class-weight

If the interesting class is rare among the general population, the class distribution is **imbalanced** or **skewed**. 

- As the class distribution becomes more skewed, evaluation based on **accuracy breaks down**. Consider fraud detection where fraud cases appear in a 1 out of 100 ratio. A simple rule‚Äî**always choose the most prevalent class**‚Äîgives 99% accuracy and this may tell us little about what data mining has really accomplished.


One of the simplest ways to address the class imbalance is to simply provide a weight for each class which **places more emphasis on the minority classes** such that the end result is a classifier which can learn equally from all classes.

- Assign a high weight to the minority class (i.e., higher misclassification cost). We can determine a class weight from the ratio between classes in the dataset
- The class weights are then incorporated into the algorithm. 

## Unequal Costs and Benefits: lowering this threshold

The LDA classifier will yield the **smallest possible total number of misclassified observations, regardless of which class the errors come from**. Therefore, the LDA classifier works by assigning an observation to the class for which the posterior probability pk(X) is greatest. In the two-class case, this amounts to **assigning an observation to the default class if 

$Pr(default = Yes|X = x)>0.5$**. In other words, by default, the Bayes classifier, and by extension LDA, uses a **threshold of 50%** for the posterior probability of default.

However, a credit card company might particularly wish to **avoid incorrectly classifying an individual** who will default, whereas incorrectly classifying an individual who will not default is less problematic. In such case, we are more concerned about incorrectly predicting the default status for individuals who default, which is the **false negative rate**, then we can consider lowering this threshold. In other words, instead of using the 50% threshold, we could instead assign an observation to this class if

$Pr(default = Yes|X = x)>0.2$, which means we can label any customer with a **posterior probability** of default **above 20%** to the **default class**.

The trade-off that results from modifying the threshold value for the posterior probability of default.

<img src="./images/59.png" width=700>

- Using a threshold of 0.5 minimizes the overall error rate, shown as a black solid line.
- But when a threshold of 0.5 is used, the error rate among the individuals who default is quite high (blue dashed line). 
- As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases. 

**How can we decide which threshold value is best? Such a decision must be based on domain knowledge, such as detailed information about the costs associated with default.**

## Use Expected Value to Decide Threshold

The general form of an expected value calculation

\begin{align}
EV = p({o_1}) ¬∑ v({o_1}) + p({o_2}) ¬∑ v({o_2}) + p({o_3}) ¬∑ v({o_3}) ...
\end{align}

Each ${o_i}$ is a possible decision outcome; $p(o_i)$ is its probability and $v(o_i)$ is its value. The probabilities often can be estimated from the data (ii), but the business values often need to be acquired from other sources (iii). 

In targeted marketing, for example, we may want to assign each consumer a class of *likely responder* versus *not likely responder*, then we could target the likely responders. Unfortunately, for targeted marketing often the probability of response for any individual consumer is very low‚Äîmaybe **one or two percent**‚Äîso no consumer may seem like a likely responder. If we choose a ‚Äúcommon sense‚Äù threshold of 50% for deciding what a likely responder is, we would probably not target anyone.

- **Responder's value**: Say that our average revenue from a respond consumer is 100 dollar. To target the consumer with the offer, we also incur a cost. Let‚Äôs say that we mail some flashy marketing materials, and the overall cost including postage is 1, yielding a profit of 99 if the consumer responds.

- **Non-responder's value**: We still mailed the marketing materials, incurring a cost of 1 or equivalently a benefit of -1.

- Target a given customer x only if:
\begin{align}
p_R(x)*99 - [1-p_R(x)]*1 > 0 \\ \rightarrow p_R(x) > 0.01
\end{align}


## Use Expected Value to Evaluate Classifier

**Confusion Matrix --> Expected rates of TP/FP/FN/TN --> Cost and Benefit --> Expected Value**

Expected profit = p(TP) * Value_TP + p(FP) * Value_FP + p(FN) * Value_FN + p(TN) * Value_TN

# Model Performance Visualization

## Profit Curve to Compare Classifier and Threshold

**Y-axis**: Expected value. As the threshold is lowered, some instances that were considered negative may be changed as positive. So technically, each different threshold produces a different classifier. At each threshold we explore the classifier's confusion matrix and calculate the corresponding expected profit. 

**X-axis**: Percentage of target according to the classifier score. Some models give a score that ranks cases by their likelihood of belonging to the class of interest. 

> Logistic model can give probability estimation of each observation. Tree-based model can use frequency-based model to can also produce probabilities. SVM is closely related to logistic regression, and can be used to predict the probabilities as well based on the distance to the hyperplane (the score of each point). Another possibility are neural networks, if you use the cross-entropy as the cost functional with sigmoidal output units. You can perform probability calibration on the outputs of any classifier that gives some scoring. The most common example of this is called Platt's scaling.

With these scoring, we can rank our interesting cases by probability and target only a subset which have higher score. This would be preferable if we have a budget for actions, such as a fixed marketing budget for a campaign, and so you want to target the most promising candidates.


<img src="./images/120.png" width=600>

- This should make sense because, at the left side, when no customers are targeted there are no expenses and zero profit; at the right side everyone is targeted, so every classifier performs the same. 

- In between, we‚Äôll see some differences depending on how the classifiers. The random classifier performs worst because it has an even chance of choosing a responder or a nonresponder. Among the classifiers tested here, the one labeled Classifier 2 produces the maximum profit of 200 by targeting the top-ranked 50% of consumers.

- If constrained by a budget and still want to target the highest-ranked people. 
    - Say you have 100,000 total customers and a budget of 40,000 for the marketing campaign. Each offer costs 5 so you can target at most 40,000/5 = 8,000 customers. 
    - 8,000 customers is 8% of your total customer base, so check the performance curves at x=8%. The best-performing model at this performance point is Classifier 1. You should use it to score the entire population, then send offers to the highest-ranked 8,000 customers.

## Fitting Curve to Detect Overfitting

**y-axis**: a performance measure on both train set and test set (MSE, AUC, etc.)

**x-axis**: the flexibility of the model (Tree size / Number of tree nodes in tree model)


When performance on the test set starts to decrease, overfitting is occurring.



<img src="./images/126.png" width=600>

## ROC Curve & AUC

- ROC curve: a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. 
- AUC: The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the ROC curve (AUC). 

<img src="./images/60.png" width=700>


- An ideal ROC curve will **hug the top left corner**, so **the larger the AUC** the better the classifier. We expect a classifier that performs no better than chance to have an AUC of 0.5


- Diagonal: a **random classifier** moves back and forth on the diagonal based on the frequency with which it guesses the positive class.


- Point on the ROC curve: models with **different threshold**. Each threshold value produces a different point in ROC space.

<img src="./images/121.png" width=600>

> The bottom left corner represents the model with the **highest** threshold, everything is classified as **Negative**. Lower the threshold, we pass more negative insances to the positive class, and we take higher step on the curve. If we pass a true instance, we take a step upward (increase true positives); whenever we pass a false instance, we take a step rightward (increase false positives).


- ROC curves are useful for **comparing different classifiers**, since they take into account all possible thresholds.
 - The choice of a threshold **depends on the importance of TPR and FPR** classification problem.
 - If there is no external concern about low TPR or high FPR, one option is to weight them equally by choosing the threshold that maximizes ùëáùëÉùëÖ‚àíùêπùëÉùëÖ, which the point closest to the **top left corner** of the ROC space.

## Precision-Recall Curve

ROC curves can sometimes be misleading in some very imbalanced applications. A ROC curve can still look pretty good (ie better than random) while misclassifying most or all of the minority class,  because the False Positive Rate (False Positives / Total Real Negatives) does not drop drastically when the Total Real Negatives is huge.

Precision = ùëÉ(ùëå=1|ùëåÃÇ=1)
Recall/Sensitivity = ùëÉ(ùëåÃÇ=1|ùëå=1)
Specificity = ùëÉ(ùëåÃÇ=0|ùëå=0)

The sensitivity/recall and specificity, which make up the ROC curve, are probabilities conditioned on the true class label. Therefore, they will be the same regardless of what ùëÉ(ùëå=1) is. Precision is a probability conditioned on your estimate of the class label and will thus vary if you try your classifier in different populations with different baseline ùëÉ(ùëå=1). So PR in this case does reflect (amplify or zoom in on) the trade off TP vs FP. But they don't translate well to more balanced cases, or cases where negatives are rare.

A skillful model is represented by a curve that bows towards a coordinate of (1,1). A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5.

<img src="./images/124.png" width=600>

## Cumulative Gains Chart and Lift Chart


**Cumulative Gains Chart**

We compare the cumulative percentage of customers who are responders with the cumulative percentage of customers contacted in the marketing campaign across the groups. This describes the ‚Äògain‚Äô in targeting a given percentage of the total number of customers using the highest modelled probabilities of responding, rather than targeting them at random.


<img src="./images/122.png" width=600>

Dashed line corresponds with ‚Äúno gain‚Äù, i.e., what we would expect to achieve by contacting customers at random. The closer the cumulative gains line is to the top-left corner of the chart, the greater the gain; the higher the proportion of the responders that are reached for the lower proportion of customers contacted.

Depending on the **costs** associated with sending each piece of direct mail and the expected revenue from each responder, the cumulative gains chart can be used to **decide upon the optimum number** of customers to contact. There will likely be a tipping point at which we have reached a sufficiently high proportion of responders, and where the costs of contacting a greater proportion of customers are too great given the diminishing returns. This will generally correspond with a **flattening-off** of the cumulative gains curve, where further contacts (corresponding with additional deciles) are not expected to provide many additional responders.


**Lift Chart**

Shows the actual lift.

To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model.

Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3.

Ideally, we want the lift curve to extend as high as possible into the top-left corner of the figure, indicating that we have a large lift associated with contacting a small proportion of customers.


<img src="./images/123.png" width=600>


**Why?**

- Cumulative gains and lift curves are a simple and useful approach to understand what returns you are likely to get from running a marketing campaign and how many customers you should contact, based on targeting the most promising customers using a predictive model.