#Evaluation of Classification Models
___

$$$$
###Evaluation outputs
___
[**ROC curve**](#ROC)

[**Lift chart**](#Lift Chart)

[**Calibration plot**](#Calibration)

[**Confusion Matrix**](#Confusion Matrix)
$$$$
###Metrics
___

[**Classification Accuracy**](#Classification Accuracy): What % of predictions were correct?

[**Precision**](#Precision): How pure is our pool of predicted Positives?

[**Recall**](#Recall):  How often do we identify True Positives?

[**F-Measure or F1 Score**](#F-measure)

**Sensitivity:** See Recall

[**Specificity**](#Specificity): How well do we identify True Negatives?

[**Brier Score:**](#Brier Score) How well calibrated is the classifier?
$$$$
###Caveats
___
[**Accuracy Paradox**](#Accuracy Paradox)

<a id='ROC'></a>
##ROC curve

- independent of P:N ratio (good for comparing classifiers with different ratios)
- each point represents a different classifier
- to get a classifier on a line between two points, use an ensemble method: choose classifier A's prediction with a probability &alpha; and classifier B prediction with probability 1 - &alpha;
 
\begin{equation*}
x = FPrate(t), y = TPrate(t)
\end{equation*}



**Binary classifiers** give us a point on the ROC graph

**Probabilistic classifiers** give us a curve by varying the threshold

￼<img src="ROC.png" width="300" height="500">

\begin{equation*}
A_{ROC} =\int_{0}^{1}\frac{\mathbf{TP}}{P}d{\frac{\mathbf{FP}}{N}} = \frac{1}{PN} \int_0^N {TP} d{FP}
\end{equation*}

\begin{equation*}
A_{ROC1} = P(\mbox{ random positive example } > \mbox{ random negative example})
\end{equation*}

\begin{equation*}
A_{ROC2} = P(\mbox{ random P } > \mbox{ random N }) + \frac{1}{2}P(\mbox{random P } > \mbox{random N})
\end{equation*}

<img src="ROC1_vs_2.png" width="300" height="500">

<a id='Classification Accuracy'></a>
####Optimal classification accuracy####

**Q:** What percent of predictions were correct?  (TP or TN)

\begin{equation*}
Accuracy = \frac{\mathbf{TP}+\mathbf{TN}}{P+N}
\end{equation*}

    So the equation for iso-performance or iso-parametric accuracy is:

\begin{equation*}
\mathbf{TPrate} = \frac{N}{P}\mathbf{FPrate} + \frac{Accuracy \times (P+N)-N}{P}
\end{equation*}

    where the slope is equal to the N:P ratio
    
- So the point of optimal accuracy is where the iso-performance line meets the ROC curve (upper leftmost point of ROC curve)
- Therefore, two curves with the same AUC can differ greatly in Accuracy

<a id='Accuracy Paradox'></a>
*note: the * **Accuracy Paradox** says that you can "increase accuracy" even when you're decreasing a classifier's predictive power. 
    - If TP < FP: you can "increase accuracy" by classifying everything as Negative 
    - If TN < FN: you can "increase accuracy" by classifying everything as Positive 

<a id='Lift Chart'></a>
##Lift Chart
**Question answered:** If I predict a success, how likely is it that this is actually a success?
- Plots TP against all predicted P
- If chart follows diagonal, only half of predicted successes are actual successes
- unlike ROC, depends on ratio of P:N
    - This means that different samples could yield different Lift Charts with identical classification properties
- useful when P (and therefore $TPrate$, ROC curve) is unknown
- e.g. when serving ads, we don't know the total population of converters P

\begin{equation*}
x = Yrate(t) = \frac{\mathbf{TP}(t)+\mathbf{FP}(t)}{P+N}
\end{equation*}

where $Yrate$ is essentially the predicted rate of success

<img src="LiftChart.png" width = "300" height = "500">

####AUC for Lift Charts
- Random classifier will have an AUC of $\frac{P}{2}$, while a perfect classifier has an AUC of $P$
- Not meant to be used for optimal classification
- but can find point of maximal profit (related to weighted optimal classification)

    ####Profit
    - fixed benefit for every correct classification
    - reduced by fixed cost for every misclassification
    - optimal profit is where expected benefit = expected cost
    - adding weights to positive and negative errors impacts the slope of the iso-performance lines

<a id='Calibration'></a>
##Calibration Plot
- Re-scales score allocation to reflect actual probabilities
- For Probabilistic Classifiers
- Dependant on P:N ratio
- Calibration is a measure of whether the predicted success rate = the actual success rate

**Calibration does not:** affect ROC or Lift Chart

**Calibration does:** re-distribute the distribution of probability scores to align with actual probability priors

<img src="Calibration.png" width="300" height="500">

####In a Calibrated model, we expect...
- Of all examples with a probability score of 0.7, 70% are actually successes
- The highest theoretical score is a success close to 100% of the time
- The ranking order of examples is not changed, but their absolute probability scores change
- The model is unbiased (imperative for unbalanced data)

####Steps (for SVM) when the mapping function is known
1. Choose a subset of examples with the same probabilistic score (or bin samples with similar scores)
2. The ratio of P:N in each subset is the true probability
3. Since the relationship between SVM scores s(x) and actual probability P(c|x) is often sigmoidal, fit the score distribution to the following function:

    $\hat{P}(c|x) = \frac{1}{1+e^{As(x)+B}}$

4. Find parameters A and B, thereby minimizing the Negative Log-Likelihood
5. Transform scores to calibrated curve

####Steps when the mapping function is unknown
1. Sort training examples according to scores and divide into $b$ bins of equal size (number of bins should be chosen by cross-validation)
2. For each bin, find the lower and upper boundadry scores
3. For each bin, record the actual proportion of training examples that are successes
4. Use this proportion to estimate the corrected probability score

*note: this does not work well for small or unbalanced datasets*

####If all else fails
- use isotonic regression to learn mapping of predictions to actual probabilities
- one algorithmic example is Pair-adjacent violators (PAV)
- sklearn has an implementation of isotonic regression: http://scikit-learn.org/stable/modules/calibration.html

<a id='Brier Score'></a>
**Brier Score:** How well calibrated is the classifier? 
\begin{equation*}
BS = \frac{1}{N} \sum_N^{t=1}(f_t-o_t)^2
\end{equation*}

####Theory:
Bianca Zadrozny, Charles Elkan @IBM: http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf

Miha Vuk, Tomaz Curk (2006) Metodoloski zvezki: http://www.stat-d.si/mz/mz3.1/vuk.pdf

<a id='Confusion Matrix'></a>
###Confusion Matrix

<a id='Precision'></a>
**Precision:** How many Positive Predictions are correct?  
\begin{equation*}
Precision = \frac{\mathbf{TP}}{\mathbf{TP}+FP}
\end{equation*}

<a id='Recall'></a>
**Recall aka Sensitivity:** How often do we identify True Positives?
\begin{equation*}
Precision = \frac{\mathbf{TP}}{\mathbf{TP}+FN}
\end{equation*}

<a id='F-measure'></a>
**F-measure:** Combines Precision and Recall
- the harmonic mean of Precision and Recall
- called F1 Score when $Cost_{precision} = Cost_{recall}$

\begin{equation*}
F = 2\times\frac{Precision \times Recall}{Precision + Recall}
\end{equation*}

<a id='Specificity'></a>
**Specificity:** How often do we identify True Negatives?

\begin{equation*}
Specificity = \frac{\mathbf{TN}}{\mathbf{TN}+FP}
\end{equation*}