# Understanding Regression Error Metrics in Python

        
        Mean Absolute Error (MAE)
        Mean Square Error (MSE)
        Root Mean Squared Error (RMSE)
        Root Mean Squared Logarithmic Error (RMSLE)
        Mean Squared Percentage Error (MSPE)
        
        Mean Absolute Percentage Error
        Mean Percentage Error  
      

        R Squared (R²)
        Adjusted R Squared (R²)        
        

## What is Absolute Error?

Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. For example, if a scale states 90 pounds but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1 lbs.

This can be caused by your scale not measuring the exact amount you are trying to measure. For example, your scale may be accurate to the nearest pound. If you weigh 89.6 lbs, the scale may “round up” and give you 90 lbs. In this case the absolute error is 90 lbs – 89.6 lbs = .4 lbs.

                                    (Δx) = x(i) – x
                                    
Where:

    xi is the measurement,
    x is the true value.
Using the first weight example above, the absolute error formula gives the same result:

(Δx) = 90 lbs – 89 lbs = 1 lb.

Sometimes you’ll see the formula written with the absolute value symbol (these bars: | |). This is often used when you’re dealing with multiple measurements:
                                            
                                    (Δx) = |x(i) – x|
                                    
The absolute value symbol is needed because sometimes the measurement will be smaller, giving a negative number. For example, if the scale measured 89 lbs and the true value was 95 lbs then you would have a difference of 89 lbs – 95 lbs = -6 lbs. On it’s own, a negative value is fine (-6 just means “six units below”) but the problem comes when you’re trying to add several values, some of which are positive and some are negative. For example, let’s say you have:

89 lbs – 95 lbs = -6 lbs and

98 lbs – 92 lbs = 6 lbs

On their own, both measurements have absolute errors of 6 lbs. If you add them together, you should get a total of 12 lbs of error, but because of that negative sign you’ll actually get -6 lbs + 6 lbs = 0 lbs, which makes no sense at all — after all, there was a pretty big error (12 lbs) which has somehow become 0 lbs of error. We can solve this by taking the absolute value of the results and then adding:

|-6 lbs| + |6 lbs| = 12 lbs.


![BmBC8VW.jpg](https://i.imgur.com/BmBC8VW.jpg)



## Mean Absolute Error
The Mean Absolute Error(MAE) is the average of all absolute errors. The formula is:
mean absolute error

![MAE.png](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/10/MAE.png)

The picture below is a graphical description of the MAE. The green line represents our model’s predictions, and the blue points represent our data. 

![tqnei6J.jpg](https://i.imgur.com/tqnei6J.jpg)

__we use the absolute value of the residual, the MAE does not indicate underperformance or overperformance of the model__
Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error. Like we’ve said above, 
__A small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas.__ 

    A MAE of 0 means that your model is a perfect predictor of the outputs (but this will almost never happen).
    

Where:

    n = the number of errors,
    Σ = summation symbol (which means “add them all up”),
    |xi – x| = the absolute errors.
    
The formula may look a little daunting, but the steps are easy:

    Find all of your absolute errors, xi – x.
    Add them all up.
    Divide by the number of errors. For example, if you had 10 measurements, divide by 10.
    
__in your data, you may want to bring more attention to these outliers or downplay them. The issue of outliers can play a major role in which error metric you use.__

In MAE the error is calculated as an average of absolute differences between the target values and the predictions. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0. However, same is not true for RMSE.

MAE is widely used in finance, where $10 error is usually exactly two times worse than $5 error. On the other hand, MSE metric thinks that $10 error is four times worse than $5 error. MAE is easier to justify than RMSE




## Mean square error

The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. We can see this difference in the equation below. 

![vB3UAiH.jpg](https://i.imgur.com/vB3UAiH.jpg)

Because we are squaring the difference, __the MSE will almost always be bigger than the MAE.__ For this reason, we cannot directly compare the MAE to the MSE

We can only compare our model’s error metrics to those of a competing model. The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data

While each residual in MAE contributes proportionally to the total error, __the error grows quadratically in MSE.__ This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would the MAE.  

![mLn8AeW.jpg](https://i.imgur.com/mLn8AeW.jpg)

Outliers will produce these exponentially larger differences, and it is our job to judge how we should approach them.


The higher this value, the worse the model is. It is never negative, since we’re squaring the individual prediction-wise errors before summing them, but would be zero for a perfect model .

__Advantage:__ Useful if we have unexpected values that we should care about. Vey high or low value that we should pay attention.

__Disadvantage:__ If we make a single very bad prediction, the squaring will make the error even worse and it may skew the metric towards overestimating the model’s badness. That is a particularly problematic behaviour if we have noisy data (that is, data that for whatever reason is not entirely reliable) — even a “perfect” model may have a high MSE in that situation, so it becomes hard to judge how well the model is performing. On the other hand, if all the errors are small, or rather, smaller than 1, than the opposite effect is felt: we may underestimate the model’s badness.


Mean square error is always positive and a value closer to 0 or a lower value is better. Let’s see how this this is calculated;

|Actual Value (y)|Predicted Value (y hat)|Error (difference)|Squared Error|​|
|---|---|---|---|---|
|100|130|-30|900||
|150|170|-20|400|​|
|200|220|-20|400|​|
|250|260|-10|100|​|
|300|325|-25|625|​|
||​|​|485|Mean|

if we were to run a model with different parameters/independent variables, model with lower MSE will be deemed better.


### The problem of outliers
 Do we include the outliers in our model creation or do we ignore them? The answer to this question is dependent on the field of study, the data set on hand and the consequences of having errors in the first place.
 
1) For example, I know that some video games achieve superstar status and thus have disproportionately higher earnings. Therefore, it would be foolish of me to ignore these outlier games because they represent a real phenomenon within the data set.  I would want to use the MSE to ensure that my model takes these outliers into account more

2) If I wanted to downplay their significance, I would use the MAE since the outlier residuals won’t contribute as much to the total error as MSE. Ultimately, the choice between is MSE and MAE is application-specific and depends on how you want to treat large errors. Both are still viable error metrics, but will describe different nuances about the prediction errors of your model.


Another error metric you may encounter is the root mean squared error (RMSE). As the name suggests, it is the square root of the MSE. Because the MSE is squared, its units do not match that of the original output. Researchers will often use RMSE to convert the error metric back into similar units, making interpretation easier. Since the MSE and RMSE both square the residual, they are similarly affected by outliers. The RMSE is analogous to the standard deviation (MSE to variance) and is a measure of how large your residuals are spread out. Both MAE and MSE can range from 0 to positive infinity, so as both of these measures get higher, it becomes harder to interpret how well your model is performing. Another way we can summarize our collection of residuals is by using percentages so that each prediction is scaled against the value it’s supposed to estimate.

Like MAE, we’ll calculate the MSE for our model. Thankfully, the calculation is just as simple as MAE.

        mse_sum = 0
        for sale, x in zip(sales, X):
        prediction = lm.predict(x)
        mse_sum += (sale - prediction)**2
        mse = mse_sum / len(sales)
        print(mse)
        >>> [ 3.53926581 ]
With the MSE, we would expect it to be much larger than MAE due to the influence of outliers. We find that this is the case: the MSE is an order of magnitude higher than the MAE. The corresponding RMSE would be about 1.88, indicating that our model misses actual sale values by about $ 1.8M.

### Root Mean Squared Error

Most popular evalution metric usind in Regression problems.

RMSE is just the square root of MSE. The square root is introduced to make scale of the errors to be the same as the scale of targets.

![1*qz8jRMxmMEwNsFh0Cs5XfQ.png](https://cdn-images-1.medium.com/max/720/1*qz8jRMxmMEwNsFh0Cs5XfQ.png)

For example, if we have two sets of predictions, A and B, and say MSE of A is greater than MSE of B, then we can be sure that RMSE of A is greater RMSE of B.And it also works in the opposite direction

![1*qz8jRMxmMEwNsFh0Cs5XfQ.png](https://cdn-images-1.medium.com/max/720/1*e9NYGLz3a9wdKpYMuLTI0Q.png)



Higly affected by outlier values.
Square nature of this metric helps to deliver more robust reuslt which prevents cancelling the positive and negotive error values.

__Model 1__

|Actual Value|Predicated Vlaue|Error||
|------------|------------|------------||
|250|240|10||
|645|600|45||
|800|825|-25||

__Model 2__

|Actual Value|Predicated Vlaue|Error||
|------------|------------|------------||
|250|280|-30||
|645|1200|-555||
|800|1600|-800||

__Result__

|Error Matric|Model 1|Model 2||
|------------|------------|------------||
|MAE|10|10||
|MSE|916.6666|316308.333||
|RMSE|30.27|562.412||

            MAE are same for both model.
            MSE Punishes large errors.
            MSE and RMSE value are less for model 1 so better is first model.
            
#### MAE and RMSE — Which Metric is Better?

__Mean Absolute Error (MAE)__

![1*OVlFLnMwHDx08PHzqlBDag.gif](https://cdn-images-1.medium.com/max/800/1*OVlFLnMwHDx08PHzqlBDag.gif)

__Root mean squared error (RMSE)__
![1*OVlFLnMwHDx08PHzqlBDag.gif](https://cdn-images-1.medium.com/max/800/1*9hQVcasuwx5ddq_s3MFCyw.gif)

The three tables below show examples where MAE is steady and RMSE increases as the variance associated with the frequency distribution of error magnitudes also increases.

![1*YTxb8K2XZIisC944v6rERw.png](https://cdn-images-1.medium.com/max/800/1*YTxb8K2XZIisC944v6rERw.png)

    1) [MAE] ≤ [RMSE]. The RMSE result will always be larger or equal to the MAE. If all of the errors have the same magnitude, then RMSE=MAE.

    2) [RMSE] ≤ [MAE * sqrt(n)], where n is the number of test samples. The difference between RMSE and MAE is greatest when all of the prediction error comes from a single test sample. The squared error then equals to [MAE^2 * n] for that single test sample and 0 for all other samples. Taking the square root, RMSE then equals to [MAE * sqrt(n)].

![1*HmnyRcMjgfW-Bo2_NKLYqg.png](https://cdn-images-1.medium.com/max/720/1*HmnyRcMjgfW-Bo2_NKLYqg.png)



https://www.dataquest.io/blog/understanding-regression-error-metrics/
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d


### Root Mean Squared Logarithmic Error

While calculating RMSLE, 1 is added as constant to actual and predicted values because they can be 0 and log of 0 is undefined. Overall formula remains same. Standard denotation for RMSLE is;


![rmsle-2.png](https://akhilendra.com/wp-content/uploads/2019/03/rmsle-2.png)


|Actual Value (y)|Predicted Value (y hat)|Actual + 1|Predicted + 1|log (Actual)|Log (Predicted)|Error (difference)|Squared Error||
|---|---|---|---|---|---|---|---|---|
|100|130|101|131|2.004321374|2.117271296|-0.112949922|0.012757685||
|150|170|151|171|2.178976947|2.23299611|-0.054019163|0.00291807||
|200|220|201|221|2.303196057|2.344392274|-0.041196216|0.001697128||
|250|260|251|261|2.399673721|2.416640507|-0.016966786|0.000287872||
|300|325|301|326|2.478566496|2.5132176|-0.034651104|0.001200699||
||||||||0.003772291|Mean|
||||||||0.061418977|Squre root of mean|


|Mean absolute Error (MAE)|Mean square Error (MSE)|Root mean square error (RMSE)|Root mean square log Error (RMSLE)|
|-|-|-|-|
|It doesn’t account for the direction of the value. Even if value is negative, positive value is used for calculation.|It does account for positive or negative value.|It does account for positive or negative value.|It does account for positive or negative value.|
||RMSE & MSE share many properties with MSE because RMSE is simply the square root of MSE.|RMSE & MSE share many properties with MSE because it is simply the square root of MSE.||
|MAE is less biased for higher values. It may not adequately reflect the performance when dealing with large error values.|MSE is highly biased for higher values.|RMSE is better in terms of reflecting performance when dealing with large error values.||
|||RMSE is more useful when lower residual values are preferred.||
|MAE is less than RMSE as the sample size goes up.||RMSE tends to be higher than MAE as the sample size goes up.||
|MAE doesn’t necessarily penalize large errors.|MSE penalize large errors.|RMSE penalize large errors.|RMSLE doesn’t penalize large errors. It is usually used when you don’t want to influence the results if there are large errors. RMSLE penalize lower errors.|
|MAE is more useful when the overall impact is proportionate to the actual increase in error. For example- if error values go up to 6 from 3, actual impact on the result is twice. It is more common in financial industry where a loss of 6 would be twice of 3.||RMSE is more useful when the overall impact is disproportionate to the actual increase in error. For example- if error values go up to 6 from 3, actual impact on the result is more than twice. This could be common in clinical trials, as error goes up, overall impact goes up disproportionately.||
|||When actual and predicted values are low, RMSE & RMSLE are usually same.|When actual and predicted values are low, RMSE & RMSLE are usually same.|
|||When either of actual or predicted values are high, RMSE > RMSLE.|When either of actual or predicted values are high, RMSE > RMSLE.|


### Mean Percentage Error
__First find the Error:__
Subtract one value from the other. Ignore any minus sign.

Example: I estimated 260 people, but 325 came. 
    
    260 − 325 = −65, ignore the "−" sign, so my error is 65

__Then find the Percentage Error: __
Show the error as a percent of the exact value, so divide by the exact value and make it a percentage:

Example continued: 65/325 = 0.2 = 20%
This is the formula for "Percentage Error":

      |Approximate Value − Exact Value|  × 100%
      ---------------------------------
            |Exact Value|
        
(The "|" symbols mean absolute value, so negatives become positive)


__Example:__ I thought 70 people would turn up to the concert, but in fact 80 did!

            |70 − 80||80|	× 100% =  1080 × 100% = 12.5%
            I was in error by 12.5%


__Example:__ The report said the carpark held 240 cars, but we counted only 200 parking spaces.

            |240 − 200||200|	× 100% =  40200 × 100% = 20%
            The report had a 20% error.


        Use Percentage Change when comparing an Old Value to a New Value
        
            Percent Change =  New Value − Old Value/|Old Value|  × 100%
            
        Use Percentage Error when comparing an Approximate Value to an Exact Value
            
            Percent Error =  |Approximate Value − Exact Value|/|Exact Value|  × 100%
            
        Use Percentage Difference when both values mean the same kind of thing (one value is not obviously older or better than the other).
        
            Percentage Difference = | First Value − Second Value(First Value + Second Value)/2 | × 100%



__Example:__ fence (continued)

        Length = 12.5 ±0.05 m

        So: Absolute Error = 0.05 m

        And: Relative Error =   0.05 m12.5 m   = 0.004

        And: Percentage Error = 0.4%

__Example:__ The thermometer measures to the nearest 2 degrees. The temperature was measured as 38° C The temperature could be up to 1° either side of 38° (i.e. between 37° and 39°)

        Temperature = 38 ±1°

        So:Absolute Error = 1°

        And:Relative Error =   1°38°   = 0.0263...

        And:Percentage Error = 2.63...% 


The mean absolute percentage error (MAPE) is the percentage equivalent of MAE. The equation looks just like that of MAE, but with adjustments to convert everything into percentages.
![YYMpqUY.jpg](https://i.imgur.com/YYMpqUY.jpg)

![HPlrPmu.jpg](https://i.imgur.com/HPlrPmu.jpg)

The MAPE is biased towards predictions that are systematically less than the actual values themselves. That is to say, MAPE will be lower when the prediction is lower than the actual compared to a prediction that is higher by the same amount. The quick calculation below demonstrates this point. 

![HPlrPmu.jpg](https://i.imgur.com/OBBvmIH.jpg)

We have a measure similar to MAPE in the form of the mean percentage error. While the absolute value in MAPE eliminates any negative values, the mean percentage error incorporates both positive and negative errors into its calculation.

mape_sum = 0
for sale, x in zip(sales, X):
    prediction = lm.predict(x)
    mape_sum += (abs((sale - prediction))/sale)
    mape = mape_sum/len(sales)
    print(mape)
>>> [ 5.68377867 ]

We know for sure that there are no data points for which there are zero sales, so we are safe to use MAPE. Remember that we must interpret it in terms of percentage points. MAPE states that our model’s predictions are, on average, 5.6% off from actual value.<br>


### Mean percentage error

The mean percentage error (MPE) equation is exactly like that of MAPE. The only difference is that it lacks the absolute value operation.
![ndIXERr.jpg](https://i.imgur.com/ndIXERr.jpg)

Even though the MPE lacks the absolute value operation, it is actually its absence that makes MPE useful. Since positive and negative errors will cancel out, we cannot make any statements about how well the model predictions perform overall. However, if there are more negative or positive errors, this bias will show up in the MPE. Unlike MAE and MAPE, MPE is useful to us because it allows us to see if our model __systematically underestimates (more negative error) or overestimates (positive error). __

![kTIYRBX.jpg](https://i.imgur.com/kTIYRBX.jpg)

|acroynm|full name|residual operation?|robust to outliers?|
|------------|------------|------------|------------|
|MAE|Mean Absolute Error|Absolute Value|Yes|
|MSE|Mean Squared Error|Square|No|
|RMSE|Root Mean Squared Error|Square|No|
|MAPE|Mean Absolute Percentage Error|Absolute Value|Yes|
|MPE|Mean Percentage Error|N/A|Yes|


### R-Squared

How much difference in outcome is explained by the model.

R-squared (R2) is a statistical measure that 

1) represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

2) Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable.

3) So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs.


![R2.png](R2.png)

Here R2 is 0.6 , If the points are colse to each other R2 will be 1 , so its perfect fit.
And vice versa, if R2  is very small  like R2 is .02. the estimated point will show as below.
R2 ---> 0 { There are no relation ship between}

R2 ---> 1 { There are close releation ship

![R2_1.png](R2_1.png)

![R2_2.png](R2_2.png)

Example : If house R2 is 70% then what about other 30% . That means 30% there might be depedent on other variables.
might be , location , repair , style, bedrooms, etc...

![R2_3.png](R2_3.png)

1) Jane estimated value much higher. may be good performance<br>
2) Bob is very low might be he hate his boss.

# Adjusted R Squared
![R2_6.png](R2_6.png)
![R2_4.png](R2_4.png) ![R2_5.png](R2_5.png)

Is it good to have as many independent variables as possible ? Nope <br>
R-Square is deceptive. R-Squared never decreases when a new X variable is added to model - True? <br>
We need a better measure or an adjustment to the original R-Squared Formula . <br>

__used For__

    1) To compare model with different no. of independent variable
    2) To select important predictors ( independent variable) for regression model.



# For Classification

    Accuracy
    confusion matrix 

# Accuracy

__Classification Error__

                       Errors        FP + FN
                       ------  =  ----------------- = Classification Error
                       Total     TP + TN + FP +FN

__Accuracy__

Ratio of number of correct predictions to total number of samples.

            Number of correct predictions         TP + TN
            --------------------------     = --------------------  = (1 - error) = Accuracy
         
            Total number of predicitions       TP + TN + FP +FN

Basic Measure of "goodness" of a classifier


    a positive example classified as positive. This is a true positive.
    a positive example misclassified as negative. This is a false negative.
    a negative example classified as negative. This is a true negative.
    a negative example misclassified as positive. This is a false positive.
    
    
__Balanced and Imbalanced__

1) Balanced Dataset - ~Equal number of +ve and -ve samples.<br>
2) Imbalanced dataset - one class significantly dominates others.

__When to use Accuracy__

When target variable data are nearly balanced.<br>
1) For Ex. if 55% classses in fruits dataset are oranges and 45 % are apples.

__When not to use Accuracy__

When target variables classes in data are majority of one class.<br>
1) For Ex. In cancer detection dataset of 100 only 5 people has cancer.

Let's say, we are working on classfication problem, where we are predicting whether a person a having cancer or not.

|Cancer Detection Dataset||
|---|---|
|No. of people having cancer|5|
|No.of people not having cancer|95|
|Total People|100|

Assigning labels to our target variables.<br>
1: when person is having Cancer<br>
0: when person is not having Cancer

Suppose model has classfied all patient as __not__ Having cancer.
Accuracy of model = 95/100*100 = 95%


# confusion matrix 

It is a table with 4 different combinations of predicted and actual values.<br>
Confusion matrix only can calculated when true value are know.
![1*Z54JgbS4DUwWSknhDCvNTQ.png](https://cdn-images-1.medium.com/max/720/1*Z54JgbS4DUwWSknhDCvNTQ.png)

It is extremely useful for measuring

    Recall, 
    Precision, 
    Specificity, 
    Accuracy and most importantly 
    AUC-ROC Curve.
    
Let’s understand TP, FP, FN, TN in terms of pregnancy analogy.
![1*7EYylA6XlXSGBCF77j_rOA.png](https://cdn-images-1.medium.com/max/720/1*7EYylA6XlXSGBCF77j_rOA.png) 

Suppose I work for Target and I want to detect pregnant teenagers. So based on based on shopping patterns. I take a random sample of 500 female, teenage customers. Of these teenagers, 50 are actually pregnant. I predicted 100 total pregnant teenagers, 45 of which are actually pregnant.

Our task is two-fold:<br>
A) Identify the TP, TN, FP, FN, and construct a confusion matrix and <br>
B) Calculate the accuracy, misclassification, precision, sensitivity, and specificity

![1*8YioEcYGAYKbkQafKwaC4w.png](https://cdn-images-1.medium.com/max/720/1*8YioEcYGAYKbkQafKwaC4w.png)
Next, we can use our labelled confusion matrix to calculate our metrics.

1. Accuracy (all correct / all) = TP + TN / TP + TN + FP + FN
        (45 + 395) / 500 = 440 / 500 = 0.88 or 88% Accuracy

2. Misclassification (all incorrect / all) = FP + FN / TP + TN + FP + FN
        (55 + 5) / 500 = 60 / 500 = 0.12 or 12% Misclassification
        You can also just do 1 — Accuracy, so:
        1–0.88 = 0.12 or 12% Misclassification

3. Precision (true positives / predicted positives) = TP / TP + FP
        45 / (45 + 55) = 45 / 100 = 0.45 or 45% Precision

4. Sensitivity aka Recall (true positives / all actual positives) = TP / TP + FN
        45 / (45 + 5) = 45 / 50 = 0.90 or 90% Sensitivity

5. Specificity (true negatives / all actual negatives) =TN / TN + FP
        395 / (395 + 55) = 395 / 450 = 0.88 or 88% Specificity


![1*PBgDG3NtVMXFrkYgAG00Jw.png](https://cdn-images-1.medium.com/max/720/1*PBgDG3NtVMXFrkYgAG00Jw.png)

True Positive:<br>
Interpretation: You predicted positive and it’s true. You predicted that a woman is pregnant and she actually is.<br>
True Negative:<br>
Interpretation: You predicted negative and it’s true. You predicted that a man is not pregnant and he actually is not.<br>
False Positive: (Type 1 Error) <br>
Interpretation: You predicted positive and it’s false.You predicted that a man is pregnant but he actually is not.<br>
False Negative: (Type 2 Error) <br>
Interpretation: You predicted negative and it’s false.You predicted that a woman is not pregnant but she actually is

![confusion_matrix_simple2.png](https://www.dataschool.io/content/images/2015/01/confusion_matrix_simple2.png)
What can we learn from this matrix?

1) There are two possible predicted classes: "yes" and "no". If we were predicting the presence of a disease, for example, "yes" would mean they have the disease, and "no" would mean they don't have the disease.<br>
2) The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).<br>
3) Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.<br>
4) In reality, 105 patients in the sample have the disease, and 60 patients do not.<br>

Let's now define the most basic terms, which are whole numbers (not rates):

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.<br>
true negatives (TN): We predicted no, and they don't have the disease.<br>
false positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")<br>
false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a "Type II error.")<br>

![confusion_matrix2.png](https://www.dataschool.io/content/images/2015/01/confusion_matrix2.png)

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

__Accuracy:__ Overall, how often is the classifier correct?<br>
(TP+TN)/total = (100+50)/165 = 0.91<br>
__Misclassification Rate:__ Overall, how often is it wrong?<br>
(FP+FN)/total = (10+5)/165 = 0.09<br>
equivalent to 1 minus Accuracy<br>
also known as "Error Rate"<br>
__True Positive Rate:__ When it's actually yes, how often does it predict yes?<br>
TP/actual yes = 100/105 = 0.95<br>
also known as "Sensitivity" or "Recall"<br>
__False Positive Rate:__ When it's actually no, how often does it predict yes?<br>
FP/actual no = 10/60 = 0.17<br><br>
__True Negative Rate:__ When it's actually no, how often does it predict no?<br>
TN/actual no = 50/60 = 0.83<br>
equivalent to 1 minus False Positive Rate<br>
also known as "Specificity"<br>
__Precision:__ When it predicts yes, how often is it correct?<br>
TP/predicted yes = 100/110 = 0.91<br>
__Prevalence:__ How often does the yes condition actually occur in our sample?<br>
actual yes/total = 105/165 = 0.64<br>

![Accuracy.png](Accuracy.png)

![Accuracy_2.png](Accuracy_2.png)

# F-Score or  F-Measure

    Is single mesure of classfication procedure's useful.
    Consider both the Precision and Recall of the procedure to compute the score.
    The higher the F-Score the better the predictive power of the classfication procedure. 
    A score of 1 means the classfication is perfect and 0 means lowest possbile F-score.

![F-Score.png](F-Score.png)



# ROC - AUC


__What is AUC - ROC Curve?__

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no).  Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

1) AUC - ROC curve is a performance measurement for classification problem at various thresholds settings.<br>
2)  ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.<br>
3) Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.<br>
4) The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.<br>

__Defining terms used in AUC and ROC Curve.__

TPR(True Positive rate) /Recall/ Sensitivity =  TP / (TP+ FN )


Specificity = TN / (TN+ FP)

FPR = FPR = 1 - Specificity  = FP / ( FP + TN)


    When two curves don’t overlap at all means model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class.
    
    AUC is 0.7,it means there is 70% chance that model will be able to distinguish between positive class and negative class.
    
    When AUC approximately 0.5,model has no discrimination capacity to distinguish between positive class and negative class.
    
    When AUC is approximately 0, model is actually reciprocating the classes. It means, model is predicting negative class as a positive class and vice versa
   
   
__Sensitivity and Specificity are inversely proportional to each other. So when we increase Sensitivity, Specificity decreases and vice versa.__

When we __decrease the threshold__, we get more positive values thus it increases the sensitivity and decreasing the specificity.

Similarly, when we __increase the threshold__, we get more negative values thus we get higher specificity and lower sensitivity.

![AUC_ROC_5.png](AUC_ROC_5.png)
__Step - 1__ Here, the red distribution represents all the patients who do not have the disease and the green distribution represents all the patients who have the disease.

__Step - 2__ Now we got to pick a value where we need to set the cut off i.e. a threshold value, above which we will predict everyone as positive (they have the disease) and below which will predict as negative (they do not have the disease). We will set the threshold at “0.5” as shown below:

__Step - 3__ All the positive values above the threshold will be “True Positives” and the negative values above the threshold will be “False Positives” as they are predicted incorrectly as positives.

__Step - 4__ All the negative values below the threshold will be “True Negatives” and the positive values below the threshold will be “False Negative” as they are predicted incorrectly as negatives.

__Step - 5__ Here, we have got a basic idea of the model predicting correct and incorrect values with respect to the threshold set. Before we move on, let’s go through two important terms: Sensitivity and Specificity.

__What is Sensitivity and Specificity?__

In simple terms, the proportion of patients that were identified correctly to have the disease (i.e. True Positive) upon the total number of patients who actually have the disease is called as Sensitivity or Recall.

Similarly, the proportion of patients that were identified correctly to not have the disease (i.e. True Negative) upon the total number of patients who do not have the disease is called as Specificity.

    
    Trade-off between Sensitivity and Specificity
    
When we decrease the threshold, we get more positive values thus increasing the sensitivity. Meanwhile, this will decrease the specificity.

Similarly, when we increase the threshold, we get more negative values thus increasing the specificity and decreasing sensitivity.

As Sensitivity ⬇️ Specificity ⬆️

As Specificity ⬇️ Sensitivity ⬆️

    Area Under the Curve
    
The AUC is the area under the ROC curve. This score gives us a good idea of how well the model performances.

Let’s take a few examples



Let''s Start with some examples.

1) The Y -axis has two categories. "Obese" and "Not Obese" <br>
2) Along the X-Axis we do have weight.

![AUC_ROC.png](AUC_ROC.png)
However, if we want to classify the mice as "Obese" and "Not Obese" , then we need a way to turn probabilites into classification. One way to classfiy mice is to set a threshold at 0.5 and classfiy all mice with a probabilites of being obese > 0.5 and 

![AUC_ROC_2.png](AUC_ROC_2.png)
![AUC_ROC_3.png](AUC_ROC_3.png)
![AUC_ROC_4.png](AUC_ROC_4.png)