# Binary classifier
It takes only two values: "0" or "1", "true" or "false" etc.
Examples:
- Covid Test (checks, is someone's sick or not),
- Analysis, if someone will be able to pay back credit,
- finding out if we talk with real person or AI.

## Values Analysis with COVID example

|  | someone is sick | someone is healthy |
| --- | --- | --- |
| **result of PCR test is positive** | true positive (TP) | false positive (FP) |
| **result of PCR test is negative** | false negative (FN) | true negative (TN) |

False positive are called **type I errors**, while false negative are called **type II errors**. Obviously we want to maximize rate of true negatives and positives while minimizing rate of the errors - so minimize FP/TP and FN/TN. However sometimes its not so simple.

What's wrong with a classifier that simply maximizes rate TP and TN and meanwhile minimizes rate of FN and FP? Imagine we have a test that detects some rare illness X. If someone is sick, tests always shows (true) positive result. If someone is not sick, test will in 99% cases show (true) negative result and in 1% (false) positive result. Seems pretty accurate? Nope. If only every 1 per 10 000 people is sick of X, then, testing 10 000 people, we will have 101 people diagnosed as ill, while only 1 person is really sick. Conclusion: we need better tools.

# Our tools and its values on COVID tests.
Here is some research testing quality of rapid antigene tests in comparison to PCR-tests. 

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2793365?utm_source=For_The_Media&utm_medium=referral&utm_campaign=ftm_links&utm_term=061522

Results: 
| | |
|---|---|
|TP = 50| FP = 1|
|FN = 27| TN = 649|

- **sensitivity** (true positives rate) - probability of correctly diagnosing sick person:
$$TPR = \frac{TP}{TP+FN} = 65\%$$
- **specificity** (true negative rate) - probability of correctly diagnosing healthy person: 
$$SPC = \frac{TN}{FP+TN} = 100\%$$
- **false positive rate** - probability of wrongly diagnosing a healthy person as a sick one:
$$FPR = \frac{FP}{FP+TN} = 0\%$$
- **false discovery rate** - if someone is diagnosed as sick, what is the chance that he is actually healthy:
$$FDR = \frac{FP}{FP+TP} = 2\%$$
- **positive predicted value** - if someone is diagnosed as sick, what is the chance that he is really sick:
$$PPV = \frac{TP}{FP+TP} = 98\%$$
- **negative predicted value** - if someone is diagnosed as healthy, what is the chance that he is really healthy:
$$NPV\frac{TN}{TN+FN} = 96\%$$
- **accuracy** - probability of a correct diagnose:
$$ACC = \frac{TP+TN}{TP+TN+FP+FN} = 96\%$$

# ROC curve

In the case of binary classifiers we often have a continuous input and 2-valued output. To achieve such result we may add a threshold - if input is under threshold, then output = 0 and if it is above it then output = 1. Example:

<img src="ROC_rozklady.png" alt="Threshold" width="800"/>

ROC curve is a way of visualizing a threshold. We create a plot with TPR on Y-axis and FPR on X-axis. Each choice of threshold results in a single point in the plot. Its not only nice way to visualize data, but also may be helpful in choosing right classifier.

<img src="Roc_curve.svg" alt="ROC" width="400"/>

For a fast judgement if a classifier is good, one can compute the area under the ROC curve (we call it **AUC**). For a perfect classifier it's equal to 1, for a random one is qual to 0.5.

# Cross Validation

How to judge which of two models is better? Simplest case: divide data into **train data** and **test data**. Then we train both model on training part and compare their errors ($RSS$ or $R^2$) on testing part. Drawback: we "lose" a big part of data - test data is not used to train model. That's when cross validation comes in handy.

**How cross validation works?**
1. We divide data in $k$ parts (we'll call them folds). Usually $k = 5, 10$.
2. We train model $k$ times. Each time we choose different fold to be our "test data", while we train model on the other $k-1$ - folds.
3. For each trained model we compute test error (RSS). The mean of those errors is called **validation error**.
4. The better model, the smaller is validation error. 

<br>
<br>

<img src = "cross_validation.png" alt = CV width = "800">