# COMP-2704: Supervised Machine Learning
### <span style="color:blue"> Week 8 </span>

## <span style="color:blue"> Chapter 5 continued</span>

### How to find a good classifier? The perceptron algorithm

The pseudocode is similar for all supervised learning algorithms:

1) Start with random parameters.
2) Iterate through training data and use the error function to update the parameters.
3) Measure error to decide when to stop training.

We developed much of the code for the perceptron algorithm last week. We still need code to update the parameters during training.

<img src='Fig5.21.png' width='500'/>

* The 'perceptron trick' will only adjust the line when misclassified points are encountered.
* One way a misclassification happens is when $\hat{y} = 1$ and $y=0$.
    * In this case the score $z = \mathbf{w} \cdot \mathbf{x} + b \ge 0$, but should be less than zero.
* <span style="color:green">**Q: How can we use a learning rate $\eta$ to adjust $\mathbf{w}$ and $b$ to lower $z$?**</span>

    * We can adjust the bias by $b' = b - \eta$
    * We can adjust the weights by $\mathbf{w}' = \mathbf{w} - \eta \mathbf{x}$.
        * Making the adjustment proportional to $\mathbf{x}$ makes the changes bigger when $x_i$ is larger.
        * Also, when $x_i$ is negative, this correctly makes $w_i$ larger; the goal for this case is to lower the value of $z$.
        
* The other misclassification case is when $\hat{y} = 0$ and $y=1$.
    * In this case the score $z = \mathbf{w} \cdot \mathbf{x} + b < 0$, but should be greater than or equal to zero.
    * We can adjust the bias by $b' = b + \eta$
    * We can adjust the weights by $\mathbf{w}' = \mathbf{w} + \eta \mathbf{x}$.

#### The perceptron trick 
* Here we develop a function to update the weights and bias when a point is misclassified.  
* We will write this to work in general, and not just for the alien planet example we are considering.  
* The pseudocode is:
    * *Input*: weights $\mathbf{w}$, bias $b$, features of a sample $\mathbf{x}$, a label $y$, learning rate with default value $\eta = 0.01$.
    * *Output*: updated weights $\mathbf{w}$ and bias $b$.
    * *Procedure*:
        * Find the prediction $\hat{y}$
        * If $\hat{y}=1$ and $y=0$:
            * $\mathbf{w}' = \mathbf{w} - \eta \mathbf{x}$
            * $b' = b - \eta$
        * If $\hat{y}=0$ and $y=1$:
            * $\mathbf{w}' = \mathbf{w} + \eta \mathbf{x}$
            * $b' = b + \eta$
        * Return $\mathbf{w}'$ and bias $b'$
            
* Let's write this function in the next cell.

In [1]:
import numpy as np

w = np.array([1, 2])
b = -4

def predict(w, b, x):
    return w.dot(x) + b

def step(z):
    return 1 if z >=0 else 0

def trick_1(w, b, x, y, eta = 0.01):
    # add code here
    return w, b
    
# test the code using the data below
x = np.array([0, 1])
y = 1

print('prediction =', predict(w, b, x))
print('original parameters: w =', w, '\tb =', b)
w, b = trick_1(w, b, x, y)
print('updated parameters:  w =', w, '\tb =', b)

prediction = -2
original parameters: w = [1 2] 	b = -4
updated parameters:  w = [1 2] 	b = -4


<span style="color:green">**Q: Can you improve the code for trick_1 so that it does not require a conditional?**</span>

In [2]:
def trick_2(w, b, x, y, eta = 0.01):
    # add code here
    return w, b

# test the code using the data below
x = np.array([0, 1])
y = 1

print('prediction =', predict(w, b, x))
print('original parameters: w =', w, '\tb =', b)
w, b = trick_2(w, b, x, y)
print('updated parameters:  w =', w, '\tb =', b)

prediction = -2
original parameters: w = [1 2] 	b = -4
updated parameters:  w = [1 2] 	b = -4


* The above 'trick' is used for gradient descent. The error will (usually) be lower for the updated parameters.
* You will complete the perceptron algorithm code in this week's exercise.
* You will use the Turi Create implementation in this week's lab.

<span style="color:red">*Let us now review the textbook code Coding_perceptron_algorithm.ipynb.*</span>

## <span style="color:blue"> Chapter 7 </span>
### How do you measure classification models? Accuracy and its friends

<img src='cartoon7.png' />

* A simple and important metric for classification is **accuracy**: the percentage of predictions that are correct.
* However, as suggested by the cartoon above, accuracy is often not the best metric to use.
* We now disucss several metrics to use for classification using two examples use cases: coronavirus and spam detection. 

**Medical dataset: A set of patients diagnosed with coronavirus**
* 1,000 patients in total
* 10 have been diagnosed with coronavirus
* 990 have been diagnosed as healthy
* labels are “sick” (1) or “healthy” (0) corresponding to the diagnosis
* goal of a model would be to predict the diagnosis based on the features of each patient

**Email dataset: A set of emails labeled spam or ham**
* 100 emails in total
* 40 are spam
* 60 are ham
* labels are 'spam' (1) or 'ham' (0)
* goal of a model would be to predict the label based on the features of each email

### A super effective yet super useless model
* First, let's look at the medical dataset.
* As in the cartoon above, consider a model 1 that predicts everyone as healthy.
* All 1000 patients are predicted healthy (0)
* We calculate $\mbox{accuracy} = \frac{990}{1000} = 99\%$

<span style="color:green">**Q: Is this a good model?**</span>

* Even though accuracy is high, this is a bad model.
* To develop better metrics, we need to look into the predictions for each class.
* Predictions can be correct or incorrect for each class, so there are four numbers to consider in binary classification:
    * *True positive*  (TP): label is 1 and prediction is 1
    * *False positive* (FP): label is 0 and prediction is 1
    * *True negative*  (TN): label is 0 and prediction is 0
    * *False negative* (FN): label is 1 and prediction is 0
* These are often presented in a table called the **confusion matrix**:

<img src='confusion.png' width='400'/>

* By combining these numbers in different ways, we can find other metrics to choose from.
* Depending on the use case, one type of error may be worse than the other, or they may be similarly bad.
* One should choose metrics that are most relevant to the use case.

For the coronavirus model 1 that predicts everyone as healthy, we get:

<img src='table7.2.png' width='600'/>

<span style="color:green">**Q: Which is worse here, a false negative or a false positive?**</span>

* The worst mistake is a *false negative*, predicting a patient is healthy when they are sick and need care.
* A false positive would result in a patient getting extra care that is not needed, which is not nearly as bad.
* For a use case where false negatives must be avoided, we define **recall**: Among the positive examples, how many did we correctly classify? 
$$ R = \frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}} $$
* $R \in [0, 1]$ and higher values are better than lower values.
* For coronavirus model 1, we find $R = \frac{0}{10} = 0$, telling us how bad the model is.
* Consider coronavirus model 2, with the confusion matrix
<img src='table7.3.png' width='600'/>

* Model 2 seems better, because it has fewer false negatives.

<span style="color:green">**Q: What is the recall of model 2? What is the accuracy? Which model do you think is better?**</span>

Let us now turn to the spam detection use case. Consider the following two models:
<img src='table7.4.png' width='600'/>
<img src='table7.5.png' width='550'/>

They are both $85\%$ accurate, but which is better?

<span style="color:green">**Q: Which problem is worse in spam detection, false positives or false negatives?**</span>

* In spam detection, you must never delete ham, so *false positves* are a serious problem.
* False negatives result in spam getting in the inbox, which is not as bad.
* The metric we define for this case is **precision**: Among the examples we classified as positive, how many did we correctly classify? $$ P = \frac{ \mbox{TP}}{\mbox{TP} + \mbox{FP}} $$
* $P \in [0, 1]$ and higher values are better than lower values.
* For the models above we find $P_1 = \frac{30}{35} \approx 0.857$ and $P_2 = \frac{35}{45} \approx 0.777$, showing that model 1 does a better job of avoiding false positives.

### Combining recall and precision as a way to optimize both: The $F_\beta$-score
* Precision and recall can be combined in a way that makes both important; this is the $F_\beta$ score: $$ F_\beta = \frac{(1+\beta^2)PR}{\beta^2P + R} $$
* Notice this is zero if either $P$ or $R$ is zero.
* $\beta \in [0, \infty)$ is a number that you can choose.
* $\beta = 1$ weights $P$ and $R$ equally.
* $\beta > 1$ places more importance on recall and avoiding false negatives.
* $0 < \beta < 1$ places more importance on precision and avoiding false positives.
* For example, we might consider $\beta = 2$ for the coronavirus model; we might consider $\beta = 0.5$ for the spam detection model.

<span style="color:green">**Q: What is the $F_2$ score for the spam detection models? What are the $F_{0.5}$ scores?**</span>

In [9]:
def F_beta(TP, FP, TN, FN, beta=1):
    # write this code
    return 0

In [10]:
# find results here
print(F_beta(TP=1, FP=1, TN=1, FN=1, beta=1))

0
