# Data Science 1 - Tutorial 5.2 - Classification Part 1

## Classification evaluation metrics

Table 1 shows the dataset showing the characteristics of a tumor mass with their respective labels (Benign or Malignant). A colleague proposed a cancer (malignant tumor) detector that outputs the class Malignant if the Radius>5 and Area>5, and the class Benign otherwise. Calculate the accuracy, precision, recall, specificity, F1 score, and balanced accuracy of this detector.


<center><b>Table 1: Tumor data and labels</b></center>

| Radius | Perimeter | Area | Category |
| --- | --- | --- | --- |
| 5 | 1 | 1 | Benign |
| 5 | 4 | 5 | Benign |
| 3 | 1 | 1 | Benign |
| 6 | 8 | 1 | Benign |
| 4 | 1 | 3 | Benign |
| 8 | 10 | 8 | Malignant |
| 1 | 1 | 1 | Benign |
| 2 | 2 | 1 | Benign |
| 2 | 1 | 1 | Benign |
| 4 | 1 | 1 | Benign |
| 1 | 1 | 1 | Benign |
| 2 | 1 | 1 | Benign |
| 5 | 3 | 3 | Malignant |
| 1 | 1 | 1 | Benign |
| 8 | 5 | 10 | Malignant |
| 7 | 6 | 4 | Malignant |
| 4 | 1 | 1 | Benign |
| 4 | 1 | 1 | Benign |
| 10 | 7 | 6 | Malignant |
| 6 | 1 | 1 | Benign |
| 7 | 2 | 10 | Malignant |
| 10 | 5 | 3 | Malignant |
| 3 | 1 | 1 | Benign |


**A**: ...

In [42]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report


data = {
    "Radius": [5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, 10, 6, 7, 10, 3],
    "Perimeter": [1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1, 7, 1, 2, 5, 1],
    "Area": [1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1, 6, 1, 10, 3, 1],
    "Category": ["Benign", "Benign", "Benign", "Benign", "Benign", "Malignant", "Benign", "Benign", "Benign", "Benign", "Benign", "Benign", "Malignant", "Benign", "Malignant", "Malignant", "Benign", "Benign", "Malignant", "Benign", "Malignant", "Malignant", "Benign"]
}

cancer_df = pd.DataFrame(data)


In [43]:
cancer_df.value_counts("Category")

Category
Benign       16
Malignant     7
Name: count, dtype: int64

In [44]:

classes = {"Benign": 0, "Malignant": 1}

cancer_df["Category"] = cancer_df["Category"].apply(lambda c: classes[c])

pred_s = (cancer_df["Area"] > 5) & (cancer_df["Radius"] > 5)
pred_s = pred_s.astype("int")

cancer_df["Prediction"] = pred_s

print("Accuracy: ", accuracy_score(cancer_df["Category"], pred_s))
print("Precision: ", precision_score(cancer_df["Category"], pred_s))
print("Recall: ", recall_score(cancer_df["Category"], pred_s))
print("F1: ", f1_score(cancer_df["Category"], pred_s))

print("Confusion Matrix: ")
print(confusion_matrix(cancer_df["Category"], pred_s))

Accuracy:  0.8695652173913043
Precision:  1.0
Recall:  0.5714285714285714
F1:  0.7272727272727273
Confusion Matrix: 
[[16  0]
 [ 3  4]]


In [45]:
cancer_df

Unnamed: 0,Radius,Perimeter,Area,Category,Prediction
0,5,1,1,0,0
1,5,4,5,0,0
2,3,1,1,0,0
3,6,8,1,0,0
4,4,1,3,0,0
5,8,10,8,1,1
6,1,1,1,0,0
7,2,2,1,0,0
8,2,1,1,0,0
9,4,1,1,0,0


In [46]:
pivot = pd.pivot_table(cancer_df, index="Category", columns="Prediction", aggfunc="count", values="Radius", fill_value=0, margins=True)

true_positive = pivot.iloc[1,1]
true_negative = pivot.iloc[0,0]
false_negative = pivot.iloc[1,0]
false_positive = pivot.iloc[0,1]

accuracy = (true_positive + true_negative)/len(cancer_df)

precision = (true_positive)/(false_positive + true_positive)

recall = (true_positive)/(true_positive + false_negative)

f1 = 2/(1/precision + 1/recall)

print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1: ", f1)

Accuracy:  0.8695652173913043
Precision:  1.0
Recall:  0.5714285714285714
F1:  0.7272727272727273


## Bayes' theorem

Printer failures are associated with three types of problems: hardware, software, and other (such as human error), with probabilities 0.1, 0.6, and 0.3, respectively. The probability of a printer failure given a hardware problem is 0.9, given a software problem is 0.2, and given any other problem is 0.5. If a customer enters the manufacturer’s Web site to diagnose a printer failure, what is the most likely cause of the problem?

**A**: The posterior P(Software | Failure) = 0.6 is the highest. Therefore, "software" is the most likely class for the failure type. However, if the first probabilities given in the problem are the actually the prior probabilities over the classes of failure, then we need to compute the posterior probabilities and choose the maximum one: 

- P(S|F) = P(F|S)P(S)/P(F)
- P(H|F) = P(F|H)P(H)/P(F)
- P(O|F) = P(F|O)P(O)/P(F)

with P(F) = P(F|S)P(S) + P(F|H)P(H) + P(F|O)P(O)

## Decision Tree

1. Is the (ID3) decision tree algorithm that we saw in the lecture a greedy algorithm? Explain.
2. Still using the dataset in Table 1, discretize all the numerical features into $\leq5$ and $>5$ and derive the decision tree using the ID3 algorithm.

**A**: ...

## Support Vector Machine

In the lecture we saw the following standard quadratic optimization
problem:
$$
\begin{align}
\underset{\boldsymbol{w}}{\text{minimize}}\frac{1}{2}\left\Vert \boldsymbol{w}\right\Vert ^{2}\\
\text{subject to }\forall j,\;y^{(j)}\left(\boldsymbol{w}^{\intercal}\boldsymbol{x}^{(j)}+w_{0}\right) & \geq1.
\end{align}
$$

In finding the optimal hyperplane, we can convert the optimization
problem to an unconstrained problem using Lagrange multipliers $\alpha^{(j)}$:

$$
\begin{align}
L_{p} & =\frac{1}{2}\left\Vert \boldsymbol{w}\right\Vert ^{2}-\sum_{j}\alpha^{(j)}\left\{ y^{(j)}\left(\boldsymbol{w}^{\intercal}\boldsymbol{x}^{(j)}+w_{0}\right)-1\right\} \\
 & =\frac{1}{2}\left(\boldsymbol{w}^{\intercal}\boldsymbol{w}\right)-\boldsymbol{w}^{\intercal}\sum_{j}\alpha^{(j)}y^{(j)}\boldsymbol{x}^{(j)}-w_{0}\sum_{j}\alpha^{(j)}y^{(j)}+\sum_{j}\alpha^{(j)}
\end{align}
$$

for $j=1,\ldots,n$ samples. This should be minimized w.r.t. $\boldsymbol{w},w_{0}$
and maximized w.r.t. $\alpha^{(j)}\geq0$.

1. Find the partial derivatives of $ L_{p} $ with respect to $ \boldsymbol{w} $ and $ w_{0} $ and set them to zero.

2. Since the main minimization problem and the linear constraints are convex, we can solve the dual problem. Plug the the equations you get into $L_{p}$, which now we call the dual $L_{d}$, such that you don't see the terms $\boldsymbol{w}$ or $w_{0}$ anymore.

3. Research: What is the dual problem, what are the constraints? What values of $\alpha$'s do the support vectors take? What are the values of the $\alpha$'s for the rest of the samples?

4. Given two samples $ \boldsymbol{x}^{(1)}=\left(2,2\right),\boldsymbol{x}^{(2)}=\left(5,6\right)$ with labels $y$ equal to +1 and -1 respectively. By using the information that the lagrange multipliers can be obtained from: $$ \alpha^{(1)}=\alpha^{(2)}=\frac{2}{\left(x_{1}^{(1)}-x_{1}^{(2)}\right)^{2}+\left(x_{2}^{(1)}-x_{2}^{\left(2\right)}\right)^{2}},$$ find the optimal separating plane.

5. In the lecture, we saw the samples whose two classes are linearly separable. In the case where there is no hyperplane to separate the classes, look for one that gives the least error. This is called the soft margin hyperplane and we define slack variables $\xi^{(j)}\geq0 $ for $ j=1,2,\ldots,n$ samples. Research: Write down the new optimization problem. What are the values of $ \xi^{(j)} $ that correspond to the data points within the margin, and the $ \xi^{(j)} $ for the misclassified data points?

**A**: ...