# Metrics: Micro and Macro approaches
In this lab we are going to review the used metrics in a multiclass classification (precision, recall and F1 score using micro and macro approaches

# Metrics

We will be using Precision, Recall and F1-Scores. For more information, please see:
* [here](
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) 
* And  [here](http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics) for precision, recall, f-measure metrics.  

### Terms

* __True positives (TP):__ is an outcome where the model correctly predicts the positive class. 
* __True negatives (TN):__ is an outcome where the model correctly predicts the negative class.
* __False positives (FP):__ is an outcome where the model incorrectly predicts the positive class
* __False negatives (FN)__: is an outcome where the model incorrectly predicts the negative class.

### Equation Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.  
$
\text{precision} = \cfrac{TP}{TP + FP}
$


### Equation Recall
Recall is the ratio of correctly predicted positive observations to the all observations in actual positive class.  
$
\text{recall} = \cfrac{TP}{TP + FN}
$


### Equation  $F_1$ score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.  
$
F_1 = \cfrac{2}{\cfrac{1}{\text{precision}} + \cfrac{1}{\text{recall}}} = 2 \times \cfrac{\text{precision}\, \times \, \text{recall}}{\text{precision}\, + \, \text{recall}} = \cfrac{TP}{TP + \cfrac{FN + FP}{2}}
$

# Micro and macro approaches
## Micro precision
Micro averaging treats the entire set of data as an aggregate result, and calculates 1 metric rather than k metrics that get averaged together.

For precision, this works by calculating all of the true positive results for each class and using that as the numerator, and then calculating all of the true positive and false positive results for each class, and using that as the denominator.  
$
\text{precision micro }=\cfrac {TP_1+TP_2+…+TP_k}{(TP_1+TP_2+…+TP_k)+(FP_1+FP_2+…+FP_k)}
$   
In this case, rather than each class having equal weight, each observation gets equal weight. This gives the classes with the most observations more power.
## Macro precision
Macro averaging reduces your multiclass predictions down to multiple sets of binary predictions, calculates the corresponding metric for each of the binary cases, and then averages the results together. As an example, consider precision for the binary case.

 $
\text{precision} = \cfrac{TP}{TP + FP}
$

In the multiclass case, if there are  classes 1, 2 and 3 , macro averaging reduces the problem to multiple one-vs-all comparisons. The truth and estimate columns are recoded such that the only two classes are 1 and other, and then precision is calculated based on those recoded columns, with 1 being the “relevant” column. This process is repeated for the other 3 levels to get a total of 4 precision values. The results are then averaged together.

The formula representation looks like this. For k classes:  
$
\text{precision macro }=\cfrac {Pr_1+Pr_2+…+Pr_k}{k}
$  
where $Pr_1$ is the precision calculated from recoding the multiclass predictions down to just class 1 and other.



# Example: Multiclass classifier & micro and macro precision 
Suppose we have a multiclass classification problem with three classes where:

* 98% of the points belong to Class 1
* 1% of the points belong to Class 2
* 1% of the points belong to Class 3 
* We have a classifier that always predicts class 1.   
What will the precision be based on micro approach? what about macro?


In [5]:
import numpy as np 
from sklearn.metrics import f1_score, precision_score, recall_score
#actual_labels: 98 elements class 1, 1 element class 2, 1 element class 3
actual_labels= np.ones(98)
actual_labels= np.append(actual_labels,[2,3])
#the predictions ( all ones)
predictions= np.ones(100)


## Calculating metrics using sklearn

In [6]:
print("Precision score for each class is :", precision_score(actual_labels, predictions, average=None))
print("Precision score using micro approach :", precision_score(actual_labels, predictions, average='micro'))
print("Precision score using macro approach :", np.round(precision_score(actual_labels, predictions, average='macro'),3))
print("Recall score using micro approach :", recall_score(actual_labels, predictions, average='micro'))
print("Recall score using macro approach :", np.round(recall_score(actual_labels, predictions, average='macro'),3))
print("F1 score using micro approach :", f1_score(actual_labels, predictions, average='micro'))
print("F1 score using macro approach :", np.round(f1_score(actual_labels, predictions, average='macro'),3))


Precision score for each class is : [0.98 0.   0.  ]
Precision score using micro approach : 0.98
Precision score using macro approach : 0.327
Recall score using micro approach : 0.98
Recall score using macro approach : 0.333
F1 score using micro approach : 0.98
F1 score using macro approach : 0.33


The micro approach is not very discriminating compared to the macro approach

## Calculating macro precision manually (Decomposing into 3 binary classifiers) 
In this section we are trying to reproduce the macro presion results obtained by using sklearn. 
The macro precision reduces your multiclass predictions down to multiple sets of binary predictions, calculates the corresponding metric for each of the binary cases, and then averages the results together.
In the multiclass case, if there are classes 1, 2 and 3 , macro averaging reduces the problem to multiple one-vs-all comparisons. The truth and estimate columns are recoded such that the only two classes are 1 and other, and then precision is calculated based on those recoded columns, with 1 being the “relevant” column. This process is repeated for the other 3 levels to get a total of 4 precision values. The results are then averaged together.

In [7]:
#Classifier prediction ( all ones)
predictions= np.ones(100)
# Labels of the first binary classifier (class 1 vs class 2 and 3)
#98 ones and 2 zeroes
labels1= np.ones(98) 
labels1= np.append(labels1,[0,0])
# Labels of the second binary classifier (class 2 vs class 1 and 3)
#99 zeroes and 1 one
labels2= np.zeros(98) 
labels2= np.append(labels2,[1,0])
# Labels of the third binary classifier (class 3 vs class 1 and 2)
#99 zeroes and 1 one
labels3= np.zeros(99) 
labels3= np.append(labels3,1)

print("Precision score for  the first binary classifier (class 1 vs class 2 and 3) :", precision_score(labels1, predictions, average=None)[1])
print("Precision score for  the second binary classifier (class 2 vs class 1 and 3):", precision_score(labels2, predictions, average=None)[1])
print("Precision score for  the third binary classifier (class 3 vs class 1 and 2):", precision_score(labels3, predictions, average=None)[1])
print("Precision score macro (average of the 3 binary classifiers precision) :", np.round(0.9802/3,3))
print("Precision score macro using sklearn:", np.round(precision_score(actual_labels, predictions, average='macro'),3))


Precision score for  the first binary classifier (class 1 vs class 2 and 3) : 0.98
Precision score for  the second binary classifier (class 2 vs class 1 and 3): 0.01
Precision score for  the third binary classifier (class 3 vs class 1 and 2): 0.01
Precision score macro (average of the 3 binary classifiers precision) : 0.327
Precision score macro using sklearn: 0.327


# Equal precision, recall and F1 score when using micro averaging with multi-class 

### Precision
 $
\text{precision } P = \cfrac{TP}{TP + FP}
$  
__Precision__ can be intuitively understood as the classifier’s ability to only predict really positive samples as positive. For example, a classifier that classifies just everything as positive would have a precision of 0.5 in a balanced test set (50% positive, 50% negative). One that has no false positives, i.e. classifies only the true positives as positive would have a precision of 1.0. So basically, the less false positives a classifier gives, the higher is its precision.  
### Recall
 $
\text{recall } R= \cfrac{TP}{TP + FN}
$  
__Recall__ can be interpreted as the amount of positive test samples that were actually classified as positive. A classifier that just outputs positive for every sample, regardless if it is really positive, would get a recall of 1.0 but a lower precision. The less false negatives a clasifier gives, the higher is its recall.

So the higher precision and recall are, the better the classifier performs because it detects most of the positive samples (high recall) and does not detect many samples that should not be detected (high precision). In order to quantify that, we can use another metric called F1 score.
### F1 score
 $
\text{F1 score } 𝐹_1=2 \cfrac{𝑃∗𝑅}{𝑃+𝑅}
$  
This is just the weighted average between precision and recall. The higher precision and recall are, the higher the F1 score is. You can directly see from this formula, that if $𝑃=𝑅$ , then $𝐹_1=𝑃=𝑅$, because:  
$
𝐹_1=2 \cfrac{𝑃∗𝑅}{𝑃+𝑅}=2\cfrac{P∗P}{𝑃+P}=2\cfrac{𝑃^2}{2𝑃}=𝑃
$  
So this already explains why the F1 score is the same as precision and recall, if precision and recall are the same. But why are recall and precision the same when using micro averaging? Let’s look at an example to understand this.

## Example

<img src="example.png" alt="example" style="width: =600px;"/>

* TP is the amount of samples that were predicted to have the correct label.  
* __TP = 4 (all green cells)__
* FP is the amount of labels that got a “vote” but shouldn’t. For example, in the first column, 1 should have been predicted, but 2 was predicted. So there is a false positive for class 2 in this case. On the other hand, if the prediction is right (column 2), there is no FP counted. 
* __FP = 5 (all red cells)__
* FN is the amount of labels that should have been predicted, but weren’t. Look at the first column again. 1 should have been predicted, but wasn’t. So there is a FN for class 1 in this case. As in the FP case, there is no FN counted if the prediction is correct (column 2).
* __FP = 5 (all red cells) __

* In other words, if there is a false positive, there will always also be a false negative and vice versa, because always one class if predicted. If class A is predicted and the true label is B, then there is a FP for A and a FN for B. If the prediction is correct, i.e. class A is predicted and A is also the true label,then there is neither a false positive nor a false negative but only a true positive. So there is no possibility that would increase only FP or FN but not both. That is why precision and recall are always the same when using the micro averaging scheme.

Now let’s actually calculate the values of precision, recall and F1 score.
 * $
\text{ precision } P = \cfrac{4}{4+5}= \cfrac{4}{9}= 0.44 
$   
 * $
\text{ recall  } R = \cfrac{4}{4+5}= \cfrac{4}{9}= 0.44 
$   
 * $
\text{ F1 score  } 𝐹_1 = 2\cfrac{4/9∗4/9}{4/9+4/9}= \cfrac{4}{9}= 0.44 
$

We can see that all metric values are identical.

Note: Since micro averaging does not distinguish between different classes and then just averages their metric scores, this averaging scheme is not prone to inaccurate values due to an unequally distributed test set (e.g. 3 classes and one of these has 98% of the samples).

In [8]:
from sklearn.metrics import precision_score, recall_score, f1_score
# These values are the same as in the table above
labels      = [1,2,3,2,3,3,1,2,2]
predicitons = [2,2,1,2,1,3,2,3,2]
print("Precision (micro): %f" % precision_score(labels, predicitons, average='micro'))
print("Recall (micro):    %f" % recall_score(labels, predicitons, average='micro'))
print("F1 score (micro):  %f" % f1_score(labels, predicitons, average='micro'), end='\n\n')


print("Precision (macro): %f" % precision_score(labels, predicitons, average='macro'))
print("Recall (macro):    %f" % recall_score(labels, predicitons, average='macro'))
print("F1 score (macro):  %f" % f1_score(labels, predicitons, average='macro'), end='\n\n')

Precision (micro): 0.444444
Recall (micro):    0.444444
F1 score (micro):  0.444444

Precision (macro): 0.366667
Recall (macro):    0.361111
F1 score (macro):  0.355556



# Case study: Iris Data

## Load the data

In [44]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import  datasets
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.linear_model import LogisticRegression
import pandas as pd
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

## Adding noisy features

In [45]:

print("originial X shape", X.shape)

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 5*n_features)]

print("X shape after adding noisy features", X.shape)
# Split into training, validation and test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)



originial X shape (150, 4)
X shape after adding noisy features (150, 24)


## Classification

In [46]:
# Run classifier
classifier = LogisticRegression(multi_class='multinomial',random_state=42,solver = 'lbfgs')
model= classifier.fit(X_train, y_train)
y_score_test= model.predict(X_test)
y_score_valid = model.predict(X_valid)




## Results

In [47]:
# Compute Precision and Recall

results = pd.DataFrame(columns=["set ", "Precision micro", "Precision macro","recall micro", "recall macro"])
results.loc[len(results)] = ["test set ", np.round(precision_score(y_test, y_score_test, average='micro'),3), 
                            np.round(precision_score(y_test, y_score_test,  average='macro'),3),np.round(recall_score(y_test, y_score_test, average='micro'),3),
                            np.round(recall_score(y_test, y_score_test, average='macro'),3)]

results.loc[len(results)] = ["validation set ", np.round(precision_score(y_valid, y_score_valid, average='micro'),3), 
                            np.round(precision_score(y_valid, y_score_valid,  average='macro'),3),np.round(recall_score(y_valid, y_score_valid, average='micro'),3),
                            np.round(recall_score(y_valid, y_score_valid, average='macro'),3)]


results


Unnamed: 0,set,Precision micro,Precision macro,recall micro,recall macro
0,test set,0.9,0.905,0.9,0.905
1,validation set,0.913,0.907,0.913,0.907
