## Metrics

#### Classification:
- Confusion Matrix
- Accuracy score
- Misclassification score
- Precision score (Positive Predictive Value, PPV)
- Recall score (Sensitivity, True Positive Rate)
- F1 score
- Classification Report

#### Regression:
- R2
- MSE
- Mean Absolute Error (MAE)

#### Clustering:
- Adjusted Rand Index
- Homogeneity
- V-measure

#### Confusion Matrix

- Returns: array, shape = [n_classes, n_classes]
- 0, 1 and 2 classified as 0, 1 or 2
- A/B: TN, FP, FN, TP

In [46]:
import pandas as pd

# Multiclass with no labels
y_test = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]

from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, y_pred)

display(pd.DataFrame(cf))

Unnamed: 0,0,1,2
0,2,0,0
1,0,0,1
2,1,0,2


In [47]:
# Multiclass with labels
y_test = ["C", "A", "C", "C", "A", "B"]
y_pred = ["A", "A", "C", "C", "A", "C"]
labels = ["A", "B", "C"]

from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, y_pred, labels=labels)

df = pd.DataFrame(cf, index=labels, columns=labels)
display(df)

Unnamed: 0,A,B,C
A,2,0,0
B,0,0,1
C,1,0,2


In [48]:
# Extracted binary confusion matrix:
y_test = [1, 1, 0, 0,   1, 1]
y_pred = [1, 0, 1, 0,   1, 1]

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print('True Positive: {}'.format(tp))
print('True Negative: {}'.format(tn))
print('False Positive: {}'.format(fp))
print('False Negative: {}'.format(fn))

True Positive: 3
True Negative: 1
False Positive: 1
False Negative: 1


#### Accuracy score
 Overall, how often is the classifier correct?


- Accuracy - a ratio of correctly predicted labels to the number of total samples. 
- Accuracy = TP+TN/TP+FP+FN+TN
- Good measure if values of false positive and false negatives are simillar. 


In [49]:
y_test      = ['B', 'A', 'A', 'C', 'B'] # Ground truth 
predictions = ['A', 'A', 'C', 'B', 'B'] # Predictions

# Ratio
from sklearn.metrics import accuracy_score
acc  = accuracy_score(y_test, predictions) # 2/5
print('{}'.format(acc))

0.4


In [56]:
y_test      = [2, 1, 1, 3, 2] # Ground truth 
predictions = [1, 1, 3, 2, 2] # Predictions

# Nr of correctly classified samples.
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, predictions, normalize=False)
print('{}'.format(acc))

2


In [98]:
y_test      = [2, 1, 1, 3, 2] # Ground truth 
predictions = [1, 1, 3, 2, 2] # Predictions

# Weighted.
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, predictions, sample_weight=[0,1,0,0,1])
print('{}'.format(acc))

1.0


#### Misclassification score (1 minus Accuracy)
Overall, how often is it wrong?

- Misclassification 
- (FP+FN)/total
- equivalent to 1 minus Accuracy
- also known as "Error Rate

#### Precision score (Positive Predictive Value, PPV)

When it predicts yes, how often is it correct?

- Returns: float (if average is not None) or array of floats.
- Ratio of correctly predicted positive labels to all samples predicted positive.
- Of all emails classified as spam, how many actually was a spam?
- Precision = TP/TP+FP
- High FP = low Precision.
- Ability of the classifier not to label as positive a sample that is negative.
- The best value is 1 and the worst value is 0.

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc globally by counting the total true positives, false negatives and false positives.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.


In [None]:
precision_score(y_true, y_pred, 
                labels=None, # set of labels to include if not binary
                pos_label=1, # class to report if average and data are binary.
                average='binary', # below
                sample_weight=None)

In [64]:
from sklearn.metrics import precision_score

y_test = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

In [65]:
# Precision
ps = precision_score(y_test, y_pred, average=None)
print('{}'.format(ps))

[0.66666667 0.         0.        ]


In [66]:
# Precision with macro averaging:
ps = precision_score(y_test, y_pred, average='macro') 
print('{}'.format(ps))

0.2222222222222222


In [67]:
# Precision with micro averaging:
ps = precision_score(y_test, y_pred, average='micro')  
print('{}'.format(ps))

0.3333333333333333


In [68]:
# Precision with weighted averaging:
ps = precision_score(y_test, y_pred, average='weighted')
print('{}'.format(ps))

0.2222222222222222


#### Recall score (Sensitivity, True Positive Rate)
When it's actually yes, how often does it predict yes?

- Ability of the classifier to find all the positive samples.
- Ratio of correctly predicted positive labels to the all samples in actual class. 
- Recall = TP/TP+FN
- The best value is 1 and the worst value is 0.
- Returns: float (if average is not None) or array of floats.

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc globally by counting the total true positives, false negatives and false positives.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.

In [None]:
recall_score(y_true, y_pred, 
             labels=None, # set of labels to include if not binary
             pos_label=1, # class to report if average and data are binary.
             average='binary', # below
             sample_weight=None)

In [70]:
from sklearn.metrics import recall_score
y_test = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

In [71]:
# Recall score
rs = recall_score(y_test, y_pred, average=None)
print('{}'.format(rs))

[1. 0. 0.]


In [72]:
# Macro averaged recall score 
rs = recall_score(y_test, y_pred, average='macro')  
print('{}'.format(rs))

0.3333333333333333


In [73]:
# Micro averaged recall score 
rs = recall_score(y_test, y_pred, average='micro')  
print('{}'.format(rs))

0.3333333333333333


In [74]:
# Weighted averaged recall score 
rs = recall_score(y_test, y_pred, average='weighted')  
print('{}'.format(rs))

0.3333333333333333


#### F1 score

- Balance between the precision and the recall.
- The weighted average of Precision and Recall. 
- F1 Score = 2*(Recall * Precision) / (Recall + Precision)
- Returns: float or array of float, shape = [n_unique_labels]

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc globally by counting the total true positives, false negatives and false positives.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.

In [None]:
f1_score(y_true, y_pred, 
         labels=None, # set of labels to include if not binary
         pos_label=1, # class to report if average and data are binary.
         average='binary', # below
         sample_weight=None)

In [86]:
y_test      = ['B', 'A', 'A', 'C', 'B'] # Ground truth 
predictions = ['A', 'A', 'C', 'B', 'B'] # Predictions

y_test      = [2, 1, 1, 3, 2] # Ground truth 
predictions = [1, 1, 3, 2, 2] # Predictions

In [87]:
# F1 score
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average=None)
print('{}'.format(f1_score))

[0.5 0.5 0. ]


In [90]:
# Macro averaged F1 score
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average='macro')
print('{}'.format(f1_score))

0.3333333333333333


In [91]:
# Micro averaged F1 score
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average='micro')  
print('{}'.format(f1_score))

0.4000000000000001


In [92]:
# Weighted averaged F1 score
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average='weighted')  
print('{}'.format(f1_score))

0.4


#### Classification Report

Text summary of the precision, recall, F1 score for each class.

- Build & show main classification metrics report
- Returns string/dict

{'label 1': {'precision':0.5,
             'recall':1.0,
             'f1-score':0.67,
             'support':1},
 'label 2': { ... },
  ...
}

The reported averages include
- micro average (averaging the total true positives, false negatives and false positives), 
- macro average (averaging the unweighted mean per label), 
- weighted average (averaging the support-weighted mean per label),
- sample average (only for multilabel classification).
- recall of the positive is also known as “sensitivity”; 
- recall of the negativeclass is “specificity”.


In [None]:
classification_report(y_test, y_pred, 
                      labels=None, # include list of labels in report
                      target_names=None, # display names for labels
                      sample_weight=None, # weights for samples
                      digits=2, # round output (ignored if dict)
                      output_dict=False) #If True: return dict(output)

In [97]:
from sklearn.metrics import classification_report

y_test = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['Amber', 'Blue', 'Cedar']

cr = classification_report(y_test, y_pred, target_names=target_names)

print(cr)
# 0 class - Amber
# 1 class - Blue
# 2 class - Cedar

              precision    recall  f1-score   support

       Amber       0.50      1.00      0.67         1
        Blue       0.00      0.00      0.00         1
       Cedar       1.00      0.67      0.80         3

   micro avg       0.60      0.60      0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5



In [None]:
#False Positive Rate: When it's actually no, how often does it predict yes?
#FP/actual no = 10/60 = 0.17

#True Negative Rate: When it's actually no, how often does it predict no?
#TN/actual no = 50/60 = 0.83
#equivalent to 1 minus False Positive Rate
#also known as "Specificity"

#Prevalence: How often does the yes condition actually occur in our sample?
#actual yes/total = 105/165 = 0.64

https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
http://cs229.stanford.edu/section/evaluation_metrics.pdf

#### R^2

The coefficient of determination. Scores regression function.

- Returns float (or ndarray of floats if multioutput=‘raw_values’)
- Best score: 1.0
- Can be negative
- 'A constant model that always predicts the expected value of y, 
   disregarding the input features, would get a R^2 score of 0.0.'
- Not symmetric.

<b>Multioutput</b>. Defines aggregating of multiple output scores, array-like value defines weights used to average scores. (default=“uniform_average”)

- multioutput=‘raw_values’: Returns a full set of scores in case of multioutput input.
- multioutput=‘uniform_average’: Scores of all outputs are averaged with uniform weight.
- multioutput=‘variance_weighted’: Scores of all outputs are averaged, weighted by the variances of each individual output.

In [143]:
y_test      = [2, 1, 1, 3, 2] # Ground truth 
predictions = [2, 1, 2, 2, 1] # Predictions


from sklearn.metrics import r2_score
r2score = r2_score(y_test, predictions)  
print('{}'.format(r2score))

y_test      = [[0.4, 10], [0.5, 11], [0.6, 12]]
predictions = [[0.4, 10], [0.5, 11], [0.3, 12]]

from sklearn.metrics import r2_score
r2score = r2_score(y_test, predictions, multioutput='variance_weighted')  
print('{}'.format(r2score))

y_test      = [[0.4, 10], [0.5, 11], [0.6, 12]]
predictions = [[0.4, 10], [0.5, 11], [0.3, 12]]

from sklearn.metrics import r2_score
r2score = r2_score(y_test, predictions, multioutput='raw_values')  
print('{}'.format(r2score))

-0.0714285714285714
0.9554455445544554
[-3.5  1. ]


#### Mean Squared Error (MSE)

- Mean squared error regression loss
- Returns: loss (non negative float) for each target.
- multioutput: Aggregating of multiple output values (weights used to average errors)  (default=“uniform_average”):

   a) multioutput=‘uniform_average’:  Scores of all outputs are averaged with uniform weight.

   b) multioutput=‘raw_values’: Returns a full set of scores in case of multioutput input.

In [167]:
y_test      = [3.0, -0.5, 2, 7, 3] # Ground truth
predictions = [2.5, -1.0, 2, 8, 3] # Predictions

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
print('{}'.format(mse))

y_test      = [[0.5, 1], [-11, 13], [2, -4]] # Ground truth
predictions = [[0.1, 2], [-11, 22], [3, -3]] # Predictions

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions) # uniform average
print('{}'.format(mse))

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions, multioutput=[0.5, 0.5])
print('{}'.format(mse))

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions, multioutput='raw_values')
print('{}'.format(mse))


0.3
14.026666666666667
14.026666666666667
[ 0.38666667 27.66666667]


#### Mean Absolute Error (MAE)

Mean absolute error regression loss.
- non-negative float
- the best value is 0.0

- multioutput (default=“uniform_average”): Aggregating of multiple output values (weights used to average errors)
- a) multioutput=‘raw_values’: Returns a full set of scores in case of multioutput input.
- b) multioutput=‘uniform_average’:  Scores of all outputs are averaged with uniform weight.


In [None]:
loss : float or ndarray of floats
If multioutput is 
‘raw_values’, then mean absolute error is returned for each output
separately. If multioutput is ‘uniform_average’ or an ndarray of weights,
then the weighted average of all output errors is returned.


In [163]:
y_test      = [3.5, -0.5, 1, 3, 3] # Ground truth
predictions = [2.5, -1.0, 1, 4, 5] # Predictions

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, predictions)
print('{}'.format(mae))

y_test      = [[0.5, 1, 1], [-11, 13, 43], [2, 43, -4]] # Ground truth
predictions = [[0.1, 2, 1], [-11, 22, 12], [3, 37, -3]] # Predictions

# Return MAE score
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, predictions)
print('{}'.format(mae))

# To return the mean absolute error for each output separately. [0.46666667 3.66666667]
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, predictions, multioutput='raw_values')
print('{}'.format(mae))

# Weights
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, predictions, multioutput=[1, 1, 1])
print('{}'.format(mae))

0.9
5.488888888888888
[ 0.46666667  5.33333333 10.66666667]
5.488888888888888


In [None]:
from sklearn.metrics import classification_report

## Clustering

#### Adjusted Rand Score

Rand index adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

ARI is a symmetric measure:

adjusted_rand_score(a, b) == adjusted_rand_score(b, a)
Read more in the User Guide.

Parameters:	
labels_true : int array, shape = [n_samples]
Ground truth class labels to be used as a reference

labels_pred : array, shape = [n_samples]
Cluster labels to evaluate

Returns:	
ari : float
Similarity score between -1.0 and 1.0. Random labelings have an ARI close to 0.0. 1.0 stands for perfect match.

In [None]:
from sklearn.metrics import adjusted_rand_score

In [None]:
sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)

In [None]:
Perfectly matching labelings have a score of 1 even

>>>
>>> from sklearn.metrics.cluster import adjusted_rand_score
>>> adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> adjusted_rand_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Labelings that assign all classes members to the same clusters are complete be not always pure, hence penalized:

>>>
>>> adjusted_rand_score([0, 0, 1, 2], [0, 0, 1, 1])  
0.57...
ARI is symmetric, so labelings that have pure clusters with members coming from the same classes but unnecessary splits are penalized:

>>>
>>> adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 2])  
0.57...
If classes members are completely split across different clusters, the assignment is totally incomplete, hence the ARI is very low:

>>>
>>> adjusted_rand_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0

#### Homogeneity Score

Homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.

Read more in the User Guide.

Parameters:	
labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference

labels_pred : array, shape = [n_samples]
cluster labels to evaluate

Returns:	
homogeneity : float
score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling

In [None]:
Perfect labelings are homogeneous:
>>> from sklearn.metrics.cluster import homogeneity_score
>>> homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Non-perfect labelings that further split classes into more clusters can be perfectly homogeneous:

>>> print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2]))
...                                                  
1.000000
>>> print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3]))
...                                                  
1.000000
Clusters that include samples from different classes do not make for an homogeneous labeling:

>>> print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 1, 0, 1]))
...                                                  
0.0...
>>> print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))
...                                                  
0.0...

In [None]:
sklearn.metrics.homogeneity_score(labels_true, labels_pred)

#### V Measure Score

V-measure cluster labeling given a ground truth.

This score is identical to normalized_mutual_info_score with the 'arithmetic' option for averaging.

The V-measure is the harmonic mean between homogeneity and completeness:

v = 2 * (homogeneity * completeness) / (homogeneity + completeness)
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

Read more in the User Guide.

Parameters:	
labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference

labels_pred : array, shape = [n_samples]
cluster labels to evaluate

Returns:	
v_measure : float
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling


In [None]:
from sklearn.metrics import v_measure_score

In [None]:
Perfect labelings are both homogeneous and complete, hence have score 1.0:

>>>
>>> from sklearn.metrics.cluster import v_measure_score
>>> v_measure_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> v_measure_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Labelings that assign all classes members to the same clusters are complete be not homogeneous, hence penalized:

>>>
>>> print("%.6f" % v_measure_score([0, 0, 1, 2], [0, 0, 1, 1]))
...                                                  
0.8...
>>> print("%.6f" % v_measure_score([0, 1, 2, 3], [0, 0, 1, 1]))
...                                                  
0.66...
Labelings that have pure clusters with members coming from the same classes are homogeneous but un-necessary splits harms completeness and thus penalize V-measure as well:

>>>
>>> print("%.6f" % v_measure_score([0, 0, 1, 1], [0, 0, 1, 2]))
...                                                  
0.8...
>>> print("%.6f" % v_measure_score([0, 0, 1, 1], [0, 1, 2, 3]))
...                                                  
0.66...
If classes members are completely split across different clusters, the assignment is totally incomplete, hence the V-Measure is null:

>>>
>>> print("%.6f" % v_measure_score([0, 0, 0, 0], [0, 1, 2, 3]))
...                                                  
0.0...
Clusters that include samples from totally different classes totally destroy the homogeneity of the labeling, hence:

>>>
>>> print("%.6f" % v_measure_score([0, 0, 1, 1], [0, 0, 0, 0]))
...                                                  
0.0...

#### Completeness

Completeness metric of a cluster labeling given a ground truth.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which will be different in general.

Read more in the User Guide.

Parameters:	
labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference

labels_pred : array, shape = [n_samples]
cluster labels to evaluate

Returns:	
completeness : float
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling

In [None]:
sklearn.metrics.completeness_score(labels_true, labels_pred)

In [None]:
Perfect labelings are complete:

>>>
>>> from sklearn.metrics.cluster import completeness_score
>>> completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Non-perfect labelings that assign all classes members to the same clusters are still complete:

>>>
>>> print(completeness_score([0, 0, 1, 1], [0, 0, 0, 0]))
1.0
>>> print(completeness_score([0, 1, 2, 3], [0, 0, 1, 1]))
0.999...
If classes members are split across different clusters, the assignment cannot be complete:

>>>
>>> print(completeness_score([0, 0, 1, 1], [0, 1, 0, 1]))
0.0
>>> print(completeness_score([0, 0, 0, 0], [0, 1, 2, 3]))
0.0