## Metrics

#### I. Classification:
1. Confusion Matrix
2. Accuracy score
3. Misclassification score
4. Precision score (Positive Predictive Value, PPV)
5. Recall score (Sensitivity, True Positive Rate)
6. Precision-Recall curve
7. F1 score
8. Classification Report
9. AUC ROC

#### II. Regression:
1. R2
2. MSE
3. Mean Absolute Error (MAE)
4. Explained Variance

#### III. Clustering:
1. Adjusted Rand Index
2. Homogeneity
3. V-measure
4. Completeness


To be updated.

# I. Classification

### 1. Confusion Matrix

- Returns: array, shape = [n_classes, n_classes]
- 0, 1 and 2 classified as 0, 1 or 2
- A/B: TN, FP, FN, TP

#### Multiclass with no labels

In [3]:
import pandas as pd

y_test = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]

In [7]:
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, y_pred)

display(pd.DataFrame(cf))

Unnamed: 0,0,1,2
0,2,0,0
1,0,0,1
2,1,0,2


#### Multiclass with labels

In [8]:
y_test = ["C", "A", "C", "C", "A", "B"]
y_pred = ["A", "A", "C", "C", "A", "C"]
labels = ["A", "B", "C"]

In [9]:
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, y_pred, labels=labels)

df = pd.DataFrame(cf, index=labels, columns=labels)
display(df)

Unnamed: 0,A,B,C
A,2,0,0
B,0,0,1
C,1,0,2


#### Extracted binary confusion matrix

In [10]:
y_test = [1, 1, 0, 0,   1, 1]
y_pred = [1, 0, 1, 0,   1, 1]

In [11]:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print('True Positive: {}'.format(tp))
print('True Negative: {}'.format(tn))
print('False Positive: {}'.format(fp))
print('False Negative: {}'.format(fn))

True Positive: 3
True Negative: 1
False Positive: 1
False Negative: 1


### 2. Accuracy score
 Overall, how often is the classifier correct?


- Accuracy - a ratio of correctly predicted labels to the number of total samples. 
- Accuracy = TP+TN/TP+FP+FN+TN
- Good measure if values of false positive and false negatives are simillar. 


#### Accuracy ratio

In [12]:
y_test      = ['B', 'A', 'A', 'C', 'B']
predictions = ['A', 'A', 'C', 'B', 'B']

In [13]:
from sklearn.metrics import accuracy_score
acc  = accuracy_score(y_test, predictions) # 2/5
print('{}'.format(acc))

0.4


#### Nr of correctly classified samples

In [14]:
y_test      = [2, 1, 1, 3, 2]
predictions = [1, 1, 3, 2, 2]

In [15]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, predictions, normalize=False)
print('{}'.format(acc))

2


#### Weighted

In [16]:
y_test      = [2, 1, 1, 3, 2]
predictions = [1, 1, 3, 2, 2]

In [17]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, predictions, sample_weight=[0,1,0,0,1])
print('{}'.format(acc))

1.0


### 3. Misclassification score (1 minus Accuracy)
Overall, how often is it wrong?

- Misclassification 
- (FP+FN)/total
- equivalent to 1 minus Accuracy
- also known as "Error Rate

In [18]:
from sklearn.metrics import accuracy_score
y_test      = [0, 1, 1, 0, 0]
predictions = [1, 1, 0, 0, 0]

acc = accuracy_score(y_test, predictions)
mc  = 1 - acc

print('Accuracy:          {}'.format(acc))
print('Misclassification: {}'.format(mc))

Accuracy:          0.6
Misclassification: 0.4


### 4. Precision score (Positive Predictive Value, PPV)

When it predicts yes, how often is it correct?

- Returns: float (if average is not None) or array of floats.
- Ratio of correctly predicted positive labels to all samples predicted positive.
- Of all emails classified as spam, how many actually was a spam?
- Precision = TP/TP+FP
- High FP = low Precision.
- Ability of the classifier not to label as positive a sample that is negative.
- The best value is 1 and the worst value is 0.

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc globally by counting the total true positives, false negatives and false positives.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.


In [None]:
precision_score(y_true, y_pred, 
                labels=None, # set of labels to include if not binary
                pos_label=1, # class to report if average and data are binary.
                average='binary', # below
                sample_weight=None)

In [21]:
from sklearn.metrics import precision_score

y_test = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

#### Precision

In [22]:
ps = precision_score(y_test, y_pred, average=None)
print('{}'.format(ps))

[0.66666667 0.         0.        ]


#### Precision with macro averaging

In [23]:
ps = precision_score(y_test, y_pred, average='macro') 
print('{}'.format(ps))

0.2222222222222222


#### Precision with micro averaging

In [24]:
ps = precision_score(y_test, y_pred, average='micro')  
print('{}'.format(ps))

0.3333333333333333


#### Precision with weighted averaging

In [25]:
ps = precision_score(y_test, y_pred, average='weighted')
print('{}'.format(ps))

0.2222222222222222


#### Average Precision score
Compute average precision (AP) from prediction scores


sklearn.metrics.average_precision_score(y_true, y_score, average=’macro’, pos_label=1, sample_weight=None)

- Returns: float
- AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc metrics globally by considering each element of the label indicator matrix as a label.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.



In [26]:
y_true   = [0.0, 0.0, 1.00, 1.0]
y_scores = [0.1, 0.4, 0.35, 0.8]

In [27]:
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_scores)

0.8333333333333333

### 5. Recall score (Sensitivity, True Positive Rate)
When it's actually yes, how often does it predict yes?

- Ability of the classifier to find all the positive samples.
- Ratio of correctly predicted positive labels to the all samples in actual class. 
- Recall = TP/TP+FN
- The best value is 1 and the worst value is 0.
- Returns: float (if average is not None) or array of floats.

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc globally by counting the total true positives, false negatives and false positives.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.

In [None]:
recall_score(y_true, y_pred, 
             labels=None, # set of labels to include if not binary
             pos_label=1, # class to report if average and data are binary.
             average='binary', # below
             sample_weight=None)

In [28]:
from sklearn.metrics import recall_score
y_test = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

#### Recall score

In [29]:
rs = recall_score(y_test, y_pred, average=None)
print('{}'.format(rs))

[1. 0. 0.]


#### Macro averaged recall score 

In [30]:
rs = recall_score(y_test, y_pred, average='macro')  
print('{}'.format(rs))

0.3333333333333333


#### Micro averaged recall score 

In [31]:
rs = recall_score(y_test, y_pred, average='micro')  
print('{}'.format(rs))

0.3333333333333333


#### Weighted averaged recall score 

In [32]:
rs = recall_score(y_test, y_pred, average='weighted')  
print('{}'.format(rs))

0.3333333333333333


### 6. Precision-Recall curve
Compute precision-recall pairs for different probability thresholds

In [None]:
precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None)


- The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.
- Parameters: targets of binary classification in range {-1, 1} or {0, 1} and estimated probabilities or decision function.

Returns:
- Precision: array, element i is the precision of predictions with score >= thresholds[i]. Last element is 1.
- Recall : array, element i is the recall of predictions with score >= thresholds[i] and the last element is 0.
- Thresholds : array, increasing thresholds on the decision function used to compute precision and recall.

In [35]:
from sklearn.metrics import precision_recall_curve
y_true   = [1, 1, 1, 1]
y_scores = [0.1, 0.4, 0.8, 0.5]

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

#(precision, recall, thresholds)
print('Precision:  {}'.format(precision))
print('Recall:     {}'.format(recall))
print('Thresholds: {}'.format(thresholds))

Precision:  [1. 1. 1. 1. 1.]
Recall:     [1.   0.75 0.5  0.25 0.  ]
Thresholds: [0.1 0.4 0.5 0.8]


### 7. F1 score

In [None]:
f1_score(y_true, y_pred, 
         labels=None, # set of labels to include if not binary
         pos_label=1, # class to report if average and data are binary.
         average='binary', # below
         sample_weight=None)

- Balance between the precision and the recall.
- The weighted average of Precision and Recall. 
- F1 Score = 2*(Recall * Precision) / (Recall + Precision)
- Returns: float or array of float, shape = [n_unique_labels]

Averaging:
-  required for multiclass/multilabel targets. 
-  None: the scores for each class are returned. 
- 'binary': Only report results for the class specified by pos_label. 
- 'micro': calc globally by counting the total true positives, false negatives and false positives.
- 'macro': calc each label, and find their unweighted mean. 
- 'weighted' calc each label, and find their average weighted 
- 'samples': calc each instance, and find their average.

In [36]:
y_test      = ['B', 'A', 'A', 'C', 'B'] 
predictions = ['A', 'A', 'C', 'B', 'B']

y_test      = [2, 1, 1, 3, 2]
predictions = [1, 1, 3, 2, 2]

#### F1 score

In [37]:
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average=None)
print('{}'.format(f1_score))

[0.5 0.5 0. ]


#### Macro averaged F1 score

In [38]:
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average='macro')
print('{}'.format(f1_score))

0.3333333333333333


#### Micro averaged F1 score

In [39]:
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average='micro')  
print('{}'.format(f1_score))

0.4000000000000001


#### Weighted averaged F1 score

In [40]:
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, predictions, average='weighted')  
print('{}'.format(f1_score))

0.4


### 8. Classification Report

In [None]:
classification_report(y_test, y_pred, 
                      labels=None, # include list of labels in report
                      target_names=None, # display names for labels
                      sample_weight=None, # weights for samples
                      digits=2, # round output (ignored if dict)
                      output_dict=False) #If True: return dict(output)

Text summary of the precision, recall, F1 score for each class.

- Build & show main classification metrics report
- Returns string/dict

The reported averages include
- micro average (averaging the total true positives, false negatives and false positives), 
- macro average (averaging the unweighted mean per label), 
- weighted average (averaging the support-weighted mean per label),
- sample average (only for multilabel classification).
- recall of the positive is also known as “sensitivity”; 
- recall of the negativeclass is “specificity”.


In [42]:
from sklearn.metrics import classification_report

y_test = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['Amber', 'Blue', 'Cedar']

cr = classification_report(y_test, y_pred, target_names=target_names)

print(cr)
# 0 class - Amber
# 1 class - Blue
# 2 class - Cedar

              precision    recall  f1-score   support

       Amber       0.50      1.00      0.67         1
        Blue       0.00      0.00      0.00         1
       Cedar       1.00      0.67      0.80         3

   micro avg       0.60      0.60      0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5



## 9. AUC ROC 

### ROC curve
The function roc_curve computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :

In [None]:
roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)

- sk learn implementation is restricted to the binary classification task.
- ROC: Receiver operating characteristic
- graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied.
- created by plotting fraction of TP out of positives vs. fraction of FP out of the negatives at various threshold settings.
- TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.
- Input: requires the true binary value and the target scores, which can either be probability estimates of the positive class, confidence values, or binary decisions

Returns:
- fpr : array, shape = [>2]: Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].
- tpr : array, shape = [>2]: Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].
- thresholds : array, shape = [n_thresholds]: Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

Since the thresholds are sorted from low to high values, they are reversed upon returning them to ensure they correspond to both fpr and tpr, which are sorted in reversed order during their calculation.



In [43]:
from sklearn import metrics

y_test      = [1, 1, 2, 2]
predictions = [0.1, 0.4, 0.35, 0.8]

fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=2)

print('False Positive eate: {}'.format(fpr))
print('True Positive rate:  {}'.format(tpr))
print('Tresholds:           {}'.format(thresholds))

False Positive eate: [0.  0.  0.5 0.5 1. ]
True Positive rate:  [0.  0.5 0.5 1.  1. ]
Tresholds:           [1.8  0.8  0.4  0.35 0.1 ]


### AUC

Compute Area Under the Curve (AUC) using the trapezoidal rule.

In [None]:
sklearn.metrics.auc(x, y, reorder='deprecated')

- this sk learn implementation is restricted to the binary classification task.
- This is a general function, given points on a curve. 
- Parameters:	
- x : array, shape = [n] (monotonic increasing or monotonic decreasing).
- y : array, shape = [n] (y coordinates)
- Returns: float

In [44]:
from sklearn import metrics

y_test      = [1, 1, 2, 2]
predictions = [0.1, 0.4, 0.35, 0.8]

fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=2)
auc = metrics.auc(fpr, tpr)

print('AUC: {}'.format(auc))

AUC: 0.75


In [45]:
import matplotlib.pyplot as plt

plt.figure()

plt.plot(fpr, tpr, 
         color='orange',
         lw=4, 
         label='ROC curve (area = %0.2f)' % auc)

plt.plot([0, 1], [0, 1], 
         color='blue', 
         lw=4, 
         linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

plt.xlabel('False Positive')
plt.ylabel('True Positive')
plt.title('Receiver Operating Characteristic Curve')
plt.legend(loc="lower right")

plt.show()

<Figure size 640x480 with 1 Axes>

### ROC AUC Score
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

In [None]:
roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)

- Returns: auc : float
- sk learn implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
- Input: y_test, y_score: [n_samples] or [n_samples, n_classes]
- Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions.
- max_fpr : float > 0 and <= 1, optional: If not None, the standardized partial AUC over the range [0, max_fpr] is returned.
- In multi-label classification, the roc_auc_score function is extended by averaging over the labels as above.
- Compared to metrics such as the subset accuracy, the Hamming loss, or the F1 score, ROC doesn’t require optimizing a threshold for each label. 
- The roc_auc_score function can also be used in multi-class classification, if the predicted outputs have been binarized.
- In applications where a high false positive rate is not tolerable the parameter max_fpr of roc_auc_score can be used to summarize the ROC curve up to the given limit.

Averaging:
- If None, the scores for each class are returned.
- 'micro': calc globally by each element of the label indicator matrix as a label.
- 'macro': calc each label, and find their unweighted mean.
- 'weighted' calc each label, and find their average weighted
- 'samples': calc each instance, and find their average.



In [46]:
from sklearn.metrics import roc_auc_score

y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]

roc_auc_score(y_true, y_scores)

0.75

## II. Regression

### 1. R2

The coefficient of determination. Scores regression function.

- Returns float (or ndarray of floats if multioutput=‘raw_values’)
- Best score: 1.0
- Can be negative
- 'A constant model that always predicts the expected value of y, 
   disregarding the input features, would get a R^2 score of 0.0.'
- Not symmetric.

<b>Multioutput</b>. Defines aggregating of multiple output scores, array-like value defines weights used to average scores. (default=“uniform_average”)

- multioutput=‘raw_values’: Returns a full set of scores in case of multioutput input.
- multioutput=‘uniform_average’: Scores of all outputs are averaged with uniform weight.
- multioutput=‘variance_weighted’: Scores of all outputs are averaged, weighted by the variances of each individual output.

In [47]:
y_test      = [11, 11, 10, 10, 11] # Ground truth 
predictions = [11, 11, 10, 10, 10] # Predictions

from sklearn.metrics import r2_score
r2score = r2_score(y_test, predictions)  
print('{}'.format(r2score))

0.16666666666666674


In [48]:
y_test      = [[0.4, 11], [0.5, 11], [0.6, 12]]
predictions = [[0.4, 10], [0.5, 11], [0.3, 12]]

from sklearn.metrics import r2_score
r2score = r2_score(y_test, predictions, multioutput='variance_weighted')  
print('{}'.format(r2score))

-0.587378640776699


In [49]:
y_test      = [[0.4, 10], [0.5, 11], [0.6, 12]]
predictions = [[0.4, 10], [0.5, 11], [0.3, 12]]

from sklearn.metrics import r2_score
r2score = r2_score(y_test, predictions, multioutput='raw_values')  
print('{}'.format(r2score))

[-3.5  1. ]


### 2. Mean Squared Error (MSE)


- Mean squared error regression loss
- Returns: loss (non negative float) for each target.
- multioutput: Aggregating of multiple output values (weights used to average errors)  (default=“uniform_average”):

   a) multioutput=‘uniform_average’:  Scores of all outputs are averaged with uniform weight.

   b) multioutput=‘raw_values’: Returns a full set of scores in case of multioutput input.

In [50]:
from sklearn.metrics import mean_squared_error

y_test      = [3.0, -0.5, 2, 7, 3] # Ground truth
predictions = [2.5, -1.0, 2, 8, 3] # Predictions

mse = mean_squared_error(y_test, predictions)

print('{}'.format(mse))

0.3


In [51]:
y_test      = [[0.5, 1], [-11, 13], [2, -4]] # Ground truth
predictions = [[0.1, 2], [-11, 22], [3, -3]] # Predictions

mse = mean_squared_error(y_test, predictions) # uniform average
print('{}'.format(mse))

14.026666666666667


In [52]:
mse = mean_squared_error(y_test, predictions, multioutput=[0.5, 0.5])
print('{}'.format(mse))

14.026666666666667


In [53]:
mse = mean_squared_error(y_test, predictions, multioutput='raw_values')
print('{}'.format(mse))

[ 0.38666667 27.66666667]


### 3. Mean Absolute Error (MAE)


<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mtext>MAE</mtext>
  <mo stretchy="false">(</mo>
  <mi>y</mi>
  <mo>,</mo>
  <mover>
    <mi>y</mi>
    <mo stretchy="false">&#x005E;<!-- ^ --></mo>
  </mover>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mfrac>
    <mn>1</mn>
    <msub>
      <mi>n</mi>
          <mtext>samples</mtext>
    </msub>
  </mfrac>
  <munderover>
    <mo>&#x2211;<!-- ∑ --></mo>
    <mrow>
      <mi>i</mi>
      <mo>=</mo>
      <mn>0</mn>
    </mrow>
    <mrow>
      <msub>
        <mi>n</mi>
              <mtext>samples</mtext>
      </msub>
      <mo>&#x2212;<!-- − --></mo>
      <mn>1</mn>
    </mrow>
  </munderover>
  <mrow>
    <mo>|</mo>
    <mrow>
      <msub>
        <mi>y</mi>
        <mi>i</mi>
      </msub>
      <mo>&#x2212;<!-- − --></mo>
      <msub>
              <mover>
          <mi>y</mi>
          <mo stretchy="false">&#x005E;<!-- ^ --></mo>
        </mover>
        <mi>i</mi>
      </msub>
    </mrow>
    <mo>|</mo>
  </mrow>
  <mo>.</mo>
</math>

Mean absolute error regression loss.
- non-negative float
- the best value is 0.0
- returns float or ndarray of floats
- multioutput (default=“uniform_average”): Aggregating of multiple output values (weights used to average errors)
- a) multioutput=‘raw_values’: Returns a full set of scores in case of multioutput input.
- b) multioutput=‘uniform_average’:  Scores of all outputs are averaged with uniform weight.


In [54]:
from sklearn.metrics import mean_absolute_error

y_test      = [3.5, -0.5, 1, 3, 3] # Ground truth
predictions = [2.5, -1.0, 1, 4, 5] # Predictions

mae = mean_absolute_error(y_test, predictions)
print('{}'.format(mae))

0.9


#### Return MAE score

In [55]:
y_test      = [[0.5, 1, 1], [-11, 13, 43], [2, 43, -4]] # Ground truth
predictions = [[0.1, 2, 1], [-11, 22, 12], [3, 37, -3]] # Predictions

mae = mean_absolute_error(y_test, predictions)
print('{}'.format(mae))

5.488888888888888


#### To return the mean absolute error for each output separately

In [56]:
mae = mean_absolute_error(y_test, predictions, multioutput='raw_values')
print('{}'.format(mae))

[ 0.46666667  5.33333333 10.66666667]


#### Weights

In [57]:
mae = mean_absolute_error(y_test, predictions, multioutput=[1, 1, 1])
print('{}'.format(mae))

5.488888888888888


### 4. Explained Variance
Explained variance regression score function

sklearn.metrics.explained_variance_score(y_true, y_pred, sample_weight=None, multioutput='uniform_average')

- Best possible score is 1.0
- Not a symmetric function.
- Returns: float or ndarray of floats

Multioutput: defines aggregating of multiple output scores.
Array-like value defines weights used to average scores.
- raw_values: Returns a full set of scores in case of multioutput input.
- uniform_average: Scores of all outputs are averaged with uniform weight.
- variance_weighted: Scores of all outputs are averaged, weighted by the variances of each individual output.


In [58]:
from sklearn.metrics import explained_variance_score

In [59]:
y_true = [3, -5.5, 5, 5]
y_pred = [4,  5.2, 6, 6]
explained_variance_score(y_true, y_pred)  

0.06144638403990055

In [60]:
y_true = [[5, -14], [5, -5], [5, -5]]
y_pred = [[2, -12], [3, -2], [4, -3]]
explained_variance_score(y_true, y_pred, multioutput='uniform_average')

0.49382716049382713

## III. Clustering

### 1. Adjusted Rand Score
Rand index adjusted for chance.



- Returns: float
- Similarity score between -1.0 and 1.0. 
- Random labelings have an ARI close to 0.0. 1.0 stands for perfect match.
- ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
- ARI is a symmetric: adjusted_rand_score(a, b) == adjusted_rand_score(b, a)
- computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
- The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:
- The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).


In [61]:
from sklearn.metrics import adjusted_rand_score

#### Perfectly matching labelings have a score of 1 even

In [62]:
adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 1])

1.0

In [63]:
adjusted_rand_score([0, 1, 1, 1], [0, 1, 1, 1])

1.0

#### Labelings that assign all classes members to the same clusters are complete be not always pure, hence penalized:


In [64]:
adjusted_rand_score([0, 0, 1, 2], [0, 0, 1, 1])  

0.5714285714285715

#### ARI is symmetric, so labelings that have pure clusters with members coming from the same classes but unnecessary splits are penalized:


In [65]:
adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 2])

0.5714285714285715

#### If classes members are completely split across different clusters, the assignment is totally incomplete, hence the ARI is very low:

In [66]:
adjusted_rand_score([0, 0, 0, 0], [0, 1, 2, 3])

0.0

### 2. Homogeneity Score
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

- Returns: float
- score between 0.0 and 1.0. 
- 1.0 stands for perfectly homogeneous labeling
- This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
- This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.

In [67]:
from sklearn.metrics import homogeneity_score

#### Perfect labelings are homogeneous

In [68]:
homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])

1.0

In [69]:
print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2]))

1.000000


In [70]:
print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3]))

1.000000


#### Clusters that include samples from different classes do not make for an homogeneous labeling

In [71]:
print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 1, 0, 1]))

0.000000


In [72]:
print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))

0.000000


### 3. V Measure Score
The V-measure is the harmonic mean between homogeneity and completeness.

- Returns: float
- score between 0.0 and 1.0.
- 1.0 stands for perfectly complete labeling
- This score is identical to normalized_mutual_info_score with the 'arithmetic' option for averaging.
- v = 2 * (homogeneity * completeness) / (homogeneity + completeness)
- This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
- This metric is symmetric: switching label_true with label_pred will return the same score value.




In [73]:
from sklearn.metrics import v_measure_score

#### Perfect labelings are both homogeneous and complete, hence have score 1.0:

In [74]:
v_measure_score([0, 0, 1, 1], [0, 0, 1, 1])
v_measure_score([0, 0, 1, 1], [1, 1, 0, 0])

1.0

#### Labelings that assign all classes members to the same clusters are complete be not homogeneous, hence penalized

In [75]:
print("%.6f" % v_measure_score([0, 0, 1, 2], [0, 0, 1, 1]))
print("%.6f" % v_measure_score([0, 1, 2, 3], [0, 0, 1, 1]))

0.800000
0.666667


#### Labelings that have pure clusters with members coming from the same classes are homogeneous but un-necessary splits harms completeness and thus penalize V-measure as well:

In [76]:
print("%.6f" % v_measure_score([0, 0, 1, 1], [0, 0, 1, 2]))
print("%.6f" % v_measure_score([0, 0, 1, 1], [0, 1, 2, 3]))

0.800000
0.666667


#### If classes members are completely split across different clusters, the assignment is totally incomplete, hence the V-Measure is null:

In [77]:
print("%.6f" % v_measure_score([0, 0, 0, 0], [0, 1, 2, 3]))

0.000000


#### Clusters that include samples from totally different classes totally destroy the homogeneity of the labeling, hence:

In [78]:
print("%.6f" % v_measure_score([0, 0, 1, 1], [0, 0, 0, 0]))

0.000000


### 4. Completeness
Completeness metric of a cluster labeling given a ground truth.

- Returns: completeness : float
- score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
- A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
- independent of the absolute values of the labels
- a permutation of the class or cluster label values won’t change the score value in any way.
- not symmetric.

#### Perfect labelings: complete

In [79]:
from sklearn.metrics.cluster import completeness_score
completeness_score([0, 0, 1, 1], [1, 1, 0, 0])

1.0

#### Labelings that assign all classes members to the same clusters: complete

In [80]:

print(completeness_score([0, 0, 1, 1], [0, 0, 0, 0]))
print(completeness_score([0, 1, 2, 3], [0, 0, 1, 1]))

1.0
0.9999999999999999


#### Classes members split across different clusters: not complete

In [81]:
print(completeness_score([0, 0, 1, 1], [0, 1, 0, 1]))
print(completeness_score([0, 0, 0, 0], [0, 1, 2, 3]))

0.0
0.0


#### To do:
- More on Loss Functions
- Cross-Entropy (log loss) # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
- Hinge https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html#sklearn.metrics.hinge_loss
- Huber
- Hammington https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html
- Kullback-Leibler
- https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics
- https://scikit-learn.org/stable/modules/learning_curve.html


Resources: https://scikit-learn.org/

To be updated.

By Luke 23.02.2019.

In [82]:
# See also:

#False Positive Rate: When it's actually no, how often does it predict yes?
#FP/actual no = 10/60 = 0.17

#True Negative Rate: When it's actually no, how often does it predict no?
#TN/actual no = 50/60 = 0.83
#equivalent to 1 minus False Positive Rate
#also known as "Specificity"

#Prevalence: How often does the yes condition actually occur in our sample?
#actual yes/total = 105/165 = 0.64

#https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
#http://cs229.stanford.edu/section/evaluation_metrics.pdf