There are 3 different APIs for evaluating the quality of a model's predictions:
* **Estimator score method**: Estimators have a ```score``` method provide a default evaluation criterion for the problem they are designed to solve.
* **Scoring parameter**: Model-evaluation tools using cross-validation (such as ```model_selection.cross_val_score``` and ```model_selection.GridSearchCV```) rely on an internal *scoring* strategy.
* **Metric functions**: The metrics module implements functions assessing prediction error for specific purposes.

Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.

# The ```scoring``` parameter: defining model evaluation rules

Model selection and evaluation using tools, such as ```model_selection.GridSearchCV``` and ```model_selection.cross_val_score```, take a ```scoring``` paramerter that controls what metric they apply to the estimators evaluated.

Scoring | Function | Comment
--- | --- | ---
*Classification* |  | 
'accuracy' | ```metrics.accuracy_score``` | 
'average_precision' | ```metrics.average_precision_score``` |
'f1' . | ```metrics.f1_score```| for binary targets
'f1_micro'| ```metrics.f1_score``` | micro-averaged
'f1_macro' | ```metrics.f1_score``` | macro-averaged
‘f1_weighted’|	```metrics.f1_score```|	weighted average
‘f1_samples’|	```metrics.f1_score```|	by multilabel sample
‘neg_log_loss’|	```metrics.log_loss```|	requires predict_proba support
‘precision’ etc.|	```metrics.precision_score```|	suffixes apply as with ‘f1’
‘recall’ etc.|	```metrics.recall_score```|	suffixes apply as with ‘f1’
‘roc_auc’|	```metrics.roc_auc_score```|	 
Clustering| | 	 	 
‘adjusted_mutual_info_score’|	```metrics.adjusted_mutual_info_score```|	 
‘adjusted_rand_score’|	```metrics.adjusted_rand_score```|	 
‘completeness_score’|	```metrics.completeness_score```|	 
‘fowlkes_mallows_score’|	```metrics.fowlkes_mallows_score```|	 
‘homogeneity_score’|	```metrics.homogeneity_score```|	 
‘mutual_info_score’|	```metrics.mutual_info_score```|	 
‘normalized_mutual_info_score’|	```metrics.normalized_mutual_info_score```|	 
‘v_measure_score’|	```metrics.v_measure_score```|	 
Regression	 	| | 
‘explained_variance’|	```metrics.explained_variance_score```|	 
‘neg_mean_absolute_error’|	```metrics.mean_absolute_error```|	 
‘neg_mean_squared_error’|	```metrics.mean_squared_error```|	 
‘neg_mean_squared_log_error’|	```metrics.mean_squared_log_error```|	 
‘neg_median_absolute_error’|	```metrics.median_absolute_error```|	 
‘r2’|	```metrics.r2_score```|	 

In [4]:
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score

iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = svm.SVC(probability=True, random_state=0)
cross_val_score(clf, X, y, scoring='neg_log_loss')

array([-0.07490352, -0.16449405, -0.06685511])

The module ```sklearn.metrics``` also exposes a set of simple functions measuring a prediction error given ground truth and prediction:
* functions ending with ```_score``` return a value to maximize, the higher the better
* functions ending with ```_error``` or ```_loss``` return a value to minimize, the lower the better.

The simplest way to generate a callable object for scoring is by using ```make_scorer```.

In [10]:
from sklearn.metrics import fbeta_score, make_scorer

ftwo_scorer = make_scorer(fbeta_score, beta=2)

from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC

grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

In [13]:
import numpy as np

def my_custom_loss_func(ground_truth, predictions):
    diff = np.abs(ground_truth - predictions).max()
    return np.log(1 + diff)
loss = make_scorer(my_custom_loss_func, greater_is_better=False)
score = make_scorer(my_custom_loss_func, greater_is_better=True)

ground_truth = [[1], [1]]
predictions = [0, 1]

from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(ground_truth, predictions)
loss(clf, ground_truth, predictions)

-0.6931471805599453

In [14]:
score(clf, ground_truth, predictions)

0.6931471805599453

In [26]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix

X, y = datasets.make_classification(n_classes=2, random_state=0)
svm = LinearSVC(random_state=0)

def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]

scoring = {'tp': make_scorer(tp), 'tn': make_scorer(tn), 'fp': make_scorer(fp), 'fn': make_scorer(fn)}
cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
print(cv_results['test_tp'])
print(cv_results['test_fn'])

[16 14  9]
[1 3 7]


# Classification metrics

binary classification:

[]() |
---|---
```precision_recall_curve(y_true, probas_pred)```| Compute precision-recall pairs for different probability thresholds
```roc_curve(y_true, y_score[, pos_label, ...])``` | Compute Receiver operating characteristic (ROC)

multiclass:

[]() |
---|---
```cohen_kappa_score(y1, y2[, labels, weights, ...])```| Cohen's kappa: a statistic that measures inter-annotator agreement
```confusion_matrix(y_true, y_pred[, labels, …])``` |	Compute confusion matrix to evaluate the accuracy of a classification
```hinge_loss(y_true, pred_decision[, labels, …])``` |	Average hinge loss (non-regularized)
```matthews_corrcoef(y_true, y_pred[, …])``` | Compute the Matthews correlation coefficient (MCC)

mutlilabel:

[]() |
---|---
```accuracy_score(y_true, y_pred[, normalize, …])```|	Accuracy classification score.
```classification_report(y_true, y_pred[, …])``` |	Build a text report showing the main classification metrics
```f1_score(y_true, y_pred[, labels, …])``` |	Compute the F1 score, also known as balanced F-score or F-measure
```fbeta_score(y_true, y_pred, beta[, labels, …])``` |	Compute the F-beta score
```hamming_loss(y_true, y_pred[, labels, …])``` |	Compute the average Hamming loss.
```jaccard_similarity_score(y_true, y_pred[, …])``` |	Jaccard similarity coefficient score
```log_loss(y_true, y_pred[, eps, normalize, …])``` |	Log loss, aka logistic loss or cross-entropy loss.
```precision_recall_fscore_support(y_true, y_pred)``` |	Compute precision, recall, F-measure and support for each class
```precision_score(y_true, y_pred[, labels, …])``` |	Compute the precision
```recall_score(y_true, y_pred[, labels, …])``` |	Compute the recall
```zero_one_loss(y_true, y_pred[, normalize, …])``` |	Zero-one classification loss.

binary class and multilabel:

[]() |
---|---
```average_precision_score(y_true, y_score[, …])```|	Compute average precision (AP) from prediction scores
```roc_auc_score(y_true, y_score[, average, …])```|	Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

## Accuracy score

$$ accuracy(y, \hat{y}) = \frac{1}{n_{samples}}\sigma_{i=0}^{n_{samples}-1}1(\hat{y_i}=y_i) $$

In [28]:
import numpy as np
from sklearn.metrics import accuracy_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.5

In [29]:
accuracy_score(y_true, y_pred, normalize=False)

2

In the multilabel case with binary label indicators:

In [30]:
accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))

0.5

## Cohen's kappa

[Cohen's kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa)

In [31]:
from sklearn.metrics import cohen_kappa_score
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]

cohen_kappa_score(y_true, y_pred)

0.4285714285714286

## Confusion matrix

By definition, entry $i, j$ in a confusion matrix is the number of observations actually in group $i$, but predicted to be in group $j$.

In [32]:
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]

confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

For binary problems, we can get counts of true negatives, false positives, false negatives and true positives.

In [36]:
y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(confusion_matrix(y_true, y_pred))
tn, fp, fn, tp

[[2 1]
 [2 3]]


(2, 1, 2, 3)

## Classification report

The ```classification_report``` function builds a text report showing the main classification metrics.

In [37]:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5



## Hamming loss

$$ L_{Hamming}(y, \hat{y}) = \frac{1}{n_{labels}}\sum_{j=0}^{n_{labels}-1}1(\hat{y_j}\neq y_j) $$

In [39]:
from sklearn.metrics import hamming_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
hamming_loss(y_true, y_pred)

0.25

## Jaccard similarity coefficient score

$$ J(y_i, \hat{y_i}) = \frac{|y_i \cap \hat{y_i}|}{|y_i \cup \hat{y_i}|} $$

In binary and multiclass classification, th Jaccard similarity coefficient score is equal to the classification accuracy.

In [41]:
import numpy as np
from sklearn.metrics import jaccard_similarity_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

jaccard_similarity_score(y_true, y_pred)

0.5

In [42]:
jaccard_similarity_score(y_true, y_pred, normalize=False)

2

## Precision, recall and F-measure

Intuitively, **precision** is the ability of the classifier not to label as positive a sample that is negative, and **recall** is the ability of the classifier to find all the positive samples.

Several functions allow you to analyze the precision, recall and F-measure score:

[]() | 
--- | ---
```average_precision_score```(y_true, y_score[, …]) |	Compute average precision (AP) from prediction scores
```f1_score```(y_true, y_pred[, labels, …])	 | Compute the F1 score, also known as balanced F-score or F-measure
```fbeta_score```(y_true, y_pred, beta[, labels, …]) |	Compute the F-beta score
```precision_recall_curve```(y_true, probas_pred) |	Compute precision-recall pairs for different probability thresholds
```precision_recall_fscore_support```(y_true, y_pred) |	Compute precision, recall, F-measure and support for each class
```precision_score```(y_true, y_pred[, labels, …])	| Compute the precision
```recall_score```(y_true, y_pred[, labels, …])	| Compute the recall

$$ precision = \frac{tp}{tp + fp} $$
$$ recall = \frac{tp}{tp + fn} $$
$$ F_\beta = (1+\beta^2)\frac{precision \times recall}{\beta^2precision + recall} $$

In [44]:
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
metrics.precision_score(y_true, y_pred)

1.0

In [45]:
metrics.recall_score(y_true, y_pred)

0.5

In [46]:
metrics.f1_score(y_true, y_pred)

0.6666666666666666

In [47]:
metrics.fbeta_score(y_true, y_pred, beta=0.5)

0.8333333333333334

In [48]:
import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, threshold = precision_recall_curve(y_true, y_scores)
precision

array([0.66666667, 0.5       , 1.        , 1.        ])

In [49]:
recall

array([1. , 0.5, 0.5, 0. ])

In [50]:
threshold

array([0.35, 0.4 , 0.8 ])

The ```average_precision_score``` function computes the **average precision** (AP) from prediction scores. The value is between 0 and 1 and higher is better. AP is defined as

$$ AP = \sum_n(R_n-R_{n-1})P_n $$

In [51]:
average_precision_score(y_true, y_scores)

0.8333333333333333

In multiclass and multilabel classification task, the notions of precision, recall and F-measures can be applied to each label independently.

average | Precision | Recall | F_beta
--- | --- | --- | ---
"micro" | $P(y, \hat{y})$ | $R(y, \hat{y})$ | $F_\beta(y, \hat{y})$
"samples" | $\frac{1}{\lvert S \rvert}\sum_{s\in S}P(y_s, \hat{y_s})$ | $\frac{1}{\lvert S \rvert}\sum_{s\in S}R(y_s, \hat{y_s})$ |$\frac{1}{\lvert S \rvert}\sum_{s\in S}F_\beta(y_s, \hat{y_s})$ 
"macro" | $\frac{1}{\lvert L \rvert}\sum_{l\in L}P(y_l, \hat{y_l})$ | $\frac{1}{\lvert L \rvert}\sum_{l\in L}R(y_l, \hat{y_l})$ |$\frac{1}{\lvert L \rvert}\sum_{l\in L}F_\beta(y_l, \hat{y_l})$
"weighted" | $\frac{1}{\sum_{l\in L}\lvert \hat{y_l} \rvert}\sum_{l\in L}\lvert \hat{y_l} \rvert R(y_l, \hat{y_l})$ |$\frac{1}{\sum_{l\in L}\lvert \hat{y_l} \rvert}\sum_{l\in L}\lvert \hat{y_l} \rvert P(y_l, \hat{y_l})$ |$\frac{1}{\sum_{l\in L}\lvert \hat{y_l} \rvert}\sum_{l\in L}\lvert \hat{y_l} \rvert P(y_l, \hat{y_l})$

In [52]:
from sklearn import metrics
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
metrics.precision_score(y_true, y_pred, average='macro')

0.2222222222222222

In [53]:
metrics.recall_score(y_true, y_pred, average='micro')

0.3333333333333333

In [54]:
metrics.f1_score(y_true, y_pred, average='weighted')

0.26666666666666666

In [55]:
metrics.fbeta_score(y_true, y_pred, average='macro', beta=0.5)

0.23809523809523805

In [56]:
metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5, average=None)

(array([0.66666667, 0.        , 0.        ]),
 array([1., 0., 0.]),
 array([0.71428571, 0.        , 0.        ]),
 array([2, 2, 2]))

For multiclass classification with a "negative class", it is possible to exclude some labels:

In [57]:
metrics.recall_score(y_true, y_pred, labels=[1, 2], average='micro')

0.0

Similarly, labels not present in the data sample may be accounted for in macro-averaging.

In [58]:
metrics.precision_score(y_true, y_pred, labels=[0, 1, 2, 3], average='macro')

  'precision', 'predicted', average, warn_for)


0.16666666666666666

## Hinge loss

If the labels are encoded with +1 and -1, $y$ is the true value, and $w$ is the predicted decisions as output by ```decision_function```, then the hinge loss is defined as:

$$ L_{Hinge}(y, w) = max\{1-wy, 0\} = \lvert 1-wy \rvert_+ $$

In [59]:
from sklearn import svm
from sklearn.metrics import hinge_loss
X = [[0], [1]]
y = [-1, 1]
est = svm.LinearSVC(random_state=0)
est.fit(X, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     verbose=0)

In [61]:
pred_decision = est.decision_function([[-2], [3], [0.5]])
pred_decision

array([-2.18173682,  2.36360149,  0.09093234])

In [62]:
hinge_loss([-1, 1, 1], pred_decision)

0.30302255420413554

## Log loss

Log loss, also called logistic regression loss or cross-entropy loss, is defined on probability estimates. It is commonly used in (multinomial) logistic regressin and neural networks, as well as in some variants of expectation-maximization, and can be used to evaluate the probability outputs (```predict_proba```) of a classifier of its discrete predictions.

For binary classification with a true label $y \in \{0, 1\}$ and a probability estimate $p = Pr(y=1)$, the log loss per sample is the negative log-likelihood of the classifier given the true label:

$$ L_{log}(y, p) = -logPr(y|p) = -(ylog(p) + (1-y)log(1-p)) $$

This extends to the multiclass case as follows.

$$ L_{log}(Y, P) = -logPr(Y|P) = -\frac{1}{N}\sum_{i=0}^{N-1}\sum_{k=0}^{K-1}y_{i,k}log{p_{i,k}} $$

In [64]:
from sklearn.metrics import log_loss
y_true = [0, 0, 1, 1]
y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
log_loss(y_true, y_pred)

0.1738073366910675

## Matthews correlation coefficient

$$ MCC = \frac{tp\times tn - fp \times fn}{\sqrt{(tp+fp)(tp+fn)(tn+fp)(tn+fn)}} $$

In [65]:
from sklearn.metrics import matthews_corrcoef
y_true = [1, 1, 1, -1]
y_pred = [1, -1, 1, 1]
matthews_corrcoef(y_true, y_pred)

-0.3333333333333333

## Receiver operating characteristic (ROC)

> “A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”


In [67]:
import numpy as np
from sklearn.metrics import roc_curve
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
fpr

array([0. , 0.5, 0.5, 1. ])

In [68]:
tpr

array([0.5, 0.5, 1. , 1. ])

In [69]:
thresholds

array([0.8 , 0.4 , 0.35, 0.1 ])

The ```roc_auc_score``` function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

In [71]:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

0.75

## Zero one loss

$$ L_{0-1}(y_i, \hat{y_i}) = 1(\hat{y_i} \neq y_i) $$

In [73]:
from sklearn.metrics import zero_one_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
zero_one_loss(y_true, y_pred)

0.25

In [74]:
zero_one_loss(y_true, y_pred, normalize=False)

1

## Brier score loss

$$ BS = \frac{1}{N}\sum_{t=1}^{N}(f_t - o_t)^2 $$

In [76]:
import numpy as np
from sklearn.metrics import brier_score_loss
y_true = np.array([0, 1, 1, 0])
y_true_categorical = np.array(["spam", "ham", "ham", "spam"])
y_prob = np.array([0.1, 0.9, 0.8, 0.4])
y_pred = np.array([0, 1, 1, 0])
brier_score_loss(y_true, y_prob)

0.055

In [77]:
brier_score_loss(y_true, 1-y_prob, pos_label=0)

0.055

In [78]:
brier_score_loss(y_true_categorical, y_prob, pos_label="ham")

0.055

In [79]:
brier_score_loss(y_true, y_prob > 0.5)

0.0

# Multilabel ranking metrics

## Coverage error

$$ coverage(y, \hat{f}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}max_{j:y_{i,j}=1}rank_{i,j} $$

with $$ rank_{i, j} = \lvert \{ k:\hat{f_{ik}} \ge \hat{f_{ij}} \} \rvert$$

In [80]:
import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
coverage_error(y_true, y_score)

2.5

## Label ranking average precision

$$ LRAP(y, \hat{f}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}\frac{1}{\lvert y_i \rvert}\sum_{j:y_{ij=1}}\frac{\lvert \mathcal{L_{ij}} \rvert}{rank_{ij}} $$

with $$ \mathcal{L_{ij}} = \{ k:y_{ik}=1, \hat{f_{ik}} \ge \hat{f_{ij}} \} $$

In [82]:
import numpy as np
from sklearn.metrics import label_ranking_average_precision_score
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_average_precision_score(y_true, y_score)

0.41666666666666663

## Ranking loss

$$ ranking\_loss(y, \hat{f})=\frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}\frac{1}{\lvert y_i \rvert(n_{labels}-\lvert y_i \rvert)}\lvert \{ (k, l): \hat{f_{ik}} < \hat{f_{il}}, y_{ik}=1, y_{il}=0 \} \rvert $$

In [1]:
import numpy as np
from sklearn.metrics import label_ranking_loss
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_loss(y_true, y_score) 

0.75

In [2]:
# With the following prediction, we have perfect and minimal loss
y_score = np.array([[1.0, 0.1, 0.2], [0.1, 0.2, 0.9]])
label_ranking_loss(y_true, y_score)

0.0

# Regression metrics

The ```sklearn.metrics``` module implements several loss, score and utility functions to measure regression performance. Some of those have been enhanced to handle the multioutput case: ```mean_squared_error```, ```mean_absolute_error```, ```explained_variance_score``` and ```r2_score```.

These functions have an ```multioutput``` keyword argument which specifies the way the scores or losses for each individual target should be averaged. The default is ```'uniform_average'```, which specifies a uniformly weighted mean over outputs. If an ```ndarray``` of shape ```(n_outputs,)``` is passed, then its entries are interpreted as weights and an according weighted average is returned. If ```multioutput``` is ```'raw_values'``` is specified, then all unaltered individual scores or losses will be returned in an array of shape ```(n_outputs,)```.

The ```r2_score``` and ```explained_variance_score``` accept an additional value ```'variance_weighted'``` for the ```multioutput``` parameter. This option leads to a weighting of each individual score by the variance of the corresponding target variable.

## Explained variance score

$$ explained\_variance(y, \hat{y}) = 1 - \frac{Var\{y-\hat{y}\}}{Var\{y\}} $$
The best possible score is 1.0, lower values are worse.

In [6]:
from sklearn.metrics import explained_variance_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(explained_variance_score(y_true, y_pred))
explained_variance_score(y_true, y_pred, multioutput='raw_values')

0.9571734475374732


array([0.95717345])

In [7]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(explained_variance_score(y_true, y_pred, multioutput='raw_values'))
explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7])

[0.96774194 1.        ]


0.9903225806451612

## Mean absolute error

$$ MAE(y, \hat{y}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}\lvert y_i - \hat{y_i} \rvert $$

In [9]:
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)

0.5

In [10]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
mean_absolute_error(y_true, y_pred)

0.75

In [11]:
mean_absolute_error(y_true, y_pred, multioutput='raw_values')

array([0.5, 1. ])

In [12]:
mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])

0.85

## Mean squared error

$$ MSE(y, \hat{y}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}(y_i - \hat{y_i})^2 $$

In [16]:
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_squared_error(y_true, y_pred)

0.375

In [17]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
mean_squared_error(y_true, y_pred)  

0.7083333333333334

## Mean squared logarithmic error

$$ MSLE(y, \hat{y})=\frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}(log_e(1+y_i)-log_e(1+\hat{y_i}))^2 $$

This metric is best to use when targets having exponential growth, such as population counts, average sales of a commodity over a span of years etc. Note that this metric penalizes an under-predicted estimate greater than an over-predicted estimate.

In [19]:
from sklearn.metrics import mean_squared_log_error
y_true = [3, 5, 2.5, 7]
y_pred = [2.5, 5, 4, 8]
mean_squared_log_error(y_true, y_pred)  

0.03973012298459379

In [20]:
y_true = [[0.5, 1], [1, 2], [7, 6]]
y_pred = [[0.5, 2], [1, 2.5], [8, 8]]
mean_squared_log_error(y_true, y_pred) 

0.044199361889160536

## Median absolute error

$$ MedAE(y, \hat{y})=median(\lvert y_1-\hat{y_1} \rvert, ..., \lvert y_n - \hat{y_n} \rvert) $$

In [21]:
from sklearn.metrics import median_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
median_absolute_error(y_true, y_pred)

0.5

## $R^2$ score, the coefficient of determination

$$ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^{n_{samples}-1}(y_i-\hat{y_i})^2}{\sum_{i=0}^{n_{samples}-1}(y_i-\bar{y_i})^2} $$ 

In [22]:
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
r2_score(y_true, y_pred)

0.9486081370449679

In [23]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2_score(y_true, y_pred, multioutput='variance_weighted')

0.9382566585956417

In [24]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2_score(y_true, y_pred, multioutput='uniform_average')

0.9368005266622779

In [25]:
r2_score(y_true, y_pred, multioutput='raw_values')

array([0.96543779, 0.90816327])

In [26]:
r2_score(y_true, y_pred, multioutput=[0.3, 0.7])

0.9253456221198156

# Clustering metrics

# Dummy estimators

```DummyClassifier``` implements several such simple strategies for classification:
* ```stratiffied``` generates random predictions by respecting the training set class distribution.
* ```most_frequent``` always predicts the most frequent label in the training set.
* ```prior``` always predicts the class that maximizes the class prior.
* ```uniform``` generates predictions uniformly at random.
* ```constant``` always predicts a constant label that is provided by the user.

Note that with all these strategies, the ```predict``` method completely ignores the input data!

In [30]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
y[y != 1] = -1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [31]:
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test) 

0.631578947368421

In [32]:
clf = DummyClassifier(strategy='most_frequent',random_state=0)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)  

0.5789473684210527

In [33]:
clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)  

0.9736842105263158

```DummyRegressor``` also implemetns four simple rules of thumb for regression:
* ```mean``` always predicts the mean of the training targets.
* ```median``` always predicts the median of the training targets.
* ```quantile``` always predicts a specified quantile of the training set, provided with the quantile parameter.
* ```constant``` always predicts a constant value that is provided by the user.

In all these strategies, the ```predict``` method completely ignores the input data.