# Scoring metrics

*In this notebook, we'll examine the role various scoring metrics have on our model performance/selection process*

#### Central question: "How do I know if my model performs well?"

### Selecting the proper metric

Much of the ability for a model's efficacy to be effectively judged lies in the metric you select.

* Is it tied to dollars saved?
* How does it relate to decision making ability?
* Is the class label imbalanced?

## Regression

__Example__: Boston housing

* Two models
* No model tuning
* No pre-processing
* No CV

In [1]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

boston = load_boston()

random_state = 42
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
                                                    test_size=0.2, random_state=random_state)

### Regression fits


In [7]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import Lasso

# first model
reg_model1 = Lasso(random_state=random_state)
reg_model1.fit(X_train, y_train)

# second model
reg_model2 = AdaBoostRegressor(random_state=random_state)
reg_model2.fit(X_train, y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=42)

# Scoring interface:

All scikit-learn scoring methods are in the `metrics` submodule:

```python
from sklearn.metrics import some_metric
```

<br/>
<br/>
All scikit-learn scoring methods match the following signature:

```python
# interface:
def some_metric(actual_target, predicted_target, *args, **kwargs):
    ...
    return float(some_val)
```

In [8]:
from sklearn.metrics import mean_squared_error, r2_score

# Print results
print("Regression model 1 MSE: %.4f" 
      % mean_squared_error(y_test, 
                           reg_model1.predict(X_test)))

# Print results
print("Regression model 2 MSE: %.4f\n" 
      % mean_squared_error(y_test, 
                           reg_model2.predict(X_test)))

print("Regression model 1 R^2: %.4f" 
      % r2_score(y_test, reg_model1.predict(X_test)))

print("Regression model 2 R^2: %.4f" 
      % r2_score(y_test, reg_model2.predict(X_test)))

Regression model 1 MSE: 24.4298
Regression model 2 MSE: 10.9930

Regression model 1 R^2: 0.6669
Regression model 2 R^2: 0.8501


__Beware error terms vs. "higher=better" scoring terms!!!__

Without any knowledge of what the metrics mean, we might select model 1 since it's MSE is higher if we don't understand the concept of *error*.

Likewise, we might select model 1 on the basis of $R^{2}$ if we treat it like an error metric.

__In reality, model 2 wins on the basis of both metrics!__

## Classification

Different challenges than regression:
1. Consider false positives/negatives
2. Consider very imbalanced events

In [77]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=25, n_informative=13, 
                           n_redundant=2, n_repeated=0, n_classes=2, 
                           n_clusters_per_class=2, weights=[0.98], 
                           flip_y=0.01, class_sep=1.0, hypercube=True, 
                           shift=0.0, scale=1.0, shuffle=True, 
                           random_state=random_state)

# need to stratify the split due to our imbalance!
print("Negative class labels: %i" % (y == 0).sum())
print("Positive class labels: %i" % (y == 1).sum())

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=random_state,
                                                    stratify=y)

Negative class labels: 976
Positive class labels: 24


In [57]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score

# first model
clf1 = LogisticRegression(random_state=random_state)
clf1.fit(X_train, y_train)

# second model
clf2 = SGDClassifier(max_iter=25, random_state=random_state)
clf2.fit(X_train, y_train)

# print metrics
print("Clf 1 accuracy: %.5f" 
      % accuracy_score(y_test, clf1.predict(X_test)))

print("Clf 2 accuracy: %.5f" 
      % accuracy_score(y_test, clf2.predict(X_test)))

Clf 1 accuracy: 0.98000
Clf 2 accuracy: 0.97500


## Which  model would you choose?

Model 1 appears to out-perform model 2. But does that mean it's actually better?

In [58]:
def prediction_report(model, model_name):
    preds = model.predict(X_test)
    print("[%s] Num '0' predictions: %i" % (model_name, (preds == 0).sum()))
    print("[%s] Num '1' predictions: %i" % (model_name, (preds == 1).sum()))
    
    # more interesting...
    fn = (y_test == 1) & (preds == 0)
    fp = (y_test == 0) & (preds == 1)
    print("[%s] Num FN: %i" % (model_name, fn.sum()))
    print("[%s] Num FP: %i\n" % (model_name, fp.sum()))

prediction_report(clf1, "CLF 1")
prediction_report(clf2, "CLF 2")

[CLF 1] Num '0' predictions: 199
[CLF 1] Num '1' predictions: 1
[CLF 1] Num FN: 4
[CLF 1] Num FP: 0

[CLF 2] Num '0' predictions: 200
[CLF 2] Num '1' predictions: 0
[CLF 2] Num FN: 5
[CLF 2] Num FP: 0



*__Since the prior probabilities of a positive class label were 0.02, we basically got 98% accuracy by almost never predicting 1 in estimator 1!!!__*

In [59]:
from sklearn.metrics import classification_report

print("CLF 1 report:")
print(classification_report(y_test, clf1.predict(X_test)))

CLF 1 report:
             precision    recall  f1-score   support

          0       0.98      1.00      0.99       195
          1       1.00      0.20      0.33         5

avg / total       0.98      0.98      0.97       200



In [60]:
print("CLF 2 report:")
print(classification_report(y_test, clf2.predict(X_test)))

CLF 2 report:
             precision    recall  f1-score   support

          0       0.97      1.00      0.99       195
          1       0.00      0.00      0.00         5

avg / total       0.95      0.97      0.96       200



  'precision', 'predicted', average, warn_for)


#### While the scope of this lesson is not to discuss class imbalance fixes, it *is* to introduce how confounding some scoring metrics may be

__Recall__: The proportion of positives that are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition).

## Custom scoring

Perhaps you care about dollars saved. Your business partners provide you with the following table:

* False positive = -$50

* False negative = -$250

* True positive = $2,500

* True negative = $0

In [66]:
import numpy as np

def dollars_saved(actual, predicted):
    fp = (actual == 0) & (predicted == 1)
    fn = (actual == 1) & (predicted == 0)
    tp = (actual == 1) & (predicted == 1)
    tn = (actual == 0) & (predicted == 0)
    
    return np.sum([fp.sum() * -50.,
                   fn.sum() * -250.,
                   tp.sum() * 2500.])  # don't need TN, since it's worth nothing

print("Dollars saved with CLF 1: ${:,}".format(
        dollars_saved(y_test, clf1.predict(X_test))))

print("Dollars saved with CLF 2: ${:,}".format(
        dollars_saved(y_test, clf2.predict(X_test))))

Dollars saved with CLF 1: $1,500.0
Dollars saved with CLF 2: $-1,250.0


In [75]:
from sklearn.metrics import make_scorer

custom_scorer = make_scorer(score_func=dollars_saved, greater_is_better=True)
custom_scorer

make_scorer(dollars_saved)

# Take-aways

* Know whether "higher is better" or not
* Be able to explain intuitively what a score represents
  - i.e., MSE means that the average squared residual term is X
* Know whether your data makes a scoring metric useless (i.e., accuracy for imbalance problems)
* Consider a custom scoring metric if it makes most sense for your problem