In [None]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.utils import shuffle

# Model evaluation

[How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)

[Understanding and diagnosing your machine-learning models - Gaël Varoquaux](https://www.youtube.com/watch?v=kbj3llSbaVA)

Be the algorithm
- solve manually
- if you can't, can you expect a machine too?

Look at the samples you get wrong
- best, worst, uncertain (50% probability in classification)

## Train / test / validation / holdout

### Train

- used to find statistics & parameters of a model

### Validation

- aka test
- used to find hyperparameters & select a model
- should both be representative of new/future data

### Holdout

- aka test
- how well did my final model do
- should be representative of new/future data
- use a test set to evaluate the performance of the model selected using the previous steps

## Metrics

Align with a business objective / product
- possible to have a good ML model with a bad product

Most are **aggregates**
- lose/hide infomation

### Classification metrics

BLUE, cross entropy
- key question = can I take a gradient?

Zero one loss 
- hard to optimize

**Always look at class cardinality & imbalance**
- more classes = harder problem
- imbalance = easier (can predict most common class & do well)

### Accuracy
- correct / all predictions
- not useful with strong imbalance
- LOC 2390 of building ml powered

### Confusion matrix

|  | 0 | 1 |
| --- | --- | --- |
| 0| tn | fn |
| 1| fp | tp |


### Precision
- tp / (tp + fp)
- how many positive predictions were correct
- false detection
- fraction of all predictions of class 1 that are correct
- minimize false positives

### Recall
- tp / (tp + fn)
- how many positives did I detect out of all the positives
- misses
- how many of the true class 1 did I predict

**Note that only one term changes in the definitions of precision & recall!**
- whether you are dealing with fp (precision) or fn (recall)

Always tradeoff between precision & recall

Predicitive maintenance
- false positives = ok, false negatives = not ok

## True / false positive rates

Measuring p with r = nonsense
- can maximize one at the cost of the other

F1

Area under ROC
- 1 = perfect, 0.5 = random
- use for classifiers that can modify a threshold
- summarizes tradeoffs in varying the threshold
- what does under the line of ROC mean?

Threshold tuning

Average precision
- averaged over all recalls

Create an imbalanced dataset, by selecting only the sevens:

In [None]:

data = load_digits()

y = data.target
x = data.data / 255

noise = np.random.random(size=x.size).reshape(x.shape)
x += noise * 0.5
y = (y == 3).astype(int)

In [None]:
np.mean(y)

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs')

model.fit(x, y)

probs = model.decision_function(x)

In [None]:
probs

In [None]:
fpr, tpr, thresholds = roc_curve(y, probs)

roc = auc(fpr, tpr)

plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
print(roc)

### Regression metrics

MAE

MSE

RMSE

MAPE

MASE

Explained variance (infamous R2)
- the proportion to which a model accounts for the variation (dispersion) of data
- scaled
- 0 = chance, 1 = perfect
- only compare on the same dataset

MAPE is not symmetric
- puts a heavier penalty on negative errors

In [None]:
def mape(pred, act):
    return abs((pred - act) / act)

In [None]:
print(mape(100, 90))
print(mape(100, 110))

In [None]:
import sys
sys.path.append('..')

from common import load_iris

from sklearn.model_selection import cross_validate, train_test_split
from sklearn.naive_bayes import GaussianNB

## Splitting datasets (train or validation)

Be careful with random sampling of datasets - **data leakage**
- predicting the past from the future
- duplicates

In [None]:
ds = load_iris()

x, y = ds.features, ds.target

y = pd.DataFrame(data=np.argmax(y.values, axis=1), index=y.index)

## Test set

Let's follow a best practice and split off a test set.  Reason for this is:
- unseen data for a final measure of generalization error
- only ever do one forward pass on this dataset

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.1)
assert x_tr.shape[0] > x_te.shape[0]

## Cross-validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

https://scikit-learn.org/stable/modules/naive_bayes.html

CV
- use for hyperparams
- use all your data for test & train
- large K = small test set sizes
- computationally expensive
- avoid fitting your test set

When not to randomly sample
- time series

Cross validation = randomly sample!
- see [sklearn.model_selection.TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)

Pick a model and do cross-validation.  Reasons for this:
- use all data
- don't overfit holdout set

In [None]:
model = GaussianNB(priors=[1/3 for _ in range(3)])

#  defaults to stratified KFold
results = cross_validate(
    model, 
    x_tr, 
    y_tr.values.flatten(), 
    scoring='accuracy', 
    cv=5,
    return_train_score=True
)

In [None]:
pd.DataFrame(results)

## Grid search + cross-validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Most of the time these occur together

In [None]:
from sklearn.model_selection import GridSearchCV

def make_prior(logits):
    s = sum(logits)
    return [e / s for e in logits]

params = {
    'n_estimators': [1, 10, 100],
    'max_features': [1, 2, 3]
}

model = RandomForestClassifier()

gs = GridSearchCV(model, params, cv=5, return_train_score=True)
gs.fit(x_tr, y_tr.values.reshape(-1, ))

res = gs.cv_results_
print(res.keys())

res = pd.DataFrame(res)

cols = [r for r in res.columns if ('score' in r and 'mean' in r)]

print(np.max(res.loc[:, 'mean_test_score']))
res.loc[:, cols]

## Random search + cross-validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Random search is better (so they say)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

import scipy.stats as stats

params = {
    'n_estimators': stats.randint(10, 1000),
    'max_features': stats.randint(1, 3),
    'max_depth': stats.randint(1, 30)
}

model = RandomForestClassifier()

rs = RandomizedSearchCV(model, params, cv=5, return_train_score=True, n_iter=32)
rs.fit(x_tr, y_tr.values.reshape(-1, ))

res = rs.cv_results_
res = pd.DataFrame(res)
cols = [r for r in res.columns if ('score' in r and 'mean' in r)]
print(np.max(res.loc[:, 'mean_test_score']))
res.loc[:, cols]

## A closer look at cross-validation

In [None]:
import sys
sys.path.append('..')
from common import load_forest_fires

In [None]:
forest = load_forest_fires()

In [None]:
model = RandomForestRegressor(n_estimators=50)

x = forest.loc[:, ['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']]
y = forest.loc[:, 'area']

from sklearn.metrics import mean_absolute_error

res = []
for _ in range(10):
    x, y = shuffle(x, y)
    
    results = cross_validate(
        model, 
        x, 
        y.values.flatten(), 
        scoring='neg_mean_absolute_error',
        cv=3,
        return_train_score=True
    )

    res.append(results['test_score'])
 
res = pd.DataFrame(res)
res.loc[:, 'avg'] = res.loc[:, :].mean(axis=1)
res

In [None]:
res.min().min() ,res.max().max()

This noise is unavoidable!

## Bias-variance tradeoff versus double descent

[Belkin et. al (2019) Reconciling modern machine learning practice and the bias-variance trade-off](https://arxiv.org/pdf/1812.11118.pdf)

Traditional wisdom is that beyond a certain amount of model capacity, additional capacity is used to overfit
- the classical regime
- sometimes, bigger models are worse

Modern deep learning often shows the opposite
- bigger models are better

In late 2019 the **double descent** phenomena was observed
- second regime (interpolating region) that occurs after a high capacity model has memorized the training data (interpolation threshold)

Idea larger models are smoother
- norms of coefficients are smaller

> For smaller data sets, these
large neural networks would be firmly in the over-parametrized regime, and simply training to
obtain zero training risk often results in good test performance

![](assets/belkin-f1.png)

![](assets/belkin-f2.png)

![](assets/belkin-f3.png)

[Nakkiran et. al (2019) Deep Double Descent: Where Bigger Models and More Data Hurt](https://arxiv.org/pdf/1912.02292.pdf)

Same thing but for deep neural nets (resnets + transformers)

![](assets/Nakkiran-f1.png)