# Practical Machine Learning in Python
Part of Parrot Prediction's ESCO Courses.

## Evaluate results
In this notebook you will see how measure the quality of the algorithm performance.

**You will learn how to:**
- <a href="#pmetrics">use predefined evaluation metrics</a>,
- <a href="#cmetrics">write your own evaluation metrics</a>,
- <a href="#earlystopping">use early stopping feature</a>,
- <a href="#cv">cross validate results</a>

### Prepare data
Begin with loading all required libraries

In [None]:
import numpy as np
import xgboost as xgb

from pprint import pprint

# reproducibility
seed = 123
np.random.seed(seed)

Load agaricus dataset from file

In [None]:
# load Agaricus data
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

Specify training parameters - we are going to use 5 stump decision trees with average learning rate.

In [None]:
# specify general training parameters
params = {
    'objective':'binary:logistic',
    'max_depth':1,
    'silent':1,
    'eta':0.5
}

num_rounds = 5

Before training the model let's specify `watchlist` to observe it's performance on the both datasets.

In [None]:
watchlist  = [(dtest,'test'), (dtrain,'train')]

### Using predefined evaluation metrics<a name='pmetrics' />

#### What is already available?
There are already [some](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md) predefined metrics availabe. You can use them as the input for the `eval_metric` parameter while training the model.

- `rmse` - [root mean square error](https://www.wikiwand.com/en/Root-mean-square_deviation),
- `mae` - [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error?oldformat=true),
- `logloss` - [negative log-likelihood](https://en.wikipedia.org/wiki/Likelihood_function?oldformat=true)
- `error` - binary classification error rate. It is calculated as `#(wrong cases)/#(all cases)`. Treat predicted values with probability $p > 0.5$ as positive,
- `merror` - multiclass classification error rate. It is calculated as `#(wrong cases)/#(all cases)`,
- `auc` - [area under curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic?oldformat=true),
- `ndcg` - [normalized discounted cumulative gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain?oldformat=true),
- `map` - [mean average precision](https://en.wikipedia.org/wiki/Information_retrieval?oldformat=true)

By default an `error` metric will be used.

In [None]:
bst = xgb.train(params, dtrain, num_rounds, watchlist)

To change is simply specify the `eval_metric` argument to the `params` dictionary.

In [None]:
params['eval_metric'] = 'logloss'
bst = xgb.train(params, dtrain, num_rounds, watchlist)

You can also use multiple evaluation metrics at one time

In [None]:
params['eval_metric'] = ['logloss', 'auc']
bst = xgb.train(params, dtrain, num_rounds, watchlist)

### Creating custom evaluation metric<a name='cmetrics' />

In order to create our own evaluation metric, the only thing needed to do is to create a method taking two arguments - predicted probabilities, and `DMatrix` object holding training data.

In this example our classification metric will simply count the number of misclassified examples assuming that classes with $p> 0.5$ are positive. You can change this threshold if you want more certainty. The algorithm is getting better when the number of misclassified examples is getting lower. Remember to also set the argument `maximize=False` while training.

In [None]:
# custom evaluation metric
def misclassified(pred_probs, dtrain):
    labels = dtrain.get_label() # obtain true labels
    preds = pred_probs > 0.5 # obtain predicted values
    return 'misclassified', np.sum(labels != preds)

In [None]:
bst = xgb.train(params, dtrain, num_rounds, watchlist, feval=misclassified, maximize=False)

You can see that even though the `params` dictionary is holding `eval_metric` key these values are being ignored and overwritten by `feval`.

### Extracting the evaluation results
You can get evaluation scores by declaring a dictionary for holding values and passing it as a parameter for `evals_result` argument

In [None]:
evals_result = {}
bst = xgb.train(params, dtrain, num_rounds, watchlist, feval=misclassified, maximize=False, evals_result=evals_result)

Now you can reuse these scores (ie. for plotting)

In [None]:
pprint(evals_result)

### Early stopping<a name='earlystopping' />
There is a nice optimization trick when fitting multiple trees. 

You can train the model until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. This approach results in simpler model, because the best number of trees will be found.

In the following example a total number of 1500 trees is to be creeated, but we are telling it to stop if the validation score does not improve for last ten iterations.

In [None]:
params['eval_metric'] = 'error'
num_rounds = 1500

bst = xgb.train(params, dtrain, num_rounds, watchlist, early_stopping_rounds=10)

When using `early_stopping_rounds` parameter resulting model will have 3 additional fields - `bst.best_score`, `bst.best_iteration` and `bst.best_ntree_limit`.

In [None]:
print("Booster best train score: {}".format(bst.best_score))
print("Booster best iteration: {}".format(bst.best_iteration))
print("Booster best number of trees limit: {}".format(bst.best_ntree_limit))

Also keep in mind that `train()` will return a model from the last iteration, not the best one.

### Cross validating results<a name='cv' />
Native package provides an option of cross-validating results. It is not as sophisticated as Sklearn package. The next input shows a basic execution. Notice that we are passing only single `DMatrix`. It would be good to merge train and test into one object to have more training samples.

In [None]:
num_rounds = 10
hist = xgb.cv(params, dtrain, num_rounds, nfold=10, metrics={'error'}, seed=seed)
hist

Notice that:

- by default we get a pandas data frame object (can be changed with `as_pandas` param),
- metrics are passed as an argument (muliple values are allowed),
- we can use own evaluation metrics (param `feval` and `maximize`),
- we can use early stopping feature (param `early_stopping_rounds`)