# 模型评估（一）：参数选择

## Cross-validation: evaluating estimator performance

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

In [None]:
# Sklearn 0.18
sklearn.model_selection.train_test_split(*arrays, **options)

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
   * A model is trained using k-1 of the folds as training data
   * the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such      as accuracy).
   
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

### Computing cross-validated metrics

In [None]:
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, 
                                        fit_params=None, pre_dispatch='2*n_jobs')

In [2]:
# Coding in here 
from sklearn.cross_validation import cross_val_score
import pandas as pd 

In [3]:
df = pd.read_csv("adultTest.csv")

In [6]:
dfNew = pd.get_dummies(data = df,columns = ['workclass','education','marital-status','occupation',
                                           'relationship','race','sex','native-country'])

In [9]:
dfNew['class'] = df['class'].map(lambda s :s.strip(" "))

In [15]:
dfNew.loc[dfNew['class']=='<=50K','target'] = 0
dfNew.loc[dfNew['class']!='<=50K','target'] = 1

In [16]:
dfNew.target.value_counts()

0.0    24720
1.0     7841
Name: target, dtype: int64

In [14]:
dfNew['class'].value_counts()

<=50K    24720
>50K      7841
Name: class, dtype: int64

In [17]:
dfNew.drop("class",axis = 1,inplace = True)

In [18]:
X = dfNew.drop("target",axis = 1)
y = dfNew['target']

In [24]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.01,n_estimators=50,max_depth=2)
cross_val_score(gbc,X,y,cv = 5,n_jobs=-1)

array([ 0.76708122,  0.76827396,  0.76765971,  0.76796683,  0.76858108])

In [25]:
gbc = GradientBoostingClassifier(learning_rate=0.1,n_estimators=50,max_depth=4)
cross_val_score(gbc,X,y,cv = 5,n_jobs=-1)

array([ 0.8601259 ,  0.85949017,  0.86517199,  0.86578624,  0.86701474])

### Cross validation iterators
1. K-fold
2. Stratified k-fold
3. Label k-fold
4. Leave-One-Out - LOO
5. Leave-P-Out - LPO

cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:

* None, to use the default 3-fold cross validation,
* integer, to specify the number of folds in a (Stratified)KFold,
* An object to be used as a cross-validation generator.
* An iterable yielding train, test splits.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

In [26]:
import numpy as np
from sklearn.cross_validation import KFold

kf = KFold(4, n_folds=2)
for train, test in kf:
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


In [29]:
kf1 = KFold(len(X),n_folds=5)
gbc = GradientBoostingClassifier(learning_rate=0.1,n_estimators=50,max_depth=4)
cross_val_score(gbc,X,y,cv = kf1,n_jobs=-1)

array([ 0.86058652,  0.86148649,  0.8659398 ,  0.86624693,  0.86394349])

In [30]:
from sklearn.cross_validation import StratifiedKFold

labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(labels, 3)
for train, test in skf:
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


In [31]:
skf1 = StratifiedKFold(y,5)

In [None]:
1000

980 train 

20  test



### Random permutations cross-validation a.k.a. Shuffle & Split

**ShuffleSplit**

The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.

It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state pseudo random number generator.

In [None]:
class sklearn.cross_validation.ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None)

In [34]:
# Coding in here .............
from sklearn.cross_validation import ShuffleSplit
ss = ShuffleSplit(len(X),test_size=0.3)


[24255  3865 15177 ..., 28494 10767 17198]
[29492 18305   826 ..., 25407 10505 14903]
[11682  9477 14044 ..., 18960  1101  5806]
[31568 21708 23670 ..., 17945 10073 16038]
[ 5335 15985 11595 ..., 31148 10064 22052]
[ 6822  8093 17338 ..., 27021 23028  1316]
[10876  4771  8434 ..., 28849 19739  8584]
[ 4943 22970 29439 ..., 25699  2630 26840]
[ 9982 18995 12137 ...,  7182  1025 21787]
[ 5690 26026  7509 ...,  3540 23380 15510]


**A note on shuffling**

If the data ordering is not arbitrary (e.g. samples with the same label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:

1. This consumes less memory than shuffling the data directly.
2. By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
3. The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
4. To ensure results are repeatable (on the same platform), use a fixed value for random_state.

## Grid Search: Searching for estimator parameters

A search consists of:
1. an estimator (regressor or classifier such as sklearn.svm.SVC());
2. a parameter space;
3. a method for searching or sampling candidates;
4. a cross-validation scheme; and
5. a score function.

**GridSearchCV exhaustively considers all parameter combinations **

**RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution**

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible values.
2. Adding parameters that do not influence the performance does not decrease efficiency.

Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified

* scipy.stats

* numpy.random

In [None]:
class sklearn.grid_search.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, 
                                       verbose=0, pre_dispatch='2*n_jobs', error_score='raise')

In [None]:
class sklearn.grid_search.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, 
                                             iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, 
                                             error_score='raise')

In [36]:
GradientBoostingClassifier?

In [46]:
# Coding in here 
from sklearn.grid_search import GridSearchCV
clf = GradientBoostingClassifier()
param_grid ={"learning_rate":[0.001,0.05],'n_estimators':[50,100],'subsample':[0.7,1.0]}
gscv = GridSearchCV(clf,param_grid,n_jobs= -1,verbose = 1,cv = 5,error_score = 0,scoring='roc_auc')

In [47]:
gscv.fit(X,y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.0min finished


GridSearchCV(cv=5, error_score=0,
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'learning_rate': [0.001, 0.05], 'subsample': [0.7, 1.0], 'n_estimators': [50, 100]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=1)

In [48]:
gscv.grid_scores_

[mean: 0.86505, std: 0.00375, params: {'learning_rate': 0.001, 'subsample': 0.7, 'n_estimators': 50},
 mean: 0.85991, std: 0.00464, params: {'learning_rate': 0.001, 'subsample': 1.0, 'n_estimators': 50},
 mean: 0.86711, std: 0.00358, params: {'learning_rate': 0.001, 'subsample': 0.7, 'n_estimators': 100},
 mean: 0.86212, std: 0.00298, params: {'learning_rate': 0.001, 'subsample': 1.0, 'n_estimators': 100},
 mean: 0.90731, std: 0.00290, params: {'learning_rate': 0.05, 'subsample': 0.7, 'n_estimators': 50},
 mean: 0.90715, std: 0.00310, params: {'learning_rate': 0.05, 'subsample': 1.0, 'n_estimators': 50},
 mean: 0.91488, std: 0.00324, params: {'learning_rate': 0.05, 'subsample': 0.7, 'n_estimators': 100},
 mean: 0.91453, std: 0.00300, params: {'learning_rate': 0.05, 'subsample': 1.0, 'n_estimators': 100}]

[Comparing randomized search and grid search for hyperparameter estimation](http://scikit-learn.org/0.17/auto_examples/model_selection/randomized_search.html#example-model-selection-randomized-search-py)

#### Tips
1. Specifying an objective metric
2. Model selection: development and evaluation
3. Parallelism
4. Robustness to failure

#### Model specific cross-validation
1. linear_model.ElasticNetCV([l1_ratio, eps, ...])	Elastic Net model with iterative fitting along a regularization path
2. linear_model.LarsCV([fit_intercept, ...])	Cross-validated Least Angle Regression model
3. linear_model.LassoCV([eps, n_alphas, ...])	Lasso linear model with iterative fitting along a regularization path
4. linear_model.LassoLarsCV([fit_intercept, ...])	Cross-validated Lasso, using the LARS algorithm
5. linear_model.LogisticRegressionCV([Cs, ...])	Logistic Regression CV (aka logit, MaxEnt) classifier.
6. linear_model.MultiTaskElasticNetCV([...])	Multi-task L1/L2 ElasticNet with built-in cross-validation.
7. linear_model.MultiTaskLassoCV([eps, ...])	Multi-task L1/L2 Lasso with built-in cross-validation.
8. linear_model.OrthogonalMatchingPursuitCV([...])	Cross-validated Orthogonal Matching Pursuit model (OMP)
9. linear_model.RidgeCV([alphas, ...])	Ridge regression with built-in cross-validation.
10. linear_model.RidgeClassifierCV([alphas, ...])	Ridge classifier with built-in cross-validation.

#### Out of Bag Estimates
1. ensemble.RandomForestClassifier([...])	A random forest classifier.
2. ensemble.RandomForestRegressor([...])	A random forest regressor.
3. ensemble.ExtraTreesClassifier([...])	An extra-trees classifier.
4. ensemble.ExtraTreesRegressor([n_estimators, ...])	An extra-trees regressor.
5. ensemble.GradientBoostingClassifier([loss, ...])	Gradient Boosting for classification.
6. ensemble.GradientBoostingRegressor([loss, ...])	Gradient Boosting for regression.

In [49]:
clf = GradientBoostingClassifier()
clf.fit(X,y)

GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [52]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(oob_score=True)
rf.fit(X,y)

  warn("Some inputs do not have OOB scores. "


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [54]:
rf.oob_score_

0.83354319584779335

# 模型评估（二）

## Model evaluation: quantifying the quality of predictions
1. Estimator score method
2. Scoring parameter
3. Metric functions

#### [The scoring parameter: defining model evaluation rules](http://scikit-learn.org/0.17/modules/model_evaluation.html)

#### [Classification metrics](http://scikit-learn.org/0.17/modules/model_evaluation.html)
![precision](http://scikit-learn.org/0.17/_images/math/f8029f7b6c8fc80db737d850ed8e10ea8f27e410.png)
![recall](http://scikit-learn.org/0.17/_images/math/ca017d3d38a5a935ae8bee84d8143b44f1b32c9a.png)
![F](http://scikit-learn.org/0.17/_images/math/b6183c8fb10498f949131f2aa67eeb1256cdc68a.png)

### ROC
![roc](http://scikit-learn.org/0.17/_images/plot_roc_0011.png)

In [None]:
sklearn.metrics.precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None)

In [None]:
sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)

In [None]:
sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None)

In [None]:
sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2)[source]

In [None]:
sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)[source]

In [None]:
sklearn.metrics.fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1, average='binary', sample_weight=None)
# The beta parameter determines the weight of precision in the combined score. beta < 1 lends more weight to precision, 
# while beta > 1 favors recall (beta -> 0 considers only precision, beta -> inf only recall).