# Gradient Boosting Regressor

## Used in projects

* Vykonia

## Information Sources

Scikit-learn:
* [scikit API](http://scikit-learn.org/stable/modules/classes.html)
* [GradientBoostingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)
* [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)
* [learning curve sklearn](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html)
* [learning curve documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve)

Other sources
* [parameter tuning](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)

## Parameters and their tuning

General and **nice** info can be found [here](* [parameter tuning](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)


**THIS WAS FOR CLASSIFICATION**
Keeping all this in mind, we can take the following approach:
	1. Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems
	2. Determine the optimum number of trees for this learning rate. This should range around 40-70. Remember to choose a value on which your system can work fairly fast. This is because it will be used for testing various scenarios and determining the tree parameters.
	3. Tune tree-specific parameters for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
	4. Lower the learning rate and increase the estimators proportionally to get more robust models.


Loss function
* Defaultní je least squared.    

Boosting parameters:
* n_estimators
    - počet stromů $B$
    - výchozí 100
* learning_rate:
    - shrinkage $\lambda$
    - výchozí 0.1
    - obvykle 0.01 - 0.001, nebo 0.2-0.05
    
Tree specific parameters:
* max_depth
    - ~počet dělení, resp. max. počet dělení je 2^(max_depth)
    - podle počtu pozorování
* min_samples_split
    - 0.5-1% vzorku, ale je dost závislé na úloze
* min_sample_leaf
    - hodně podle intuice
    - menší hodnoty pro inbalance classes
* max_features
    - obvykle odmocnina z počtu
* subsample:
    - vede ke stochastickému gradientu pokud je <1
    - obvykle ale kolem 0.8    

Miscellaneous Parameters:



## Knihovny

In [2]:
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

%matplotlib notebook

#model
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV
from sklearn.metrics import mean_squared_error, explained_variance_score, mean_absolute_error,r2_score, make_scorer
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

## Data

In [3]:
n = 1000
p = 1
noise_std = 5
(X,Y) = make_regression(
    n_samples = n,
    n_features = p,
    noise = noise_std
)

In [4]:
plt.scatter(X,Y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1c686bfb710>

## Learning Curve

In [5]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    #if ylim is not None:
    #    plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    my_scorer = make_scorer(mean_squared_error)
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring = my_scorer)
    
    print(train_sizes)
    print(train_scores)
    print(test_scores)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [6]:
title = "Learning Curves (Naive Bayes)"
estimator = GradientBoostingRegressor()
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

plot_learning_curve(estimator, title, X, Y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

<IPython.core.display.Javascript object>

[ 80 260 440 620 800]
[[  2.53114874   1.75422148   2.29522349   0.8586068    2.28607708
    2.26419793   1.5092643    2.96848676   2.60520862   2.08167958
    2.36096427   0.8855471    1.62298771   2.74299878   4.80125026
    1.79212834   1.97969497   3.10499316   2.17078014   1.45641652
    2.34012851   1.94693664   1.89994532   1.1905884    1.10788444
    1.39344677   3.18539981   1.28802335   2.50483497   2.5255775
    1.57390415   1.28570953   1.33138749   0.78295841   1.58893342
    1.33511651   1.27549307   1.76751806   1.75214425   1.24409469
    1.6994461    1.80373621   2.41097061   2.35675066   1.03687719
    1.75537011   2.68636756   1.93360821   1.83575699   2.22621787
    1.80865271   2.50681205   2.1193367    1.44946368   1.6366535
    1.14560002   1.531135     2.20097272   1.17765455   2.18947278
    1.51778612   1.55310421   2.61503701   2.48631456   1.4705869
    3.03070879   1.82732479   1.20678007   2.63915392   1.57874554
    1.56798426   2.95048432   1.3501104    

<module 'matplotlib.pyplot' from 'C:\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

## Model setting

In [7]:
par_B = 500 #def 100
par_lambda = 0.1 #def 0.1
par_max_depth = 8 #def 3

#par_min_samples_split = int(X.shape[0]/1000)
#par_min_sample_leaf = X.shape

cv_n_split = 6

my_scorer = make_scorer(mean_squared_error)
#my_scorer = make_scorer(r2_score)

title = "Learning Curve " + "GBM "+ str(par_B) + " " + str(par_lambda) + " " + str(par_max_depth)

In [8]:
par_min_samples_split

1