**Домашнее задание по теме "Деревья решений"**

Для выполнения домашнего задания необходимо взять boston house-prices datase (sklearn.datasets.load_boston) и сделать тоже самое для задачи регрессии (попробовать разные алгоритмы, поподбирать параметры, вывести итоговое качество).

**Реализация:**

In [25]:
conda install -c conda-forge jupyterthemes

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Anaconda

  added / updated specs:
    - jupyterthemes


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.9.2                |   py38h9f7ea03_0         2.9 MB
    jupyterthemes-0.20.0       |             py_1         6.1 MB  conda-forge
    lesscpy-0.13.0             |             py_1          35 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.0 MB

The following NEW packages will be INSTALLED:

  jupyterthemes      conda-forge/noarch::jupyterthemes-0.20.0-py_1
  lesscpy            conda-forge/noarch::lesscpy-0.13.0-py_1

The following packages will be UPDATED:

  conda                                        4.8.3-py38_0 --> 4.9.2-

In [21]:
from sklearn import datasets
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Lasso, Ridge, HuberRegressor, ElasticNet
from sklearn.tree import DecisionTreeRegressor
%matplotlib inline
import numpy as np
import pandas as pd
import random

from jupyterthemes import jtplot
jtplot.style()

In [22]:
random.seed(1)

In [23]:
boston = datasets.load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [5]:
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [25]:
df.shape

(506, 13)

In [26]:
X = boston['data']
y = boston['target']

In [27]:
# Делим выборку на тренировочную и тестовую:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

**Применим разные варианты регрессий:**

*Лассо регрессия*

In [53]:
lasso_reg = Lasso()

Подберём некоторые оптимальные гиперпараметры для данной регрессии: 'alpha' и 'selection' 

In [61]:
lasso_params = {
    'alpha': np.logspace(-3, 5, 2000),
    'selection' : ['cyclic', 'random']
}
grid_lasso = GridSearchCV(lasso_reg, lasso_params, cv=10, verbose=2, n_jobs=-1)
grid_lasso.fit(X_train, y_train)

Fitting 10 folds for each of 4000 candidates, totalling 40000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 4208 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 17200 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 35312 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done 40000 out of 40000 | elapsed:   15.4s finished


GridSearchCV(cv=10, estimator=Lasso(), n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-03, 1.00925754e-03, 1.01860077e-03, ...,
       9.81738896e+04, 9.90827380e+04, 1.00000000e+05]),
                         'selection': ['cyclic', 'random']},
             verbose=2)

Выведем наилучшие результаты нашего исследования:

In [62]:
print("Наилучшие гиперпараметры:", grid_lasso.best_params_, ". Наилучшая оценка предсказания:", grid_lasso.best_score_)


Наилучшие гиперпараметры: {'alpha': 0.0027556249611976015, 'selection': 'random'} . Наилучшая оценка предсказания: 0.7124699906971095


Проделаем такое же исследование для *Ридж-регрессии*:

In [63]:
ridge_reg = Ridge()

In [64]:
ridge_params = {
    'alpha': np.logspace(-3, 5, 2000),
    'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}
grid_ridge = GridSearchCV(ridge_reg, ridge_params, cv=10, verbose=2, n_jobs=-1)
grid_ridge.fit(X_train, y_train)

Fitting 10 folds for each of 12000 candidates, totalling 120000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 1616 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done 3198 tasks      | elapsed:   15.0s
[Parallel(n_jobs=-1)]: Done 4896 tasks      | elapsed:   21.7s
[Parallel(n_jobs=-1)]: Done 7086 tasks      | elapsed:   30.7s
[Parallel(n_jobs=-1)]: Done 9756 tasks      | elapsed:   41.5s
[Parallel(n_jobs=-1)]: Done 12918 tasks      | elapsed:   54.2s
[Parallel(n_jobs=-1)]: Done 16560 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 20694 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 25308 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 30414 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 36000 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 42078 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 48636 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 55686 t

GridSearchCV(cv=10, estimator=Ridge(), n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-03, 1.00925754e-03, 1.01860077e-03, ...,
       9.81738896e+04, 9.90827380e+04, 1.00000000e+05]),
                         'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg',
                                    'sag', 'saga']},
             verbose=2)

In [65]:
print("Наилучшие гиперпараметры:", grid_ridge.best_params_, ". Наилучшая оценка предсказания:", grid_ridge.best_score_)

Наилучшие гиперпараметры: {'alpha': 0.04373412180769153, 'solver': 'cholesky'} . Наилучшая оценка предсказания: 0.712451410169829


Теперь исследуем *Регрессию Хьюберта*:

In [67]:
huber_reg = HuberRegressor()

In [68]:
huber_params = {
    'alpha': np.logspace(-3, 5, 200),
    'epsilon': np.linspace(1, 2, 100)
}
grid_huber = GridSearchCV(huber_reg, huber_params, cv=10, verbose=2, n_jobs=-1)
grid_huber.fit(X_train, y_train)

Fitting 10 folds for each of 20000 candidates, totalling 200000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 432 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done 1244 tasks      | elapsed:    8.5s
[Parallel(n_jobs=-1)]: Done 2376 tasks      | elapsed:   16.7s
[Parallel(n_jobs=-1)]: Done 3836 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done 5616 tasks      | elapsed:   43.1s
[Parallel(n_jobs=-1)]: Done 7724 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 10152 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 12908 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 15984 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 19388 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 23112 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 27164 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 31536 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 36236 tas

GridSearchCV(cv=10, estimator=HuberRegressor(), n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-03, 1.09698580e-03, 1.20337784e-03, 1.32008840e-03,
       1.44811823e-03, 1.58856513e-03, 1.74263339e-03, 1.91164408e-03,
       2.09704640e-03, 2.30043012e-03, 2.52353917e-03, 2.76828663e-03,
       3.03677112e-03, 3.33129479e-03, 3.65438307e-03, 4.00880633e-03,
       4.39760361e-03, 4.82410870e-...
       1.65656566, 1.66666667, 1.67676768, 1.68686869, 1.6969697 ,
       1.70707071, 1.71717172, 1.72727273, 1.73737374, 1.74747475,
       1.75757576, 1.76767677, 1.77777778, 1.78787879, 1.7979798 ,
       1.80808081, 1.81818182, 1.82828283, 1.83838384, 1.84848485,
       1.85858586, 1.86868687, 1.87878788, 1.88888889, 1.8989899 ,
       1.90909091, 1.91919192, 1.92929293, 1.93939394, 1.94949495,
       1.95959596, 1.96969697, 1.97979798, 1.98989899, 2.        ])},
             verbose=2)

In [69]:
print("Наилучшие гиперпараметры:", grid_huber.best_params_, ". Наилучшая оценка предсказания:", grid_huber.best_score_)

Наилучшие гиперпараметры: {'alpha': 0.6517339604882427, 'epsilon': 1.3838383838383839} . Наилучшая оценка предсказания: 0.691724541028457


Теперь перейдём к регрессии *ElasticNet*:

In [70]:
elast_reg = ElasticNet()

In [72]:
elast_params = {
    'alpha': np.logspace(-3, 5, 200),
    'l1_ratio': np.linspace(0, 1, 50)
}
grid_elast = GridSearchCV(elast_reg, elast_params, cv=10, verbose=2, n_jobs=-1)
grid_elast.fit(X_train, y_train)

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 2192 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 15184 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 33296 tasks      | elapsed:   16.2s
[Parallel(n_jobs=-1)]: Done 56656 tasks      | elapsed:   25.0s
[Parallel(n_jobs=-1)]: Done 85136 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 100000 out of 100000 | elapsed:   40.0s finished


GridSearchCV(cv=10, estimator=ElasticNet(), n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-03, 1.09698580e-03, 1.20337784e-03, 1.32008840e-03,
       1.44811823e-03, 1.58856513e-03, 1.74263339e-03, 1.91164408e-03,
       2.09704640e-03, 2.30043012e-03, 2.52353917e-03, 2.76828663e-03,
       3.03677112e-03, 3.33129479e-03, 3.65438307e-03, 4.00880633e-03,
       4.39760361e-03, 4.82410870e-03, 5...
       0.30612245, 0.32653061, 0.34693878, 0.36734694, 0.3877551 ,
       0.40816327, 0.42857143, 0.44897959, 0.46938776, 0.48979592,
       0.51020408, 0.53061224, 0.55102041, 0.57142857, 0.59183673,
       0.6122449 , 0.63265306, 0.65306122, 0.67346939, 0.69387755,
       0.71428571, 0.73469388, 0.75510204, 0.7755102 , 0.79591837,
       0.81632653, 0.83673469, 0.85714286, 0.87755102, 0.89795918,
       0.91836735, 0.93877551, 0.95918367, 0.97959184, 1.        ])},
             verbose=2)

In [74]:
print("Наилучшие гиперпараметры:", grid_elast.best_params_, "Наилучшая оценка предсказания:", grid_elast.best_score_)

Наилучшие гиперпараметры: {'alpha': 0.0027682866303920667, 'l1_ratio': 1.0} Наилучшая оценка предсказания: 0.7124696578426108


Также посмотрим, что получается с регрессией *Деревья решений*:

In [75]:
tree_reg = DecisionTreeRegressor()

In [76]:
tree_params = {
    'max_depth': range(1, 11),
    'splitter': ['best', 'random'],
    'criterion': ['mse', 'mae', 'friedman_mse'],
    'min_samples_leaf': [1, 2, 4, 8, 16]
}
grid_tree = GridSearchCV(tree_reg, tree_params, cv=10, verbose=2, n_jobs=-1)
grid_tree.fit(X_train, y_train)

Fitting 10 folds for each of 300 candidates, totalling 3000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 2133 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done 3000 out of 3000 | elapsed:    5.5s finished


GridSearchCV(cv=10, estimator=DecisionTreeRegressor(), n_jobs=-1,
             param_grid={'criterion': ['mse', 'mae', 'friedman_mse'],
                         'max_depth': range(1, 11),
                         'min_samples_leaf': [1, 2, 4, 8, 16],
                         'splitter': ['best', 'random']},
             verbose=2)

In [79]:
print("Наилучшие гиперпараметры:", grid_tree.best_params_, "Наилучшая оценка предсказания:", grid_tree.best_score_)

Наилучшие гиперпараметры: {'criterion': 'friedman_mse', 'max_depth': 8, 'min_samples_leaf': 2, 'splitter': 'best'} Наилучшая оценка предсказания: 0.7684230889989335


**Выведем совокупную оценку качества наших моделей.**

Сначала посмотрим, что у нас вышло на тренировочных данных:

In [80]:
estimators = {
    'lasso': grid_lasso,
    'rige': grid_ridge,
    'huber': grid_huber,
    'elasticNet': grid_elast,
    'tree': grid_tree
}

In [81]:
for est in estimators:
    m = estimators[est]
    print(est, "CV R^2:", m.best_score_, "Validation R^2:", m.best_estimator_.score(X_train, y_train))

lasso CV R^2: 0.7124699906971095 Validation R^2: 0.7523518608250437
rige CV R^2: 0.712451410169829 Validation R^2: 0.7523678795464146
huber CV R^2: 0.691724541028457 Validation R^2: 0.6442575820846107
elasticNet CV R^2: 0.7124696578426108 Validation R^2: 0.7523515593711791
tree CV R^2: 0.7684230889989335 Validation R^2: 0.9608997774572935


Итак, самый лучший результат получился у регрессии "Деревья решений"

Теперь проверим тестовые данные:

In [82]:
for est in estimators:
    m = estimators[est]
    print(est, "CV R^2:", m.best_score_, "Validation R^2:", m.best_estimator_.score(X_test, y_test))

lasso CV R^2: 0.7124699906971095 Validation R^2: 0.6660879915826171
rige CV R^2: 0.712451410169829 Validation R^2: 0.6655065226735672
huber CV R^2: 0.691724541028457 Validation R^2: 0.5945262046525202
elasticNet CV R^2: 0.7124696578426108 Validation R^2: 0.6660937581446583
tree CV R^2: 0.7684230889989335 Validation R^2: 0.7611308402878274


На тестовых данных также побеждает регрессия "Деревья решений".