https://stackoverflow.com/questions/50265993/alternate-different-models-in-pipeline-for-gridsearchcv

https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/

In [30]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

**Загрузка данных**

In [2]:
data = load_boston()

In [5]:
print data.DESCR

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [7]:
data.keys()

['filename', 'data', 'target', 'DESCR', 'feature_names']

**Подготовка данных**

In [10]:
df = pd.DataFrame(data.data, columns = data.feature_names)
# MEDV - Price. Это наш y
df['Price'] = data.target

In [12]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [24]:
df.shape

(506, 14)

In [19]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['Price'], axis = 1), df.Price, test_size = 0.33, random_state = 0)

In [23]:
print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape

(339, 13)
(167, 13)
(339L,)
(167L,)


**Сетка параметров и моделей**

In [25]:
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor

In [130]:
models_dict = {
    'Lasso regression': Lasso(random_state=42),
    'Ridge regression': Ridge(random_state=42),
    'ElasticNet regression': ElasticNet(random_state = 42),
    'Random forest regression': RandomForestRegressor(random_state = 42),
    'AdaBoost regression': AdaBoostRegressor(random_state=42),
    'XGBoost regression': XGBRegressor(silent = 1)
}

parameters_grid_dict = {
    'Lasso regression': {'alpha' : np.linspace(0, 1, 100)},
    'Ridge regression': {'alpha' : np.linspace(0, 1, 100)},
    'ElasticNet regression': {'alpha': np.linspace(0, 1, 100),
                              'l1_ratio': np.arange(0,1, 100)},
    'Random forest regression': {'n_estimators': np.arange(100, 1000, 100)},
    'AdaBoost regression': {'n_estimators': np.arange(50, 1000, 50),
                            'loss': ('linear', 'square', 'exponential')},
    'XGBoost regression': {'gamma': np.linspace(0.01, 0.5, 100),
                           'max_depth': np.arange(3, 10, 1),
                           'subsample': np.linspace(0.5, 1, 5),
                           'colsample_bytree': np.linspace(0.5, 1,5)}
}

# можно упаковать в функцию
for model_name in models_dict.keys():
    print model_name
    model = models_dict.get(model_name)
    param_grid = parameters_grid_dict.get(model_name)
    clf = GridSearchCV(model, param_grid, cv = 5)
    best_model = clf.fit(X_train, y_train)
    print best_model.best_estimator_
    print best_model.best_score_
    print

XGBoost regression




XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1.0,
       gamma=0.2871717171717172, learning_rate=0.1, max_delta_step=0,
       max_depth=4, min_child_weight=1, missing=None, n_estimators=100,
       nthread=-1, objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=1, subsample=1.0)
0.8985062210707282

ElasticNet regression
ElasticNet(alpha=0.0, copy_X=True, fit_intercept=True, l1_ratio=0,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=42, selection='cyclic', tol=0.0001, warm_start=False)
0.7291438354168754

Lasso regression
Lasso(alpha=0.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=42,
   selection='cyclic', tol=0.0001, warm_start=False)
0.7291438354168754

AdaBoost regression
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='square',
         n_estimators=150, random_state=42)
0.8518832098821766
