### Homework 3:  Regression Challenge

Your homework assignment will be to synthesize the lessons taught in Unit 3, and present a coherent walk through of how you approached the modeling process.

**What You Will Turn In:**

A jupyter notebook with code and commentary that walks us through the following:

 - Exploratory Data Analysis on the original data
   - What are general patterns within the target variable? 
   - What relationships can we deduce from the features in X and how they relate to one another?  How they impact y?
   
 - What were some of the challenges in dealing with this dataset and why?
 - What cross validation strategy did you use and why?
   - How did you interpret their results?
   - What did you change as a result?
   - Did changing the number of folds have any measurable impact on your scores?
 - The use of pipelines to streamline your data processing and ensure correct alignment between training and test sets
 - Strategies you used to try and improve your score (it's okay if they didn't work -- just show us what you tried to do and why)
 - What features ended up having important causal impact on the target variable?  Can you demonstrate this graphically?
 - How did you choose your model parameters?
 - How did you validation predictions compare with your test set predictions?  Can you visualize this graphically?
  
The end result should be a coherent walk through of how you approached the problem and developed a coherent solution to model your data.

Some bonus tasks you could take on:  

 - The `bikeshare` dataset makes use of a date column.  There are many specialized versions of KFold, one of which is Time Series Splitting.  This would split the data in a way that was described in the cross-validation class, with each validation set coming after the previous training sets.  Could you make use of this in your modeling?  You can find it here:  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

**Datasets You Can Work With:**

Here are the list of the current datasets you have at your disposal:

`housing.csv` **beginner difficulty** -- this is the boston housing dataset that was used in class.  It contains 13 columns, and the target variable would be `PRICE`.  It states the sale price, and columns associated with the house's location, zoning details, and physical characteristics.  It has no missing values, no time column, and no categorical variables, so it's the most straight forward to work with.

`insurance_premiums.csv` **beginner difficulty** -- this dataset is a collection of insurance customers with columns describing some of their characteristics (age, bmi, smoking status, etc) and how much they ended up paying in insurance premiums as a result.  It is slightly more involved than `housing.csv`, since it has categorical variables.  This dataset is a useful exercise in understanding how different variables can interact with one another to impact the outcome being studied.

`bikeshare.csv` **intermediate difficulty** -- this is the dataset that was part of your bonus assignment.  It represents the number of bike rentals every hour in Manhattan during the course of several years.  This dataset is a **time series**, so it's important to make judicious use of time-based data, and to make sure you cross-validate your results sequentially.

`iowa_mini.csv` **intermediate difficulty** -- this is the dataset that we've worked on throughout class for the past two weeks. It has a few outliers within it, as well as some missing values that make it a bit challenging. It's a good idea for people who feel most comfortable continuing what was done in class.

`iowa_full.csv` **advanced difficulty** -- this is the complete iowa dataset, which has a total of 80+ columns.  Most of these are redundant, but deciphering how to best make use of them is a lot more work than the other files listed here.  With this dataset, expect to spend a lot of time cleaning your data, and deciphering how different columns ought to be encoded.

In [341]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
import category_encoders as ce
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
from pdpbox import pdp, info_plots

In [363]:
df = pd.read_csv('./insurance_premiums.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [343]:
#age:  age of insurance payor
#sex:  sex of insurance payor
#bmi:  bmi of insurance payor
#children: # of children of insurance payor
#smoker:  whether or not the payor smokes
#region:  region of the country where the payor is located
#charges: total amount of insurance premiums paid by insurance payor

#Sorting the data age from lowest to higher
Sort = df.sort_values(by=['age'], ascending = True)
Sort

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1248,18,female,39.820,0,no,southeast,1633.96180
482,18,female,31.350,0,no,southeast,1622.18850
492,18,female,25.080,0,no,northeast,2196.47320
525,18,female,33.880,0,no,southeast,11482.63485
529,18,male,25.460,0,no,northeast,1708.00140
...,...,...,...,...,...,...,...
398,64,male,25.600,2,no,southwest,14988.43200
335,64,male,34.500,0,no,southwest,13822.80300
378,64,female,30.115,3,no,northwest,16455.70785
1265,64,male,23.760,0,yes,southeast,26926.51440


In [366]:
y = Sort['charges']
X = Sort.drop('charges', axis=1)

In [367]:
y

1248     1633.96180
482      1622.18850
492      2196.47320
525     11482.63485
529      1708.00140
           ...     
398     14988.43200
335     13822.80300
378     16455.70785
1265    26926.51440
635     14410.93210
Name: charges, Length: 1338, dtype: float64

In [368]:
X

Unnamed: 0,age,sex,bmi,children,smoker,region
1248,18,female,39.820,0,no,southeast
482,18,female,31.350,0,no,southeast
492,18,female,25.080,0,no,northeast
525,18,female,33.880,0,no,southeast
529,18,male,25.460,0,no,northeast
...,...,...,...,...,...,...
398,64,male,25.600,2,no,southwest
335,64,male,34.500,0,no,southwest
378,64,female,30.115,3,no,northwest
1265,64,male,23.760,0,yes,southeast


In [347]:
#average value of insurance charges across the entire data set approach 1 
Avg_value = y.mean()
Avg_value

13270.422265141273

In [348]:
#average value of insurance charges across the entire data set approach 2
Avg = df['charges'].mean()
Avg

13270.422265141257

In [349]:
pos_neg_error = (df['charges']-Avg)**2
avg_error = np.mean((df['charges']-Avg)**2)

In [350]:
Avg

13270.422265141257

In [351]:
gradient = y - Avg
gradient

1248   -11636.460465
482    -11648.233765
492    -11073.949065
525     -1787.787415
529    -11562.420865
            ...     
398      1718.009735
335       552.380735
378      3185.285585
1265    13656.092135
635      1140.509835
Name: charges, Length: 1338, dtype: float64

In [352]:
#Target Encoder (many unique categories in the column)
#OrdinalEncoder (assigns each unique category within a column an integer value, either specified or random)
#OneHotEncoder (assigns a binary value 0 and 1 for each category in the column)
te = ce.TargetEncoder()
ore = ce.OrdinalEncoder()
ohe = ce.OneHotEncoder(use_cat_names=True)

In [353]:
#OneHotEncoder turns each categorical data into binary column. it works for sex, and smoker column but region column can be simplified
ohe.fit_transform(df)

Unnamed: 0,age,sex_female,sex_male,bmi,children,smoker_yes,smoker_no,region_southwest,region_southeast,region_northwest,region_northeast,charges
0,19,1,0,27.900,0,1,0,1,0,0,0,16884.92400
1,18,0,1,33.770,1,0,1,0,1,0,0,1725.55230
2,28,0,1,33.000,3,0,1,0,1,0,0,4449.46200
3,33,0,1,22.705,0,0,1,0,0,1,0,21984.47061
4,32,0,1,28.880,0,0,1,0,0,1,0,3866.85520
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,0,1,30.970,3,0,1,0,0,1,0,10600.54830
1334,18,1,0,31.920,0,0,1,0,0,0,1,2205.98080
1335,18,1,0,36.850,0,0,1,0,1,0,0,1629.83350
1336,21,1,0,25.800,0,0,1,1,0,0,0,2007.94500


In [354]:
#this is most ideal because the categorical data was simplified for most part for region, smoker and sex. but it is unclear which category is numbered what)
ore.fit_transform(df)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.900,0,1,1,16884.92400
1,18,2,33.770,1,2,2,1725.55230
2,28,2,33.000,3,2,2,4449.46200
3,33,2,22.705,0,2,3,21984.47061
4,32,2,28.880,0,2,3,3866.85520
...,...,...,...,...,...,...,...
1333,50,2,30.970,3,2,3,10600.54830
1334,18,1,31.920,0,2,4,2205.98080
1335,18,1,36.850,0,2,2,1629.83350
1336,21,1,25.800,0,2,1,2007.94500


In [355]:
#male has higher charges avg
df.groupby('sex')['charges'].mean()

sex
female    12569.578844
male      13956.751178
Name: charges, dtype: float64

In [356]:
#southeast region has higher charges avg
df.groupby('region')['charges'].mean()

region
northeast    13406.384516
northwest    12417.575374
southeast    14735.411438
southwest    12346.937377
Name: charges, dtype: float64

In [357]:
#Smoker has higher charges avg
df.groupby('smoker')['charges'].mean()

smoker
no      8434.268298
yes    32050.231832
Name: charges, dtype: float64

In [358]:
#TargetEncoder fit here took an average of charges for each category in each column. 
#for Sex, the average of the charges for male and female, 
te.fit_transform(X,y)

Unnamed: 0,age,sex,bmi,children,smoker,region
1248,18,12569.578844,39.820,0,8434.268298,14735.411438
482,18,12569.578844,31.350,0,8434.268298,14735.411438
492,18,12569.578844,25.080,0,8434.268298,13406.384516
525,18,12569.578844,33.880,0,8434.268298,14735.411438
529,18,13956.751178,25.460,0,8434.268298,13406.384516
...,...,...,...,...,...,...
398,64,13956.751178,25.600,2,8434.268298,12346.937377
335,64,13956.751178,34.500,0,8434.268298,12346.937377
378,64,12569.578844,30.115,3,8434.268298,12417.575374
1265,64,13956.751178,23.760,0,32050.231832,14735.411438


In [359]:
#How many of each age group are in the DataFrame? 
df.age.value_counts().sort_index()

18    69
19    68
20    29
21    28
22    28
23    28
24    28
25    28
26    28
27    28
28    28
29    27
30    27
31    27
32    26
33    26
34    26
35    25
36    25
37    25
38    25
39    25
40    27
41    27
42    27
43    27
44    27
45    29
46    29
47    29
48    29
49    28
50    29
51    29
52    29
53    28
54    28
55    26
56    26
57    26
58    25
59    25
60    23
61    23
62    23
63    23
64    22
Name: age, dtype: int64

In [416]:
#create a training and test set data for age group <= 25 years old
train = Sort[Sort['age']<= 25].copy()
test  = Sort[Sort['age']>25].copy()

In [419]:
train0 = Sort.copy()
test0 = Sort.copy()

In [316]:
Sort

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1248,18,female,39.820,0,no,southeast,1633.96180
482,18,female,31.350,0,no,southeast,1622.18850
492,18,female,25.080,0,no,northeast,2196.47320
525,18,female,33.880,0,no,southeast,11482.63485
529,18,male,25.460,0,no,northeast,1708.00140
...,...,...,...,...,...,...,...
398,64,male,25.600,2,no,southwest,14988.43200
335,64,male,34.500,0,no,southwest,13822.80300
378,64,female,30.115,3,no,northwest,16455.70785
1265,64,male,23.760,0,yes,southeast,26926.51440


In [435]:
# create X and y training and test set for (25 or younger)
X_train = train.drop('charges', axis=1) 
X_test  = test.drop('charges', axis=1)

y_train = train['charges']
y_test = test['charges']

In [436]:
# create X and y training and test set
X_train0 = train0.drop('charges', axis=1) 
X_test0  = test0.drop('charges', axis=1)

y_train0 = train0['charges']
y_test0 = test0['charges']

In [424]:
#entire data set
ore     = ce.OrdinalEncoder()

X_train1 = ore.fit_transform(X_train0)
X_test1  = ore.transform(X_test0)

In [426]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train1, y_train0)
gbm.score(X_train1, y_train0), gbm.score(X_test1, y_test0)

(0.8995461669277299, 0.8995461669277299)

In [437]:
#25 or younger 
ore     = ce.OrdinalEncoder()

X_train2 = ore.fit_transform(X_train)
X_test2  = ore.transform(X_test)

In [438]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train2, y_train)
gbm.score(X_train2, y_train), gbm.score(X_test2, y_test)

(0.9387077886695129, 0.6286077481751334)

In [510]:
te = ce.TargetEncoder()

X_train3 = te.fit_transform(X_train, y_train)
X_test3  = te.transform(X_test, y_test)

In [512]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train3, y_train)
gbm.score(X_train3, y_train), gbm.score(X_test3, y_test)
#why did it give me 

(0.9387077886695129, 0.6279478783228225)

In [513]:
te = ce.TargetEncoder()

X_train6 = te.fit_transform(X_train0, y_train0)
X_test6  = te.transform(X_test0, y_test0)

In [514]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train6, y_train0)
gbm.score(X_train6, y_train0), gbm.score(X_test6, y_test0)

(0.8976952687334837, 0.8976952687334837)

In [440]:
ohe      = ce.OneHotEncoder()

X_train4 = ohe.fit_transform(X_train0)
X_test4  = ohe.transform(X_test0)

In [441]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train4, y_train0)
gbm.score(X_train4, y_train0), gbm.score(X_test4, y_test0)

(0.8987356739004362, 0.8987356739004362)

In [442]:
ohe      = ce.OneHotEncoder()

X_train5 = ohe.fit_transform(X_train)
X_test5  = ohe.transform(X_test)

In [443]:
gbm = GradientBoostingRegressor()

gbm.fit(X_train5, y_train)
gbm.score(X_train5, y_train), gbm.score(X_test5, y_test)

(0.9411651393454947, 0.6167473071379835)

In [458]:
#to put the columns in the order of importance causal impact on the charges (target variable) 
feats = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance', ascending=False)

In [459]:
feats

Unnamed: 0,Feature,Importance
4,smoker,0.741726
2,bmi,0.215682
3,children,0.017536
0,age,0.017419
5,region,0.006137
1,sex,0.001501


In [447]:
pipe = make_pipeline(ce.TargetEncoder(), gbm)

In [462]:
# use the pipeline to fit on the training set and score on the test set 
pipe.fit(X_train0, y_train0)

Pipeline(memory=None,
         steps=[('targetencoder',
                 TargetEncoder(cols=['sex', 'smoker', 'region'],
                               drop_invariant=False, handle_missing='value',
                               handle_unknown='value', min_samples_leaf=1,
                               return_df=True, smoothing=1.0, verbose=0)),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                           criterion='friedman_mse', init=None,
                                           learning_rate=0.1, loss='ls',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=1,
                                           m

In [463]:
pipe.score(X_test0, y_test0)

0.8976952687334837

In [541]:
# use the pipeline to fit on the training set and score on the test set 25 or younger
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('targetencoder',
                 TargetEncoder(cols=['sex', 'smoker', 'region'],
                               drop_invariant=False, handle_missing='value',
                               handle_unknown='value', min_samples_leaf=1,
                               return_df=True, smoothing=1.0, verbose=0)),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                           criterion='friedman_mse', init=None,
                                           learning_rate=0.1, loss='ls',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=1,
                                           m

In [542]:
pipe.score(X_test, y_test)

0.6285817214712055

In [469]:
#using specific values for each categorical value for region
col_mapping = {
    'southwest': 1,
    'southeast': 2,
    'northwest': 3,
    'norteast': 4,
}

In [470]:
pipe1 = make_pipeline(ce.TargetEncoder(cols=['age']),
                     ce.OrdinalEncoder(cols=['region'], mapping=[{'col': 'region', 'mapping': col_mapping}]),
                     ce.OneHotEncoder(use_cat_names=True)
                     ,GradientBoostingRegressor())

In [546]:
pipe1.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('targetencoder',
                 TargetEncoder(cols=['age'], drop_invariant=False,
                               handle_missing='value', handle_unknown='value',
                               min_samples_leaf=1, return_df=True,
                               smoothing=1.0, verbose=0)),
                ('ordinalencoder',
                 OrdinalEncoder(cols=['region'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'region',
                                          'mapping':...
                                           learning_rate=0.1, loss='ls',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                     

In [547]:
pipe1.score(X_test, y_test)

0.5654341333178451

In [526]:
pipe1.fit(X_train0, y_train0)

Pipeline(memory=None,
         steps=[('targetencoder',
                 TargetEncoder(cols=['age'], drop_invariant=False,
                               handle_missing='value', handle_unknown='value',
                               min_samples_leaf=1, return_df=True,
                               smoothing=1.0, verbose=0)),
                ('ordinalencoder',
                 OrdinalEncoder(cols=['region'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'region',
                                          'mapping':...
                                           learning_rate=0.1, loss='ls',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                     

In [527]:
pipe1.score(X_test0, y_test0)

0.8911206153028562

In [493]:
dtr = DecisionTreeRegressor(max_depth=3)

In [494]:
dtr

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=3,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [529]:
dtr.fit(X_train6, gradient)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=3,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [497]:
gbm.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'ls',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'presort': 'deprecated',
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [553]:
gbm.fit(X_train5, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.15, loss='ls', max_depth=5,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=250,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [554]:
gbm.score(X_train5, y_train)

0.9937883317049375

In [535]:
#target encoder, entire data set 
n_estimators = [50, 100, 250]
learning_rate = [.05, .1, .15]
tree_depth    = [3, 4, 5]
cv_scores = []

for estimator_num in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            print(f"Fitting model for:  rounds: {estimator_num}, learning_rate: {rate}, depth: {depth}")
            gbm = GradientBoostingRegressor(n_estimators=estimator_num, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train6, y_train0)
            score = gbm.score(X_test6, y_test0)
            print(f"Model score: {score}")
            cv_scores.append((score, estimator_num, rate, depth))

Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 3
Model score: 0.866925182055519
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 4
Model score: 0.8785855074843069
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 5
Model score: 0.8945566618767683
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 3
Model score: 0.8843384352294387
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 4
Model score: 0.8981655207977726
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 5
Model score: 0.9209690358258595
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 3
Model score: 0.8931244368835846
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 4
Model score: 0.9121976376697891
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 5
Model score: 0.9364657433594713
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 3
Model score: 0.8838080560583061
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 4


In [565]:
#target encoder, 25 or younger data set 
n_estimators = [50, 100, 250]
learning_rate = [.05, .1, .15]
tree_depth    = [3, 4, 5]
cv_scores1 = []

for estimator_num in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            print(f"Fitting model for:  rounds: {estimator_num}, learning_rate: {rate}, depth: {depth}")
            gbm = GradientBoostingRegressor(n_estimators=estimator_num, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train3, y_train)
            score = gbm.score(X_test3, y_test)
            print(f"Model score: {score}")
            cv_scores1.append((score, estimator_num, rate, depth))

Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 3
Model score: 0.6192067305316509
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 4
Model score: 0.6164411436350309
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 5
Model score: 0.6224726679412299
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 3
Model score: 0.6327234408635065
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 4
Model score: 0.6167114219376124
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 5
Model score: 0.6255718508766117
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 3
Model score: 0.6281613529153507
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 4
Model score: 0.6031405741332527
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 5
Model score: 0.6139175613426984
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 3
Model score: 0.6367691241887727
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 4

In [562]:
#Ohe, entire data set 
n_estimators = [50, 100, 250]
learning_rate = [.05, .1, .15]
tree_depth    = [3, 4, 5]
cv_scores2 = []

for estimator_num in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            print(f"Fitting model for:  rounds: {estimator_num}, learning_rate: {rate}, depth: {depth}")
            gbm = GradientBoostingRegressor(n_estimators=estimator_num, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train4, y_train0)
            score = gbm.score(X_test4, y_test0)
            print(f"Model score: {score}")
            cv_scores2.append((score, estimator_num, rate, depth))

Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 3
Model score: 0.8672413155879821
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 4
Model score: 0.879088065918477
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 5
Model score: 0.8952327437071513
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 3
Model score: 0.8850705879299852
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 4
Model score: 0.8968293444103841
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 5
Model score: 0.9233703422920159
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 3
Model score: 0.8928225950634511
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 4
Model score: 0.9132526787715323
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 5
Model score: 0.9400512267562456
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 3
Model score: 0.8849636856067044
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 4


In [570]:
#Ohe, 25 or younger data set 
n_estimators = [50, 100, 250]
learning_rate = [.05, .1, .15]
tree_depth    = [3, 4, 5]
cv_scores3 = []

for estimator_num in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            print(f"Fitting model for:  rounds: {estimator_num}, learning_rate: {rate}, depth: {depth}")
            gbm = GradientBoostingRegressor(n_estimators=estimator_num, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train5, y_train)
            score = gbm.score(X_test5, y_test)
            print(f"Model score: {score}")
            cv_scores3.append((score, estimator_num, rate, depth))

Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 3
Model score: 0.616567333981817
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 4
Model score: 0.6122769713911312
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 5
Model score: 0.593427946815027
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 3
Model score: 0.6358792151797392
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 4
Model score: 0.6097499379216937
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 5
Model score: 0.5802958769756732
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 3
Model score: 0.6331544880792113
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 4
Model score: 0.6119698862572527
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 5
Model score: 0.5736284985252527
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 3
Model score: 0.6355682752800595
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 4
M

In [563]:
max(cv_scores)

(0.9935484630058999, 250, 0.15, 5)

In [566]:
max(cv_scores1)

(0.6367691241887727, 100, 0.05, 3)

In [567]:
max(cv_scores2)

(0.9935484630058999, 250, 0.15, 5)

In [569]:
max(cv_scores3)

(0.6360252173069932, 50, 0.1, 3)