# Housing prices regression

In [2]:
# load libraries
from e2eml.regression import regression_blueprints as rb
from e2eml.full_processing.postprocessing import save_to_production, load_for_production
from e2eml.test.regression_blueprints_test import load_housingprices_data
import pandas as pd
from sklearn.metrics import mean_absolute_error

[nltk_data] Downloading package punkt to /home/thomas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/thomas/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/thomas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/thomas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Feature engineering
Load & preprocess housing prices dataset.



In [3]:
# load Housing price data
test_df, test_target, val_df, val_df_target, test_categorical_cols = load_housingprices_data()

Do dataframe splits.


# Using e2eml - Run and save a pipeline
We only need a few steps to get ur full pipeline:
- Instantiate class
- Run chosen blueprint
- Save blueprint for later usage

In [4]:
# Instantiate class
housing_ml = rb.RegressionBluePrint(datasource=test_df,
                                         target_variable=test_target, # only a string with the target column name within the dataframe
                                         categorical_columns=test_categorical_cols, # here we specify cat columns (that is optional however)
                                         tune_mode='accurate',
                                         ml_task='regression') # usually not needed, but sometimes it might have to be called explicitly

Ml task is regression
Preferred training mode auto has been chosen. e2eml will automatically detect, if LGBM and Xgboost can use GPU acceleration and optimize the workflow accordingly.


In [5]:
"""
In this case we chose Ngboost, which is uses natural gradient. It is really strong for regression problem, but
does not have GPU acceleration at all unfortunately. However we always recommend trying Ngboost if possible.
"""
housing_ml.ml_bp14_regressions_full_processing_ngboost()

Some target classes have less members than allowed. You can ignore this message, if you
            are running a blueprint without NLP transformers.
            
            In order to create a strong model e2eml splits the data into several folds. Please provide data with at least
             6 class members for each target class. Otherwise the model is likely to fail to a CUDA error on runtime. 
             You can use the following function on your dataframe before passing it to e2eml:
            
            def handle_rarity(all_data, threshold=6, mask_as='miscellaneous', rarity_cols=None, normalize=False):
                if isinstance(rarity_cols, list):
                    for col in rarity_cols:
                        frequencies = all_data[col].value_counts(normalize=normalize)
                        condition = frequencies < threshold
                        mask_obs = frequencies[condition].index
                        mask_dict = dict.fromkeys(mask_obs, mask_as)
  

Started Execute test train split at 15:19:57.
Started Apply datetime transformation at 15:19:57.
Started Start Spacy, POS tagging at 15:19:58.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 29 to 102
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             800 non-null    int64  
 1   MSSubClass     800 non-null    int64  
 2   MSZoning       800 non-null    object 
 3   LotFrontage    663 non-null    float64
 4   LotArea        800 non-null    int64  
 5   Street         800 non-null    object 
 6   Alley          50 non-null     object 
 7   LotShape       800 non-null    object 
 8   LandContour    800 non-null    object 
 9   Utilities      800 non-null    object 
 10  LotConfig      800 non-null    object 
 11  LandSlope      800 non-null    object 
 12  Neighborhood   800 non-null    object 
 13  Condition1     800 non-null    object 
 14  Condition2     800 non-null    object 
 15

is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead


Started Execute categorical encoding at 15:19:59.


is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead


Started  Delete columns with high share of NULLs at 15:20:00.
Started Fill nulls at 15:20:00.
Started Execute numerical binning at 15:20:00.
Started Handle outliers at 15:20:01.
Started Remove collinearity at 15:20:01.
Started Execute clustering as a feature at 15:20:02.
Started Scale data at 15:20:02.
Started Execute clustering as a feature at 15:20:02.
Started Execute clustering as a feature at 15:20:03.
Started Execute clustering as a feature at 15:20:04.
Started Execute clustering as a feature at 15:20:04.
Started Execute clustering as a feature at 15:20:05.
Started Execute clustering as a feature at 15:20:06.
Started Execute clustering as a feature at 15:20:06.
Started Execute clustering as a feature at 15:20:07.
Started Select best features at 15:20:08.
Features before selection are...Id
Features before selection are...MSSubClass
Features before selection are...MSZoning
Features before selection are...LotFrontage
Features before selection are...LotArea
Features before selection a

Round:  1  iteration:  10
Parameters: { "silent" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Round:  2  iteration:  1
Parameters: { "silent" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Round:  2  iteration:  2
Parameters: { "silent" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Round:  2  iteration:  3
Parameters: { "silent" } might not be used.

  

[32m[I 2021-08-11 15:21:24,052][0m A new study created in memory with name: no-name-d2394aad-a8ef-4ad3-8a22-100ea6e1adaf[0m


Started Sort columns alphabetically at 15:21:23.
Started Train Ngboost at 15:21:24.
[iter 0] loss=12.7239 val_loss=12.5511 scale=2.0000 norm=122749.4355
[iter 100] loss=11.6251 val_loss=11.8475 scale=2.0000 norm=26081.2832
== Early stopping achieved.
== Best iteration / VAL158 (val_loss=11.6608)
[iter 0] loss=12.7273 val_loss=12.5571 scale=2.0000 norm=121870.0809
[iter 100] loss=11.6261 val_loss=11.8129 scale=2.0000 norm=26801.7708
== Early stopping achieved.
== Best iteration / VAL169 (val_loss=11.5620)
[iter 0] loss=12.7366 val_loss=12.5597 scale=2.0000 norm=122737.8371
[iter 100] loss=11.6376 val_loss=11.8228 scale=2.0000 norm=27725.4771
== Early stopping achieved.
== Best iteration / VAL168 (val_loss=11.5804)
[iter 0] loss=12.6929 val_loss=12.5313 scale=2.0000 norm=119340.3685
[iter 100] loss=11.5830 val_loss=11.7714 scale=2.0000 norm=25765.2758
== Early stopping achieved.
== Best iteration / VAL173 (val_loss=11.5076)
[iter 0] loss=12.7076 val_loss=12.5535 scale=2.0000 norm=119673.

[32m[I 2021-08-11 15:23:52,398][0m Trial 0 finished with value: -889824897.7742364 and parameters: {'base_learner': 'GradientBoost_depth5', 'Dist': 'Normal', 'n_estimators': 48198, 'minibatch_frac': 0.5210588095817861, 'learning_rate': 0.009437275576101205}. Best is trial 0 with value: -889824897.7742364.[0m


== Early stopping achieved.
== Best iteration / VAL174 (val_loss=11.5358)
[iter 0] loss=12.5540 val_loss=12.4041 scale=2.0000 norm=1.2846
[iter 100] loss=11.5363 val_loss=11.6515 scale=2.0000 norm=0.8683
== Early stopping achieved.
== Best iteration / VAL179 (val_loss=11.4309)
[iter 0] loss=12.5371 val_loss=12.4022 scale=2.0000 norm=1.2697
[iter 100] loss=11.5342 val_loss=11.6346 scale=2.0000 norm=0.8665
== Early stopping achieved.
== Best iteration / VAL185 (val_loss=11.3830)
[iter 0] loss=12.5844 val_loss=12.4050 scale=2.0000 norm=1.3177
[iter 100] loss=11.5461 val_loss=11.6431 scale=2.0000 norm=0.8711
== Early stopping achieved.
== Best iteration / VAL184 (val_loss=11.3978)
[iter 0] loss=12.5557 val_loss=12.3982 scale=2.0000 norm=1.2946
[iter 100] loss=11.5189 val_loss=11.6087 scale=2.0000 norm=0.8700
== Early stopping achieved.
== Best iteration / VAL187 (val_loss=11.3367)
[iter 0] loss=12.5656 val_loss=12.4083 scale=2.0000 norm=1.2763
[iter 100] loss=11.5514 val_loss=11.6294 scale

[32m[I 2021-08-11 15:25:20,882][0m Trial 1 finished with value: -841773693.6131566 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 35005, 'minibatch_frac': 0.5147808490055511, 'learning_rate': 0.00994760328924329}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL195 (val_loss=11.3398)
[iter 0] loss=12.5714 val_loss=12.4181 scale=1.0000 norm=0.6541
[iter 100] loss=12.4204 val_loss=12.3342 scale=1.0000 norm=0.5375
[iter 200] loss=12.2690 val_loss=12.2549 scale=2.0000 norm=0.9528
[iter 300] loss=12.1154 val_loss=12.1675 scale=2.0000 norm=0.8960
[iter 400] loss=11.9922 val_loss=12.0952 scale=2.0000 norm=0.8825
[iter 500] loss=11.8892 val_loss=12.0244 scale=2.0000 norm=0.8825
[iter 600] loss=11.7720 val_loss=11.9554 scale=2.0000 norm=0.8810
[iter 700] loss=11.6649 val_loss=11.8890 scale=2.0000 norm=0.8781
[iter 800] loss=11.5439 val_loss=11.8289 scale=2.0000 norm=0.8767
[iter 900] loss=11.4384 val_loss=11.7780 scale=2.0000 norm=0.8740
[iter 1000] loss=11.3348 val_loss=11.7407 scale=2.0000 norm=0.8690
[iter 1100] loss=11.2065 val_loss=11.7216 scale=2.0000 norm=0.8650
== Early stopping achieved.
== Best iteration / VAL1132 (val_loss=11.7201)
[iter 0] loss=12.5700 val_loss=12.4169 scale=1.0000 norm=0.

[32m[I 2021-08-11 15:27:14,078][0m Trial 2 finished with value: -971877667.4856697 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'LogNormal', 'n_estimators': 30502, 'minibatch_frac': 0.7963646418605659, 'learning_rate': 0.0012676525847053462}. Best is trial 1 with value: -841773693.6131566.[0m


[iter 0] loss=13.1317 val_loss=13.0547 scale=1.0000 norm=0.3322
== Early stopping achieved.
== Best iteration / VAL29 (val_loss=13.0054)
[iter 0] loss=13.1207 val_loss=13.0540 scale=1.0000 norm=0.3358
== Early stopping achieved.
== Best iteration / VAL29 (val_loss=13.0042)
[iter 0] loss=13.1296 val_loss=13.0559 scale=1.0000 norm=0.3386
== Early stopping achieved.
== Best iteration / VAL53 (val_loss=13.0048)
[iter 0] loss=13.1205 val_loss=13.0481 scale=2.0000 norm=0.6527
== Early stopping achieved.
== Best iteration / VAL60 (val_loss=13.0036)
[iter 0] loss=13.1217 val_loss=13.0460 scale=2.0000 norm=0.6587


[32m[I 2021-08-11 15:27:25,495][0m Trial 3 finished with value: -909607496.5930477 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'Exponential', 'n_estimators': 35532, 'minibatch_frac': 0.9090002851415449, 'learning_rate': 0.0992613096587529}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL25 (val_loss=13.0036)
[iter 0] loss=12.5803 val_loss=12.4162 scale=2.0000 norm=1.3178
[iter 100] loss=12.2427 val_loss=12.2317 scale=2.0000 norm=0.9283
[iter 200] loss=12.0435 val_loss=12.0991 scale=2.0000 norm=0.8897
[iter 300] loss=11.8752 val_loss=11.9772 scale=2.0000 norm=0.8935
[iter 400] loss=11.7232 val_loss=11.8652 scale=2.0000 norm=0.9085
[iter 500] loss=11.5654 val_loss=11.7662 scale=2.0000 norm=0.9210
[iter 600] loss=11.4076 val_loss=11.6848 scale=2.0000 norm=0.9319
[iter 700] loss=11.2580 val_loss=11.6272 scale=2.0000 norm=0.9374
[iter 800] loss=11.0977 val_loss=11.6036 scale=2.0000 norm=0.9405
== Early stopping achieved.
== Best iteration / VAL805 (val_loss=11.6034)
[iter 0] loss=12.5771 val_loss=12.4149 scale=2.0000 norm=1.3083
[iter 100] loss=12.2365 val_loss=12.2260 scale=2.0000 norm=0.9291
[iter 200] loss=12.0401 val_loss=12.0857 scale=2.0000 norm=0.8897
[iter 300] loss=11.8719 val_loss=11.9569 scale=2.0000 norm=0.8944

[32m[I 2021-08-11 15:44:35,093][0m Trial 4 finished with value: -942595260.8079674 and parameters: {'base_learner': 'GradientBoost_depth5', 'Dist': 'LogNormal', 'n_estimators': 42483, 'minibatch_frac': 0.9351310305514687, 'learning_rate': 0.0016038139590983558}. Best is trial 1 with value: -841773693.6131566.[0m


[iter 0] loss=12.5784 val_loss=12.4134 scale=2.0000 norm=1.3065
[iter 100] loss=12.1163 val_loss=12.1197 scale=2.0000 norm=0.9008
[iter 200] loss=11.7899 val_loss=11.8910 scale=2.0000 norm=0.8952
[iter 300] loss=11.5208 val_loss=11.6915 scale=2.0000 norm=0.8886
[iter 400] loss=11.3010 val_loss=11.5443 scale=2.0000 norm=0.8751
[iter 500] loss=11.1450 val_loss=11.4675 scale=2.0000 norm=0.8552
== Early stopping achieved.
== Best iteration / VAL546 (val_loss=11.4587)
[iter 0] loss=12.5745 val_loss=12.4117 scale=2.0000 norm=1.3161
[iter 100] loss=12.1135 val_loss=12.1100 scale=2.0000 norm=0.9001
[iter 200] loss=11.7938 val_loss=11.8739 scale=2.0000 norm=0.8905
[iter 300] loss=11.5260 val_loss=11.6688 scale=2.0000 norm=0.8876
[iter 400] loss=11.2983 val_loss=11.5137 scale=2.0000 norm=0.8691
[iter 500] loss=11.1307 val_loss=11.4262 scale=2.0000 norm=0.8509
== Early stopping achieved.
== Best iteration / VAL558 (val_loss=11.4108)
[iter 0] loss=12.5917 val_loss=12.4171 scale=2.0000 norm=1.3190


[32m[I 2021-08-11 15:49:12,391][0m Trial 5 finished with value: -852791221.6538063 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 16345, 'minibatch_frac': 0.6927252002019952, 'learning_rate': 0.0031472832802177976}. Best is trial 1 with value: -841773693.6131566.[0m


[iter 0] loss=12.5768 val_loss=12.3729 scale=1.0000 norm=0.6542
== Early stopping achieved.
== Best iteration / VAL25 (val_loss=11.6017)
[iter 0] loss=12.5723 val_loss=12.3739 scale=1.0000 norm=0.6555
== Early stopping achieved.
== Best iteration / VAL25 (val_loss=11.6551)
[iter 0] loss=12.5937 val_loss=12.4335 scale=1.0000 norm=0.6574
== Early stopping achieved.
== Best iteration / VAL26 (val_loss=11.6347)
[iter 0] loss=12.5822 val_loss=12.3712 scale=1.0000 norm=0.6655
== Early stopping achieved.
== Best iteration / VAL26 (val_loss=11.5452)
[iter 0] loss=12.5999 val_loss=12.3799 scale=1.0000 norm=0.6612


[32m[I 2021-08-11 15:49:15,069][0m Trial 6 finished with value: -1014399278.4598761 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'LogNormal', 'n_estimators': 34124, 'minibatch_frac': 0.7174860556671294, 'learning_rate': 0.05776598615807477}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL28 (val_loss=11.5494)
[iter 0] loss=12.5660 val_loss=12.3612 scale=1.0000 norm=0.6481
== Early stopping achieved.
== Best iteration / VAL20 (val_loss=11.6038)
[iter 0] loss=12.5660 val_loss=12.3540 scale=1.0000 norm=0.6522
== Early stopping achieved.
== Best iteration / VAL19 (val_loss=11.6138)
[iter 0] loss=12.5877 val_loss=12.3840 scale=1.0000 norm=0.6557
== Early stopping achieved.
== Best iteration / VAL20 (val_loss=11.7205)
[iter 0] loss=12.5841 val_loss=12.3805 scale=1.0000 norm=0.6686
== Early stopping achieved.
== Best iteration / VAL19 (val_loss=11.7248)
[iter 0] loss=12.5858 val_loss=12.3850 scale=1.0000 norm=0.6549


[32m[I 2021-08-11 15:49:17,221][0m Trial 7 finished with value: -995635562.5387728 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'LogNormal', 'n_estimators': 30357, 'minibatch_frac': 0.6520306452412628, 'learning_rate': 0.07443859965572573}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL21 (val_loss=11.5337)
[iter 0] loss=12.7383 val_loss=12.5604 scale=1.0000 norm=61992.1285
[iter 100] loss=11.7581 val_loss=11.9936 scale=2.0000 norm=33983.7433
== Early stopping achieved.
== Best iteration / VAL174 (val_loss=11.7593)
[iter 0] loss=12.7322 val_loss=12.5565 scale=1.0000 norm=61246.7926
[iter 100] loss=11.7533 val_loss=11.9337 scale=2.0000 norm=34702.7446
[iter 200] loss=11.0545 val_loss=11.6326 scale=2.0000 norm=18093.8642
== Early stopping achieved.
== Best iteration / VAL192 (val_loss=11.6246)
[iter 0] loss=12.7288 val_loss=12.5635 scale=1.0000 norm=61313.1602
[iter 100] loss=11.7579 val_loss=11.9539 scale=2.0000 norm=35731.2121
== Early stopping achieved.
== Best iteration / VAL188 (val_loss=11.6735)
[iter 0] loss=12.6785 val_loss=12.5330 scale=1.0000 norm=59716.2593
[iter 100] loss=11.7157 val_loss=11.9168 scale=2.0000 norm=33569.4674
[iter 200] loss=10.9931 val_loss=11.5854 scale=2.0000 norm=16484.9179
== Early stop

[32m[I 2021-08-11 15:49:27,183][0m Trial 8 finished with value: -943536036.9203793 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'Normal', 'n_estimators': 20638, 'minibatch_frac': 0.47782438481492934, 'learning_rate': 0.008777099102532467}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL188 (val_loss=11.6343)
[iter 0] loss=12.7507 val_loss=12.5599 scale=2.0000 norm=123681.6000
[iter 100] loss=11.9333 val_loss=12.0695 scale=2.0000 norm=42888.3306
[iter 200] loss=11.4091 val_loss=11.7588 scale=2.0000 norm=23604.1355
== Early stopping achieved.
== Best iteration / VAL235 (val_loss=11.7276)
[iter 0] loss=12.7498 val_loss=12.5569 scale=2.0000 norm=122261.6448
[iter 100] loss=11.9366 val_loss=12.0503 scale=2.0000 norm=44484.6595
[iter 200] loss=11.4167 val_loss=11.6923 scale=2.0000 norm=24726.0934
== Early stopping achieved.
== Best iteration / VAL252 (val_loss=11.6231)
[iter 0] loss=12.7534 val_loss=12.5588 scale=2.0000 norm=122801.9737
[iter 100] loss=11.9376 val_loss=12.0661 scale=2.0000 norm=44412.5273
[iter 200] loss=11.4114 val_loss=11.7311 scale=2.0000 norm=24382.0396
== Early stopping achieved.
== Best iteration / VAL243 (val_loss=11.6825)
[iter 0] loss=12.7102 val_loss=12.5315 scale=2.0000 norm=119495.7304
[iter 10

[32m[I 2021-08-11 15:51:45,745][0m Trial 9 finished with value: -898811956.2389376 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'Normal', 'n_estimators': 20164, 'minibatch_frac': 0.7013867394187225, 'learning_rate': 0.006402087918758834}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL282 (val_loss=11.5640)
[iter 0] loss=13.1300 val_loss=13.0619 scale=1.0000 norm=0.3370
[iter 100] loss=13.0588 val_loss=13.0050 scale=2.0000 norm=0.0316
== Early stopping achieved.
== Best iteration / VAL115 (val_loss=13.0050)
[iter 0] loss=13.0959 val_loss=13.0597 scale=1.0000 norm=0.3350
[iter 100] loss=13.0411 val_loss=13.0075 scale=2.0000 norm=0.0368
== Early stopping achieved.
== Best iteration / VAL108 (val_loss=13.0074)
[iter 0] loss=13.1072 val_loss=13.0608 scale=2.0000 norm=0.6665
[iter 100] loss=13.0637 val_loss=13.0083 scale=2.0000 norm=0.0336
== Early stopping achieved.
== Best iteration / VAL108 (val_loss=13.0082)
[iter 0] loss=13.1050 val_loss=13.0610 scale=1.0000 norm=0.3260
[iter 100] loss=13.0407 val_loss=13.0064 scale=2.0000 norm=0.0348
== Early stopping achieved.
== Best iteration / VAL115 (val_loss=13.0063)
[iter 0] loss=13.1164 val_loss=13.0581 scale=2.0000 norm=0.6534


[32m[I 2021-08-11 15:51:50,285][0m Trial 10 finished with value: -955726642.4597628 and parameters: {'base_learner': 'DecTree_depthNone', 'Dist': 'Exponential', 'n_estimators': 2199, 'minibatch_frac': 0.41127841025102024, 'learning_rate': 0.02392285850088981}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL81 (val_loss=13.0064)
[iter 0] loss=12.5597 val_loss=12.4192 scale=1.0000 norm=0.6425
[iter 100] loss=12.3735 val_loss=12.3319 scale=1.0000 norm=0.5244
[iter 200] loss=12.1586 val_loss=12.2514 scale=1.0000 norm=0.4790
[iter 300] loss=12.0207 val_loss=12.1592 scale=1.0000 norm=0.4590
[iter 400] loss=11.8732 val_loss=12.0535 scale=1.0000 norm=0.4527
[iter 500] loss=11.7555 val_loss=11.9375 scale=1.0000 norm=0.4505
[iter 600] loss=11.5794 val_loss=11.8395 scale=2.0000 norm=0.8859
[iter 700] loss=11.4613 val_loss=11.7566 scale=1.0000 norm=0.4378
[iter 800] loss=11.3734 val_loss=11.7061 scale=1.0000 norm=0.4333
[iter 900] loss=11.2680 val_loss=11.6829 scale=1.0000 norm=0.4236
== Early stopping achieved.
== Best iteration / VAL959 (val_loss=11.6762)
[iter 0] loss=12.5370 val_loss=12.4161 scale=1.0000 norm=0.6301
[iter 100] loss=12.3589 val_loss=12.3055 scale=1.0000 norm=0.5208
[iter 200] loss=12.1602 val_loss=12.2135 scale=1.0000 norm=0.4783

[32m[I 2021-08-11 15:52:24,331][0m Trial 11 finished with value: -957233326.0940228 and parameters: {'base_learner': 'DecTree_depth2', 'Dist': 'LogNormal', 'n_estimators': 9382, 'minibatch_frac': 0.5912806398283483, 'learning_rate': 0.0032738656888004775}. Best is trial 1 with value: -841773693.6131566.[0m


[iter 0] loss=12.5727 val_loss=12.4124 scale=2.0000 norm=1.3095
[iter 100] loss=12.0591 val_loss=12.0858 scale=2.0000 norm=0.8973
[iter 200] loss=11.7221 val_loss=11.8337 scale=2.0000 norm=0.8888
[iter 300] loss=11.4350 val_loss=11.6287 scale=2.0000 norm=0.8785
[iter 400] loss=11.1940 val_loss=11.5083 scale=2.0000 norm=0.8609
== Early stopping achieved.
== Best iteration / VAL458 (val_loss=11.4878)
[iter 0] loss=12.5714 val_loss=12.4109 scale=2.0000 norm=1.3070
[iter 100] loss=12.0543 val_loss=12.0754 scale=2.0000 norm=0.8978
[iter 200] loss=11.7261 val_loss=11.8179 scale=2.0000 norm=0.8862
[iter 300] loss=11.4395 val_loss=11.6046 scale=2.0000 norm=0.8756
[iter 400] loss=11.1956 val_loss=11.4703 scale=2.0000 norm=0.8549
== Early stopping achieved.
== Best iteration / VAL487 (val_loss=11.4342)
[iter 0] loss=12.5990 val_loss=12.4154 scale=2.0000 norm=1.3206
[iter 100] loss=12.0656 val_loss=12.0923 scale=2.0000 norm=0.8982
[iter 200] loss=11.7336 val_loss=11.8321 scale=2.0000 norm=0.8853


[32m[I 2021-08-11 15:56:00,568][0m Trial 12 finished with value: -865269061.8109213 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 14000, 'minibatch_frac': 0.8036426924778791, 'learning_rate': 0.0035865491785189974}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL522 (val_loss=11.3731)
[iter 0] loss=12.5551 val_loss=12.3832 scale=2.0000 norm=1.2862
== Early stopping achieved.
== Best iteration / VAL74 (val_loss=11.4421)
[iter 0] loss=12.5423 val_loss=12.3869 scale=2.0000 norm=1.2696
== Early stopping achieved.
== Best iteration / VAL74 (val_loss=11.4008)
[iter 0] loss=12.5822 val_loss=12.3883 scale=2.0000 norm=1.3055
== Early stopping achieved.
== Best iteration / VAL76 (val_loss=11.4197)
[iter 0] loss=12.5575 val_loss=12.3790 scale=2.0000 norm=1.2866
== Early stopping achieved.
== Best iteration / VAL80 (val_loss=11.3316)
[iter 0] loss=12.5482 val_loss=12.3906 scale=2.0000 norm=1.2514


[32m[I 2021-08-11 15:56:25,882][0m Trial 13 finished with value: -862744342.8312544 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 41882, 'minibatch_frac': 0.5693107410185092, 'learning_rate': 0.023449174740242787}. Best is trial 1 with value: -841773693.6131566.[0m


== Early stopping achieved.
== Best iteration / VAL80 (val_loss=11.3411)
[iter 0] loss=12.5638 val_loss=12.3855 scale=2.0000 norm=1.3038
== Early stopping achieved.
== Best iteration / VAL81 (val_loss=11.4640)
[iter 0] loss=12.5393 val_loss=12.3853 scale=2.0000 norm=1.2780
== Early stopping achieved.
== Best iteration / VAL87 (val_loss=11.3779)
[iter 0] loss=12.5682 val_loss=12.3928 scale=2.0000 norm=1.3120
== Early stopping achieved.
== Best iteration / VAL82 (val_loss=11.4112)
[iter 0] loss=12.5398 val_loss=12.3831 scale=2.0000 norm=1.2805
== Early stopping achieved.
== Best iteration / VAL90 (val_loss=11.3593)
[iter 0] loss=12.5480 val_loss=12.3891 scale=2.0000 norm=1.2646


[32m[I 2021-08-11 15:56:50,271][0m Trial 14 finished with value: -838944770.0473243 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 24667, 'minibatch_frac': 0.4841608783288508, 'learning_rate': 0.0212282965443222}. Best is trial 14 with value: -838944770.0473243.[0m


== Early stopping achieved.
== Best iteration / VAL87 (val_loss=11.3607)
[iter 0] loss=13.1276 val_loss=13.0620 scale=1.0000 norm=0.3347
[iter 100] loss=13.0514 val_loss=13.0060 scale=2.0000 norm=0.0351
== Early stopping achieved.
== Best iteration / VAL128 (val_loss=13.0057)
[iter 0] loss=13.0941 val_loss=13.0598 scale=1.0000 norm=0.3327
[iter 100] loss=13.0370 val_loss=13.0068 scale=2.0000 norm=0.0395
== Early stopping achieved.
== Best iteration / VAL96 (val_loss=13.0067)
[iter 0] loss=13.1068 val_loss=13.0609 scale=2.0000 norm=0.6608
[iter 100] loss=13.0593 val_loss=13.0077 scale=2.0000 norm=0.0370
== Early stopping achieved.
== Best iteration / VAL104 (val_loss=13.0076)
[iter 0] loss=13.1030 val_loss=13.0606 scale=2.0000 norm=0.6499
[iter 100] loss=13.0376 val_loss=13.0069 scale=2.0000 norm=0.0379
== Early stopping achieved.
== Best iteration / VAL104 (val_loss=13.0068)
[iter 0] loss=13.1165 val_loss=13.0585 scale=2.0000 norm=0.6555
[iter 100] loss=13.0441 val_loss=13.0081 scale=2

[32m[I 2021-08-11 15:56:55,122][0m Trial 15 finished with value: -908728780.6353728 and parameters: {'base_learner': 'DecTree_depthNone', 'Dist': 'Exponential', 'n_estimators': 24479, 'minibatch_frac': 0.41734369906144025, 'learning_rate': 0.02262216186594188}. Best is trial 14 with value: -838944770.0473243.[0m


== Early stopping achieved.
== Best iteration / VAL126 (val_loss=13.0078)
[iter 0] loss=12.5734 val_loss=12.3691 scale=2.0000 norm=1.3158
== Early stopping achieved.
== Best iteration / VAL44 (val_loss=11.4791)
[iter 0] loss=12.5276 val_loss=12.3618 scale=2.0000 norm=1.2463
== Early stopping achieved.
== Best iteration / VAL50 (val_loss=11.3704)
[iter 0] loss=12.5604 val_loss=12.3713 scale=2.0000 norm=1.2819
== Early stopping achieved.
== Best iteration / VAL45 (val_loss=11.4240)
[iter 0] loss=12.5310 val_loss=12.3602 scale=2.0000 norm=1.2479
== Early stopping achieved.
== Best iteration / VAL50 (val_loss=11.3488)
[iter 0] loss=12.5337 val_loss=12.3716 scale=2.0000 norm=1.2291


[32m[I 2021-08-11 15:57:09,527][0m Trial 16 finished with value: -862580789.7945347 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 26898, 'minibatch_frac': 0.47012721743628566, 'learning_rate': 0.03722742594440133}. Best is trial 14 with value: -838944770.0473243.[0m


== Early stopping achieved.
== Best iteration / VAL50 (val_loss=11.3556)
[iter 0] loss=12.5709 val_loss=12.4194 scale=1.0000 norm=0.6445
[iter 100] loss=11.7169 val_loss=11.8971 scale=1.0000 norm=0.4324
== Early stopping achieved.
== Best iteration / VAL182 (val_loss=11.6511)
[iter 0] loss=12.5325 val_loss=12.4051 scale=1.0000 norm=0.6226
[iter 100] loss=11.7160 val_loss=11.8426 scale=2.0000 norm=0.8662
[iter 200] loss=11.1721 val_loss=11.5553 scale=1.0000 norm=0.4255
== Early stopping achieved.
== Best iteration / VAL201 (val_loss=11.5548)
[iter 0] loss=12.5703 val_loss=12.4258 scale=1.0000 norm=0.6416
[iter 100] loss=11.7650 val_loss=11.9029 scale=1.0000 norm=0.4327
== Early stopping achieved.
== Best iteration / VAL169 (val_loss=11.6739)
[iter 0] loss=12.5422 val_loss=12.4090 scale=1.0000 norm=0.6210
[iter 100] loss=11.7027 val_loss=11.8282 scale=1.0000 norm=0.4282
== Early stopping achieved.
== Best iteration / VAL189 (val_loss=11.5641)
[iter 0] loss=12.5632 val_loss=12.4136 scale=

[32m[I 2021-08-11 15:57:15,415][0m Trial 17 finished with value: -956382167.5523866 and parameters: {'base_learner': 'DecTree_depth2', 'Dist': 'LogNormal', 'n_estimators': 37520, 'minibatch_frac': 0.4001954173753072, 'learning_rate': 0.01617171566083057}. Best is trial 14 with value: -838944770.0473243.[0m


== Early stopping achieved.
== Best iteration / VAL189 (val_loss=11.6106)
[iter 0] loss=12.5627 val_loss=12.3956 scale=2.0000 norm=1.2930
[iter 100] loss=11.2520 val_loss=11.4848 scale=2.0000 norm=0.8410
== Early stopping achieved.
== Best iteration / VAL128 (val_loss=11.4355)
[iter 0] loss=12.5499 val_loss=12.3966 scale=2.0000 norm=1.2764
[iter 100] loss=11.2425 val_loss=11.4453 scale=2.0000 norm=0.8439
== Early stopping achieved.
== Best iteration / VAL130 (val_loss=11.3822)
[iter 0] loss=12.5900 val_loss=12.3986 scale=2.0000 norm=1.3120
[iter 100] loss=11.2411 val_loss=11.4522 scale=2.0000 norm=0.8402
== Early stopping achieved.
== Best iteration / VAL130 (val_loss=11.3966)
[iter 0] loss=12.5644 val_loss=12.3913 scale=2.0000 norm=1.2935
[iter 100] loss=11.2321 val_loss=11.4226 scale=2.0000 norm=0.8421
== Early stopping achieved.
== Best iteration / VAL130 (val_loss=11.3477)
[iter 0] loss=12.5527 val_loss=12.4015 scale=2.0000 norm=1.2521
[iter 100] loss=11.2693 val_loss=11.4266 scale

[32m[I 2021-08-11 15:57:58,536][0m Trial 18 finished with value: -845962770.3142052 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 48119, 'minibatch_frac': 0.5582798872292, 'learning_rate': 0.0139163392804938}. Best is trial 14 with value: -838944770.0473243.[0m


== Early stopping achieved.
== Best iteration / VAL134 (val_loss=11.3348)
[iter 0] loss=12.5667 val_loss=12.4095 scale=2.0000 norm=1.3061
[iter 100] loss=11.9073 val_loss=11.9501 scale=2.0000 norm=0.8865
[iter 200] loss=11.4526 val_loss=11.6115 scale=2.0000 norm=0.8699
[iter 300] loss=11.1248 val_loss=11.4394 scale=2.0000 norm=0.8644
== Early stopping achieved.
== Best iteration / VAL333 (val_loss=11.4260)
[iter 0] loss=12.5419 val_loss=12.4086 scale=2.0000 norm=1.2801
[iter 100] loss=11.9041 val_loss=11.9391 scale=2.0000 norm=0.8879
[iter 200] loss=11.4622 val_loss=11.5877 scale=2.0000 norm=0.8737
[iter 300] loss=11.1411 val_loss=11.3933 scale=2.0000 norm=0.8544
== Early stopping achieved.
== Best iteration / VAL352 (val_loss=11.3697)
[iter 0] loss=12.5710 val_loss=12.4129 scale=2.0000 norm=1.3144
[iter 100] loss=11.9318 val_loss=11.9544 scale=2.0000 norm=0.8914
[iter 200] loss=11.4634 val_loss=11.5987 scale=2.0000 norm=0.8654
[iter 300] loss=11.1431 val_loss=11.4049 scale=2.0000 norm

[32m[I 2021-08-11 15:59:33,900][0m Trial 19 finished with value: -861329091.4097408 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 41930, 'minibatch_frac': 0.4809968955761645, 'learning_rate': 0.005350001440076019}. Best is trial 14 with value: -838944770.0473243.[0m


== Early stopping achieved.
== Best iteration / VAL359 (val_loss=11.3373)
[iter 0] loss=12.5725 val_loss=12.3900 scale=2.0000 norm=1.2576
== Early stopping achieved.
== Best iteration / VAL87 (val_loss=11.3711)
Started Predict with Ngboost at 15:59:40.
The R2 score is 0.880364570773793
The MAE score is 15853.002350163648
The Median absolute error score is 9910.137334882864
The MSE score is 22352.23426138616
The RMSE score is 499622376.4758852


In [6]:
# Save pipeline
save_to_production(housing_ml, file_name='housing_automl_instance')

# Predict on new data
In the beginning we kept a holdout dataset. We use this to simulate prediction on completely new data.

In [7]:
# load stored pipeline
housing_ml_loaded = load_for_production(file_name='housing_automl_instance')

In [8]:
# predict on new data
housing_ml_loaded.ml_bp14_regressions_full_processing_ngboost(val_df)

# access predicted labels
val_y_hat = housing_ml_loaded.predicted_values['ngboost']

Started Execute test train split at 15:59:41.
Started Apply datetime transformation at 15:59:41.
Started Start Spacy, POS tagging at 15:59:41.
Started Handle rare features at 15:59:41.
Started Remove cardinality at 15:59:41.
Started Onehot + PCA categorical features at 15:59:41.
Started Execute categorical encoding at 15:59:41.
Started  Delete columns with high share of NULLs at 15:59:41.
Started Fill nulls at 15:59:41.
Started Execute numerical binning at 15:59:41.
Started Handle outliers at 15:59:41.
Started Remove collinearity at 15:59:41.
Started Execute clustering as a feature at 15:59:41.
Started Execute clustering as a feature at 15:59:41.
Started Execute clustering as a feature at 15:59:41.
Started Execute clustering as a feature at 15:59:42.
Started Execute clustering as a feature at 15:59:42.
Started Execute clustering as a feature at 15:59:42.
Started Execute clustering as a feature at 15:59:42.
Started Execute clustering as a feature at 15:59:42.
Started Execute clustering 

In [9]:
# Assess prediction quality on holdout data
mae = mean_absolute_error(val_df_target, val_y_hat)
print(mae)

15893.900047278938
