# Housing prices regression

In [1]:
# load libraries
from e2eml.regression import regression_blueprints as rb
from e2eml.full_processing.postprocessing import save_to_production, load_for_production
from e2eml.test.regression_blueprints_test import load_housingprices_data
import pandas as pd
from sklearn.metrics import mean_absolute_error

[nltk_data] Downloading package punkt to /home/thomas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/thomas/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/thomas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/thomas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Feature engineering
Load & preprocess housing prices dataset.



In [2]:
# load Housing price data
test_df, test_target, val_df, val_df_target, test_categorical_cols = load_housingprices_data()

Do dataframe splits.


# Using e2eml - Run and save a pipeline
We only need a few steps to get ur full pipeline:
- Instantiate class
- Run chosen blueprint
- Save blueprint for later usage

In [3]:
# Instantiate class
housing_ml = rb.RegressionBluePrint(datasource=test_df,
                                         target_variable=test_target,
                                         categorical_columns=test_categorical_cols, # here we specify cat columns (that is optional however)
                                         preferred_training_mode='auto',
                                         tune_mode='accurate',
                                         ml_task='regression') # usually not needed, but sometimes it might have to be called explicitly

Ml task is regression
Preferred training mode auto has been chosen. e2eml will automatically detect, if LGBM and Xgboost can use GPU acceleration and optimize the workflow accordingly.


In [5]:
"""
In this case we chose Ngboost, which is uses natural gradient. It is really strong for regression problem, but
does not have GPU acceleration at all unfortunately. However we always recommend trying Ngboost if possible.
"""
housing_ml.ml_bp14_regressions_full_processing_ngboost()

Started Execute test train split at 19:38:12.
Started Apply datetime transformation at 19:38:13.
Started Start Spacy, POS tagging at 19:38:13.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 29 to 102
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             800 non-null    int64  
 1   MSSubClass     800 non-null    int64  
 2   MSZoning       800 non-null    object 
 3   LotFrontage    663 non-null    float64
 4   LotArea        800 non-null    int64  
 5   Street         800 non-null    object 
 6   Alley          50 non-null     object 
 7   LotShape       800 non-null    object 
 8   LandContour    800 non-null    object 
 9   Utilities      800 non-null    object 
 10  LotConfig      800 non-null    object 
 11  LandSlope      800 non-null    object 
 12  Neighborhood   800 non-null    object 
 13  Condition1     800 non-null    object 
 14  Condition2     800 non-null    object 
 15

is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead


Started Execute categorical encoding at 19:38:13.


is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead


Started  Delete columns with high share of NULLs at 19:38:14.
Started Fill nulls at 19:38:14.
Started Execute numerical binning at 19:38:14.
Started Handle outliers at 19:38:14.
Started Remove collinearity at 19:38:14.
Started Execute clustering as a feature at 19:38:15.
Started Scale data at 19:38:15.
Started Execute clustering as a feature at 19:38:15.
Started Execute clustering as a feature at 19:38:15.
Started Execute clustering as a feature at 19:38:16.
Started Execute clustering as a feature at 19:38:16.
Started Execute clustering as a feature at 19:38:16.
Started Execute clustering as a feature at 19:38:17.
Started Execute clustering as a feature at 19:38:17.
Started Execute clustering as a feature at 19:38:17.
Started Select best features at 19:38:18.
Features before selection are...Id
Features before selection are...MSSubClass
Features before selection are...MSZoning
Features before selection are...LotFrontage
Features before selection are...LotArea
Features before selection a

[32m[I 2021-08-01 19:38:18,552][0m A new study created in memory with name: no-name-54c51c2f-c5f3-441f-b75e-70bcc1384875[0m


Started Train Ngboost at 19:38:18.
[iter 0] loss=12.5783 val_loss=12.4042 scale=2.0000 norm=1.3278
[iter 100] loss=11.6037 val_loss=11.7697 scale=2.0000 norm=0.9175
== Early stopping achieved.
== Best iteration / VAL178 (val_loss=11.5364)
[iter 0] loss=12.5283 val_loss=12.4056 scale=2.0000 norm=1.2518
[iter 100] loss=11.5907 val_loss=11.8594 scale=2.0000 norm=0.9193
== Early stopping achieved.
== Best iteration / VAL154 (val_loss=11.7491)
[iter 0] loss=12.5605 val_loss=12.4089 scale=2.0000 norm=1.2811
[iter 100] loss=11.6290 val_loss=11.8220 scale=2.0000 norm=0.9204
== Early stopping achieved.
== Best iteration / VAL162 (val_loss=11.6541)
[iter 0] loss=12.5306 val_loss=12.4027 scale=2.0000 norm=1.2464
[iter 100] loss=11.5859 val_loss=11.7849 scale=2.0000 norm=0.9215
== Early stopping achieved.
== Best iteration / VAL173 (val_loss=11.5796)
[iter 0] loss=12.5308 val_loss=12.4141 scale=2.0000 norm=1.2127
[iter 100] loss=11.6111 val_loss=11.8337 scale=2.0000 norm=0.9234


[32m[I 2021-08-01 19:38:30,072][0m Trial 0 finished with value: -982517592.4782884 and parameters: {'base_learner': 'DecTree_depthNone', 'Dist': 'LogNormal', 'n_estimators': 39639, 'minibatch_frac': 0.45705911818108086, 'learning_rate': 0.007977509434108937}. Best is trial 0 with value: -982517592.4782884.[0m


== Early stopping achieved.
== Best iteration / VAL163 (val_loss=11.6565)
[iter 0] loss=12.5683 val_loss=12.3998 scale=2.0000 norm=1.2922
== Early stopping achieved.
== Best iteration / VAL89 (val_loss=11.6469)
[iter 0] loss=12.5641 val_loss=12.4035 scale=2.0000 norm=1.3017
== Early stopping achieved.
== Best iteration / VAL80 (val_loss=11.8189)
[iter 0] loss=12.5824 val_loss=12.4008 scale=2.0000 norm=1.3075
== Early stopping achieved.
== Best iteration / VAL73 (val_loss=11.8582)
[iter 0] loss=12.5793 val_loss=12.3985 scale=2.0000 norm=1.3418
== Early stopping achieved.
== Best iteration / VAL78 (val_loss=11.8181)
[iter 0] loss=12.5896 val_loss=12.4139 scale=2.0000 norm=1.3233


[32m[I 2021-08-01 19:38:37,701][0m Trial 1 finished with value: -1080547522.3364635 and parameters: {'base_learner': 'DecTree_depthNone', 'Dist': 'LogNormal', 'n_estimators': 37769, 'minibatch_frac': 0.6817067336254456, 'learning_rate': 0.013319665098625685}. Best is trial 0 with value: -982517592.4782884.[0m


== Early stopping achieved.
== Best iteration / VAL82 (val_loss=11.8213)
[iter 0] loss=13.1231 val_loss=13.0630 scale=1.0000 norm=0.3314
[iter 100] loss=13.1467 val_loss=13.0548 scale=1.0000 norm=0.3006
[iter 200] loss=13.0841 val_loss=13.0469 scale=2.0000 norm=0.5054
[iter 300] loss=13.0806 val_loss=13.0402 scale=2.0000 norm=0.4414
[iter 400] loss=13.0596 val_loss=13.0347 scale=2.0000 norm=0.3937
[iter 500] loss=13.0884 val_loss=13.0299 scale=2.0000 norm=0.3420
[iter 600] loss=13.0711 val_loss=13.0256 scale=2.0000 norm=0.2988
[iter 700] loss=13.0582 val_loss=13.0223 scale=2.0000 norm=0.2856
[iter 800] loss=13.0671 val_loss=13.0195 scale=2.0000 norm=0.2581
[iter 900] loss=13.0744 val_loss=13.0172 scale=2.0000 norm=0.2351
[iter 1000] loss=13.0610 val_loss=13.0152 scale=2.0000 norm=0.2141
[iter 1100] loss=13.0423 val_loss=13.0136 scale=2.0000 norm=0.2021
[iter 1200] loss=13.0432 val_loss=13.0122 scale=2.0000 norm=0.1894
[iter 1300] loss=13.0647 val_loss=13.0110 scale=2.0000 norm=0.1934
[

[iter 1500] loss=13.0487 val_loss=13.0101 scale=2.0000 norm=0.1806
[iter 1600] loss=13.0284 val_loss=13.0096 scale=2.0000 norm=0.1734
[iter 1700] loss=13.0255 val_loss=13.0091 scale=2.0000 norm=0.1708
[iter 1800] loss=13.0417 val_loss=13.0087 scale=2.0000 norm=0.1701
[iter 1900] loss=13.0436 val_loss=13.0084 scale=1.0000 norm=0.0834
[iter 2000] loss=13.0149 val_loss=13.0081 scale=2.0000 norm=0.1679
[iter 2100] loss=13.0402 val_loss=13.0079 scale=2.0000 norm=0.1653
[iter 2200] loss=13.0449 val_loss=13.0077 scale=2.0000 norm=0.1576
[iter 2300] loss=13.0481 val_loss=13.0075 scale=1.0000 norm=0.0777
[iter 2400] loss=13.0441 val_loss=13.0074 scale=2.0000 norm=0.1604
[iter 2500] loss=13.0433 val_loss=13.0073 scale=2.0000 norm=0.1490
[iter 2600] loss=13.0414 val_loss=13.0072 scale=2.0000 norm=0.1626
[iter 2700] loss=13.0392 val_loss=13.0070 scale=2.0000 norm=0.1525
[iter 2800] loss=13.0429 val_loss=13.0069 scale=2.0000 norm=0.1485


[32m[I 2021-08-01 19:39:46,168][0m Trial 2 finished with value: -1658465215.476712 and parameters: {'base_learner': 'DecTree_depth2', 'Dist': 'Exponential', 'n_estimators': 30050, 'minibatch_frac': 0.6244943204616877, 'learning_rate': 0.001342888372808318}. Best is trial 0 with value: -982517592.4782884.[0m


== Early stopping achieved.
== Best iteration / VAL2885 (val_loss=13.0069)
[iter 0] loss=12.5698 val_loss=12.4123 scale=1.0000 norm=0.6524
[iter 100] loss=11.3604 val_loss=11.7615 scale=2.0000 norm=0.8729
== Early stopping achieved.
== Best iteration / VAL115 (val_loss=11.7398)
[iter 0] loss=12.5671 val_loss=12.4130 scale=1.0000 norm=0.6514
[iter 100] loss=11.3732 val_loss=11.8293 scale=2.0000 norm=0.8783
== Early stopping achieved.
== Best iteration / VAL110 (val_loss=11.8161)
[iter 0] loss=12.5923 val_loss=12.4131 scale=1.0000 norm=0.6573
[iter 100] loss=11.3979 val_loss=11.8869 scale=2.0000 norm=0.8811
== Early stopping achieved.
== Best iteration / VAL110 (val_loss=11.8802)
[iter 0] loss=12.5777 val_loss=12.4086 scale=1.0000 norm=0.6626
[iter 100] loss=11.3681 val_loss=11.7930 scale=2.0000 norm=0.8799
== Early stopping achieved.
== Best iteration / VAL115 (val_loss=11.7741)
[iter 0] loss=12.5915 val_loss=12.4180 scale=1.0000 norm=0.6558
[iter 100] loss=11.3987 val_loss=11.7981 scal

[32m[I 2021-08-01 19:39:52,996][0m Trial 3 finished with value: -1016135254.5126966 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'LogNormal', 'n_estimators': 48397, 'minibatch_frac': 0.815053931146642, 'learning_rate': 0.011620994281146262}. Best is trial 0 with value: -982517592.4782884.[0m


== Early stopping achieved.
== Best iteration / VAL115 (val_loss=11.7732)
[iter 0] loss=12.5614 val_loss=12.4052 scale=2.0000 norm=1.2876
[iter 100] loss=11.7472 val_loss=11.7804 scale=2.0000 norm=0.8768
[iter 200] loss=11.1682 val_loss=11.4233 scale=2.0000 norm=0.8582
== Early stopping achieved.
== Best iteration / VAL237 (val_loss=11.3906)
[iter 0] loss=12.5386 val_loss=12.4071 scale=2.0000 norm=1.2627
[iter 100] loss=11.7351 val_loss=11.8115 scale=2.0000 norm=0.8796
[iter 200] loss=11.1800 val_loss=11.4814 scale=2.0000 norm=0.8692
== Early stopping achieved.
== Best iteration / VAL229 (val_loss=11.4595)
[iter 0] loss=12.5771 val_loss=12.4087 scale=2.0000 norm=1.2971
[iter 100] loss=11.7416 val_loss=11.8009 scale=2.0000 norm=0.8830
[iter 200] loss=11.1828 val_loss=11.4620 scale=2.0000 norm=0.8521
== Early stopping achieved.
== Best iteration / VAL228 (val_loss=11.4442)
[iter 0] loss=12.5677 val_loss=12.4027 scale=2.0000 norm=1.3094
[iter 100] loss=11.7147 val_loss=11.7829 scale=2.000

[32m[I 2021-08-01 19:41:11,872][0m Trial 4 finished with value: -913623978.6309942 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 13401, 'minibatch_frac': 0.589057954626008, 'learning_rate': 0.007443427210302614}. Best is trial 4 with value: -913623978.6309942.[0m


== Early stopping achieved.
== Best iteration / VAL245 (val_loss=11.4120)
[iter 0] loss=12.7401 val_loss=12.5298 scale=2.0000 norm=123010.2110
== Early stopping achieved.
== Best iteration / VAL69 (val_loss=11.4587)
[iter 0] loss=12.7404 val_loss=12.5344 scale=2.0000 norm=122229.1229
== Early stopping achieved.
== Best iteration / VAL59 (val_loss=11.6529)
[iter 0] loss=12.7474 val_loss=12.5332 scale=2.0000 norm=124106.7043
== Early stopping achieved.
== Best iteration / VAL67 (val_loss=11.5385)
[iter 0] loss=12.7061 val_loss=12.5121 scale=2.0000 norm=119328.6693
== Early stopping achieved.
== Best iteration / VAL70 (val_loss=11.4243)
[iter 0] loss=12.7258 val_loss=12.5314 scale=2.0000 norm=121297.9366


[32m[I 2021-08-01 19:42:26,403][0m Trial 5 finished with value: -915836696.8694875 and parameters: {'base_learner': 'GradientBoost_depth5', 'Dist': 'Normal', 'n_estimators': 31356, 'minibatch_frac': 0.8279029318909668, 'learning_rate': 0.02287051529133464}. Best is trial 4 with value: -913623978.6309942.[0m


== Early stopping achieved.
== Best iteration / VAL68 (val_loss=11.4725)
[iter 0] loss=12.7622 val_loss=12.5615 scale=1.0000 norm=61491.7830
[iter 100] loss=12.0063 val_loss=12.1268 scale=2.0000 norm=48276.5585
[iter 200] loss=11.4840 val_loss=11.7383 scale=2.0000 norm=22929.6241
== Early stopping achieved.
== Best iteration / VAL282 (val_loss=11.6009)
[iter 0] loss=12.7639 val_loss=12.5619 scale=1.0000 norm=61217.6751
[iter 100] loss=11.9995 val_loss=12.1880 scale=2.0000 norm=47427.8283
[iter 200] loss=11.4716 val_loss=11.8380 scale=2.0000 norm=22225.6771
== Early stopping achieved.
== Best iteration / VAL261 (val_loss=11.7524)
[iter 0] loss=12.7708 val_loss=12.5657 scale=1.0000 norm=61902.9648
[iter 100] loss=12.0291 val_loss=12.1899 scale=2.0000 norm=50287.9764
[iter 200] loss=11.4954 val_loss=11.8470 scale=2.0000 norm=23074.8481
== Early stopping achieved.
== Best iteration / VAL256 (val_loss=11.7772)
[iter 0] loss=12.7052 val_loss=12.5374 scale=1.0000 norm=59384.2091
[iter 100] lo

[32m[I 2021-08-01 19:42:41,287][0m Trial 6 finished with value: -991800545.1810688 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'Normal', 'n_estimators': 15267, 'minibatch_frac': 0.9540494129922555, 'learning_rate': 0.005775585261760808}. Best is trial 4 with value: -913623978.6309942.[0m


== Early stopping achieved.
== Best iteration / VAL286 (val_loss=11.6351)
[iter 0] loss=13.1315 val_loss=13.0629 scale=1.0000 norm=0.3303
[iter 100] loss=13.0920 val_loss=13.0392 scale=2.0000 norm=0.4569
[iter 200] loss=13.0705 val_loss=13.0258 scale=2.0000 norm=0.3269
[iter 300] loss=13.0605 val_loss=13.0185 scale=2.0000 norm=0.2390
[iter 400] loss=13.0572 val_loss=13.0143 scale=2.0000 norm=0.1751
[iter 500] loss=13.0548 val_loss=13.0116 scale=2.0000 norm=0.1305
[iter 600] loss=13.0542 val_loss=13.0099 scale=2.0000 norm=0.0989
[iter 700] loss=13.0506 val_loss=13.0086 scale=2.0000 norm=0.0783
[iter 800] loss=13.0479 val_loss=13.0078 scale=2.0000 norm=0.0626
[iter 900] loss=13.0469 val_loss=13.0073 scale=2.0000 norm=0.0521
[iter 0] loss=13.1216 val_loss=13.0621 scale=1.0000 norm=0.3330
[iter 100] loss=13.0796 val_loss=13.0398 scale=2.0000 norm=0.4587
[iter 200] loss=13.0572 val_loss=13.0280 scale=2.0000 norm=0.3284
[iter 300] loss=13.0430 val_loss=13.0213 scale=2.0000 norm=0.2410
[iter 

[32m[I 2021-08-01 19:51:07,173][0m Trial 7 finished with value: -938623276.1136224 and parameters: {'base_learner': 'GradientBoost_depth5', 'Dist': 'Exponential', 'n_estimators': 967, 'minibatch_frac': 0.9688419571551059, 'learning_rate': 0.0018208405046365333}. Best is trial 4 with value: -913623978.6309942.[0m


[iter 0] loss=12.7258 val_loss=12.5276 scale=1.0000 norm=61012.2551
== Early stopping achieved.
== Best iteration / VAL59 (val_loss=11.6895)
[iter 0] loss=12.7199 val_loss=12.5783 scale=1.0000 norm=60305.6002
== Early stopping achieved.
== Best iteration / VAL32 (val_loss=12.1672)
[iter 0] loss=12.7375 val_loss=12.5435 scale=1.0000 norm=61329.2653
== Early stopping achieved.
== Best iteration / VAL63 (val_loss=11.8202)
[iter 0] loss=12.6911 val_loss=12.5069 scale=1.0000 norm=59861.7434
== Early stopping achieved.
== Best iteration / VAL53 (val_loss=11.7132)
[iter 0] loss=12.7080 val_loss=12.5294 scale=1.0000 norm=60186.8665


[32m[I 2021-08-01 19:51:08,711][0m Trial 8 finished with value: -1154413557.0301154 and parameters: {'base_learner': 'DecTree_depth2', 'Dist': 'Normal', 'n_estimators': 1540, 'minibatch_frac': 0.49445317217518253, 'learning_rate': 0.05002839091595205}. Best is trial 4 with value: -913623978.6309942.[0m


== Early stopping achieved.
== Best iteration / VAL39 (val_loss=12.0340)
[iter 0] loss=12.7350 val_loss=12.5644 scale=1.0000 norm=61646.1121
[iter 100] loss=12.3830 val_loss=12.4033 scale=2.0000 norm=88727.7258
[iter 200] loss=12.1646 val_loss=12.2429 scale=2.0000 norm=61571.6164
[iter 300] loss=11.9788 val_loss=12.0870 scale=2.0000 norm=45458.4435
[iter 400] loss=11.8078 val_loss=11.9404 scale=2.0000 norm=34833.9613
[iter 500] loss=11.6476 val_loss=11.8077 scale=2.0000 norm=29383.4492
[iter 600] loss=11.4607 val_loss=11.6955 scale=2.0000 norm=22527.0800
[iter 700] loss=11.3062 val_loss=11.6085 scale=2.0000 norm=20668.4343
[iter 800] loss=11.1707 val_loss=11.5539 scale=2.0000 norm=19615.0912
== Early stopping achieved.
== Best iteration / VAL881 (val_loss=11.5409)
[iter 0] loss=12.7365 val_loss=12.5627 scale=1.0000 norm=61005.9811
[iter 100] loss=12.3840 val_loss=12.4376 scale=2.0000 norm=88495.1351
[iter 200] loss=12.1742 val_loss=12.2919 scale=2.0000 norm=62182.1439
[iter 300] loss=1

[32m[I 2021-08-01 19:51:41,766][0m Trial 9 finished with value: -955606041.3322674 and parameters: {'base_learner': 'DecTree_depth5', 'Dist': 'Normal', 'n_estimators': 9160, 'minibatch_frac': 0.5700684725641546, 'learning_rate': 0.001972024099562782}. Best is trial 4 with value: -913623978.6309942.[0m


== Early stopping achieved.
== Best iteration / VAL872 (val_loss=11.5786)
[iter 0] loss=12.5707 val_loss=12.2929 scale=2.0000 norm=1.2897
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4265)
[iter 0] loss=12.5273 val_loss=12.3051 scale=2.0000 norm=1.2503
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4287)
[iter 0] loss=12.5622 val_loss=12.3286 scale=2.0000 norm=1.2836
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4408)
[iter 0] loss=12.5311 val_loss=12.3179 scale=2.0000 norm=1.2376
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4340)
[iter 0] loss=12.5558 val_loss=12.3138 scale=2.0000 norm=1.2370


[32m[I 2021-08-01 19:51:47,557][0m Trial 10 finished with value: -883653368.6968629 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 18765, 'minibatch_frac': 0.4123721284410168, 'learning_rate': 0.09417560187424806}. Best is trial 10 with value: -883653368.6968629.[0m


== Early stopping achieved.
== Best iteration / VAL20 (val_loss=11.4127)
[iter 0] loss=12.5709 val_loss=12.2922 scale=2.0000 norm=1.2889
== Early stopping achieved.
== Best iteration / VAL18 (val_loss=11.3673)
[iter 0] loss=12.5325 val_loss=12.3131 scale=2.0000 norm=1.2451
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4382)
[iter 0] loss=12.5703 val_loss=12.3450 scale=2.0000 norm=1.2831
== Early stopping achieved.
== Best iteration / VAL21 (val_loss=11.4072)
[iter 0] loss=12.5422 val_loss=12.3041 scale=2.0000 norm=1.2420
== Early stopping achieved.
== Best iteration / VAL18 (val_loss=11.4089)
[iter 0] loss=12.5632 val_loss=12.3070 scale=2.0000 norm=1.2406


[32m[I 2021-08-01 19:51:53,450][0m Trial 11 finished with value: -882124430.7774332 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 18044, 'minibatch_frac': 0.4008902175258032, 'learning_rate': 0.09147517256702006}. Best is trial 11 with value: -882124430.7774332.[0m


== Early stopping achieved.
== Best iteration / VAL20 (val_loss=11.4070)
[iter 0] loss=12.5707 val_loss=12.2949 scale=2.0000 norm=1.2875
== Early stopping achieved.
== Best iteration / VAL18 (val_loss=11.3681)
[iter 0] loss=12.5300 val_loss=12.3213 scale=2.0000 norm=1.2477
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4723)
[iter 0] loss=12.5652 val_loss=12.3303 scale=2.0000 norm=1.2815
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4221)
[iter 0] loss=12.5373 val_loss=12.2946 scale=2.0000 norm=1.2410
== Early stopping achieved.
== Best iteration / VAL18 (val_loss=11.3396)
[iter 0] loss=12.5591 val_loss=12.2995 scale=2.0000 norm=1.2411


[32m[I 2021-08-01 19:51:59,260][0m Trial 12 finished with value: -839782425.3162731 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 20872, 'minibatch_frac': 0.40746035676868947, 'learning_rate': 0.09725433680775422}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL21 (val_loss=11.3886)
[iter 0] loss=12.5707 val_loss=12.2934 scale=2.0000 norm=1.2875
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.3652)
[iter 0] loss=12.5300 val_loss=12.3201 scale=2.0000 norm=1.2477
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4939)
[iter 0] loss=12.5652 val_loss=12.3292 scale=2.0000 norm=1.2815
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.4009)
[iter 0] loss=12.5373 val_loss=12.2931 scale=2.0000 norm=1.2410
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.3469)
[iter 0] loss=12.5591 val_loss=12.2979 scale=2.0000 norm=1.2411


[32m[I 2021-08-01 19:52:05,062][0m Trial 13 finished with value: -847551456.3812597 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 21890, 'minibatch_frac': 0.40716888076951757, 'learning_rate': 0.09889559588555788}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL20 (val_loss=11.3993)
[iter 0] loss=12.5547 val_loss=12.3454 scale=2.0000 norm=1.2868
== Early stopping achieved.
== Best iteration / VAL36 (val_loss=11.4140)
[iter 0] loss=12.5320 val_loss=12.3492 scale=2.0000 norm=1.2620
== Early stopping achieved.
== Best iteration / VAL33 (val_loss=11.4364)
[iter 0] loss=12.5804 val_loss=12.3514 scale=2.0000 norm=1.3117
== Early stopping achieved.
== Best iteration / VAL36 (val_loss=11.4067)
[iter 0] loss=12.5525 val_loss=12.3467 scale=2.0000 norm=1.2873
== Early stopping achieved.
== Best iteration / VAL36 (val_loss=11.3889)
[iter 0] loss=12.5619 val_loss=12.3685 scale=2.0000 norm=1.2747


[32m[I 2021-08-01 19:52:17,113][0m Trial 14 finished with value: -885386501.6433289 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 24367, 'minibatch_frac': 0.5044246268912617, 'learning_rate': 0.0473579561407795}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL39 (val_loss=11.4047)
[iter 0] loss=12.5740 val_loss=12.3491 scale=2.0000 norm=1.2885
== Early stopping achieved.
== Best iteration / VAL42 (val_loss=11.3858)
[iter 0] loss=12.5333 val_loss=12.3637 scale=2.0000 norm=1.2487
== Early stopping achieved.
== Best iteration / VAL39 (val_loss=11.4387)
[iter 0] loss=12.5709 val_loss=12.3727 scale=2.0000 norm=1.2865
== Early stopping achieved.
== Best iteration / VAL39 (val_loss=11.4023)
[iter 0] loss=12.5429 val_loss=12.3535 scale=2.0000 norm=1.2456
== Early stopping achieved.
== Best iteration / VAL44 (val_loss=11.3846)
[iter 0] loss=12.5638 val_loss=12.3671 scale=2.0000 norm=1.2440


[32m[I 2021-08-01 19:52:28,472][0m Trial 15 finished with value: -903438357.270965 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 26009, 'minibatch_frac': 0.40179068247417593, 'learning_rate': 0.04245773742181925}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL42 (val_loss=11.3988)
[iter 0] loss=13.1160 val_loss=13.0464 scale=2.0000 norm=0.6560
== Early stopping achieved.
== Best iteration / VAL22 (val_loss=13.0035)
[iter 0] loss=13.0965 val_loss=13.0462 scale=2.0000 norm=0.6589
== Early stopping achieved.
== Best iteration / VAL26 (val_loss=13.0045)
[iter 0] loss=13.1092 val_loss=13.0462 scale=2.0000 norm=0.6616
== Early stopping achieved.
== Best iteration / VAL26 (val_loss=13.0039)
[iter 0] loss=13.1033 val_loss=13.0444 scale=2.0000 norm=0.6500
== Early stopping achieved.
== Best iteration / VAL26 (val_loss=13.0035)
[iter 0] loss=13.1103 val_loss=13.0446 scale=2.0000 norm=0.6446


[32m[I 2021-08-01 19:52:33,412][0m Trial 16 finished with value: -919178361.884399 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'Exponential', 'n_estimators': 7569, 'minibatch_frac': 0.5258601707695173, 'learning_rate': 0.09876934600452908}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL26 (val_loss=13.0035)
[iter 0] loss=12.5743 val_loss=12.3721 scale=2.0000 norm=1.3058
== Early stopping achieved.
== Best iteration / VAL66 (val_loss=11.4048)
[iter 0] loss=12.5711 val_loss=12.3750 scale=2.0000 norm=1.3077
== Early stopping achieved.
== Best iteration / VAL63 (val_loss=11.5017)
[iter 0] loss=12.5924 val_loss=12.3833 scale=2.0000 norm=1.3117
== Early stopping achieved.
== Best iteration / VAL63 (val_loss=11.4617)
[iter 0] loss=12.5803 val_loss=12.3719 scale=2.0000 norm=1.3278
== Early stopping achieved.
== Best iteration / VAL65 (val_loss=11.4277)
[iter 0] loss=12.5980 val_loss=12.3842 scale=2.0000 norm=1.3184


[32m[I 2021-08-01 19:53:01,119][0m Trial 17 finished with value: -929319563.0802523 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 22875, 'minibatch_frac': 0.7235456091583856, 'learning_rate': 0.025626019034005947}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL68 (val_loss=11.4282)
[iter 0] loss=12.5739 val_loss=12.4114 scale=2.0000 norm=1.3256
[iter 100] loss=12.0199 val_loss=12.0167 scale=2.0000 norm=0.8926
[iter 200] loss=11.6297 val_loss=11.7131 scale=2.0000 norm=0.8828
[iter 300] loss=11.3137 val_loss=11.4891 scale=2.0000 norm=0.8843
[iter 400] loss=11.1033 val_loss=11.3837 scale=2.0000 norm=0.8850
== Early stopping achieved.
== Best iteration / VAL438 (val_loss=11.3731)
[iter 0] loss=12.5226 val_loss=12.4121 scale=2.0000 norm=1.2477
[iter 100] loss=12.0057 val_loss=12.0333 scale=2.0000 norm=0.8946
[iter 200] loss=11.6455 val_loss=11.7387 scale=2.0000 norm=0.8793
[iter 300] loss=11.3282 val_loss=11.5221 scale=2.0000 norm=0.8806
[iter 400] loss=11.0858 val_loss=11.4290 scale=2.0000 norm=0.8786
== Early stopping achieved.
== Best iteration / VAL419 (val_loss=11.4253)
[iter 0] loss=12.5568 val_loss=12.4157 scale=2.0000 norm=1.2777
[iter 100] loss=12.0440 val_loss=12.0295 scale=2.0000 norm=

[32m[I 2021-08-01 19:54:48,670][0m Trial 18 finished with value: -902679384.7894685 and parameters: {'base_learner': 'GradientBoost_depth2', 'Dist': 'LogNormal', 'n_estimators': 35777, 'minibatch_frac': 0.447676221891831, 'learning_rate': 0.004275250562539325}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL433 (val_loss=11.3943)
[iter 0] loss=12.5646 val_loss=12.3411 scale=2.0000 norm=1.2950
== Early stopping achieved.
== Best iteration / VAL19 (val_loss=11.6639)
[iter 0] loss=12.5644 val_loss=12.3621 scale=2.0000 norm=1.3095
== Early stopping achieved.
== Best iteration / VAL16 (val_loss=11.8389)
[iter 0] loss=12.5865 val_loss=12.3386 scale=2.0000 norm=1.3169
== Early stopping achieved.
== Best iteration / VAL18 (val_loss=11.6800)
[iter 0] loss=12.5776 val_loss=12.3565 scale=2.0000 norm=1.3359
== Early stopping achieved.
== Best iteration / VAL19 (val_loss=11.7163)
[iter 0] loss=12.5717 val_loss=12.3611 scale=2.0000 norm=1.3024


[32m[I 2021-08-01 19:54:51,109][0m Trial 19 finished with value: -1027735212.3609196 and parameters: {'base_learner': 'DecTree_depthNone', 'Dist': 'LogNormal', 'n_estimators': 21262, 'minibatch_frac': 0.6416818354150133, 'learning_rate': 0.061151242718555314}. Best is trial 12 with value: -839782425.3162731.[0m


== Early stopping achieved.
== Best iteration / VAL18 (val_loss=11.8152)
[iter 0] loss=12.5958 val_loss=12.3067 scale=2.0000 norm=1.3252
== Early stopping achieved.
== Best iteration / VAL17 (val_loss=11.3779)
Started Predict with Ngboost at 19:54:53.
The R2 score is 0.8820978416974048
The MAE score is 14927.931999567378
The Median absolute error score is 10130.38403799045
The MSE score is 22189.724591573104
The RMSE score is 492383877.4498642


In [6]:
# Save pipeline
save_to_production(housing_ml, file_name='housing_automl_instance')

# Predict on new data
In the beginning we kept a holdout dataset. We use this to simulate prediction on completely new data.

In [7]:
# load stored pipeline
housing_ml_loaded = load_for_production(file_name='housing_automl_instance')

In [8]:
# predict on new data
housing_ml_loaded.ml_bp14_regressions_full_processing_ngboost(val_df, preprocessing_type='full')

# access predicted labels
val_y_hat = housing_ml_loaded.predicted_values['ngboost']

Started Execute test train split at 19:54:53.
Started Apply datetime transformation at 19:54:53.
Started Start Spacy, POS tagging at 19:54:53.
Started Handle rare features at 19:54:53.
Started Remove cardinality at 19:54:53.
Started Onehot + PCA categorical features at 19:54:53.
Started Execute categorical encoding at 19:54:53.
Started  Delete columns with high share of NULLs at 19:54:53.
Started Fill nulls at 19:54:53.
Started Execute numerical binning at 19:54:53.
Started Handle outliers at 19:54:53.
Started Remove collinearity at 19:54:53.
Started Execute clustering as a feature at 19:54:53.
Started Execute clustering as a feature at 19:54:53.
Started Execute clustering as a feature at 19:54:53.
Started Execute clustering as a feature at 19:54:53.
Started Execute clustering as a feature at 19:54:54.
Started Execute clustering as a feature at 19:54:54.
Started Execute clustering as a feature at 19:54:54.
Started Execute clustering as a feature at 19:54:54.
Started Execute clustering 

In [9]:
# Assess prediction quality on holdout data
mae = mean_absolute_error(val_df_target, val_y_hat)
print(mae)

15740.063154203399
