# H2O AutoML Regression Demo


### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster.

In [None]:
!pip install h2o



In [None]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,17 mins 57 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.3
H2O_cluster_version_age:,22 days
H2O_cluster_name:,H2O_from_python_unknownUser_1zng51
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.165 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


### Load Data

For the AutoML regression demo, we use the [Combined Cycle Power Plant](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset.  The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values.  In this demo, you will use H2O's AutoML to outperform the [state of the art results](https://www.sciencedirect.com/science/article/pii/S0142061514000908) on this task.

In [None]:
# Use local data file or download from GitHub
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"

# Load data into H2O
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Let's take a look at the data.

In [None]:
df.describe()

Rows:9568
Cols:5




Unnamed: 0,TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
type,real,real,real,real,real
mins,1.81,25.36,992.89,25.56,420.26
mean,19.651231187290957,54.3058037207358,1013.2590781772578,73.30897784280936,454.36500940635455
maxs,37.11,81.56,1033.3,100.16,495.76
sigma,7.452473229611082,12.707892998326807,5.93878370581162,14.600268756728957,17.066994999803423
zeros,0,0,0,0,0
missing,0,0,0,0,0
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


Next, let's identify the response column and save the column name as `y`.  In this dataset, we will use all columns except the response as predictors, so we can skip setting the `x` argument explicitly.

In [None]:
y = "HourlyEnergyOutputMW"

Lastly, let's split the data into two frames, a `train` (80%) and a `test` frame (20%).  The `test` frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.

In [None]:
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

## Run AutoML 

The `test` frame is passed explicitly to the `leaderboard_frame` argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [None]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


*Note: If you see the following error, it means that you need to install the pandas module.*
```
H2OTypeError: Argument `python_obj` should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable 
``` 

## Leaderboard


A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric.  In the case of regression, the default ranking metric is mean residual deviance.  In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [None]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_grid__1_AutoML_20210611_000357_model_1,11.7239,3.42402,11.7239,2.34031,0.00751901
XGBoost_1_AutoML_20210611_000357,12.2895,3.50565,12.2895,2.55107,0.00772691
XGBoost_2_AutoML_20210611_000357,14.1029,3.75538,14.1029,2.75173,0.00826118
XGBoost_grid__1_AutoML_20210611_000357_model_2,14.1326,3.75933,14.1326,2.73014,0.00825646
GBM_grid__1_AutoML_20210611_000357_model_2,14.3148,3.7835,14.3148,2.78255,0.00832357
XGBoost_3_AutoML_20210611_000357,14.8167,3.84924,14.8167,2.85931,0.0084692
XRT_1_AutoML_20210611_000357,15.5593,3.94453,15.5593,2.8226,0.00869352
GBM_3_AutoML_20210611_000357,15.6261,3.95299,15.6261,3.00367,0.00868289
DRF_1_AutoML_20210611_000357,15.8346,3.97927,15.8346,2.83185,0.00875305
GBM_grid__1_AutoML_20210611_000357_model_1,15.8912,3.98638,15.8912,2.96458,0.00876373




This dataset comes from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) of machine learning datasets.  The data was used in a [publication](https://www.sciencedirect.com/science/article/pii/S0142061514000908) in the *International Journal of Electrical Power & Energy Systems* in 2014.  In the paper, the authors achieved a mean absolute error (MAE) of 2.818 and a Root Mean-Squared Error (RMSE) of 3.787 on their best model.  So, with H2O's AutoML, we've already beaten the state-of-the-art in just 60 seconds of compute time!

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [None]:
pred = aml.predict(test)
pred.head()

xgboost prediction progress: |████████████████████████████████████████████| 100%


predict
485.695
473.756
467.413
450.41
448.128
468.133
442.936
466.101
442.655
431.161




If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [None]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegression: xgboost
** Reported on test data. **

MSE: 11.723888817434862
RMSE: 3.4240164744689623
MAE: 2.3403112773602754
RMSLE: 0.007519009129557612
Mean Residual Deviance: 11.723888817434862


