<a href="https://colab.research.google.com/github/FAID-Nacera/AutoML/blob/main/automl_demo_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install h2o

Collecting h2o
  Downloading h2o-3.46.0.9-py2.py3-none-any.whl.metadata (2.1 kB)
Downloading h2o-3.46.0.9-py2.py3-none-any.whl (266.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m266.0/266.0 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: h2o
Successfully installed h2o-3.46.0.9


In [2]:
pip install pandas numpy scikit-learn matplotlib



In [3]:
#=================================
# Step 2: Import libraries
#=================================

In [4]:
import h2o
from h2o.automl import H2OAutoML
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

print("=== AutoML Demo with H2O ===")

=== AutoML Demo with H2O ===


In [5]:
# Step 1: Create sample data

In [6]:
print("1. Creating sample dataset...")
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=15,
n_redundant=5,
random_state=42
)

1. Creating sample dataset...


In [7]:
# Convert to DataFrame
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print(f"Dataset shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts()}")

Dataset shape: (1000, 21)
Target distribution:
target
0    502
1    498
Name: count, dtype: int64


In [8]:
# Step 2: Initialize H2O

In [9]:
print("\n2. Initializing H2O...")
h2o.init()


2. Initializing H2O...
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "17.0.17" 2025-10-21; OpenJDK Runtime Environment (build 17.0.17+10-Ubuntu-122.04); OpenJDK 64-Bit Server VM (build 17.0.17+10-Ubuntu-122.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.12/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpsaeqfehz
  JVM stdout: /tmp/tmpsaeqfehz/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpsaeqfehz/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.9
H2O_cluster_version_age:,1 month and 14 days
H2O_cluster_name:,H2O_from_python_unknownUser_5wlx4i
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.147 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [10]:
# Convert pandas DataFrame to H2O Frame
h2o_df = h2o.H2OFrame(df)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [11]:
# Step 3: Define features and target

In [12]:
print("\n3. Setting up AutoML...")
x = h2o_df.columns
y = "target"
x.remove(y)

print(f"Features: {len(x)}")
print(f"Target: {y}")


3. Setting up AutoML...
Features: 20
Target: target


In [13]:
# Step 4: Split the data

In [14]:
train, test = h2o_df.split_frame(ratios=[0.8], seed=42)
print(f"Training set: {train.shape[0]} rows")
print(f"Test set: {test.shape[0]} rows")

Training set: 788 rows
Test set: 212 rows


In [15]:
# Step 5: Run AutoML

In [17]:
print("\n4. Running AutoML ...")
aml = H2OAutoML(
max_models=10, # Maximum number of models
seed=42, # Reproducibility
max_runtime_secs=300, # 5 minutes max runtime
verbosity="info" # Show progress
)

aml.train(x=x, y=y, training_frame=train)


4. Running AutoML ...
AutoML progress: |
01:00:49.475: Project: AutoML_2_20260108_10049
01:00:49.475: 5-fold cross-validation will be used.
01:00:49.475: Setting stopping tolerance adaptively based on the training frame: 0.035623524993954825
01:00:49.475: Build control seed: 42
01:00:49.479: training frame: Frame key: AutoML_2_20260108_10049_training_py_2_sid_bab0    cols: 21    rows: 788  chunks: 1    size: 127609  checksum: -4011942677019520980
01:00:49.479: validation frame: NULL
01:00:49.479: leaderboard frame: NULL
01:00:49.479: blending frame: NULL
01:00:49.479: response column: target
01:00:49.479: fold column: null
01:00:49.479: weights column: null
01:00:49.480: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (7g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_anneal

key,value
Stacking strategy,cross_validation
Number of base models (used / total),6/10
# GBM base models (used / total),3/4
# XGBoost base models (used / total),3/3
# DRF base models (used / total),0/2
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5
Metalearner fold_column,

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
aic,50.339706,25.124321,43.98503,14.2525,52.92315,56.462498,84.07536
loglikelihood,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mae,0.1945337,0.0176525,0.1879929,0.1697395,0.196789,0.2001011,0.218046
mean_residual_deviance,0.072927,0.0115591,0.068805,0.0578976,0.0728143,0.0752788,0.0898395
mse,0.072927,0.0115591,0.068805,0.0578976,0.0728143,0.0752788,0.0898395
null_deviance,39.506123,1.0986288,40.60454,39.01914,37.86994,40.32026,39.71674
r2,0.7065298,0.0465478,0.722974,0.7684095,0.7029985,0.6987804,0.6394868
residual_deviance,11.501791,1.9145478,11.0776,9.032029,10.994959,12.119881,14.284487
rmse,0.2693739,0.0213518,0.262307,0.2406192,0.2698412,0.2743698,0.2997324
rmsle,0.1924982,0.0150314,0.1944158,0.1693475,0.1918172,0.1955951,0.2113153


In [18]:
# Step 6: Display results

In [19]:
print("\n5. AutoML Results:")
print("\nLeaderboard (Top Models):")
lb = aml.leaderboard
print(lb.head())


5. AutoML Results:

Leaderboard (Top Models):
model_id                                                    rmse        mse       mae     rmsle    mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_2_20260108_10049     0.27053   0.0731867  0.19523   0.193326                 0.0731867
StackedEnsemble_BestOfFamily_1_AutoML_2_20260108_10049  0.272437  0.0742217  0.198322  0.19522                  0.0742217
GBM_2_AutoML_2_20260108_10049                           0.278879  0.0777733  0.208588  0.198495                 0.0777733
GBM_4_AutoML_2_20260108_10049                           0.28377   0.0805256  0.208346  0.201539                 0.0805256
GBM_3_AutoML_2_20260108_10049                           0.285511  0.0815166  0.209818  0.202311                 0.0815166
XGBoost_3_AutoML_2_20260108_10049                       0.30581   0.09352    0.229059  0.218654                 0.09352
DRF_1_AutoML_2_20260108_10049                           0.309205  0.095608   0.249357  0.218733      

In [20]:
# Step 7: Get the best model

In [21]:
print(f"\nBest Model: {aml.leader.model_id}")
best_model = aml.leader


Best Model: StackedEnsemble_AllModels_1_AutoML_2_20260108_10049


In [22]:
# Step 8: Make predictions

In [23]:
print("\n6. Making predictions...")
predictions = best_model.predict(test)
print("Predictions sample:")
print(predictions.head())


6. Making predictions...
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Predictions sample:
   predict
 0.862563
 0.960305
 1.02453
 0.114947
 0.977722
 0.319058
 1.02947
 0.503038
 0.965337
-0.0634541
[10 rows x 1 column]



In [24]:
# Step 9: Model performance

In [25]:
print("\n7. Model Performance:")
performance = best_model.model_performance(test)
print(performance)


7. Model Performance:
ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.07628804881280456
RMSE: 0.27620291239015665
MAE: 0.19258573626725728
RMSLE: 0.2014162625369738
Mean Residual Deviance: 0.07628804881280456
R^2: 0.692631997683309
Null degrees of freedom: 211
Residual degrees of freedom: 205
Null deviance: 53.17662784405677
Residual deviance: 16.173066348314567
AIC: 72.10327280712396


In [26]:
# Step 10: Shutdown H2O

In [27]:
print("\n8. Shutting down H2O...")
h2o.cluster().shutdown()

print("\n=== AutoML Complete ===")


8. Shutting down H2O...
H2O session _sid_bab0 closed.

=== AutoML Complete ===
