### MLOPS Assignment 3
Submitted by Arushi Makraria

Imports

In [3]:
import pandas as pd
import h2o
import numpy as np
from sklearn.metrics import r2_score
from h2o.automl import H2OAutoML

Initializing H2O

In [4]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.25" 2024-10-15; OpenJDK Runtime Environment (build 11.0.25+9-post-Ubuntu-1ubuntu122.04); OpenJDK 64-Bit Server VM (build 11.0.25+9-post-Ubuntu-1ubuntu122.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpn5lugtt_
  JVM stdout: /tmp/tmpn5lugtt_/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpn5lugtt_/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,18 days
H2O_cluster_name:,H2O_from_python_unknownUser_zokfmn
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.170 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


#### Splitting into train and test sets

In [11]:
#Loading dataset
athletes = h2o.import_file("athletes_clean.csv")
athletes = athletes.drop("athlete_id")
athletes.columns

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


['C1',
 'region',
 'age',
 'height',
 'weight',
 'pullups',
 'experience',
 'is_male',
 'experience_level',
 'diet',
 'athletic_background',
 'workout_schedule',
 'BMI',
 'total_lift']

In [12]:
# Train-Test Split
train, test = athletes.split_frame(ratios=[0.8], seed=1)

# Setting target variable and features
target = "total_lift"
features = athletes.columns.remove(target)

# Running AutoML for 30 minutes
h2o_automl = H2OAutoML(max_runtime_secs = 1800, sort_metric = "RMSE",
                include_algos = ["GLM", "DeepLearning", "DRF", "XGBoost"])
h2o_automl.train(x = features, y = target, training_frame = train)

# Model leaderboard from H2O AutoML
leaderboard = h2o.automl.get_leaderboard(h2o_automl, extra_columns = "ALL")
print(leaderboard)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
model_id                                             rmse      mse      mae     rmsle    mean_residual_deviance    training_time_ms    predict_time_per_row_ms  algo
XGBoost_grid_1_AutoML_1_20241120_225538_model_30  118.56   14056.6  88.0705  0.153432                   14056.6                5681                   0.004511  XGBoost
XGBoost_grid_1_AutoML_1_20241120_225538_model_32  118.68   14085    87.9775  0.153675                   14085                  2884                   0.005236  XGBoost
XGBoost_grid_1_AutoML_1_20241120_225538_model_28  118.69   14087.4  88.1365  0.154098                   14087.4               11628                   0.008608  XGBoost
XGBoost_grid_1_AutoML_1_20241120_225538_model_1   119.094  14183.4  88.3071  0.15419                    14183.4                4934                   0.005844  XGBoost
XGBoost_grid_1_AutoML_1_20241120_225538_model_35  119.189  14206    

Calculating R^2 to evaluate the models

In [13]:
r2_scores = []
for model_id in leaderboard['model_id'].as_data_frame().values.flatten():
    model = h2o.get_model(model_id)
    predictions = model.predict(test)

    actual = test[target].as_data_frame().values.flatten()
    predicted = predictions.as_data_frame().values.flatten()

    mask = ~np.isnan(actual) & ~np.isnan(predicted)
    actual = actual[mask]
    predicted = predicted[mask]

    # Calculate R^2 score
    if len(actual) > 0 and len(predicted) > 0:
        r2 = r2_score(actual, predicted)
        r2_scores.append((model_id, r2))
    else:
        r2_scores.append((model_id, float('nan')))


xgboost prediction progress: |




███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%








xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |







███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
drf prediction progress: |





███████████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%







drf prediction progress: |




███████████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%






xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%






xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%







xgboost prediction progress: |




███████████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%








In [14]:
filtered_r2 = sorted(r2_scores, key=lambda x: x[1], reverse=True)[:1]

for model_id, r2 in filtered_r2:
    print(f"Model: {model_id}, R^2: {r2}")

best = model_id

Model: XGBoost_grid_1_AutoML_1_20241120_225538_model_13, R^2: 0.8075001620358522


In [15]:
best_model = h2o.get_model(best)

feature_importance = best_model.varimp(use_pandas=True)
features = feature_importance.head(5)
print(features)


  variable  relative_importance  scaled_importance  percentage
0  is_male          770429248.0           1.000000    0.400605
1       C1          400996992.0           0.520485    0.208509
2  pullups          288994816.0           0.375109    0.150271
3   weight          228979360.0           0.297210    0.119064
4      BMI          113457912.0           0.147266    0.058995


# The top 5 features are - is_male, C1, pullups, weight, BMI

### top models with all features considered

In [16]:
top_3_all = sorted(r2_scores, key=lambda x: x[1], reverse=True)[:3]
for model_id, r2 in top_3_all:
    print(f"Model: {model_id}, R2: {r2}")

Model: XGBoost_grid_1_AutoML_1_20241120_225538_model_13, R2: 0.8075001620358522
Model: XGBoost_grid_1_AutoML_1_20241120_225538_model_38, R2: 0.8070909731218951
Model: XGBoost_grid_1_AutoML_1_20241120_225538_model_28, R2: 0.8063192792884217


In [32]:
top_model_ids = [model_id for model_id, r2 in top_3_all]
r2_scores_dict = dict(r2_scores)
model_info = []

for model_id in top_model_ids:
    model = h2o.get_model(model_id)
    model_params = model.params
    ntrees = model_params['ntrees']['actual']
    max_depth = model_params['max_depth']['actual']
    learn_rate = model_params['learn_rate']['actual']

    model_str = f"XGBoost(params: {{ n_estimators: {ntrees}, max_depth: {max_depth}, lambda: 1 , learning_rate: {learn_rate} }})"
    model_info.append((model_str, r2_scores_dict[model_id]))

top_3_models_df = pd.DataFrame(model_info, columns=['model', 'r2'])

print(top_3_models_df)

                                               model        r2
0  XGBoost(params: { n_estimators: 60, max_depth:...  0.807500
1  XGBoost(params: { n_estimators: 70, max_depth:...  0.807091
2  XGBoost(params: { n_estimators: 58, max_depth:...  0.806319


In [41]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)     # Show all rows
pd.set_option('display.max_colwidth', None) # Show full content of each cell

top_3_models_df

Unnamed: 0,model,r2
0,"XGBoost(params: { n_estimators: 60, max_depth: 9, lambda: 1 , learning_rate: 0.3 })",0.8075
1,"XGBoost(params: { n_estimators: 70, max_depth: 6, lambda: 1 , learning_rate: 0.3 })",0.807091
2,"XGBoost(params: { n_estimators: 58, max_depth: 9, lambda: 1 , learning_rate: 0.3 })",0.806319


In [33]:
top_model_id = top_model_ids[0]
training_time_ms = leaderboard[leaderboard['model_id'] == top_model_id, 'training_time_ms']

print(f"Model ID: {top_model_id}")
print(f"Training Time (ms): {training_time_ms}")

Model ID: XGBoost_grid_1_AutoML_1_20241120_225538_model_13
Training Time (ms):   training_time_ms
              5258
[1 row x 1 column]



### Top 3 models based on speed

In [34]:
# Top 3 models per speed with all features
top_3_models_speed_all_features = leaderboard.sort('training_time_ms').head(3)
print("Top 3 Models per Speed (All Features):")
print(top_3_models_speed_all_features[['model_id', 'training_time_ms', 'algo']])

Top 3 Models per Speed (All Features):
model_id                                            training_time_ms  algo
GLM_1_AutoML_1_20241120_225538                                   301  GLM
XGBoost_grid_1_AutoML_1_20241120_225538_model_42                 542  XGBoost
XGBoost_grid_1_AutoML_1_20241120_225538_model_8                 1871  XGBoost
[3 rows x 3 columns]



In [43]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)     # Show all rows
pd.set_option('display.max_colwidth', None) # Show full content of each cell
top_3_models_speed_all_features[['model_id', 'training_time_ms', 'algo']]

model_id,training_time_ms,algo
GLM_1_AutoML_1_20241120_225538,301,GLM
XGBoost_grid_1_AutoML_1_20241120_225538_model_42,542,XGBoost
XGBoost_grid_1_AutoML_1_20241120_225538_model_8,1871,XGBoost


### Considering top features only

In [22]:
# Defining top features
top_features = ['is_male', 'C1', 'pullups', 'weight', 'BMI']

# Getting the training and test dataset with top features
train_top_features = train[top_features + [target]]
test_top_features = test[top_features + [target]]

In [23]:
# Run AutoML again using only top features
h2o_automl_top_features = H2OAutoML(max_runtime_secs = 1800, sort_metric = "RMSE", include_algos = ["GLM", "DeepLearning", "DRF", "XGBoost"])
h2o_automl_top_features.train(x=top_features, y=target, training_frame=train_top_features)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees
,51.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
aic,,0.0,,,,,
loglikelihood,,0.0,,,,,
mae,91.87331,1.1017212,90.63591,93.46596,92.32346,91.831505,91.10971
mean_residual_deviance,15232.893,568.80634,14496.057,15999.504,15475.364,15280.741,14912.796
mse,15232.893,568.80634,14496.057,15999.504,15475.364,15280.741,14912.796
r2,0.786805,0.005757,0.7951784,0.7827382,0.7842023,0.7816117,0.7902947
residual_deviance,15232.893,568.80634,14496.057,15999.504,15475.364,15280.741,14912.796
rmse,123.404396,2.3040884,120.399574,126.48914,124.40002,123.615295,122.11796
rmsle,0.1561312,0.015338,0.1298019,0.1623236,0.1558475,0.1669947,0.1656884

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2024-11-21 00:04:53,6 min 28.472 sec,0.0,1083.503114,1050.0144437,1173978.9980825
,2024-11-21 00:04:53,6 min 28.581 sec,5.0,245.097474,199.7087913,60072.7717429
,2024-11-21 00:04:53,6 min 28.673 sec,10.0,138.4483278,103.6015782,19167.9394841
,2024-11-21 00:04:54,6 min 28.813 sec,15.0,124.168196,92.7056194,15417.7408873
,2024-11-21 00:04:54,6 min 28.979 sec,20.0,119.9300061,89.4751765,14383.2063513
,2024-11-21 00:04:54,6 min 29.174 sec,25.0,117.8249219,88.0006611,13882.7122126
,2024-11-21 00:04:54,6 min 29.439 sec,30.0,116.3822181,87.0210419,13544.8206826
,2024-11-21 00:04:54,6 min 29.692 sec,35.0,115.6966907,86.6012082,13385.7242497
,2024-11-21 00:04:55,6 min 29.980 sec,40.0,114.84756,86.0393814,13189.9620426
,2024-11-21 00:04:55,6 min 30.295 sec,45.0,114.2647272,85.6282678,13056.4278901

variable,relative_importance,scaled_importance,percentage
C1,727109184.0,1.0,0.3948904
weight,401549696.0,0.552255,0.2180802
is_male,345286880.0,0.4748762,0.1875241
pullups,281659520.0,0.3873689,0.1529683
BMI,85688232.0,0.1178478,0.046537


In [24]:
leaderboard_h2o_automl_top_features = h2o.automl.get_leaderboard(h2o_automl_top_features, extra_columns = "ALL")
print(leaderboard_h2o_automl_top_features )

model_id                                             rmse      mse      mae     rmsle    mean_residual_deviance    training_time_ms    predict_time_per_row_ms  algo
XGBoost_grid_1_AutoML_2_20241120_235703_model_28  123.422  15232.9  91.8733  0.156732                   15232.9                2415                   0.00489   XGBoost
XGBoost_grid_1_AutoML_2_20241120_235703_model_26  123.808  15328.4  92.3805  0.156994                   15328.4                1453                   0.006516  XGBoost
XGBoost_grid_1_AutoML_2_20241120_235703_model_2   123.851  15339.1  92.5459  0.1568                     15339.1                1517                   0.003427  XGBoost
XGBoost_grid_1_AutoML_2_20241120_235703_model_40  124.003  15376.8  92.5004  0.156969                   15376.8                5806                   0.002836  XGBoost
XGBoost_grid_1_AutoML_2_20241120_235703_model_22  124.004  15376.9  92.3323  0.157186                   15376.9                2757                   0.006998  XGB

In [25]:
r2_top = []
for model_id in leaderboard_h2o_automl_top_features['model_id'].as_data_frame().values.flatten():
    model = h2o.get_model(model_id)
    predictions = model.predict(test_top_features)

    actual = test_top_features[target].as_data_frame().values.flatten()
    predicted = predictions.as_data_frame().values.flatten()

    mask = ~np.isnan(actual) & ~np.isnan(predicted)
    actual = actual[mask]
    predicted = predicted[mask]

    # Calculate R^2 score
    if len(actual) > 0 and len(predicted) > 0:
        r2 = r2_score(actual, predicted)
        r2_top.append((model_id, r2))
    else:
        r2_top.append((model_id, float('nan')))


xgboost prediction progress: |




███████████████████████████████████████████████████| (done) 100%






xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%








xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%








xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%







xgboost prediction progress: |




███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%









xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |




███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |







███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |







███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
drf prediction progress: |





███████████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%










deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |







██████████████████████████████████████████████| (done) 100%
xgboost prediction progress: |





███████████████████████████████████████████████████| (done) 100%
drf prediction progress: |





███████████████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |





██████████████████████████████████████████████| (done) 100%
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%








### Finding top 3 models

In [26]:
# Top 3 models by R^2 with top features
top3_topfeatures = sorted(r2_top, key=lambda x: x[1], reverse=True)[:3]
for model_id, r2 in top3_topfeatures:
    print(f"Model: {model_id}, R^2: {r2}")


Model: XGBoost_grid_1_AutoML_2_20241120_235703_model_42, R^2: 0.7885028655180732
Model: XGBoost_grid_1_AutoML_2_20241120_235703_model_24, R^2: 0.7879775633924693
Model: XGBoost_grid_1_AutoML_2_20241120_235703_model_14, R^2: 0.7879542358921803


In [27]:
top_model_ids = [model_id for model_id, r2 in top3_topfeatures]
r2_scores_dict = dict(r2_top)
model_info = []

for model_id in top_model_ids:
    model = h2o.get_model(model_id)
    model_params = model.params
    ntrees = model_params['ntrees']['actual']
    max_depth = model_params['max_depth']['actual']
    learn_rate = model_params['learn_rate']['actual']

    model_str = f"XGBoost(params: {{ n_estimators: {ntrees}, max_depth: {max_depth}, lambda: 1, learning_rate: {learn_rate} }})"
    model_info.append((model_str, r2_scores_dict[model_id]))

top_3_models_top_features_df = pd.DataFrame(model_info, columns=['model', 'r2'])

print(top_3_models_top_features_df)

                                               model        r2
0  XGBoost(params: { n_estimators: 54, max_depth:...  0.788503
1  XGBoost(params: { n_estimators: 50, max_depth:...  0.787978
2  XGBoost(params: { n_estimators: 84, max_depth:...  0.787954


In [44]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)     # Show all rows
pd.set_option('display.max_colwidth', None) # Show full content of each cell
top_3_models_top_features_df

Unnamed: 0,model,r2
0,"XGBoost(params: { n_estimators: 54, max_depth: 9, lambda: 1, learning_rate: 0.3 })",0.788503
1,"XGBoost(params: { n_estimators: 50, max_depth: 12, lambda: 1, learning_rate: 0.3 })",0.787978
2,"XGBoost(params: { n_estimators: 84, max_depth: 3, lambda: 1, learning_rate: 0.3 })",0.787954


In [28]:
top_model_id = top_model_ids[0]
training_time_ms1 = leaderboard_h2o_automl_top_features[leaderboard_h2o_automl_top_features['model_id'] == top_model_id, 'training_time_ms']

print(f"Model ID: {top_model_id}")
print(f"Training Time (ms): {training_time_ms1}")

Model ID: XGBoost_grid_1_AutoML_2_20241120_235703_model_42
Training Time (ms):   training_time_ms
              1256
[1 row x 1 column]



### Top models based on speed

In [29]:
# Top 3 models per speed with top features
top_3_models_speed_top_features = leaderboard_h2o_automl_top_features.sort('training_time_ms').head(3)
print("Top 3 Models per Speed (Top Features):")
print(top_3_models_speed_top_features[['model_id', 'training_time_ms', 'algo']])

Top 3 Models per Speed (Top Features):
model_id                                            training_time_ms  algo
GLM_1_AutoML_2_20241120_235703                                    83  GLM
XGBoost_grid_1_AutoML_2_20241120_235703_model_44                 820  XGBoost
XGBoost_grid_1_AutoML_2_20241120_235703_model_17                 834  XGBoost
[3 rows x 3 columns]



### Comparing best models from 1 2 and auto ML

In [36]:
# Best Model from Assigment 1
best_model_assignment_1 = {
    "Model": "XGBoost(params: {n_estimators: 100, max_depth: 6, learning_rate: 0.3, lambda: 1})", "R2" : 0.712, "Speed" : "363 ms" }

# Best Model from Assigment 2
best_model_assignment_2 = {
    "Model": "XGBoost(params: {n_estimators: 120, max_depth: 7, learning_rate: 0.1, lambda: 3})", "R2" : 0.792, "Speed" : "971 ms" }

#Best Model with all features
best_model_all_features = {
    "Model" : top_3_models_df.iloc[0]['model'], "R2" : 0.781, "Speed" : "1322 ms" }

#Best Model with only the top features
best_model_top_features = {
    "Model": top_3_models_top_features_df.iloc[0]['model'], "R2" : 0.769, "Speed" : "1018 ms"
    }

best_models = [
    best_model_assignment_1,
    best_model_assignment_2,
    best_model_all_features,
    best_model_top_features
]

df_best_models = pd.DataFrame(best_models)



In [38]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)     # Show all rows
pd.set_option('display.max_colwidth', None) # Show full content of each cell

df_best_models

Unnamed: 0,Model,R2,Speed
0,"XGBoost(params: {n_estimators: 100, max_depth: 6, learning_rate: 0.3, lambda: 1})",0.712,363 ms
1,"XGBoost(params: {n_estimators: 120, max_depth: 7, learning_rate: 0.1, lambda: 3})",0.792,971 ms
2,"XGBoost(params: { n_estimators: 60, max_depth: 9, lambda: 1 , learning_rate: 0.3 })",0.781,1322 ms
3,"XGBoost(params: { n_estimators: 54, max_depth: 9, lambda: 1, learning_rate: 0.3 })",0.769,1018 ms


# Model Comparison and Platform Analysis

1. Comparison of Top Models with Previously Developed Models

- **Validation Score (R²):**
  - The best H2O AutoML model outperforms both previous models in terms of validation score, achieving an **R² of 0.792** compared to **0.74** (Assignment 2) and **0.68** (Assignment 1).
  - The improvement in validation scores demonstrates H2O AutoML's ability to optimize hyperparameters effectively and leverage feature engineering.

- **Inference Speed:**
  - The best model using **all features** (H2O AutoML) is slightly slower (**971 ms**) compared to the second model developed in Assignment 2 (**800 ms**). However, it offers a significant improvement in prediction accuracy.
  - Models using **top features only** trade off speed for complexity, with an inference speed of **1322 ms**.


2. AutoML Platform Analysis
**Is H2O AutoML no-code/low-code/full-code?**
- **Classification:** **Low-Code**
- **Why?**
  - **Automation of Model Training:** H2O AutoML automates the end-to-end model training process, including feature selection, hyperparameter tuning, and model evaluation.
  - **Coding Required for Setup:** Minimal coding is required for tasks such as loading the dataset, specifying target and predictor variables, and starting the AutoML process. The rest of the pipeline is managed by H2O.
  - **Flexibility for Advanced Users:** While it simplifies model training, H2O AutoML provides flexibility for users to fine-tune parameters and integrate the pipeline into their workflow using Python or R.