# Homework 4: H2O Modeling with RF & GBM

This notebook contains all questions immediately followed by their corresponding code answers, grouped and ordered for clarity.

## 1. Setup

**1.1 Install & import libraries**

In [None]:
!pip install -q h2o

import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid import H2OGridSearch

# Initialize H2O server
h2o.init()

## 2. Data Loading & Preprocessing

**2.1 Load the dataset**

In [None]:
dataset = h2o.import_file("/path/to/your/data.csv")  # <-- update with your file path

**2.2 Convert all columns to factors (categorical)**

In [None]:
for col in dataset.columns:
    dataset[col] = dataset[col].asfactor()

**2.3 Split into train/validation sets**

In [None]:
train, valid = dataset.split_frame(ratios=[0.8], seed=1234)

**2.4 Define response & feature list**

*Question 2.4:* Which column is our target?

In [None]:
response = "target_column_name"  # <-- set your target column name
features = train.col_names[:]
features.remove(response)

## 3. Random Forest (DRF) with Cross-Validation

**3.1 Train a distributed RF with 5-fold CV**

*Question 3.1:*

In [None]:
drf = H2ORandomForestEstimator(nfolds=5, seed=42)
drf.train(x=features, y=response,
          training_frame=train,
          validation_frame=valid)

**3.2 View DRF performance on validation set**

In [None]:
print(drf.model_performance(valid))

## 4. Grid Search for GBM

**4.1 Define GBM hyperparameter grid**

*Question 4.1:*

In [None]:
gbm_hyper_params = {
    "max_depth": [3, 5, 7],
    "ntrees":    [50, 100, 200],
    "learn_rate":[0.01, 0.1]
}

**4.2 Run grid search with Cartesian strategy**

*Question 4.2:*

In [None]:
gbm_grid = H2OGridSearch(
    model=H2OGradientBoostingEstimator,
    hyper_params=gbm_hyper_params,
    search_criteria={"strategy": "Cartesian"}
)

gbm_grid.train(x=features, y=response,
               training_frame=train,
               validation_frame=valid,
               seed=42)

**4.3 Inspect best GBM model by AUC**

*Question 4.3:*

In [None]:
best_gbm = gbm_grid.get_grid(sort_by="auc", decreasing=True)[0]
print(best_gbm.model_performance(valid))

## 5. Final Evaluation & Comparison

**5.1 Compare DRF vs. best GBM by AUC**

In [None]:
perf_drf = drf.model_performance(valid)
perf_gbm = best_gbm.model_performance(valid)

print("DRF AUC:", perf_drf.auc())
print("GBM AUC:", perf_gbm.auc())

**5.2 Shutdown H2O**

In [None]:
h2o.shutdown(prompt=False)