In this notebook I'm going to check different AutoML solutions:
* [H2O](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#training)
* [LightAutoML](https://github.com/sb-ai-lab/LightAutoML)
* [AutoGluon](https://github.com/autogluon/autogluon)
* [FEDOT](https://github.com/aimclub/FEDOT)

In [4]:
# !pip install -U lightautoml==0.3.7.3

# Imports

In [1]:
import os

import numpy as np
import pandas as pd

In [18]:
import h2o
from h2o.automl import H2OAutoML

from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
from sklearn.metrics import mean_squared_error
import numpy as np
import torch

In [7]:
SEED = 42

# Paths

Here we set a relative path so that we can easily export this notebook from Kaggle, change the path and run it locally.


In [8]:
RELATIVE_PATH = "../data"
ORIGINAL_PATH = "../data"

# Evaluation Metric

In this competition we are going to use Root Mean Squared Error (`RMSE`). 

RMSE is defined as:  
$\textrm{RMSE} =  \sqrt{ \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 }$,  

where (for each instance $i$):
- $\hat{y}_i$ is the predicted value
- $y_i$ is the original value.

# Data

The dataset for this competition (both `train` and `test`) was generated from a deep learning model trained on the *California Housing Dataset*.  
Feature distributions are close to, but not exactly the same, as the original.  

We will use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.  

## Loading the data

In [9]:
data = pd.read_csv(os.path.join(ORIGINAL_PATH, "train_folds.csv"))
test = pd.read_csv(os.path.join(RELATIVE_PATH, "test.csv"))

## Setting the columns

This dataset contains only numerical columns, which are listed below:

In [25]:
num_cols = [
    "MedInc",
    "HouseAge",
    "AveRooms",
    "AveBedrms",
    "Population",
    "AveOccup",
    "Latitude",
    "Longitude",
]
cat_cols = []

drop_cols = [col for col in data.columns if col not in [*feature_cols, target_col]]

feature_cols = num_cols + cat_cols
target_col = "MedHouseVal"

## Dataset Overview

This can be done using the `df.sample(n=10, random_state=SEED)` method, specifying the number of lines `n` and `random_state` (for reproducible calculations).  

I looked at 30 random lines, but left 10 in the code for easier viewing. Missing values are not visually visible - we will check this formally in the next section.
You can also look at the beginning and end of the dataset using the `df.head(n=10)` and `df.tail(n=10)` methods, respectively.

In [11]:
data.sample(n=10, random_state=SEED)

Unnamed: 0,Source,KFold,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
13267,competition,0,1.7297,22.0,3.819071,1.00489,2313.0,2.187919,33.9,-118.36,1.098
50068,original,0,5.259,13.0,6.733133,1.032984,1657.0,2.484258,38.64,-121.24,2.494
41330,original,0,3.1806,41.0,4.043333,1.003333,801.0,2.67,34.12,-118.24,2.042
54959,original,0,7.0735,14.0,8.056485,1.117155,2052.0,4.292887,37.38,-121.87,3.356
476,competition,0,4.7276,25.0,5.341176,1.0,769.0,2.735294,37.29,-121.87,2.25
49107,original,0,3.0606,16.0,5.276061,1.070194,5913.0,3.097433,33.98,-117.42,1.195
37342,original,0,2.0375,48.0,4.944606,1.154519,1481.0,4.317784,37.79,-122.23,1.225
47263,original,0,4.7216,20.0,4.961481,1.014815,1822.0,2.699259,33.94,-117.89,2.27
8435,competition,0,3.6094,14.0,4.586837,1.074153,1427.0,2.481735,33.96,-118.14,2.095
48229,original,0,3.7604,15.0,5.804143,1.009416,1268.0,2.387947,33.81,-117.87,2.801


## Dataset Close Look

In this section, let's look at what columns we have, what type they are, how many non-zero values, etc.

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57777 entries, 0 to 57776
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Source       57777 non-null  object 
 1   KFold        57777 non-null  int64  
 2   MedInc       57777 non-null  float64
 3   HouseAge     57777 non-null  float64
 4   AveRooms     57777 non-null  float64
 5   AveBedrms    57777 non-null  float64
 6   Population   57777 non-null  float64
 7   AveOccup     57777 non-null  float64
 8   Latitude     57777 non-null  float64
 9   Longitude    57777 non-null  float64
 10  MedHouseVal  57777 non-null  float64
dtypes: float64(9), int64(1), object(1)
memory usage: 4.8+ MB


In [13]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24759 entries, 0 to 24758
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          24759 non-null  int64  
 1   MedInc      24759 non-null  float64
 2   HouseAge    24759 non-null  float64
 3   AveRooms    24759 non-null  float64
 4   AveBedrms   24759 non-null  float64
 5   Population  24759 non-null  float64
 6   AveOccup    24759 non-null  float64
 7   Latitude    24759 non-null  float64
 8   Longitude   24759 non-null  float64
dtypes: float64(8), int64(1)
memory usage: 1.7 MB


In [14]:
data["Source"].value_counts()

competition    37137
original       20640
Name: Source, dtype: int64

In total, we have 57777 (37137 from synthetic dataset and 20640 from original dataset) lines in the sample for training. In the test sample we have 24759 objects. 

# Modelling

## H2O

### Initialization

Let's start local H2O instance:

In [64]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.17" 2022-10-18; OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04); OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpoc5po2gm
  JVM stdout: /tmp/tmpoc5po2gm/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpoc5po2gm/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.3
H2O_cluster_version_age:,1 month and 17 days
H2O_cluster_name:,H2O_from_python_unknownUser_w130k8
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.500 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


### Reading data into H2O format

In [65]:
train_h2o = h2o.H2OFrame(data)
test_h2o = h2o.H2OFrame(test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


### Running AutoML

Run AutoML for 20 base models


In [None]:
aml = H2OAutoML(max_models=20, seed=SEED)
aml.train(x=feature_cols, y=target_col, training_frame=train_h2o, fold_column="KFold")

AutoML progress: |
17:22:33.110: Fold column KFold will be used for cross-validation. nfolds parameter will be ignored.

███████████

### View the AutoML Leaderboard

In [17]:
lb = aml.leaderboard
lb.head(rows=lb.nrows) # Print all rows instead of default (10 rows)

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_1_20230108_191800,0.520119,0.270524,0.356694,0.155985,0.270524
StackedEnsemble_BestOfFamily_1_AutoML_1_20230108_191800,0.5257,0.27636,0.36134,0.157718,0.27636
GBM_4_AutoML_1_20230108_191800,0.527322,0.278069,0.362273,0.158171,0.278069
GBM_3_AutoML_1_20230108_191800,0.527508,0.278265,0.363844,0.158578,0.278265
GBM_1_AutoML_1_20230108_191800,0.52804,0.278827,0.363532,0.158555,0.278827
GBM_2_AutoML_1_20230108_191800,0.528719,0.279544,0.365131,0.159082,0.279544
GBM_5_AutoML_1_20230108_191800,0.531871,0.282887,0.368424,0.160352,0.282887
GBM_grid_1_AutoML_1_20230108_191800_model_2,0.537084,0.288459,0.368735,0.160907,0.288459
GBM_grid_1_AutoML_1_20230108_191800_model_1,0.543424,0.29531,0.37821,0.163651,0.29531
XGBoost_3_AutoML_1_20230108_191800,0.544888,0.296903,0.380517,0.165127,0.296903


### Prediction

In [26]:
test_h2o

id,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
37137,1.7062,35,4.96637,1.09654,1318,2.84441,39.75,-121.85
37138,1.3882,22,4.18704,1.09823,2296,3.18022,33.95,-118.29
37139,7.7197,21,7.12944,0.959276,1535,2.88889,33.61,-117.81
37140,4.6806,49,4.7697,1.04848,707,1.74359,34.17,-118.34
37141,3.1284,25,3.76531,1.08163,4716,2.00383,34.17,-118.29
37142,5.7268,23,6.0625,1.14527,1039,2.3871,33.81,-118.11
37143,3.3583,25,5.06878,1.22727,949,3.60256,33.14,-117.12
37144,4.1302,35,5.94472,1.06236,1043,3.16592,34.09,-117.98
37145,1.7991,23,4.92836,1.17406,848,2.55801,37.3,-120.89
37146,1.7857,44,5.71712,1.10164,4276,2.37307,33.98,-117.33


In [19]:
preds = aml.predict(test_h2o)

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


#### Save submission file

In [32]:
preds_pd = h2o.as_list(preds, use_pandas=True)
len(preds_list)

24759

In [39]:
submission = pd.DataFrame(data={'id': test["id"], 'MedHouseVal': preds_pd.values.flatten()})
submission.head()

Unnamed: 0,id,MedHouseVal
0,37137,0.620736
1,37138,1.104696
2,37139,3.949899
3,37140,3.446578
4,37141,2.489239


In [40]:
submission.to_csv('submission.csv', index=False)

## LightAutoML

### Parameters

In [55]:
N_THREADS = 4
N_FOLDS = 10
TIMEOUT = 900

### Imported models setup

In [56]:
np.random.seed(SEED)
torch.set_num_threads(N_THREADS)

### Setting Task

In [57]:
def root_mean_squared_error(y_true, y_pred, **kwargs):
    return mean_squared_error(y_true, y_pred, squared=False, **kwargs)

In [58]:
task = Task('reg', metric = root_mean_squared_error, greater_is_better=False)

roles = {
    'target': target_col,
    'drop': drop_cols
}

In [59]:
automl = TabularAutoML(
    task = task,
    timeout = TIMEOUT,
    cpu_limit = N_THREADS,
    reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': SEED}
)

### Training

In [60]:
%%time

oof_pred = automl.fit_predict(data, roles = roles, verbose = 1)

[16:30:01] Stdout logging level is INFO.
[16:30:01] Task: reg

[16:30:01] Start automl preset with listed constraints:
[16:30:01] - time: 900.00 seconds
[16:30:01] - CPU: 4 cores
[16:30:01] - memory: 16 GB

[16:30:01] [1mTrain data shape: (57777, 11)[0m

[16:30:06] Layer [1m1[0m train process start. Time left 894.96 secs
[16:30:08] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[16:30:12] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m-0.6407789564335908[0m
[16:30:12] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed
[16:30:12] Time left 888.92 secs

[16:30:18] [1mSelector_LightGBM[0m fitting and predicting completed
[16:30:19] Start fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m ...
[16:31:27] Time limit exceeded after calculating fold 8

[16:31:27] Fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m finished. score = [1m-0.5251702381097954[0m
[16:31:27] [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m fitting and predicting completed
[16:31:27] Start hyp

### Prediction

In [61]:
preds = automl.predict(test)

#### Save submission file

In [62]:
submission = pd.DataFrame(data={'id': test["id"], 'MedHouseVal': preds.data[:, 0]})
submission.head()

Unnamed: 0,id,MedHouseVal
0,37137,0.692604
1,37138,0.988744
2,37139,4.033728
3,37140,3.12943
4,37141,2.493648


In [63]:
submission.to_csv('submission_light_automl_2.csv', index=False)