Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.
This repository was archived by the owner on Jun 29, 2019. It is now read-only.

H2O performance comparison has major flaws #42

@ledell

Description

@ledell

Hi, I saw some of the benchmarks blogged about here from a recent Strata presentation slidedeck.

There are major flaws in your benchmarking of H2O:

The point of using H2O's Sparkling Water (and rsparkling if you are using R) is to interact with data already in the Spark cluster. When you have data on disk, then you should be using the h2o.importFile() function (to do a parallel read from disk into the H2O cluster) and the h2o package for modeling. There is no need to use rsparkling at all.

Loading to disk into Spark, then from Spark into H2O is an unnecessary task and doing so misrepresents the computational efficiency of H2O as compared to the other tools in this benchmark. In the interest of honest & accurate benchmarking practices, it would be great if you could revise the benchmark to reflect this. If you have any questions on how to do this, please let me know.

All you need to do is load the data from disk using h2o.importFile() and then execute these rows
of the benchmark. You can also compute performance directly in H2O using h2o.performace() rather than generating predicted values using h2o.predict(), however there is nothing wrong with generating the predictions and calculating performance metrics manually, it's just faster if you use H2O's h2o.performance() function. To most efficiently write the predictions back to disk, you should be using the h2o.exportFile() function.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions