H2O performance comparison has major flaws

Hi, I saw some of the benchmarks blogged about [here](http://blog.revolutionanalytics.com/2017/03/tutorial-scaling-r.html) from a recent [Strata presentation slidedeck](https://cdn.oreillystatic.com/en/assets/1/event/193/Using%20R%20for%20scalable%20data%20analytics_%20From%20single%20machines%20to%20Hadoop%20Spark%20clusters%20Presentation.pdf). 

There are major flaws in your [benchmarking of H2O](https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/StrataSanJose2017/Performance_Comparison/h2o.R):

The point of using H2O's Sparkling Water (and **rsparkling** if you are using R) is to interact with data already in the Spark cluster.  When you have data on disk, then you should be using the `h2o.importFile()` function (to do a parallel read from disk into the H2O cluster) and the **h2o** package for modeling.  There is no need to use **rsparkling** at all.

Loading to disk into Spark, then from Spark into H2O is an unnecessary task and doing so misrepresents the computational efficiency of H2O as compared to the other tools in this benchmark.  In the interest of honest & accurate benchmarking practices, it would be great if you could revise the benchmark to reflect this.  If you have any questions on how to do this, please let me know.

All you need to do is load the data from disk using `h2o.importFile()` and then execute [these rows](https://github.com/Azure/Azure-MachineLearning-DataScience/blob/76c87888f1bdbe0fd2424adc5f3b05690e34c8fa/Misc/StrataSanJose2017/Performance_Comparison/h2o.R#L89-L117)
of the benchmark.  You can also compute performance directly in H2O using `h2o.performace()` rather than generating predicted values using `h2o.predict()`, however there is nothing wrong with generating the predictions and calculating performance metrics manually, it's just faster if you use H2O's `h2o.performance()` function.  To most efficiently write the predictions back to disk, you should be using the `h2o.exportFile()` function.
   

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H2O performance comparison has major flaws #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

H2O performance comparison has major flaws #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions