Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ml xgboost workload #638

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

bobjiang82
Copy link

  • Reuse GradientBoostedTreeDataGenerator to generate dataset
  • Read dataset and convert to ml.LabeledPoint and to ml.DataFrame
  • Call XGBoost and passed in params for training
  • Call XGBoost prediction and print test error
  • Add XGBoost libs configuration doc
  • Use pipeline for training
  • Verified with Scala 2.12, Apache Spark 2.4, and XGBoost v1.1.

Note: based on Xiaochang's PR #628.

@xwu99
Copy link
Contributor

xwu99 commented Jul 31, 2020

@bobjiang82 #628 is merged. could you rebase the code to resolve the conflict?

@bobjiang82
Copy link
Author

@xwu99 Done.

conf/hibench.conf Outdated Show resolved Hide resolved


### 8. Run xgboost workload ###

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change xgboost to XGBoost and following the same?

```

#### 8.a latest xgboost release (default) ####

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to use 8.a, 8.b., need to use correct captial cases for titles.

Copy link
Contributor

@xwu99 xwu99 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to write this since it's already written in the above section 4. Run a workload
I suggest you seperate the doc out and only merge code and make sure it's runnable with default HiBench process.

If you only have the xgboost jar files, just copy them to $SPARK_HOME/jars/ and update the relevant versions for xgboost4j and xgboost4j-spark in sparkbench/ml/pom.xml to get aligned.<br>
For example, if xgboost is built from source on a Linux platform, the jars will be generated and installed to ```~/.m2/repository/ml/dmlc/xgboost4j_<scala version>/<xgboost version>-SNAPSHOT/``` and ```~/.m2/repository/ml/dmlc/xgboost4j-spark_<scala version>/<xgboost version>-SNAPSHOT/``` respectively. To use them, copy the 2 jars to $SPARK_HOME/jars/ and update the relevant versions for xgboost4j and xgboost4j-spark in the pom.xml files.<br>
After that, build hibench, prepare data and run xgboost benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the doc style is not consistent as the original doc. and too complicated to follow.
I suggest rewrite or remove. We can merge code first. It should be runnable with default setting.

```

#### 8.a latest xgboost release (default) ####

Copy link
Contributor

@xwu99 xwu99 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to write this since it's already written in the above section 4. Run a workload
I suggest you seperate the doc out and only merge code and make sure it's runnable with default HiBench process.

commit code first and continue to refine doc.
@bobjiang82
Copy link
Author

Updated to merge the code first and continue to refine the doc.

@xwu99
Copy link
Contributor

xwu99 commented Aug 10, 2020

Updated to merge the code first and continue to refine the doc.

Thanks! could you add this to CI

Updated to merge the code first and continue to refine the doc.

Thanks, could you add this to
benchmark list: conf/benchmarks.lst
and
CI: travis/benchmarks_ml.lst

@bobjiang82
Copy link
Author

Added xgboost to conf/benchmarks.lst and travis/benchmarks_ml.lst

@xwu99
Copy link
Contributor

xwu99 commented Aug 17, 2020

@bobjiang82 could you modify bin/run_all.sh to mask out hadoop since this is for spark only.

sync the forked repo with HiBench base
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants