Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Howo do I actually change the scale factor #28

Open
v-olmedo opened this issue May 17, 2018 · 2 comments
Open

Howo do I actually change the scale factor #28

v-olmedo opened this issue May 17, 2018 · 2 comments

Comments

@v-olmedo
Copy link

I do not see any way to do that.

@dilipbiswal
Copy link
Contributor

Hello @v-olmedo,
Thanks for trying out the code pattern. Actually this pattern is initially targeted towards developers and target platform was laptop. My thought was that data with larger scale factor may be too large for a laptop running spark. Thats why i didn't expose the scale factor. Here is the line in the code that hard-codes it to 1G at present.

 "2")  gen_data $TPCDS_ROOT_DIR '1G' ;;

You can change it to increase the scale factor. Please make sure to move the data to HDFS if you want parallelism in processing. Also you may want to partition data. I have very briefly touched up on this in the doc.

@HichamISIMA
Copy link

Hello @dilipbiswal,
You stated: "Please make sure to move the data to HDFS", does that mean that dsdgen can't generate the tables in parallel, distributed manner across a cluster that isn't HDFS? Also for the query execution with dsqgen I don't seem to get any distributed processing !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants