Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] add performance tuning page #494

Merged
merged 2 commits into from
Nov 1, 2023
Merged

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Oct 31, 2023

No description provided.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

However, if you are using a spark-rapids-ml version earlier than 23.10.0 or a Spark
standalone cluster version below 3.4.0, you still need to set
`"spark.task.resource.gpu.amount"` equal to `"spark.executor.resource.gpu.amount"`. For example,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do they have to be equal? We just need to make sure there is one task per gpu not necessarily one task per executor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, You're right. New commit has fixed this part.

...
```

The above submit command specifies a request for 1 GPU and 12 CPUs per executor. So you can see,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submit -> spark-submit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

```

The above submit command specifies a request for 1 GPU and 12 CPUs per executor. So you can see,
a total of 12 tasks per executor will be executed concurrently during the ETL phase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that stage level scheduling is then used internal to the library to automatically carry out the ML training phases using the required 1 gpu per task.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@wbo4958 wbo4958 merged commit f0121ef into NVIDIA:branch-23.12 Nov 1, 2023
1 check passed
@wbo4958 wbo4958 deleted the performance branch November 1, 2023 02:43
@eordentlich
Copy link
Collaborator

@wbo4958 Can this be merged to 23.10 also?

wbo4958 added a commit to wbo4958/spark-rapids-ml that referenced this pull request Nov 6, 2023
* [Doc] add performance tuning page

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

* comments

---------

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants