-
Notifications
You must be signed in to change notification settings - Fork 42
Extend DSE for Env Vars along with Cmd Args #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TaekyungHeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have two approvals. Please review my comments. Some are minor.
TaekyungHeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's have two approvals.
TaekyungHeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's remove pyright ignore gradually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Minor comment, which is not a blocker: You replaced
Dict[str, str]withUnion[str, List[str]]. You might consider introducing a new type, such asDSEableParam = Union[str, List[str]], or creating a new class to improve readability and maintainability.
I don;t think this is required for such a small change. This is basic python and no need to have a class definition. I infact think this makes it worse. |
Summary
Lot of workloads defined in the test toml also requires sweeps in environment variables (
extra_env_vars). Previous support for only exposed thecmd_argsas DSE'ble. This PR expands the scope to parameters defined inextra_env_varsfield defined in Test toml.Similar to how users can specify the any parameters defined under
cmd_argsas list which will be activate the DSE. Likewise, they can also expand any parameters inextra_env_varsas list and CloudAI will activate the DSE for these parameters and add them to the agents action space.Example
Note:
extra_env_varswhich is controlled by users in Test toml in this PR. If there is a requirement for system toml, we should be able to figure something out too.Auxiliary inconsistency fixes
It looks like most of the local dev uses Python 3.10. The Github CI uses python 3.9. One corner cases is when the local CI/CD passes and the upstream ones fail.
Example is the failed CI test due to a python 3.10 feature.
In python 3.10 this is a valid way to define things instead of
Unionfrom typing. Though this is fixed to use Union to ensure backward compatibility, in general saves lot of time to move this CI flow to 3.10 since this is what we use in private repo and as well as most clusters.In 3.9, this will result in the following error
@amaslenn FYI
Discussion on Match14th (with @amaslenn @TaekyungHeo ): Some older cluster might need 3.9. But cloudaix already uses 3.10. So not sure why are can't upgrade it given even cloudaix need to run on older cluster?
But need broader discussion on this separately as agreed.
Test Plan
Nemo2.0 Llama3-8b Model
NCCL Test All-Gather
NCCL AG with Extra Env Args
Output (representative). Please ping me offline for more details.
NCCL AG with Extra Env Args and CmdArgs
Results
Representative output. Ping me on slack for more details
Additional Notes
More unit test and workloads (e.g., NCCL, Nemo) will be tested. UCC will be tested after sync-up with Sergey. But can be outside of this PR. Note, the reward generation etc depends upon the objects. So the generic report will be useful. But can fix this once we understand the objective.