Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Azure RDMA] Merge Azure RDMA change into master branch #2091

Merged
merged 14 commits into from
Jan 29, 2019
Merged

[Azure RDMA] Merge Azure RDMA change into master branch #2091

merged 14 commits into from
Jan 29, 2019

Conversation

ydye
Copy link
Contributor

@ydye ydye commented Jan 28, 2019

  • Cluster Level Configuration Switch for Admin to enable the Az RDMA feature

  • User Level Job Parameter to get the Az RDMA Capable container

  • Necessary code change in restserver to enable az-RDMA environment to the job container.

  • Some useful tool of ssh and sftp-copy to help admin to maintain the cluster machines.

  • An example job of intel MPI benchmark based on azure RDMA. And Guide user to run the mpi task.

  • A tutorial for admin to enable azure rdma for the cluster.

  • existing issue

  1. If the machine in the cluster is removed or added, admin should restart restserver to refresh the machine list. Pending it as a feature and do it in the feature.

ydye and others added 10 commits January 15, 2019 13:56
…t passed in (#2010)

* add avg to singlestate panel (#2006)

* Add necessary rdma enviroment in azure to restserver's yarn container startup script.

* Issue Fix

* [Doc] update job tutorial doc about minFailedTaskCount and minSucceededTaskCount (#2009)

* update job tutorial doc

* fix comment

* fix comment

* fix min succeed task count

* Issue Fix

* fix_log_path (#2012)

* Issue Fix

* Issue Fix

* Issue Fix

* Issue Fix

* issue fix

* add more node related alerts (#2008)

* update virtual cluster doc (#1991)

* update virtual cluster doc

* change vc's definition

* add description of vc capacity and availability

* fix typo

* issue fix

* issue fix

* issue fix

* issue fix

* issue fix

* issue fix
@coveralls
Copy link

coveralls commented Jan 28, 2019

Coverage Status

Coverage increased (+0.2%) to 52.845% when pulling f254c7b on az-r into 9480bb0 on master.

logger.warning("3 Times......... Sorry, we will force stopping your operation.")
sys.exit(1)

def run(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better choose some name more readable.
For example, it could be: uploade_xxx_ and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think the unified entry point could make more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run is a too common name, which should be avoided.

paictl.py Show resolved Hide resolved
@ydye
Copy link
Contributor Author

ydye commented Jan 28, 2019

Optimization of machine-list is done.

@ydye
Copy link
Contributor Author

ydye commented Jan 28, 2019

Depending on for...else... done.

@ydye
Copy link
Contributor Author

ydye commented Jan 28, 2019

Replace == with ===

deployment/utility/ssh.py Outdated Show resolved Hide resolved
@ydye
Copy link
Contributor Author

ydye commented Jan 29, 2019

Move machine list generate logic to paictl. Done.

@ydye ydye merged commit 6e6d41d into master Jan 29, 2019
@ydye ydye deleted the az-r branch January 29, 2019 06:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants