-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment doc update #458
Conversation
…experiment-doc-update
…experiment-doc-update # Conflicts: # doc/autoscale/experiment/README.md
doc/autoscale/experiment/README.md
Outdated
|
||
## Environment Enviroment | ||
To verify effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> "To verify the effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism."
doc/autoscale/experiment/README.md
Outdated
|
||
How the effectiveness are measured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are -> is
doc/autoscale/experiment/README.md
Outdated
## Experiment Metric | ||
1. Cluster computing resource overall utilization. | ||
- the higher the better. | ||
- higher utilization means less resource are idle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are -> is
doc/autoscale/experiment/README.md
Outdated
1. Task average pending time. | ||
- the less the better. | ||
- the less pending time the earlier developers and researchers can start seeing the training cost curve, and the better they can verify the training algorithm effectiveness. | ||
- This is a common pain point of researchers with internal cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal -> the internal
doc/autoscale/experiment/README.md
Outdated
1. Task average execution time. | ||
- the less the better. | ||
- average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization. | ||
- average execution time is also the way of measuring effectiveness of fault-tolerance. If the fault-tolerance is not working properly, the training job will simply fail or finish with significantly longer duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
effectiveness -> the effectiveness
doc/autoscale/experiment/README.md
Outdated
- the less the better. | ||
- average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization. | ||
- average execution time is also the way of measuring effectiveness of fault-tolerance. If the fault-tolerance is not working properly, the training job will simply fail or finish with significantly longer duration. | ||
1. Quality of service with Hybrid cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hybrid don't need capitalization (hybrid), maybe "general purpose cluster" is easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected.
doc/autoscale/experiment/README.md
Outdated
|
||
### Resource utilization increased in both cases | ||
|
||
Utilization increased by XX% in case one, and XX% in case two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case 1的utilization不一定能increase。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just placeholder for now. will update these conclusions when we have real numbers
doc/autoscale/experiment/README.md
Outdated
|
||
XX% in case one and XX% in case two. | ||
|
||
### Average execution time reduced in both cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case 1的average execution time不一定能reduce。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just placeholder for now.
doc/autoscale/experiment/README.md
Outdated
|
||
XX% in case one and XX% in case two. | ||
|
||
### Improved the service quality with Hybrid cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hybrid don't need capitalization (hybrid), maybe "general purpose cluster" is easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected, thanks!
doc/autoscale/experiment/README.md
Outdated
|
||
### Improved the service quality with Hybrid cloud | ||
|
||
As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddlepaddle -> PaddlePaddle
doc/autoscale/experiment/README.md
Outdated
|
||
### Improved the service quality with Hybrid cloud | ||
|
||
As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is get -> is getting
Maybe change "QPS" to "the load", since our test case 2 does not use QPS (uses pod count instead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
doc/autoscale/experiment/README.md
Outdated
## Experiment Metric | ||
1. Cluster computing resource overall utilization. | ||
- the higher the better. | ||
- higher utilization means less resource is idle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
higher utilization means less resource is idle.
This sentence is too simple. May describe like this:
Autoscaling intended to maximize the overall cluster resource(CPU, GPU, memory) usage by ensuring resource for production level jobs/services, then fairly scale jobs that are scalable to use resource left in the cluster.
doc/autoscale/experiment/README.md
Outdated
- This is a common pain point of researchers with the internal cloud. | ||
1. Task average execution time. | ||
- the less the better. | ||
- average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jobs may spend more time to run when total resource requested is more than what we have in the cluster because the scaler will scale down some low-priority jobs to ensure higher priority jobs. So when inspecting the results of the experiment, the less the average job running time increases, the better the scaler performances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
update DOE with explanation of metrics and other tweaks.