Experiment doc update #458

putcn · 2017-10-30T23:42:03Z

update DOE with explanation of metrics and other tweaks.

…experiment-doc-update

…experiment-doc-update # Conflicts: # doc/autoscale/experiment/README.md

helinwang · 2017-10-30T23:45:23Z

doc/autoscale/experiment/README.md


-## Environment Enviroment
+To verify effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism.


-> "To verify the effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism."

helinwang · 2017-10-30T23:46:30Z

doc/autoscale/experiment/README.md


+How the effectiveness are measured.


helinwang · 2017-10-30T23:46:39Z

doc/autoscale/experiment/README.md

-## Experiment Metric
+1. Cluster computing resource overall utilization.
+    - the higher the better.
+    - higher utilization means less resource are idle.


helinwang · 2017-10-30T23:47:05Z

doc/autoscale/experiment/README.md

+1. Task average pending time.
+    - the less the better.
+    - the less pending time the earlier developers and researchers can start seeing the training cost curve, and the better they can verify the training algorithm effectiveness.
+    - This is a common pain point of researchers with internal cloud.


internal -> the internal

helinwang · 2017-10-30T23:47:20Z

doc/autoscale/experiment/README.md

+1. Task average execution time.
+    - the less the better.
+    - average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization.
+    - average execution time is also the way of measuring effectiveness of fault-tolerance. If the fault-tolerance is not working properly, the training job will simply fail or finish with significantly longer duration.


effectiveness -> the effectiveness

helinwang · 2017-10-30T23:49:46Z

doc/autoscale/experiment/README.md

+    - the less the better.
+    - average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization.
+    - average execution time is also the way of measuring effectiveness of fault-tolerance. If the fault-tolerance is not working properly, the training job will simply fail or finish with significantly longer duration.
+1. Quality of service with Hybrid cluster


Hybrid don't need capitalization (hybrid), maybe "general purpose cluster" is easier to understand?

helinwang · 2017-10-30T23:50:49Z

doc/autoscale/experiment/README.md

+
+### Resource utilization increased in both cases
+
+Utilization increased by XX% in case one, and XX% in case two


case 1的utilization不一定能increase。

just placeholder for now. will update these conclusions when we have real numbers

helinwang · 2017-10-30T23:51:19Z

doc/autoscale/experiment/README.md

+
+XX% in case one and XX% in case two.
+
+### Average execution time reduced in both cases


case 1的average execution time不一定能reduce。

just placeholder for now.

helinwang · 2017-10-30T23:51:52Z

doc/autoscale/experiment/README.md

+
+XX% in case one and XX% in case two.
+
+### Improved the service quality with Hybrid cloud


Hybrid don't need capitalization (hybrid), maybe "general purpose cluster" is easier to understand?

corrected, thanks!

helinwang · 2017-10-30T23:53:00Z

doc/autoscale/experiment/README.md

+
+### Improved the service quality with Hybrid cloud
+
+As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive.


Paddlepaddle -> PaddlePaddle

helinwang · 2017-10-30T23:54:11Z

doc/autoscale/experiment/README.md

+
+### Improved the service quality with Hybrid cloud
+
+As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive.


is get -> is getting

Maybe change "QPS" to "the load", since our test case 2 does not use QPS (uses pod count instead).

got it, thanks!

helinwang

LGTM++

typhoonzero · 2017-10-31T02:05:27Z

doc/autoscale/experiment/README.md

-## Experiment Metric
+1. Cluster computing resource overall utilization.
+    - the higher the better.
+    - higher utilization means less resource is idle.


higher utilization means less resource is idle.

This sentence is too simple. May describe like this:

Autoscaling intended to maximize the overall cluster resource(CPU, GPU, memory) usage by ensuring resource for production level jobs/services, then fairly scale jobs that are scalable to use resource left in the cluster.

typhoonzero · 2017-10-31T02:06:36Z

doc/autoscale/experiment/README.md

+    - This is a common pain point of researchers with the internal cloud.
+1. Task average execution time.
+    - the less the better.
+    - average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization.


Jobs may spend more time to run when total resource requested is more than what we have in the cluster because the scaler will scale down some low-priority jobs to ensure higher priority jobs. So when inspecting the results of the experiment, the less the average job running time increases, the better the scaler performances.

typhoonzero

LGTM!

putcn added 6 commits October 26, 2017 14:37

init update to design of experiment

fe8f61e

Merge branch 'develop' of https://github.com/PaddlePaddle/cloud into …

98cbf23

…experiment-doc-update

update purpose and metrics section

6c00e18

Merge branch 'develop' of https://github.com/PaddlePaddle/cloud into …

c342869

…experiment-doc-update # Conflicts: # doc/autoscale/experiment/README.md

update DOE with explanation of metrics and other tweaks.

08cfb27

wording update

92ac008

putcn requested review from helinwang and Yancey1989 October 30, 2017 23:44

helinwang reviewed Oct 30, 2017

View reviewed changes

doc/autoscale/experiment/README.md Outdated

How the effectiveness are measured.

Copy link

Collaborator

helinwang Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are -> is

putcn reacted with thumbs up emoji

helinwang reviewed Oct 30, 2017

View reviewed changes

wording update responding to Helin's feedback

7a7de73

helinwang previously approved these changes Oct 31, 2017

View reviewed changes

putcn requested review from helinwang and removed request for Yancey1989 October 31, 2017 00:02

typhoonzero reviewed Oct 31, 2017

View reviewed changes

response to @typhoonzero's comment

872515f

putcn dismissed helinwang’s stale review via 872515f October 31, 2017 05:47

minor treaks

7131527

typhoonzero approved these changes Oct 31, 2017

View reviewed changes

putcn merged commit 3f8ae3c into PaddlePaddle:develop Oct 31, 2017

putcn deleted the experiment-doc-update branch October 31, 2017 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment doc update #458

Experiment doc update #458

putcn commented Oct 30, 2017

helinwang Oct 30, 2017

helinwang Oct 30, 2017

helinwang Oct 30, 2017

helinwang Oct 30, 2017

helinwang Oct 30, 2017

helinwang Oct 30, 2017 •

edited

putcn Oct 30, 2017

helinwang Oct 30, 2017

putcn Oct 30, 2017

helinwang Oct 30, 2017

putcn Oct 30, 2017

helinwang Oct 30, 2017

putcn Oct 30, 2017

helinwang Oct 30, 2017

helinwang Oct 30, 2017

putcn Oct 30, 2017

helinwang left a comment

typhoonzero Oct 31, 2017

typhoonzero Oct 31, 2017

typhoonzero left a comment


		## Environment Enviroment
		To verify effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism.


		### Resource utilization increased in both cases

		Utilization increased by XX% in case one, and XX% in case two


		XX% in case one and XX% in case two.

		### Average execution time reduced in both cases


		XX% in case one and XX% in case two.

		### Improved the service quality with Hybrid cloud


		### Improved the service quality with Hybrid cloud

		As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive.

Experiment doc update #458

Experiment doc update #458

Conversation

putcn commented Oct 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Oct 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

helinwang Oct 30, 2017 •

edited