Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment doc update #458

Merged
merged 9 commits into from
Oct 31, 2017
Merged

Conversation

putcn
Copy link

@putcn putcn commented Oct 30, 2017

update DOE with explanation of metrics and other tweaks.


## Environment Enviroment
To verify effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> "To verify the effectiveness of PaddlePaddle's fault-tolerance and auto-scaling mechanism."


How the effectiveness are measured.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are -> is

## Experiment Metric
1. Cluster computing resource overall utilization.
- the higher the better.
- higher utilization means less resource are idle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are -> is

1. Task average pending time.
- the less the better.
- the less pending time the earlier developers and researchers can start seeing the training cost curve, and the better they can verify the training algorithm effectiveness.
- This is a common pain point of researchers with internal cloud.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal -> the internal

1. Task average execution time.
- the less the better.
- average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization.
- average execution time is also the way of measuring effectiveness of fault-tolerance. If the fault-tolerance is not working properly, the training job will simply fail or finish with significantly longer duration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectiveness -> the effectiveness

- the less the better.
- average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization.
- average execution time is also the way of measuring effectiveness of fault-tolerance. If the fault-tolerance is not working properly, the training job will simply fail or finish with significantly longer duration.
1. Quality of service with Hybrid cluster
Copy link
Collaborator

@helinwang helinwang Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hybrid don't need capitalization (hybrid), maybe "general purpose cluster" is easier to understand?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.


### Resource utilization increased in both cases

Utilization increased by XX% in case one, and XX% in case two
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 1的utilization不一定能increase。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just placeholder for now. will update these conclusions when we have real numbers


XX% in case one and XX% in case two.

### Average execution time reduced in both cases
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 1的average execution time不一定能reduce。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just placeholder for now.


XX% in case one and XX% in case two.

### Improved the service quality with Hybrid cloud
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hybrid don't need capitalization (hybrid), maybe "general purpose cluster" is easier to understand?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected, thanks!


### Improved the service quality with Hybrid cloud

As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddlepaddle -> PaddlePaddle


### Improved the service quality with Hybrid cloud

As shown in test case two, Paddlepaddle yields resource to more important online services when QPS is get intensive.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is get -> is getting

Maybe change "QPS" to "the load", since our test case 2 does not use QPS (uses pod count instead).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks!

helinwang
helinwang previously approved these changes Oct 31, 2017
Copy link
Collaborator

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@putcn putcn requested review from helinwang and removed request for Yancey1989 October 31, 2017 00:02
## Experiment Metric
1. Cluster computing resource overall utilization.
- the higher the better.
- higher utilization means less resource is idle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

higher utilization means less resource is idle.

This sentence is too simple. May describe like this:

Autoscaling intended to maximize the overall cluster resource(CPU, GPU, memory) usage by ensuring resource for production level jobs/services, then fairly scale jobs that are scalable to use resource left in the cluster.

- This is a common pain point of researchers with the internal cloud.
1. Task average execution time.
- the less the better.
- average execution time is another way of measuring computing resource utilization. the less execution time, the higher overall utilization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs may spend more time to run when total resource requested is more than what we have in the cluster because the scaler will scale down some low-priority jobs to ensure higher priority jobs. So when inspecting the results of the experiment, the less the average job running time increases, the better the scaler performances.

Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@putcn putcn merged commit 3f8ae3c into PaddlePaddle:develop Oct 31, 2017
@putcn putcn deleted the experiment-doc-update branch October 31, 2017 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants