Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

add vgg16_aws_dist #33

Merged
merged 3 commits into from
Jun 6, 2018
Merged

add vgg16_aws_dist #33

merged 3 commits into from
Jun 6, 2018

Conversation

putcn
Copy link
Contributor

@putcn putcn commented Jun 4, 2018

add vgg16_aws_dist.
this test will be performed in aws utilizing aws_runner to dynamically allocating instances, network resources, collect metrics data and save results to CE for further analytics.
Testing models are using benchmar/fluid. this case is running vgg16 dist train test, but it's very easy to add more round of tests with other models and cluster configs. these configs can be updated in run.xsh and continuous_evaluation.py

Please note: nccl2 is not yet supported, will add it in next PR

@putcn putcn mentioned this pull request Jun 4, 2018
@Superjomn
Copy link
Collaborator

看了代码格式,貌似没有跑 pre-commit ?

@putcn
Copy link
Contributor Author

putcn commented Jun 5, 2018

跑了啊,

paddle-ce-latest-kpis git:(vgg16_aws_dist) ✗ pre-commit run -a
yapf.....................................................................Passed
Check for added large files..............................................Passed
Check for merge conflicts................................................Passed
Check for broken symlinks................................................Passed
Fix End of Files.........................................................Passed

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in rows:
writer.writerow(row)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an empty line here.

Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a proposal to reference the fluid_benchmark.py and kube_gen_job.py from Paddle the repo https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/fluid, if so that we don't need to update them twice.

@putcn
Copy link
Contributor Author

putcn commented Jun 6, 2018

@Yancey1989 currently we can't, because the metric retrieve part is different, we have to output specific format of metrics so that aws_runner can identify and catch these values.

Yancey1989
Yancey1989 previously approved these changes Jun 6, 2018
Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++ with some tiny comment.

--online_mode yes \
--pserver_command $training_command \
--trainer_command $training_command \
--docker_image $fluid_benchmark_dockerhub_tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems need a blank line at the end of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update

@@ -0,0 +1 @@
[1.0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these KPI result files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, CE have to have initial KPI data

Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again.

@putcn putcn merged commit 07cc82f into PaddlePaddle:master Jun 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants