Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert to recordio format function #183

Merged
merged 17 commits into from
Jun 29, 2017

Conversation

gongweibao
Copy link
Collaborator

@gongweibao gongweibao commented Jun 26, 2017

Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need a Dockerfile and k8s yaml file and README.md.

path = output_path + "/" + name
mkdir_p(path)

mod.convert(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we do convert and split at the same time? @Yancey1989

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert的时候已经split了。

Copy link
Collaborator

@Yancey1989 Yancey1989 Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the convert function here:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/common.py#L167, it also needs some other required parameters such as reader and num_shards. So maybe the code can not be run correctly, or I missed sth. ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh,sorry,我应该把相关的PR都放上来的。其实,跟这个相关的PR是这个:
PaddlePaddle/Paddle#2608

raise

def convert(output_path, name):
print "proc " + name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logging instead of printing, it will log the time and log level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

print "proc " + name
mod = __import__("paddle.v2.dataset." + name, fromlist=[''])

path = output_path + "/" + name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use os.path.join(output_path, name) instead of hard code delimiter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

path = output_path + "/" + name
mkdir_p(path)

mod.convert(path)
Copy link
Collaborator

@Yancey1989 Yancey1989 Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the convert function here:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/common.py#L167, it also needs some other required parameters such as reader and num_shards. So maybe the code can not be run correctly, or I missed sth. ?

@@ -0,0 +1,4 @@
FROM paddlepaddle/paddle
ADD ./convert.py /convert/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add multiple files using a single line to reduce the image layers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import logging
import logging.config

logging.config.fileConfig('logging.conf')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use [dict_config], so we don't need another config file.

FROM paddlepaddle/paddle
ADD ./convert.py /convert/
ADD ./logging.conf /convert/
ADD .cache/paddle/dataset /root/.cache/paddle/dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a default command for this dockerfile

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有default command,路径是需要指定的。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get your point. Default command could be CMD ["/program", "argument"], this can be overrided by k8s yaml definations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.Done.

Copy link
Collaborator

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you let me know why users need to submit a k8s job to convert the public datasets?
I thought we will convert them locally (or by a k8s job), upload to the cluster public storage. And users don't need to convert them again?

@gongweibao
Copy link
Collaborator Author

gongweibao commented Jun 28, 2017

@helinwang 这里的users其实是我们自己。当我们有了新的集群或者需要重新生成一下数据,只需要改一下convert_app.yaml里边的路径,然后用kubectrl create -f convert_app.yaml 就可以把数据写到相应的位置。我还是把README改成中文吧,英文还是表达不清楚啊。

Copy link
Collaborator

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gongweibao 懂了,谢谢!LGTM.

@gongweibao gongweibao merged commit 19eff0c into PaddlePaddle:develop Jun 29, 2017
@gongweibao gongweibao deleted the convertdataset branch August 21, 2017 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants